diff mbox series

[V1,vfio,9/9] vfio/virtio: Introduce a vfio driver over virtio devices

Message ID 20231017134217.82497-10-yishaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Introduce a vfio driver over virtio devices | expand

Commit Message

Yishai Hadas Oct. 17, 2023, 1:42 p.m. UTC
Introduce a vfio driver over virtio devices to support the legacy
interface functionality for VFs.

Background, from the virtio spec [1].
--------------------------------------------------------------------
In some systems, there is a need to support a virtio legacy driver with
a device that does not directly support the legacy interface. In such
scenarios, a group owner device can provide the legacy interface
functionality for the group member devices. The driver of the owner
device can then access the legacy interface of a member device on behalf
of the legacy member device driver.

For example, with the SR-IOV group type, group members (VFs) can not
present the legacy interface in an I/O BAR in BAR0 as expected by the
legacy pci driver. If the legacy driver is running inside a virtual
machine, the hypervisor executing the virtual machine can present a
virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
legacy driver accesses to this I/O BAR and forwards them to the group
owner device (PF) using group administration commands.
--------------------------------------------------------------------

Specifically, this driver adds support for a virtio-net VF to be exposed
as a transitional device to a guest driver and allows the legacy IO BAR
functionality on top.

This allows a VM which uses a legacy virtio-net driver in the guest to
work transparently over a VF which its driver in the host is that new
driver.

The driver can be extended easily to support some other types of virtio
devices (e.g virtio-blk), by adding in a few places the specific type
properties as was done for virtio-net.

For now, only the virtio-net use case was tested and as such we introduce
the support only for such a device.

Practically,
Upon probing a VF for a virtio-net device, in case its PF supports
legacy access over the virtio admin commands and the VF doesn't have BAR
0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
transitional device with I/O BAR in BAR 0.

The existence of the simulated I/O bar is reported later on by
overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
exposes itself as a transitional device by overwriting some properties
upon reading its config space.

Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
guest may use it via read/write calls according to the virtio
specification.

Any read/write towards the control parts of the BAR will be captured by
the new driver and will be translated into admin commands towards the
device.

Any data path read/write access (i.e. virtio driver notifications) will
be forwarded to the physical BAR which its properties were supplied by
the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
probing/init flow.

With that code in place a legacy driver in the guest has the look and
feel as if having a transitional device with legacy support for both its
control and data path flows.

[1]
https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 MAINTAINERS                      |   7 +
 drivers/vfio/pci/Kconfig         |   2 +
 drivers/vfio/pci/Makefile        |   2 +
 drivers/vfio/pci/virtio/Kconfig  |  15 +
 drivers/vfio/pci/virtio/Makefile |   4 +
 drivers/vfio/pci/virtio/main.c   | 577 +++++++++++++++++++++++++++++++
 6 files changed, 607 insertions(+)
 create mode 100644 drivers/vfio/pci/virtio/Kconfig
 create mode 100644 drivers/vfio/pci/virtio/Makefile
 create mode 100644 drivers/vfio/pci/virtio/main.c

Comments

Alex Williamson Oct. 17, 2023, 8:24 p.m. UTC | #1
On Tue, 17 Oct 2023 16:42:17 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> +			      const struct pci_device_id *id)
> +{
> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> +	struct virtiovf_pci_core_device *virtvdev;
> +	int ret;
> +
> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> +		ops = &virtiovf_acc_vfio_pci_tran_ops;


This is still an issue for me, it's a very narrow use case where we
have a modern device and want to enable legacy support.  Implementing an
IO BAR and mangling the device ID seems like it should be an opt-in,
not standard behavior for any compatible device.  Users should
generally expect that the device they see in the host is the device
they see in the guest.  They might even rely on that principle.

We can't use the argument that users wanting the default device should
use vfio-pci rather than virtio-vfio-pci because we've already defined
the algorithm by which libvirt should choose a variant driver for a
device.  libvirt will choose this driver for all virtio-net devices.

This driver effectively has the option to expose two different profiles
for the device, native or transitional.  We've discussed profile
support for variant drivers previously as an equivalent functionality
to mdev types, but the only use case for this currently is out-of-tree.
I think this might be the opportunity to define how device profiles are
exposed and selected in a variant driver.

Jason had previously suggested a devlink interface for this, but I
understand that path had been shot down by devlink developers.  Another
obvious option is sysfs, where we might imagine an optional "profiles"
directory, perhaps under vfio-dev.  Attributes of "available" and
"current" could allow discovery and selection of a profile similar to
mdev types.

Is this where we should head with this or are there other options to
confine this transitional behavior?

BTW, what is "acc" in virtiovf_acc_vfio_pci_ops?

> +
> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> +				     &pdev->dev, ops);
> +	if (IS_ERR(virtvdev))
> +		return PTR_ERR(virtvdev);
> +
> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
> +	if (ret)
> +		goto out;
> +	return 0;
> +out:
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> +	/* Only virtio-net is supported/tested so far */
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = virtiovf_pci_table,
> +	.probe = virtiovf_pci_probe,
> +	.remove = virtiovf_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");

Not yet "family" per the device table.  Thanks,

Alex
Yishai Hadas Oct. 18, 2023, 9:01 a.m. UTC | #2
On 17/10/2023 23:24, Alex Williamson wrote:
> On Tue, 17 Oct 2023 16:42:17 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>> +static int virtiovf_pci_probe(struct pci_dev *pdev,
>> +			      const struct pci_device_id *id)
>> +{
>> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
>> +	struct virtiovf_pci_core_device *virtvdev;
>> +	int ret;
>> +
>> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
>> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
>> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
>
> This is still an issue for me, it's a very narrow use case where we
> have a modern device and want to enable legacy support.  Implementing an
> IO BAR and mangling the device ID seems like it should be an opt-in,
> not standard behavior for any compatible device.  Users should
> generally expect that the device they see in the host is the device
> they see in the guest.  They might even rely on that principle.

Users here mainly refer to cloud operators.

We may assume, I believe, that they will be fine with seeing a 
transitional device in the guest as they would like to get the legacy IO 
support for their system.

However, we can still consider supplying a configuration knob in the 
device layer (e.g. in the DPU side) to let a cloud operator turning off 
the legacy capability.

In that case upon probe() of the vfio-virtio driver, we'll just pick-up 
the default vfio-pci 'ops' and in the guest we may have the same device 
ID as of in the host.

With that approach we may not require a HOST side control (i.e. sysfs, 
etc.), but stay with a s device control based on its user manual.

At the end, we don't expect any functional issue nor any compatible 
problem with the new driver, both modern and legacy drivers can work in 
the guest.

Can that work for you ?

>
> We can't use the argument that users wanting the default device should
> use vfio-pci rather than virtio-vfio-pci because we've already defined
> the algorithm by which libvirt should choose a variant driver for a
> device.  libvirt will choose this driver for all virtio-net devices.
>
> This driver effectively has the option to expose two different profiles
> for the device, native or transitional.  We've discussed profile
> support for variant drivers previously as an equivalent functionality
> to mdev types, but the only use case for this currently is out-of-tree.
> I think this might be the opportunity to define how device profiles are
> exposed and selected in a variant driver.
>
> Jason had previously suggested a devlink interface for this, but I
> understand that path had been shot down by devlink developers.  Another
> obvious option is sysfs, where we might imagine an optional "profiles"
> directory, perhaps under vfio-dev.  Attributes of "available" and
> "current" could allow discovery and selection of a profile similar to
> mdev types.

Referring to the sysfs option,

Do you expect the sysfs data to effect the libvirt decision ? may that 
require changes in libvirt ?

In addition,
May that be too late as the sysfs entry will be created upon driver 
binding by libvirt or that we have in mind some other option to control 
with that ?

Jason,
Can you please comment here as well ?

> Is this where we should head with this or are there other options to
> confine this transitional behavior?
>
> BTW, what is "acc" in virtiovf_acc_vfio_pci_ops?

"acc" is just a short-cut to "access", see also here[1] a similar usage.

[1] 
https://elixir.bootlin.com/linux/v6.6-rc6/source/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c#L1380

>
>> +
>> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>> +				     &pdev->dev, ops);
>> +	if (IS_ERR(virtvdev))
>> +		return PTR_ERR(virtvdev);
>> +
>> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
>> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
>> +	if (ret)
>> +		goto out;
>> +	return 0;
>> +out:
>> +	vfio_put_device(&virtvdev->core_device.vdev);
>> +	return ret;
>> +}
>> +
>> +static void virtiovf_pci_remove(struct pci_dev *pdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>> +
>> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
>> +	vfio_put_device(&virtvdev->core_device.vdev);
>> +}
>> +
>> +static const struct pci_device_id virtiovf_pci_table[] = {
>> +	/* Only virtio-net is supported/tested so far */
>> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
>> +	{}
>> +};
>> +
>> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
>> +
>> +static struct pci_driver virtiovf_pci_driver = {
>> +	.name = KBUILD_MODNAME,
>> +	.id_table = virtiovf_pci_table,
>> +	.probe = virtiovf_pci_probe,
>> +	.remove = virtiovf_pci_remove,
>> +	.err_handler = &vfio_pci_core_err_handlers,
>> +	.driver_managed_dma = true,
>> +};
>> +
>> +module_pci_driver(virtiovf_pci_driver);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
>> +MODULE_DESCRIPTION(
>> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
> Not yet "family" per the device table.  Thanks,

Right

How about dropping the word "family" and say instead ".. for VIRTIO 
devices" as we have in the Kconfig in that patch [1] ?

[1] "This provides support for exposing VIRTIO VF devices .."

Yishai

> Alex
>
Alex Williamson Oct. 18, 2023, 12:51 p.m. UTC | #3
On Wed, 18 Oct 2023 12:01:57 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 17/10/2023 23:24, Alex Williamson wrote:
> > On Tue, 17 Oct 2023 16:42:17 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:  
> >> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> >> +			      const struct pci_device_id *id)
> >> +{
> >> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> >> +	struct virtiovf_pci_core_device *virtvdev;
> >> +	int ret;
> >> +
> >> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> >> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> >> +		ops = &virtiovf_acc_vfio_pci_tran_ops;  
> >
> > This is still an issue for me, it's a very narrow use case where we
> > have a modern device and want to enable legacy support.  Implementing an
> > IO BAR and mangling the device ID seems like it should be an opt-in,
> > not standard behavior for any compatible device.  Users should
> > generally expect that the device they see in the host is the device
> > they see in the guest.  They might even rely on that principle.  
> 
> Users here mainly refer to cloud operators.
> 
> We may assume, I believe, that they will be fine with seeing a 
> transitional device in the guest as they would like to get the legacy IO 
> support for their system.
> 
> However, we can still consider supplying a configuration knob in the 
> device layer (e.g. in the DPU side) to let a cloud operator turning off 
> the legacy capability.

This is a driver that implements to the virtio standard, so I don't see
how we can assume that the current use case is the only use case we'll
ever see.  Therefore we cannot assume this will only be consumed by a
specific cloud operator making use of NVIDIA hardware.  Other vendors
may implement this spec for other environments.  We might even see an
implementation of a virtual virtio-net device with SR-IOV.

> In that case upon probe() of the vfio-virtio driver, we'll just pick-up 
> the default vfio-pci 'ops' and in the guest we may have the same device 
> ID as of in the host.
> 
> With that approach we may not require a HOST side control (i.e. sysfs, 
> etc.), but stay with a s device control based on its user manual.
> 
> At the end, we don't expect any functional issue nor any compatible 
> problem with the new driver, both modern and legacy drivers can work in 
> the guest.
> 
> Can that work for you ?

This is not being proposed as an NVIDIA specific driver, we can't make
such claims relative to all foreseeable implementations of virtio-net.

> > We can't use the argument that users wanting the default device should
> > use vfio-pci rather than virtio-vfio-pci because we've already defined
> > the algorithm by which libvirt should choose a variant driver for a
> > device.  libvirt will choose this driver for all virtio-net devices.
> >
> > This driver effectively has the option to expose two different profiles
> > for the device, native or transitional.  We've discussed profile
> > support for variant drivers previously as an equivalent functionality
> > to mdev types, but the only use case for this currently is out-of-tree.
> > I think this might be the opportunity to define how device profiles are
> > exposed and selected in a variant driver.
> >
> > Jason had previously suggested a devlink interface for this, but I
> > understand that path had been shot down by devlink developers.  Another
> > obvious option is sysfs, where we might imagine an optional "profiles"
> > directory, perhaps under vfio-dev.  Attributes of "available" and
> > "current" could allow discovery and selection of a profile similar to
> > mdev types.  
> 
> Referring to the sysfs option,
> 
> Do you expect the sysfs data to effect the libvirt decision ? may that 
> require changes in libvirt ?

We don't have such changes in libvirt for mdev, other than the ability
of the nodedev information to return available type information.
Generally the mdev type is configured outside of libvirt, which falls
into the same sort of configuration as necessary to enable migration on
mlx5-vfio-pci.

It's possible we could allows a default profile which would be used if
the open_device callback is used without setting a profile, but we need
to be careful of vGPU use cases where profiles consume resources and a
default selection may affect other devices.

> In addition,
> May that be too late as the sysfs entry will be created upon driver 
> binding by libvirt or that we have in mind some other option to control 
> with that ?

No different than mlx5-vfio-pci, there's a necessary point between
binding the driver and using the device where configuration is needed.

> Jason,
> Can you please comment here as well ?
> 
> > Is this where we should head with this or are there other options to
> > confine this transitional behavior?
> >
> > BTW, what is "acc" in virtiovf_acc_vfio_pci_ops?  
> 
> "acc" is just a short-cut to "access", see also here[1] a similar usage.
> 
> [1] 
> https://elixir.bootlin.com/linux/v6.6-rc6/source/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c#L1380

Per the Kconfig:

	  This provides generic PCI support for HiSilicon ACC devices
	  using the VFIO framework.

Therefore I understood acc in this use case to be a formal reference to
the controller name.

> >> +
> >> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> >> +				     &pdev->dev, ops);
> >> +	if (IS_ERR(virtvdev))
> >> +		return PTR_ERR(virtvdev);
> >> +
> >> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> >> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
> >> +	if (ret)
> >> +		goto out;
> >> +	return 0;
> >> +out:
> >> +	vfio_put_device(&virtvdev->core_device.vdev);
> >> +	return ret;
> >> +}
> >> +
> >> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> >> +
> >> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
> >> +	vfio_put_device(&virtvdev->core_device.vdev);
> >> +}
> >> +
> >> +static const struct pci_device_id virtiovf_pci_table[] = {
> >> +	/* Only virtio-net is supported/tested so far */
> >> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
> >> +	{}
> >> +};
> >> +
> >> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> >> +
> >> +static struct pci_driver virtiovf_pci_driver = {
> >> +	.name = KBUILD_MODNAME,
> >> +	.id_table = virtiovf_pci_table,
> >> +	.probe = virtiovf_pci_probe,
> >> +	.remove = virtiovf_pci_remove,
> >> +	.err_handler = &vfio_pci_core_err_handlers,
> >> +	.driver_managed_dma = true,
> >> +};
> >> +
> >> +module_pci_driver(virtiovf_pci_driver);
> >> +
> >> +MODULE_LICENSE("GPL");
> >> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> >> +MODULE_DESCRIPTION(
> >> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");  
> > Not yet "family" per the device table.  Thanks,  
> 
> Right
> 
> How about dropping the word "family" and say instead ".. for VIRTIO 
> devices" as we have in the Kconfig in that patch [1] ?
> 
> [1] "This provides support for exposing VIRTIO VF devices .."

Are we realistically extending this beyond virtio-net?  Maybe all the
descriptions should be limited to what is actually supported as
proposed rather than aspirational goals.  Thanks,

Alex
Parav Pandit Oct. 18, 2023, 1:06 p.m. UTC | #4
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, October 18, 2023 6:22 PM

> Are we realistically extending this beyond virtio-net?  Maybe all the descriptions
> should be limited to what is actually supported as proposed rather than
> aspirational goals.  Thanks,
Virtio blk would the second user of it.
The series didn't cover the test of virtio-blk mainly due to time limitation due to which Yishai only included the net device id.
Jason Gunthorpe Oct. 18, 2023, 4:33 p.m. UTC | #5
On Tue, Oct 17, 2023 at 02:24:48PM -0600, Alex Williamson wrote:
> On Tue, 17 Oct 2023 16:42:17 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> > +static int virtiovf_pci_probe(struct pci_dev *pdev,
> > +			      const struct pci_device_id *id)
> > +{
> > +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> > +	struct virtiovf_pci_core_device *virtvdev;
> > +	int ret;
> > +
> > +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> > +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> > +		ops = &virtiovf_acc_vfio_pci_tran_ops;
> 
> This is still an issue for me, it's a very narrow use case where we
> have a modern device and want to enable legacy support.  Implementing an
> IO BAR and mangling the device ID seems like it should be an opt-in,
> not standard behavior for any compatible device.  Users should
> generally expect that the device they see in the host is the device
> they see in the guest.  They might even rely on that principle.

I think this should be configured when the VF is provisioned. If the
user does not want legacy IO bar support then the VFIO VF function
should not advertise the capability, and they won't get driver
support.

I think that is a very reasonable way to approach this - it is how we
approached similar problems for mlx5. The provisioning interface is
what "profiles" the VF, regardless of if VFIO is driving it or not.

> We can't use the argument that users wanting the default device should
> use vfio-pci rather than virtio-vfio-pci because we've already defined
> the algorithm by which libvirt should choose a variant driver for a
> device.  libvirt will choose this driver for all virtio-net devices.

Well, we can if the use case is niche. I think profiling a virtio VF
to support legacy IO bar emulation and then not wanting to use it is
a niche case.

The same argument is going come with live migration. This same driver
will still bind and enable live migration if the virtio function is
profiled to support it. If you don't want that in your system then
don't profile the VF for migration support.

> This driver effectively has the option to expose two different profiles
> for the device, native or transitional.  We've discussed profile
> support for variant drivers previously as an equivalent functionality
> to mdev types, but the only use case for this currently is out-of-tree.
> I think this might be the opportunity to define how device profiles are
> exposed and selected in a variant driver.

Honestly, I've been trying to keep this out of VFIO...

The function is profiled when it is created, by whatever created
it. As in the other thread we have a vast amount of variation in what
is required to provision the function in the first place. "Legacy IO
BAR emulation support" is just one thing. virtio-net needs to be
hooked up to real network and get a MAC, virtio-blk needs to be hooked
up to real storage and get a media. At a minimum. This is big and
complicated.

It may not even be the x86 running VFIO that is doing this
provisioning, the PCI function may come pre-provisioned from a DPU.

It feels better to keep that all in one place, in whatever external
thing is preparing the function before giving it to VFIO. VFIO is
concerned with operating a prepared function.

When we get to SIOV it should not be VFIO that is
provisioning/creating functions. The owning driver should be doing
this and routing the function to VFIO (eg with an aux device or
otherwise)

This gets back to the qemu thread on the grace patch where we need to
ask how does the libvirt world see this, given there is no good way to
generically handle all scenarios without a userspace driver to operate
elements.

> Jason had previously suggested a devlink interface for this, but I
> understand that path had been shot down by devlink developers.  

I think we go some things support but supporting all things was shot
down.

> Another obvious option is sysfs, where we might imagine an optional
> "profiles" directory, perhaps under vfio-dev.  Attributes of
> "available" and "current" could allow discovery and selection of a
> profile similar to mdev types.

IMHO it is a far too complex problem for sysfs.

Jason
Alex Williamson Oct. 18, 2023, 6:29 p.m. UTC | #6
On Wed, 18 Oct 2023 13:33:33 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Oct 17, 2023 at 02:24:48PM -0600, Alex Williamson wrote:
> > On Tue, 17 Oct 2023 16:42:17 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:  
> > > +static int virtiovf_pci_probe(struct pci_dev *pdev,
> > > +			      const struct pci_device_id *id)
> > > +{
> > > +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> > > +	struct virtiovf_pci_core_device *virtvdev;
> > > +	int ret;
> > > +
> > > +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> > > +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
> > > +		ops = &virtiovf_acc_vfio_pci_tran_ops;  
> > 
> > This is still an issue for me, it's a very narrow use case where we
> > have a modern device and want to enable legacy support.  Implementing an
> > IO BAR and mangling the device ID seems like it should be an opt-in,
> > not standard behavior for any compatible device.  Users should
> > generally expect that the device they see in the host is the device
> > they see in the guest.  They might even rely on that principle.  
> 
> I think this should be configured when the VF is provisioned. If the
> user does not want legacy IO bar support then the VFIO VF function
> should not advertise the capability, and they won't get driver
> support.
> 
> I think that is a very reasonable way to approach this - it is how we
> approached similar problems for mlx5. The provisioning interface is
> what "profiles" the VF, regardless of if VFIO is driving it or not.

It seems like a huge assumption that every device is going to allow
this degree of specification in provisioning VFs.  mlx5 is a vendor
specific driver, it can make such assumptions in design philosophy.

> > We can't use the argument that users wanting the default device should
> > use vfio-pci rather than virtio-vfio-pci because we've already defined
> > the algorithm by which libvirt should choose a variant driver for a
> > device.  libvirt will choose this driver for all virtio-net devices.  
> 
> Well, we can if the use case is niche. I think profiling a virtio VF
> to support legacy IO bar emulation and then not wanting to use it is
> a niche case.
> 
> The same argument is going come with live migration. This same driver
> will still bind and enable live migration if the virtio function is
> profiled to support it. If you don't want that in your system then
> don't profile the VF for migration support.

What in the virtio or SR-IOV spec requires a vendor to make this
configurable?

> > This driver effectively has the option to expose two different profiles
> > for the device, native or transitional.  We've discussed profile
> > support for variant drivers previously as an equivalent functionality
> > to mdev types, but the only use case for this currently is out-of-tree.
> > I think this might be the opportunity to define how device profiles are
> > exposed and selected in a variant driver.  
> 
> Honestly, I've been trying to keep this out of VFIO...
> 
> The function is profiled when it is created, by whatever created
> it. As in the other thread we have a vast amount of variation in what
> is required to provision the function in the first place. "Legacy IO
> BAR emulation support" is just one thing. virtio-net needs to be
> hooked up to real network and get a MAC, virtio-blk needs to be hooked
> up to real storage and get a media. At a minimum. This is big and
> complicated.
> 
> It may not even be the x86 running VFIO that is doing this
> provisioning, the PCI function may come pre-provisioned from a DPU.
> 
> It feels better to keep that all in one place, in whatever external
> thing is preparing the function before giving it to VFIO. VFIO is
> concerned with operating a prepared function.
> 
> When we get to SIOV it should not be VFIO that is
> provisioning/creating functions. The owning driver should be doing
> this and routing the function to VFIO (eg with an aux device or
> otherwise)
> 
> This gets back to the qemu thread on the grace patch where we need to
> ask how does the libvirt world see this, given there is no good way to
> generically handle all scenarios without a userspace driver to operate
> elements.

So nothing here is really "all in one place", it may be in the
provisioning of the VF, outside of the scope of the host OS, it might
be a collection of scripts or operators with device or interface
specific tooling to configure the device.  Sometimes this configuration
will be before the device is probed by the vfio-pci variant driver,
sometimes in between probing and opening the device.

I don't see why it becomes out of scope if the variant driver itself
provides some means for selecting a device profile.  We have evidence
both from mdev vGPUs and here (imo) that we can expect to see that
behavior, so why wouldn't we want to attempt some basic shared
interface for variant drivers to implement for selecting such a profile
rather than add to this hodgepodge 

> > Jason had previously suggested a devlink interface for this, but I
> > understand that path had been shot down by devlink developers.    
> 
> I think we go some things support but supporting all things was shot
> down.
> 
> > Another obvious option is sysfs, where we might imagine an optional
> > "profiles" directory, perhaps under vfio-dev.  Attributes of
> > "available" and "current" could allow discovery and selection of a
> > profile similar to mdev types.  
> 
> IMHO it is a far too complex problem for sysfs.

Isn't it then just like devlink, not a silver bullet, but useful for
some configuration?  AIUI, devlink shot down a means to list available
profiles for a device and a means to select one of those profiles.
There are a variety of attributes in sysfs which perform this sort of
behavior.  Specifying a specific profile in sysfs can be difficult, and
I'm not proposing sysfs profile support as a mandatory feature, but I'm
also not a fan of the vendor specific sysfs approach that out of tree
drivers have taken.

The mdev type interface is certainly not perfect, but from it we've
been able to develop mdevctl to allow persistent and complex
configurations of mdev devices.  I'd like to see the ability to do
something like that with variant drivers that offer multiple profiles
without always depending on vendor specific interfaces.  Thanks,

Alex
Jason Gunthorpe Oct. 18, 2023, 7:28 p.m. UTC | #7
On Wed, Oct 18, 2023 at 12:29:25PM -0600, Alex Williamson wrote:

> > I think this should be configured when the VF is provisioned. If the
> > user does not want legacy IO bar support then the VFIO VF function
> > should not advertise the capability, and they won't get driver
> > support.
> > 
> > I think that is a very reasonable way to approach this - it is how we
> > approached similar problems for mlx5. The provisioning interface is
> > what "profiles" the VF, regardless of if VFIO is driving it or not.
> 
> It seems like a huge assumption that every device is going to allow
> this degree of specification in provisioning VFs.  mlx5 is a vendor
> specific driver, it can make such assumptions in design philosophy.

I don't think it is a huge assumption.  Some degree of configuration
is already mandatory just to get basic functionality, and it isn't
like virtio can really be a full fixed HW implementation on the
control plane.

So the assumption is that some device SW that already must exist, and
already must be configurable just gains 1 more bit of
configuration. It does not seem like a big assumption to me at all.

Regardless, if we set an architecture/philosophy from the kernel side
vendors will align to it.

> > The same argument is going come with live migration. This same driver
> > will still bind and enable live migration if the virtio function is
> > profiled to support it. If you don't want that in your system then
> > don't profile the VF for migration support.
> 
> What in the virtio or SR-IOV spec requires a vendor to make this
> configurable?

The same part that describes how to make live migration work :)

> So nothing here is really "all in one place", it may be in the
> provisioning of the VF, outside of the scope of the host OS, it might
> be a collection of scripts or operators with device or interface
> specific tooling to configure the device.  Sometimes this configuration
> will be before the device is probed by the vfio-pci variant driver,
> sometimes in between probing and opening the device.

We don't have any in tree examples of between probing and opening -
I'd like to keep it that way..

> I don't see why it becomes out of scope if the variant driver itself
> provides some means for selecting a device profile.  We have evidence
> both from mdev vGPUs and here (imo) that we can expect to see that
> behavior, so why wouldn't we want to attempt some basic shared
> interface for variant drivers to implement for selecting such a profile
> rather than add to this hodgepodge

The GPU profiling approach is an artifact of the mdev sysfs. I do not
expect to actually do this in tree.. The function should be profiled
before it reaches VFIO, not after. This is often necessary anyhow
because a function can be bound to kernel driver in almost all cases
too.

Consistently following this approach prevents future problems where we
end up with different ways to profile/provision functions depending on
what driver is attached (vfio/in-kernel). That would be a mess.

> > > Another obvious option is sysfs, where we might imagine an optional
> > > "profiles" directory, perhaps under vfio-dev.  Attributes of
> > > "available" and "current" could allow discovery and selection of a
> > > profile similar to mdev types.  
> > 
> > IMHO it is a far too complex problem for sysfs.
> 
> Isn't it then just like devlink, not a silver bullet, but useful for
> some configuration? 

Yes, but that accepts the architecture that configuration and
provisioning should happen on the VFIO side at all, which I think is
not a good direction.

> AIUI, devlink shot down a means to list available
> profiles for a device and a means to select one of those profiles.

And other things, yes.

> There are a variety of attributes in sysfs which perform this sort of
> behavior.  Specifying a specific profile in sysfs can be difficult, and
> I'm not proposing sysfs profile support as a mandatory feature, but I'm
> also not a fan of the vendor specific sysfs approach that out of tree
> drivers have taken.

It is my belief we are going to have to build some good general
infrastructure to support SIOV. The action to spawn, provision and
activate a SIOV function should be a generic infrastructure of some
kind. We have already been through a precursor to all this with mlx5's
devlink infrastructure for SFs (which are basically SIOV functions),
so we have a pretty deep experience now.

mdev mushed all those steps into VFIO, but it belongs in different
layers. SIOV devices are not going to be exclusively consumed by VFIO.

If we have such a layer then it would be possible to also configure
VFIO "through the back door" of the provisioning layer in the kernel.

I think that is the closest we can get to some kind of generic API
here. The trouble is that it will not actually be generic because
provisioning is not generic or standardized. It doesn't eliminate the
need for having a user space driver component that actually
understands exactly what to do in order to fully provision something.

I don't know what to say about that from a libvirt perspective. Like
how does that world imagine provisioning network and storage
functions? All I know is at the openshift level it is done with
operators (aka user space drivers).

> The mdev type interface is certainly not perfect, but from it we've
> been able to develop mdevctl to allow persistent and complex
> configurations of mdev devices.  I'd like to see the ability to do
> something like that with variant drivers that offer multiple profiles
> without always depending on vendor specific interfaces.

I think profiles are too narrow an abstraction to be that broadly
useful beyond simple device types. Given the device variety we already
have I don't know if there is an alternative to a user space driver to
manage provisioning. Indeed that is how we see our actual deployments
already.

IOW I'm worried we invest a lot of effort in VFIO profiling for little
return.

Jason
Alex Williamson Oct. 24, 2023, 7:57 p.m. UTC | #8
On Tue, 17 Oct 2023 16:42:17 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
> 
> Background, from the virtio spec [1].
> --------------------------------------------------------------------
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
> 
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> --------------------------------------------------------------------
> 
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
> 
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
> 
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
> 
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
> 
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
> 
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
> 
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
> 
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
> 
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
> probing/init flow.
> 
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
> 
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
> 
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  MAINTAINERS                      |   7 +
>  drivers/vfio/pci/Kconfig         |   2 +
>  drivers/vfio/pci/Makefile        |   2 +
>  drivers/vfio/pci/virtio/Kconfig  |  15 +
>  drivers/vfio/pci/virtio/Makefile |   4 +
>  drivers/vfio/pci/virtio/main.c   | 577 +++++++++++++++++++++++++++++++
>  6 files changed, 607 insertions(+)
>  create mode 100644 drivers/vfio/pci/virtio/Kconfig
>  create mode 100644 drivers/vfio/pci/virtio/Makefile
>  create mode 100644 drivers/vfio/pci/virtio/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7a7bd8bd80e9..680a70063775 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22620,6 +22620,13 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/pci/mlx5/
>  
> +VFIO VIRTIO PCI DRIVER
> +M:	Yishai Hadas <yishaih@nvidia.com>
> +L:	kvm@vger.kernel.org
> +L:	virtualization@lists.linux-foundation.org
> +S:	Maintained
> +F:	drivers/vfio/pci/virtio
> +
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
>  R:	Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>  
>  source "drivers/vfio/pci/pds/Kconfig"
>  
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
>  endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>  
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> new file mode 100644
> index 000000000000..89eddce8b1bd
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config VIRTIO_VFIO_PCI
> +        tristate "VFIO support for VIRTIO PCI devices"
> +        depends on VIRTIO_PCI
> +        select VFIO_PCI_CORE
> +        help
> +          This provides support for exposing VIRTIO VF devices using the VFIO
> +          framework that can work with a legacy virtio driver in the guest.
> +          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> +          not indicate I/O Space.
> +          As of that this driver emulated I/O BAR in software to let a VF be
> +          seen as a transitional device in the guest and let it work with
> +          a legacy driver.

This description is a little bit subtle to the hard requirements on the
device.  Reading this, one might think that this should work for any
SR-IOV VF virtio device, when in reality it only support virtio-net
currently and places a number of additional requirements on the device
(ex. legacy access and MSI-X support).

> +
> +          If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> new file mode 100644
> index 000000000000..2039b39fb723
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/Makefile
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> +virtio-vfio-pci-y := main.o
> +
> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> new file mode 100644
> index 000000000000..3fef4b21f7e6
> --- /dev/null
> +++ b/drivers/vfio/pci/virtio/main.c
> @@ -0,0 +1,577 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/vfio_pci_core.h>
> +#include <linux/virtio_pci.h>
> +#include <linux/virtio_net.h>
> +#include <linux/virtio_pci_admin.h>
> +
> +struct virtiovf_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	u8 bar0_virtual_buf_size;
> +	u8 *bar0_virtual_buf;
> +	/* synchronize access to the virtual buf */
> +	struct mutex bar_mutex;
> +	void __iomem *notify_addr;
> +	u32 notify_offset;
> +	u8 notify_bar;

Push the above u8 to the end of the structure for better packing.

> +	u16 pci_cmd;
> +	u16 msix_ctrl;
> +};
> +
> +static int
> +virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
> +			     loff_t pos, char __user *buf,
> +			     size_t count, bool read)
> +{
> +	bool msix_enabled = virtvdev->msix_ctrl & PCI_MSIX_FLAGS_ENABLE;
> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> +	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> +	u16 opcode;
> +	int ret;
> +
> +	mutex_lock(&virtvdev->bar_mutex);
> +	if (read) {
> +		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> +		ret = virtio_pci_admin_legacy_io_read(pdev, opcode, pos, count,
> +						      bar0_buf + pos);
> +		if (ret)
> +			goto out;
> +		if (copy_to_user(buf, bar0_buf + pos, count))
> +			ret = -EFAULT;
> +		goto out;
> +	}

TBH, I think the symmetry of read vs write would be more apparent if
this were an else branch.

> +
> +	if (copy_from_user(bar0_buf + pos, buf, count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> +	ret = virtio_pci_admin_legacy_io_write(pdev, opcode, pos, count,
> +					       bar0_buf + pos);
> +out:
> +	mutex_unlock(&virtvdev->bar_mutex);
> +	return ret;
> +}
> +
> +static int
> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> +			    loff_t pos, char __user *buf,
> +			    size_t count, bool read)
> +{
> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> +	u16 queue_notify;
> +	int ret;
> +
> +	if (pos + count > virtvdev->bar0_virtual_buf_size)
> +		return -EINVAL;
> +
> +	switch (pos) {
> +	case VIRTIO_PCI_QUEUE_NOTIFY:
> +		if (count != sizeof(queue_notify))
> +			return -EINVAL;
> +		if (read) {
> +			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> +						virtvdev->notify_addr);
> +			if (ret)
> +				return ret;
> +			if (copy_to_user(buf, &queue_notify,
> +					 sizeof(queue_notify)))
> +				return -EFAULT;
> +			break;
> +		}

Same.

> +
> +		if (copy_from_user(&queue_notify, buf, count))
> +			return -EFAULT;
> +
> +		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> +					 virtvdev->notify_addr);
> +		break;
> +	default:
> +		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
> +						   read);
> +	}
> +
> +	return ret ? ret : count;
> +}
> +
> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> +				  loff_t range2_start, size_t count2,
> +				  loff_t *start_offset,
> +				  size_t *intersect_count,
> +				  size_t *register_offset)
> +{
> +	if (range1_start <= range2_start &&
> +	    range1_start + count1 > range2_start) {
> +		*start_offset = range2_start - range1_start;
> +		*intersect_count = min_t(size_t, count2,
> +					 range1_start + count1 - range2_start);
> +		if (register_offset)
> +			*register_offset = 0;
> +		return true;
> +	}
> +
> +	if (range1_start > range2_start &&
> +	    range1_start < range2_start + count2) {
> +		*start_offset = range1_start;
> +		*intersect_count = min_t(size_t, count1,
> +					 range2_start + count2 - range1_start);
> +		if (register_offset)
> +			*register_offset = range1_start - range2_start;
> +		return true;
> +	}

Seems like we're missing a case, and some documentation.

The first test requires range1 to fully enclose range2 and provides the
offset of range2 within range1 and the length of the intersection.

The second test requires range1 to start from a non-zero offset within
range2 and returns the absolute offset of range1 and the length of the
intersection.

The register offset is then non-zero offset of range1 into range2.  So
does the caller use the zero value in the previous test to know range2
exists within range1?

We miss the cases where range1_start is <= range2_start and range1
terminates within range2.  I suppose we'll see below how this is used,
but it seems asymmetric and incomplete.

> +
> +	return false;
> +}
> +
> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> +					char __user *buf, size_t count,
> +					loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	size_t register_offset;
> +	loff_t copy_offset;
> +	size_t copy_count;
> +	__le32 val32;
> +	__le16 val16;
> +	u8 val8;
> +	int ret;
> +
> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> +				  &copy_offset, &copy_count, NULL)) {

If a user does 'setpci -s x:00.0 2.b' (range1 <= range2, but terminates
within range2) they'll not enter this branch and see 41 rather than 00.

If a user does 'setpci -s x:00.0 3.b' (range1 > range2, range 1
contained within range 2), the above function returns a copy_offset of
range1_start (ie. 3).  But that offset is applied to the buffer, which
is out of bounds.  The function needs to have returned an offset of 1
and it should have applied to the val16 address.

I don't think this works like it's intended.


> +		val16 = cpu_to_le16(0x1000);

Please #define this somewhere rather than hiding a magic value here.

> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
> +			return -EFAULT;
> +	}
> +
> +	if ((virtvdev->pci_cmd & PCI_COMMAND_IO) &&
> +	    range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> +				  &copy_offset, &copy_count, &register_offset)) {
> +		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
> +				   copy_count))
> +			return -EFAULT;
> +		val16 |= cpu_to_le16(PCI_COMMAND_IO);
> +		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
> +				 copy_count))
> +			return -EFAULT;
> +	}
> +
> +	if (range_intersect_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> +				  &copy_offset, &copy_count, NULL)) {
> +		/* Transional needs to have revision 0 */
> +		val8 = 0;
> +		if (copy_to_user(buf + copy_offset, &val8, copy_count))
> +			return -EFAULT;
> +	}
> +
> +	if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> +				  &copy_offset, &copy_count, NULL)) {
> +		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);

I'd still like to see the remainder of the BAR follow the semantics
vfio-pci does.  I think this requires a __le32 bar0 field on the
virtvdev struct to store writes and the read here would mask the lower
bits up to the BAR size and OR in the IO indicator bit.


> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
> +			return -EFAULT;
> +	}
> +
> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> +				  &copy_offset, &copy_count, NULL)) {
> +		/*
> +		 * Transitional devices use the PCI subsystem device id as
> +		 * virtio device id, same as legacy driver always did.

Where did we require the subsystem vendor ID to be 0x1af4?  This
subsystem device ID really only makes since given that subsystem
vendor ID, right?  Otherwise I don't see that non-transitional devices,
such as the VF, have a hard requirement per the spec for the subsystem
vendor ID.

Do we want to make this only probe the correct subsystem vendor ID or do
we want to emulate the subsystem vendor ID as well?  I don't see this is
correct without one of those options.

> +		 */
> +		val16 = cpu_to_le16(VIRTIO_ID_NET);
> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
> +			return -EFAULT;
> +	}
> +
> +	return count;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> +		       size_t count, loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> +		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> +
> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> +		return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +
> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> +	if (ret) {
> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> +	pm_runtime_put(&pdev->dev);
> +	return ret;
> +}
> +
> +static ssize_t
> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	int ret;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> +		size_t register_offset;
> +		loff_t copy_offset;
> +		size_t copy_count;
> +
> +		if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(virtvdev->pci_cmd),
> +					  &copy_offset, &copy_count,
> +					  &register_offset)) {
> +			if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
> +					   buf + copy_offset,
> +					   copy_count))
> +				return -EFAULT;
> +		}
> +
> +		if (range_intersect_range(pos, count, pdev->msix_cap + PCI_MSIX_FLAGS,
> +					  sizeof(virtvdev->msix_ctrl),
> +					  &copy_offset, &copy_count,
> +					  &register_offset)) {
> +			if (copy_from_user((void *)&virtvdev->msix_ctrl + register_offset,
> +					   buf + copy_offset,
> +					   copy_count))
> +				return -EFAULT;
> +		}

MSI-X is setup via ioctl, so you're relying on a userspace that writes
through the control register bit even though it doesn't do anything.
Why not use vfio_pci_core_device.irq_type to track if MSI-X mode is
enabled?

> +	}
> +
> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> +		return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +
> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> +	if (ret) {
> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> +		return -EIO;
> +	}
> +
> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> +	pm_runtime_put(&pdev->dev);
> +	return ret;
> +}
> +
> +static int
> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> +				   unsigned int cmd, unsigned long arg)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> +	void __user *uarg = (void __user *)arg;
> +	struct vfio_region_info info = {};
> +
> +	if (copy_from_user(&info, uarg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;
> +
> +	switch (info.index) {
> +	case VFIO_PCI_BAR0_REGION_INDEX:
> +		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +		info.size = virtvdev->bar0_virtual_buf_size;
> +		info.flags = VFIO_REGION_INFO_FLAG_READ |
> +			     VFIO_REGION_INFO_FLAG_WRITE;
> +		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> +	default:
> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +	}
> +}
> +
> +static long
> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> +			     unsigned long arg)
> +{
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> +	default:
> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +	}
> +}
> +
> +static int
> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> +{
> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> +	int ret;
> +
> +	/*
> +	 * Setup the BAR where the 'notify' exists to be used by vfio as well
> +	 * This will let us mmap it only once and use it when needed.
> +	 */
> +	ret = vfio_pci_core_setup_barmap(core_device,
> +					 virtvdev->notify_bar);
> +	if (ret)
> +		return ret;
> +
> +	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> +			virtvdev->notify_offset;
> +	return 0;
> +}
> +
> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> +	int ret;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	if (virtvdev->bar0_virtual_buf) {
> +		/*
> +		 * Upon close_device() the vfio_pci_core_disable() is called
> +		 * and will close all the previous mmaps, so it seems that the
> +		 * valid life cycle for the 'notify' addr is per open/close.
> +		 */
> +		ret = virtiovf_set_notify_addr(virtvdev);
> +		if (ret) {
> +			vfio_pci_core_disable(vdev);
> +			return ret;
> +		}
> +	}
> +
> +	vfio_pci_core_finish_enable(vdev);
> +	return 0;
> +}
> +
> +static int virtiovf_get_device_config_size(unsigned short device)
> +{
> +	/* Network card */
> +	return offsetofend(struct virtio_net_config, status);
> +}
> +
> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> +{
> +	u64 offset;
> +	int ret;
> +	u8 bar;
> +
> +	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
> +				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> +				&bar, &offset);
> +	if (ret)
> +		return ret;
> +
> +	virtvdev->notify_bar = bar;
> +	virtvdev->notify_offset = offset;
> +	return 0;
> +}
> +
> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +	struct pci_dev *pdev;
> +	int ret;
> +
> +	ret = vfio_pci_core_init_dev(core_vdev);
> +	if (ret)
> +		return ret;
> +
> +	pdev = virtvdev->core_device.pdev;
> +	ret = virtiovf_read_notify_info(virtvdev);
> +	if (ret)
> +		return ret;
> +
> +	/* Being ready with a buffer that supports MSIX */
> +	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
> +				virtiovf_get_device_config_size(pdev->device);
> +	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> +					     GFP_KERNEL);
> +	if (!virtvdev->bar0_virtual_buf)
> +		return -ENOMEM;
> +	mutex_init(&virtvdev->bar_mutex);
> +	return 0;
> +}
> +
> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> +
> +	kfree(virtvdev->bar0_virtual_buf);
> +	vfio_pci_core_release_dev(core_vdev);
> +}
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> +	.name = "virtio-transitional-vfio-pci",
> +	.init = virtiovf_pci_init_device,
> +	.release = virtiovf_pci_core_release_dev,
> +	.open_device = virtiovf_pci_open_device,
> +	.close_device = vfio_pci_core_close_device,
> +	.ioctl = virtiovf_vfio_pci_core_ioctl,
> +	.read = virtiovf_pci_core_read,
> +	.write = virtiovf_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> +	.name = "virtio-acc-vfio-pci",
> +	.init = vfio_pci_core_init_dev,
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = virtiovf_pci_open_device,
> +	.close_device = vfio_pci_core_close_device,
> +	.ioctl = vfio_pci_core_ioctl,
> +	.device_feature = vfio_pci_core_ioctl_feature,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = vfio_pci_core_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> +{
> +	struct resource *res = pdev->resource;
> +
> +	return res->flags ? true : false;
> +}
> +
> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> +	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> +
> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> +{
> +	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> +	u8 *buf;
> +	int ret;
> +
> +	buf = kzalloc(buf_size, GFP_KERNEL);
> +	if (!buf)
> +		return false;
> +
> +	ret = virtio_pci_admin_list_query(pdev, buf, buf_size);
> +	if (ret)
> +		goto end;
> +
> +	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> +		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> +		ret = -EOPNOTSUPP;
> +		goto end;
> +	}
> +
> +	/* Confirm the used commands */
> +	memset(buf, 0, buf_size);
> +	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> +	ret = virtio_pci_admin_list_use(pdev, buf, buf_size);
> +end:
> +	kfree(buf);
> +	return ret ? false : true;
> +}
> +
> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> +			      const struct pci_device_id *id)
> +{
> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> +	struct virtiovf_pci_core_device *virtvdev;
> +	int ret;
> +
> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)


All but the last test here are fairly evident requirements of the
driver.  Why do we require a device that supports MSI-X?

Thanks,
Alex


> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
> +
> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> +				     &pdev->dev, ops);
> +	if (IS_ERR(virtvdev))
> +		return PTR_ERR(virtvdev);
> +
> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
> +	if (ret)
> +		goto out;
> +	return 0;
> +out:
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> +{
> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> +
> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
> +	vfio_put_device(&virtvdev->core_device.vdev);
> +}
> +
> +static const struct pci_device_id virtiovf_pci_table[] = {
> +	/* Only virtio-net is supported/tested so far */
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> +
> +static struct pci_driver virtiovf_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = virtiovf_pci_table,
> +	.probe = virtiovf_pci_probe,
> +	.remove = virtiovf_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(virtiovf_pci_driver);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
Yishai Hadas Oct. 25, 2023, 2:35 p.m. UTC | #9
On 24/10/2023 22:57, Alex Williamson wrote:
> On Tue, 17 Oct 2023 16:42:17 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> Introduce a vfio driver over virtio devices to support the legacy
>> interface functionality for VFs.
>>
>> Background, from the virtio spec [1].
>> --------------------------------------------------------------------
>> In some systems, there is a need to support a virtio legacy driver with
>> a device that does not directly support the legacy interface. In such
>> scenarios, a group owner device can provide the legacy interface
>> functionality for the group member devices. The driver of the owner
>> device can then access the legacy interface of a member device on behalf
>> of the legacy member device driver.
>>
>> For example, with the SR-IOV group type, group members (VFs) can not
>> present the legacy interface in an I/O BAR in BAR0 as expected by the
>> legacy pci driver. If the legacy driver is running inside a virtual
>> machine, the hypervisor executing the virtual machine can present a
>> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
>> legacy driver accesses to this I/O BAR and forwards them to the group
>> owner device (PF) using group administration commands.
>> --------------------------------------------------------------------
>>
>> Specifically, this driver adds support for a virtio-net VF to be exposed
>> as a transitional device to a guest driver and allows the legacy IO BAR
>> functionality on top.
>>
>> This allows a VM which uses a legacy virtio-net driver in the guest to
>> work transparently over a VF which its driver in the host is that new
>> driver.
>>
>> The driver can be extended easily to support some other types of virtio
>> devices (e.g virtio-blk), by adding in a few places the specific type
>> properties as was done for virtio-net.
>>
>> For now, only the virtio-net use case was tested and as such we introduce
>> the support only for such a device.
>>
>> Practically,
>> Upon probing a VF for a virtio-net device, in case its PF supports
>> legacy access over the virtio admin commands and the VF doesn't have BAR
>> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
>> transitional device with I/O BAR in BAR 0.
>>
>> The existence of the simulated I/O bar is reported later on by
>> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
>> exposes itself as a transitional device by overwriting some properties
>> upon reading its config space.
>>
>> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
>> guest may use it via read/write calls according to the virtio
>> specification.
>>
>> Any read/write towards the control parts of the BAR will be captured by
>> the new driver and will be translated into admin commands towards the
>> device.
>>
>> Any data path read/write access (i.e. virtio driver notifications) will
>> be forwarded to the physical BAR which its properties were supplied by
>> the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
>> probing/init flow.
>>
>> With that code in place a legacy driver in the guest has the look and
>> feel as if having a transitional device with legacy support for both its
>> control and data path flows.
>>
>> [1]
>> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   MAINTAINERS                      |   7 +
>>   drivers/vfio/pci/Kconfig         |   2 +
>>   drivers/vfio/pci/Makefile        |   2 +
>>   drivers/vfio/pci/virtio/Kconfig  |  15 +
>>   drivers/vfio/pci/virtio/Makefile |   4 +
>>   drivers/vfio/pci/virtio/main.c   | 577 +++++++++++++++++++++++++++++++
>>   6 files changed, 607 insertions(+)
>>   create mode 100644 drivers/vfio/pci/virtio/Kconfig
>>   create mode 100644 drivers/vfio/pci/virtio/Makefile
>>   create mode 100644 drivers/vfio/pci/virtio/main.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7a7bd8bd80e9..680a70063775 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -22620,6 +22620,13 @@ L:	kvm@vger.kernel.org
>>   S:	Maintained
>>   F:	drivers/vfio/pci/mlx5/
>>   
>> +VFIO VIRTIO PCI DRIVER
>> +M:	Yishai Hadas <yishaih@nvidia.com>
>> +L:	kvm@vger.kernel.org
>> +L:	virtualization@lists.linux-foundation.org
>> +S:	Maintained
>> +F:	drivers/vfio/pci/virtio
>> +
>>   VFIO PCI DEVICE SPECIFIC DRIVERS
>>   R:	Jason Gunthorpe <jgg@nvidia.com>
>>   R:	Yishai Hadas <yishaih@nvidia.com>
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 8125e5f37832..18c397df566d 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>>   
>>   source "drivers/vfio/pci/pds/Kconfig"
>>   
>> +source "drivers/vfio/pci/virtio/Kconfig"
>> +
>>   endmenu
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 45167be462d8..046139a4eca5 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>>   obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>>   
>>   obj-$(CONFIG_PDS_VFIO_PCI) += pds/
>> +
>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
>> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
>> new file mode 100644
>> index 000000000000..89eddce8b1bd
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/Kconfig
>> @@ -0,0 +1,15 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +config VIRTIO_VFIO_PCI
>> +        tristate "VFIO support for VIRTIO PCI devices"
>> +        depends on VIRTIO_PCI
>> +        select VFIO_PCI_CORE
>> +        help
>> +          This provides support for exposing VIRTIO VF devices using the VFIO
>> +          framework that can work with a legacy virtio driver in the guest.
>> +          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
>> +          not indicate I/O Space.
>> +          As of that this driver emulated I/O BAR in software to let a VF be
>> +          seen as a transitional device in the guest and let it work with
>> +          a legacy driver.
> This description is a little bit subtle to the hard requirements on the
> device.  Reading this, one might think that this should work for any
> SR-IOV VF virtio device, when in reality it only support virtio-net
> currently and places a number of additional requirements on the device
> (ex. legacy access and MSI-X support).

Sure, will change to refer only to virtio-net devices which are capable 
for 'legacy access'.

No need to refer to MSI-X, please see below.

>
>> +
>> +          If you don't know what to do here, say N.
>> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
>> new file mode 100644
>> index 000000000000..2039b39fb723
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/Makefile
>> @@ -0,0 +1,4 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
>> +virtio-vfio-pci-y := main.o
>> +
>> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
>> new file mode 100644
>> index 000000000000..3fef4b21f7e6
>> --- /dev/null
>> +++ b/drivers/vfio/pci/virtio/main.c
>> @@ -0,0 +1,577 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include <linux/device.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/pci.h>
>> +#include <linux/pm_runtime.h>
>> +#include <linux/types.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/vfio_pci_core.h>
>> +#include <linux/virtio_pci.h>
>> +#include <linux/virtio_net.h>
>> +#include <linux/virtio_pci_admin.h>
>> +
>> +struct virtiovf_pci_core_device {
>> +	struct vfio_pci_core_device core_device;
>> +	u8 bar0_virtual_buf_size;
>> +	u8 *bar0_virtual_buf;
>> +	/* synchronize access to the virtual buf */
>> +	struct mutex bar_mutex;
>> +	void __iomem *notify_addr;
>> +	u32 notify_offset;
>> +	u8 notify_bar;
> Push the above u8 to the end of the structure for better packing.
OK
>> +	u16 pci_cmd;
>> +	u16 msix_ctrl;
>> +};
>> +
>> +static int
>> +virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
>> +			     loff_t pos, char __user *buf,
>> +			     size_t count, bool read)
>> +{
>> +	bool msix_enabled = virtvdev->msix_ctrl & PCI_MSIX_FLAGS_ENABLE;
>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>> +	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
>> +	u16 opcode;
>> +	int ret;
>> +
>> +	mutex_lock(&virtvdev->bar_mutex);
>> +	if (read) {
>> +		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
>> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
>> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
>> +		ret = virtio_pci_admin_legacy_io_read(pdev, opcode, pos, count,
>> +						      bar0_buf + pos);
>> +		if (ret)
>> +			goto out;
>> +		if (copy_to_user(buf, bar0_buf + pos, count))
>> +			ret = -EFAULT;
>> +		goto out;
>> +	}
> TBH, I think the symmetry of read vs write would be more apparent if
> this were an else branch.
OK, will do.
>> +
>> +	if (copy_from_user(bar0_buf + pos, buf, count)) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
>> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
>> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
>> +	ret = virtio_pci_admin_legacy_io_write(pdev, opcode, pos, count,
>> +					       bar0_buf + pos);
>> +out:
>> +	mutex_unlock(&virtvdev->bar_mutex);
>> +	return ret;
>> +}
>> +
>> +static int
>> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
>> +			    loff_t pos, char __user *buf,
>> +			    size_t count, bool read)
>> +{
>> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>> +	u16 queue_notify;
>> +	int ret;
>> +
>> +	if (pos + count > virtvdev->bar0_virtual_buf_size)
>> +		return -EINVAL;
>> +
>> +	switch (pos) {
>> +	case VIRTIO_PCI_QUEUE_NOTIFY:
>> +		if (count != sizeof(queue_notify))
>> +			return -EINVAL;
>> +		if (read) {
>> +			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
>> +						virtvdev->notify_addr);
>> +			if (ret)
>> +				return ret;
>> +			if (copy_to_user(buf, &queue_notify,
>> +					 sizeof(queue_notify)))
>> +				return -EFAULT;
>> +			break;
>> +		}
> Same.
OK
>> +
>> +		if (copy_from_user(&queue_notify, buf, count))
>> +			return -EFAULT;
>> +
>> +		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
>> +					 virtvdev->notify_addr);
>> +		break;
>> +	default:
>> +		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
>> +						   read);
>> +	}
>> +
>> +	return ret ? ret : count;
>> +}
>> +
>> +static bool range_intersect_range(loff_t range1_start, size_t count1,
>> +				  loff_t range2_start, size_t count2,
>> +				  loff_t *start_offset,
>> +				  size_t *intersect_count,
>> +				  size_t *register_offset)
>> +{
>> +	if (range1_start <= range2_start &&
>> +	    range1_start + count1 > range2_start) {
>> +		*start_offset = range2_start - range1_start;
>> +		*intersect_count = min_t(size_t, count2,
>> +					 range1_start + count1 - range2_start);
>> +		if (register_offset)
>> +			*register_offset = 0;
>> +		return true;
>> +	}
>> +
>> +	if (range1_start > range2_start &&
>> +	    range1_start < range2_start + count2) {
>> +		*start_offset = range1_start;
>> +		*intersect_count = min_t(size_t, count1,
>> +					 range2_start + count2 - range1_start);
>> +		if (register_offset)
>> +			*register_offset = range1_start - range2_start;
>> +		return true;
>> +	}
> Seems like we're missing a case, and some documentation.
>
> The first test requires range1 to fully enclose range2 and provides the
> offset of range2 within range1 and the length of the intersection.
>
> The second test requires range1 to start from a non-zero offset within
> range2 and returns the absolute offset of range1 and the length of the
> intersection.
>
> The register offset is then non-zero offset of range1 into range2.  So
> does the caller use the zero value in the previous test to know range2
> exists within range1?
>
> We miss the cases where range1_start is <= range2_start and range1
> terminates within range2.

The first test should cover this case as well of the case of fully 
enclosing.

It checks whether range1_start + count1 > range2_start which can 
terminates also within range2.

Isn't it ?

I may add some documentation for that function as part of V2 as you asked.

>   I suppose we'll see below how this is used,
> but it seems asymmetric and incomplete.
>
>> +
>> +	return false;
>> +}
>> +
>> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
>> +					char __user *buf, size_t count,
>> +					loff_t *ppos)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	size_t register_offset;
>> +	loff_t copy_offset;
>> +	size_t copy_count;
>> +	__le32 val32;
>> +	__le16 val16;
>> +	u8 val8;
>> +	int ret;
>> +
>> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
>> +				  &copy_offset, &copy_count, NULL)) {
> If a user does 'setpci -s x:00.0 2.b' (range1 <= range2, but terminates
> within range2) they'll not enter this branch and see 41 rather than 00.
>
> If a user does 'setpci -s x:00.0 3.b' (range1 > range2, range 1
> contained within range 2), the above function returns a copy_offset of
> range1_start (ie. 3).  But that offset is applied to the buffer, which
> is out of bounds.  The function needs to have returned an offset of 1
> and it should have applied to the val16 address.
>
> I don't think this works like it's intended.

Is that because of the missing case ?
Please see my note above.

>
>
>> +		val16 = cpu_to_le16(0x1000);
> Please #define this somewhere rather than hiding a magic value here.
Sure, will just replace to VIRTIO_TRANS_ID_NET.
>> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
>> +			return -EFAULT;
>> +	}
>> +
>> +	if ((virtvdev->pci_cmd & PCI_COMMAND_IO) &&
>> +	    range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
>> +				  &copy_offset, &copy_count, &register_offset)) {
>> +		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
>> +				   copy_count))
>> +			return -EFAULT;
>> +		val16 |= cpu_to_le16(PCI_COMMAND_IO);
>> +		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
>> +				 copy_count))
>> +			return -EFAULT;
>> +	}
>> +
>> +	if (range_intersect_range(pos, count, PCI_REVISION_ID, sizeof(val8),
>> +				  &copy_offset, &copy_count, NULL)) {
>> +		/* Transional needs to have revision 0 */
>> +		val8 = 0;
>> +		if (copy_to_user(buf + copy_offset, &val8, copy_count))
>> +			return -EFAULT;
>> +	}
>> +
>> +	if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
>> +				  &copy_offset, &copy_count, NULL)) {
>> +		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
> I'd still like to see the remainder of the BAR follow the semantics
> vfio-pci does.  I think this requires a __le32 bar0 field on the
> virtvdev struct to store writes and the read here would mask the lower
> bits up to the BAR size and OR in the IO indicator bit.

OK, will do.

>
>
>> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
>> +			return -EFAULT;
>> +	}
>> +
>> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
>> +				  &copy_offset, &copy_count, NULL)) {
>> +		/*
>> +		 * Transitional devices use the PCI subsystem device id as
>> +		 * virtio device id, same as legacy driver always did.
> Where did we require the subsystem vendor ID to be 0x1af4?  This
> subsystem device ID really only makes since given that subsystem
> vendor ID, right?  Otherwise I don't see that non-transitional devices,
> such as the VF, have a hard requirement per the spec for the subsystem
> vendor ID.
>
> Do we want to make this only probe the correct subsystem vendor ID or do
> we want to emulate the subsystem vendor ID as well?  I don't see this is
> correct without one of those options.

Looking in the 1.x spec we can see the below.

Legacy Interfaces: A Note on PCI Device Discovery

"Transitional devices MUST have the PCI Subsystem
Device ID matching the Virtio Device ID, as indicated in section 5 ...
This is to match legacy drivers."

However, there is no need to enforce Subsystem Vendor ID.

This is what we followed here.

Makes sense ?

>> +		 */
>> +		val16 = cpu_to_le16(VIRTIO_ID_NET);
>> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
>> +			return -EFAULT;
>> +	}
>> +
>> +	return count;
>> +}
>> +
>> +static ssize_t
>> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>> +		       size_t count, loff_t *ppos)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	int ret;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
>> +		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
>> +
>> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
>> +		return vfio_pci_core_read(core_vdev, buf, count, ppos);
>> +
>> +	ret = pm_runtime_resume_and_get(&pdev->dev);
>> +	if (ret) {
>> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
>> +				     ret);
>> +		return -EIO;
>> +	}
>> +
>> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
>> +	pm_runtime_put(&pdev->dev);
>> +	return ret;
>> +}
>> +
>> +static ssize_t
>> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
>> +			size_t count, loff_t *ppos)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	int ret;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
>> +		size_t register_offset;
>> +		loff_t copy_offset;
>> +		size_t copy_count;
>> +
>> +		if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(virtvdev->pci_cmd),
>> +					  &copy_offset, &copy_count,
>> +					  &register_offset)) {
>> +			if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
>> +					   buf + copy_offset,
>> +					   copy_count))
>> +				return -EFAULT;
>> +		}
>> +
>> +		if (range_intersect_range(pos, count, pdev->msix_cap + PCI_MSIX_FLAGS,
>> +					  sizeof(virtvdev->msix_ctrl),
>> +					  &copy_offset, &copy_count,
>> +					  &register_offset)) {
>> +			if (copy_from_user((void *)&virtvdev->msix_ctrl + register_offset,
>> +					   buf + copy_offset,
>> +					   copy_count))
>> +				return -EFAULT;
>> +		}
> MSI-X is setup via ioctl, so you're relying on a userspace that writes
> through the control register bit even though it doesn't do anything.
> Why not use vfio_pci_core_device.irq_type to track if MSI-X mode is
> enabled?
OK, may switch to your suggestion post of testing it.
>
>> +	}
>> +
>> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
>> +		return vfio_pci_core_write(core_vdev, buf, count, ppos);
>> +
>> +	ret = pm_runtime_resume_and_get(&pdev->dev);
>> +	if (ret) {
>> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
>> +		return -EIO;
>> +	}
>> +
>> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
>> +	pm_runtime_put(&pdev->dev);
>> +	return ret;
>> +}
>> +
>> +static int
>> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
>> +				   unsigned int cmd, unsigned long arg)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
>> +	void __user *uarg = (void __user *)arg;
>> +	struct vfio_region_info info = {};
>> +
>> +	if (copy_from_user(&info, uarg, minsz))
>> +		return -EFAULT;
>> +
>> +	if (info.argsz < minsz)
>> +		return -EINVAL;
>> +
>> +	switch (info.index) {
>> +	case VFIO_PCI_BAR0_REGION_INDEX:
>> +		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>> +		info.size = virtvdev->bar0_virtual_buf_size;
>> +		info.flags = VFIO_REGION_INFO_FLAG_READ |
>> +			     VFIO_REGION_INFO_FLAG_WRITE;
>> +		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
>> +	default:
>> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>> +	}
>> +}
>> +
>> +static long
>> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> +			     unsigned long arg)
>> +{
>> +	switch (cmd) {
>> +	case VFIO_DEVICE_GET_REGION_INFO:
>> +		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
>> +	default:
>> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>> +	}
>> +}
>> +
>> +static int
>> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
>> +{
>> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>> +	int ret;
>> +
>> +	/*
>> +	 * Setup the BAR where the 'notify' exists to be used by vfio as well
>> +	 * This will let us mmap it only once and use it when needed.
>> +	 */
>> +	ret = vfio_pci_core_setup_barmap(core_device,
>> +					 virtvdev->notify_bar);
>> +	if (ret)
>> +		return ret;
>> +
>> +	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
>> +			virtvdev->notify_offset;
>> +	return 0;
>> +}
>> +
>> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
>> +	int ret;
>> +
>> +	ret = vfio_pci_core_enable(vdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (virtvdev->bar0_virtual_buf) {
>> +		/*
>> +		 * Upon close_device() the vfio_pci_core_disable() is called
>> +		 * and will close all the previous mmaps, so it seems that the
>> +		 * valid life cycle for the 'notify' addr is per open/close.
>> +		 */
>> +		ret = virtiovf_set_notify_addr(virtvdev);
>> +		if (ret) {
>> +			vfio_pci_core_disable(vdev);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	vfio_pci_core_finish_enable(vdev);
>> +	return 0;
>> +}
>> +
>> +static int virtiovf_get_device_config_size(unsigned short device)
>> +{
>> +	/* Network card */
>> +	return offsetofend(struct virtio_net_config, status);
>> +}
>> +
>> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
>> +{
>> +	u64 offset;
>> +	int ret;
>> +	u8 bar;
>> +
>> +	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
>> +				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
>> +				&bar, &offset);
>> +	if (ret)
>> +		return ret;
>> +
>> +	virtvdev->notify_bar = bar;
>> +	virtvdev->notify_offset = offset;
>> +	return 0;
>> +}
>> +
>> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +	struct pci_dev *pdev;
>> +	int ret;
>> +
>> +	ret = vfio_pci_core_init_dev(core_vdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	pdev = virtvdev->core_device.pdev;
>> +	ret = virtiovf_read_notify_info(virtvdev);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* Being ready with a buffer that supports MSIX */
>> +	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
>> +				virtiovf_get_device_config_size(pdev->device);
>> +	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
>> +					     GFP_KERNEL);
>> +	if (!virtvdev->bar0_virtual_buf)
>> +		return -ENOMEM;
>> +	mutex_init(&virtvdev->bar_mutex);
>> +	return 0;
>> +}
>> +
>> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>> +
>> +	kfree(virtvdev->bar0_virtual_buf);
>> +	vfio_pci_core_release_dev(core_vdev);
>> +}
>> +
>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
>> +	.name = "virtio-transitional-vfio-pci",
>> +	.init = virtiovf_pci_init_device,
>> +	.release = virtiovf_pci_core_release_dev,
>> +	.open_device = virtiovf_pci_open_device,
>> +	.close_device = vfio_pci_core_close_device,
>> +	.ioctl = virtiovf_vfio_pci_core_ioctl,
>> +	.read = virtiovf_pci_core_read,
>> +	.write = virtiovf_pci_core_write,
>> +	.mmap = vfio_pci_core_mmap,
>> +	.request = vfio_pci_core_request,
>> +	.match = vfio_pci_core_match,
>> +	.bind_iommufd = vfio_iommufd_physical_bind,
>> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
>> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
>> +};
>> +
>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
>> +	.name = "virtio-acc-vfio-pci",
>> +	.init = vfio_pci_core_init_dev,
>> +	.release = vfio_pci_core_release_dev,
>> +	.open_device = virtiovf_pci_open_device,
>> +	.close_device = vfio_pci_core_close_device,
>> +	.ioctl = vfio_pci_core_ioctl,
>> +	.device_feature = vfio_pci_core_ioctl_feature,
>> +	.read = vfio_pci_core_read,
>> +	.write = vfio_pci_core_write,
>> +	.mmap = vfio_pci_core_mmap,
>> +	.request = vfio_pci_core_request,
>> +	.match = vfio_pci_core_match,
>> +	.bind_iommufd = vfio_iommufd_physical_bind,
>> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
>> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
>> +};
>> +
>> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>> +{
>> +	struct resource *res = pdev->resource;
>> +
>> +	return res->flags ? true : false;
>> +}
>> +
>> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
>> +	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
>> +
>> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
>> +{
>> +	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
>> +	u8 *buf;
>> +	int ret;
>> +
>> +	buf = kzalloc(buf_size, GFP_KERNEL);
>> +	if (!buf)
>> +		return false;
>> +
>> +	ret = virtio_pci_admin_list_query(pdev, buf, buf_size);
>> +	if (ret)
>> +		goto end;
>> +
>> +	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
>> +		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
>> +		ret = -EOPNOTSUPP;
>> +		goto end;
>> +	}
>> +
>> +	/* Confirm the used commands */
>> +	memset(buf, 0, buf_size);
>> +	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
>> +	ret = virtio_pci_admin_list_use(pdev, buf, buf_size);
>> +end:
>> +	kfree(buf);
>> +	return ret ? false : true;
>> +}
>> +
>> +static int virtiovf_pci_probe(struct pci_dev *pdev,
>> +			      const struct pci_device_id *id)
>> +{
>> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
>> +	struct virtiovf_pci_core_device *virtvdev;
>> +	int ret;
>> +
>> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
>> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
>
> All but the last test here are fairly evident requirements of the
> driver.  Why do we require a device that supports MSI-X?

As now we check at run time to decide whether MSI-X is enabled/disabled 
to pick-up the correct op code, no need for that any more.

Will drop this MSI-X check from V2.

Thanks,
Yishai

>
> Thanks,
> Alex
>
>
>> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
>> +
>> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>> +				     &pdev->dev, ops);
>> +	if (IS_ERR(virtvdev))
>> +		return PTR_ERR(virtvdev);
>> +
>> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
>> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
>> +	if (ret)
>> +		goto out;
>> +	return 0;
>> +out:
>> +	vfio_put_device(&virtvdev->core_device.vdev);
>> +	return ret;
>> +}
>> +
>> +static void virtiovf_pci_remove(struct pci_dev *pdev)
>> +{
>> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>> +
>> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
>> +	vfio_put_device(&virtvdev->core_device.vdev);
>> +}
>> +
>> +static const struct pci_device_id virtiovf_pci_table[] = {
>> +	/* Only virtio-net is supported/tested so far */
>> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
>> +	{}
>> +};
>> +
>> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
>> +
>> +static struct pci_driver virtiovf_pci_driver = {
>> +	.name = KBUILD_MODNAME,
>> +	.id_table = virtiovf_pci_table,
>> +	.probe = virtiovf_pci_probe,
>> +	.remove = virtiovf_pci_remove,
>> +	.err_handler = &vfio_pci_core_err_handlers,
>> +	.driver_managed_dma = true,
>> +};
>> +
>> +module_pci_driver(virtiovf_pci_driver);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
>> +MODULE_DESCRIPTION(
>> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
Michael S. Tsirkin Oct. 25, 2023, 4:28 p.m. UTC | #10
On Wed, Oct 25, 2023 at 05:35:51PM +0300, Yishai Hadas wrote:
> > Do we want to make this only probe the correct subsystem vendor ID or do
> > we want to emulate the subsystem vendor ID as well?  I don't see this is
> > correct without one of those options.
> 
> Looking in the 1.x spec we can see the below.
> 
> Legacy Interfaces: A Note on PCI Device Discovery
> 
> "Transitional devices MUST have the PCI Subsystem
> Device ID matching the Virtio Device ID, as indicated in section 5 ...
> This is to match legacy drivers."
> 
> However, there is no need to enforce Subsystem Vendor ID.
> 
> This is what we followed here.
> 
> Makes sense ?

Won't work for legacy windows drivers.
Alex Williamson Oct. 25, 2023, 7:13 p.m. UTC | #11
On Wed, 25 Oct 2023 17:35:51 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 24/10/2023 22:57, Alex Williamson wrote:
> > On Tue, 17 Oct 2023 16:42:17 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> Introduce a vfio driver over virtio devices to support the legacy
> >> interface functionality for VFs.
> >>
> >> Background, from the virtio spec [1].
> >> --------------------------------------------------------------------
> >> In some systems, there is a need to support a virtio legacy driver with
> >> a device that does not directly support the legacy interface. In such
> >> scenarios, a group owner device can provide the legacy interface
> >> functionality for the group member devices. The driver of the owner
> >> device can then access the legacy interface of a member device on behalf
> >> of the legacy member device driver.
> >>
> >> For example, with the SR-IOV group type, group members (VFs) can not
> >> present the legacy interface in an I/O BAR in BAR0 as expected by the
> >> legacy pci driver. If the legacy driver is running inside a virtual
> >> machine, the hypervisor executing the virtual machine can present a
> >> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> >> legacy driver accesses to this I/O BAR and forwards them to the group
> >> owner device (PF) using group administration commands.
> >> --------------------------------------------------------------------
> >>
> >> Specifically, this driver adds support for a virtio-net VF to be exposed
> >> as a transitional device to a guest driver and allows the legacy IO BAR
> >> functionality on top.
> >>
> >> This allows a VM which uses a legacy virtio-net driver in the guest to
> >> work transparently over a VF which its driver in the host is that new
> >> driver.
> >>
> >> The driver can be extended easily to support some other types of virtio
> >> devices (e.g virtio-blk), by adding in a few places the specific type
> >> properties as was done for virtio-net.
> >>
> >> For now, only the virtio-net use case was tested and as such we introduce
> >> the support only for such a device.
> >>
> >> Practically,
> >> Upon probing a VF for a virtio-net device, in case its PF supports
> >> legacy access over the virtio admin commands and the VF doesn't have BAR
> >> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> >> transitional device with I/O BAR in BAR 0.
> >>
> >> The existence of the simulated I/O bar is reported later on by
> >> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> >> exposes itself as a transitional device by overwriting some properties
> >> upon reading its config space.
> >>
> >> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> >> guest may use it via read/write calls according to the virtio
> >> specification.
> >>
> >> Any read/write towards the control parts of the BAR will be captured by
> >> the new driver and will be translated into admin commands towards the
> >> device.
> >>
> >> Any data path read/write access (i.e. virtio driver notifications) will
> >> be forwarded to the physical BAR which its properties were supplied by
> >> the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
> >> probing/init flow.
> >>
> >> With that code in place a legacy driver in the guest has the look and
> >> feel as if having a transitional device with legacy support for both its
> >> control and data path flows.
> >>
> >> [1]
> >> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
> >>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> ---
> >>   MAINTAINERS                      |   7 +
> >>   drivers/vfio/pci/Kconfig         |   2 +
> >>   drivers/vfio/pci/Makefile        |   2 +
> >>   drivers/vfio/pci/virtio/Kconfig  |  15 +
> >>   drivers/vfio/pci/virtio/Makefile |   4 +
> >>   drivers/vfio/pci/virtio/main.c   | 577 +++++++++++++++++++++++++++++++
> >>   6 files changed, 607 insertions(+)
> >>   create mode 100644 drivers/vfio/pci/virtio/Kconfig
> >>   create mode 100644 drivers/vfio/pci/virtio/Makefile
> >>   create mode 100644 drivers/vfio/pci/virtio/main.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 7a7bd8bd80e9..680a70063775 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -22620,6 +22620,13 @@ L:	kvm@vger.kernel.org
> >>   S:	Maintained
> >>   F:	drivers/vfio/pci/mlx5/
> >>   
> >> +VFIO VIRTIO PCI DRIVER
> >> +M:	Yishai Hadas <yishaih@nvidia.com>
> >> +L:	kvm@vger.kernel.org
> >> +L:	virtualization@lists.linux-foundation.org
> >> +S:	Maintained
> >> +F:	drivers/vfio/pci/virtio
> >> +
> >>   VFIO PCI DEVICE SPECIFIC DRIVERS
> >>   R:	Jason Gunthorpe <jgg@nvidia.com>
> >>   R:	Yishai Hadas <yishaih@nvidia.com>
> >> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >> index 8125e5f37832..18c397df566d 100644
> >> --- a/drivers/vfio/pci/Kconfig
> >> +++ b/drivers/vfio/pci/Kconfig
> >> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
> >>   
> >>   source "drivers/vfio/pci/pds/Kconfig"
> >>   
> >> +source "drivers/vfio/pci/virtio/Kconfig"
> >> +
> >>   endmenu
> >> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> >> index 45167be462d8..046139a4eca5 100644
> >> --- a/drivers/vfio/pci/Makefile
> >> +++ b/drivers/vfio/pci/Makefile
> >> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
> >>   obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> >>   
> >>   obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> >> +
> >> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> >> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
> >> new file mode 100644
> >> index 000000000000..89eddce8b1bd
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/virtio/Kconfig
> >> @@ -0,0 +1,15 @@
> >> +# SPDX-License-Identifier: GPL-2.0-only
> >> +config VIRTIO_VFIO_PCI
> >> +        tristate "VFIO support for VIRTIO PCI devices"
> >> +        depends on VIRTIO_PCI
> >> +        select VFIO_PCI_CORE
> >> +        help
> >> +          This provides support for exposing VIRTIO VF devices using the VFIO
> >> +          framework that can work with a legacy virtio driver in the guest.
> >> +          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
> >> +          not indicate I/O Space.
> >> +          As of that this driver emulated I/O BAR in software to let a VF be
> >> +          seen as a transitional device in the guest and let it work with
> >> +          a legacy driver.  
> > This description is a little bit subtle to the hard requirements on the
> > device.  Reading this, one might think that this should work for any
> > SR-IOV VF virtio device, when in reality it only support virtio-net
> > currently and places a number of additional requirements on the device
> > (ex. legacy access and MSI-X support).  
> 
> Sure, will change to refer only to virtio-net devices which are capable 
> for 'legacy access'.
> 
> No need to refer to MSI-X, please see below.
> 
> >  
> >> +
> >> +          If you don't know what to do here, say N.
> >> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
> >> new file mode 100644
> >> index 000000000000..2039b39fb723
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/virtio/Makefile
> >> @@ -0,0 +1,4 @@
> >> +# SPDX-License-Identifier: GPL-2.0-only
> >> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
> >> +virtio-vfio-pci-y := main.o
> >> +
> >> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
> >> new file mode 100644
> >> index 000000000000..3fef4b21f7e6
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/virtio/main.c
> >> @@ -0,0 +1,577 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +
> >> +#include <linux/device.h>
> >> +#include <linux/module.h>
> >> +#include <linux/mutex.h>
> >> +#include <linux/pci.h>
> >> +#include <linux/pm_runtime.h>
> >> +#include <linux/types.h>
> >> +#include <linux/uaccess.h>
> >> +#include <linux/vfio.h>
> >> +#include <linux/vfio_pci_core.h>
> >> +#include <linux/virtio_pci.h>
> >> +#include <linux/virtio_net.h>
> >> +#include <linux/virtio_pci_admin.h>
> >> +
> >> +struct virtiovf_pci_core_device {
> >> +	struct vfio_pci_core_device core_device;
> >> +	u8 bar0_virtual_buf_size;
> >> +	u8 *bar0_virtual_buf;
> >> +	/* synchronize access to the virtual buf */
> >> +	struct mutex bar_mutex;
> >> +	void __iomem *notify_addr;
> >> +	u32 notify_offset;
> >> +	u8 notify_bar;  
> > Push the above u8 to the end of the structure for better packing.  
> OK
> >> +	u16 pci_cmd;
> >> +	u16 msix_ctrl;
> >> +};
> >> +
> >> +static int
> >> +virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
> >> +			     loff_t pos, char __user *buf,
> >> +			     size_t count, bool read)
> >> +{
> >> +	bool msix_enabled = virtvdev->msix_ctrl & PCI_MSIX_FLAGS_ENABLE;
> >> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> >> +	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
> >> +	u16 opcode;
> >> +	int ret;
> >> +
> >> +	mutex_lock(&virtvdev->bar_mutex);
> >> +	if (read) {
> >> +		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
> >> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
> >> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
> >> +		ret = virtio_pci_admin_legacy_io_read(pdev, opcode, pos, count,
> >> +						      bar0_buf + pos);
> >> +		if (ret)
> >> +			goto out;
> >> +		if (copy_to_user(buf, bar0_buf + pos, count))
> >> +			ret = -EFAULT;
> >> +		goto out;
> >> +	}  
> > TBH, I think the symmetry of read vs write would be more apparent if
> > this were an else branch.  
> OK, will do.
> >> +
> >> +	if (copy_from_user(bar0_buf + pos, buf, count)) {
> >> +		ret = -EFAULT;
> >> +		goto out;
> >> +	}
> >> +
> >> +	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
> >> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
> >> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
> >> +	ret = virtio_pci_admin_legacy_io_write(pdev, opcode, pos, count,
> >> +					       bar0_buf + pos);
> >> +out:
> >> +	mutex_unlock(&virtvdev->bar_mutex);
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
> >> +			    loff_t pos, char __user *buf,
> >> +			    size_t count, bool read)
> >> +{
> >> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> >> +	u16 queue_notify;
> >> +	int ret;
> >> +
> >> +	if (pos + count > virtvdev->bar0_virtual_buf_size)
> >> +		return -EINVAL;
> >> +
> >> +	switch (pos) {
> >> +	case VIRTIO_PCI_QUEUE_NOTIFY:
> >> +		if (count != sizeof(queue_notify))
> >> +			return -EINVAL;
> >> +		if (read) {
> >> +			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
> >> +						virtvdev->notify_addr);
> >> +			if (ret)
> >> +				return ret;
> >> +			if (copy_to_user(buf, &queue_notify,
> >> +					 sizeof(queue_notify)))
> >> +				return -EFAULT;
> >> +			break;
> >> +		}  
> > Same.  
> OK
> >> +
> >> +		if (copy_from_user(&queue_notify, buf, count))
> >> +			return -EFAULT;
> >> +
> >> +		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
> >> +					 virtvdev->notify_addr);
> >> +		break;
> >> +	default:
> >> +		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
> >> +						   read);
> >> +	}
> >> +
> >> +	return ret ? ret : count;
> >> +}
> >> +
> >> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> >> +				  loff_t range2_start, size_t count2,
> >> +				  loff_t *start_offset,
> >> +				  size_t *intersect_count,
> >> +				  size_t *register_offset)
> >> +{
> >> +	if (range1_start <= range2_start &&
> >> +	    range1_start + count1 > range2_start) {
> >> +		*start_offset = range2_start - range1_start;
> >> +		*intersect_count = min_t(size_t, count2,
> >> +					 range1_start + count1 - range2_start);
> >> +		if (register_offset)
> >> +			*register_offset = 0;
> >> +		return true;
> >> +	}
> >> +
> >> +	if (range1_start > range2_start &&
> >> +	    range1_start < range2_start + count2) {
> >> +		*start_offset = range1_start;
> >> +		*intersect_count = min_t(size_t, count1,
> >> +					 range2_start + count2 - range1_start);
> >> +		if (register_offset)
> >> +			*register_offset = range1_start - range2_start;
> >> +		return true;
> >> +	}  
> > Seems like we're missing a case, and some documentation.
> >
> > The first test requires range1 to fully enclose range2 and provides the
> > offset of range2 within range1 and the length of the intersection.
> >
> > The second test requires range1 to start from a non-zero offset within
> > range2 and returns the absolute offset of range1 and the length of the
> > intersection.
> >
> > The register offset is then non-zero offset of range1 into range2.  So
> > does the caller use the zero value in the previous test to know range2
> > exists within range1?
> >
> > We miss the cases where range1_start is <= range2_start and range1
> > terminates within range2.  
> 
> The first test should cover this case as well of the case of fully 
> enclosing.
> 
> It checks whether range1_start + count1 > range2_start which can 
> terminates also within range2.
> 
> Isn't it ?

Hmm, maybe I read it wrong.  Let me try again...

The first test covers the cases where range1 starts at or below range2
and range1 extends into or through range2.  start_offset describes the
offset into range1 that range2 begins.  The intersect_count is the
extent of the intersection and it's not clear what register_offset
describes since it's zero.

The second test covers the cases where range1 starts within range2.
start_offset is the start of range1, which doesn't seem consistent with
the previous branch usage.  The intersect_count does look consistent
with the previous branch.  register_offset is then the offset of range1
into range2

So I had some things wrong, but I'm still having trouble with a
consistent definition of start_offset and register_offset.


> I may add some documentation for that function as part of V2 as you asked.
> 
> >   I suppose we'll see below how this is used,
> > but it seems asymmetric and incomplete.
> >  
> >> +
> >> +	return false;
> >> +}
> >> +
> >> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
> >> +					char __user *buf, size_t count,
> >> +					loff_t *ppos)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +	size_t register_offset;
> >> +	loff_t copy_offset;
> >> +	size_t copy_count;
> >> +	__le32 val32;
> >> +	__le16 val16;
> >> +	u8 val8;
> >> +	int ret;
> >> +
> >> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
> >> +				  &copy_offset, &copy_count, NULL)) {  
> > If a user does 'setpci -s x:00.0 2.b' (range1 <= range2, but terminates
> > within range2) they'll not enter this branch and see 41 rather than 00.

Yes, this does take the first branch per my second look, so copy_offset
is zero, copy_count is 1.  I think the copy_to_user() works correctly

> > If a user does 'setpci -s x:00.0 3.b' (range1 > range2, range 1
> > contained within range 2), the above function returns a copy_offset of
> > range1_start (ie. 3).  But that offset is applied to the buffer, which
> > is out of bounds.  The function needs to have returned an offset of 1
> > and it should have applied to the val16 address.
> >
> > I don't think this works like it's intended.  
> 
> Is that because of the missing case ?
> Please see my note above.

No, I think my original evaluation of this second case still holds,
copy_offset is wrong.  I suspect what you're trying to do with
start_offset and register_offset is specify the output buffer offset,
ie. relative to range1 or buf, or the input offset, ie. range2 or our
local val variable.  But start_offset is incorrectly calculated in the
second branch above (should always be zero) and the caller didn't ask
for the register offset here, which is seems it always should.

> >> +		val16 = cpu_to_le16(0x1000);  
> > Please #define this somewhere rather than hiding a magic value here.  
> Sure, will just replace to VIRTIO_TRANS_ID_NET.
> >> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	if ((virtvdev->pci_cmd & PCI_COMMAND_IO) &&
> >> +	    range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> >> +				  &copy_offset, &copy_count, &register_offset)) {
> >> +		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
> >> +				   copy_count))
> >> +			return -EFAULT;
> >> +		val16 |= cpu_to_le16(PCI_COMMAND_IO);
> >> +		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
> >> +				 copy_count))
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	if (range_intersect_range(pos, count, PCI_REVISION_ID, sizeof(val8),
> >> +				  &copy_offset, &copy_count, NULL)) {
> >> +		/* Transional needs to have revision 0 */
> >> +		val8 = 0;
> >> +		if (copy_to_user(buf + copy_offset, &val8, copy_count))
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
> >> +				  &copy_offset, &copy_count, NULL)) {
> >> +		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);  
> > I'd still like to see the remainder of the BAR follow the semantics
> > vfio-pci does.  I think this requires a __le32 bar0 field on the
> > virtvdev struct to store writes and the read here would mask the lower
> > bits up to the BAR size and OR in the IO indicator bit.  
> 
> OK, will do.
> 
> >
> >  
> >> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> >> +				  &copy_offset, &copy_count, NULL)) {
> >> +		/*
> >> +		 * Transitional devices use the PCI subsystem device id as
> >> +		 * virtio device id, same as legacy driver always did.  
> > Where did we require the subsystem vendor ID to be 0x1af4?  This
> > subsystem device ID really only makes since given that subsystem
> > vendor ID, right?  Otherwise I don't see that non-transitional devices,
> > such as the VF, have a hard requirement per the spec for the subsystem
> > vendor ID.
> >
> > Do we want to make this only probe the correct subsystem vendor ID or do
> > we want to emulate the subsystem vendor ID as well?  I don't see this is
> > correct without one of those options.  
> 
> Looking in the 1.x spec we can see the below.
> 
> Legacy Interfaces: A Note on PCI Device Discovery
> 
> "Transitional devices MUST have the PCI Subsystem
> Device ID matching the Virtio Device ID, as indicated in section 5 ...
> This is to match legacy drivers."
> 
> However, there is no need to enforce Subsystem Vendor ID.
> 
> This is what we followed here.
> 
> Makes sense ?

So do I understand correctly that virtio dictates the subsystem device
ID for all subsystem vendor IDs that implement a legacy virtio
interface?  Ok, but this device didn't actually implement a legacy
virtio interface.  The device itself is not tranistional, we're imposing
an emulated transitional interface onto it.  So did the subsystem vendor
agree to have their subsystem device ID managed by the virtio committee
or might we create conflicts?  I imagine we know we don't have a
conflict if we also virtualize the subsystem vendor ID.


BTW, it would be a lot easier for all of the config space emulation here
if we could make use of the existing field virtualization in
vfio-pci-core.  In fact you'll see in vfio_config_init() that
PCI_DEVICE_ID is already virtualized for VFs, so it would be enough to
simply do the following to report the desired device ID:

	*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(0x1000);

It appears everything in this function could be handled similarly by
vfio-pci-core if the right fields in the perm_bits.virt and .write
bits could be manipulated and vconfig modified appropriately.  I'd look
for a way that a variant driver could provide an alternate set of
permissions structures for various capabilities.  Thanks,

Alex


> >> +		 */
> >> +		val16 = cpu_to_le16(VIRTIO_ID_NET);
> >> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
> >> +			return -EFAULT;
> >> +	}
> >> +
> >> +	return count;
> >> +}
> >> +
> >> +static ssize_t
> >> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> >> +		       size_t count, loff_t *ppos)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> >> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> >> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +	int ret;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> >> +		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
> >> +
> >> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> >> +		return vfio_pci_core_read(core_vdev, buf, count, ppos);
> >> +
> >> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> >> +	if (ret) {
> >> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
> >> +				     ret);
> >> +		return -EIO;
> >> +	}
> >> +
> >> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
> >> +	pm_runtime_put(&pdev->dev);
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t
> >> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> >> +			size_t count, loff_t *ppos)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
> >> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> >> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +	int ret;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
> >> +		size_t register_offset;
> >> +		loff_t copy_offset;
> >> +		size_t copy_count;
> >> +
> >> +		if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(virtvdev->pci_cmd),
> >> +					  &copy_offset, &copy_count,
> >> +					  &register_offset)) {
> >> +			if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
> >> +					   buf + copy_offset,
> >> +					   copy_count))
> >> +				return -EFAULT;
> >> +		}
> >> +
> >> +		if (range_intersect_range(pos, count, pdev->msix_cap + PCI_MSIX_FLAGS,
> >> +					  sizeof(virtvdev->msix_ctrl),
> >> +					  &copy_offset, &copy_count,
> >> +					  &register_offset)) {
> >> +			if (copy_from_user((void *)&virtvdev->msix_ctrl + register_offset,
> >> +					   buf + copy_offset,
> >> +					   copy_count))
> >> +				return -EFAULT;
> >> +		}  
> > MSI-X is setup via ioctl, so you're relying on a userspace that writes
> > through the control register bit even though it doesn't do anything.
> > Why not use vfio_pci_core_device.irq_type to track if MSI-X mode is
> > enabled?  
> OK, may switch to your suggestion post of testing it.
> >  
> >> +	}
> >> +
> >> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
> >> +		return vfio_pci_core_write(core_vdev, buf, count, ppos);
> >> +
> >> +	ret = pm_runtime_resume_and_get(&pdev->dev);
> >> +	if (ret) {
> >> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
> >> +		return -EIO;
> >> +	}
> >> +
> >> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
> >> +	pm_runtime_put(&pdev->dev);
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> >> +				   unsigned int cmd, unsigned long arg)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> >> +	void __user *uarg = (void __user *)arg;
> >> +	struct vfio_region_info info = {};
> >> +
> >> +	if (copy_from_user(&info, uarg, minsz))
> >> +		return -EFAULT;
> >> +
> >> +	if (info.argsz < minsz)
> >> +		return -EINVAL;
> >> +
> >> +	switch (info.index) {
> >> +	case VFIO_PCI_BAR0_REGION_INDEX:
> >> +		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> >> +		info.size = virtvdev->bar0_virtual_buf_size;
> >> +		info.flags = VFIO_REGION_INFO_FLAG_READ |
> >> +			     VFIO_REGION_INFO_FLAG_WRITE;
> >> +		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> >> +	default:
> >> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> >> +	}
> >> +}
> >> +
> >> +static long
> >> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >> +			     unsigned long arg)
> >> +{
> >> +	switch (cmd) {
> >> +	case VFIO_DEVICE_GET_REGION_INFO:
> >> +		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
> >> +	default:
> >> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> >> +	}
> >> +}
> >> +
> >> +static int
> >> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
> >> +{
> >> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
> >> +	int ret;
> >> +
> >> +	/*
> >> +	 * Setup the BAR where the 'notify' exists to be used by vfio as well
> >> +	 * This will let us mmap it only once and use it when needed.
> >> +	 */
> >> +	ret = vfio_pci_core_setup_barmap(core_device,
> >> +					 virtvdev->notify_bar);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
> >> +			virtvdev->notify_offset;
> >> +	return 0;
> >> +}
> >> +
> >> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
> >> +	int ret;
> >> +
> >> +	ret = vfio_pci_core_enable(vdev);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	if (virtvdev->bar0_virtual_buf) {
> >> +		/*
> >> +		 * Upon close_device() the vfio_pci_core_disable() is called
> >> +		 * and will close all the previous mmaps, so it seems that the
> >> +		 * valid life cycle for the 'notify' addr is per open/close.
> >> +		 */
> >> +		ret = virtiovf_set_notify_addr(virtvdev);
> >> +		if (ret) {
> >> +			vfio_pci_core_disable(vdev);
> >> +			return ret;
> >> +		}
> >> +	}
> >> +
> >> +	vfio_pci_core_finish_enable(vdev);
> >> +	return 0;
> >> +}
> >> +
> >> +static int virtiovf_get_device_config_size(unsigned short device)
> >> +{
> >> +	/* Network card */
> >> +	return offsetofend(struct virtio_net_config, status);
> >> +}
> >> +
> >> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
> >> +{
> >> +	u64 offset;
> >> +	int ret;
> >> +	u8 bar;
> >> +
> >> +	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
> >> +				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
> >> +				&bar, &offset);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	virtvdev->notify_bar = bar;
> >> +	virtvdev->notify_offset = offset;
> >> +	return 0;
> >> +}
> >> +
> >> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +	struct pci_dev *pdev;
> >> +	int ret;
> >> +
> >> +	ret = vfio_pci_core_init_dev(core_vdev);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	pdev = virtvdev->core_device.pdev;
> >> +	ret = virtiovf_read_notify_info(virtvdev);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	/* Being ready with a buffer that supports MSIX */
> >> +	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
> >> +				virtiovf_get_device_config_size(pdev->device);
> >> +	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
> >> +					     GFP_KERNEL);
> >> +	if (!virtvdev->bar0_virtual_buf)
> >> +		return -ENOMEM;
> >> +	mutex_init(&virtvdev->bar_mutex);
> >> +	return 0;
> >> +}
> >> +
> >> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = container_of(
> >> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
> >> +
> >> +	kfree(virtvdev->bar0_virtual_buf);
> >> +	vfio_pci_core_release_dev(core_vdev);
> >> +}
> >> +
> >> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
> >> +	.name = "virtio-transitional-vfio-pci",
> >> +	.init = virtiovf_pci_init_device,
> >> +	.release = virtiovf_pci_core_release_dev,
> >> +	.open_device = virtiovf_pci_open_device,
> >> +	.close_device = vfio_pci_core_close_device,
> >> +	.ioctl = virtiovf_vfio_pci_core_ioctl,
> >> +	.read = virtiovf_pci_core_read,
> >> +	.write = virtiovf_pci_core_write,
> >> +	.mmap = vfio_pci_core_mmap,
> >> +	.request = vfio_pci_core_request,
> >> +	.match = vfio_pci_core_match,
> >> +	.bind_iommufd = vfio_iommufd_physical_bind,
> >> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> >> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> >> +};
> >> +
> >> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
> >> +	.name = "virtio-acc-vfio-pci",
> >> +	.init = vfio_pci_core_init_dev,
> >> +	.release = vfio_pci_core_release_dev,
> >> +	.open_device = virtiovf_pci_open_device,
> >> +	.close_device = vfio_pci_core_close_device,
> >> +	.ioctl = vfio_pci_core_ioctl,
> >> +	.device_feature = vfio_pci_core_ioctl_feature,
> >> +	.read = vfio_pci_core_read,
> >> +	.write = vfio_pci_core_write,
> >> +	.mmap = vfio_pci_core_mmap,
> >> +	.request = vfio_pci_core_request,
> >> +	.match = vfio_pci_core_match,
> >> +	.bind_iommufd = vfio_iommufd_physical_bind,
> >> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> >> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> >> +};
> >> +
> >> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
> >> +{
> >> +	struct resource *res = pdev->resource;
> >> +
> >> +	return res->flags ? true : false;
> >> +}
> >> +
> >> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
> >> +	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
> >> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
> >> +
> >> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
> >> +{
> >> +	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
> >> +	u8 *buf;
> >> +	int ret;
> >> +
> >> +	buf = kzalloc(buf_size, GFP_KERNEL);
> >> +	if (!buf)
> >> +		return false;
> >> +
> >> +	ret = virtio_pci_admin_list_query(pdev, buf, buf_size);
> >> +	if (ret)
> >> +		goto end;
> >> +
> >> +	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
> >> +		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
> >> +		ret = -EOPNOTSUPP;
> >> +		goto end;
> >> +	}
> >> +
> >> +	/* Confirm the used commands */
> >> +	memset(buf, 0, buf_size);
> >> +	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
> >> +	ret = virtio_pci_admin_list_use(pdev, buf, buf_size);
> >> +end:
> >> +	kfree(buf);
> >> +	return ret ? false : true;
> >> +}
> >> +
> >> +static int virtiovf_pci_probe(struct pci_dev *pdev,
> >> +			      const struct pci_device_id *id)
> >> +{
> >> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
> >> +	struct virtiovf_pci_core_device *virtvdev;
> >> +	int ret;
> >> +
> >> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
> >> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)  
> >
> > All but the last test here are fairly evident requirements of the
> > driver.  Why do we require a device that supports MSI-X?  
> 
> As now we check at run time to decide whether MSI-X is enabled/disabled 
> to pick-up the correct op code, no need for that any more.
> 
> Will drop this MSI-X check from V2.
> 
> Thanks,
> Yishai
> 
> >
> > Thanks,
> > Alex
> >
> >  
> >> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
> >> +
> >> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
> >> +				     &pdev->dev, ops);
> >> +	if (IS_ERR(virtvdev))
> >> +		return PTR_ERR(virtvdev);
> >> +
> >> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
> >> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
> >> +	if (ret)
> >> +		goto out;
> >> +	return 0;
> >> +out:
> >> +	vfio_put_device(&virtvdev->core_device.vdev);
> >> +	return ret;
> >> +}
> >> +
> >> +static void virtiovf_pci_remove(struct pci_dev *pdev)
> >> +{
> >> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
> >> +
> >> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
> >> +	vfio_put_device(&virtvdev->core_device.vdev);
> >> +}
> >> +
> >> +static const struct pci_device_id virtiovf_pci_table[] = {
> >> +	/* Only virtio-net is supported/tested so far */
> >> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
> >> +	{}
> >> +};
> >> +
> >> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
> >> +
> >> +static struct pci_driver virtiovf_pci_driver = {
> >> +	.name = KBUILD_MODNAME,
> >> +	.id_table = virtiovf_pci_table,
> >> +	.probe = virtiovf_pci_probe,
> >> +	.remove = virtiovf_pci_remove,
> >> +	.err_handler = &vfio_pci_core_err_handlers,
> >> +	.driver_managed_dma = true,
> >> +};
> >> +
> >> +module_pci_driver(virtiovf_pci_driver);
> >> +
> >> +MODULE_LICENSE("GPL");
> >> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
> >> +MODULE_DESCRIPTION(
> >> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");  
> 
>
Yishai Hadas Oct. 26, 2023, 12:08 p.m. UTC | #12
On 25/10/2023 22:13, Alex Williamson wrote:
> On Wed, 25 Oct 2023 17:35:51 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> On 24/10/2023 22:57, Alex Williamson wrote:
>>> On Tue, 17 Oct 2023 16:42:17 +0300
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>   
>>>> Introduce a vfio driver over virtio devices to support the legacy
>>>> interface functionality for VFs.
>>>>
>>>> Background, from the virtio spec [1].
>>>> --------------------------------------------------------------------
>>>> In some systems, there is a need to support a virtio legacy driver with
>>>> a device that does not directly support the legacy interface. In such
>>>> scenarios, a group owner device can provide the legacy interface
>>>> functionality for the group member devices. The driver of the owner
>>>> device can then access the legacy interface of a member device on behalf
>>>> of the legacy member device driver.
>>>>
>>>> For example, with the SR-IOV group type, group members (VFs) can not
>>>> present the legacy interface in an I/O BAR in BAR0 as expected by the
>>>> legacy pci driver. If the legacy driver is running inside a virtual
>>>> machine, the hypervisor executing the virtual machine can present a
>>>> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
>>>> legacy driver accesses to this I/O BAR and forwards them to the group
>>>> owner device (PF) using group administration commands.
>>>> --------------------------------------------------------------------
>>>>
>>>> Specifically, this driver adds support for a virtio-net VF to be exposed
>>>> as a transitional device to a guest driver and allows the legacy IO BAR
>>>> functionality on top.
>>>>
>>>> This allows a VM which uses a legacy virtio-net driver in the guest to
>>>> work transparently over a VF which its driver in the host is that new
>>>> driver.
>>>>
>>>> The driver can be extended easily to support some other types of virtio
>>>> devices (e.g virtio-blk), by adding in a few places the specific type
>>>> properties as was done for virtio-net.
>>>>
>>>> For now, only the virtio-net use case was tested and as such we introduce
>>>> the support only for such a device.
>>>>
>>>> Practically,
>>>> Upon probing a VF for a virtio-net device, in case its PF supports
>>>> legacy access over the virtio admin commands and the VF doesn't have BAR
>>>> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
>>>> transitional device with I/O BAR in BAR 0.
>>>>
>>>> The existence of the simulated I/O bar is reported later on by
>>>> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
>>>> exposes itself as a transitional device by overwriting some properties
>>>> upon reading its config space.
>>>>
>>>> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
>>>> guest may use it via read/write calls according to the virtio
>>>> specification.
>>>>
>>>> Any read/write towards the control parts of the BAR will be captured by
>>>> the new driver and will be translated into admin commands towards the
>>>> device.
>>>>
>>>> Any data path read/write access (i.e. virtio driver notifications) will
>>>> be forwarded to the physical BAR which its properties were supplied by
>>>> the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
>>>> probing/init flow.
>>>>
>>>> With that code in place a legacy driver in the guest has the look and
>>>> feel as if having a transitional device with legacy support for both its
>>>> control and data path flows.
>>>>
>>>> [1]
>>>> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
>>>>
>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>> ---
>>>>    MAINTAINERS                      |   7 +
>>>>    drivers/vfio/pci/Kconfig         |   2 +
>>>>    drivers/vfio/pci/Makefile        |   2 +
>>>>    drivers/vfio/pci/virtio/Kconfig  |  15 +
>>>>    drivers/vfio/pci/virtio/Makefile |   4 +
>>>>    drivers/vfio/pci/virtio/main.c   | 577 +++++++++++++++++++++++++++++++
>>>>    6 files changed, 607 insertions(+)
>>>>    create mode 100644 drivers/vfio/pci/virtio/Kconfig
>>>>    create mode 100644 drivers/vfio/pci/virtio/Makefile
>>>>    create mode 100644 drivers/vfio/pci/virtio/main.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 7a7bd8bd80e9..680a70063775 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -22620,6 +22620,13 @@ L:	kvm@vger.kernel.org
>>>>    S:	Maintained
>>>>    F:	drivers/vfio/pci/mlx5/
>>>>    
>>>> +VFIO VIRTIO PCI DRIVER
>>>> +M:	Yishai Hadas <yishaih@nvidia.com>
>>>> +L:	kvm@vger.kernel.org
>>>> +L:	virtualization@lists.linux-foundation.org
>>>> +S:	Maintained
>>>> +F:	drivers/vfio/pci/virtio
>>>> +
>>>>    VFIO PCI DEVICE SPECIFIC DRIVERS
>>>>    R:	Jason Gunthorpe <jgg@nvidia.com>
>>>>    R:	Yishai Hadas <yishaih@nvidia.com>
>>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>>> index 8125e5f37832..18c397df566d 100644
>>>> --- a/drivers/vfio/pci/Kconfig
>>>> +++ b/drivers/vfio/pci/Kconfig
>>>> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>>>>    
>>>>    source "drivers/vfio/pci/pds/Kconfig"
>>>>    
>>>> +source "drivers/vfio/pci/virtio/Kconfig"
>>>> +
>>>>    endmenu
>>>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>>>> index 45167be462d8..046139a4eca5 100644
>>>> --- a/drivers/vfio/pci/Makefile
>>>> +++ b/drivers/vfio/pci/Makefile
>>>> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>>>>    obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>>>>    
>>>>    obj-$(CONFIG_PDS_VFIO_PCI) += pds/
>>>> +
>>>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
>>>> diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
>>>> new file mode 100644
>>>> index 000000000000..89eddce8b1bd
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/pci/virtio/Kconfig
>>>> @@ -0,0 +1,15 @@
>>>> +# SPDX-License-Identifier: GPL-2.0-only
>>>> +config VIRTIO_VFIO_PCI
>>>> +        tristate "VFIO support for VIRTIO PCI devices"
>>>> +        depends on VIRTIO_PCI
>>>> +        select VFIO_PCI_CORE
>>>> +        help
>>>> +          This provides support for exposing VIRTIO VF devices using the VFIO
>>>> +          framework that can work with a legacy virtio driver in the guest.
>>>> +          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
>>>> +          not indicate I/O Space.
>>>> +          As of that this driver emulated I/O BAR in software to let a VF be
>>>> +          seen as a transitional device in the guest and let it work with
>>>> +          a legacy driver.
>>> This description is a little bit subtle to the hard requirements on the
>>> device.  Reading this, one might think that this should work for any
>>> SR-IOV VF virtio device, when in reality it only support virtio-net
>>> currently and places a number of additional requirements on the device
>>> (ex. legacy access and MSI-X support).
>> Sure, will change to refer only to virtio-net devices which are capable
>> for 'legacy access'.
>>
>> No need to refer to MSI-X, please see below.
>>
>>>   
>>>> +
>>>> +          If you don't know what to do here, say N.
>>>> diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
>>>> new file mode 100644
>>>> index 000000000000..2039b39fb723
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/pci/virtio/Makefile
>>>> @@ -0,0 +1,4 @@
>>>> +# SPDX-License-Identifier: GPL-2.0-only
>>>> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
>>>> +virtio-vfio-pci-y := main.o
>>>> +
>>>> diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
>>>> new file mode 100644
>>>> index 000000000000..3fef4b21f7e6
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/pci/virtio/main.c
>>>> @@ -0,0 +1,577 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>>>> + */
>>>> +
>>>> +#include <linux/device.h>
>>>> +#include <linux/module.h>
>>>> +#include <linux/mutex.h>
>>>> +#include <linux/pci.h>
>>>> +#include <linux/pm_runtime.h>
>>>> +#include <linux/types.h>
>>>> +#include <linux/uaccess.h>
>>>> +#include <linux/vfio.h>
>>>> +#include <linux/vfio_pci_core.h>
>>>> +#include <linux/virtio_pci.h>
>>>> +#include <linux/virtio_net.h>
>>>> +#include <linux/virtio_pci_admin.h>
>>>> +
>>>> +struct virtiovf_pci_core_device {
>>>> +	struct vfio_pci_core_device core_device;
>>>> +	u8 bar0_virtual_buf_size;
>>>> +	u8 *bar0_virtual_buf;
>>>> +	/* synchronize access to the virtual buf */
>>>> +	struct mutex bar_mutex;
>>>> +	void __iomem *notify_addr;
>>>> +	u32 notify_offset;
>>>> +	u8 notify_bar;
>>> Push the above u8 to the end of the structure for better packing.
>> OK
>>>> +	u16 pci_cmd;
>>>> +	u16 msix_ctrl;
>>>> +};
>>>> +
>>>> +static int
>>>> +virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
>>>> +			     loff_t pos, char __user *buf,
>>>> +			     size_t count, bool read)
>>>> +{
>>>> +	bool msix_enabled = virtvdev->msix_ctrl & PCI_MSIX_FLAGS_ENABLE;
>>>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>>>> +	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
>>>> +	u16 opcode;
>>>> +	int ret;
>>>> +
>>>> +	mutex_lock(&virtvdev->bar_mutex);
>>>> +	if (read) {
>>>> +		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
>>>> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
>>>> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
>>>> +		ret = virtio_pci_admin_legacy_io_read(pdev, opcode, pos, count,
>>>> +						      bar0_buf + pos);
>>>> +		if (ret)
>>>> +			goto out;
>>>> +		if (copy_to_user(buf, bar0_buf + pos, count))
>>>> +			ret = -EFAULT;
>>>> +		goto out;
>>>> +	}
>>> TBH, I think the symmetry of read vs write would be more apparent if
>>> this were an else branch.
>> OK, will do.
>>>> +
>>>> +	if (copy_from_user(bar0_buf + pos, buf, count)) {
>>>> +		ret = -EFAULT;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
>>>> +			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
>>>> +			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
>>>> +	ret = virtio_pci_admin_legacy_io_write(pdev, opcode, pos, count,
>>>> +					       bar0_buf + pos);
>>>> +out:
>>>> +	mutex_unlock(&virtvdev->bar_mutex);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int
>>>> +translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
>>>> +			    loff_t pos, char __user *buf,
>>>> +			    size_t count, bool read)
>>>> +{
>>>> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>>>> +	u16 queue_notify;
>>>> +	int ret;
>>>> +
>>>> +	if (pos + count > virtvdev->bar0_virtual_buf_size)
>>>> +		return -EINVAL;
>>>> +
>>>> +	switch (pos) {
>>>> +	case VIRTIO_PCI_QUEUE_NOTIFY:
>>>> +		if (count != sizeof(queue_notify))
>>>> +			return -EINVAL;
>>>> +		if (read) {
>>>> +			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
>>>> +						virtvdev->notify_addr);
>>>> +			if (ret)
>>>> +				return ret;
>>>> +			if (copy_to_user(buf, &queue_notify,
>>>> +					 sizeof(queue_notify)))
>>>> +				return -EFAULT;
>>>> +			break;
>>>> +		}
>>> Same.
>> OK
>>>> +
>>>> +		if (copy_from_user(&queue_notify, buf, count))
>>>> +			return -EFAULT;
>>>> +
>>>> +		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
>>>> +					 virtvdev->notify_addr);
>>>> +		break;
>>>> +	default:
>>>> +		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
>>>> +						   read);
>>>> +	}
>>>> +
>>>> +	return ret ? ret : count;
>>>> +}
>>>> +
>>>> +static bool range_intersect_range(loff_t range1_start, size_t count1,
>>>> +				  loff_t range2_start, size_t count2,
>>>> +				  loff_t *start_offset,
>>>> +				  size_t *intersect_count,
>>>> +				  size_t *register_offset)
>>>> +{
>>>> +	if (range1_start <= range2_start &&
>>>> +	    range1_start + count1 > range2_start) {
>>>> +		*start_offset = range2_start - range1_start;
>>>> +		*intersect_count = min_t(size_t, count2,
>>>> +					 range1_start + count1 - range2_start);
>>>> +		if (register_offset)
>>>> +			*register_offset = 0;
>>>> +		return true;
>>>> +	}
>>>> +
>>>> +	if (range1_start > range2_start &&
>>>> +	    range1_start < range2_start + count2) {
>>>> +		*start_offset = range1_start;
>>>> +		*intersect_count = min_t(size_t, count1,
>>>> +					 range2_start + count2 - range1_start);
>>>> +		if (register_offset)
>>>> +			*register_offset = range1_start - range2_start;
>>>> +		return true;
>>>> +	}
>>> Seems like we're missing a case, and some documentation.
>>>
>>> The first test requires range1 to fully enclose range2 and provides the
>>> offset of range2 within range1 and the length of the intersection.
>>>
>>> The second test requires range1 to start from a non-zero offset within
>>> range2 and returns the absolute offset of range1 and the length of the
>>> intersection.
>>>
>>> The register offset is then non-zero offset of range1 into range2.  So
>>> does the caller use the zero value in the previous test to know range2
>>> exists within range1?
>>>
>>> We miss the cases where range1_start is <= range2_start and range1
>>> terminates within range2.
>> The first test should cover this case as well of the case of fully
>> enclosing.
>>
>> It checks whether range1_start + count1 > range2_start which can
>> terminates also within range2.
>>
>> Isn't it ?
> Hmm, maybe I read it wrong.  Let me try again...
>
> The first test covers the cases where range1 starts at or below range2
> and range1 extends into or through range2.  start_offset describes the
> offset into range1 that range2 begins.  The intersect_count is the
> extent of the intersection and it's not clear what register_offset
> describes since it's zero.
>
> The second test covers the cases where range1 starts within range2.
> start_offset is the start of range1, which doesn't seem consistent with
> the previous branch usage.

Right, start_offest needs to be 0 in that second branch.


>   The intersect_count does look consistent
> with the previous branch.  register_offset is then the offset of range1
> into range2
>
> So I had some things wrong, but I'm still having trouble with a
> consistent definition of start_offset and register_offset.
>
>
>> I may add some documentation for that function as part of V2 as you asked.
>>
>>>    I suppose we'll see below how this is used,
>>> but it seems asymmetric and incomplete.
>>>   
>>>> +
>>>> +	return false;
>>>> +}
>>>> +
>>>> +static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
>>>> +					char __user *buf, size_t count,
>>>> +					loff_t *ppos)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>>>> +	size_t register_offset;
>>>> +	loff_t copy_offset;
>>>> +	size_t copy_count;
>>>> +	__le32 val32;
>>>> +	__le16 val16;
>>>> +	u8 val8;
>>>> +	int ret;
>>>> +
>>>> +	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
>>>> +				  &copy_offset, &copy_count, NULL)) {
>>> If a user does 'setpci -s x:00.0 2.b' (range1 <= range2, but terminates
>>> within range2) they'll not enter this branch and see 41 rather than 00.
> Yes, this does take the first branch per my second look, so copy_offset
> is zero, copy_count is 1.  I think the copy_to_user() works correctly
>
>>> If a user does 'setpci -s x:00.0 3.b' (range1 > range2, range 1
>>> contained within range 2), the above function returns a copy_offset of
>>> range1_start (ie. 3).  But that offset is applied to the buffer, which
>>> is out of bounds.  The function needs to have returned an offset of 1
>>> and it should have applied to the val16 address.
>>>
>>> I don't think this works like it's intended.
>> Is that because of the missing case ?
>> Please see my note above.
> No, I think my original evaluation of this second case still holds,
> copy_offset is wrong.  I suspect what you're trying to do with
> start_offset and register_offset is specify the output buffer offset,
> ie. relative to range1 or buf, or the input offset, ie. range2 or our
> local val variable.  But start_offset is incorrectly calculated in the
> second branch above (should always be zero) and the caller didn't ask
> for the register offset here, which is seems it always should.

OK,  I got what you said.

Yes, start_offset should always be 0 in the second branch, and the 
caller may need always to ask for the register_offest and use it.

As current QEMU code doesn't read partial register/field we didn't get 
to the second branch and to the above issues.

I will fix as part of V2, it should be a simple change.

>
>>>> +		val16 = cpu_to_le16(0x1000);
>>> Please #define this somewhere rather than hiding a magic value here.
>> Sure, will just replace to VIRTIO_TRANS_ID_NET.
>>>> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
>>>> +			return -EFAULT;
>>>> +	}
>>>> +
>>>> +	if ((virtvdev->pci_cmd & PCI_COMMAND_IO) &&
>>>> +	    range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
>>>> +				  &copy_offset, &copy_count, &register_offset)) {
>>>> +		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
>>>> +				   copy_count))
>>>> +			return -EFAULT;
>>>> +		val16 |= cpu_to_le16(PCI_COMMAND_IO);
>>>> +		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
>>>> +				 copy_count))
>>>> +			return -EFAULT;
>>>> +	}
>>>> +
>>>> +	if (range_intersect_range(pos, count, PCI_REVISION_ID, sizeof(val8),
>>>> +				  &copy_offset, &copy_count, NULL)) {
>>>> +		/* Transional needs to have revision 0 */
>>>> +		val8 = 0;
>>>> +		if (copy_to_user(buf + copy_offset, &val8, copy_count))
>>>> +			return -EFAULT;
>>>> +	}
>>>> +
>>>> +	if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
>>>> +				  &copy_offset, &copy_count, NULL)) {
>>>> +		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
>>> I'd still like to see the remainder of the BAR follow the semantics
>>> vfio-pci does.  I think this requires a __le32 bar0 field on the
>>> virtvdev struct to store writes and the read here would mask the lower
>>> bits up to the BAR size and OR in the IO indicator bit.
>> OK, will do.
>>
>>>   
>>>> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
>>>> +			return -EFAULT;
>>>> +	}
>>>> +
>>>> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
>>>> +				  &copy_offset, &copy_count, NULL)) {
>>>> +		/*
>>>> +		 * Transitional devices use the PCI subsystem device id as
>>>> +		 * virtio device id, same as legacy driver always did.
>>> Where did we require the subsystem vendor ID to be 0x1af4?  This
>>> subsystem device ID really only makes since given that subsystem
>>> vendor ID, right?  Otherwise I don't see that non-transitional devices,
>>> such as the VF, have a hard requirement per the spec for the subsystem
>>> vendor ID.
>>>
>>> Do we want to make this only probe the correct subsystem vendor ID or do
>>> we want to emulate the subsystem vendor ID as well?  I don't see this is
>>> correct without one of those options.
>> Looking in the 1.x spec we can see the below.
>>
>> Legacy Interfaces: A Note on PCI Device Discovery
>>
>> "Transitional devices MUST have the PCI Subsystem
>> Device ID matching the Virtio Device ID, as indicated in section 5 ...
>> This is to match legacy drivers."
>>
>> However, there is no need to enforce Subsystem Vendor ID.
>>
>> This is what we followed here.
>>
>> Makes sense ?
> So do I understand correctly that virtio dictates the subsystem device
> ID for all subsystem vendor IDs that implement a legacy virtio
> interface?  Ok, but this device didn't actually implement a legacy
> virtio interface.  The device itself is not tranistional, we're imposing
> an emulated transitional interface onto it.  So did the subsystem vendor
> agree to have their subsystem device ID managed by the virtio committee
> or might we create conflicts?  I imagine we know we don't have a
> conflict if we also virtualize the subsystem vendor ID.
>
The non transitional net device in the virtio spec defined as the below 
tuple.
T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.

And transitional net device in the virtio spec for a vendor FOO is 
defined as:
T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1

This driver is converting T_A to T_B, which both are defined by the 
virtio spec.
Hence, it does not conflict for the subsystem vendor, it is fine.
> BTW, it would be a lot easier for all of the config space emulation here
> if we could make use of the existing field virtualization in
> vfio-pci-core.  In fact you'll see in vfio_config_init() that
> PCI_DEVICE_ID is already virtualized for VFs, so it would be enough to
> simply do the following to report the desired device ID:
>
> 	*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(0x1000);

I would prefer keeping things simple and have one place/flow that 
handles all the fields as we have now as part of the driver.

In any case, I'll further look at that option for managing the DEVICE_ID 
towards V2.

> It appears everything in this function could be handled similarly by
> vfio-pci-core if the right fields in the perm_bits.virt and .write
> bits could be manipulated and vconfig modified appropriately.  I'd look
> for a way that a variant driver could provide an alternate set of
> permissions structures for various capabilities.  Thanks,

OK

However, let's not block V2 and the series acceptance as of that.

It can always be some future refactoring as part of other series that 
will bring the infra-structure that is needed for that.

Yishai

>
> Alex
>
>
>>>> +		 */
>>>> +		val16 = cpu_to_le16(VIRTIO_ID_NET);
>>>> +		if (copy_to_user(buf + copy_offset, &val16, copy_count))
>>>> +			return -EFAULT;
>>>> +	}
>>>> +
>>>> +	return count;
>>>> +}
>>>> +
>>>> +static ssize_t
>>>> +virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>>>> +		       size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>>>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>>>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>>>> +	int ret;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
>>>> +		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
>>>> +
>>>> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
>>>> +		return vfio_pci_core_read(core_vdev, buf, count, ppos);
>>>> +
>>>> +	ret = pm_runtime_resume_and_get(&pdev->dev);
>>>> +	if (ret) {
>>>> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
>>>> +				     ret);
>>>> +		return -EIO;
>>>> +	}
>>>> +
>>>> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
>>>> +	pm_runtime_put(&pdev->dev);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static ssize_t
>>>> +virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
>>>> +			size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	struct pci_dev *pdev = virtvdev->core_device.pdev;
>>>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>>>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>>>> +	int ret;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
>>>> +		size_t register_offset;
>>>> +		loff_t copy_offset;
>>>> +		size_t copy_count;
>>>> +
>>>> +		if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(virtvdev->pci_cmd),
>>>> +					  &copy_offset, &copy_count,
>>>> +					  &register_offset)) {
>>>> +			if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
>>>> +					   buf + copy_offset,
>>>> +					   copy_count))
>>>> +				return -EFAULT;
>>>> +		}
>>>> +
>>>> +		if (range_intersect_range(pos, count, pdev->msix_cap + PCI_MSIX_FLAGS,
>>>> +					  sizeof(virtvdev->msix_ctrl),
>>>> +					  &copy_offset, &copy_count,
>>>> +					  &register_offset)) {
>>>> +			if (copy_from_user((void *)&virtvdev->msix_ctrl + register_offset,
>>>> +					   buf + copy_offset,
>>>> +					   copy_count))
>>>> +				return -EFAULT;
>>>> +		}
>>> MSI-X is setup via ioctl, so you're relying on a userspace that writes
>>> through the control register bit even though it doesn't do anything.
>>> Why not use vfio_pci_core_device.irq_type to track if MSI-X mode is
>>> enabled?
>> OK, may switch to your suggestion post of testing it.
>>>   
>>>> +	}
>>>> +
>>>> +	if (index != VFIO_PCI_BAR0_REGION_INDEX)
>>>> +		return vfio_pci_core_write(core_vdev, buf, count, ppos);
>>>> +
>>>> +	ret = pm_runtime_resume_and_get(&pdev->dev);
>>>> +	if (ret) {
>>>> +		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
>>>> +		return -EIO;
>>>> +	}
>>>> +
>>>> +	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
>>>> +	pm_runtime_put(&pdev->dev);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int
>>>> +virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
>>>> +				   unsigned int cmd, unsigned long arg)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
>>>> +	void __user *uarg = (void __user *)arg;
>>>> +	struct vfio_region_info info = {};
>>>> +
>>>> +	if (copy_from_user(&info, uarg, minsz))
>>>> +		return -EFAULT;
>>>> +
>>>> +	if (info.argsz < minsz)
>>>> +		return -EINVAL;
>>>> +
>>>> +	switch (info.index) {
>>>> +	case VFIO_PCI_BAR0_REGION_INDEX:
>>>> +		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>>>> +		info.size = virtvdev->bar0_virtual_buf_size;
>>>> +		info.flags = VFIO_REGION_INFO_FLAG_READ |
>>>> +			     VFIO_REGION_INFO_FLAG_WRITE;
>>>> +		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
>>>> +	default:
>>>> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>>>> +	}
>>>> +}
>>>> +
>>>> +static long
>>>> +virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>> +			     unsigned long arg)
>>>> +{
>>>> +	switch (cmd) {
>>>> +	case VFIO_DEVICE_GET_REGION_INFO:
>>>> +		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
>>>> +	default:
>>>> +		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
>>>> +	}
>>>> +}
>>>> +
>>>> +static int
>>>> +virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
>>>> +{
>>>> +	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
>>>> +	int ret;
>>>> +
>>>> +	/*
>>>> +	 * Setup the BAR where the 'notify' exists to be used by vfio as well
>>>> +	 * This will let us mmap it only once and use it when needed.
>>>> +	 */
>>>> +	ret = vfio_pci_core_setup_barmap(core_device,
>>>> +					 virtvdev->notify_bar);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
>>>> +			virtvdev->notify_offset;
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
>>>> +	int ret;
>>>> +
>>>> +	ret = vfio_pci_core_enable(vdev);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	if (virtvdev->bar0_virtual_buf) {
>>>> +		/*
>>>> +		 * Upon close_device() the vfio_pci_core_disable() is called
>>>> +		 * and will close all the previous mmaps, so it seems that the
>>>> +		 * valid life cycle for the 'notify' addr is per open/close.
>>>> +		 */
>>>> +		ret = virtiovf_set_notify_addr(virtvdev);
>>>> +		if (ret) {
>>>> +			vfio_pci_core_disable(vdev);
>>>> +			return ret;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	vfio_pci_core_finish_enable(vdev);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int virtiovf_get_device_config_size(unsigned short device)
>>>> +{
>>>> +	/* Network card */
>>>> +	return offsetofend(struct virtio_net_config, status);
>>>> +}
>>>> +
>>>> +static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
>>>> +{
>>>> +	u64 offset;
>>>> +	int ret;
>>>> +	u8 bar;
>>>> +
>>>> +	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
>>>> +				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
>>>> +				&bar, &offset);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	virtvdev->notify_bar = bar;
>>>> +	virtvdev->notify_offset = offset;
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +	struct pci_dev *pdev;
>>>> +	int ret;
>>>> +
>>>> +	ret = vfio_pci_core_init_dev(core_vdev);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	pdev = virtvdev->core_device.pdev;
>>>> +	ret = virtiovf_read_notify_info(virtvdev);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	/* Being ready with a buffer that supports MSIX */
>>>> +	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
>>>> +				virtiovf_get_device_config_size(pdev->device);
>>>> +	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
>>>> +					     GFP_KERNEL);
>>>> +	if (!virtvdev->bar0_virtual_buf)
>>>> +		return -ENOMEM;
>>>> +	mutex_init(&virtvdev->bar_mutex);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = container_of(
>>>> +		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
>>>> +
>>>> +	kfree(virtvdev->bar0_virtual_buf);
>>>> +	vfio_pci_core_release_dev(core_vdev);
>>>> +}
>>>> +
>>>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
>>>> +	.name = "virtio-transitional-vfio-pci",
>>>> +	.init = virtiovf_pci_init_device,
>>>> +	.release = virtiovf_pci_core_release_dev,
>>>> +	.open_device = virtiovf_pci_open_device,
>>>> +	.close_device = vfio_pci_core_close_device,
>>>> +	.ioctl = virtiovf_vfio_pci_core_ioctl,
>>>> +	.read = virtiovf_pci_core_read,
>>>> +	.write = virtiovf_pci_core_write,
>>>> +	.mmap = vfio_pci_core_mmap,
>>>> +	.request = vfio_pci_core_request,
>>>> +	.match = vfio_pci_core_match,
>>>> +	.bind_iommufd = vfio_iommufd_physical_bind,
>>>> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
>>>> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
>>>> +};
>>>> +
>>>> +static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
>>>> +	.name = "virtio-acc-vfio-pci",
>>>> +	.init = vfio_pci_core_init_dev,
>>>> +	.release = vfio_pci_core_release_dev,
>>>> +	.open_device = virtiovf_pci_open_device,
>>>> +	.close_device = vfio_pci_core_close_device,
>>>> +	.ioctl = vfio_pci_core_ioctl,
>>>> +	.device_feature = vfio_pci_core_ioctl_feature,
>>>> +	.read = vfio_pci_core_read,
>>>> +	.write = vfio_pci_core_write,
>>>> +	.mmap = vfio_pci_core_mmap,
>>>> +	.request = vfio_pci_core_request,
>>>> +	.match = vfio_pci_core_match,
>>>> +	.bind_iommufd = vfio_iommufd_physical_bind,
>>>> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
>>>> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
>>>> +};
>>>> +
>>>> +static bool virtiovf_bar0_exists(struct pci_dev *pdev)
>>>> +{
>>>> +	struct resource *res = pdev->resource;
>>>> +
>>>> +	return res->flags ? true : false;
>>>> +}
>>>> +
>>>> +#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
>>>> +	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
>>>> +	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
>>>> +
>>>> +static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
>>>> +{
>>>> +	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
>>>> +	u8 *buf;
>>>> +	int ret;
>>>> +
>>>> +	buf = kzalloc(buf_size, GFP_KERNEL);
>>>> +	if (!buf)
>>>> +		return false;
>>>> +
>>>> +	ret = virtio_pci_admin_list_query(pdev, buf, buf_size);
>>>> +	if (ret)
>>>> +		goto end;
>>>> +
>>>> +	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
>>>> +		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
>>>> +		ret = -EOPNOTSUPP;
>>>> +		goto end;
>>>> +	}
>>>> +
>>>> +	/* Confirm the used commands */
>>>> +	memset(buf, 0, buf_size);
>>>> +	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
>>>> +	ret = virtio_pci_admin_list_use(pdev, buf, buf_size);
>>>> +end:
>>>> +	kfree(buf);
>>>> +	return ret ? false : true;
>>>> +}
>>>> +
>>>> +static int virtiovf_pci_probe(struct pci_dev *pdev,
>>>> +			      const struct pci_device_id *id)
>>>> +{
>>>> +	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
>>>> +	struct virtiovf_pci_core_device *virtvdev;
>>>> +	int ret;
>>>> +
>>>> +	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
>>>> +	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
>>> All but the last test here are fairly evident requirements of the
>>> driver.  Why do we require a device that supports MSI-X?
>> As now we check at run time to decide whether MSI-X is enabled/disabled
>> to pick-up the correct op code, no need for that any more.
>>
>> Will drop this MSI-X check from V2.
>>
>> Thanks,
>> Yishai
>>
>>> Thanks,
>>> Alex
>>>
>>>   
>>>> +		ops = &virtiovf_acc_vfio_pci_tran_ops;
>>>> +
>>>> +	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
>>>> +				     &pdev->dev, ops);
>>>> +	if (IS_ERR(virtvdev))
>>>> +		return PTR_ERR(virtvdev);
>>>> +
>>>> +	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
>>>> +	ret = vfio_pci_core_register_device(&virtvdev->core_device);
>>>> +	if (ret)
>>>> +		goto out;
>>>> +	return 0;
>>>> +out:
>>>> +	vfio_put_device(&virtvdev->core_device.vdev);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static void virtiovf_pci_remove(struct pci_dev *pdev)
>>>> +{
>>>> +	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
>>>> +
>>>> +	vfio_pci_core_unregister_device(&virtvdev->core_device);
>>>> +	vfio_put_device(&virtvdev->core_device.vdev);
>>>> +}
>>>> +
>>>> +static const struct pci_device_id virtiovf_pci_table[] = {
>>>> +	/* Only virtio-net is supported/tested so far */
>>>> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
>>>> +	{}
>>>> +};
>>>> +
>>>> +MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
>>>> +
>>>> +static struct pci_driver virtiovf_pci_driver = {
>>>> +	.name = KBUILD_MODNAME,
>>>> +	.id_table = virtiovf_pci_table,
>>>> +	.probe = virtiovf_pci_probe,
>>>> +	.remove = virtiovf_pci_remove,
>>>> +	.err_handler = &vfio_pci_core_err_handlers,
>>>> +	.driver_managed_dma = true,
>>>> +};
>>>> +
>>>> +module_pci_driver(virtiovf_pci_driver);
>>>> +
>>>> +MODULE_LICENSE("GPL");
>>>> +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
>>>> +MODULE_DESCRIPTION(
>>>> +	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");
>>
Michael S. Tsirkin Oct. 26, 2023, 12:12 p.m. UTC | #13
On Thu, Oct 26, 2023 at 03:08:12PM +0300, Yishai Hadas wrote:
> > > Makes sense ?
> > So do I understand correctly that virtio dictates the subsystem device
> > ID for all subsystem vendor IDs that implement a legacy virtio
> > interface?  Ok, but this device didn't actually implement a legacy
> > virtio interface.  The device itself is not tranistional, we're imposing
> > an emulated transitional interface onto it.  So did the subsystem vendor
> > agree to have their subsystem device ID managed by the virtio committee
> > or might we create conflicts?  I imagine we know we don't have a
> > conflict if we also virtualize the subsystem vendor ID.
> > 
> The non transitional net device in the virtio spec defined as the below
> tuple.
> T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
> 
> And transitional net device in the virtio spec for a vendor FOO is defined
> as:
> T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
> 
> This driver is converting T_A to T_B, which both are defined by the virtio
> spec.
> Hence, it does not conflict for the subsystem vendor, it is fine.

You are talking about legacy guests, what 1.X spec says about them
is much less important than what guests actually do.
Check the INF of the open source windows drivers and linux code, at least.
Parav Pandit Oct. 26, 2023, 12:40 p.m. UTC | #14
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 26, 2023 5:42 PM
> 
> On Thu, Oct 26, 2023 at 03:08:12PM +0300, Yishai Hadas wrote:
> > > > Makes sense ?
> > > So do I understand correctly that virtio dictates the subsystem
> > > device ID for all subsystem vendor IDs that implement a legacy
> > > virtio interface?  Ok, but this device didn't actually implement a
> > > legacy virtio interface.  The device itself is not tranistional,
> > > we're imposing an emulated transitional interface onto it.  So did
> > > the subsystem vendor agree to have their subsystem device ID managed
> > > by the virtio committee or might we create conflicts?  I imagine we
> > > know we don't have a conflict if we also virtualize the subsystem vendor ID.
> > >
> > The non transitional net device in the virtio spec defined as the
> > below tuple.
> > T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
> >
> > And transitional net device in the virtio spec for a vendor FOO is
> > defined
> > as:
> > T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
> >
> > This driver is converting T_A to T_B, which both are defined by the
> > virtio spec.
> > Hence, it does not conflict for the subsystem vendor, it is fine.
> 
> You are talking about legacy guests, what 1.X spec says about them is much less
> important than what guests actually do.
> Check the INF of the open source windows drivers and linux code, at least.

Linux legacy guest has,

static struct pci_device_id virtio_pci_id_table[] = {
        { 0x1af4, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
        { 0 },
};
Followed by an open coded driver check for 0x1000 to 0x103f range.
Do you mean windows driver expects specific subsystem vendor id of 0x1af4?
Michael S. Tsirkin Oct. 26, 2023, 1:15 p.m. UTC | #15
On Thu, Oct 26, 2023 at 12:40:04PM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 26, 2023 5:42 PM
> > 
> > On Thu, Oct 26, 2023 at 03:08:12PM +0300, Yishai Hadas wrote:
> > > > > Makes sense ?
> > > > So do I understand correctly that virtio dictates the subsystem
> > > > device ID for all subsystem vendor IDs that implement a legacy
> > > > virtio interface?  Ok, but this device didn't actually implement a
> > > > legacy virtio interface.  The device itself is not tranistional,
> > > > we're imposing an emulated transitional interface onto it.  So did
> > > > the subsystem vendor agree to have their subsystem device ID managed
> > > > by the virtio committee or might we create conflicts?  I imagine we
> > > > know we don't have a conflict if we also virtualize the subsystem vendor ID.
> > > >
> > > The non transitional net device in the virtio spec defined as the
> > > below tuple.
> > > T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
> > >
> > > And transitional net device in the virtio spec for a vendor FOO is
> > > defined
> > > as:
> > > T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
> > >
> > > This driver is converting T_A to T_B, which both are defined by the
> > > virtio spec.
> > > Hence, it does not conflict for the subsystem vendor, it is fine.
> > 
> > You are talking about legacy guests, what 1.X spec says about them is much less
> > important than what guests actually do.
> > Check the INF of the open source windows drivers and linux code, at least.
> 
> Linux legacy guest has,
> 
> static struct pci_device_id virtio_pci_id_table[] = {
>         { 0x1af4, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
>         { 0 },
> };
> Followed by an open coded driver check for 0x1000 to 0x103f range.
> Do you mean windows driver expects specific subsystem vendor id of 0x1af4?

Look it up, it's open source.
Parav Pandit Oct. 26, 2023, 1:28 p.m. UTC | #16
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 26, 2023 6:45 PM

> > Followed by an open coded driver check for 0x1000 to 0x103f range.
> > Do you mean windows driver expects specific subsystem vendor id of 0x1af4?
> 
> Look it up, it's open source.

Those are not OS inbox drivers anyway.
:)
The current vfio driver is following the virtio spec based on legacy spec, 1.x spec following the transitional device sections.
There is no need to do something out of spec at this point.
Michael S. Tsirkin Oct. 26, 2023, 3:06 p.m. UTC | #17
On Thu, Oct 26, 2023 at 01:28:18PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 26, 2023 6:45 PM
> 
> > > Followed by an open coded driver check for 0x1000 to 0x103f range.
> > > Do you mean windows driver expects specific subsystem vendor id of 0x1af4?
> > 
> > Look it up, it's open source.
> 
> Those are not OS inbox drivers anyway.
> :)

Does not matter at all if guest has drivers installed.
Either you worry about legacy guests or not.


> The current vfio driver is following the virtio spec based on legacy spec, 1.x spec following the transitional device sections.
> There is no need to do something out of spec at this point.

legacy spec wasn't maintained properly, drivers diverged sometimes
significantly. what matters is installed base.
Parav Pandit Oct. 26, 2023, 3:09 p.m. UTC | #18
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 26, 2023 8:36 PM
> 
> On Thu, Oct 26, 2023 at 01:28:18PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 26, 2023 6:45 PM
> >
> > > > Followed by an open coded driver check for 0x1000 to 0x103f range.
> > > > Do you mean windows driver expects specific subsystem vendor id of
> 0x1af4?
> > >
> > > Look it up, it's open source.
> >
> > Those are not OS inbox drivers anyway.
> > :)
> 
> Does not matter at all if guest has drivers installed.
> Either you worry about legacy guests or not.
> 
So, Linux guests have inbox drivers, that we care about and they seems to be covered, right?

> 
> > The current vfio driver is following the virtio spec based on legacy spec, 1.x
> spec following the transitional device sections.
> > There is no need to do something out of spec at this point.
> 
> legacy spec wasn't maintained properly, drivers diverged sometimes
> significantly. what matters is installed base.

So if you know the subsystem vendor id that Windows expects, please share, so we can avoid playing puzzle game. :)
It anyway can be reported by the device itself.
Michael S. Tsirkin Oct. 26, 2023, 3:46 p.m. UTC | #19
On Thu, Oct 26, 2023 at 03:09:13PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 26, 2023 8:36 PM
> > 
> > On Thu, Oct 26, 2023 at 01:28:18PM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, October 26, 2023 6:45 PM
> > >
> > > > > Followed by an open coded driver check for 0x1000 to 0x103f range.
> > > > > Do you mean windows driver expects specific subsystem vendor id of
> > 0x1af4?
> > > >
> > > > Look it up, it's open source.
> > >
> > > Those are not OS inbox drivers anyway.
> > > :)
> > 
> > Does not matter at all if guest has drivers installed.
> > Either you worry about legacy guests or not.
> > 
> So, Linux guests have inbox drivers, that we care about and they seems to be covered, right?
> 
> > 
> > > The current vfio driver is following the virtio spec based on legacy spec, 1.x
> > spec following the transitional device sections.
> > > There is no need to do something out of spec at this point.
> > 
> > legacy spec wasn't maintained properly, drivers diverged sometimes
> > significantly. what matters is installed base.
> 
> So if you know the subsystem vendor id that Windows expects, please share, so we can avoid playing puzzle game. :)
> It anyway can be reported by the device itself.

I don't know myself offhand. I just know it's not so simple. Looking at the source
for network drivers I see:

%kvmnet6.DeviceDesc%    = kvmnet6.ndi, PCI\VEN_1AF4&DEV_1000&SUBSYS_0001_INX_SUBSYS_VENDOR_ID&REV_00, PCI\VEN_1AF4&DEV_1000


So the drivers will:
A. bind with high priority to subsystem vendor ID used when drivers where built.
   popular drivers built and distributed for free by Red Hat have 1AF4
B. bind with low priority to any subsystem device/vendor id as long as
   vendor is 1af4 and device is 1000


My conclusions:
- you probably need a way to tweak subsystem vendor id in software
- default should probably be 1AF4 not whatever actual device uses
Parav Pandit Oct. 26, 2023, 3:56 p.m. UTC | #20
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 26, 2023 9:16 PM

> On Thu, Oct 26, 2023 at 03:09:13PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 26, 2023 8:36 PM
> > >
> > > On Thu, Oct 26, 2023 at 01:28:18PM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, October 26, 2023 6:45 PM
> > > >
> > > > > > Followed by an open coded driver check for 0x1000 to 0x103f range.
> > > > > > Do you mean windows driver expects specific subsystem vendor
> > > > > > id of
> > > 0x1af4?
> > > > >
> > > > > Look it up, it's open source.
> > > >
> > > > Those are not OS inbox drivers anyway.
> > > > :)
> > >
> > > Does not matter at all if guest has drivers installed.
> > > Either you worry about legacy guests or not.
> > >
> > So, Linux guests have inbox drivers, that we care about and they seems to be
> covered, right?
> >
> > >
> > > > The current vfio driver is following the virtio spec based on
> > > > legacy spec, 1.x
> > > spec following the transitional device sections.
> > > > There is no need to do something out of spec at this point.
> > >
> > > legacy spec wasn't maintained properly, drivers diverged sometimes
> > > significantly. what matters is installed base.
> >
> > So if you know the subsystem vendor id that Windows expects, please
> > share, so we can avoid playing puzzle game. :) It anyway can be reported by
> the device itself.
> 
> I don't know myself offhand. I just know it's not so simple. Looking at the source
> for network drivers I see:
> 
> %kvmnet6.DeviceDesc%    = kvmnet6.ndi,
> PCI\VEN_1AF4&DEV_1000&SUBSYS_0001_INX_SUBSYS_VENDOR_ID&REV_00,
> PCI\VEN_1AF4&DEV_1000
> 
Yeah, I was checking the cryptic notation at https://github.com/virtio-win/kvm-guest-drivers-windows/tree/master

> 
> So the drivers will:
> A. bind with high priority to subsystem vendor ID used when drivers where built.
>    popular drivers built and distributed for free by Red Hat have 1AF4 B. bind
> with low priority to any subsystem device/vendor id as long as
>    vendor is 1af4 and device is 1000
> 
> 
> My conclusions:
> - you probably need a way to tweak subsystem vendor id in software
> - default should probably be 1AF4 not whatever actual device uses
Ok.
It is not mandatory but, if you and Alex are ok to also tweak the subsystem vendor id, it seems fine to me. It does not hurt anything.
Alex Williamson Oct. 26, 2023, 5:55 p.m. UTC | #21
On Thu, 26 Oct 2023 15:08:12 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 25/10/2023 22:13, Alex Williamson wrote:
> > On Wed, 25 Oct 2023 17:35:51 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> On 24/10/2023 22:57, Alex Williamson wrote:  
> >>> On Tue, 17 Oct 2023 16:42:17 +0300
> >>> Yishai Hadas <yishaih@nvidia.com> wrote:
   
> >>>> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
> >>>> +			return -EFAULT;
> >>>> +	}
> >>>> +
> >>>> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> >>>> +				  &copy_offset, &copy_count, NULL)) {
> >>>> +		/*
> >>>> +		 * Transitional devices use the PCI subsystem device id as
> >>>> +		 * virtio device id, same as legacy driver always did.  
> >>> Where did we require the subsystem vendor ID to be 0x1af4?  This
> >>> subsystem device ID really only makes since given that subsystem
> >>> vendor ID, right?  Otherwise I don't see that non-transitional devices,
> >>> such as the VF, have a hard requirement per the spec for the subsystem
> >>> vendor ID.
> >>>
> >>> Do we want to make this only probe the correct subsystem vendor ID or do
> >>> we want to emulate the subsystem vendor ID as well?  I don't see this is
> >>> correct without one of those options.  
> >> Looking in the 1.x spec we can see the below.
> >>
> >> Legacy Interfaces: A Note on PCI Device Discovery
> >>
> >> "Transitional devices MUST have the PCI Subsystem
> >> Device ID matching the Virtio Device ID, as indicated in section 5 ...
> >> This is to match legacy drivers."
> >>
> >> However, there is no need to enforce Subsystem Vendor ID.
> >>
> >> This is what we followed here.
> >>
> >> Makes sense ?  
> > So do I understand correctly that virtio dictates the subsystem device
> > ID for all subsystem vendor IDs that implement a legacy virtio
> > interface?  Ok, but this device didn't actually implement a legacy
> > virtio interface.  The device itself is not tranistional, we're imposing
> > an emulated transitional interface onto it.  So did the subsystem vendor
> > agree to have their subsystem device ID managed by the virtio committee
> > or might we create conflicts?  I imagine we know we don't have a
> > conflict if we also virtualize the subsystem vendor ID.
> >  
> The non transitional net device in the virtio spec defined as the below 
> tuple.
> T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
> 
> And transitional net device in the virtio spec for a vendor FOO is 
> defined as:
> T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
> 
> This driver is converting T_A to T_B, which both are defined by the 
> virtio spec.
> Hence, it does not conflict for the subsystem vendor, it is fine.

Surprising to me that the virtio spec dictates subsystem device ID in
all cases.  The further discussion in this thread seems to indicate we
need to virtualize subsystem vendor ID for broader driver compatibility
anyway.

> > BTW, it would be a lot easier for all of the config space emulation here
> > if we could make use of the existing field virtualization in
> > vfio-pci-core.  In fact you'll see in vfio_config_init() that
> > PCI_DEVICE_ID is already virtualized for VFs, so it would be enough to
> > simply do the following to report the desired device ID:
> >
> > 	*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(0x1000);  
> 
> I would prefer keeping things simple and have one place/flow that 
> handles all the fields as we have now as part of the driver.

That's the same argument I'd make for re-using the core code, we don't
need multiple implementations handling merging physical and virtual
bits within config space.

> In any case, I'll further look at that option for managing the DEVICE_ID 
> towards V2.
> 
> > It appears everything in this function could be handled similarly by
> > vfio-pci-core if the right fields in the perm_bits.virt and .write
> > bits could be manipulated and vconfig modified appropriately.  I'd look
> > for a way that a variant driver could provide an alternate set of
> > permissions structures for various capabilities.  Thanks,  
> 
> OK
> 
> However, let's not block V2 and the series acceptance as of that.
> 
> It can always be some future refactoring as part of other series that 
> will bring the infra-structure that is needed for that.

We're already on the verge of the v6.7 merge window, so this looks like
v6.8 material anyway.  We have time.  Thanks,

Alex
Michael S. Tsirkin Oct. 26, 2023, 7:49 p.m. UTC | #22
On Thu, Oct 26, 2023 at 11:55:39AM -0600, Alex Williamson wrote:
> On Thu, 26 Oct 2023 15:08:12 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > On 25/10/2023 22:13, Alex Williamson wrote:
> > > On Wed, 25 Oct 2023 17:35:51 +0300
> > > Yishai Hadas <yishaih@nvidia.com> wrote:
> > >  
> > >> On 24/10/2023 22:57, Alex Williamson wrote:  
> > >>> On Tue, 17 Oct 2023 16:42:17 +0300
> > >>> Yishai Hadas <yishaih@nvidia.com> wrote:
>    
> > >>>> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
> > >>>> +			return -EFAULT;
> > >>>> +	}
> > >>>> +
> > >>>> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
> > >>>> +				  &copy_offset, &copy_count, NULL)) {
> > >>>> +		/*
> > >>>> +		 * Transitional devices use the PCI subsystem device id as
> > >>>> +		 * virtio device id, same as legacy driver always did.  
> > >>> Where did we require the subsystem vendor ID to be 0x1af4?  This
> > >>> subsystem device ID really only makes since given that subsystem
> > >>> vendor ID, right?  Otherwise I don't see that non-transitional devices,
> > >>> such as the VF, have a hard requirement per the spec for the subsystem
> > >>> vendor ID.
> > >>>
> > >>> Do we want to make this only probe the correct subsystem vendor ID or do
> > >>> we want to emulate the subsystem vendor ID as well?  I don't see this is
> > >>> correct without one of those options.  
> > >> Looking in the 1.x spec we can see the below.
> > >>
> > >> Legacy Interfaces: A Note on PCI Device Discovery
> > >>
> > >> "Transitional devices MUST have the PCI Subsystem
> > >> Device ID matching the Virtio Device ID, as indicated in section 5 ...
> > >> This is to match legacy drivers."
> > >>
> > >> However, there is no need to enforce Subsystem Vendor ID.
> > >>
> > >> This is what we followed here.
> > >>
> > >> Makes sense ?  
> > > So do I understand correctly that virtio dictates the subsystem device
> > > ID for all subsystem vendor IDs that implement a legacy virtio
> > > interface?  Ok, but this device didn't actually implement a legacy
> > > virtio interface.  The device itself is not tranistional, we're imposing
> > > an emulated transitional interface onto it.  So did the subsystem vendor
> > > agree to have their subsystem device ID managed by the virtio committee
> > > or might we create conflicts?  I imagine we know we don't have a
> > > conflict if we also virtualize the subsystem vendor ID.
> > >  
> > The non transitional net device in the virtio spec defined as the below 
> > tuple.
> > T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
> > 
> > And transitional net device in the virtio spec for a vendor FOO is 
> > defined as:
> > T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
> > 
> > This driver is converting T_A to T_B, which both are defined by the 
> > virtio spec.
> > Hence, it does not conflict for the subsystem vendor, it is fine.
> 
> Surprising to me that the virtio spec dictates subsystem device ID in
> all cases.

Modern virtio spec doesn't. Legacy spec did.
Yishai Hadas Oct. 29, 2023, 4:13 p.m. UTC | #23
On 26/10/2023 20:55, Alex Williamson wrote:
> On Thu, 26 Oct 2023 15:08:12 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> On 25/10/2023 22:13, Alex Williamson wrote:
>>> On Wed, 25 Oct 2023 17:35:51 +0300
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>   
>>>> On 24/10/2023 22:57, Alex Williamson wrote:
>>>>> On Tue, 17 Oct 2023 16:42:17 +0300
>>>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>     
>>>>>> +		if (copy_to_user(buf + copy_offset, &val32, copy_count))
>>>>>> +			return -EFAULT;
>>>>>> +	}
>>>>>> +
>>>>>> +	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
>>>>>> +				  &copy_offset, &copy_count, NULL)) {
>>>>>> +		/*
>>>>>> +		 * Transitional devices use the PCI subsystem device id as
>>>>>> +		 * virtio device id, same as legacy driver always did.
>>>>> Where did we require the subsystem vendor ID to be 0x1af4?  This
>>>>> subsystem device ID really only makes since given that subsystem
>>>>> vendor ID, right?  Otherwise I don't see that non-transitional devices,
>>>>> such as the VF, have a hard requirement per the spec for the subsystem
>>>>> vendor ID.
>>>>>
>>>>> Do we want to make this only probe the correct subsystem vendor ID or do
>>>>> we want to emulate the subsystem vendor ID as well?  I don't see this is
>>>>> correct without one of those options.
>>>> Looking in the 1.x spec we can see the below.
>>>>
>>>> Legacy Interfaces: A Note on PCI Device Discovery
>>>>
>>>> "Transitional devices MUST have the PCI Subsystem
>>>> Device ID matching the Virtio Device ID, as indicated in section 5 ...
>>>> This is to match legacy drivers."
>>>>
>>>> However, there is no need to enforce Subsystem Vendor ID.
>>>>
>>>> This is what we followed here.
>>>>
>>>> Makes sense ?
>>> So do I understand correctly that virtio dictates the subsystem device
>>> ID for all subsystem vendor IDs that implement a legacy virtio
>>> interface?  Ok, but this device didn't actually implement a legacy
>>> virtio interface.  The device itself is not tranistional, we're imposing
>>> an emulated transitional interface onto it.  So did the subsystem vendor
>>> agree to have their subsystem device ID managed by the virtio committee
>>> or might we create conflicts?  I imagine we know we don't have a
>>> conflict if we also virtualize the subsystem vendor ID.
>>>   
>> The non transitional net device in the virtio spec defined as the below
>> tuple.
>> T_A: VID=0x1AF4, DID=0x1040, Subsys_VID=FOO, Subsys_DID=0x40.
>>
>> And transitional net device in the virtio spec for a vendor FOO is
>> defined as:
>> T_B: VID=0x1AF4,DID=0x1000,Subsys_VID=FOO, subsys_DID=0x1
>>
>> This driver is converting T_A to T_B, which both are defined by the
>> virtio spec.
>> Hence, it does not conflict for the subsystem vendor, it is fine.
> Surprising to me that the virtio spec dictates subsystem device ID in
> all cases.  The further discussion in this thread seems to indicate we
> need to virtualize subsystem vendor ID for broader driver compatibility
> anyway.
>
>>> BTW, it would be a lot easier for all of the config space emulation here
>>> if we could make use of the existing field virtualization in
>>> vfio-pci-core.  In fact you'll see in vfio_config_init() that
>>> PCI_DEVICE_ID is already virtualized for VFs, so it would be enough to
>>> simply do the following to report the desired device ID:
>>>
>>> 	*(__le16 *)&vconfig[PCI_DEVICE_ID] = cpu_to_le16(0x1000);
>> I would prefer keeping things simple and have one place/flow that
>> handles all the fields as we have now as part of the driver.
> That's the same argument I'd make for re-using the core code, we don't
> need multiple implementations handling merging physical and virtual
> bits within config space.
>
>> In any case, I'll further look at that option for managing the DEVICE_ID
>> towards V2.
>>
>>> It appears everything in this function could be handled similarly by
>>> vfio-pci-core if the right fields in the perm_bits.virt and .write
>>> bits could be manipulated and vconfig modified appropriately.  I'd look
>>> for a way that a variant driver could provide an alternate set of
>>> permissions structures for various capabilities.  Thanks,
>> OK
>>
>> However, let's not block V2 and the series acceptance as of that.
>>
>> It can always be some future refactoring as part of other series that
>> will bring the infra-structure that is needed for that.
> We're already on the verge of the v6.7 merge window, so this looks like
> v6.8 material anyway.  We have time.  Thanks,

OK

I sent V2 having all the other notes handled to share and get feedback 
from both you and Michael.

Let's continue from there to see what is needed towards v6.8.

Thanks,
Yishai

>
> Alex
>
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 7a7bd8bd80e9..680a70063775 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22620,6 +22620,13 @@  L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/pci/mlx5/
 
+VFIO VIRTIO PCI DRIVER
+M:	Yishai Hadas <yishaih@nvidia.com>
+L:	kvm@vger.kernel.org
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/vfio/pci/virtio
+
 VFIO PCI DEVICE SPECIFIC DRIVERS
 R:	Jason Gunthorpe <jgg@nvidia.com>
 R:	Yishai Hadas <yishaih@nvidia.com>
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 8125e5f37832..18c397df566d 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,4 +65,6 @@  source "drivers/vfio/pci/hisilicon/Kconfig"
 
 source "drivers/vfio/pci/pds/Kconfig"
 
+source "drivers/vfio/pci/virtio/Kconfig"
+
 endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 45167be462d8..046139a4eca5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -13,3 +13,5 @@  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
 
 obj-$(CONFIG_PDS_VFIO_PCI) += pds/
+
+obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
diff --git a/drivers/vfio/pci/virtio/Kconfig b/drivers/vfio/pci/virtio/Kconfig
new file mode 100644
index 000000000000..89eddce8b1bd
--- /dev/null
+++ b/drivers/vfio/pci/virtio/Kconfig
@@ -0,0 +1,15 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+config VIRTIO_VFIO_PCI
+        tristate "VFIO support for VIRTIO PCI devices"
+        depends on VIRTIO_PCI
+        select VFIO_PCI_CORE
+        help
+          This provides support for exposing VIRTIO VF devices using the VFIO
+          framework that can work with a legacy virtio driver in the guest.
+          Based on PCIe spec, VFs do not support I/O Space; thus, VF BARs shall
+          not indicate I/O Space.
+          As of that this driver emulated I/O BAR in software to let a VF be
+          seen as a transitional device in the guest and let it work with
+          a legacy driver.
+
+          If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/virtio/Makefile b/drivers/vfio/pci/virtio/Makefile
new file mode 100644
index 000000000000..2039b39fb723
--- /dev/null
+++ b/drivers/vfio/pci/virtio/Makefile
@@ -0,0 +1,4 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio-vfio-pci.o
+virtio-vfio-pci-y := main.o
+
diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
new file mode 100644
index 000000000000..3fef4b21f7e6
--- /dev/null
+++ b/drivers/vfio/pci/virtio/main.c
@@ -0,0 +1,577 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/virtio_pci.h>
+#include <linux/virtio_net.h>
+#include <linux/virtio_pci_admin.h>
+
+struct virtiovf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	u8 bar0_virtual_buf_size;
+	u8 *bar0_virtual_buf;
+	/* synchronize access to the virtual buf */
+	struct mutex bar_mutex;
+	void __iomem *notify_addr;
+	u32 notify_offset;
+	u8 notify_bar;
+	u16 pci_cmd;
+	u16 msix_ctrl;
+};
+
+static int
+virtiovf_issue_legacy_rw_cmd(struct virtiovf_pci_core_device *virtvdev,
+			     loff_t pos, char __user *buf,
+			     size_t count, bool read)
+{
+	bool msix_enabled = virtvdev->msix_ctrl & PCI_MSIX_FLAGS_ENABLE;
+	struct pci_dev *pdev = virtvdev->core_device.pdev;
+	u8 *bar0_buf = virtvdev->bar0_virtual_buf;
+	u16 opcode;
+	int ret;
+
+	mutex_lock(&virtvdev->bar_mutex);
+	if (read) {
+		opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
+			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ :
+			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ;
+		ret = virtio_pci_admin_legacy_io_read(pdev, opcode, pos, count,
+						      bar0_buf + pos);
+		if (ret)
+			goto out;
+		if (copy_to_user(buf, bar0_buf + pos, count))
+			ret = -EFAULT;
+		goto out;
+	}
+
+	if (copy_from_user(bar0_buf + pos, buf, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	opcode = (pos < VIRTIO_PCI_CONFIG_OFF(msix_enabled)) ?
+			VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE :
+			VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE;
+	ret = virtio_pci_admin_legacy_io_write(pdev, opcode, pos, count,
+					       bar0_buf + pos);
+out:
+	mutex_unlock(&virtvdev->bar_mutex);
+	return ret;
+}
+
+static int
+translate_io_bar_to_mem_bar(struct virtiovf_pci_core_device *virtvdev,
+			    loff_t pos, char __user *buf,
+			    size_t count, bool read)
+{
+	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
+	u16 queue_notify;
+	int ret;
+
+	if (pos + count > virtvdev->bar0_virtual_buf_size)
+		return -EINVAL;
+
+	switch (pos) {
+	case VIRTIO_PCI_QUEUE_NOTIFY:
+		if (count != sizeof(queue_notify))
+			return -EINVAL;
+		if (read) {
+			ret = vfio_pci_ioread16(core_device, true, &queue_notify,
+						virtvdev->notify_addr);
+			if (ret)
+				return ret;
+			if (copy_to_user(buf, &queue_notify,
+					 sizeof(queue_notify)))
+				return -EFAULT;
+			break;
+		}
+
+		if (copy_from_user(&queue_notify, buf, count))
+			return -EFAULT;
+
+		ret = vfio_pci_iowrite16(core_device, true, queue_notify,
+					 virtvdev->notify_addr);
+		break;
+	default:
+		ret = virtiovf_issue_legacy_rw_cmd(virtvdev, pos, buf, count,
+						   read);
+	}
+
+	return ret ? ret : count;
+}
+
+static bool range_intersect_range(loff_t range1_start, size_t count1,
+				  loff_t range2_start, size_t count2,
+				  loff_t *start_offset,
+				  size_t *intersect_count,
+				  size_t *register_offset)
+{
+	if (range1_start <= range2_start &&
+	    range1_start + count1 > range2_start) {
+		*start_offset = range2_start - range1_start;
+		*intersect_count = min_t(size_t, count2,
+					 range1_start + count1 - range2_start);
+		if (register_offset)
+			*register_offset = 0;
+		return true;
+	}
+
+	if (range1_start > range2_start &&
+	    range1_start < range2_start + count2) {
+		*start_offset = range1_start;
+		*intersect_count = min_t(size_t, count1,
+					 range2_start + count2 - range1_start);
+		if (register_offset)
+			*register_offset = range1_start - range2_start;
+		return true;
+	}
+
+	return false;
+}
+
+static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
+					char __user *buf, size_t count,
+					loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	size_t register_offset;
+	loff_t copy_offset;
+	size_t copy_count;
+	__le32 val32;
+	__le16 val16;
+	u8 val8;
+	int ret;
+
+	ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
+	if (ret < 0)
+		return ret;
+
+	if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
+				  &copy_offset, &copy_count, NULL)) {
+		val16 = cpu_to_le16(0x1000);
+		if (copy_to_user(buf + copy_offset, &val16, copy_count))
+			return -EFAULT;
+	}
+
+	if ((virtvdev->pci_cmd & PCI_COMMAND_IO) &&
+	    range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
+				  &copy_offset, &copy_count, &register_offset)) {
+		if (copy_from_user((void *)&val16 + register_offset, buf + copy_offset,
+				   copy_count))
+			return -EFAULT;
+		val16 |= cpu_to_le16(PCI_COMMAND_IO);
+		if (copy_to_user(buf + copy_offset, (void *)&val16 + register_offset,
+				 copy_count))
+			return -EFAULT;
+	}
+
+	if (range_intersect_range(pos, count, PCI_REVISION_ID, sizeof(val8),
+				  &copy_offset, &copy_count, NULL)) {
+		/* Transional needs to have revision 0 */
+		val8 = 0;
+		if (copy_to_user(buf + copy_offset, &val8, copy_count))
+			return -EFAULT;
+	}
+
+	if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_0, sizeof(val32),
+				  &copy_offset, &copy_count, NULL)) {
+		val32 = cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
+		if (copy_to_user(buf + copy_offset, &val32, copy_count))
+			return -EFAULT;
+	}
+
+	if (range_intersect_range(pos, count, PCI_SUBSYSTEM_ID, sizeof(val16),
+				  &copy_offset, &copy_count, NULL)) {
+		/*
+		 * Transitional devices use the PCI subsystem device id as
+		 * virtio device id, same as legacy driver always did.
+		 */
+		val16 = cpu_to_le16(VIRTIO_ID_NET);
+		if (copy_to_user(buf + copy_offset, &val16, copy_count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+static ssize_t
+virtiovf_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
+		       size_t count, loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct pci_dev *pdev = virtvdev->core_device.pdev;
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int ret;
+
+	if (!count)
+		return 0;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+		return virtiovf_pci_read_config(core_vdev, buf, count, ppos);
+
+	if (index != VFIO_PCI_BAR0_REGION_INDEX)
+		return vfio_pci_core_read(core_vdev, buf, count, ppos);
+
+	ret = pm_runtime_resume_and_get(&pdev->dev);
+	if (ret) {
+		pci_info_ratelimited(pdev, "runtime resume failed %d\n",
+				     ret);
+		return -EIO;
+	}
+
+	ret = translate_io_bar_to_mem_bar(virtvdev, pos, buf, count, true);
+	pm_runtime_put(&pdev->dev);
+	return ret;
+}
+
+static ssize_t
+virtiovf_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct pci_dev *pdev = virtvdev->core_device.pdev;
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	int ret;
+
+	if (!count)
+		return 0;
+
+	if (index == VFIO_PCI_CONFIG_REGION_INDEX) {
+		size_t register_offset;
+		loff_t copy_offset;
+		size_t copy_count;
+
+		if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(virtvdev->pci_cmd),
+					  &copy_offset, &copy_count,
+					  &register_offset)) {
+			if (copy_from_user((void *)&virtvdev->pci_cmd + register_offset,
+					   buf + copy_offset,
+					   copy_count))
+				return -EFAULT;
+		}
+
+		if (range_intersect_range(pos, count, pdev->msix_cap + PCI_MSIX_FLAGS,
+					  sizeof(virtvdev->msix_ctrl),
+					  &copy_offset, &copy_count,
+					  &register_offset)) {
+			if (copy_from_user((void *)&virtvdev->msix_ctrl + register_offset,
+					   buf + copy_offset,
+					   copy_count))
+				return -EFAULT;
+		}
+	}
+
+	if (index != VFIO_PCI_BAR0_REGION_INDEX)
+		return vfio_pci_core_write(core_vdev, buf, count, ppos);
+
+	ret = pm_runtime_resume_and_get(&pdev->dev);
+	if (ret) {
+		pci_info_ratelimited(pdev, "runtime resume failed %d\n", ret);
+		return -EIO;
+	}
+
+	ret = translate_io_bar_to_mem_bar(virtvdev, pos, (char __user *)buf, count, false);
+	pm_runtime_put(&pdev->dev);
+	return ret;
+}
+
+static int
+virtiovf_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
+				   unsigned int cmd, unsigned long arg)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+	void __user *uarg = (void __user *)arg;
+	struct vfio_region_info info = {};
+
+	if (copy_from_user(&info, uarg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	switch (info.index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+		info.size = virtvdev->bar0_virtual_buf_size;
+		info.flags = VFIO_REGION_INFO_FLAG_READ |
+			     VFIO_REGION_INFO_FLAG_WRITE;
+		return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
+	default:
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+	}
+}
+
+static long
+virtiovf_vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+			     unsigned long arg)
+{
+	switch (cmd) {
+	case VFIO_DEVICE_GET_REGION_INFO:
+		return virtiovf_pci_ioctl_get_region_info(core_vdev, cmd, arg);
+	default:
+		return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+	}
+}
+
+static int
+virtiovf_set_notify_addr(struct virtiovf_pci_core_device *virtvdev)
+{
+	struct vfio_pci_core_device *core_device = &virtvdev->core_device;
+	int ret;
+
+	/*
+	 * Setup the BAR where the 'notify' exists to be used by vfio as well
+	 * This will let us mmap it only once and use it when needed.
+	 */
+	ret = vfio_pci_core_setup_barmap(core_device,
+					 virtvdev->notify_bar);
+	if (ret)
+		return ret;
+
+	virtvdev->notify_addr = core_device->barmap[virtvdev->notify_bar] +
+			virtvdev->notify_offset;
+	return 0;
+}
+
+static int virtiovf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct vfio_pci_core_device *vdev = &virtvdev->core_device;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	if (virtvdev->bar0_virtual_buf) {
+		/*
+		 * Upon close_device() the vfio_pci_core_disable() is called
+		 * and will close all the previous mmaps, so it seems that the
+		 * valid life cycle for the 'notify' addr is per open/close.
+		 */
+		ret = virtiovf_set_notify_addr(virtvdev);
+		if (ret) {
+			vfio_pci_core_disable(vdev);
+			return ret;
+		}
+	}
+
+	vfio_pci_core_finish_enable(vdev);
+	return 0;
+}
+
+static int virtiovf_get_device_config_size(unsigned short device)
+{
+	/* Network card */
+	return offsetofend(struct virtio_net_config, status);
+}
+
+static int virtiovf_read_notify_info(struct virtiovf_pci_core_device *virtvdev)
+{
+	u64 offset;
+	int ret;
+	u8 bar;
+
+	ret = virtio_pci_admin_legacy_io_notify_info(virtvdev->core_device.pdev,
+				VIRTIO_ADMIN_CMD_NOTIFY_INFO_FLAGS_OWNER_MEM,
+				&bar, &offset);
+	if (ret)
+		return ret;
+
+	virtvdev->notify_bar = bar;
+	virtvdev->notify_offset = offset;
+	return 0;
+}
+
+static int virtiovf_pci_init_device(struct vfio_device *core_vdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+	struct pci_dev *pdev;
+	int ret;
+
+	ret = vfio_pci_core_init_dev(core_vdev);
+	if (ret)
+		return ret;
+
+	pdev = virtvdev->core_device.pdev;
+	ret = virtiovf_read_notify_info(virtvdev);
+	if (ret)
+		return ret;
+
+	/* Being ready with a buffer that supports MSIX */
+	virtvdev->bar0_virtual_buf_size = VIRTIO_PCI_CONFIG_OFF(true) +
+				virtiovf_get_device_config_size(pdev->device);
+	virtvdev->bar0_virtual_buf = kzalloc(virtvdev->bar0_virtual_buf_size,
+					     GFP_KERNEL);
+	if (!virtvdev->bar0_virtual_buf)
+		return -ENOMEM;
+	mutex_init(&virtvdev->bar_mutex);
+	return 0;
+}
+
+static void virtiovf_pci_core_release_dev(struct vfio_device *core_vdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = container_of(
+		core_vdev, struct virtiovf_pci_core_device, core_device.vdev);
+
+	kfree(virtvdev->bar0_virtual_buf);
+	vfio_pci_core_release_dev(core_vdev);
+}
+
+static const struct vfio_device_ops virtiovf_acc_vfio_pci_tran_ops = {
+	.name = "virtio-transitional-vfio-pci",
+	.init = virtiovf_pci_init_device,
+	.release = virtiovf_pci_core_release_dev,
+	.open_device = virtiovf_pci_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = virtiovf_vfio_pci_core_ioctl,
+	.read = virtiovf_pci_core_read,
+	.write = virtiovf_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+};
+
+static const struct vfio_device_ops virtiovf_acc_vfio_pci_ops = {
+	.name = "virtio-acc-vfio-pci",
+	.init = vfio_pci_core_init_dev,
+	.release = vfio_pci_core_release_dev,
+	.open_device = virtiovf_pci_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+};
+
+static bool virtiovf_bar0_exists(struct pci_dev *pdev)
+{
+	struct resource *res = pdev->resource;
+
+	return res->flags ? true : false;
+}
+
+#define VIRTIOVF_USE_ADMIN_CMD_BITMAP \
+	(BIT_ULL(VIRTIO_ADMIN_CMD_LIST_QUERY) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LIST_USE) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_READ) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_READ) | \
+	 BIT_ULL(VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO))
+
+static bool virtiovf_support_legacy_access(struct pci_dev *pdev)
+{
+	int buf_size = DIV_ROUND_UP(VIRTIO_ADMIN_MAX_CMD_OPCODE, 64) * 8;
+	u8 *buf;
+	int ret;
+
+	buf = kzalloc(buf_size, GFP_KERNEL);
+	if (!buf)
+		return false;
+
+	ret = virtio_pci_admin_list_query(pdev, buf, buf_size);
+	if (ret)
+		goto end;
+
+	if ((le64_to_cpup((__le64 *)buf) & VIRTIOVF_USE_ADMIN_CMD_BITMAP) !=
+		VIRTIOVF_USE_ADMIN_CMD_BITMAP) {
+		ret = -EOPNOTSUPP;
+		goto end;
+	}
+
+	/* Confirm the used commands */
+	memset(buf, 0, buf_size);
+	*(__le64 *)buf = cpu_to_le64(VIRTIOVF_USE_ADMIN_CMD_BITMAP);
+	ret = virtio_pci_admin_list_use(pdev, buf, buf_size);
+end:
+	kfree(buf);
+	return ret ? false : true;
+}
+
+static int virtiovf_pci_probe(struct pci_dev *pdev,
+			      const struct pci_device_id *id)
+{
+	const struct vfio_device_ops *ops = &virtiovf_acc_vfio_pci_ops;
+	struct virtiovf_pci_core_device *virtvdev;
+	int ret;
+
+	if (pdev->is_virtfn && virtiovf_support_legacy_access(pdev) &&
+	    !virtiovf_bar0_exists(pdev) && pdev->msix_cap)
+		ops = &virtiovf_acc_vfio_pci_tran_ops;
+
+	virtvdev = vfio_alloc_device(virtiovf_pci_core_device, core_device.vdev,
+				     &pdev->dev, ops);
+	if (IS_ERR(virtvdev))
+		return PTR_ERR(virtvdev);
+
+	dev_set_drvdata(&pdev->dev, &virtvdev->core_device);
+	ret = vfio_pci_core_register_device(&virtvdev->core_device);
+	if (ret)
+		goto out;
+	return 0;
+out:
+	vfio_put_device(&virtvdev->core_device.vdev);
+	return ret;
+}
+
+static void virtiovf_pci_remove(struct pci_dev *pdev)
+{
+	struct virtiovf_pci_core_device *virtvdev = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(&virtvdev->core_device);
+	vfio_put_device(&virtvdev->core_device.vdev);
+}
+
+static const struct pci_device_id virtiovf_pci_table[] = {
+	/* Only virtio-net is supported/tested so far */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041) },
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, virtiovf_pci_table);
+
+static struct pci_driver virtiovf_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = virtiovf_pci_table,
+	.probe = virtiovf_pci_probe,
+	.remove = virtiovf_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(virtiovf_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>");
+MODULE_DESCRIPTION(
+	"VIRTIO VFIO PCI - User Level meta-driver for VIRTIO device family");