mbox series

[v2,0/4] vfio-pci support pasid attach/detach

Message ID 20240412082121.33382-1-yi.l.liu@intel.com (mailing list archive)
Headers show
Series vfio-pci support pasid attach/detach | expand

Message

Yi Liu April 12, 2024, 8:21 a.m. UTC
This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).

This series first adds the helpers for pasid attach in vfio core and then
adds the device cdev ioctls for pasid attach/detach, finally exposing the
device PASID capability to the user. It depends on the iommufd pasid
attach/detach series [1].

A userspace VMM is supposed to get the details of the device's PASID capability
and assemble a virtual PASID capability in a proper offset in the virtual PCI
configuration space. While it is still an open on how to get the available
offsets. Devices may have hidden bits that are not in the PCI cap chain. For
now, there are two options to get the available offsets.[2]

- Report the available offsets via ioctl. This requires device-specific logic
  to provide available offsets. e.g., vfio-pci variant driver. Or may the device
  provide the available offset by DVSEC.
- Store the available offsets in a static table in userspace VMM. VMM gets the
  empty offsets from this table.

Since there was no clear answer to it, I have not made much change on this in
this version. Wish we could have more discussions on this.

The completed code can be found at [3], tested with a hacky Qemu branch [4].

[1] https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/kvm/b3e07591-8ebc-4924-85fe-29a46fc73d78@intel.com/
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[4] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-test-pasid

Change log:

v2:
 - Use IDA to track if PASID is attached or not in VFIO. (Jason)
 - Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex)
 - Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong)
 - Minor tweaks in comments (Kevin)

v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/
 - Report PASID capability via VFIO_DEVICE_FEATURE (Alex)

rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.com/

Regards,
        Yi Liu

Yi Liu (4):
  ida: Add ida_get_lowest()
  vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

 drivers/vfio/device_cdev.c       | 51 +++++++++++++++++++++++
 drivers/vfio/iommufd.c           | 60 +++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      |  2 +
 drivers/vfio/pci/vfio_pci_core.c | 50 +++++++++++++++++++++++
 drivers/vfio/vfio.h              |  4 ++
 drivers/vfio/vfio_main.c         |  8 ++++
 include/linux/idr.h              |  1 +
 include/linux/vfio.h             | 11 +++++
 include/uapi/linux/vfio.h        | 69 ++++++++++++++++++++++++++++++++
 lib/idr.c                        | 67 +++++++++++++++++++++++++++++++
 10 files changed, 323 insertions(+)

Comments

Tian, Kevin April 16, 2024, 8:38 a.m. UTC | #1
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> A userspace VMM is supposed to get the details of the device's PASID
> capability
> and assemble a virtual PASID capability in a proper offset in the virtual PCI
> configuration space. While it is still an open on how to get the available
> offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> now, there are two options to get the available offsets.[2]
> 
> - Report the available offsets via ioctl. This requires device-specific logic
>   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
>   provide the available offset by DVSEC.
> - Store the available offsets in a static table in userspace VMM. VMM gets the
>   empty offsets from this table.
> 

I'm not a fan of requesting a variant driver for every PASID-capable
VF just for the purpose of reporting a free range in the PCI config space.

It's easier to do that quirk in userspace.

But I like Alex's original comment that at least for PF there is no reason
to hide the offset. there could be a flag+field to communicate it. or 
if there will be a new variant VF driver for other purposes e.g. migration
it can certainly fill the field too.
Jason Gunthorpe April 16, 2024, 5:50 p.m. UTC | #2
On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, April 12, 2024 4:21 PM
> > 
> > A userspace VMM is supposed to get the details of the device's PASID
> > capability
> > and assemble a virtual PASID capability in a proper offset in the virtual PCI
> > configuration space. While it is still an open on how to get the available
> > offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> > now, there are two options to get the available offsets.[2]
> > 
> > - Report the available offsets via ioctl. This requires device-specific logic
> >   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
> >   provide the available offset by DVSEC.
> > - Store the available offsets in a static table in userspace VMM. VMM gets the
> >   empty offsets from this table.
> > 
> 
> I'm not a fan of requesting a variant driver for every PASID-capable
> VF just for the purpose of reporting a free range in the PCI config space.
> 
> It's easier to do that quirk in userspace.
> 
> But I like Alex's original comment that at least for PF there is no reason
> to hide the offset. there could be a flag+field to communicate it. or 
> if there will be a new variant VF driver for other purposes e.g. migration
> it can certainly fill the field too.

Yes, since this has been such a sticking point can we get a clean
series that just enables it for PF and then come with a solution for
VF?

Jason
Tian, Kevin April 17, 2024, 7:16 a.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 17, 2024 1:50 AM
> 
> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, April 12, 2024 4:21 PM
> > >
> > > A userspace VMM is supposed to get the details of the device's PASID
> > > capability
> > > and assemble a virtual PASID capability in a proper offset in the virtual
> PCI
> > > configuration space. While it is still an open on how to get the available
> > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> For
> > > now, there are two options to get the available offsets.[2]
> > >
> > > - Report the available offsets via ioctl. This requires device-specific logic
> > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> device
> > >   provide the available offset by DVSEC.
> > > - Store the available offsets in a static table in userspace VMM. VMM gets
> the
> > >   empty offsets from this table.
> > >
> >
> > I'm not a fan of requesting a variant driver for every PASID-capable
> > VF just for the purpose of reporting a free range in the PCI config space.
> >
> > It's easier to do that quirk in userspace.
> >
> > But I like Alex's original comment that at least for PF there is no reason
> > to hide the offset. there could be a flag+field to communicate it. or
> > if there will be a new variant VF driver for other purposes e.g. migration
> > it can certainly fill the field too.
> 
> Yes, since this has been such a sticking point can we get a clean
> series that just enables it for PF and then come with a solution for
> VF?
> 

sure but we at least need to reach consensus on a minimal required
uapi covering both PF/VF to move forward so the user doesn't need
to touch different contracts for PF vs. VF.
Jason Gunthorpe April 17, 2024, 12:20 p.m. UTC | #4
On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 17, 2024 1:50 AM
> > 
> > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Friday, April 12, 2024 4:21 PM
> > > >
> > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > capability
> > > > and assemble a virtual PASID capability in a proper offset in the virtual
> > PCI
> > > > configuration space. While it is still an open on how to get the available
> > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> > For
> > > > now, there are two options to get the available offsets.[2]
> > > >
> > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > device
> > > >   provide the available offset by DVSEC.
> > > > - Store the available offsets in a static table in userspace VMM. VMM gets
> > the
> > > >   empty offsets from this table.
> > > >
> > >
> > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > VF just for the purpose of reporting a free range in the PCI config space.
> > >
> > > It's easier to do that quirk in userspace.
> > >
> > > But I like Alex's original comment that at least for PF there is no reason
> > > to hide the offset. there could be a flag+field to communicate it. or
> > > if there will be a new variant VF driver for other purposes e.g. migration
> > > it can certainly fill the field too.
> > 
> > Yes, since this has been such a sticking point can we get a clean
> > series that just enables it for PF and then come with a solution for
> > VF?
> > 
> 
> sure but we at least need to reach consensus on a minimal required
> uapi covering both PF/VF to move forward so the user doesn't need
> to touch different contracts for PF vs. VF.

Do we? The situation where the VMM needs to wholly make a up a PASID
capability seems completely new and seperate from just using an
existing PASID capability as in the PF case.

If it needs to make another system call or otherwise to do that then
that seems fine to do incrementally??

Jason
Alex Williamson April 17, 2024, 11:02 p.m. UTC | #5
On Wed, 17 Apr 2024 09:20:51 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > 
> > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > >
> > > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > > capability
> > > > > and assemble a virtual PASID capability in a proper offset in the virtual  
> > > PCI  
> > > > > configuration space. While it is still an open on how to get the available
> > > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.  
> > > For  
> > > > > now, there are two options to get the available offsets.[2]
> > > > >
> > > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> > > device  
> > > > >   provide the available offset by DVSEC.
> > > > > - Store the available offsets in a static table in userspace VMM. VMM gets  
> > > the  
> > > > >   empty offsets from this table.
> > > > >  
> > > >
> > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > VF just for the purpose of reporting a free range in the PCI config space.
> > > >
> > > > It's easier to do that quirk in userspace.
> > > >
> > > > But I like Alex's original comment that at least for PF there is no reason
> > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > if there will be a new variant VF driver for other purposes e.g. migration
> > > > it can certainly fill the field too.  
> > > 
> > > Yes, since this has been such a sticking point can we get a clean
> > > series that just enables it for PF and then come with a solution for
> > > VF?
> > >   
> > 
> > sure but we at least need to reach consensus on a minimal required
> > uapi covering both PF/VF to move forward so the user doesn't need
> > to touch different contracts for PF vs. VF.  
> 
> Do we? The situation where the VMM needs to wholly make a up a PASID
> capability seems completely new and seperate from just using an
> existing PASID capability as in the PF case.

But we don't actually expose the PASID capability on the PF and as
argued in path 4/ we can't because it would break existing userspace.

> If it needs to make another system call or otherwise to do that then
> that seems fine to do incrementally??

With PASID currently hidden, VF and PF support really seem too similar
not to handle them both at the same time.  What's missing is a
mechanism to describe unused config space where userspace can implement
an emulated PASID capability.  Defining a DVSEC capability to handle
this seemed to be the preferred solution from the LPC discussion.  PF
and VF devices that implement a TBD DVSEC capability could avoid
needing a variant driver.  Until then we need to be careful not to
undermine a future world with that preferred solution by introducing
side-band mechanisms which only work for variant drivers and PF
devices.  Thanks,

Alex
Tian, Kevin April 18, 2024, 12:06 a.m. UTC | #6
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 18, 2024 7:02 AM
> 
> On Wed, 17 Apr 2024 09:20:51 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > >
> > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > > >
> > > > > > A userspace VMM is supposed to get the details of the device's
> PASID
> > > > > > capability
> > > > > > and assemble a virtual PASID capability in a proper offset in the
> virtual
> > > > PCI
> > > > > > configuration space. While it is still an open on how to get the
> available
> > > > > > offsets. Devices may have hidden bits that are not in the PCI cap
> chain.
> > > > For
> > > > > > now, there are two options to get the available offsets.[2]
> > > > > >
> > > > > > - Report the available offsets via ioctl. This requires device-specific
> logic
> > > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > > > device
> > > > > >   provide the available offset by DVSEC.
> > > > > > - Store the available offsets in a static table in userspace VMM.
> VMM gets
> > > > the
> > > > > >   empty offsets from this table.
> > > > > >
> > > > >
> > > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > > VF just for the purpose of reporting a free range in the PCI config
> space.
> > > > >
> > > > > It's easier to do that quirk in userspace.
> > > > >
> > > > > But I like Alex's original comment that at least for PF there is no
> reason
> > > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > > if there will be a new variant VF driver for other purposes e.g.
> migration
> > > > > it can certainly fill the field too.
> > > >
> > > > Yes, since this has been such a sticking point can we get a clean
> > > > series that just enables it for PF and then come with a solution for
> > > > VF?
> > > >
> > >
> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.
> >
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing userspace.

Come back to this statement.

Does 'break' means that legacy Qemu will crash due to a guest write
to the read-only PASID capability, or just a conceptually functional
break i.e. non-faithful emulation due to writes being dropped?

If the latter it's probably not a bad idea to allow exposing the PASID
capability on the PF as a sane guest shouldn't enable the PASID
capability w/o seeing vIOMMU supporting PASID. And there is no
status bit defined in the PASID capability to check back so even
if an insane guest wants to blindly enable PASID it will naturally
write and done. The only niche case is that the enable bits are
defined as RW so ideally reading back those bits should get the
latest written value. But probably this can be tolerated?

With that then should we consider exposing the PASID capability
in PCI config space as the first option? For PF it's simple as how
other caps are exposed. For VF a variant driver can also fake the
PASID capability or emulate a DVSEC capability for unused space
(to motivate the physical implementation so no variant driver is
required in the future)

This allows a staging approach as Jason envisioned.
Yi Liu April 18, 2024, 9:03 a.m. UTC | #7
On 2024/4/18 08:06, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Thursday, April 18, 2024 7:02 AM
>>
>> On Wed, 17 Apr 2024 09:20:51 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Sent: Wednesday, April 17, 2024 1:50 AM
>>>>>
>>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Sent: Friday, April 12, 2024 4:21 PM
>>>>>>>
>>>>>>> A userspace VMM is supposed to get the details of the device's
>> PASID
>>>>>>> capability
>>>>>>> and assemble a virtual PASID capability in a proper offset in the
>> virtual
>>>>> PCI
>>>>>>> configuration space. While it is still an open on how to get the
>> available
>>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap
>> chain.
>>>>> For
>>>>>>> now, there are two options to get the available offsets.[2]
>>>>>>>
>>>>>>> - Report the available offsets via ioctl. This requires device-specific
>> logic
>>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the
>>>>> device
>>>>>>>    provide the available offset by DVSEC.
>>>>>>> - Store the available offsets in a static table in userspace VMM.
>> VMM gets
>>>>> the
>>>>>>>    empty offsets from this table.
>>>>>>>
>>>>>>
>>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
>>>>>> VF just for the purpose of reporting a free range in the PCI config
>> space.
>>>>>>
>>>>>> It's easier to do that quirk in userspace.
>>>>>>
>>>>>> But I like Alex's original comment that at least for PF there is no
>> reason
>>>>>> to hide the offset. there could be a flag+field to communicate it. or
>>>>>> if there will be a new variant VF driver for other purposes e.g.
>> migration
>>>>>> it can certainly fill the field too.
>>>>>
>>>>> Yes, since this has been such a sticking point can we get a clean
>>>>> series that just enables it for PF and then come with a solution for
>>>>> VF?
>>>>>
>>>>
>>>> sure but we at least need to reach consensus on a minimal required
>>>> uapi covering both PF/VF to move forward so the user doesn't need
>>>> to touch different contracts for PF vs. VF.
>>>
>>> Do we? The situation where the VMM needs to wholly make a up a PASID
>>> capability seems completely new and seperate from just using an
>>> existing PASID capability as in the PF case.
>>
>> But we don't actually expose the PASID capability on the PF and as
>> argued in path 4/ we can't because it would break existing userspace.
> > Come back to this statement.
> 
> Does 'break' means that legacy Qemu will crash due to a guest write
> to the read-only PASID capability, or just a conceptually functional
> break i.e. non-faithful emulation due to writes being dropped?
> 
> If the latter it's probably not a bad idea to allow exposing the PASID
> capability on the PF as a sane guest shouldn't enable the PASID
> capability w/o seeing vIOMMU supporting PASID. And there is no
> status bit defined in the PASID capability to check back so even
> if an insane guest wants to blindly enable PASID it will naturally
> write and done. The only niche case is that the enable bits are
> defined as RW so ideally reading back those bits should get the
> latest written value. But probably this can be tolerated?
> 
> With that then should we consider exposing the PASID capability
> in PCI config space as the first option? For PF it's simple as how
> other caps are exposed. For VF a variant driver can also fake the
> PASID capability or emulate a DVSEC capability for unused space
> (to motivate the physical implementation so no variant driver is
> required in the future)

If kernel exposes pasid cap for PF same as other caps, and in the meantime
the variant driver chooses to emulate a DVSEC cap, then userspace follows
the below steps to expose pasid cap to VM.

1) Check if a pasid cap is already present in the virtual config space
    read from kernel. If no, but user wants pasid, then goto step 2).
2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
    pasid cap. If yes, goto step 3).
3) Userspace gets an available offset via reading the DVSEC cap.
4) Userspace assembles a pasid cap and inserts it to the vconfig space.

For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
This is a bit different from what we planned at the beginning. But sounds
doable if we want to pursue the staging direction.
Alex Williamson April 18, 2024, 8:37 p.m. UTC | #8
On Thu, 18 Apr 2024 17:03:15 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/18 08:06, Tian, Kevin wrote:
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Thursday, April 18, 2024 7:02 AM
> >>
> >> On Wed, 17 Apr 2024 09:20:51 -0300
> >> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>  
> >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:  
> >>>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>>> Sent: Wednesday, April 17, 2024 1:50 AM
> >>>>>
> >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> >>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>>> Sent: Friday, April 12, 2024 4:21 PM
> >>>>>>>
> >>>>>>> A userspace VMM is supposed to get the details of the device's  
> >> PASID  
> >>>>>>> capability
> >>>>>>> and assemble a virtual PASID capability in a proper offset in the  
> >> virtual  
> >>>>> PCI  
> >>>>>>> configuration space. While it is still an open on how to get the  
> >> available  
> >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap  
> >> chain.  
> >>>>> For  
> >>>>>>> now, there are two options to get the available offsets.[2]
> >>>>>>>
> >>>>>>> - Report the available offsets via ioctl. This requires device-specific  
> >> logic  
> >>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> >>>>> device  
> >>>>>>>    provide the available offset by DVSEC.
> >>>>>>> - Store the available offsets in a static table in userspace VMM.  
> >> VMM gets  
> >>>>> the  
> >>>>>>>    empty offsets from this table.
> >>>>>>>  
> >>>>>>
> >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
> >>>>>> VF just for the purpose of reporting a free range in the PCI config  
> >> space.  
> >>>>>>
> >>>>>> It's easier to do that quirk in userspace.
> >>>>>>
> >>>>>> But I like Alex's original comment that at least for PF there is no  
> >> reason  
> >>>>>> to hide the offset. there could be a flag+field to communicate it. or
> >>>>>> if there will be a new variant VF driver for other purposes e.g.  
> >> migration  
> >>>>>> it can certainly fill the field too.  
> >>>>>
> >>>>> Yes, since this has been such a sticking point can we get a clean
> >>>>> series that just enables it for PF and then come with a solution for
> >>>>> VF?
> >>>>>  
> >>>>
> >>>> sure but we at least need to reach consensus on a minimal required
> >>>> uapi covering both PF/VF to move forward so the user doesn't need
> >>>> to touch different contracts for PF vs. VF.  
> >>>
> >>> Do we? The situation where the VMM needs to wholly make a up a PASID
> >>> capability seems completely new and seperate from just using an
> >>> existing PASID capability as in the PF case.  
> >>
> >> But we don't actually expose the PASID capability on the PF and as
> >> argued in path 4/ we can't because it would break existing userspace.
> > > Come back to this statement.  
> > 
> > Does 'break' means that legacy Qemu will crash due to a guest write
> > to the read-only PASID capability, or just a conceptually functional
> > break i.e. non-faithful emulation due to writes being dropped?

I expect more the latter.

> > If the latter it's probably not a bad idea to allow exposing the PASID
> > capability on the PF as a sane guest shouldn't enable the PASID
> > capability w/o seeing vIOMMU supporting PASID. And there is no
> > status bit defined in the PASID capability to check back so even
> > if an insane guest wants to blindly enable PASID it will naturally
> > write and done. The only niche case is that the enable bits are
> > defined as RW so ideally reading back those bits should get the
> > latest written value. But probably this can be tolerated?

Some degree of inconsistency is likely tolerated, the guest is unlikely
to check that a RW bit was set or cleared.  How would we virtualize the
control registers for a VF and are they similarly virtualized for a PF
or would we allow the guest to manipulate the physical PASID control
registers?

> > With that then should we consider exposing the PASID capability
> > in PCI config space as the first option? For PF it's simple as how
> > other caps are exposed. For VF a variant driver can also fake the
> > PASID capability or emulate a DVSEC capability for unused space
> > (to motivate the physical implementation so no variant driver is
> > required in the future)  
> 
> If kernel exposes pasid cap for PF same as other caps, and in the meantime
> the variant driver chooses to emulate a DVSEC cap, then userspace follows
> the below steps to expose pasid cap to VM.

If we have a variant driver, why wouldn't it expose an emulated PASID
capability rather than a DVSEC if we're choosing to expose PASID for
PFs?

> 
> 1) Check if a pasid cap is already present in the virtual config space
>     read from kernel. If no, but user wants pasid, then goto step 2).
> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>     pasid cap. If yes, goto step 3).

Why do we need the vfio feature interface if a physical or virtual PASID
capability on the device exposes the same info?

> 3) Userspace gets an available offset via reading the DVSEC cap.

What's the scenario where we'd have a VF wanting to expose PASID
support which doesn't also have a variant driver that could implement a
virtual PASID?

> 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> 
> For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> This is a bit different from what we planned at the beginning. But sounds
> doable if we want to pursue the staging direction.

Seems like if we decide that we can just expose the PASID capability
for a PF then we should just have any VF variant drivers also implement
a virtual PASID capability.  In this case DVSEC would only be used to
provide information for a purely userspace emulation of PASID (in which
case it also wouldn't necessarily need the vfio feature because it
might implicitly know the PASID capabilities of the device).  Thanks,

Alex
Tian, Kevin April 19, 2024, 5:52 a.m. UTC | #9
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 19, 2024 4:38 AM
> 
> On Thu, 18 Apr 2024 17:03:15 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > On 2024/4/18 08:06, Tian, Kevin wrote:
> > >> From: Alex Williamson <alex.williamson@redhat.com>
> > >> Sent: Thursday, April 18, 2024 7:02 AM
> > >>
> > >> But we don't actually expose the PASID capability on the PF and as
> > >> argued in path 4/ we can't because it would break existing userspace.
> > > > Come back to this statement.
> > >
> > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > to the read-only PASID capability, or just a conceptually functional
> > > break i.e. non-faithful emulation due to writes being dropped?
> 
> I expect more the latter.
> 
> > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > capability on the PF as a sane guest shouldn't enable the PASID
> > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > status bit defined in the PASID capability to check back so even
> > > if an insane guest wants to blindly enable PASID it will naturally
> > > write and done. The only niche case is that the enable bits are
> > > defined as RW so ideally reading back those bits should get the
> > > latest written value. But probably this can be tolerated?
> 
> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

it's shared so the guest shouldn't be allowed to touch the physical
register.

Even for PF this is virtualized as the physical control is toggled by
the iommu driver today. We discussed before whether there is a
value moving the control to device driver but the conclusion is no.

> 
> > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> >
> > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > This is a bit different from what we planned at the beginning. But sounds
> > doable if we want to pursue the staging direction.
> 
> Seems like if we decide that we can just expose the PASID capability
> for a PF then we should just have any VF variant drivers also implement
> a virtual PASID capability.  In this case DVSEC would only be used to

I'm leaning toward this direction now.

> provide information for a purely userspace emulation of PASID (in which
> case it also wouldn't necessarily need the vfio feature because it
> might implicitly know the PASID capabilities of the device).  Thanks,
> 

that's a good point. Then no new contract is required.

and allowing variant driver to implement a virtual PASID capability
seems also make a room for making a shared variant driver to host
a table of virtual capabilities (both offset and content) for VFs, just
as discussed in patch4 having a shared driver to host a table for DVSEC?

Along this route probably most vendors will just extend the table in
the shared driver, leading to decreased value on DVSEC and question
on its necessity...

then it's back to the quirk-in-kernel approach... but if simple enough
probably not a bad idea to pursue? 
Jason Gunthorpe April 19, 2024, 1:34 p.m. UTC | #10
On Wed, Apr 17, 2024 at 05:02:16PM -0600, Alex Williamson wrote:

> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.  
> > 
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing
> userspace.

Are we sure about that argument? Exposing the PF's PASID cap to qemu
shouldn't break qemu at all - it will just pass it over into a vPCI
function?

Ultimately the guest will observe a vPCI device with a PASID cap. This
is not really so different from plugging in a new PCI device with the
PASID cap into an old HW system that doesn't support PASID.

So does a guest break? I'd expect no - the viommu's created by qemu
should not advertise PASID support already. So the guest can't use
PASID - just like an old HW system.

Is there a known concrete thing that breaks there?

> > If it needs to make another system call or otherwise to do that then
> > that seems fine to do incrementally??
> 
> With PASID currently hidden, VF and PF support really seem too similar
> not to handle them both at the same time.  What's missing is a
> mechanism to describe unused config space where userspace can implement
> an emulated PASID capability. 

Yes, sure the unused config space is a great idea. I thought we were
talking about doing the fixup in userspace, but a kernel solution is
good too.

But also if we are set on the dvsec, in kernel or user, then we can
move ahead with the core PASID enablement immediately?

Jason
Jason Gunthorpe April 19, 2024, 1:59 p.m. UTC | #11
On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:

> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

No, the OS owns the physical PASID control. If the platform IOMMU
knows how to parse PASID then PASID support is turned on and left on
at boot time.

There should be no guest visible difference to not supporting global
PASID disable, and we can't even implement it for VFs anyhow.

Same sort of argument for ATS/etc

> > If kernel exposes pasid cap for PF same as other caps, and in the meantime
> > the variant driver chooses to emulate a DVSEC cap, then userspace follows
> > the below steps to expose pasid cap to VM.
> 
> If we have a variant driver, why wouldn't it expose an emulated PASID
> capability rather than a DVSEC if we're choosing to expose PASID for
> PFs?

Indeed, also an option. Supplying the DVSEC is probably simpler and
addresses other synthesized capability blocks in future. VMM is a
better place to build various synthetic blocks in general, IMHO.

New VMM's could parse the PF PASID cap and add it to its list of "free
space"

We may also be overdoing it here..

Maybe if the VMM wants to enable PASID we should flip the logic and
the VMM should assume that unused config space is safe to use. Only
devices that violate that rule need to join an ID list and provide a
DVSEC/free space list/etc.

I'm guessing that list will be pretty small and hopefully will not
grow. It is easy and better for future devices to wrap their hidden
registers in a private DVSEC.

Then I'd suggest just writing the special list in a text file and
leaving it in the VMM side.. Users can adjust the text file right away
if they have old and troublesome devices and all VMMs can share it.

> > 1) Check if a pasid cap is already present in the virtual config space
> >     read from kernel. If no, but user wants pasid, then goto step 2).
> > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> >     pasid cap. If yes, goto step 3).
> 
> Why do we need the vfio feature interface if a physical or virtual PASID
> capability on the device exposes the same info?

Still need to check if the platform, os, iommu, etc are all OK with
enabling PASID support before the viommu advertises it.

Jason
Alex Williamson April 19, 2024, 4:35 p.m. UTC | #12
On Fri, 19 Apr 2024 05:52:01 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 19, 2024 4:38 AM
> > 
> > On Thu, 18 Apr 2024 17:03:15 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > On 2024/4/18 08:06, Tian, Kevin wrote:  
> > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > >>
> > > >> But we don't actually expose the PASID capability on the PF and as
> > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > Come back to this statement.  
> > > >
> > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > to the read-only PASID capability, or just a conceptually functional
> > > > break i.e. non-faithful emulation due to writes being dropped?  
> > 
> > I expect more the latter.
> >   
> > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > status bit defined in the PASID capability to check back so even
> > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > write and done. The only niche case is that the enable bits are
> > > > defined as RW so ideally reading back those bits should get the
> > > > latest written value. But probably this can be tolerated?  
> > 
> > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > to check that a RW bit was set or cleared.  How would we virtualize the
> > control registers for a VF and are they similarly virtualized for a PF
> > or would we allow the guest to manipulate the physical PASID control
> > registers?  
> 
> it's shared so the guest shouldn't be allowed to touch the physical
> register.
> 
> Even for PF this is virtualized as the physical control is toggled by
> the iommu driver today. We discussed before whether there is a
> value moving the control to device driver but the conclusion is no.

So in both cases we virtualize the PASID bits in the vfio variant
driver in order to maintain spec compliant behavior of the register
(ie. the control bits are RW with no underlying hardware effect and
capability bits only reflect the features enabled by the host in the
control register)?

> > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > >
> > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > This is a bit different from what we planned at the beginning. But sounds
> > > doable if we want to pursue the staging direction.  
> > 
> > Seems like if we decide that we can just expose the PASID capability
> > for a PF then we should just have any VF variant drivers also implement
> > a virtual PASID capability.  In this case DVSEC would only be used to  
> 
> I'm leaning toward this direction now.
> 
> > provide information for a purely userspace emulation of PASID (in which
> > case it also wouldn't necessarily need the vfio feature because it
> > might implicitly know the PASID capabilities of the device).  Thanks,
> >   
> 
> that's a good point. Then no new contract is required.
> 
> and allowing variant driver to implement a virtual PASID capability
> seems also make a room for making a shared variant driver to host
> a table of virtual capabilities (both offset and content) for VFs, just
> as discussed in patch4 having a shared driver to host a table for DVSEC?

Yes, vfio-pci-core would support virtualizing the PF PASID capability
mapped 1:1 at the physical PASID location.  We should architect that
support to be easily reused for a driver provided offset for the VF use
case and then we'd need to decide if a lookup table to associate an
offset to a VF vendor:device warrants a variant driver (which could be
shared by multiple devices) or if we'd accept that into vfio-pci-core.

> Along this route probably most vendors will just extend the table in
> the shared driver, leading to decreased value on DVSEC and question
> on its necessity...
> 
> then it's back to the quirk-in-kernel approach... but if simple enough
> probably not a bad idea to pursue? 
Tian, Kevin April 23, 2024, 7:43 a.m. UTC | #13
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 20, 2024 12:36 AM
> 
> On Fri, 19 Apr 2024 05:52:01 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 19, 2024 4:38 AM
> > >
> > > On Thu, 18 Apr 2024 17:03:15 +0800
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > On 2024/4/18 08:06, Tian, Kevin wrote:
> > > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > > >>
> > > > >> But we don't actually expose the PASID capability on the PF and as
> > > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > > Come back to this statement.
> > > > >
> > > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > > to the read-only PASID capability, or just a conceptually functional
> > > > > break i.e. non-faithful emulation due to writes being dropped?
> > >
> > > I expect more the latter.
> > >
> > > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > > status bit defined in the PASID capability to check back so even
> > > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > > write and done. The only niche case is that the enable bits are
> > > > > defined as RW so ideally reading back those bits should get the
> > > > > latest written value. But probably this can be tolerated?
> > >
> > > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > > to check that a RW bit was set or cleared.  How would we virtualize the
> > > control registers for a VF and are they similarly virtualized for a PF
> > > or would we allow the guest to manipulate the physical PASID control
> > > registers?
> >
> > it's shared so the guest shouldn't be allowed to touch the physical
> > register.
> >
> > Even for PF this is virtualized as the physical control is toggled by
> > the iommu driver today. We discussed before whether there is a
> > value moving the control to device driver but the conclusion is no.
> 
> So in both cases we virtualize the PASID bits in the vfio variant
> driver in order to maintain spec compliant behavior of the register
> (ie. the control bits are RW with no underlying hardware effect and
> capability bits only reflect the features enabled by the host in the
> control register)?

yes

> 
> > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > > >
> > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > > This is a bit different from what we planned at the beginning. But
> sounds
> > > > doable if we want to pursue the staging direction.
> > >
> > > Seems like if we decide that we can just expose the PASID capability
> > > for a PF then we should just have any VF variant drivers also implement
> > > a virtual PASID capability.  In this case DVSEC would only be used to
> >
> > I'm leaning toward this direction now.
> >
> > > provide information for a purely userspace emulation of PASID (in which
> > > case it also wouldn't necessarily need the vfio feature because it
> > > might implicitly know the PASID capabilities of the device).  Thanks,
> > >
> >
> > that's a good point. Then no new contract is required.
> >
> > and allowing variant driver to implement a virtual PASID capability
> > seems also make a room for making a shared variant driver to host
> > a table of virtual capabilities (both offset and content) for VFs, just
> > as discussed in patch4 having a shared driver to host a table for DVSEC?
> 
> Yes, vfio-pci-core would support virtualizing the PF PASID capability
> mapped 1:1 at the physical PASID location.  We should architect that
> support to be easily reused for a driver provided offset for the VF use
> case and then we'd need to decide if a lookup table to associate an
> offset to a VF vendor:device warrants a variant driver (which could be
> shared by multiple devices) or if we'd accept that into vfio-pci-core.

yes

> 
> > Along this route probably most vendors will just extend the table in
> > the shared driver, leading to decreased value on DVSEC and question
> > on its necessity...
> >
> > then it's back to the quirk-in-kernel approach... but if simple enough
> > probably not a bad idea to pursue? 
Yi Liu April 23, 2024, 7:58 a.m. UTC | #14
On 2024/4/19 21:59, Jason Gunthorpe wrote:
> On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:
> 
>> Some degree of inconsistency is likely tolerated, the guest is unlikely
>> to check that a RW bit was set or cleared.  How would we virtualize the
>> control registers for a VF and are they similarly virtualized for a PF
>> or would we allow the guest to manipulate the physical PASID control
>> registers?
> 
> No, the OS owns the physical PASID control. If the platform IOMMU
> knows how to parse PASID then PASID support is turned on and left on
> at boot time.

I think you mean host os. right?

> There should be no guest visible difference to not supporting global
> PASID disable, and we can't even implement it for VFs anyhow.
> 
> Same sort of argument for ATS/etc
> 
>>> If kernel exposes pasid cap for PF same as other caps, and in the meantime
>>> the variant driver chooses to emulate a DVSEC cap, then userspace follows
>>> the below steps to expose pasid cap to VM.
>>
>> If we have a variant driver, why wouldn't it expose an emulated PASID
>> capability rather than a DVSEC if we're choosing to expose PASID for
>> PFs?
> 
> Indeed, also an option. Supplying the DVSEC is probably simpler and
> addresses other synthesized capability blocks in future. VMM is a
> better place to build various synthetic blocks in general, IMHO.
> 
> New VMM's could parse the PF PASID cap and add it to its list of "free
> space"
> 
> We may also be overdoing it here..
> 
> Maybe if the VMM wants to enable PASID we should flip the logic and
> the VMM should assume that unused config space is safe to use. Only
> devices that violate that rule need to join an ID list and provide a
> DVSEC/free space list/etc.

So, if the kernel decides to hide a specific physical capability, the
space of this capability would be considered as free to use as well.
is it?

> I'm guessing that list will be pretty small and hopefully will not
> grow.

any channel to collect this kind of info? :)

> It is easy and better for future devices to wrap their hidden
> registers in a private DVSEC.

hmmm, do you mean include the registers a DVSEC hence userspace can
work out the free space by iterating the cap chain? or still mean
indicating the free spaces by DVSEC? I guess the prior one.

> Then I'd suggest just writing the special list in a text file and
> leaving it in the VMM side.. Users can adjust the text file right away
> if they have old and troublesome devices and all VMMs can share it.

So for the existing devices that have both pasid cap and hidden registers,
userspace should add them in the special list, and work out the free space
by referring the file. While for the devices that only have pasid cap, or
have the hidden register in a DVSEC, userspace finds a free space by
iterating the cap chain. This seems to be general for today and future.

>>> 1) Check if a pasid cap is already present in the virtual config space
>>>      read from kernel. If no, but user wants pasid, then goto step 2).
>>> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>>>      pasid cap. If yes, goto step 3).
>>
>> Why do we need the vfio feature interface if a physical or virtual PASID
>> capability on the device exposes the same info?
> 
> Still need to check if the platform, os, iommu, etc are all OK with
> enabling PASID support before the viommu advertises it.

This means we don't expose physical or virtual PASID cap, is it? Otherwise,
host kernel could check if pasid is enabled before exposing the PASID cap.
Jason Gunthorpe April 23, 2024, 12:01 p.m. UTC | #15
On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> I'm not sure how userspace can fully handle this w/o certain assistance
> from the kernel.
> 
> So I kind of agree that emulated PASID capability is probably the only
> contract which the kernel should provide:
>   - mapped 1:1 at the physical location, or
>   - constructed at an offset according to DVSEC, or
>   - constructed at an offset according to a look-up table
> 
> The VMM always scans the vfio pci config space to expose vPASID.
> 
> Then the remaining open is what VMM could do when a VF supports
> PASID but unfortunately it's not reported by vfio. W/o the capability
> of inspecting the PASID state of PF, probably the only feasible option
> is to maintain a look-up table in VMM itself and assumes the kernel
> always enables the PASID cap on PF.

I'm still not sure I like doing this in the kernel - we need to do the
same sort of thing for ATS too, right?

It feels simpler if the indicates if PASID and ATS can be supported
and userspace builds the capability blocks.

There are migration considerations too - the blocks need to be
migrated over and end up in the same place as well..

Jason
Jason Gunthorpe April 23, 2024, 12:05 p.m. UTC | #16
On Tue, Apr 23, 2024 at 03:58:17PM +0800, Yi Liu wrote:
> > > > 1) Check if a pasid cap is already present in the virtual config space
> > > >      read from kernel. If no, but user wants pasid, then goto step 2).
> > > > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> > > >      pasid cap. If yes, goto step 3).
> > > 
> > > Why do we need the vfio feature interface if a physical or virtual PASID
> > > capability on the device exposes the same info?
> > 
> > Still need to check if the platform, os, iommu, etc are all OK with
> > enabling PASID support before the viommu advertises it.
> 
> This means we don't expose physical or virtual PASID cap, is it?

Yeah keep hiding both still works. Some kind of test to see if PASID
and ATS are supportable on the device. Probably via
IOMMUFD_CMD_GET_HW_INFO

If they are supported then the VMM will find empty space and generate
the missing caps. The VMM will have means to fix the location during
migration.

Jason
Tian, Kevin April 23, 2024, 11:47 p.m. UTC | #17
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:02 PM
> 
> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > I'm not sure how userspace can fully handle this w/o certain assistance
> > from the kernel.
> >
> > So I kind of agree that emulated PASID capability is probably the only
> > contract which the kernel should provide:
> >   - mapped 1:1 at the physical location, or
> >   - constructed at an offset according to DVSEC, or
> >   - constructed at an offset according to a look-up table
> >
> > The VMM always scans the vfio pci config space to expose vPASID.
> >
> > Then the remaining open is what VMM could do when a VF supports
> > PASID but unfortunately it's not reported by vfio. W/o the capability
> > of inspecting the PASID state of PF, probably the only feasible option
> > is to maintain a look-up table in VMM itself and assumes the kernel
> > always enables the PASID cap on PF.
> 
> I'm still not sure I like doing this in the kernel - we need to do the
> same sort of thing for ATS too, right?

VF is allowed to implement ATS.

PRI has the same problem as PASID.

> 
> It feels simpler if the indicates if PASID and ATS can be supported
> and userspace builds the capability blocks.

this routes back to Alex's original question about using different
interfaces (a device feature vs. PCI PASID cap) for VF and PF.

Are we OK with that divergence?

> 
> There are migration considerations too - the blocks need to be
> migrated over and end up in the same place as well..
> 

Can you elaborate what is the problem with the kernel emulating
the PASID cap in this consideration?

Does it talk about a case where the devices between src/dest are
different versions (but backward compatible) with different unused
space layout and the kernel approach may pick up different offsets
while the VMM can guarantee the same offset?
Jason Gunthorpe April 24, 2024, 12:12 a.m. UTC | #18
On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 23, 2024 8:02 PM
> > 
> > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > from the kernel.
> > >
> > > So I kind of agree that emulated PASID capability is probably the only
> > > contract which the kernel should provide:
> > >   - mapped 1:1 at the physical location, or
> > >   - constructed at an offset according to DVSEC, or
> > >   - constructed at an offset according to a look-up table
> > >
> > > The VMM always scans the vfio pci config space to expose vPASID.
> > >
> > > Then the remaining open is what VMM could do when a VF supports
> > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > of inspecting the PASID state of PF, probably the only feasible option
> > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > always enables the PASID cap on PF.
> > 
> > I'm still not sure I like doing this in the kernel - we need to do the
> > same sort of thing for ATS too, right?
> 
> VF is allowed to implement ATS.
> 
> PRI has the same problem as PASID.

I'm surprised by this, I would have guessed ATS would be the device
global one, PRI not being per-VF seems problematic??? How do you
disable PRI generation to get a clean shutdown?

> > It feels simpler if the indicates if PASID and ATS can be supported
> > and userspace builds the capability blocks.
> 
> this routes back to Alex's original question about using different
> interfaces (a device feature vs. PCI PASID cap) for VF and PF.

I'm not sure it is different interfaces..

The only reason to pass the PF's PASID cap is to give free space to
the VMM. If we are saying that gaps are free space (excluding a list
of bad devices) then we don't acutally need to do that anymore.

VMM will always create a synthetic PASID cap and kernel will always
suppress a real one.

An iommufd query will indicate if the vIOMMU can support vPASID on
that device.

Same for all the troublesome non-physical caps.

> > There are migration considerations too - the blocks need to be
> > migrated over and end up in the same place as well..
> 
> Can you elaborate what is the problem with the kernel emulating
> the PASID cap in this consideration?

If the kernel changes the algorithm, say it wants to do PASID, PRI,
something_new then it might change the layout

We can't just have the kernel decide without also providing a way for
userspace to say what the right layout actually is. :\

> Does it talk about a case where the devices between src/dest are
> different versions (but backward compatible) with different unused
> space layout and the kernel approach may pick up different offsets
> while the VMM can guarantee the same offset?

That is also a concern where the PCI cap layout may change a bit but
they are still migration compatible, but my bigger worry is that the
kernel just lays out the fake caps in a different way because the
kernel changes.

At least if the VMM is doing this then the VMM can include the
information in its migration scheme and use it to recreate the PCI
layout withotu having to create a bunch of uAPI to do so.

Jason
Tian, Kevin April 24, 2024, 2:57 a.m. UTC | #19
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.
> > >
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?
> >
> > VF is allowed to implement ATS.
> >
> > PRI has the same problem as PASID.
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?

Here is what the PCIe spec says:

  For SR-IOV devices, a single Page Request Interface is permitted for
  the PF and is shared between the PF and its associated VFs, in which
  case the PF implements this capability and its VFs must not.

I'll let Baolu chime in for the potential impact to his PRI cleanup
effort, e.g. whether disabling PRI generation is mandatory if the
IOMMU side is already put in a mode auto-responding error to
new PRI request instead of reporting to sw.

But I do see another problem for shared capabilities between PF/VFs.

Now those shared capabilities are enabled/disabled when the PF is
attached to/detached from a domain, w/o counting the shared usage
from VFs.

Looks we have a gap here.
Tian, Kevin April 24, 2024, 5:19 a.m. UTC | #20
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.
> >
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.
> 
> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.

oh you suggest that there won't even be a 1:1 map for PF!

kind of continue with the device_feature method as this series does.
and it could include all VMM-emulated capabilities which are not
enumerated properly from vfio pci config space.

this interface only reports the availability/features of a capability
but never includes information about offset.

If a device implements DVSEC it will be exposed to the VMM.

Then suppose the VMM will introduce a new cmd parameter to
turn on the emulation of the PASID capability. It's default off so
legacy usages can still work.

Once the parameter is on then the VMM will emulate the PASID
capability by:

  - Locating a free range according to DVSEC, or,
  - Locating a free range from gaps between PCI caps,

If a device is found not working properly then add a fixed offset for 
this device.
 
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..
> >
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\
> 

emm that's a good point.
Baolu Lu April 24, 2024, 12:29 p.m. UTC | #21
On 2024/4/24 10:57, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, April 24, 2024 8:12 AM
>>
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
> 
> Here is what the PCIe spec says:
> 
>    For SR-IOV devices, a single Page Request Interface is permitted for
>    the PF and is shared between the PF and its associated VFs, in which
>    case the PF implements this capability and its VFs must not.
> 
> I'll let Baolu chime in for the potential impact to his PRI cleanup
> effort, e.g. whether disabling PRI generation is mandatory if the
> IOMMU side is already put in a mode auto-responding error to
> new PRI request instead of reporting to sw.

The PRI cleanup steps are defined like this:

  * - Disable new PRI reception: Turn off PRI generation in the IOMMU 
hardware
  *   and flush any hardware page request queues. This should be done before
  *   calling into this helper.
  * - Acknowledge all outstanding PRQs to the device: Respond to all 
outstanding
  *   page requests with IOMMU_PAGE_RESP_INVALID, indicating the device 
should
  *   not retry. This helper function handles this.
  * - Disable PRI on the device: After calling this helper, the caller could
  *   then disable PRI on the device.

Disabling PRI on the device is the last step and optional because the
IOMMU is required to support a PRI blocking state and has already been
put in that state at the first step.

For the VF case, it probably is a no-op except for maintaining a
reference count. Once PRI is disabled on all PFs and VFs, it can then be
physically disabled on the PF.

> 
> But I do see another problem for shared capabilities between PF/VFs.
> 
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.
> 
> Looks we have a gap here.

Yes, there's a gap at least for the Intel IOMMU driver. I'll soon fix
this by moving the handling of ATS out of the driver, especially from
the default domain attach/detach paths.

Best regards,
baolu
Jason Gunthorpe April 24, 2024, 2:04 p.m. UTC | #22
On Wed, Apr 24, 2024 at 02:57:38AM +0000, Tian, Kevin wrote:
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.

Urh..

This looks broken in the kernel right now?

If you attach a SVA PASID to a VF then the iommu driver will call 
pci_enable_pri(vf) which will always fail:

	/*
	 * VFs must not implement the PRI Capability.  If their PF
	 * implements PRI, it is shared by the VFs, so if the PF PRI is
	 * enabled, it is also enabled for the VF.
	 */
	if (pdev->is_virtfn) {
		if (pci_physfn(pdev)->pri_enabled)
			return 0;
		return -EINVAL;
	}

More to fix :(

Jason
Jason Gunthorpe April 24, 2024, 2:15 p.m. UTC | #23
On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 24, 2024 8:12 AM
> > 
> > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > >
> > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > and userspace builds the capability blocks.
> > >
> > > this routes back to Alex's original question about using different
> > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> > 
> > I'm not sure it is different interfaces..
> > 
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> > 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> 
> oh you suggest that there won't even be a 1:1 map for PF!

Right. No real need..

> kind of continue with the device_feature method as this series does.
> and it could include all VMM-emulated capabilities which are not
> enumerated properly from vfio pci config space.

1) VFIO creates the iommufd idev
2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
   etc is supported
3) VMM locates empty space in the config space
4) VMM figures out where and what cap blocks to create (considering
   migration needs/etc)
5) VMM synthesizes the blocks and ties emulation to other iommufd things

This works generically for any synthetic vPCI function including a
non-vfio-pci one.

Most likely due to migration needs the exact layout of the PCI config
space should be configured to the VMM, including the location of any
blocks copied from physical and any blocks synthezied. This is the
only way to be sure the config space is actually 100% consistent.

For non migration cases to make it automatic we can check the free
space via gaps. The broken devices that have problems with this can
either be told to use the explicit approach above,the VMM could
consult some text file, or vPASID/etc can be left disabled. IMHO the
support of PASID is so rare today this is probably fine.

Vendors should be *strongly encouraged* to wrap their special used
config space areas in DVSEC and not hide them in free space.

We may also want a DVSEC to indicate free space - but if vendors are
going to change their devices I'd rather them change to mark the used
space with DVSEC then mark the free space :)

Jason
Alex Williamson April 24, 2024, 6:24 p.m. UTC | #24
On Tue, 23 Apr 2024 21:12:21 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > 
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.  
> > > 
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?  
> > 
> > VF is allowed to implement ATS.
> > 
> > PRI has the same problem as PASID.  
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?
> 
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.  
> > 
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.

Are we saying that now??  That's new.

> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..  
> > 
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?  
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\

The capability layout is only relevant to migration, right?  A variant
driver that supports migration is a prerequisite and would also be
responsible for exposing the PASID capability.  This isn't as disjoint
as it's being portrayed.

> > Does it talk about a case where the devices between src/dest are
> > different versions (but backward compatible) with different unused
> > space layout and the kernel approach may pick up different offsets
> > while the VMM can guarantee the same offset?  
> 
> That is also a concern where the PCI cap layout may change a bit but
> they are still migration compatible, but my bigger worry is that the
> kernel just lays out the fake caps in a different way because the
> kernel changes.

Outside of migration, what does it matter if the cap layout is
different?  A driver should never hard code the address for a
capability.
 
> At least if the VMM is doing this then the VMM can include the
> information in its migration scheme and use it to recreate the PCI
> layout withotu having to create a bunch of uAPI to do so.

We're again back to migration compatibility, where again the capability
layout would be governed by the migration support in the in-kernel
variant driver.  Once migration is involved the location of a PASID
shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

Regardless, the VMM ultimately has the authority what the guest
sees in config space.  The VMM is not bound to expose the PASID at the
offset provided by the kernel, or bound to expose it at all.  The
kernel exposed PASID can simply provide an available location and set
of enabled capabilities.  Thanks,

Alex
Jason Gunthorpe April 24, 2024, 6:36 p.m. UTC | #25
On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.

I suggested it a few times

> 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> > 
> > An iommufd query will indicate if the vIOMMU can support vPASID on
> > that device.
> > 
> > Same for all the troublesome non-physical caps.
> > 
> > > > There are migration considerations too - the blocks need to be
> > > > migrated over and end up in the same place as well..  
> > > 
> > > Can you elaborate what is the problem with the kernel emulating
> > > the PASID cap in this consideration?  
> > 
> > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > something_new then it might change the layout
> > 
> > We can't just have the kernel decide without also providing a way for
> > userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  

Yes, proabbly

> A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.

I guess..  But also not quite. We still have the problem that kernel
migration driver V1 could legitimately create a different config space
that migration driver V2

And now you are saying that the migration driver has to parse the
migration stream and readjust its own layout

And every driver needs to do this?

We can, it is a quite big bit of infrastructure I think, but sure..

I fear the VMM still has to be involved somehow because it still has
to know if the source VMM has removed any kernel created caps.

> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.

Yes, talking about migration here - migration is the hardest case it
seems.
 
> > At least if the VMM is doing this then the VMM can include the
> > information in its migration scheme and use it to recreate the PCI
> > layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

I wasn't going in this direction. I was thinking to make the VMM
create the config space layout that is approriate and hold it stable
as a migration ABI.

I think in practice many VMMs are going to do this anyhow unless we
put full support for config space synthesis, stable versions, and
version selection in the kernel directly. I was thinking to avoid
doing that.
 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities. 

And if the VMM is going to ignore the kernel layout then why do so
much work in the kernel to create it?

I think we need to decide, either only the VMM or only the kernel
should do this.

Jason
Alex Williamson April 24, 2024, 6:38 p.m. UTC | #26
On Wed, 24 Apr 2024 11:15:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > 
> > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > >
> > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > and userspace builds the capability blocks.  
> > > >
> > > > this routes back to Alex's original question about using different
> > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > 
> > > I'm not sure it is different interfaces..
> > > 
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.
> > > 
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.  
> > 
> > oh you suggest that there won't even be a 1:1 map for PF!  
> 
> Right. No real need..
> 
> > kind of continue with the device_feature method as this series does.
> > and it could include all VMM-emulated capabilities which are not
> > enumerated properly from vfio pci config space.  
> 
> 1) VFIO creates the iommufd idev
> 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
>    etc is supported
> 3) VMM locates empty space in the config space
> 4) VMM figures out where and what cap blocks to create (considering
>    migration needs/etc)
> 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> 
> This works generically for any synthetic vPCI function including a
> non-vfio-pci one.

Maybe this is the actual value in implementing this in the VMM, one
implementation can support multiple device interfaces.

> Most likely due to migration needs the exact layout of the PCI config
> space should be configured to the VMM, including the location of any
> blocks copied from physical and any blocks synthezied. This is the
> only way to be sure the config space is actually 100% consistent.

Where is this concern about config space arbitrarily changing coming
from?  It's possible, yes, but vfio-pci-core or a variant driver are
going to have some sort of reasoning for exposing a capability at a
given offset.  A variant driver is necessary for supporting migration,
that variant driver should be aware that capability offsets are part of
the migration contract, and QEMU will enforce identical config space
unless we introduce exceptions.

> For non migration cases to make it automatic we can check the free
> space via gaps. The broken devices that have problems with this can
> either be told to use the explicit approach above,the VMM could
> consult some text file, or vPASID/etc can be left disabled. IMHO the
> support of PASID is so rare today this is probably fine.
> 
> Vendors should be *strongly encouraged* to wrap their special used
> config space areas in DVSEC and not hide them in free space.

As in my previous reply, this is a new approach that I had thought we
weren't comfortable making, and I'm still not very comfortable with.
The fact is random registers exist outside of capabilities today and
creating a general policy that we're going to deal with that as issues
arise from a generic "find a gap" algorithm is concerning.

> We may also want a DVSEC to indicate free space - but if vendors are
> going to change their devices I'd rather them change to mark the used
> space with DVSEC then mark the free space :)

Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
great.  Thanks,

Alex
Jason Gunthorpe April 24, 2024, 6:45 p.m. UTC | #27
On Wed, Apr 24, 2024 at 12:38:51PM -0600, Alex Williamson wrote:
> On Wed, 24 Apr 2024 11:15:25 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > > 
> > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > > >
> > > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > > and userspace builds the capability blocks.  
> > > > >
> > > > > this routes back to Alex's original question about using different
> > > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > > 
> > > > I'm not sure it is different interfaces..
> > > > 
> > > > The only reason to pass the PF's PASID cap is to give free space to
> > > > the VMM. If we are saying that gaps are free space (excluding a list
> > > > of bad devices) then we don't acutally need to do that anymore.
> > > > 
> > > > VMM will always create a synthetic PASID cap and kernel will always
> > > > suppress a real one.  
> > > 
> > > oh you suggest that there won't even be a 1:1 map for PF!  
> > 
> > Right. No real need..
> > 
> > > kind of continue with the device_feature method as this series does.
> > > and it could include all VMM-emulated capabilities which are not
> > > enumerated properly from vfio pci config space.  
> > 
> > 1) VFIO creates the iommufd idev
> > 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
> >    etc is supported
> > 3) VMM locates empty space in the config space
> > 4) VMM figures out where and what cap blocks to create (considering
> >    migration needs/etc)
> > 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> > 
> > This works generically for any synthetic vPCI function including a
> > non-vfio-pci one.
> 
> Maybe this is the actual value in implementing this in the VMM, one
> implementation can support multiple device interfaces.
> 
> > Most likely due to migration needs the exact layout of the PCI config
> > space should be configured to the VMM, including the location of any
> > blocks copied from physical and any blocks synthezied. This is the
> > only way to be sure the config space is actually 100% consistent.
> 
> Where is this concern about config space arbitrarily changing coming
> from?

It is important for migration.

Today with the drivers we have the devices have to take care of their
own config space layout only in HW. We are not expecting migration
drivers to worry about any SW created config space. SW config space is
a new thing.

> > We may also want a DVSEC to indicate free space - but if vendors are
> > going to change their devices I'd rather them change to mark the used
> > space with DVSEC then mark the free space :)
> 
> Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
> great

We are applying this going forward to PASID, and PRI, which barely
exist on devices at all today. We don't need to worry about those
10+yr old devices that don't support PASID/PRI in the first place.

I think this makes the problem much smaller.

Jason
Alex Williamson April 24, 2024, 8:13 p.m. UTC | #28
On Wed, 24 Apr 2024 15:36:26 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.  
> 
> I suggested it a few times
> 
> >   
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.
> > > 
> > > An iommufd query will indicate if the vIOMMU can support vPASID on
> > > that device.
> > > 
> > > Same for all the troublesome non-physical caps.
> > >   
> > > > > There are migration considerations too - the blocks need to be
> > > > > migrated over and end up in the same place as well..    
> > > > 
> > > > Can you elaborate what is the problem with the kernel emulating
> > > > the PASID cap in this consideration?    
> > > 
> > > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > > something_new then it might change the layout
> > > 
> > > We can't just have the kernel decide without also providing a way for
> > > userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?    
> 
> Yes, proabbly
> 
> > A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.  
> 
> I guess..  But also not quite. We still have the problem that kernel
> migration driver V1 could legitimately create a different config space
> that migration driver V2
> 
> And now you are saying that the migration driver has to parse the
> migration stream and readjust its own layout
> 
> And every driver needs to do this?
> 
> We can, it is a quite big bit of infrastructure I think, but sure..
> 
> I fear the VMM still has to be involved somehow because it still has
> to know if the source VMM has removed any kernel created caps.

This is kind of an absurd example to portray as a ubiquitous problem.
Typically the config space layout is a reflection of hardware whether
the device supports migration or not.  If a driver were to insert a
virtual capability, then yes it would want to be consistent about it if
it also cares about migration.  If the driver needs to change the
location of a virtual capability, problems will arise, but that's also
not something that every driver needs to do.

Also, how exactly does emulating the capability in the VMM solve this
problem?  Currently QEMU migration simply applies state to an identical
VM on the target.  QEMU doesn't modify the target VM to conform to the
data stream.  So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.

> > Outside of migration, what does it matter if the cap layout is
> > different?  A driver should never hard code the address for a
> > capability.  
> 
> Yes, talking about migration here - migration is the hardest case it
> seems.
>  
> > > At least if the VMM is doing this then the VMM can include the
> > > information in its migration scheme and use it to recreate the PCI
> > > layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.  
> 
> I wasn't going in this direction. I was thinking to make the VMM
> create the config space layout that is approriate and hold it stable
> as a migration ABI.
> 
> I think in practice many VMMs are going to do this anyhow unless we
> put full support for config space synthesis, stable versions, and
> version selection in the kernel directly. I was thinking to avoid
> doing that.

Currently QEMU replies on determinism that a given command line results
in an identical machine configuration and identical devices.  State of
that target VM is then populated, not defined by, the migration stream.

> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.   
> 
> And if the VMM is going to ignore the kernel layout then why do so
> much work in the kernel to create it?

Ok, let's not ignore it ;)

> I think we need to decide, either only the VMM or only the kernel
> should do this.

What are you actually proposing?  Are you suggesting that if a device
supports the Power Management capability it will be virtualized at
offset 0x60, if the device supports the MSI capability it will be
virtualized at 0x68,... if a device supports PASID it will be
virtualized at offset 0x300, etc...?

That's not only impractical because we can't layout all the capabilities
within the available space, but also because we will run into masking
hidden registers and devices where the driver hard codes a capability
offset.

If the VMM implements the "find a gap" solution then it's just as
subject to config space changes in hardware or provided by the variant
driver.

If not either of those, are we hard coding a device specific config
space map into the VMM or providing one on the command line?  I thought
we were using vfio-pci variant drivers and a defined vfio migration API
in order to prevent modifying the VMM for every device we want to
support migration.  Also I'd wonder if the driver itself shouldn't be
configured to provide a compatible type.  Thanks,

Alex
Yi Liu April 25, 2024, 9:26 a.m. UTC | #29
On 2024/4/25 02:24, Alex Williamson wrote:
> On Tue, 23 Apr 2024 21:12:21 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
>>
>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>> and userspace builds the capability blocks.
>>>
>>> this routes back to Alex's original question about using different
>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>
>> I'm not sure it is different interfaces..
>>
>> The only reason to pass the PF's PASID cap is to give free space to
>> the VMM. If we are saying that gaps are free space (excluding a list
>> of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.
> 
>> VMM will always create a synthetic PASID cap and kernel will always
>> suppress a real one.
>>
>> An iommufd query will indicate if the vIOMMU can support vPASID on
>> that device.
>>
>> Same for all the troublesome non-physical caps.
>>
>>>> There are migration considerations too - the blocks need to be
>>>> migrated over and end up in the same place as well..
>>>
>>> Can you elaborate what is the problem with the kernel emulating
>>> the PASID cap in this consideration?
>>
>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>> something_new then it might change the layout
>>
>> We can't just have the kernel decide without also providing a way for
>> userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.
> 
>>> Does it talk about a case where the devices between src/dest are
>>> different versions (but backward compatible) with different unused
>>> space layout and the kernel approach may pick up different offsets
>>> while the VMM can guarantee the same offset?
>>
>> That is also a concern where the PCI cap layout may change a bit but
>> they are still migration compatible, but my bigger worry is that the
>> kernel just lays out the fake caps in a different way because the
>> kernel changes.
> 
> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.
>   

But it may store the offset of capability to make next cap access more
convenient. I noticted struct pci_dev stores the offset of PRI and PASID
cap. So if the layout of config space changed between src and dst, it may
result in problem in guest when guest driver uses the offsets to access
PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
msix). So there is already a problem even put aside the PRI and PASID cap.

#ifdef CONFIG_PCI_PRI
	u16		pri_cap;	/* PRI Capability offset */
	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
	unsigned int	pasid_required:1; /* PRG Response PASID Required */
#endif
#ifdef CONFIG_PCI_PASID
	u16		pasid_cap;	/* PASID Capability offset */
	u16		pasid_features;
#endif
#ifdef CONFIG_PCI_P2PDMA
	struct pci_p2pdma __rcu *p2pdma;
#endif
#ifdef CONFIG_PCI_DOE
	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
#endif
	u16		acs_cap;	/* ACS Capability offset */

https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350

>> At least if the VMM is doing this then the VMM can include the
>> information in its migration scheme and use it to recreate the PCI
>> layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities.  Thanks,
> 
> Alex
>
Alex Williamson April 25, 2024, 12:58 p.m. UTC | #30
On Thu, 25 Apr 2024 17:26:54 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/25 02:24, Alex Williamson wrote:
> > On Tue, 23 Apr 2024 21:12:21 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Tuesday, April 23, 2024 8:02 PM
> >>>>
> >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> >>>>> I'm not sure how userspace can fully handle this w/o certain assistance
> >>>>> from the kernel.
> >>>>>
> >>>>> So I kind of agree that emulated PASID capability is probably the only
> >>>>> contract which the kernel should provide:
> >>>>>    - mapped 1:1 at the physical location, or
> >>>>>    - constructed at an offset according to DVSEC, or
> >>>>>    - constructed at an offset according to a look-up table
> >>>>>
> >>>>> The VMM always scans the vfio pci config space to expose vPASID.
> >>>>>
> >>>>> Then the remaining open is what VMM could do when a VF supports
> >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
> >>>>> of inspecting the PASID state of PF, probably the only feasible option
> >>>>> is to maintain a look-up table in VMM itself and assumes the kernel
> >>>>> always enables the PASID cap on PF.  
> >>>>
> >>>> I'm still not sure I like doing this in the kernel - we need to do the
> >>>> same sort of thing for ATS too, right?  
> >>>
> >>> VF is allowed to implement ATS.
> >>>
> >>> PRI has the same problem as PASID.  
> >>
> >> I'm surprised by this, I would have guessed ATS would be the device
> >> global one, PRI not being per-VF seems problematic??? How do you
> >> disable PRI generation to get a clean shutdown?
> >>  
> >>>> It feels simpler if the indicates if PASID and ATS can be supported
> >>>> and userspace builds the capability blocks.  
> >>>
> >>> this routes back to Alex's original question about using different
> >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> >>
> >> I'm not sure it is different interfaces..
> >>
> >> The only reason to pass the PF's PASID cap is to give free space to
> >> the VMM. If we are saying that gaps are free space (excluding a list
> >> of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.
> >   
> >> VMM will always create a synthetic PASID cap and kernel will always
> >> suppress a real one.
> >>
> >> An iommufd query will indicate if the vIOMMU can support vPASID on
> >> that device.
> >>
> >> Same for all the troublesome non-physical caps.
> >>  
> >>>> There are migration considerations too - the blocks need to be
> >>>> migrated over and end up in the same place as well..  
> >>>
> >>> Can you elaborate what is the problem with the kernel emulating
> >>> the PASID cap in this consideration?  
> >>
> >> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> >> something_new then it might change the layout
> >>
> >> We can't just have the kernel decide without also providing a way for
> >> userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?  A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.
> >   
> >>> Does it talk about a case where the devices between src/dest are
> >>> different versions (but backward compatible) with different unused
> >>> space layout and the kernel approach may pick up different offsets
> >>> while the VMM can guarantee the same offset?  
> >>
> >> That is also a concern where the PCI cap layout may change a bit but
> >> they are still migration compatible, but my bigger worry is that the
> >> kernel just lays out the fake caps in a different way because the
> >> kernel changes.  
> > 
> > Outside of migration, what does it matter if the cap layout is
    ^^^^^^^^^^^^^^^^^^^^
> > different?  A driver should never hard code the address for a
> > capability.
> >     
> 
> But it may store the offset of capability to make next cap access more
> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
> cap. So if the layout of config space changed between src and dst, it may
> result in problem in guest when guest driver uses the offsets to access
> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
> msix). So there is already a problem even put aside the PRI and PASID cap.

Yes, I had noted "outside of migration" above.  Config space must be
consistent to a running VM.  But the possibility of config space
changing like this only exists in the case where the driver supports
migration, so I think we're inventing an unrealistic concern that a
driver that supports migration would arbitrarily modify the config space
layout in order to make an argument for VMM managed layout.  Thanks,

Alex

> #ifdef CONFIG_PCI_PRI
> 	u16		pri_cap;	/* PRI Capability offset */
> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
> #endif
> #ifdef CONFIG_PCI_PASID
> 	u16		pasid_cap;	/* PASID Capability offset */
> 	u16		pasid_features;
> #endif
> #ifdef CONFIG_PCI_P2PDMA
> 	struct pci_p2pdma __rcu *p2pdma;
> #endif
> #ifdef CONFIG_PCI_DOE
> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> #endif
> 	u16		acs_cap;	/* ACS Capability offset */
> 
> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
> 
> >> At least if the VMM is doing this then the VMM can include the
> >> information in its migration scheme and use it to recreate the PCI
> >> layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> > 
> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.  Thanks,
> > 
> > Alex
> >   
>
Yi Liu April 26, 2024, 9:01 a.m. UTC | #31
On 2024/4/25 20:58, Alex Williamson wrote:
> On Thu, 25 Apr 2024 17:26:54 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/4/25 02:24, Alex Williamson wrote:
>>> On Tue, 23 Apr 2024 21:12:21 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>>>
>>>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>>>> from the kernel.
>>>>>>>
>>>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>>>> contract which the kernel should provide:
>>>>>>>     - mapped 1:1 at the physical location, or
>>>>>>>     - constructed at an offset according to DVSEC, or
>>>>>>>     - constructed at an offset according to a look-up table
>>>>>>>
>>>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>>>
>>>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>>>> always enables the PASID cap on PF.
>>>>>>
>>>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>>>> same sort of thing for ATS too, right?
>>>>>
>>>>> VF is allowed to implement ATS.
>>>>>
>>>>> PRI has the same problem as PASID.
>>>>
>>>> I'm surprised by this, I would have guessed ATS would be the device
>>>> global one, PRI not being per-VF seems problematic??? How do you
>>>> disable PRI generation to get a clean shutdown?
>>>>   
>>>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>>>> and userspace builds the capability blocks.
>>>>>
>>>>> this routes back to Alex's original question about using different
>>>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>>>
>>>> I'm not sure it is different interfaces..
>>>>
>>>> The only reason to pass the PF's PASID cap is to give free space to
>>>> the VMM. If we are saying that gaps are free space (excluding a list
>>>> of bad devices) then we don't acutally need to do that anymore.
>>>
>>> Are we saying that now??  That's new.
>>>    
>>>> VMM will always create a synthetic PASID cap and kernel will always
>>>> suppress a real one.
>>>>
>>>> An iommufd query will indicate if the vIOMMU can support vPASID on
>>>> that device.
>>>>
>>>> Same for all the troublesome non-physical caps.
>>>>   
>>>>>> There are migration considerations too - the blocks need to be
>>>>>> migrated over and end up in the same place as well..
>>>>>
>>>>> Can you elaborate what is the problem with the kernel emulating
>>>>> the PASID cap in this consideration?
>>>>
>>>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>>>> something_new then it might change the layout
>>>>
>>>> We can't just have the kernel decide without also providing a way for
>>>> userspace to say what the right layout actually is. :\
>>>
>>> The capability layout is only relevant to migration, right?  A variant
>>> driver that supports migration is a prerequisite and would also be
>>> responsible for exposing the PASID capability.  This isn't as disjoint
>>> as it's being portrayed.
>>>    
>>>>> Does it talk about a case where the devices between src/dest are
>>>>> different versions (but backward compatible) with different unused
>>>>> space layout and the kernel approach may pick up different offsets
>>>>> while the VMM can guarantee the same offset?
>>>>
>>>> That is also a concern where the PCI cap layout may change a bit but
>>>> they are still migration compatible, but my bigger worry is that the
>>>> kernel just lays out the fake caps in a different way because the
>>>> kernel changes.
>>>
>>> Outside of migration, what does it matter if the cap layout is
>      ^^^^^^^^^^^^^^^^^^^^
>>> different?  A driver should never hard code the address for a
>>> capability.
>>>      
>>
>> But it may store the offset of capability to make next cap access more
>> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
>> cap. So if the layout of config space changed between src and dst, it may
>> result in problem in guest when guest driver uses the offsets to access
>> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
>> msix). So there is already a problem even put aside the PRI and PASID cap.
> 
> Yes, I had noted "outside of migration" above.  Config space must be
> consistent to a running VM.  But the possibility of config space
> changing like this only exists in the case where the driver supports
> migration, so I think we're inventing an unrealistic concern that a
> driver that supports migration would arbitrarily modify the config space
> layout in order to make an argument for VMM managed layout.  Thanks,

I was considering a case in which the src VM has a v1 hw device, while the
dst VM has a v2 hw device. Due to whatever reasons, the config space
layouts are different between v1 and v2 hw. Since the current vfio copies
the physical layout (except for the hidden caps) to VMM, the layout between
the src and dst VM would be different. This would result in problem since 
the cap offsets are stale. It seems to be a problem you mentioned in
another email of this thread[1]. :)

[1] "So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.
" 
https://lore.kernel.org/linux-iommu/20240424141349.376bdbf9.alex.williamson@redhat.com/


> Alex
> 
>> #ifdef CONFIG_PCI_PRI
>> 	u16		pri_cap;	/* PRI Capability offset */
>> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
>> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
>> #endif
>> #ifdef CONFIG_PCI_PASID
>> 	u16		pasid_cap;	/* PASID Capability offset */
>> 	u16		pasid_features;
>> #endif
>> #ifdef CONFIG_PCI_P2PDMA
>> 	struct pci_p2pdma __rcu *p2pdma;
>> #endif
>> #ifdef CONFIG_PCI_DOE
>> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
>> #endif
>> 	u16		acs_cap;	/* ACS Capability offset */
>>
>> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
>>
>>>> At least if the VMM is doing this then the VMM can include the
>>>> information in its migration scheme and use it to recreate the PCI
>>>> layout withotu having to create a bunch of uAPI to do so.
>>>
>>> We're again back to migration compatibility, where again the capability
>>> layout would be governed by the migration support in the in-kernel
>>> variant driver.  Once migration is involved the location of a PASID
>>> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
>>>
>>> Regardless, the VMM ultimately has the authority what the guest
>>> sees in config space.  The VMM is not bound to expose the PASID at the
>>> offset provided by the kernel, or bound to expose it at all.  The
>>> kernel exposed PASID can simply provide an available location and set
>>> of enabled capabilities.  Thanks,
>>>
>>> Alex
>>>    
>>
>
Jason Gunthorpe April 26, 2024, 2:11 p.m. UTC | #32
On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:

> This is kind of an absurd example to portray as a ubiquitous problem.
> Typically the config space layout is a reflection of hardware whether
> the device supports migration or not.

Er, all our HW has FW constructed config space. It changes with FW
upgrades. We change it during the life of the product. This has to be
considered..

> If a driver were to insert a
> virtual capability, then yes it would want to be consistent about it if
> it also cares about migration.  If the driver needs to change the
> location of a virtual capability, problems will arise, but that's also
> not something that every driver needs to do.

Well, mlx5 has to cope with this. It supports so many devices with so
many config space layouts :( I don't know if we can just hard wire an
offset to stick in a PASID cap and expect that to work...

> Also, how exactly does emulating the capability in the VMM solve this
> problem?  Currently QEMU migration simply applies state to an identical
> VM on the target.  QEMU doesn't modify the target VM to conform to the
> data stream.  So in either case, the problem might be more along the
> lines of how to make a V1 device from a V2 driver, which is more the
> device type/flavor/persona problem.

Yes, it doesn't solve anything, it just puts the responsibility for
something that is very complicated in userspace where there are more
options to configure and customize it to the environment.

> Currently QEMU replies on determinism that a given command line results
> in an identical machine configuration and identical devices.  State of
> that target VM is then populated, not defined by, the migration stream.

But that won't be true if the kernel is making decisions. The config
space layout depends now on the kernel driver version too.

> > I think we need to decide, either only the VMM or only the kernel
> > should do this.
> 
> What are you actually proposing?

Okay, what I'm thinking about is a text file that describes the vPCI
function configuration space to create. The community will standardize
this and VMMs will have to implement to get PASID/etc. Maybe the
community will provide a BSD licensed library to do this job.

The text file allows the operator to specify exactly the configuration
space the VFIO function should have. It would not be derived
automatically from physical. AFAIK qemu does not have this capability
currently.

This reflects my observation and discussions around the live migration
standardization. I belive we are fast reaching a point where this is
required.

Consider standards based migration between wildly different
devices. The devices will not standardize their physical config space,
but an operator could generate a consistent vPCI config space that
works with all the devices in their fleet.

Consider the usual working model of the large operators - they define
instance types with some regularity. But an instance type is fixed in
concrete once it is specified, things like the vPCI config space are
fixed.

Running Instance A on newer hardware with a changed physical config
space should continue to present Instance A's vPCI config layout
regardless. Ie Instance A might not support PASID but Instance B can
run on newer HW that does. The config space layout depends on the
requested Instance Type, not the physical layout.

The auto-configuration of the config layout from physical is a nice
feature and is excellent for development/small scale, but it shouldn't
be the only way to work.

So - if we accept that text file configuration should be something the
VMM supports then let's reconsider how to solve the PASID problem.

I'd say the way to solve it should be via a text file specifying a
full config space layout that includes the PASID cap. From the VMM
perspective this works fine, and it ports to every VMM directly via
processing the text file.

The autoconfiguration use case can be done by making a tool build the
text file by deriving it from physical, much like today. The single
instance of that tool could have device specific knowledge to avoid
quirks. This way the smarts can still be shared by all the VMMs
without going into the kernel. Special devices with hidden config
space could get special quirks or special reference text files into
the tool repo.

Serious operators doing production SRIOV/etc would negotiate the text
file with the HW vendors when they define their Instance Type. Ideally
these reference text files would be contributed to the tool repo
above. I think there would be some nice idea to define fully open
source Instance Types that include VFIO devices too.

Is it too much of a fantasy?

Jason
Alex Williamson April 26, 2024, 8:13 p.m. UTC | #33
On Fri, 26 Apr 2024 11:11:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.  
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

So as I understand it, the concern is that you have firmware that
supports migration, but it also openly hostile to the fundamental
aspects of exposing a stable device ABI in support of migration.

> > If a driver were to insert a
> > virtual capability, then yes it would want to be consistent about it if
> > it also cares about migration.  If the driver needs to change the
> > location of a virtual capability, problems will arise, but that's also
> > not something that every driver needs to do.  
> 
> Well, mlx5 has to cope with this. It supports so many devices with so
> many config space layouts :( I don't know if we can just hard wire an
> offset to stick in a PASID cap and expect that to work...
> 
> > Also, how exactly does emulating the capability in the VMM solve this
> > problem?  Currently QEMU migration simply applies state to an identical
> > VM on the target.  QEMU doesn't modify the target VM to conform to the
> > data stream.  So in either case, the problem might be more along the
> > lines of how to make a V1 device from a V2 driver, which is more the
> > device type/flavor/persona problem.  
> 
> Yes, it doesn't solve anything, it just puts the responsibility for
> something that is very complicated in userspace where there are more
> options to configure and customize it to the environment.
> 
> > Currently QEMU replies on determinism that a given command line results
> > in an identical machine configuration and identical devices.  State of
> > that target VM is then populated, not defined by, the migration stream.  
> 
> But that won't be true if the kernel is making decisions. The config
> space layout depends now on the kernel driver version too.

But in the cases where we support migration there's a device specific
variant driver that supports that migration.  It's the job of that
variant driver to not only export and import the device state, but also
to provide a consistent ABI to the user, which includes the config
space layout.  I don't understand why we'd say the device programming
ABI itself falls within the purview of the device/variant driver, but
PCI config space is defined by device specific code at a higher level.

> > > I think we need to decide, either only the VMM or only the kernel
> > > should do this.  
> > 
> > What are you actually proposing?  
> 
> Okay, what I'm thinking about is a text file that describes the vPCI
> function configuration space to create. The community will standardize
> this and VMMs will have to implement to get PASID/etc. Maybe the
> community will provide a BSD licensed library to do this job.
> 
> The text file allows the operator to specify exactly the configuration
> space the VFIO function should have. It would not be derived
> automatically from physical. AFAIK qemu does not have this capability
> currently.
> 
> This reflects my observation and discussions around the live migration
> standardization. I belive we are fast reaching a point where this is
> required.
> 
> Consider standards based migration between wildly different
> devices. The devices will not standardize their physical config space,
> but an operator could generate a consistent vPCI config space that
> works with all the devices in their fleet.
> 
> Consider the usual working model of the large operators - they define
> instance types with some regularity. But an instance type is fixed in
> concrete once it is specified, things like the vPCI config space are
> fixed.
> 
> Running Instance A on newer hardware with a changed physical config
> space should continue to present Instance A's vPCI config layout
> regardless. Ie Instance A might not support PASID but Instance B can
> run on newer HW that does. The config space layout depends on the
> requested Instance Type, not the physical layout.
> 
> The auto-configuration of the config layout from physical is a nice
> feature and is excellent for development/small scale, but it shouldn't
> be the only way to work.
> 
> So - if we accept that text file configuration should be something the
> VMM supports then let's reconsider how to solve the PASID problem.
> 
> I'd say the way to solve it should be via a text file specifying a
> full config space layout that includes the PASID cap. From the VMM
> perspective this works fine, and it ports to every VMM directly via
> processing the text file.
> 
> The autoconfiguration use case can be done by making a tool build the
> text file by deriving it from physical, much like today. The single
> instance of that tool could have device specific knowledge to avoid
> quirks. This way the smarts can still be shared by all the VMMs
> without going into the kernel. Special devices with hidden config
> space could get special quirks or special reference text files into
> the tool repo.
> 
> Serious operators doing production SRIOV/etc would negotiate the text
> file with the HW vendors when they define their Instance Type. Ideally
> these reference text files would be contributed to the tool repo
> above. I think there would be some nice idea to define fully open
> source Instance Types that include VFIO devices too.

Regarding "if we accept that text file configuration should be
something the VMM supports", I'm not on board with this yet, so
applying it to PASID discussion seems premature.

We've developed variant drivers specifically to host the device specific
aspects of migration support.  The requirement of a consistent config
space layout is a problem that only exists relative to migration.  This
is an issue that I would have considered the responsibility of the
variant driver, which would likely expect a consistent interface from
the hardware/firmware.  Why does hostile firmware suddenly make it the
VMM's problem to provide a consistent ABI to the config space of the
device rather than the variant driver?

Obviously config maps are something that a VMM could do, but it also
seems to impose a non-trivial burden that every VMM requires an
implementation of a config space map and integration for each device
rather than simply expecting the exposed config space of the device to
be part of the migration ABI.  Also this solution specifically only
addresses config space compatibility without considering the more
generic issue that a variant driver can expose different device
personas.  A versioned persona and config space virtualization in the
variant driver is a much more flexible solution.  Thanks,

Alex
Christoph Hellwig April 27, 2024, 5:05 a.m. UTC | #34
On Fri, Apr 26, 2024 at 11:11:17AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

FYI, the same is true for just about any storage device I know.  Maybe
not all of the config space, but definitely parts of it.
Tian, Kevin April 28, 2024, 6:19 a.m. UTC | #35
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 27, 2024 4:14 AM
> 
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> >
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.
> >
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.
> 
> > > If a driver were to insert a
> > > virtual capability, then yes it would want to be consistent about it if
> > > it also cares about migration.  If the driver needs to change the
> > > location of a virtual capability, problems will arise, but that's also
> > > not something that every driver needs to do.
> >
> > Well, mlx5 has to cope with this. It supports so many devices with so
> > many config space layouts :( I don't know if we can just hard wire an
> > offset to stick in a PASID cap and expect that to work...

Are those config space layout differences usually also coming with
mmio-side interface change? If yes there are more to handle for
running V1 instance on V2 device and it'd make sense to manage
everything about compatibility in one place.

If we pursue the direction deciding the vconfig layout in VMM, does
it imply that anything related to mmio layout would also be put in
VMM too?

e.g. it's not unusual to see a device mmio layout as:

	REG_BASE:  base addr of a register block in a BAR
	REG_FEAT1_OFF: feature1 regs offset in the register block
	REG_FEAT2_OFF: feature2 regs offset in the register block
	...

Driver accesses registers according to those read-only offsets.

A FW upgrade may lead to change of offsets but functions stay the
same. An instance created on an old version can be migrated to 
the new version as long as accesses to old offsets are trapped
and routed to the new offsets.

Do we envision things like above in the variant driver or in VMM?

> > >
> > > What are you actually proposing?
> >
> > Okay, what I'm thinking about is a text file that describes the vPCI
> > function configuration space to create. The community will standardize
> > this and VMMs will have to implement to get PASID/etc. Maybe the
> > community will provide a BSD licensed library to do this job.
> >
> > The text file allows the operator to specify exactly the configuration
> > space the VFIO function should have. It would not be derived
> > automatically from physical. AFAIK qemu does not have this capability
> > currently.
> >
> > This reflects my observation and discussions around the live migration
> > standardization. I belive we are fast reaching a point where this is
> > required.
> >
> > Consider standards based migration between wildly different
> > devices. The devices will not standardize their physical config space,
> > but an operator could generate a consistent vPCI config space that
> > works with all the devices in their fleet.

It's hard to believe that 'wildly different' devices only have difference
in the layout of vPCI config space. 

> >
> > Consider the usual working model of the large operators - they define
> > instance types with some regularity. But an instance type is fixed in
> > concrete once it is specified, things like the vPCI config space are
> > fixed.
> >
> > Running Instance A on newer hardware with a changed physical config
> > space should continue to present Instance A's vPCI config layout
> > regardless. Ie Instance A might not support PASID but Instance B can
> > run on newer HW that does. The config space layout depends on the
> > requested Instance Type, not the physical layout.
> >
> > The auto-configuration of the config layout from physical is a nice
> > feature and is excellent for development/small scale, but it shouldn't
> > be the only way to work.
> >
> > So - if we accept that text file configuration should be something the
> > VMM supports then let's reconsider how to solve the PASID problem.
> >
> > I'd say the way to solve it should be via a text file specifying a
> > full config space layout that includes the PASID cap. From the VMM
> > perspective this works fine, and it ports to every VMM directly via
> > processing the text file.
> >
> > The autoconfiguration use case can be done by making a tool build the
> > text file by deriving it from physical, much like today. The single
> > instance of that tool could have device specific knowledge to avoid
> > quirks. This way the smarts can still be shared by all the VMMs
> > without going into the kernel. Special devices with hidden config
> > space could get special quirks or special reference text files into
> > the tool repo.
> >
> > Serious operators doing production SRIOV/etc would negotiate the text
> > file with the HW vendors when they define their Instance Type. Ideally
> > these reference text files would be contributed to the tool repo
> > above. I think there would be some nice idea to define fully open
> > source Instance Types that include VFIO devices too.
> 
> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.
> 
> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  This
> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?
> 
> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  Also this solution specifically only
> addresses config space compatibility without considering the more
> generic issue that a variant driver can expose different device
> personas.  A versioned persona and config space virtualization in the
> variant driver is a much more flexible solution.  Thanks,
> 

and looks this community lacks of a clear criteria on what burden
should be put in the kernel vs. in the VMM.

e.g. in earlier nvgrace-gpu discussion a major open was whether
the PCI bar emulation should be done by the variant driver or
by the VMM (with variant driver providing a device feature).

It ends up to be in the variant driver with one major argument
that doing so avoids the burden in various VMMs.

But now seems the 'text-file' proposal heads the opposite direction?

btw while this discussion may continue some time, I wonder whether
this vPASID reporting open can be handled separately from the
pasid attach/detach series so we can move the ball and merge
something already in agreement. anyway it's just a read-only cap so
won't affect how VFIO/IOMMUFD handles the pasid related requests.

Thanks
Kevin
Yi Liu April 29, 2024, 7:43 a.m. UTC | #36
On 2024/4/28 14:19, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Saturday, April 27, 2024 4:14 AM
>>
>> On Fri, 26 Apr 2024 11:11:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
>>>
>>>> This is kind of an absurd example to portray as a ubiquitous problem.
>>>> Typically the config space layout is a reflection of hardware whether
>>>> the device supports migration or not.
>>>
>>> Er, all our HW has FW constructed config space. It changes with FW
>>> upgrades. We change it during the life of the product. This has to be
>>> considered..
>>
>> So as I understand it, the concern is that you have firmware that
>> supports migration, but it also openly hostile to the fundamental
>> aspects of exposing a stable device ABI in support of migration.
>>
>>>> If a driver were to insert a
>>>> virtual capability, then yes it would want to be consistent about it if
>>>> it also cares about migration.  If the driver needs to change the
>>>> location of a virtual capability, problems will arise, but that's also
>>>> not something that every driver needs to do.
>>>
>>> Well, mlx5 has to cope with this. It supports so many devices with so
>>> many config space layouts :( I don't know if we can just hard wire an
>>> offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.
> 
> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?
> 
> e.g. it's not unusual to see a device mmio layout as:
> 
> 	REG_BASE:  base addr of a register block in a BAR
> 	REG_FEAT1_OFF: feature1 regs offset in the register block
> 	REG_FEAT2_OFF: feature2 regs offset in the register block
> 	...

Should we also consider the possibility of vBAR address difference between
the src and dst? It should be impossible that the guest driver mmaps vBAR
again if it has already been loaded.

> Driver accesses registers according to those read-only offsets.
> 
> A FW upgrade may lead to change of offsets but functions stay the
> same. An instance created on an old version can be migrated to
> the new version as long as accesses to old offsets are trapped
> and routed to the new offsets.
> 
> Do we envision things like above in the variant driver or in VMM?
> 
>>>>
>>>> What are you actually proposing?
>>>
>>> Okay, what I'm thinking about is a text file that describes the vPCI
>>> function configuration space to create. The community will standardize
>>> this and VMMs will have to implement to get PASID/etc. Maybe the
>>> community will provide a BSD licensed library to do this job.
>>>
>>> The text file allows the operator to specify exactly the configuration
>>> space the VFIO function should have. It would not be derived
>>> automatically from physical. AFAIK qemu does not have this capability
>>> currently.
>>>
>>> This reflects my observation and discussions around the live migration
>>> standardization. I belive we are fast reaching a point where this is
>>> required.
>>>
>>> Consider standards based migration between wildly different
>>> devices. The devices will not standardize their physical config space,
>>> but an operator could generate a consistent vPCI config space that
>>> works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space.
> 
>>>
>>> Consider the usual working model of the large operators - they define
>>> instance types with some regularity. But an instance type is fixed in
>>> concrete once it is specified, things like the vPCI config space are
>>> fixed.
>>>
>>> Running Instance A on newer hardware with a changed physical config
>>> space should continue to present Instance A's vPCI config layout
>>> regardless. Ie Instance A might not support PASID but Instance B can
>>> run on newer HW that does. The config space layout depends on the
>>> requested Instance Type, not the physical layout.
>>>
>>> The auto-configuration of the config layout from physical is a nice
>>> feature and is excellent for development/small scale, but it shouldn't
>>> be the only way to work.
>>>
>>> So - if we accept that text file configuration should be something the
>>> VMM supports then let's reconsider how to solve the PASID problem.
>>>
>>> I'd say the way to solve it should be via a text file specifying a
>>> full config space layout that includes the PASID cap. From the VMM
>>> perspective this works fine, and it ports to every VMM directly via
>>> processing the text file.
>>>
>>> The autoconfiguration use case can be done by making a tool build the
>>> text file by deriving it from physical, much like today. The single
>>> instance of that tool could have device specific knowledge to avoid
>>> quirks. This way the smarts can still be shared by all the VMMs
>>> without going into the kernel. Special devices with hidden config
>>> space could get special quirks or special reference text files into
>>> the tool repo.
>>>
>>> Serious operators doing production SRIOV/etc would negotiate the text
>>> file with the HW vendors when they define their Instance Type. Ideally
>>> these reference text files would be contributed to the tool repo
>>> above. I think there would be some nice idea to define fully open
>>> source Instance Types that include VFIO devices too.
>>
>> Regarding "if we accept that text file configuration should be
>> something the VMM supports", I'm not on board with this yet, so
>> applying it to PASID discussion seems premature.
>>
>> We've developed variant drivers specifically to host the device specific
>> aspects of migration support.  The requirement of a consistent config
>> space layout is a problem that only exists relative to migration.  This
>> is an issue that I would have considered the responsibility of the
>> variant driver, which would likely expect a consistent interface from
>> the hardware/firmware.  Why does hostile firmware suddenly make it the
>> VMM's problem to provide a consistent ABI to the config space of the
>> device rather than the variant driver?
>>
>> Obviously config maps are something that a VMM could do, but it also
>> seems to impose a non-trivial burden that every VMM requires an
>> implementation of a config space map and integration for each device
>> rather than simply expecting the exposed config space of the device to
>> be part of the migration ABI.  Also this solution specifically only
>> addresses config space compatibility without considering the more
>> generic issue that a variant driver can expose different device
>> personas.  A versioned persona and config space virtualization in the
>> variant driver is a much more flexible solution.  Thanks,
>>
> 
> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.
> 
> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).
> 
> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.
> 
> But now seems the 'text-file' proposal heads the opposite direction?
> 
> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.
> 
> Thanks
> Kevin
>
Jason Gunthorpe April 29, 2024, 5:15 p.m. UTC | #37
On Sun, Apr 28, 2024 at 06:19:29AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 27, 2024 4:14 AM
> > 
> > On Fri, 26 Apr 2024 11:11:17 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > >
> > > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > > Typically the config space layout is a reflection of hardware whether
> > > > the device supports migration or not.
> > >
> > > Er, all our HW has FW constructed config space. It changes with FW
> > > upgrades. We change it during the life of the product. This has to be
> > > considered..
> > 
> > So as I understand it, the concern is that you have firmware that
> > supports migration, but it also openly hostile to the fundamental
> > aspects of exposing a stable device ABI in support of migration.
> > 
> > > > If a driver were to insert a
> > > > virtual capability, then yes it would want to be consistent about it if
> > > > it also cares about migration.  If the driver needs to change the
> > > > location of a virtual capability, problems will arise, but that's also
> > > > not something that every driver needs to do.
> > >
> > > Well, mlx5 has to cope with this. It supports so many devices with so
> > > many config space layouts :( I don't know if we can just hard wire an
> > > offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? 

Not really

> If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.

It's complicated :|

The config space layout can't change once the device is discovered by
the OS and I have a feeling it can't differ between VFs in a SRIOV.

So even if we did say that a device HW/FW will change it's personality
on the fly, I'm not sure we actually can and still conform to the
PCI specs?

> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?

I'd guess no, MMIO doesn't have the OS and standards entanglements. If
the FW can reprogram the MMIO layout then when it loads the migration
state, or provisions the device, it should put the MMIO to the right
configuration.

> Do we envision things like above in the variant driver or in VMM?

In the device itself. I think it would be an extreme case for someone
to make a device that had a different MMIO layout that could be
backwards compatible to an old layout and NOT provide a way for the
device HW to go back to the old layout directly. If so that device
would have to mediate MMIO in the VMM or kernel. Depends on the
complexity I suppose where to do it.

> > > Consider standards based migration between wildly different
> > > devices. The devices will not standardize their physical config space,
> > > but an operator could generate a consistent vPCI config space that
> > > works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space. 

Well, I mean a device from vendor A and another device from vendor
B. They don't have anything in common except implementing the
standard.

The standard perscribes the MMIO layout. Both devices have a way to
accept a migration stream defined by a standard. The MMIO will be
adjusted as the standard requires after loading the migration stream.

Config space primiarily remains a problem in this imagined world.

My best view is that it needs to be solved by config space synthesis
in the hypervisor and not via HW/FW on the physical device.

> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.

Right, we've always discussed this. I've fairly consistently wanted to
push things to userspace, principally for security.

> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).

Sort of, this was actually also about config space synthesis. If we
had this kind of scheme then perhaps the argument would be different.

> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.

Yes, that has always been the argument that has won out.

> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.

Yes, this makes sense to me. Ultimately we all agree the config space
will have to be synthesized.

Jason
Jason Gunthorpe April 29, 2024, 5:44 p.m. UTC | #38
On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > 
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.  
> > 
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.

Well, that makes it sound rude, but yes that is part of it.

mlx5 is tremendously FW defined. The FW can only cope with migration
in some limited cases today. Making that compatability bigger is
ongoing work.

Config space is one of the areas that has not been addressed.
Currently things are such that the FW won't support migration in
combinations that have different physical config space - so it is not
a problem.

But, in principle, it is an issue. AFAIK, the only complete solution
is for the hypervisor to fully synthesize a stable config space.

So, if we keep this in the kernel then I'd imagine the kernel will
need to grow some shared infrastructure to fully synthezise the config
space - not text file based, but basically the same as what I
described for the VMM.

> > But that won't be true if the kernel is making decisions. The config
> > space layout depends now on the kernel driver version too.
> 
> But in the cases where we support migration there's a device specific
> variant driver that supports that migration.  It's the job of that
> variant driver to not only export and import the device state, but also
> to provide a consistent ABI to the user, which includes the config
> space layout. 

Yes, we could do that, but I'm not sure how it will work in all cases.

> I don't understand why we'd say the device programming ABI itself
> falls within the purview of the device/variant driver, but PCI
> config space is defined by device specific code at a higher level.

The "device programming ABI" doesn't contain any *policy*. The layout
of the config space is like 50% policy. Especially when we start to
talk about standards defined migration. The standards will set the
"device programming ABI" and maybe even specify the migration
stream. They will not, and arguably can not, specify the config space.

Config space layout is substantially policy of the instance type. Even
little things like the vendor IDs can be meaningfully replaced in VMs.

> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.

Sure, I'm just explaining a way this could all fit together.

> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  

Well, I wouldn't go quite so far. Arguably even non-migritable
instance types may want to adjust thier config space. Eg if I'm using
a DPU and I get a NVMe/Virtio PCI function I may want to scrub out
details from the config space to make it more general. Even without
migration.

This already happens today in places like VDPA which completely
replace the underlying config space in some cases.

I see it as a difference from a world of highly constrained "instance
types" and a more ad hoc world.

> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?

It is not "hostile firmware"! It accepting that a significant aspect
of the config layout is actually policy.

Plus the standards limitations that mean we can't change the config
space on the fly make it pretty much impossible for the device to
acutally do anything to help here. Software must fix the config space.

> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  

Well, the flip is true to, it is alot of burden on every variant
device driver implement and on the kernel in general to manage config
space policy on behalf of the VMM.

My point is if the VMM is already going to be forced to manage config
space policy for other good reasons, are we sure we want to put a
bunch of stuff in the kernel that sometimes won't be used?

> Also this solution specifically only addresses config space
> compatibility without considering the more generic issue that a
> variant driver can expose different device personas.  A versioned
> persona and config space virtualization in the variant driver is a
> much more flexible solution.

It is addressed, the different personas would have their own text file
maps. The target VMM would have to load the right map. Shared common
code across all the variant drivers.

Jason
Yi Liu July 18, 2024, 1:02 p.m. UTC | #39
Hi Alex, Jason,

We didn't get any conclusion on the vPASID cap last time. Shall we
have more alignment on it? +Vasant as well.

On 2024/4/30 01:44, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
>> On Fri, 26 Apr 2024 11:11:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
>>>
>>>> This is kind of an absurd example to portray as a ubiquitous problem.
>>>> Typically the config space layout is a reflection of hardware whether
>>>> the device supports migration or not.
>>>
>>> Er, all our HW has FW constructed config space. It changes with FW
>>> upgrades. We change it during the life of the product. This has to be
>>> considered..
>>
>> So as I understand it, the concern is that you have firmware that
>> supports migration, but it also openly hostile to the fundamental
>> aspects of exposing a stable device ABI in support of migration.
> 
> Well, that makes it sound rude, but yes that is part of it.
> 
> mlx5 is tremendously FW defined. The FW can only cope with migration
> in some limited cases today. Making that compatability bigger is
> ongoing work.
> 
> Config space is one of the areas that has not been addressed.
> Currently things are such that the FW won't support migration in
> combinations that have different physical config space - so it is not
> a problem.
> 
> But, in principle, it is an issue. AFAIK, the only complete solution
> is for the hypervisor to fully synthesize a stable config space.
> 
> So, if we keep this in the kernel then I'd imagine the kernel will
> need to grow some shared infrastructure to fully synthezise the config
> space - not text file based, but basically the same as what I
> described for the VMM.
> 
>>> But that won't be true if the kernel is making decisions. The config
>>> space layout depends now on the kernel driver version too.
>>
>> But in the cases where we support migration there's a device specific
>> variant driver that supports that migration.  It's the job of that
>> variant driver to not only export and import the device state, but also
>> to provide a consistent ABI to the user, which includes the config
>> space layout.
> 
> Yes, we could do that, but I'm not sure how it will work in all cases.
> 
>> I don't understand why we'd say the device programming ABI itself
>> falls within the purview of the device/variant driver, but PCI
>> config space is defined by device specific code at a higher level.
> 
> The "device programming ABI" doesn't contain any *policy*. The layout
> of the config space is like 50% policy. Especially when we start to
> talk about standards defined migration. The standards will set the
> "device programming ABI" and maybe even specify the migration
> stream. They will not, and arguably can not, specify the config space.
> 
> Config space layout is substantially policy of the instance type. Even
> little things like the vendor IDs can be meaningfully replaced in VMs.
> 
>> Regarding "if we accept that text file configuration should be
>> something the VMM supports", I'm not on board with this yet, so
>> applying it to PASID discussion seems premature.
> 
> Sure, I'm just explaining a way this could all fit together.
> 
>> We've developed variant drivers specifically to host the device specific
>> aspects of migration support.  The requirement of a consistent config
>> space layout is a problem that only exists relative to migration.
> 
> Well, I wouldn't go quite so far. Arguably even non-migritable
> instance types may want to adjust thier config space. Eg if I'm using
> a DPU and I get a NVMe/Virtio PCI function I may want to scrub out
> details from the config space to make it more general. Even without
> migration.
> 
> This already happens today in places like VDPA which completely
> replace the underlying config space in some cases.
> 
> I see it as a difference from a world of highly constrained "instance
> types" and a more ad hoc world.
> 
>> is an issue that I would have considered the responsibility of the
>> variant driver, which would likely expect a consistent interface from
>> the hardware/firmware.  Why does hostile firmware suddenly make it the
>> VMM's problem to provide a consistent ABI to the config space of the
>> device rather than the variant driver?
> 
> It is not "hostile firmware"! It accepting that a significant aspect
> of the config layout is actually policy.
> 
> Plus the standards limitations that mean we can't change the config
> space on the fly make it pretty much impossible for the device to
> acutally do anything to help here. Software must fix the config space.
> 
>> Obviously config maps are something that a VMM could do, but it also
>> seems to impose a non-trivial burden that every VMM requires an
>> implementation of a config space map and integration for each device
>> rather than simply expecting the exposed config space of the device to
>> be part of the migration ABI.
> 
> Well, the flip is true to, it is alot of burden on every variant
> device driver implement and on the kernel in general to manage config
> space policy on behalf of the VMM.
> 
> My point is if the VMM is already going to be forced to manage config
> space policy for other good reasons, are we sure we want to put a
> bunch of stuff in the kernel that sometimes won't be used?
> 
>> Also this solution specifically only addresses config space
>> compatibility without considering the more generic issue that a
>> variant driver can expose different device personas.  A versioned
>> persona and config space virtualization in the variant driver is a
>> much more flexible solution.
> 
> It is addressed, the different personas would have their own text file
> maps. The target VMM would have to load the right map. Shared common
> code across all the variant drivers.
> 
> Jason
Tian, Kevin July 24, 2024, 2:26 a.m. UTC | #40
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 30, 2024 1:45 AM
> 
> On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> > Regarding "if we accept that text file configuration should be
> > something the VMM supports", I'm not on board with this yet, so
> > applying it to PASID discussion seems premature.
> 
> Sure, I'm just explaining a way this could all fit together.
> 

Thinking more along this direction.

I'm not sure how long it will take to standardize such text files and
share them across VMMs. We may need a way to move in steps in
Qemu to unblock the kernel development toward that end goal, e.g.
first accepting a pasid option plus user-specified offset (if offset
unspecified then auto-pick one in cap holes). Later when the text
file is ready then such per-cap options can be deprecated.

This simple way won't fix the migration issue, but at least it's on
par with physical caps (i.e. fail the migration if offset mismatched
between dest/src) and both will be fixed when the text file model
is ready.

Then look at what uAPI is required to report the vPASID cap.

In earlier discussion it's leaning toward extending GET_HW_INFO
in iommufd given both iommu/pci support are required to get
PASID working and iommu driver will not report such support until
pasid has been enabled in both iommu/pci. With that there is no
need to further report PASID in vfio-pci.

But there may be other caps which are shared between VF and
PF while having nothing to do with the iommu. e.g. the Device
Serial Number extended cap (permitted but not recommended
in VF). If there is a need to report such cap on VF which doesn't
implement it to userspace, a vfio uAPI (device_feature or a new
one dedicated to synthetical vcap) appears to be inevitable.

So I wonder whether we leave this part untouched until a real
demand comes or use vpasid to formalize that uAPI to be forward
looking. If in the end such uAPI will exist then it's a bit weird to
have PASID escaped (especially when vfio-pci already reports
PRI/ATS  which have iommu dependency too in vconfig).

In concept the Qemu logic will be clearer if any PCI caps (real
or synthesized) is always conveyed via vfio-pci while iommufd is
for identifying a viommu cap.

Thanks
Kevin
Alex Williamson July 30, 2024, 5:35 p.m. UTC | #41
On Wed, 24 Jul 2024 02:26:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 30, 2024 1:45 AM
> > 
> > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:  
> > > Regarding "if we accept that text file configuration should be
> > > something the VMM supports", I'm not on board with this yet, so
> > > applying it to PASID discussion seems premature.  
> > 
> > Sure, I'm just explaining a way this could all fit together.
> >   
> 
> Thinking more along this direction.
> 
> I'm not sure how long it will take to standardize such text files and
> share them across VMMs. We may need a way to move in steps in
> Qemu to unblock the kernel development toward that end goal, e.g.
> first accepting a pasid option plus user-specified offset (if offset
> unspecified then auto-pick one in cap holes). Later when the text
> file is ready then such per-cap options can be deprecated.

Planned obsolescence is a hard sell.

> This simple way won't fix the migration issue, but at least it's on
> par with physical caps (i.e. fail the migration if offset mismatched
> between dest/src) and both will be fixed when the text file model
> is ready.
> 
> Then look at what uAPI is required to report the vPASID cap.
> 
> In earlier discussion it's leaning toward extending GET_HW_INFO
> in iommufd given both iommu/pci support are required to get
> PASID working and iommu driver will not report such support until
> pasid has been enabled in both iommu/pci. With that there is no
> need to further report PASID in vfio-pci.
> 
> But there may be other caps which are shared between VF and
> PF while having nothing to do with the iommu. e.g. the Device
> Serial Number extended cap (permitted but not recommended
> in VF). If there is a need to report such cap on VF which doesn't
> implement it to userspace, a vfio uAPI (device_feature or a new
> one dedicated to synthetical vcap) appears to be inevitable.
> 
> So I wonder whether we leave this part untouched until a real
> demand comes or use vpasid to formalize that uAPI to be forward
> looking. If in the end such uAPI will exist then it's a bit weird to
> have PASID escaped (especially when vfio-pci already reports
> PRI/ATS  which have iommu dependency too in vconfig).
> 
> In concept the Qemu logic will be clearer if any PCI caps (real
> or synthesized) is always conveyed via vfio-pci while iommufd is
> for identifying a viommu cap.

There are so many moving pieces here and the discussion trailed off a
long time ago.  I have trouble keeping all the relevant considerations
in my head, so let me try to enumerate them, please correct/add.

 - The PASID capability cannot be implemented on VFs per the PCIe spec.
   All VFs share the PF PASID configuration.  This also implies that
   the VF PASID capability is essentially emulated since the VF driver
   cannot manipulate the PF PASID directly.

 - VFIO does not currently expose the PASID capability for PFs, nor
   does anything construct a vPASID capability for VFs.

 - The PASID capability is only useful in combination with a vIOMMU
   with PASID support, which does not yet exist in QEMU.

 - Some devices are known to place registers in configuration space,
   outside of the capability chains, which historically makes it
   difficult to place a purely virtual capability without potentially
   masking such hidden registers.  Current virtual capabilities are
   placed at vendor defined fixed locations to avoid conflicts.

 - There is some expectation that otherwise compatible devices may
   not present identical capability chains, for example devices running
   different firmware or devices from different vendors implementing a
   standard register ABI (virtio) where capability chain layout is not
   standardized.

 - There have been arguments that the layout of device capabilities is
   a policy choice, where both the kernel and libvirt traditionally try
   to avoid making policy decisions.

 - Seamless live migration of devices requires that configuration space
   remains at least consistent, if not identical for much of it.
   Capability offsets cannot change during live migration.  This leads
   to the text file reference above, which is essentially just the
   notion that if the VMM defines the capability layout in config
   space, it would need to do so via a static reference, independent of
   the layout of the physical device and we might want to share that
   among multiple VMMs.

 - For a vfio-pci device to support live migration it must be enabled
   to do so by a vfio-pci variant driver.

 - We've discussed in the community and seem to have a consensus that a
   DVSEC (Designated Vendor Specific Extended Capability) could be
   defined to describe unused configuration space.  Such a DVSEC could
   be implemented natively by the device or supplied by a vfio-pci
   variant driver.  There is currently no definition of such a DVSEC.

So what are we trying to accomplish here.  PASID is the first
non-device specific virtual capability that we'd like to insert into
the VM view of the capability chain.  It won't be the last.

 - Do we push the policy of defining the capability offset to the user?

 - Do we do some hand waving that devices supporting PASID shouldn't
   have hidden registers and therefore the VMM can simply find a gap?

 - Do we ask the hardware vendor or variant driver to insert a DVSEC to
   identify available config space?

 - Do we handle this as just another device quirk, where we maintain a
   table of supported devices and vPASID offset for each?

 - Do we consider this an inflection point where the VMM entirely takes
   over the layout of the capability spaces to impose a stable
   migration layout?  On what basis do we apply that inflection?

 - Also, do we require the same policy for both standard and extended
   capability chains?

I understand the desire to make some progress, but QEMU relies on
integration with management tools, so a temporary option for a user to
specify a PASID offset in isolation sounds like a non-starter to me.

This might be a better sell if the user interface allowed fully
defining the capability chain layout from the command line and this
interface would continue to exist and supersede how the VMM might
otherwise define the capability chain when used.  A fully user defined
layout would be complicated though, so I think there would still be a
desire for QEMU to consume or define a consistent policy itself.

Even if QEMU defines the layout for a device, there may be multiple
versions of that device.  For example, maybe we just add PASID now, but
at some point we decide that we do want to replicate the PF serial
number capability.  At that point we have versions of the device which
would need to be tied to versions of the machine and maybe also
selected via a profile switch on the device command line.

If we want to simplify this, maybe we do just look at whether the
vIOMMU is configured for PASID support and if the device supports it,
then we just look for a gap and add the capability.  If we end up with
different results between source and target for migration, then
migration will fail.  Possibly we end up with a quirk table to override
the default placement of specific capabilities on specific devices.
That might evolve into a lookup for where we place all capabilities,
which essentially turns into the "file" where the VMM defines the entire
layout for some devices.

This is already TL;DR, so I'll end with that before I further drowned
the possibility of discussion.  Thanks,

Alex
Tian, Kevin July 31, 2024, 5:15 a.m. UTC | #42
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, July 31, 2024 1:35 AM
> 
> On Wed, 24 Jul 2024 02:26:20 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 30, 2024 1:45 AM
> > >
> > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> > > > Regarding "if we accept that text file configuration should be
> > > > something the VMM supports", I'm not on board with this yet, so
> > > > applying it to PASID discussion seems premature.
> > >
> > > Sure, I'm just explaining a way this could all fit together.
> > >
> >
> > Thinking more along this direction.
> >
> > I'm not sure how long it will take to standardize such text files and
> > share them across VMMs. We may need a way to move in steps in
> > Qemu to unblock the kernel development toward that end goal, e.g.
> > first accepting a pasid option plus user-specified offset (if offset
> > unspecified then auto-pick one in cap holes). Later when the text
> > file is ready then such per-cap options can be deprecated.
> 
> Planned obsolescence is a hard sell.
> 
> > This simple way won't fix the migration issue, but at least it's on
> > par with physical caps (i.e. fail the migration if offset mismatched
> > between dest/src) and both will be fixed when the text file model
> > is ready.
> >
> > Then look at what uAPI is required to report the vPASID cap.
> >
> > In earlier discussion it's leaning toward extending GET_HW_INFO
> > in iommufd given both iommu/pci support are required to get
> > PASID working and iommu driver will not report such support until
> > pasid has been enabled in both iommu/pci. With that there is no
> > need to further report PASID in vfio-pci.
> >
> > But there may be other caps which are shared between VF and
> > PF while having nothing to do with the iommu. e.g. the Device
> > Serial Number extended cap (permitted but not recommended
> > in VF). If there is a need to report such cap on VF which doesn't
> > implement it to userspace, a vfio uAPI (device_feature or a new
> > one dedicated to synthetical vcap) appears to be inevitable.
> >
> > So I wonder whether we leave this part untouched until a real
> > demand comes or use vpasid to formalize that uAPI to be forward
> > looking. If in the end such uAPI will exist then it's a bit weird to
> > have PASID escaped (especially when vfio-pci already reports
> > PRI/ATS  which have iommu dependency too in vconfig).
> >
> > In concept the Qemu logic will be clearer if any PCI caps (real
> > or synthesized) is always conveyed via vfio-pci while iommufd is
> > for identifying a viommu cap.
> 
> There are so many moving pieces here and the discussion trailed off a
> long time ago.  I have trouble keeping all the relevant considerations
> in my head, so let me try to enumerate them, please correct/add.

Thanks for the summary!

> 
>  - The PASID capability cannot be implemented on VFs per the PCIe spec.
>    All VFs share the PF PASID configuration.  This also implies that
>    the VF PASID capability is essentially emulated since the VF driver
>    cannot manipulate the PF PASID directly.
> 
>  - VFIO does not currently expose the PASID capability for PFs, nor
>    does anything construct a vPASID capability for VFs.
> 
>  - The PASID capability is only useful in combination with a vIOMMU
>    with PASID support, which does not yet exist in QEMU.
> 
>  - Some devices are known to place registers in configuration space,
>    outside of the capability chains, which historically makes it
>    difficult to place a purely virtual capability without potentially
>    masking such hidden registers.  Current virtual capabilities are
>    placed at vendor defined fixed locations to avoid conflicts.
> 
>  - There is some expectation that otherwise compatible devices may
>    not present identical capability chains, for example devices running
>    different firmware or devices from different vendors implementing a
>    standard register ABI (virtio) where capability chain layout is not
>    standardized.
> 
>  - There have been arguments that the layout of device capabilities is
>    a policy choice, where both the kernel and libvirt traditionally try
>    to avoid making policy decisions.
> 
>  - Seamless live migration of devices requires that configuration space
>    remains at least consistent, if not identical for much of it.

I didn't quite get it. I thought being consistent means fully identical
config space from guest p.o.v.

>    Capability offsets cannot change during live migration.  This leads
>    to the text file reference above, which is essentially just the
>    notion that if the VMM defines the capability layout in config
>    space, it would need to do so via a static reference, independent of
>    the layout of the physical device and we might want to share that
>    among multiple VMMs.
> 
>  - For a vfio-pci device to support live migration it must be enabled
>    to do so by a vfio-pci variant driver.
> 
>  - We've discussed in the community and seem to have a consensus that a
>    DVSEC (Designated Vendor Specific Extended Capability) could be
>    defined to describe unused configuration space.  Such a DVSEC could
>    be implemented natively by the device or supplied by a vfio-pci
>    variant driver.  There is currently no definition of such a DVSEC.

I'm not sure whether DVSEC is still that necessary if the direction is
to go userspace-defined layout. In a synthetic world the unused
physical space doesn't really matter.

So this consensus IMHO was better placed under the umbrella of
the other direction having the kernel define the layout.

> 
> So what are we trying to accomplish here.  PASID is the first
> non-device specific virtual capability that we'd like to insert into
> the VM view of the capability chain.  It won't be the last.
> 
>  - Do we push the policy of defining the capability offset to the user?

Looks yes as I didn't see a strong argument for the opposite way.

> 
>  - Do we do some hand waving that devices supporting PASID shouldn't
>    have hidden registers and therefore the VMM can simply find a gap?

I assume 'handwaving' doesn't mean any measure in code to actually
block those devices (as doing so likely requires certain denylist based on
device/vendor ID but then why not going a step further to also hard
code an offset?). It's more a try-and-fail model where vPASID is opted
in via a cmdline parameter then a device with hidden registers may
misbehave if the VMM happens to find a conflict gap. And the impact
is restricted only to a new setup where the user is interested in
PASID  to opt hence can afford diagnostics effort to figure out the restriction.

> 
>  - Do we ask the hardware vendor or variant driver to insert a DVSEC to
>    identify available config space?

As said I don't think it's necessary if leaving the policy to the user

> 
>  - Do we handle this as just another device quirk, where we maintain a
>    table of supported devices and vPASID offset for each?
> 
>  - Do we consider this an inflection point where the VMM entirely takes
>    over the layout of the capability spaces to impose a stable
>    migration layout?  On what basis do we apply that inflection?
> 
>  - Also, do we require the same policy for both standard and extended
>    capability chains?

suppose yes.

> 
> I understand the desire to make some progress, but QEMU relies on
> integration with management tools, so a temporary option for a user to
> specify a PASID offset in isolation sounds like a non-starter to me.
> 
> This might be a better sell if the user interface allowed fully
> defining the capability chain layout from the command line and this
> interface would continue to exist and supersede how the VMM might
> otherwise define the capability chain when used.  A fully user defined
> layout would be complicated though, so I think there would still be a
> desire for QEMU to consume or define a consistent policy itself.
> 
> Even if QEMU defines the layout for a device, there may be multiple
> versions of that device.  For example, maybe we just add PASID now, but
> at some point we decide that we do want to replicate the PF serial
> number capability.  At that point we have versions of the device which
> would need to be tied to versions of the machine and maybe also
> selected via a profile switch on the device command line.
> 
> If we want to simplify this, maybe we do just look at whether the
> vIOMMU is configured for PASID support and if the device supports it,

and this is related to the open which I raised in last mail - whether we
want to report the PASID support both in iommufd and vfio-pci uAPI.

My impression is yes as there may be requirement of exposing a virtual
capability which doesn't rely on the IOMMU.

> then we just look for a gap and add the capability.  If we end up with
> different results between source and target for migration, then
> migration will fail.  Possibly we end up with a quirk table to override
> the default placement of specific capabilities on specific devices.

emm how does a quirk table work with devices having volatile config
space layout cross FW versions? Can VMM assigned with a VF be able
to check the FW version of the PF?

> That might evolve into a lookup for where we place all capabilities,
> which essentially turns into the "file" where the VMM defines the entire
> layout for some devices.

Overall this sounds a feasible path to move forward - starting with
the VMM to find the gap automatically if a new PASID option is
opted in. Devices with hidden registers may fail. Devices with volatile
config space due to FW upgrade or cross vendors may fail to migrate.
Then evolving it to the file-based scheme, and there is time to discuss
any intermediate improvement (fixed quirks, cmdline offset, etc.) in
between.

> 
> This is already TL;DR, so I'll end with that before I further drowned
> the possibility of discussion.  Thanks,
> 
> Alex
Alex Williamson July 31, 2024, 5:04 p.m. UTC | #43
On Wed, 31 Jul 2024 05:15:25 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, July 31, 2024 1:35 AM
> > 
> > On Wed, 24 Jul 2024 02:26:20 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, April 30, 2024 1:45 AM
> > > >
> > > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:  
> > > > > Regarding "if we accept that text file configuration should be
> > > > > something the VMM supports", I'm not on board with this yet, so
> > > > > applying it to PASID discussion seems premature.  
> > > >
> > > > Sure, I'm just explaining a way this could all fit together.
> > > >  
> > >
> > > Thinking more along this direction.
> > >
> > > I'm not sure how long it will take to standardize such text files and
> > > share them across VMMs. We may need a way to move in steps in
> > > Qemu to unblock the kernel development toward that end goal, e.g.
> > > first accepting a pasid option plus user-specified offset (if offset
> > > unspecified then auto-pick one in cap holes). Later when the text
> > > file is ready then such per-cap options can be deprecated.  
> > 
> > Planned obsolescence is a hard sell.
> >   
> > > This simple way won't fix the migration issue, but at least it's on
> > > par with physical caps (i.e. fail the migration if offset mismatched
> > > between dest/src) and both will be fixed when the text file model
> > > is ready.
> > >
> > > Then look at what uAPI is required to report the vPASID cap.
> > >
> > > In earlier discussion it's leaning toward extending GET_HW_INFO
> > > in iommufd given both iommu/pci support are required to get
> > > PASID working and iommu driver will not report such support until
> > > pasid has been enabled in both iommu/pci. With that there is no
> > > need to further report PASID in vfio-pci.
> > >
> > > But there may be other caps which are shared between VF and
> > > PF while having nothing to do with the iommu. e.g. the Device
> > > Serial Number extended cap (permitted but not recommended
> > > in VF). If there is a need to report such cap on VF which doesn't
> > > implement it to userspace, a vfio uAPI (device_feature or a new
> > > one dedicated to synthetical vcap) appears to be inevitable.
> > >
> > > So I wonder whether we leave this part untouched until a real
> > > demand comes or use vpasid to formalize that uAPI to be forward
> > > looking. If in the end such uAPI will exist then it's a bit weird to
> > > have PASID escaped (especially when vfio-pci already reports
> > > PRI/ATS  which have iommu dependency too in vconfig).
> > >
> > > In concept the Qemu logic will be clearer if any PCI caps (real
> > > or synthesized) is always conveyed via vfio-pci while iommufd is
> > > for identifying a viommu cap.  
> > 
> > There are so many moving pieces here and the discussion trailed off a
> > long time ago.  I have trouble keeping all the relevant considerations
> > in my head, so let me try to enumerate them, please correct/add.  
> 
> Thanks for the summary!
> 
> > 
> >  - The PASID capability cannot be implemented on VFs per the PCIe spec.
> >    All VFs share the PF PASID configuration.  This also implies that
> >    the VF PASID capability is essentially emulated since the VF driver
> >    cannot manipulate the PF PASID directly.
> > 
> >  - VFIO does not currently expose the PASID capability for PFs, nor
> >    does anything construct a vPASID capability for VFs.
> > 
> >  - The PASID capability is only useful in combination with a vIOMMU
> >    with PASID support, which does not yet exist in QEMU.
> > 
> >  - Some devices are known to place registers in configuration space,
> >    outside of the capability chains, which historically makes it
> >    difficult to place a purely virtual capability without potentially
> >    masking such hidden registers.  Current virtual capabilities are
> >    placed at vendor defined fixed locations to avoid conflicts.
> > 
> >  - There is some expectation that otherwise compatible devices may
> >    not present identical capability chains, for example devices running
> >    different firmware or devices from different vendors implementing a
> >    standard register ABI (virtio) where capability chain layout is not
> >    standardized.
> > 
> >  - There have been arguments that the layout of device capabilities is
> >    a policy choice, where both the kernel and libvirt traditionally try
> >    to avoid making policy decisions.
> > 
> >  - Seamless live migration of devices requires that configuration space
> >    remains at least consistent, if not identical for much of it.  
> 
> I didn't quite get it. I thought being consistent means fully identical
> config space from guest p.o.v.

See for example:

https://gitlab.com/qemu-project/qemu/-/commit/187716feeba406b5a3879db66a7bafd687472a1f 

The layout of config space and most of the contents therein need to be
identical, but there are arguably elements that could be volatile which
only need to be consistent.

> >    Capability offsets cannot change during live migration.  This leads
> >    to the text file reference above, which is essentially just the
> >    notion that if the VMM defines the capability layout in config
> >    space, it would need to do so via a static reference, independent of
> >    the layout of the physical device and we might want to share that
> >    among multiple VMMs.
> > 
> >  - For a vfio-pci device to support live migration it must be enabled
> >    to do so by a vfio-pci variant driver.
> > 
> >  - We've discussed in the community and seem to have a consensus that a
> >    DVSEC (Designated Vendor Specific Extended Capability) could be
> >    defined to describe unused configuration space.  Such a DVSEC could
> >    be implemented natively by the device or supplied by a vfio-pci
> >    variant driver.  There is currently no definition of such a DVSEC.  
> 
> I'm not sure whether DVSEC is still that necessary if the direction is
> to go userspace-defined layout. In a synthetic world the unused
> physical space doesn't really matter.
> 
> So this consensus IMHO was better placed under the umbrella of
> the other direction having the kernel define the layout.

I agree that we don't seem to be headed in a direction that requires
this, but I just wanted to include that there was a roughly agreed upon
way for devices and variant drivers to annotate unused config space
ranges for higher levels.  If we head in a direction where the VMM
chooses an offset for the PASID capability, we need to keep track of
whether this DVSEC comes to fruition and how that affects the offset
that QEMU might choose.

> > So what are we trying to accomplish here.  PASID is the first
> > non-device specific virtual capability that we'd like to insert into
> > the VM view of the capability chain.  It won't be the last.
> > 
> >  - Do we push the policy of defining the capability offset to the user?  
> 
> Looks yes as I didn't see a strong argument for the opposite way.

It's a policy choice though, so where and how is it implemented?  It
works fine for those of us willing to edit xml or launch VMs by command
line, but libvirt isn't going to sign up to insert a policy choice for
a device.  If we get to even higher level tools, does anything that
wants to implement PASID support required a vendor operator driver to
make such policy choices (btw, I'm just throwing out the "operator"
term as if I know what it means, I don't).

> >  - Do we do some hand waving that devices supporting PASID shouldn't
> >    have hidden registers and therefore the VMM can simply find a gap?  
> 
> I assume 'handwaving' doesn't mean any measure in code to actually
> block those devices (as doing so likely requires certain denylist based on
> device/vendor ID but then why not going a step further to also hard
> code an offset?). It's more a try-and-fail model where vPASID is opted
> in via a cmdline parameter then a device with hidden registers may
> misbehave if the VMM happens to find a conflict gap. And the impact
> is restricted only to a new setup where the user is interested in
> PASID  to opt hence can afford diagnostics effort to figure out the restriction.

If you want to hard code an offset then we're effectively introducing a
device specific quirk to enable PASID support.  I thought we wanted
this to work generically for any device exposing PASID, therefore I was
thinking more of "find a gap" as the default strategy with quirks used
to augment the resulting offset where necessary.

I'd also be careful about command line parameters.  I think we require
one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
on the vfio-pci device, instead simply enabling support when both the
vIOMMU support is enabled and the device is detected to support it.
Each command line option requires support in the upper level tools to
enable it.

> >  - Do we ask the hardware vendor or variant driver to insert a DVSEC to
> >    identify available config space?  
> 
> As said I don't think it's necessary if leaving the policy to the user

Leaving the policy to the user is essentially just kicking the can down
the road and I don't know where that policy actually gets implemented
in a cloud, production environment.  I think it would align with QEMU
practices if the user could override a default policy on the command
line, but ultimately if we want to keep the policy decision out of the
kernel then the defaults probably need to be implemented in QEMU.

> >  - Do we handle this as just another device quirk, where we maintain a
> >    table of supported devices and vPASID offset for each?
> > 
> >  - Do we consider this an inflection point where the VMM entirely takes
> >    over the layout of the capability spaces to impose a stable
> >    migration layout?  On what basis do we apply that inflection?
> > 
> >  - Also, do we require the same policy for both standard and extended
> >    capability chains?  
> 
> suppose yes.
> 
> > 
> > I understand the desire to make some progress, but QEMU relies on
> > integration with management tools, so a temporary option for a user to
> > specify a PASID offset in isolation sounds like a non-starter to me.
> > 
> > This might be a better sell if the user interface allowed fully
> > defining the capability chain layout from the command line and this
> > interface would continue to exist and supersede how the VMM might
> > otherwise define the capability chain when used.  A fully user defined
> > layout would be complicated though, so I think there would still be a
> > desire for QEMU to consume or define a consistent policy itself.
> > 
> > Even if QEMU defines the layout for a device, there may be multiple
> > versions of that device.  For example, maybe we just add PASID now, but
> > at some point we decide that we do want to replicate the PF serial
> > number capability.  At that point we have versions of the device which
> > would need to be tied to versions of the machine and maybe also
> > selected via a profile switch on the device command line.
> > 
> > If we want to simplify this, maybe we do just look at whether the
> > vIOMMU is configured for PASID support and if the device supports it,  
> 
> and this is related to the open which I raised in last mail - whether we
> want to report the PASID support both in iommufd and vfio-pci uAPI.
> 
> My impression is yes as there may be requirement of exposing a virtual
> capability which doesn't rely on the IOMMU.

What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
agree that there will be capabilities related to the iommufd and
capabilities only related to the device, but I disagree that that
provides justification to report PASID via both uAPIs.  Are we also
going to ask iommufd to report that a device has an optional serial
number capability?  It clearly doesn't make sense for iommufd to be
involved with that, so why does it make sense for vfio-pci to be
involved in reporting something that is more iommufd specific?

> > then we just look for a gap and add the capability.  If we end up with
> > different results between source and target for migration, then
> > migration will fail.  Possibly we end up with a quirk table to override
> > the default placement of specific capabilities on specific devices.  
> 
> emm how does a quirk table work with devices having volatile config
> space layout cross FW versions? Can VMM assigned with a VF be able
> to check the FW version of the PF?

If the VMM can't find the same gap between source and destination then
a quirk could make sure that the PASID offset is consistent.  But also
if the VMM doesn't find the same gap then that suggests the config
space is already different and not only the offset of the PASID
capability will need to be fixed via a quirk, so then we're into
quirking the entire capability space for the device.

The VMM should not be assumed to have any additional privileges beyond
what we provide it through the vfio device and iommufd interface.
Testing anything about the PF would require access on the host that
won't work in more secure environments.  Therefore if we can't
consistently place the PASID for a device, we probably need to quirk it
based on the vendor/device IDs or sub-IDs or we need to rely on a
management implied policy such as a device profile option on the QEMU
command line or maybe different classes of the vfio-pci driver in QEMU.

> > That might evolve into a lookup for where we place all capabilities,
> > which essentially turns into the "file" where the VMM defines the entire
> > layout for some devices.  
> 
> Overall this sounds a feasible path to move forward - starting with
> the VMM to find the gap automatically if a new PASID option is
> opted in. Devices with hidden registers may fail. Devices with volatile
> config space due to FW upgrade or cross vendors may fail to migrate.
> Then evolving it to the file-based scheme, and there is time to discuss
> any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> between.

As above, let's be careful about introducing unnecessary command line
options, especially if we expect support for them in higher level
tools.  If we place the PASID somewhere that makes the device not work,
then disabling PASID on the vIOMMU should resolve that.  It won't be a
regression, it will only be an incompatibility with a new feature.
That incompatibility may require a quirk to resolve to have the PASID
placed somewhere else.  If the PASID is placed at different offsets
based on device firmware or vendor then the location of the PASID alone
isn't the only thing preventing migration and we'll need to introduce
code for the VMM to take ownership of the capability layout at that
point.  Thanks,

Alex
Tian, Kevin Aug. 1, 2024, 7:45 a.m. UTC | #44
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, August 1, 2024 1:05 AM
> 
> On Wed, 31 Jul 2024 05:15:25 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, July 31, 2024 1:35 AM
> > >
> > >  - Seamless live migration of devices requires that configuration space
> > >    remains at least consistent, if not identical for much of it.
> >
> > I didn't quite get it. I thought being consistent means fully identical
> > config space from guest p.o.v.
> 
> See for example:
> 
> https://gitlab.com/qemu-project/qemu/-
> /commit/187716feeba406b5a3879db66a7bafd687472a1f

Thanks!

> 
> The layout of config space and most of the contents therein need to be
> identical, but there are arguably elements that could be volatile which
> only need to be consistent.

hmm IMHO it's more that the guest doesn't care volatile content in that
field instead of the guest view being strictly consistent. Probably I
don't really understand the meaning of consistency in this context...

btw that fix claims:

  "
  Here consistency could mean that VSC format should be same on
  source and destination, however actual Vendor Specific Info may
  not be byte-to-byte identical.
  "

Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU
but I'm not sure whether some vendor driver might be sensitive to
byte-to-byte consistency in VSC.

> > >
> > >  - We've discussed in the community and seem to have a consensus that a
> > >    DVSEC (Designated Vendor Specific Extended Capability) could be
> > >    defined to describe unused configuration space.  Such a DVSEC could
> > >    be implemented natively by the device or supplied by a vfio-pci
> > >    variant driver.  There is currently no definition of such a DVSEC.
> >
> > I'm not sure whether DVSEC is still that necessary if the direction is
> > to go userspace-defined layout. In a synthetic world the unused
> > physical space doesn't really matter.
> >
> > So this consensus IMHO was better placed under the umbrella of
> > the other direction having the kernel define the layout.
> 
> I agree that we don't seem to be headed in a direction that requires
> this, but I just wanted to include that there was a roughly agreed upon
> way for devices and variant drivers to annotate unused config space
> ranges for higher levels.  If we head in a direction where the VMM
> chooses an offset for the PASID capability, we need to keep track of
> whether this DVSEC comes to fruition and how that affects the offset
> that QEMU might choose.

Yes. We can keep this option in case there is a demand, especially
if the file-based synthesized scheme won't be built in one day and
we need a default policy in VMM.

> 
> > > So what are we trying to accomplish here.  PASID is the first
> > > non-device specific virtual capability that we'd like to insert into
> > > the VM view of the capability chain.  It won't be the last.
> > >
> > >  - Do we push the policy of defining the capability offset to the user?
> >
> > Looks yes as I didn't see a strong argument for the opposite way.
> 
> It's a policy choice though, so where and how is it implemented?  It
> works fine for those of us willing to edit xml or launch VMs by command
> line, but libvirt isn't going to sign up to insert a policy choice for
> a device.  If we get to even higher level tools, does anything that
> wants to implement PASID support required a vendor operator driver to
> make such policy choices (btw, I'm just throwing out the "operator"
> term as if I know what it means, I don't).

I had a rough feeling that there might be other usages requiring such
vendor plugin, e.g. provisioning VF/ADI may require vendor specific
configurations, but not really an expert in this area.

Overall I feel most of our discussions so far are about VMM-auto-
find-offset vs. file-based-policy-scheme which both belong to
user-defined policy, suggesting that we all agreed to drop the other
way having kernel define the offset (plus in-kernel quirks, etc.)?

Even the said DVSEC is to assist such user-defined direction.

> 
> > >  - Do we do some hand waving that devices supporting PASID shouldn't
> > >    have hidden registers and therefore the VMM can simply find a gap?
> >
> > I assume 'handwaving' doesn't mean any measure in code to actually
> > block those devices (as doing so likely requires certain denylist based on
> > device/vendor ID but then why not going a step further to also hard
> > code an offset?). It's more a try-and-fail model where vPASID is opted
> > in via a cmdline parameter then a device with hidden registers may
> > misbehave if the VMM happens to find a conflict gap. And the impact
> > is restricted only to a new setup where the user is interested in
> > PASID  to opt hence can afford diagnostics effort to figure out the
> restriction.
> 
> If you want to hard code an offset then we're effectively introducing a
> device specific quirk to enable PASID support.  I thought we wanted
> this to work generically for any device exposing PASID, therefore I was
> thinking more of "find a gap" as the default strategy with quirks used
> to augment the resulting offset where necessary.
> 
> I'd also be careful about command line parameters.  I think we require
> one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> on the vfio-pci device, instead simply enabling support when both the
> vIOMMU support is enabled and the device is detected to support it.
> Each command line option requires support in the upper level tools to
> enable it.

Make sense. btw will there be a requirement that the user wants to
disable PASID even if the device supports it, e.g. for testing purpose
or to workaround a HW errata disclosed after host driver claims the
support in an old kernel?

> > >
> > > I understand the desire to make some progress, but QEMU relies on
> > > integration with management tools, so a temporary option for a user to
> > > specify a PASID offset in isolation sounds like a non-starter to me.
> > >
> > > This might be a better sell if the user interface allowed fully
> > > defining the capability chain layout from the command line and this
> > > interface would continue to exist and supersede how the VMM might
> > > otherwise define the capability chain when used.  A fully user defined
> > > layout would be complicated though, so I think there would still be a
> > > desire for QEMU to consume or define a consistent policy itself.
> > >
> > > Even if QEMU defines the layout for a device, there may be multiple
> > > versions of that device.  For example, maybe we just add PASID now, but
> > > at some point we decide that we do want to replicate the PF serial
> > > number capability.  At that point we have versions of the device which
> > > would need to be tied to versions of the machine and maybe also
> > > selected via a profile switch on the device command line.
> > >
> > > If we want to simplify this, maybe we do just look at whether the
> > > vIOMMU is configured for PASID support and if the device supports it,
> >
> > and this is related to the open which I raised in last mail - whether we
> > want to report the PASID support both in iommufd and vfio-pci uAPI.
> >
> > My impression is yes as there may be requirement of exposing a virtual
> > capability which doesn't rely on the IOMMU.
> 
> What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
> agree that there will be capabilities related to the iommufd and
> capabilities only related to the device, but I disagree that that
> provides justification to report PASID via both uAPIs.  Are we also
> going to ask iommufd to report that a device has an optional serial
> number capability?  It clearly doesn't make sense for iommufd to be

Certainly no. My point was that vfio-pci/iommufd each reports its
own capability set. They may overlap but this fact just matches the
physical world.

> involved with that, so why does it make sense for vfio-pci to be
> involved in reporting something that is more iommufd specific?

It doesn't matter which one involves more. It's more akin to the
physical world.

btw vfio-pci already reports ATS/PRI which both rely on iommufd
in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a
good justification for why it's special.

I envision an extension to vfio device feature or a new vfio uAPI
for reporting virtual capabilities as augment to the ones filled in
vconfig space. 

> 
> > > then we just look for a gap and add the capability.  If we end up with
> > > different results between source and target for migration, then
> > > migration will fail.  Possibly we end up with a quirk table to override
> > > the default placement of specific capabilities on specific devices.
> >
> > emm how does a quirk table work with devices having volatile config
> > space layout cross FW versions? Can VMM assigned with a VF be able
> > to check the FW version of the PF?
> 
> If the VMM can't find the same gap between source and destination then
> a quirk could make sure that the PASID offset is consistent.  But also
> if the VMM doesn't find the same gap then that suggests the config
> space is already different and not only the offset of the PASID
> capability will need to be fixed via a quirk, so then we're into
> quirking the entire capability space for the device.

yes. So the quirk table is more for fixing the functional gap (i.e. not
overlap with a hidden register) instead of for migration. As long as
a device can function correctly with it, the virtual caps fall into the
same restriction as physical caps in migration i.e. upon inconsistent
layout between src/dest we'll need separate way to synthesize the
entire space.

> 
> The VMM should not be assumed to have any additional privileges beyond
> what we provide it through the vfio device and iommufd interface.
> Testing anything about the PF would require access on the host that
> won't work in more secure environments.  Therefore if we can't
> consistently place the PASID for a device, we probably need to quirk it
> based on the vendor/device IDs or sub-IDs or we need to rely on a
> management implied policy such as a device profile option on the QEMU
> command line or maybe different classes of the vfio-pci driver in QEMU.
> 
> > > That might evolve into a lookup for where we place all capabilities,
> > > which essentially turns into the "file" where the VMM defines the entire
> > > layout for some devices.
> >
> > Overall this sounds a feasible path to move forward - starting with
> > the VMM to find the gap automatically if a new PASID option is
> > opted in. Devices with hidden registers may fail. Devices with volatile
> > config space due to FW upgrade or cross vendors may fail to migrate.
> > Then evolving it to the file-based scheme, and there is time to discuss
> > any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> > between.
> 
> As above, let's be careful about introducing unnecessary command line
> options, especially if we expect support for them in higher level
> tools.  If we place the PASID somewhere that makes the device not work,
> then disabling PASID on the vIOMMU should resolve that.  It won't be a

vIOMMU is per-platform then it applies to all devices behind, including
those which don't have a problem with auto-selected offset. Not sure
whether one would want to continue enabling PASID for other devices
or should stop immediately to find a quirk for the problematic one and
then resume.

> regression, it will only be an incompatibility with a new feature.
> That incompatibility may require a quirk to resolve to have the PASID
> placed somewhere else.  If the PASID is placed at different offsets
> based on device firmware or vendor then the location of the PASID alone
> isn't the only thing preventing migration and we'll need to introduce
> code for the VMM to take ownership of the capability layout at that
> point.  Thanks,
> 

Yes, the migration issue might be solved in a separate track as it applies
to both physical and virtual capabilities.
Alex Williamson Aug. 2, 2024, 6:25 p.m. UTC | #45
On Thu, 1 Aug 2024 07:45:43 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, August 1, 2024 1:05 AM
> > 
> > On Wed, 31 Jul 2024 05:15:25 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, July 31, 2024 1:35 AM
> > > >
> > > >  - Seamless live migration of devices requires that configuration space
> > > >    remains at least consistent, if not identical for much of it.  
> > >
> > > I didn't quite get it. I thought being consistent means fully identical
> > > config space from guest p.o.v.  
> > 
> > See for example:
> > 
> > https://gitlab.com/qemu-project/qemu/-
> > /commit/187716feeba406b5a3879db66a7bafd687472a1f  
> 
> Thanks!
> 
> > 
> > The layout of config space and most of the contents therein need to be
> > identical, but there are arguably elements that could be volatile which
> > only need to be consistent.  
> 
> hmm IMHO it's more that the guest doesn't care volatile content in that
> field instead of the guest view being strictly consistent. Probably I
> don't really understand the meaning of consistency in this context...
> 
> btw that fix claims:
> 
>   "
>   Here consistency could mean that VSC format should be same on
>   source and destination, however actual Vendor Specific Info may
>   not be byte-to-byte identical.
>   "
> 
> Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU
> but I'm not sure whether some vendor driver might be sensitive to
> byte-to-byte consistency in VSC.

It applies to all VSC capabilities.  The argument is that there is no
standard definition of the content here therefore QEMU has no authority
to impose a policy on the contents.
 
> > > >
> > > >  - We've discussed in the community and seem to have a consensus that a
> > > >    DVSEC (Designated Vendor Specific Extended Capability) could be
> > > >    defined to describe unused configuration space.  Such a DVSEC could
> > > >    be implemented natively by the device or supplied by a vfio-pci
> > > >    variant driver.  There is currently no definition of such a DVSEC.  
> > >
> > > I'm not sure whether DVSEC is still that necessary if the direction is
> > > to go userspace-defined layout. In a synthetic world the unused
> > > physical space doesn't really matter.
> > >
> > > So this consensus IMHO was better placed under the umbrella of
> > > the other direction having the kernel define the layout.  
> > 
> > I agree that we don't seem to be headed in a direction that requires
> > this, but I just wanted to include that there was a roughly agreed upon
> > way for devices and variant drivers to annotate unused config space
> > ranges for higher levels.  If we head in a direction where the VMM
> > chooses an offset for the PASID capability, we need to keep track of
> > whether this DVSEC comes to fruition and how that affects the offset
> > that QEMU might choose.  
> 
> Yes. We can keep this option in case there is a demand, especially
> if the file-based synthesized scheme won't be built in one day and
> we need a default policy in VMM.
> 
> >   
> > > > So what are we trying to accomplish here.  PASID is the first
> > > > non-device specific virtual capability that we'd like to insert into
> > > > the VM view of the capability chain.  It won't be the last.
> > > >
> > > >  - Do we push the policy of defining the capability offset to the user?  
> > >
> > > Looks yes as I didn't see a strong argument for the opposite way.  
> > 
> > It's a policy choice though, so where and how is it implemented?  It
> > works fine for those of us willing to edit xml or launch VMs by command
> > line, but libvirt isn't going to sign up to insert a policy choice for
> > a device.  If we get to even higher level tools, does anything that
> > wants to implement PASID support required a vendor operator driver to
> > make such policy choices (btw, I'm just throwing out the "operator"
> > term as if I know what it means, I don't).  
> 
> I had a rough feeling that there might be other usages requiring such
> vendor plugin, e.g. provisioning VF/ADI may require vendor specific
> configurations, but not really an expert in this area.
> 
> Overall I feel most of our discussions so far are about VMM-auto-
> find-offset vs. file-based-policy-scheme which both belong to
> user-defined policy, suggesting that we all agreed to drop the other
> way having kernel define the offset (plus in-kernel quirks, etc.)?
> 
> Even the said DVSEC is to assist such user-defined direction.

To me a "user defined policy" is placing an option on the command line
and requiring the user, or some higher level authority representing the
user, to provide the policy.  If it's done by the VMM then we're saying
QEMU owns the policy but it might be overridden by the user via a
command line argument or modifying the policy file consumed by QEMU.
 
> > > >  - Do we do some hand waving that devices supporting PASID shouldn't
> > > >    have hidden registers and therefore the VMM can simply find a gap?  
> > >
> > > I assume 'handwaving' doesn't mean any measure in code to actually
> > > block those devices (as doing so likely requires certain denylist based on
> > > device/vendor ID but then why not going a step further to also hard
> > > code an offset?). It's more a try-and-fail model where vPASID is opted
> > > in via a cmdline parameter then a device with hidden registers may
> > > misbehave if the VMM happens to find a conflict gap. And the impact
> > > is restricted only to a new setup where the user is interested in
> > > PASID  to opt hence can afford diagnostics effort to figure out the  
> > restriction.
> > 
> > If you want to hard code an offset then we're effectively introducing a
> > device specific quirk to enable PASID support.  I thought we wanted
> > this to work generically for any device exposing PASID, therefore I was
> > thinking more of "find a gap" as the default strategy with quirks used
> > to augment the resulting offset where necessary.
> > 
> > I'd also be careful about command line parameters.  I think we require
> > one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> > on the vfio-pci device, instead simply enabling support when both the
> > vIOMMU support is enabled and the device is detected to support it.
> > Each command line option requires support in the upper level tools to
> > enable it.  
> 
> Make sense. btw will there be a requirement that the user wants to
> disable PASID even if the device supports it, e.g. for testing purpose
> or to workaround a HW errata disclosed after host driver claims the
> support in an old kernel?

We can make use of "experimental options", ie. "x-" prefixed options,
in QEMU for this.  These are useful for debugging but will not be
considered for support by higher level tools nor do we consider them to
have a stable ABI.

> > > > I understand the desire to make some progress, but QEMU relies on
> > > > integration with management tools, so a temporary option for a user to
> > > > specify a PASID offset in isolation sounds like a non-starter to me.
> > > >
> > > > This might be a better sell if the user interface allowed fully
> > > > defining the capability chain layout from the command line and this
> > > > interface would continue to exist and supersede how the VMM might
> > > > otherwise define the capability chain when used.  A fully user defined
> > > > layout would be complicated though, so I think there would still be a
> > > > desire for QEMU to consume or define a consistent policy itself.
> > > >
> > > > Even if QEMU defines the layout for a device, there may be multiple
> > > > versions of that device.  For example, maybe we just add PASID now, but
> > > > at some point we decide that we do want to replicate the PF serial
> > > > number capability.  At that point we have versions of the device which
> > > > would need to be tied to versions of the machine and maybe also
> > > > selected via a profile switch on the device command line.
> > > >
> > > > If we want to simplify this, maybe we do just look at whether the
> > > > vIOMMU is configured for PASID support and if the device supports it,  
> > >
> > > and this is related to the open which I raised in last mail - whether we
> > > want to report the PASID support both in iommufd and vfio-pci uAPI.
> > >
> > > My impression is yes as there may be requirement of exposing a virtual
> > > capability which doesn't rely on the IOMMU.  
> > 
> > What's the purpose of reporting PASID via both iommufd and vfio-pci?  I
> > agree that there will be capabilities related to the iommufd and
> > capabilities only related to the device, but I disagree that that
> > provides justification to report PASID via both uAPIs.  Are we also
> > going to ask iommufd to report that a device has an optional serial
> > number capability?  It clearly doesn't make sense for iommufd to be  
> 
> Certainly no. My point was that vfio-pci/iommufd each reports its
> own capability set. They may overlap but this fact just matches the
> physical world.
> 
> > involved with that, so why does it make sense for vfio-pci to be
> > involved in reporting something that is more iommufd specific?  
> 
> It doesn't matter which one involves more. It's more akin to the
> physical world.
> 
> btw vfio-pci already reports ATS/PRI which both rely on iommufd
> in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a
> good justification for why it's special.
> 
> I envision an extension to vfio device feature or a new vfio uAPI
> for reporting virtual capabilities as augment to the ones filled in
> vconfig space. 

Should ATS and PRI be reported through vfio-pci or should we just turn
them off to be more like PASID?  Maybe the issue simply hasn't arisen
yet because we don't have vIOMMU support and with that support QEMU
might need to filter out those capabilities and look elsewhere.
Anyway, iommufd and vfio-pci should not duplicate each other here.

> > > > then we just look for a gap and add the capability.  If we end up with
> > > > different results between source and target for migration, then
> > > > migration will fail.  Possibly we end up with a quirk table to override
> > > > the default placement of specific capabilities on specific devices.  
> > >
> > > emm how does a quirk table work with devices having volatile config
> > > space layout cross FW versions? Can VMM assigned with a VF be able
> > > to check the FW version of the PF?  
> > 
> > If the VMM can't find the same gap between source and destination then
> > a quirk could make sure that the PASID offset is consistent.  But also
> > if the VMM doesn't find the same gap then that suggests the config
> > space is already different and not only the offset of the PASID
> > capability will need to be fixed via a quirk, so then we're into
> > quirking the entire capability space for the device.  
> 
> yes. So the quirk table is more for fixing the functional gap (i.e. not
> overlap with a hidden register) instead of for migration. As long as
> a device can function correctly with it, the virtual caps fall into the
> same restriction as physical caps in migration i.e. upon inconsistent
> layout between src/dest we'll need separate way to synthesize the
> entire space.

Yes.

> > The VMM should not be assumed to have any additional privileges beyond
> > what we provide it through the vfio device and iommufd interface.
> > Testing anything about the PF would require access on the host that
> > won't work in more secure environments.  Therefore if we can't
> > consistently place the PASID for a device, we probably need to quirk it
> > based on the vendor/device IDs or sub-IDs or we need to rely on a
> > management implied policy such as a device profile option on the QEMU
> > command line or maybe different classes of the vfio-pci driver in QEMU.
> >   
> > > > That might evolve into a lookup for where we place all capabilities,
> > > > which essentially turns into the "file" where the VMM defines the entire
> > > > layout for some devices.  
> > >
> > > Overall this sounds a feasible path to move forward - starting with
> > > the VMM to find the gap automatically if a new PASID option is
> > > opted in. Devices with hidden registers may fail. Devices with volatile
> > > config space due to FW upgrade or cross vendors may fail to migrate.
> > > Then evolving it to the file-based scheme, and there is time to discuss
> > > any intermediate improvement (fixed quirks, cmdline offset, etc.) in
> > > between.  
> > 
> > As above, let's be careful about introducing unnecessary command line
> > options, especially if we expect support for them in higher level
> > tools.  If we place the PASID somewhere that makes the device not work,
> > then disabling PASID on the vIOMMU should resolve that.  It won't be a  
> 
> vIOMMU is per-platform then it applies to all devices behind, including
> those which don't have a problem with auto-selected offset. Not sure
> whether one would want to continue enabling PASID for other devices
> or should stop immediately to find a quirk for the problematic one and
> then resume.

I'm not sure if this is a real issue, we're talking about a VM, not a
server.  If a user wants PASID support and it's incompatible with a
device, the device can be excluded from the VM or we can have an
experimental option on the vfio-pci device in QEMU as a workaround.  I
don't think this is something we need to plumb up through the tool
stack.  Thanks,

Alex
Tian, Kevin Aug. 5, 2024, 5:35 a.m. UTC | #46
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, August 3, 2024 2:25 AM
> 
> On Thu, 1 Aug 2024 07:45:43 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, August 1, 2024 1:05 AM
> > >
> > > On Wed, 31 Jul 2024 05:15:25 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Wednesday, July 31, 2024 1:35 AM
> > > > >
> > > > > So what are we trying to accomplish here.  PASID is the first
> > > > > non-device specific virtual capability that we'd like to insert into
> > > > > the VM view of the capability chain.  It won't be the last.
> > > > >
> > > > >  - Do we push the policy of defining the capability offset to the user?
> > > >
> > > > Looks yes as I didn't see a strong argument for the opposite way.
> > >
> > > It's a policy choice though, so where and how is it implemented?  It
> > > works fine for those of us willing to edit xml or launch VMs by command
> > > line, but libvirt isn't going to sign up to insert a policy choice for
> > > a device.  If we get to even higher level tools, does anything that
> > > wants to implement PASID support required a vendor operator driver to
> > > make such policy choices (btw, I'm just throwing out the "operator"
> > > term as if I know what it means, I don't).
> >
> > I had a rough feeling that there might be other usages requiring such
> > vendor plugin, e.g. provisioning VF/ADI may require vendor specific
> > configurations, but not really an expert in this area.
> >
> > Overall I feel most of our discussions so far are about VMM-auto-
> > find-offset vs. file-based-policy-scheme which both belong to
> > user-defined policy, suggesting that we all agreed to drop the other
> > way having kernel define the offset (plus in-kernel quirks, etc.)?
> >
> > Even the said DVSEC is to assist such user-defined direction.
> 
> To me a "user defined policy" is placing an option on the command line
> and requiring the user, or some higher level authority representing the
> user, to provide the policy.  If it's done by the VMM then we're saying
> QEMU owns the policy but it might be overridden by the user via a
> command line argument or modifying the policy file consumed by QEMU.

Okay, I see the difference here. In my reply it's clearer to say
"userspace" instead of "user". 
Jason Gunthorpe Aug. 6, 2024, 1:54 p.m. UTC | #47
On Wed, Jul 31, 2024 at 11:04:36AM -0600, Alex Williamson wrote:

> I'd also be careful about command line parameters.  I think we require
> one for the vIOMMU to enable PASID support, but I'd prefer to avoid one
> on the vfio-pci device, instead simply enabling support when both the
> vIOMMU support is enabled and the device is detected to support it.
> Each command line option requires support in the upper level tools to
> enable it.

FWIW AMD seems to have an issue here where if they don't intend to use
PASID then they get a bunch more performance if they leave it turned
off. 

So cases like no vIOMMU, or a vIOMMU with no PASID capability, should
automatically tell the kernel to disable PASID support for the
device. These decisions needs to be made when the HWPT's are first
allocated.

Jason
Jason Gunthorpe Aug. 6, 2024, 2:05 p.m. UTC | #48
On Fri, Aug 02, 2024 at 12:25:28PM -0600, Alex Williamson wrote:

> > I envision an extension to vfio device feature or a new vfio uAPI
> > for reporting virtual capabilities as augment to the ones filled in
> > vconfig space. 
> 
> Should ATS and PRI be reported through vfio-pci or should we just turn
> them off to be more like PASID?  Maybe the issue simply hasn't arisen
> yet because we don't have vIOMMU support and with that support QEMU
> might need to filter out those capabilities and look elsewhere.
> Anyway, iommufd and vfio-pci should not duplicate each other here.

ATS and PRI are strictly controlled only by the iommu driver, VFIO
must not change them dynamically.

Effectively they become set/unset based on what domains are attached
through iommufd.

PRI is turned on when any fault capable domain is attached, eg the
HWPT uses IOMMU_HWPT_FAULT_ID_VALID. I'm not sure I know the full
story here how this will work on the qemu side, but I would guess that
PRI changes in the virtual config space will result in different
nesting domains becoming attached?

ATS, at least for SMMUv3, is controlled by a bit in the vSTE from the
guest, the virtual PCI config space changes should not be reflected to
real HW. There are data integrity/security concerns here. ATS must
only be changed as part of a sequence that can flush the cache.

Jason
Jason Gunthorpe Aug. 6, 2024, 2:20 p.m. UTC | #49
On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:

> Okay. With that I edited my earlier reply a bit by removing the note
> of cmdline option, adding DVSEC possibility, and making it clear that
> the PASID option is in vIOMMU:
> 
> "
> Overall this sounds a feasible path to move forward - starting with
> the VMM to find the gap automatically if PASID is opted in vIOMMU. 
> Devices with hidden registers may fail. Devices with volatile
> config space due to FW upgrade or cross vendors may fail to migrate.
> Then evolving it to the file-based scheme, and there is time to discuss
> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> "
> 
> Jason, your thoughts?

This thread is big and I've read it quickly, but I could support the
above summary.

For migration, where we are today, it is completely up to the device
and it's FW to present a new config space that is stable across
migration. In practice this means the device needs to have pre-set the
PF to whatever config state it needs during FLR, and refuse to migrate
if things are not correct.

Moving away from that limitation toward a VMM created stable config
space is definitely a long term interest. I understand other VMM's are
already doing things like this.

It seems we need more time to bake on this long term direction. I
suggested a text file as something very general, but we can do other
things too.

What I see as the general industry direction is toward a very
perspective vPCI function, where you might have v1, v2, etc of a
device that present different config space/behavior/etc. I expect
standards bodies are going to define reference vPCI functions for
their standards along these lines. This would specify all the MMIO
memory layout and every bit of config space right down to each offset
and bit.

This is the sort of restriction that will be needed to allow more
generic live migration between different devices and different
FW. There is a pretty clear role here for the VMM to synthesize a
highly perscribed config space. How we get there and how the libvirt
stack should support this, I don't know.

Jason
Jason Gunthorpe Aug. 6, 2024, 2:30 p.m. UTC | #50
On Tue, Jul 30, 2024 at 11:35:17AM -0600, Alex Williamson wrote:

> Even if QEMU defines the layout for a device, there may be multiple
> versions of that device.  For example, maybe we just add PASID now, but
> at some point we decide that we do want to replicate the PF serial
> number capability.  At that point we have versions of the device which
> would need to be tied to versions of the machine and maybe also
> selected via a profile switch on the device command line.

This is the larger end state I see here.

When the admin asks to install a vPCI function into a VM the admin
would specify they want, say, 'virtio-net vPCI spec 1.0'.

The text of spec 1.0 would specify each and every bit of config space
and register layout. It would specify the content of the live
migration blobs, and an exact feature set/behaviors the device must
advertise.

Even if the device can do more, a combination of the VMM and the
variant PCI driver+device FW would make it do only what spec 1.0 says
it should do.

You can imagine there would be a range of these standards available,
and large sites have a high chance to have their own private forks.

This should be viewed as totally opposite the current VFIO behavior
that intends to reflect the exact device, as-is, into the VM. I think
we can reasonably have different approaches to each.

So, how we manage this as an ecosystem, I don't know. It sure would be
nice to not need kernel changes to push through a new virtual device
spec!

But I think your summary is good.

Thanks,
Jason
Yi Liu Aug. 14, 2024, 6:38 a.m. UTC | #51
On 2024/8/6 22:20, Jason Gunthorpe wrote:
> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> 
>> Okay. With that I edited my earlier reply a bit by removing the note
>> of cmdline option, adding DVSEC possibility, and making it clear that
>> the PASID option is in vIOMMU:
>>
>> "
>> Overall this sounds a feasible path to move forward - starting with
>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>> Devices with hidden registers may fail. Devices with volatile
>> config space due to FW upgrade or cross vendors may fail to migrate.
>> Then evolving it to the file-based scheme, and there is time to discuss
>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>> "
>>
>> Jason, your thoughts?
> 
> This thread is big and I've read it quickly, but I could support the
> above summary.
> 

thanks for the ideas. I think we still need a uapi to report if the device
supports PASID or not. Do we have agreement on where should this uapi be
defined? vfio or iommufd.

Besides, I've a question on how the userspace know the hidden registers
when trying to find a gap for the vPASID cap. It should only know the
standard pci caps.
Tian, Kevin Aug. 14, 2024, 7:38 a.m. UTC | #52
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, August 14, 2024 2:39 PM
> 
> On 2024/8/6 22:20, Jason Gunthorpe wrote:
> > On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> >
> >> Okay. With that I edited my earlier reply a bit by removing the note
> >> of cmdline option, adding DVSEC possibility, and making it clear that
> >> the PASID option is in vIOMMU:
> >>
> >> "
> >> Overall this sounds a feasible path to move forward - starting with
> >> the VMM to find the gap automatically if PASID is opted in vIOMMU.
> >> Devices with hidden registers may fail. Devices with volatile
> >> config space due to FW upgrade or cross vendors may fail to migrate.
> >> Then evolving it to the file-based scheme, and there is time to discuss
> >> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> >> "
> >>
> >> Jason, your thoughts?
> >
> > This thread is big and I've read it quickly, but I could support the
> > above summary.
> >
> 
> thanks for the ideas. I think we still need a uapi to report if the device
> supports PASID or not. Do we have agreement on where should this uapi be
> defined? vfio or iommufd.

IOMMUFD_CMD_GET_HW_INFO. 

> 
> Besides, I've a question on how the userspace know the hidden registers
> when trying to find a gap for the vPASID cap. It should only know the
> standard pci caps.
> 

for the initial implementation VMM doesn't know any hidden registers.
The user passes a new vIOMMU option to the VMM for exposing
the PASID capability in vIOMMU and in device, based on info in
IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
existing caps to put the vPASID cap. If a device with hidden registers
doesn’t work correctly then then a quirk may be added for it.
Yi Liu Aug. 14, 2024, 8:19 a.m. UTC | #53
On 2024/8/14 15:38, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Wednesday, August 14, 2024 2:39 PM
>>
>> On 2024/8/6 22:20, Jason Gunthorpe wrote:
>>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
>>>
>>>> Okay. With that I edited my earlier reply a bit by removing the note
>>>> of cmdline option, adding DVSEC possibility, and making it clear that
>>>> the PASID option is in vIOMMU:
>>>>
>>>> "
>>>> Overall this sounds a feasible path to move forward - starting with
>>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>>>> Devices with hidden registers may fail. Devices with volatile
>>>> config space due to FW upgrade or cross vendors may fail to migrate.
>>>> Then evolving it to the file-based scheme, and there is time to discuss
>>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>>>> "
>>>>
>>>> Jason, your thoughts?
>>>
>>> This thread is big and I've read it quickly, but I could support the
>>> above summary.
>>>
>>
>> thanks for the ideas. I think we still need a uapi to report if the device
>> supports PASID or not. Do we have agreement on where should this uapi be
>> defined? vfio or iommufd.
> 
> IOMMUFD_CMD_GET_HW_INFO.

I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to
report PASID supporting makes it cover device capability now. I may need to
add one more capability enum in struct iommu_hw_info just like the below
one. Perhaps name it as iommufd_device_capabilities. And only set the PASID 
capability in the new device capability when the device's or its PF's pasid 
cap is enabled. Does it look good?


/**
  * enum iommufd_hw_capabilities
  * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
  *                               If available, it means the following APIs
  *                               are supported:
  *
  *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
  *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
  *
  */
enum iommufd_hw_capabilities {
	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
};

>>
>> Besides, I've a question on how the userspace know the hidden registers
>> when trying to find a gap for the vPASID cap. It should only know the
>> standard pci caps.
>>
> 
> for the initial implementation VMM doesn't know any hidden registers.
> The user passes a new vIOMMU option to the VMM for exposing
> the PASID capability in vIOMMU and in device, based on info in
> IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
> existing caps to put the vPASID cap. If a device with hidden registers
> doesn’t work correctly then then a quirk may be added for it.

I see. But we cannot know it until a guest device driver failed. This is
acceptable. right? Perhaps this should be documented somewhere to let the
user/device vendor know before safely exposing vPASID cap on a device. :)
Tian, Kevin Aug. 14, 2024, 8:26 a.m. UTC | #54
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, August 14, 2024 4:19 PM
> 
> On 2024/8/14 15:38, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Wednesday, August 14, 2024 2:39 PM
> >>
> >> On 2024/8/6 22:20, Jason Gunthorpe wrote:
> >>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
> >>>
> >>>> Okay. With that I edited my earlier reply a bit by removing the note
> >>>> of cmdline option, adding DVSEC possibility, and making it clear that
> >>>> the PASID option is in vIOMMU:
> >>>>
> >>>> "
> >>>> Overall this sounds a feasible path to move forward - starting with
> >>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
> >>>> Devices with hidden registers may fail. Devices with volatile
> >>>> config space due to FW upgrade or cross vendors may fail to migrate.
> >>>> Then evolving it to the file-based scheme, and there is time to discuss
> >>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
> >>>> "
> >>>>
> >>>> Jason, your thoughts?
> >>>
> >>> This thread is big and I've read it quickly, but I could support the
> >>> above summary.
> >>>
> >>
> >> thanks for the ideas. I think we still need a uapi to report if the device
> >> supports PASID or not. Do we have agreement on where should this uapi
> be
> >> defined? vfio or iommufd.
> >
> > IOMMUFD_CMD_GET_HW_INFO.
> 
> I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to
> report PASID supporting makes it cover device capability now. I may need to
> add one more capability enum in struct iommu_hw_info just like the below
> one. Perhaps name it as iommufd_device_capabilities. And only set the
> PASID
> capability in the new device capability when the device's or its PF's pasid
> cap is enabled. Does it look good?

Just add a new bit in iommufd_hw_capabilities.

Anyway PASID is meaningful only when both device and IOMMU supports
it. Here we are reporting what IOMMU capabilities meaningful to a given
device so checking both does make sense. 

> 
> 
> /**
>   * enum iommufd_hw_capabilities
>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for
> dirty tracking
>   *                               If available, it means the following APIs
>   *                               are supported:
>   *
>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>   *
>   */
> enum iommufd_hw_capabilities {
> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
> };
> 
> >>
> >> Besides, I've a question on how the userspace know the hidden registers
> >> when trying to find a gap for the vPASID cap. It should only know the
> >> standard pci caps.
> >>
> >
> > for the initial implementation VMM doesn't know any hidden registers.
> > The user passes a new vIOMMU option to the VMM for exposing
> > the PASID capability in vIOMMU and in device, based on info in
> > IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between
> > existing caps to put the vPASID cap. If a device with hidden registers
> > doesn’t work correctly then then a quirk may be added for it.
> 
> I see. But we cannot know it until a guest device driver failed. This is
> acceptable. right? Perhaps this should be documented somewhere to let the
> user/device vendor know before safely exposing vPASID cap on a device. :)
> 

presumably in VMM.
Jason Gunthorpe Aug. 14, 2024, 2:40 p.m. UTC | #55
On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:

> /**
>  * enum iommufd_hw_capabilities
>  * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>  *                               If available, it means the following APIs
>  *                               are supported:
>  *
>  *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>  *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>  *
>  */
> enum iommufd_hw_capabilities {
> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
> };

I think it would be appropriate to add the flag here

Is it OK to rely on the PCI config space PASID enable? I see all the
drivers right now are turning on PASID support during probe if the
iommu supports it.

Jason
Yi Liu Aug. 15, 2024, 2:12 a.m. UTC | #56
On 2024/8/14 22:40, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
> 
>> /**
>>   * enum iommufd_hw_capabilities
>>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>   *                               If available, it means the following APIs
>>   *                               are supported:
>>   *
>>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>   *
>>   */
>> enum iommufd_hw_capabilities {
>> 	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>> };
> 
> I think it would be appropriate to add the flag here

ok.

> Is it OK to rely on the PCI config space PASID enable? I see all the
> drivers right now are turning on PASID support during probe if the
> iommu supports it.

Intel side is not ready yet as it enables pasid when the device is attached
to a non-blocking domain. I've chatted with Baolu, and he will kindly to
enable the pasid cap in the probe_device() op if both iommu and device has
this cap. After that, Intel side should be fine to rely on the PASID enable
bit in the PCI config space.

How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
Vasant Hegde Aug. 16, 2024, 8:29 a.m. UTC | #57
Yi,


On 8/15/2024 7:42 AM, Yi Liu wrote:
> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>
>>> /**
>>>   * enum iommufd_hw_capabilities
>>>   * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>   *                               If available, it means the following APIs
>>>   *                               are supported:
>>>   *
>>>   *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>   *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>   *
>>>   */
>>> enum iommufd_hw_capabilities {
>>>     IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>> };
>>
>> I think it would be appropriate to add the flag here
> 
> ok.
> 
>> Is it OK to rely on the PCI config space PASID enable? I see all the
>> drivers right now are turning on PASID support during probe if the
>> iommu supports it.
> 
> Intel side is not ready yet as it enables pasid when the device is attached
> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
> enable the pasid cap in the probe_device() op if both iommu and device has
> this cap. After that, Intel side should be fine to rely on the PASID enable
> bit in the PCI config space.
> 
> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?

AMD driver currently discovers capability in probe_device() and enabled it in
attach_dev() path.

-Vasant

>
Yi Liu Aug. 16, 2024, 11:52 a.m. UTC | #58
On 2024/8/16 16:29, Vasant Hegde wrote:
> Yi,
> 
> 
> On 8/15/2024 7:42 AM, Yi Liu wrote:
>> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>>
>>>> /**
>>>>    * enum iommufd_hw_capabilities
>>>>    * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>>    *                               If available, it means the following APIs
>>>>    *                               are supported:
>>>>    *
>>>>    *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>>    *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>>    *
>>>>    */
>>>> enum iommufd_hw_capabilities {
>>>>      IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>>> };
>>>
>>> I think it would be appropriate to add the flag here
>>
>> ok.
>>
>>> Is it OK to rely on the PCI config space PASID enable? I see all the
>>> drivers right now are turning on PASID support during probe if the
>>> iommu supports it.
>>
>> Intel side is not ready yet as it enables pasid when the device is attached
>> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
>> enable the pasid cap in the probe_device() op if both iommu and device has
>> this cap. After that, Intel side should be fine to rely on the PASID enable
>> bit in the PCI config space.
>>
>> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
> 
> AMD driver currently discovers capability in probe_device() and enabled it in
> attach_dev() path.

I see. So AMD side also has a gap. Is it easy to make it suit Jason's
suggestion in the above?
Vasant Hegde Aug. 16, 2024, 12:01 p.m. UTC | #59
Yi,


On 8/16/2024 5:22 PM, Yi Liu wrote:
> On 2024/8/16 16:29, Vasant Hegde wrote:
>> Yi,
>>
>>
>> On 8/15/2024 7:42 AM, Yi Liu wrote:
>>> On 2024/8/14 22:40, Jason Gunthorpe wrote:
>>>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote:
>>>>
>>>>> /**
>>>>>    * enum iommufd_hw_capabilities
>>>>>    * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
>>>>>    *                               If available, it means the following APIs
>>>>>    *                               are supported:
>>>>>    *
>>>>>    *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
>>>>>    *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
>>>>>    *
>>>>>    */
>>>>> enum iommufd_hw_capabilities {
>>>>>      IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
>>>>> };
>>>>
>>>> I think it would be appropriate to add the flag here
>>>
>>> ok.
>>>
>>>> Is it OK to rely on the PCI config space PASID enable? I see all the
>>>> drivers right now are turning on PASID support during probe if the
>>>> iommu supports it.
>>>
>>> Intel side is not ready yet as it enables pasid when the device is attached
>>> to a non-blocking domain. I've chatted with Baolu, and he will kindly to
>>> enable the pasid cap in the probe_device() op if both iommu and device has
>>> this cap. After that, Intel side should be fine to rely on the PASID enable
>>> bit in the PCI config space.
>>>
>>> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
>>
>> AMD driver currently discovers capability in probe_device() and enabled it in
>> attach_dev() path.
> 
> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
> suggestion in the above?

We can do that. We can enable ATS, PRI and PASID capability during probe time
and keep it enabled always.

-Vasant
Jason Gunthorpe Aug. 16, 2024, 12:52 p.m. UTC | #60
On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
> > I see. So AMD side also has a gap. Is it easy to make it suit Jason's
> > suggestion in the above?
> 
> We can do that. We can enable ATS, PRI and PASID capability during probe time
> and keep it enabled always.

I don't see a downside to enabling PASID at probe time, it exists to
handshake with the device if the root complex is able to understand
PASID TLPs.

SMMU3 calls 
 arm_smmu_probe_device
  arm_smmu_enable_pasid
   pci_enable_pasid

So it looks Ok

Jason
Baolu Lu Aug. 16, 2024, 1:14 p.m. UTC | #61
On 2024/8/16 20:52, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
>>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
>>> suggestion in the above?
>> We can do that. We can enable ATS, PRI and PASID capability during probe time
>> and keep it enabled always.
> I don't see a downside to enabling PASID at probe time, it exists to
> handshake with the device if the root complex is able to understand
> PASID TLPs.
> 
> SMMU3 calls
>   arm_smmu_probe_device
>    arm_smmu_enable_pasid
>     pci_enable_pasid
> 
> So it looks Ok

I made a patch for the Intel driver.

https://lore.kernel.org/linux-iommu/20240816104945.97160-1-baolu.lu@linux.intel.com/

Thanks,
baolu
Vasant Hegde Aug. 19, 2024, 8:21 a.m. UTC | #62
Hi Jason, Yi,


On 8/16/2024 6:22 PM, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote:
>>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's
>>> suggestion in the above?
>>
>> We can do that. We can enable ATS, PRI and PASID capability during probe time
>> and keep it enabled always.
> 
> I don't see a downside to enabling PASID at probe time, it exists to
> handshake with the device if the root complex is able to understand
> PASID TLPs.

Ack. I will put it on my list. Will make change to AMD driver soon.

-Vasant
Yi Liu Sept. 9, 2024, 12:59 p.m. UTC | #63
On 2024/8/14 15:38, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Wednesday, August 14, 2024 2:39 PM
>>
>> On 2024/8/6 22:20, Jason Gunthorpe wrote:
>>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote:
>>>
>>>> Okay. With that I edited my earlier reply a bit by removing the note
>>>> of cmdline option, adding DVSEC possibility, and making it clear that
>>>> the PASID option is in vIOMMU:
>>>>
>>>> "
>>>> Overall this sounds a feasible path to move forward - starting with
>>>> the VMM to find the gap automatically if PASID is opted in vIOMMU.
>>>> Devices with hidden registers may fail. Devices with volatile
>>>> config space due to FW upgrade or cross vendors may fail to migrate.
>>>> Then evolving it to the file-based scheme, and there is time to discuss
>>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between.
>>>> "
>>>>
>>>> Jason, your thoughts?
>>>
>>> This thread is big and I've read it quickly, but I could support the
>>> above summary.
>>>
>>
>> thanks for the ideas. I think we still need a uapi to report if the device
>> supports PASID or not. Do we have agreement on where should this uapi be
>> defined? vfio or iommufd.
> 
> IOMMUFD_CMD_GET_HW_INFO.

Hi Kevin, Jason,

In order to synthesize the vPASID cap, the VMM should get to know the
capabilities like Privilege mode, Execute permission from the physical
device's config space. We have two choices as well. vfio or iommufd.

It appears to be better reporting the capabilities via vfio uapi (e.g.
VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
PASID capabilities to userspace. Please let me know your preference. :)
Jason Gunthorpe Sept. 9, 2024, 1:04 p.m. UTC | #64
On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:

> In order to synthesize the vPASID cap, the VMM should get to know the
> capabilities like Privilege mode, Execute permission from the physical
> device's config space. We have two choices as well. vfio or iommufd.
> 
> It appears to be better reporting the capabilities via vfio uapi (e.g.
> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
> PASID capabilities to userspace. Please let me know your preference. :)

I don't think you'd need a new data_uptr, that doesn't quite make
sense

What struct data do you imagine needing?

Jason
Yi Liu Sept. 9, 2024, 1:29 p.m. UTC | #65
On 2024/9/9 21:04, Jason Gunthorpe wrote:
> On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
> 
>> In order to synthesize the vPASID cap, the VMM should get to know the
>> capabilities like Privilege mode, Execute permission from the physical
>> device's config space. We have two choices as well. vfio or iommufd.
>>
>> It appears to be better reporting the capabilities via vfio uapi (e.g.
>> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
>> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
>> PASID capabilities to userspace. Please let me know your preference. :)
> 
> I don't think you'd need a new data_uptr, that doesn't quite make
> sense
> 
> What struct data do you imagine needing?

something like below.

struct iommufd_hw_info_pasid {
        __u16 capabilities;
#define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
#define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
        __u8 width;
        __u8 __reserved;
};
Jason Gunthorpe Sept. 9, 2024, 1:40 p.m. UTC | #66
On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote:
> On 2024/9/9 21:04, Jason Gunthorpe wrote:
> > On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
> > 
> > > In order to synthesize the vPASID cap, the VMM should get to know the
> > > capabilities like Privilege mode, Execute permission from the physical
> > > device's config space. We have two choices as well. vfio or iommufd.
> > > 
> > > It appears to be better reporting the capabilities via vfio uapi (e.g.
> > > VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
> > > add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
> > > PASID capabilities to userspace. Please let me know your preference. :)
> > 
> > I don't think you'd need a new data_uptr, that doesn't quite make
> > sense
> > 
> > What struct data do you imagine needing?
> 
> something like below.
> 
> struct iommufd_hw_info_pasid {
>        __u16 capabilities;
> #define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
> #define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
>        __u8 width;
>        __u8 __reserved;
> };

I think you could just stick that in the top level GET_HW_INFO struct
if you want.

It does make a sense that an iommufd user would need to know that
information, especially width (but call it something better,
max_pasid_log2 or something) to successefully use the iommfd PASID
APIs anyhow.

Jason
Jason Gunthorpe Sept. 9, 2024, 1:59 p.m. UTC | #67
On Mon, Sep 09, 2024 at 10:02:02PM +0800, Yi Liu wrote:

> However, I have one scalability concern. Do we only have PASID and PRI that
> needs to be synthesized by userspace? For PRI, kernel would need to report
> Outstanding Page Request Capacity to userspace. If only PASID and PRI cap,
> it's fine. But if there are more such caps, then the struct iommu_hw_info
> may grow bigger and bigger.

I don't see a particular issue with this though?

As long as it is well defined and it is clear to userspace when what
parts are valid or not it is fine to get pretty big.

Jason
Yi Liu Sept. 9, 2024, 2:02 p.m. UTC | #68
On 2024/9/9 21:40, Jason Gunthorpe wrote:
> On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote:
>> On 2024/9/9 21:04, Jason Gunthorpe wrote:
>>> On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote:
>>>
>>>> In order to synthesize the vPASID cap, the VMM should get to know the
>>>> capabilities like Privilege mode, Execute permission from the physical
>>>> device's config space. We have two choices as well. vfio or iommufd.
>>>>
>>>> It appears to be better reporting the capabilities via vfio uapi (e.g.
>>>> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to
>>>> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the
>>>> PASID capabilities to userspace. Please let me know your preference. :)
>>>
>>> I don't think you'd need a new data_uptr, that doesn't quite make
>>> sense
>>>
>>> What struct data do you imagine needing?
>>
>> something like below.
>>
>> struct iommufd_hw_info_pasid {
>>         __u16 capabilities;
>> #define IOMMUFD_PASID_CAP_EXEC     (1 << 0)
>> #define IOMMUFD_PASID_CAP_PRIV     (1 << 1)
>>         __u8 width;
>>         __u8 __reserved;
>> };
> 
> I think you could just stick that in the top level GET_HW_INFO struct
> if you want.
> 
> It does make a sense that an iommufd user would need to know that
> information, especially width (but call it something better,
> max_pasid_log2 or something) to successefully use the iommfd PASID
> APIs anyhow.

I see, we may define the IOMMUFD_PASID_CAP_EXEC and IOMMUFD_PASID_CAP_PRIV
in the struct iommu_hw_info::out_capabilities, and add a max_pasid_log2 to
the struct iommu_hw_info.

However, I have one scalability concern. Do we only have PASID and PRI that
needs to be synthesized by userspace? For PRI, kernel would need to report
Outstanding Page Request Capacity to userspace. If only PASID and PRI cap,
it's fine. But if there are more such caps, then the struct iommu_hw_info
may grow bigger and bigger.