Message ID | 20240412082121.33382-1-yi.l.liu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio-pci support pasid attach/detach | expand |
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Friday, April 12, 2024 4:21 PM > > A userspace VMM is supposed to get the details of the device's PASID > capability > and assemble a virtual PASID capability in a proper offset in the virtual PCI > configuration space. While it is still an open on how to get the available > offsets. Devices may have hidden bits that are not in the PCI cap chain. For > now, there are two options to get the available offsets.[2] > > - Report the available offsets via ioctl. This requires device-specific logic > to provide available offsets. e.g., vfio-pci variant driver. Or may the device > provide the available offset by DVSEC. > - Store the available offsets in a static table in userspace VMM. VMM gets the > empty offsets from this table. > I'm not a fan of requesting a variant driver for every PASID-capable VF just for the purpose of reporting a free range in the PCI config space. It's easier to do that quirk in userspace. But I like Alex's original comment that at least for PF there is no reason to hide the offset. there could be a flag+field to communicate it. or if there will be a new variant VF driver for other purposes e.g. migration it can certainly fill the field too.
On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Friday, April 12, 2024 4:21 PM > > > > A userspace VMM is supposed to get the details of the device's PASID > > capability > > and assemble a virtual PASID capability in a proper offset in the virtual PCI > > configuration space. While it is still an open on how to get the available > > offsets. Devices may have hidden bits that are not in the PCI cap chain. For > > now, there are two options to get the available offsets.[2] > > > > - Report the available offsets via ioctl. This requires device-specific logic > > to provide available offsets. e.g., vfio-pci variant driver. Or may the device > > provide the available offset by DVSEC. > > - Store the available offsets in a static table in userspace VMM. VMM gets the > > empty offsets from this table. > > > > I'm not a fan of requesting a variant driver for every PASID-capable > VF just for the purpose of reporting a free range in the PCI config space. > > It's easier to do that quirk in userspace. > > But I like Alex's original comment that at least for PF there is no reason > to hide the offset. there could be a flag+field to communicate it. or > if there will be a new variant VF driver for other purposes e.g. migration > it can certainly fill the field too. Yes, since this has been such a sticking point can we get a clean series that just enables it for PF and then come with a solution for VF? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, April 17, 2024 1:50 AM > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Friday, April 12, 2024 4:21 PM > > > > > > A userspace VMM is supposed to get the details of the device's PASID > > > capability > > > and assemble a virtual PASID capability in a proper offset in the virtual > PCI > > > configuration space. While it is still an open on how to get the available > > > offsets. Devices may have hidden bits that are not in the PCI cap chain. > For > > > now, there are two options to get the available offsets.[2] > > > > > > - Report the available offsets via ioctl. This requires device-specific logic > > > to provide available offsets. e.g., vfio-pci variant driver. Or may the > device > > > provide the available offset by DVSEC. > > > - Store the available offsets in a static table in userspace VMM. VMM gets > the > > > empty offsets from this table. > > > > > > > I'm not a fan of requesting a variant driver for every PASID-capable > > VF just for the purpose of reporting a free range in the PCI config space. > > > > It's easier to do that quirk in userspace. > > > > But I like Alex's original comment that at least for PF there is no reason > > to hide the offset. there could be a flag+field to communicate it. or > > if there will be a new variant VF driver for other purposes e.g. migration > > it can certainly fill the field too. > > Yes, since this has been such a sticking point can we get a clean > series that just enables it for PF and then come with a solution for > VF? > sure but we at least need to reach consensus on a minimal required uapi covering both PF/VF to move forward so the user doesn't need to touch different contracts for PF vs. VF.
On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, April 17, 2024 1:50 AM > > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Friday, April 12, 2024 4:21 PM > > > > > > > > A userspace VMM is supposed to get the details of the device's PASID > > > > capability > > > > and assemble a virtual PASID capability in a proper offset in the virtual > > PCI > > > > configuration space. While it is still an open on how to get the available > > > > offsets. Devices may have hidden bits that are not in the PCI cap chain. > > For > > > > now, there are two options to get the available offsets.[2] > > > > > > > > - Report the available offsets via ioctl. This requires device-specific logic > > > > to provide available offsets. e.g., vfio-pci variant driver. Or may the > > device > > > > provide the available offset by DVSEC. > > > > - Store the available offsets in a static table in userspace VMM. VMM gets > > the > > > > empty offsets from this table. > > > > > > > > > > I'm not a fan of requesting a variant driver for every PASID-capable > > > VF just for the purpose of reporting a free range in the PCI config space. > > > > > > It's easier to do that quirk in userspace. > > > > > > But I like Alex's original comment that at least for PF there is no reason > > > to hide the offset. there could be a flag+field to communicate it. or > > > if there will be a new variant VF driver for other purposes e.g. migration > > > it can certainly fill the field too. > > > > Yes, since this has been such a sticking point can we get a clean > > series that just enables it for PF and then come with a solution for > > VF? > > > > sure but we at least need to reach consensus on a minimal required > uapi covering both PF/VF to move forward so the user doesn't need > to touch different contracts for PF vs. VF. Do we? The situation where the VMM needs to wholly make a up a PASID capability seems completely new and seperate from just using an existing PASID capability as in the PF case. If it needs to make another system call or otherwise to do that then that seems fine to do incrementally?? Jason
On Wed, 17 Apr 2024 09:20:51 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, April 17, 2024 1:50 AM > > > > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > Sent: Friday, April 12, 2024 4:21 PM > > > > > > > > > > A userspace VMM is supposed to get the details of the device's PASID > > > > > capability > > > > > and assemble a virtual PASID capability in a proper offset in the virtual > > > PCI > > > > > configuration space. While it is still an open on how to get the available > > > > > offsets. Devices may have hidden bits that are not in the PCI cap chain. > > > For > > > > > now, there are two options to get the available offsets.[2] > > > > > > > > > > - Report the available offsets via ioctl. This requires device-specific logic > > > > > to provide available offsets. e.g., vfio-pci variant driver. Or may the > > > device > > > > > provide the available offset by DVSEC. > > > > > - Store the available offsets in a static table in userspace VMM. VMM gets > > > the > > > > > empty offsets from this table. > > > > > > > > > > > > > I'm not a fan of requesting a variant driver for every PASID-capable > > > > VF just for the purpose of reporting a free range in the PCI config space. > > > > > > > > It's easier to do that quirk in userspace. > > > > > > > > But I like Alex's original comment that at least for PF there is no reason > > > > to hide the offset. there could be a flag+field to communicate it. or > > > > if there will be a new variant VF driver for other purposes e.g. migration > > > > it can certainly fill the field too. > > > > > > Yes, since this has been such a sticking point can we get a clean > > > series that just enables it for PF and then come with a solution for > > > VF? > > > > > > > sure but we at least need to reach consensus on a minimal required > > uapi covering both PF/VF to move forward so the user doesn't need > > to touch different contracts for PF vs. VF. > > Do we? The situation where the VMM needs to wholly make a up a PASID > capability seems completely new and seperate from just using an > existing PASID capability as in the PF case. But we don't actually expose the PASID capability on the PF and as argued in path 4/ we can't because it would break existing userspace. > If it needs to make another system call or otherwise to do that then > that seems fine to do incrementally?? With PASID currently hidden, VF and PF support really seem too similar not to handle them both at the same time. What's missing is a mechanism to describe unused config space where userspace can implement an emulated PASID capability. Defining a DVSEC capability to handle this seemed to be the preferred solution from the LPC discussion. PF and VF devices that implement a TBD DVSEC capability could avoid needing a variant driver. Until then we need to be careful not to undermine a future world with that preferred solution by introducing side-band mechanisms which only work for variant drivers and PF devices. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Thursday, April 18, 2024 7:02 AM > > On Wed, 17 Apr 2024 09:20:51 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Wednesday, April 17, 2024 1:50 AM > > > > > > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > > > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > > > Sent: Friday, April 12, 2024 4:21 PM > > > > > > > > > > > > A userspace VMM is supposed to get the details of the device's > PASID > > > > > > capability > > > > > > and assemble a virtual PASID capability in a proper offset in the > virtual > > > > PCI > > > > > > configuration space. While it is still an open on how to get the > available > > > > > > offsets. Devices may have hidden bits that are not in the PCI cap > chain. > > > > For > > > > > > now, there are two options to get the available offsets.[2] > > > > > > > > > > > > - Report the available offsets via ioctl. This requires device-specific > logic > > > > > > to provide available offsets. e.g., vfio-pci variant driver. Or may the > > > > device > > > > > > provide the available offset by DVSEC. > > > > > > - Store the available offsets in a static table in userspace VMM. > VMM gets > > > > the > > > > > > empty offsets from this table. > > > > > > > > > > > > > > > > I'm not a fan of requesting a variant driver for every PASID-capable > > > > > VF just for the purpose of reporting a free range in the PCI config > space. > > > > > > > > > > It's easier to do that quirk in userspace. > > > > > > > > > > But I like Alex's original comment that at least for PF there is no > reason > > > > > to hide the offset. there could be a flag+field to communicate it. or > > > > > if there will be a new variant VF driver for other purposes e.g. > migration > > > > > it can certainly fill the field too. > > > > > > > > Yes, since this has been such a sticking point can we get a clean > > > > series that just enables it for PF and then come with a solution for > > > > VF? > > > > > > > > > > sure but we at least need to reach consensus on a minimal required > > > uapi covering both PF/VF to move forward so the user doesn't need > > > to touch different contracts for PF vs. VF. > > > > Do we? The situation where the VMM needs to wholly make a up a PASID > > capability seems completely new and seperate from just using an > > existing PASID capability as in the PF case. > > But we don't actually expose the PASID capability on the PF and as > argued in path 4/ we can't because it would break existing userspace. Come back to this statement. Does 'break' means that legacy Qemu will crash due to a guest write to the read-only PASID capability, or just a conceptually functional break i.e. non-faithful emulation due to writes being dropped? If the latter it's probably not a bad idea to allow exposing the PASID capability on the PF as a sane guest shouldn't enable the PASID capability w/o seeing vIOMMU supporting PASID. And there is no status bit defined in the PASID capability to check back so even if an insane guest wants to blindly enable PASID it will naturally write and done. The only niche case is that the enable bits are defined as RW so ideally reading back those bits should get the latest written value. But probably this can be tolerated? With that then should we consider exposing the PASID capability in PCI config space as the first option? For PF it's simple as how other caps are exposed. For VF a variant driver can also fake the PASID capability or emulate a DVSEC capability for unused space (to motivate the physical implementation so no variant driver is required in the future) This allows a staging approach as Jason envisioned.
On 2024/4/18 08:06, Tian, Kevin wrote: >> From: Alex Williamson <alex.williamson@redhat.com> >> Sent: Thursday, April 18, 2024 7:02 AM >> >> On Wed, 17 Apr 2024 09:20:51 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote: >>>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>>> Sent: Wednesday, April 17, 2024 1:50 AM >>>>> >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: >>>>>>> From: Liu, Yi L <yi.l.liu@intel.com> >>>>>>> Sent: Friday, April 12, 2024 4:21 PM >>>>>>> >>>>>>> A userspace VMM is supposed to get the details of the device's >> PASID >>>>>>> capability >>>>>>> and assemble a virtual PASID capability in a proper offset in the >> virtual >>>>> PCI >>>>>>> configuration space. While it is still an open on how to get the >> available >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap >> chain. >>>>> For >>>>>>> now, there are two options to get the available offsets.[2] >>>>>>> >>>>>>> - Report the available offsets via ioctl. This requires device-specific >> logic >>>>>>> to provide available offsets. e.g., vfio-pci variant driver. Or may the >>>>> device >>>>>>> provide the available offset by DVSEC. >>>>>>> - Store the available offsets in a static table in userspace VMM. >> VMM gets >>>>> the >>>>>>> empty offsets from this table. >>>>>>> >>>>>> >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable >>>>>> VF just for the purpose of reporting a free range in the PCI config >> space. >>>>>> >>>>>> It's easier to do that quirk in userspace. >>>>>> >>>>>> But I like Alex's original comment that at least for PF there is no >> reason >>>>>> to hide the offset. there could be a flag+field to communicate it. or >>>>>> if there will be a new variant VF driver for other purposes e.g. >> migration >>>>>> it can certainly fill the field too. >>>>> >>>>> Yes, since this has been such a sticking point can we get a clean >>>>> series that just enables it for PF and then come with a solution for >>>>> VF? >>>>> >>>> >>>> sure but we at least need to reach consensus on a minimal required >>>> uapi covering both PF/VF to move forward so the user doesn't need >>>> to touch different contracts for PF vs. VF. >>> >>> Do we? The situation where the VMM needs to wholly make a up a PASID >>> capability seems completely new and seperate from just using an >>> existing PASID capability as in the PF case. >> >> But we don't actually expose the PASID capability on the PF and as >> argued in path 4/ we can't because it would break existing userspace. > > Come back to this statement. > > Does 'break' means that legacy Qemu will crash due to a guest write > to the read-only PASID capability, or just a conceptually functional > break i.e. non-faithful emulation due to writes being dropped? > > If the latter it's probably not a bad idea to allow exposing the PASID > capability on the PF as a sane guest shouldn't enable the PASID > capability w/o seeing vIOMMU supporting PASID. And there is no > status bit defined in the PASID capability to check back so even > if an insane guest wants to blindly enable PASID it will naturally > write and done. The only niche case is that the enable bits are > defined as RW so ideally reading back those bits should get the > latest written value. But probably this can be tolerated? > > With that then should we consider exposing the PASID capability > in PCI config space as the first option? For PF it's simple as how > other caps are exposed. For VF a variant driver can also fake the > PASID capability or emulate a DVSEC capability for unused space > (to motivate the physical implementation so no variant driver is > required in the future) If kernel exposes pasid cap for PF same as other caps, and in the meantime the variant driver chooses to emulate a DVSEC cap, then userspace follows the below steps to expose pasid cap to VM. 1) Check if a pasid cap is already present in the virtual config space read from kernel. If no, but user wants pasid, then goto step 2). 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support pasid cap. If yes, goto step 3). 3) Userspace gets an available offset via reading the DVSEC cap. 4) Userspace assembles a pasid cap and inserts it to the vconfig space. For PF, step 1) is enough. For VF, it needs to go through all the 4 steps. This is a bit different from what we planned at the beginning. But sounds doable if we want to pursue the staging direction.
On Thu, 18 Apr 2024 17:03:15 +0800 Yi Liu <yi.l.liu@intel.com> wrote: > On 2024/4/18 08:06, Tian, Kevin wrote: > >> From: Alex Williamson <alex.williamson@redhat.com> > >> Sent: Thursday, April 18, 2024 7:02 AM > >> > >> On Wed, 17 Apr 2024 09:20:51 -0300 > >> Jason Gunthorpe <jgg@nvidia.com> wrote: > >> > >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote: > >>>>> From: Jason Gunthorpe <jgg@nvidia.com> > >>>>> Sent: Wednesday, April 17, 2024 1:50 AM > >>>>> > >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote: > >>>>>>> From: Liu, Yi L <yi.l.liu@intel.com> > >>>>>>> Sent: Friday, April 12, 2024 4:21 PM > >>>>>>> > >>>>>>> A userspace VMM is supposed to get the details of the device's > >> PASID > >>>>>>> capability > >>>>>>> and assemble a virtual PASID capability in a proper offset in the > >> virtual > >>>>> PCI > >>>>>>> configuration space. While it is still an open on how to get the > >> available > >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap > >> chain. > >>>>> For > >>>>>>> now, there are two options to get the available offsets.[2] > >>>>>>> > >>>>>>> - Report the available offsets via ioctl. This requires device-specific > >> logic > >>>>>>> to provide available offsets. e.g., vfio-pci variant driver. Or may the > >>>>> device > >>>>>>> provide the available offset by DVSEC. > >>>>>>> - Store the available offsets in a static table in userspace VMM. > >> VMM gets > >>>>> the > >>>>>>> empty offsets from this table. > >>>>>>> > >>>>>> > >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable > >>>>>> VF just for the purpose of reporting a free range in the PCI config > >> space. > >>>>>> > >>>>>> It's easier to do that quirk in userspace. > >>>>>> > >>>>>> But I like Alex's original comment that at least for PF there is no > >> reason > >>>>>> to hide the offset. there could be a flag+field to communicate it. or > >>>>>> if there will be a new variant VF driver for other purposes e.g. > >> migration > >>>>>> it can certainly fill the field too. > >>>>> > >>>>> Yes, since this has been such a sticking point can we get a clean > >>>>> series that just enables it for PF and then come with a solution for > >>>>> VF? > >>>>> > >>>> > >>>> sure but we at least need to reach consensus on a minimal required > >>>> uapi covering both PF/VF to move forward so the user doesn't need > >>>> to touch different contracts for PF vs. VF. > >>> > >>> Do we? The situation where the VMM needs to wholly make a up a PASID > >>> capability seems completely new and seperate from just using an > >>> existing PASID capability as in the PF case. > >> > >> But we don't actually expose the PASID capability on the PF and as > >> argued in path 4/ we can't because it would break existing userspace. > > > Come back to this statement. > > > > Does 'break' means that legacy Qemu will crash due to a guest write > > to the read-only PASID capability, or just a conceptually functional > > break i.e. non-faithful emulation due to writes being dropped? I expect more the latter. > > If the latter it's probably not a bad idea to allow exposing the PASID > > capability on the PF as a sane guest shouldn't enable the PASID > > capability w/o seeing vIOMMU supporting PASID. And there is no > > status bit defined in the PASID capability to check back so even > > if an insane guest wants to blindly enable PASID it will naturally > > write and done. The only niche case is that the enable bits are > > defined as RW so ideally reading back those bits should get the > > latest written value. But probably this can be tolerated? Some degree of inconsistency is likely tolerated, the guest is unlikely to check that a RW bit was set or cleared. How would we virtualize the control registers for a VF and are they similarly virtualized for a PF or would we allow the guest to manipulate the physical PASID control registers? > > With that then should we consider exposing the PASID capability > > in PCI config space as the first option? For PF it's simple as how > > other caps are exposed. For VF a variant driver can also fake the > > PASID capability or emulate a DVSEC capability for unused space > > (to motivate the physical implementation so no variant driver is > > required in the future) > > If kernel exposes pasid cap for PF same as other caps, and in the meantime > the variant driver chooses to emulate a DVSEC cap, then userspace follows > the below steps to expose pasid cap to VM. If we have a variant driver, why wouldn't it expose an emulated PASID capability rather than a DVSEC if we're choosing to expose PASID for PFs? > > 1) Check if a pasid cap is already present in the virtual config space > read from kernel. If no, but user wants pasid, then goto step 2). > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support > pasid cap. If yes, goto step 3). Why do we need the vfio feature interface if a physical or virtual PASID capability on the device exposes the same info? > 3) Userspace gets an available offset via reading the DVSEC cap. What's the scenario where we'd have a VF wanting to expose PASID support which doesn't also have a variant driver that could implement a virtual PASID? > 4) Userspace assembles a pasid cap and inserts it to the vconfig space. > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps. > This is a bit different from what we planned at the beginning. But sounds > doable if we want to pursue the staging direction. Seems like if we decide that we can just expose the PASID capability for a PF then we should just have any VF variant drivers also implement a virtual PASID capability. In this case DVSEC would only be used to provide information for a purely userspace emulation of PASID (in which case it also wouldn't necessarily need the vfio feature because it might implicitly know the PASID capabilities of the device). Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, April 19, 2024 4:38 AM > > On Thu, 18 Apr 2024 17:03:15 +0800 > Yi Liu <yi.l.liu@intel.com> wrote: > > > On 2024/4/18 08:06, Tian, Kevin wrote: > > >> From: Alex Williamson <alex.williamson@redhat.com> > > >> Sent: Thursday, April 18, 2024 7:02 AM > > >> > > >> But we don't actually expose the PASID capability on the PF and as > > >> argued in path 4/ we can't because it would break existing userspace. > > > > Come back to this statement. > > > > > > Does 'break' means that legacy Qemu will crash due to a guest write > > > to the read-only PASID capability, or just a conceptually functional > > > break i.e. non-faithful emulation due to writes being dropped? > > I expect more the latter. > > > > If the latter it's probably not a bad idea to allow exposing the PASID > > > capability on the PF as a sane guest shouldn't enable the PASID > > > capability w/o seeing vIOMMU supporting PASID. And there is no > > > status bit defined in the PASID capability to check back so even > > > if an insane guest wants to blindly enable PASID it will naturally > > > write and done. The only niche case is that the enable bits are > > > defined as RW so ideally reading back those bits should get the > > > latest written value. But probably this can be tolerated? > > Some degree of inconsistency is likely tolerated, the guest is unlikely > to check that a RW bit was set or cleared. How would we virtualize the > control registers for a VF and are they similarly virtualized for a PF > or would we allow the guest to manipulate the physical PASID control > registers? it's shared so the guest shouldn't be allowed to touch the physical register. Even for PF this is virtualized as the physical control is toggled by the iommu driver today. We discussed before whether there is a value moving the control to device driver but the conclusion is no. > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space. > > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps. > > This is a bit different from what we planned at the beginning. But sounds > > doable if we want to pursue the staging direction. > > Seems like if we decide that we can just expose the PASID capability > for a PF then we should just have any VF variant drivers also implement > a virtual PASID capability. In this case DVSEC would only be used to I'm leaning toward this direction now. > provide information for a purely userspace emulation of PASID (in which > case it also wouldn't necessarily need the vfio feature because it > might implicitly know the PASID capabilities of the device). Thanks, > that's a good point. Then no new contract is required. and allowing variant driver to implement a virtual PASID capability seems also make a room for making a shared variant driver to host a table of virtual capabilities (both offset and content) for VFs, just as discussed in patch4 having a shared driver to host a table for DVSEC? Along this route probably most vendors will just extend the table in the shared driver, leading to decreased value on DVSEC and question on its necessity... then it's back to the quirk-in-kernel approach... but if simple enough probably not a bad idea to pursue?
On Wed, Apr 17, 2024 at 05:02:16PM -0600, Alex Williamson wrote: > > > sure but we at least need to reach consensus on a minimal required > > > uapi covering both PF/VF to move forward so the user doesn't need > > > to touch different contracts for PF vs. VF. > > > > Do we? The situation where the VMM needs to wholly make a up a PASID > > capability seems completely new and seperate from just using an > > existing PASID capability as in the PF case. > > But we don't actually expose the PASID capability on the PF and as > argued in path 4/ we can't because it would break existing > userspace. Are we sure about that argument? Exposing the PF's PASID cap to qemu shouldn't break qemu at all - it will just pass it over into a vPCI function? Ultimately the guest will observe a vPCI device with a PASID cap. This is not really so different from plugging in a new PCI device with the PASID cap into an old HW system that doesn't support PASID. So does a guest break? I'd expect no - the viommu's created by qemu should not advertise PASID support already. So the guest can't use PASID - just like an old HW system. Is there a known concrete thing that breaks there? > > If it needs to make another system call or otherwise to do that then > > that seems fine to do incrementally?? > > With PASID currently hidden, VF and PF support really seem too similar > not to handle them both at the same time. What's missing is a > mechanism to describe unused config space where userspace can implement > an emulated PASID capability. Yes, sure the unused config space is a great idea. I thought we were talking about doing the fixup in userspace, but a kernel solution is good too. But also if we are set on the dvsec, in kernel or user, then we can move ahead with the core PASID enablement immediately? Jason
On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote: > Some degree of inconsistency is likely tolerated, the guest is unlikely > to check that a RW bit was set or cleared. How would we virtualize the > control registers for a VF and are they similarly virtualized for a PF > or would we allow the guest to manipulate the physical PASID control > registers? No, the OS owns the physical PASID control. If the platform IOMMU knows how to parse PASID then PASID support is turned on and left on at boot time. There should be no guest visible difference to not supporting global PASID disable, and we can't even implement it for VFs anyhow. Same sort of argument for ATS/etc > > If kernel exposes pasid cap for PF same as other caps, and in the meantime > > the variant driver chooses to emulate a DVSEC cap, then userspace follows > > the below steps to expose pasid cap to VM. > > If we have a variant driver, why wouldn't it expose an emulated PASID > capability rather than a DVSEC if we're choosing to expose PASID for > PFs? Indeed, also an option. Supplying the DVSEC is probably simpler and addresses other synthesized capability blocks in future. VMM is a better place to build various synthetic blocks in general, IMHO. New VMM's could parse the PF PASID cap and add it to its list of "free space" We may also be overdoing it here.. Maybe if the VMM wants to enable PASID we should flip the logic and the VMM should assume that unused config space is safe to use. Only devices that violate that rule need to join an ID list and provide a DVSEC/free space list/etc. I'm guessing that list will be pretty small and hopefully will not grow. It is easy and better for future devices to wrap their hidden registers in a private DVSEC. Then I'd suggest just writing the special list in a text file and leaving it in the VMM side.. Users can adjust the text file right away if they have old and troublesome devices and all VMMs can share it. > > 1) Check if a pasid cap is already present in the virtual config space > > read from kernel. If no, but user wants pasid, then goto step 2). > > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support > > pasid cap. If yes, goto step 3). > > Why do we need the vfio feature interface if a physical or virtual PASID > capability on the device exposes the same info? Still need to check if the platform, os, iommu, etc are all OK with enabling PASID support before the viommu advertises it. Jason
On Fri, 19 Apr 2024 05:52:01 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Friday, April 19, 2024 4:38 AM > > > > On Thu, 18 Apr 2024 17:03:15 +0800 > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > On 2024/4/18 08:06, Tian, Kevin wrote: > > > >> From: Alex Williamson <alex.williamson@redhat.com> > > > >> Sent: Thursday, April 18, 2024 7:02 AM > > > >> > > > >> But we don't actually expose the PASID capability on the PF and as > > > >> argued in path 4/ we can't because it would break existing userspace. > > > > > Come back to this statement. > > > > > > > > Does 'break' means that legacy Qemu will crash due to a guest write > > > > to the read-only PASID capability, or just a conceptually functional > > > > break i.e. non-faithful emulation due to writes being dropped? > > > > I expect more the latter. > > > > > > If the latter it's probably not a bad idea to allow exposing the PASID > > > > capability on the PF as a sane guest shouldn't enable the PASID > > > > capability w/o seeing vIOMMU supporting PASID. And there is no > > > > status bit defined in the PASID capability to check back so even > > > > if an insane guest wants to blindly enable PASID it will naturally > > > > write and done. The only niche case is that the enable bits are > > > > defined as RW so ideally reading back those bits should get the > > > > latest written value. But probably this can be tolerated? > > > > Some degree of inconsistency is likely tolerated, the guest is unlikely > > to check that a RW bit was set or cleared. How would we virtualize the > > control registers for a VF and are they similarly virtualized for a PF > > or would we allow the guest to manipulate the physical PASID control > > registers? > > it's shared so the guest shouldn't be allowed to touch the physical > register. > > Even for PF this is virtualized as the physical control is toggled by > the iommu driver today. We discussed before whether there is a > value moving the control to device driver but the conclusion is no. So in both cases we virtualize the PASID bits in the vfio variant driver in order to maintain spec compliant behavior of the register (ie. the control bits are RW with no underlying hardware effect and capability bits only reflect the features enabled by the host in the control register)? > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space. > > > > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps. > > > This is a bit different from what we planned at the beginning. But sounds > > > doable if we want to pursue the staging direction. > > > > Seems like if we decide that we can just expose the PASID capability > > for a PF then we should just have any VF variant drivers also implement > > a virtual PASID capability. In this case DVSEC would only be used to > > I'm leaning toward this direction now. > > > provide information for a purely userspace emulation of PASID (in which > > case it also wouldn't necessarily need the vfio feature because it > > might implicitly know the PASID capabilities of the device). Thanks, > > > > that's a good point. Then no new contract is required. > > and allowing variant driver to implement a virtual PASID capability > seems also make a room for making a shared variant driver to host > a table of virtual capabilities (both offset and content) for VFs, just > as discussed in patch4 having a shared driver to host a table for DVSEC? Yes, vfio-pci-core would support virtualizing the PF PASID capability mapped 1:1 at the physical PASID location. We should architect that support to be easily reused for a driver provided offset for the VF use case and then we'd need to decide if a lookup table to associate an offset to a VF vendor:device warrants a variant driver (which could be shared by multiple devices) or if we'd accept that into vfio-pci-core. > Along this route probably most vendors will just extend the table in > the shared driver, leading to decreased value on DVSEC and question > on its necessity... > > then it's back to the quirk-in-kernel approach... but if simple enough > probably not a bad idea to pursue?
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, April 20, 2024 12:36 AM > > On Fri, 19 Apr 2024 05:52:01 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Friday, April 19, 2024 4:38 AM > > > > > > On Thu, 18 Apr 2024 17:03:15 +0800 > > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > > > On 2024/4/18 08:06, Tian, Kevin wrote: > > > > >> From: Alex Williamson <alex.williamson@redhat.com> > > > > >> Sent: Thursday, April 18, 2024 7:02 AM > > > > >> > > > > >> But we don't actually expose the PASID capability on the PF and as > > > > >> argued in path 4/ we can't because it would break existing userspace. > > > > > > Come back to this statement. > > > > > > > > > > Does 'break' means that legacy Qemu will crash due to a guest write > > > > > to the read-only PASID capability, or just a conceptually functional > > > > > break i.e. non-faithful emulation due to writes being dropped? > > > > > > I expect more the latter. > > > > > > > > If the latter it's probably not a bad idea to allow exposing the PASID > > > > > capability on the PF as a sane guest shouldn't enable the PASID > > > > > capability w/o seeing vIOMMU supporting PASID. And there is no > > > > > status bit defined in the PASID capability to check back so even > > > > > if an insane guest wants to blindly enable PASID it will naturally > > > > > write and done. The only niche case is that the enable bits are > > > > > defined as RW so ideally reading back those bits should get the > > > > > latest written value. But probably this can be tolerated? > > > > > > Some degree of inconsistency is likely tolerated, the guest is unlikely > > > to check that a RW bit was set or cleared. How would we virtualize the > > > control registers for a VF and are they similarly virtualized for a PF > > > or would we allow the guest to manipulate the physical PASID control > > > registers? > > > > it's shared so the guest shouldn't be allowed to touch the physical > > register. > > > > Even for PF this is virtualized as the physical control is toggled by > > the iommu driver today. We discussed before whether there is a > > value moving the control to device driver but the conclusion is no. > > So in both cases we virtualize the PASID bits in the vfio variant > driver in order to maintain spec compliant behavior of the register > (ie. the control bits are RW with no underlying hardware effect and > capability bits only reflect the features enabled by the host in the > control register)? yes > > > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space. > > > > > > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps. > > > > This is a bit different from what we planned at the beginning. But > sounds > > > > doable if we want to pursue the staging direction. > > > > > > Seems like if we decide that we can just expose the PASID capability > > > for a PF then we should just have any VF variant drivers also implement > > > a virtual PASID capability. In this case DVSEC would only be used to > > > > I'm leaning toward this direction now. > > > > > provide information for a purely userspace emulation of PASID (in which > > > case it also wouldn't necessarily need the vfio feature because it > > > might implicitly know the PASID capabilities of the device). Thanks, > > > > > > > that's a good point. Then no new contract is required. > > > > and allowing variant driver to implement a virtual PASID capability > > seems also make a room for making a shared variant driver to host > > a table of virtual capabilities (both offset and content) for VFs, just > > as discussed in patch4 having a shared driver to host a table for DVSEC? > > Yes, vfio-pci-core would support virtualizing the PF PASID capability > mapped 1:1 at the physical PASID location. We should architect that > support to be easily reused for a driver provided offset for the VF use > case and then we'd need to decide if a lookup table to associate an > offset to a VF vendor:device warrants a variant driver (which could be > shared by multiple devices) or if we'd accept that into vfio-pci-core. yes > > > Along this route probably most vendors will just extend the table in > > the shared driver, leading to decreased value on DVSEC and question > > on its necessity... > > > > then it's back to the quirk-in-kernel approach... but if simple enough > > probably not a bad idea to pursue?
On 2024/4/19 21:59, Jason Gunthorpe wrote: > On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote: > >> Some degree of inconsistency is likely tolerated, the guest is unlikely >> to check that a RW bit was set or cleared. How would we virtualize the >> control registers for a VF and are they similarly virtualized for a PF >> or would we allow the guest to manipulate the physical PASID control >> registers? > > No, the OS owns the physical PASID control. If the platform IOMMU > knows how to parse PASID then PASID support is turned on and left on > at boot time. I think you mean host os. right? > There should be no guest visible difference to not supporting global > PASID disable, and we can't even implement it for VFs anyhow. > > Same sort of argument for ATS/etc > >>> If kernel exposes pasid cap for PF same as other caps, and in the meantime >>> the variant driver chooses to emulate a DVSEC cap, then userspace follows >>> the below steps to expose pasid cap to VM. >> >> If we have a variant driver, why wouldn't it expose an emulated PASID >> capability rather than a DVSEC if we're choosing to expose PASID for >> PFs? > > Indeed, also an option. Supplying the DVSEC is probably simpler and > addresses other synthesized capability blocks in future. VMM is a > better place to build various synthetic blocks in general, IMHO. > > New VMM's could parse the PF PASID cap and add it to its list of "free > space" > > We may also be overdoing it here.. > > Maybe if the VMM wants to enable PASID we should flip the logic and > the VMM should assume that unused config space is safe to use. Only > devices that violate that rule need to join an ID list and provide a > DVSEC/free space list/etc. So, if the kernel decides to hide a specific physical capability, the space of this capability would be considered as free to use as well. is it? > I'm guessing that list will be pretty small and hopefully will not > grow. any channel to collect this kind of info? :) > It is easy and better for future devices to wrap their hidden > registers in a private DVSEC. hmmm, do you mean include the registers a DVSEC hence userspace can work out the free space by iterating the cap chain? or still mean indicating the free spaces by DVSEC? I guess the prior one. > Then I'd suggest just writing the special list in a text file and > leaving it in the VMM side.. Users can adjust the text file right away > if they have old and troublesome devices and all VMMs can share it. So for the existing devices that have both pasid cap and hidden registers, userspace should add them in the special list, and work out the free space by referring the file. While for the devices that only have pasid cap, or have the hidden register in a DVSEC, userspace finds a free space by iterating the cap chain. This seems to be general for today and future. >>> 1) Check if a pasid cap is already present in the virtual config space >>> read from kernel. If no, but user wants pasid, then goto step 2). >>> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support >>> pasid cap. If yes, goto step 3). >> >> Why do we need the vfio feature interface if a physical or virtual PASID >> capability on the device exposes the same info? > > Still need to check if the platform, os, iommu, etc are all OK with > enabling PASID support before the viommu advertises it. This means we don't expose physical or virtual PASID cap, is it? Otherwise, host kernel could check if pasid is enabled before exposing the PASID cap.
On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > I'm not sure how userspace can fully handle this w/o certain assistance > from the kernel. > > So I kind of agree that emulated PASID capability is probably the only > contract which the kernel should provide: > - mapped 1:1 at the physical location, or > - constructed at an offset according to DVSEC, or > - constructed at an offset according to a look-up table > > The VMM always scans the vfio pci config space to expose vPASID. > > Then the remaining open is what VMM could do when a VF supports > PASID but unfortunately it's not reported by vfio. W/o the capability > of inspecting the PASID state of PF, probably the only feasible option > is to maintain a look-up table in VMM itself and assumes the kernel > always enables the PASID cap on PF. I'm still not sure I like doing this in the kernel - we need to do the same sort of thing for ATS too, right? It feels simpler if the indicates if PASID and ATS can be supported and userspace builds the capability blocks. There are migration considerations too - the blocks need to be migrated over and end up in the same place as well.. Jason
On Tue, Apr 23, 2024 at 03:58:17PM +0800, Yi Liu wrote: > > > > 1) Check if a pasid cap is already present in the virtual config space > > > > read from kernel. If no, but user wants pasid, then goto step 2). > > > > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support > > > > pasid cap. If yes, goto step 3). > > > > > > Why do we need the vfio feature interface if a physical or virtual PASID > > > capability on the device exposes the same info? > > > > Still need to check if the platform, os, iommu, etc are all OK with > > enabling PASID support before the viommu advertises it. > > This means we don't expose physical or virtual PASID cap, is it? Yeah keep hiding both still works. Some kind of test to see if PASID and ATS are supportable on the device. Probably via IOMMUFD_CMD_GET_HW_INFO If they are supported then the VMM will find empty space and generate the missing caps. The VMM will have means to fix the location during migration. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, April 23, 2024 8:02 PM > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > > I'm not sure how userspace can fully handle this w/o certain assistance > > from the kernel. > > > > So I kind of agree that emulated PASID capability is probably the only > > contract which the kernel should provide: > > - mapped 1:1 at the physical location, or > > - constructed at an offset according to DVSEC, or > > - constructed at an offset according to a look-up table > > > > The VMM always scans the vfio pci config space to expose vPASID. > > > > Then the remaining open is what VMM could do when a VF supports > > PASID but unfortunately it's not reported by vfio. W/o the capability > > of inspecting the PASID state of PF, probably the only feasible option > > is to maintain a look-up table in VMM itself and assumes the kernel > > always enables the PASID cap on PF. > > I'm still not sure I like doing this in the kernel - we need to do the > same sort of thing for ATS too, right? VF is allowed to implement ATS. PRI has the same problem as PASID. > > It feels simpler if the indicates if PASID and ATS can be supported > and userspace builds the capability blocks. this routes back to Alex's original question about using different interfaces (a device feature vs. PCI PASID cap) for VF and PF. Are we OK with that divergence? > > There are migration considerations too - the blocks need to be > migrated over and end up in the same place as well.. > Can you elaborate what is the problem with the kernel emulating the PASID cap in this consideration? Does it talk about a case where the devices between src/dest are different versions (but backward compatible) with different unused space layout and the kernel approach may pick up different offsets while the VMM can guarantee the same offset?
On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > > > I'm not sure how userspace can fully handle this w/o certain assistance > > > from the kernel. > > > > > > So I kind of agree that emulated PASID capability is probably the only > > > contract which the kernel should provide: > > > - mapped 1:1 at the physical location, or > > > - constructed at an offset according to DVSEC, or > > > - constructed at an offset according to a look-up table > > > > > > The VMM always scans the vfio pci config space to expose vPASID. > > > > > > Then the remaining open is what VMM could do when a VF supports > > > PASID but unfortunately it's not reported by vfio. W/o the capability > > > of inspecting the PASID state of PF, probably the only feasible option > > > is to maintain a look-up table in VMM itself and assumes the kernel > > > always enables the PASID cap on PF. > > > > I'm still not sure I like doing this in the kernel - we need to do the > > same sort of thing for ATS too, right? > > VF is allowed to implement ATS. > > PRI has the same problem as PASID. I'm surprised by this, I would have guessed ATS would be the device global one, PRI not being per-VF seems problematic??? How do you disable PRI generation to get a clean shutdown? > > It feels simpler if the indicates if PASID and ATS can be supported > > and userspace builds the capability blocks. > > this routes back to Alex's original question about using different > interfaces (a device feature vs. PCI PASID cap) for VF and PF. I'm not sure it is different interfaces.. The only reason to pass the PF's PASID cap is to give free space to the VMM. If we are saying that gaps are free space (excluding a list of bad devices) then we don't acutally need to do that anymore. VMM will always create a synthetic PASID cap and kernel will always suppress a real one. An iommufd query will indicate if the vIOMMU can support vPASID on that device. Same for all the troublesome non-physical caps. > > There are migration considerations too - the blocks need to be > > migrated over and end up in the same place as well.. > > Can you elaborate what is the problem with the kernel emulating > the PASID cap in this consideration? If the kernel changes the algorithm, say it wants to do PASID, PRI, something_new then it might change the layout We can't just have the kernel decide without also providing a way for userspace to say what the right layout actually is. :\ > Does it talk about a case where the devices between src/dest are > different versions (but backward compatible) with different unused > space layout and the kernel approach may pick up different offsets > while the VMM can guarantee the same offset? That is also a concern where the PCI cap layout may change a bit but they are still migration compatible, but my bigger worry is that the kernel just lays out the fake caps in a different way because the kernel changes. At least if the VMM is doing this then the VMM can include the information in its migration scheme and use it to recreate the PCI layout withotu having to create a bunch of uAPI to do so. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, April 24, 2024 8:12 AM > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > > > > I'm not sure how userspace can fully handle this w/o certain assistance > > > > from the kernel. > > > > > > > > So I kind of agree that emulated PASID capability is probably the only > > > > contract which the kernel should provide: > > > > - mapped 1:1 at the physical location, or > > > > - constructed at an offset according to DVSEC, or > > > > - constructed at an offset according to a look-up table > > > > > > > > The VMM always scans the vfio pci config space to expose vPASID. > > > > > > > > Then the remaining open is what VMM could do when a VF supports > > > > PASID but unfortunately it's not reported by vfio. W/o the capability > > > > of inspecting the PASID state of PF, probably the only feasible option > > > > is to maintain a look-up table in VMM itself and assumes the kernel > > > > always enables the PASID cap on PF. > > > > > > I'm still not sure I like doing this in the kernel - we need to do the > > > same sort of thing for ATS too, right? > > > > VF is allowed to implement ATS. > > > > PRI has the same problem as PASID. > > I'm surprised by this, I would have guessed ATS would be the device > global one, PRI not being per-VF seems problematic??? How do you > disable PRI generation to get a clean shutdown? Here is what the PCIe spec says: For SR-IOV devices, a single Page Request Interface is permitted for the PF and is shared between the PF and its associated VFs, in which case the PF implements this capability and its VFs must not. I'll let Baolu chime in for the potential impact to his PRI cleanup effort, e.g. whether disabling PRI generation is mandatory if the IOMMU side is already put in a mode auto-responding error to new PRI request instead of reporting to sw. But I do see another problem for shared capabilities between PF/VFs. Now those shared capabilities are enabled/disabled when the PF is attached to/detached from a domain, w/o counting the shared usage from VFs. Looks we have a gap here.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, April 24, 2024 8:12 AM > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > It feels simpler if the indicates if PASID and ATS can be supported > > > and userspace builds the capability blocks. > > > > this routes back to Alex's original question about using different > > interfaces (a device feature vs. PCI PASID cap) for VF and PF. > > I'm not sure it is different interfaces.. > > The only reason to pass the PF's PASID cap is to give free space to > the VMM. If we are saying that gaps are free space (excluding a list > of bad devices) then we don't acutally need to do that anymore. > > VMM will always create a synthetic PASID cap and kernel will always > suppress a real one. oh you suggest that there won't even be a 1:1 map for PF! kind of continue with the device_feature method as this series does. and it could include all VMM-emulated capabilities which are not enumerated properly from vfio pci config space. this interface only reports the availability/features of a capability but never includes information about offset. If a device implements DVSEC it will be exposed to the VMM. Then suppose the VMM will introduce a new cmd parameter to turn on the emulation of the PASID capability. It's default off so legacy usages can still work. Once the parameter is on then the VMM will emulate the PASID capability by: - Locating a free range according to DVSEC, or, - Locating a free range from gaps between PCI caps, If a device is found not working properly then add a fixed offset for this device. > > An iommufd query will indicate if the vIOMMU can support vPASID on > that device. > > Same for all the troublesome non-physical caps. > > > > There are migration considerations too - the blocks need to be > > > migrated over and end up in the same place as well.. > > > > Can you elaborate what is the problem with the kernel emulating > > the PASID cap in this consideration? > > If the kernel changes the algorithm, say it wants to do PASID, PRI, > something_new then it might change the layout > > We can't just have the kernel decide without also providing a way for > userspace to say what the right layout actually is. :\ > emm that's a good point.
On 2024/4/24 10:57, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Wednesday, April 24, 2024 8:12 AM >> >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Tuesday, April 23, 2024 8:02 PM >>>> >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: >>>>> I'm not sure how userspace can fully handle this w/o certain assistance >>>>> from the kernel. >>>>> >>>>> So I kind of agree that emulated PASID capability is probably the only >>>>> contract which the kernel should provide: >>>>> - mapped 1:1 at the physical location, or >>>>> - constructed at an offset according to DVSEC, or >>>>> - constructed at an offset according to a look-up table >>>>> >>>>> The VMM always scans the vfio pci config space to expose vPASID. >>>>> >>>>> Then the remaining open is what VMM could do when a VF supports >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability >>>>> of inspecting the PASID state of PF, probably the only feasible option >>>>> is to maintain a look-up table in VMM itself and assumes the kernel >>>>> always enables the PASID cap on PF. >>>> >>>> I'm still not sure I like doing this in the kernel - we need to do the >>>> same sort of thing for ATS too, right? >>> >>> VF is allowed to implement ATS. >>> >>> PRI has the same problem as PASID. >> >> I'm surprised by this, I would have guessed ATS would be the device >> global one, PRI not being per-VF seems problematic??? How do you >> disable PRI generation to get a clean shutdown? > > Here is what the PCIe spec says: > > For SR-IOV devices, a single Page Request Interface is permitted for > the PF and is shared between the PF and its associated VFs, in which > case the PF implements this capability and its VFs must not. > > I'll let Baolu chime in for the potential impact to his PRI cleanup > effort, e.g. whether disabling PRI generation is mandatory if the > IOMMU side is already put in a mode auto-responding error to > new PRI request instead of reporting to sw. The PRI cleanup steps are defined like this: * - Disable new PRI reception: Turn off PRI generation in the IOMMU hardware * and flush any hardware page request queues. This should be done before * calling into this helper. * - Acknowledge all outstanding PRQs to the device: Respond to all outstanding * page requests with IOMMU_PAGE_RESP_INVALID, indicating the device should * not retry. This helper function handles this. * - Disable PRI on the device: After calling this helper, the caller could * then disable PRI on the device. Disabling PRI on the device is the last step and optional because the IOMMU is required to support a PRI blocking state and has already been put in that state at the first step. For the VF case, it probably is a no-op except for maintaining a reference count. Once PRI is disabled on all PFs and VFs, it can then be physically disabled on the PF. > > But I do see another problem for shared capabilities between PF/VFs. > > Now those shared capabilities are enabled/disabled when the PF is > attached to/detached from a domain, w/o counting the shared usage > from VFs. > > Looks we have a gap here. Yes, there's a gap at least for the Intel IOMMU driver. I'll soon fix this by moving the handling of ATS out of the driver, especially from the default domain attach/detach paths. Best regards, baolu
On Wed, Apr 24, 2024 at 02:57:38AM +0000, Tian, Kevin wrote: > Now those shared capabilities are enabled/disabled when the PF is > attached to/detached from a domain, w/o counting the shared usage > from VFs. Urh.. This looks broken in the kernel right now? If you attach a SVA PASID to a VF then the iommu driver will call pci_enable_pri(vf) which will always fail: /* * VFs must not implement the PRI Capability. If their PF * implements PRI, it is shared by the VFs, so if the PF PRI is * enabled, it is also enabled for the VF. */ if (pdev->is_virtfn) { if (pci_physfn(pdev)->pri_enabled) return 0; return -EINVAL; } More to fix :( Jason
On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, April 24, 2024 8:12 AM > > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > > > It feels simpler if the indicates if PASID and ATS can be supported > > > > and userspace builds the capability blocks. > > > > > > this routes back to Alex's original question about using different > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF. > > > > I'm not sure it is different interfaces.. > > > > The only reason to pass the PF's PASID cap is to give free space to > > the VMM. If we are saying that gaps are free space (excluding a list > > of bad devices) then we don't acutally need to do that anymore. > > > > VMM will always create a synthetic PASID cap and kernel will always > > suppress a real one. > > oh you suggest that there won't even be a 1:1 map for PF! Right. No real need.. > kind of continue with the device_feature method as this series does. > and it could include all VMM-emulated capabilities which are not > enumerated properly from vfio pci config space. 1) VFIO creates the iommufd idev 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc, etc is supported 3) VMM locates empty space in the config space 4) VMM figures out where and what cap blocks to create (considering migration needs/etc) 5) VMM synthesizes the blocks and ties emulation to other iommufd things This works generically for any synthetic vPCI function including a non-vfio-pci one. Most likely due to migration needs the exact layout of the PCI config space should be configured to the VMM, including the location of any blocks copied from physical and any blocks synthezied. This is the only way to be sure the config space is actually 100% consistent. For non migration cases to make it automatic we can check the free space via gaps. The broken devices that have problems with this can either be told to use the explicit approach above,the VMM could consult some text file, or vPASID/etc can be left disabled. IMHO the support of PASID is so rare today this is probably fine. Vendors should be *strongly encouraged* to wrap their special used config space areas in DVSEC and not hide them in free space. We may also want a DVSEC to indicate free space - but if vendors are going to change their devices I'd rather them change to mark the used space with DVSEC then mark the free space :) Jason
On Tue, 23 Apr 2024 21:12:21 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > > > > I'm not sure how userspace can fully handle this w/o certain assistance > > > > from the kernel. > > > > > > > > So I kind of agree that emulated PASID capability is probably the only > > > > contract which the kernel should provide: > > > > - mapped 1:1 at the physical location, or > > > > - constructed at an offset according to DVSEC, or > > > > - constructed at an offset according to a look-up table > > > > > > > > The VMM always scans the vfio pci config space to expose vPASID. > > > > > > > > Then the remaining open is what VMM could do when a VF supports > > > > PASID but unfortunately it's not reported by vfio. W/o the capability > > > > of inspecting the PASID state of PF, probably the only feasible option > > > > is to maintain a look-up table in VMM itself and assumes the kernel > > > > always enables the PASID cap on PF. > > > > > > I'm still not sure I like doing this in the kernel - we need to do the > > > same sort of thing for ATS too, right? > > > > VF is allowed to implement ATS. > > > > PRI has the same problem as PASID. > > I'm surprised by this, I would have guessed ATS would be the device > global one, PRI not being per-VF seems problematic??? How do you > disable PRI generation to get a clean shutdown? > > > > It feels simpler if the indicates if PASID and ATS can be supported > > > and userspace builds the capability blocks. > > > > this routes back to Alex's original question about using different > > interfaces (a device feature vs. PCI PASID cap) for VF and PF. > > I'm not sure it is different interfaces.. > > The only reason to pass the PF's PASID cap is to give free space to > the VMM. If we are saying that gaps are free space (excluding a list > of bad devices) then we don't acutally need to do that anymore. Are we saying that now?? That's new. > VMM will always create a synthetic PASID cap and kernel will always > suppress a real one. > > An iommufd query will indicate if the vIOMMU can support vPASID on > that device. > > Same for all the troublesome non-physical caps. > > > > There are migration considerations too - the blocks need to be > > > migrated over and end up in the same place as well.. > > > > Can you elaborate what is the problem with the kernel emulating > > the PASID cap in this consideration? > > If the kernel changes the algorithm, say it wants to do PASID, PRI, > something_new then it might change the layout > > We can't just have the kernel decide without also providing a way for > userspace to say what the right layout actually is. :\ The capability layout is only relevant to migration, right? A variant driver that supports migration is a prerequisite and would also be responsible for exposing the PASID capability. This isn't as disjoint as it's being portrayed. > > Does it talk about a case where the devices between src/dest are > > different versions (but backward compatible) with different unused > > space layout and the kernel approach may pick up different offsets > > while the VMM can guarantee the same offset? > > That is also a concern where the PCI cap layout may change a bit but > they are still migration compatible, but my bigger worry is that the > kernel just lays out the fake caps in a different way because the > kernel changes. Outside of migration, what does it matter if the cap layout is different? A driver should never hard code the address for a capability. > At least if the VMM is doing this then the VMM can include the > information in its migration scheme and use it to recreate the PCI > layout withotu having to create a bunch of uAPI to do so. We're again back to migration compatibility, where again the capability layout would be governed by the migration support in the in-kernel variant driver. Once migration is involved the location of a PASID shouldn't be arbitrary, whether it's provided by the kernel or the VMM. Regardless, the VMM ultimately has the authority what the guest sees in config space. The VMM is not bound to expose the PASID at the offset provided by the kernel, or bound to expose it at all. The kernel exposed PASID can simply provide an available location and set of enabled capabilities. Thanks, Alex
On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote: > > The only reason to pass the PF's PASID cap is to give free space to > > the VMM. If we are saying that gaps are free space (excluding a list > > of bad devices) then we don't acutally need to do that anymore. > > Are we saying that now?? That's new. I suggested it a few times > > > VMM will always create a synthetic PASID cap and kernel will always > > suppress a real one. > > > > An iommufd query will indicate if the vIOMMU can support vPASID on > > that device. > > > > Same for all the troublesome non-physical caps. > > > > > > There are migration considerations too - the blocks need to be > > > > migrated over and end up in the same place as well.. > > > > > > Can you elaborate what is the problem with the kernel emulating > > > the PASID cap in this consideration? > > > > If the kernel changes the algorithm, say it wants to do PASID, PRI, > > something_new then it might change the layout > > > > We can't just have the kernel decide without also providing a way for > > userspace to say what the right layout actually is. :\ > > The capability layout is only relevant to migration, right? Yes, proabbly > A variant > driver that supports migration is a prerequisite and would also be > responsible for exposing the PASID capability. This isn't as disjoint > as it's being portrayed. I guess.. But also not quite. We still have the problem that kernel migration driver V1 could legitimately create a different config space that migration driver V2 And now you are saying that the migration driver has to parse the migration stream and readjust its own layout And every driver needs to do this? We can, it is a quite big bit of infrastructure I think, but sure.. I fear the VMM still has to be involved somehow because it still has to know if the source VMM has removed any kernel created caps. > Outside of migration, what does it matter if the cap layout is > different? A driver should never hard code the address for a > capability. Yes, talking about migration here - migration is the hardest case it seems. > > At least if the VMM is doing this then the VMM can include the > > information in its migration scheme and use it to recreate the PCI > > layout withotu having to create a bunch of uAPI to do so. > > We're again back to migration compatibility, where again the capability > layout would be governed by the migration support in the in-kernel > variant driver. Once migration is involved the location of a PASID > shouldn't be arbitrary, whether it's provided by the kernel or the VMM. I wasn't going in this direction. I was thinking to make the VMM create the config space layout that is approriate and hold it stable as a migration ABI. I think in practice many VMMs are going to do this anyhow unless we put full support for config space synthesis, stable versions, and version selection in the kernel directly. I was thinking to avoid doing that. > Regardless, the VMM ultimately has the authority what the guest > sees in config space. The VMM is not bound to expose the PASID at the > offset provided by the kernel, or bound to expose it at all. The > kernel exposed PASID can simply provide an available location and set > of enabled capabilities. And if the VMM is going to ignore the kernel layout then why do so much work in the kernel to create it? I think we need to decide, either only the VMM or only the kernel should do this. Jason
On Wed, 24 Apr 2024 11:15:25 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, April 24, 2024 8:12 AM > > > > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > > > > > It feels simpler if the indicates if PASID and ATS can be supported > > > > > and userspace builds the capability blocks. > > > > > > > > this routes back to Alex's original question about using different > > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF. > > > > > > I'm not sure it is different interfaces.. > > > > > > The only reason to pass the PF's PASID cap is to give free space to > > > the VMM. If we are saying that gaps are free space (excluding a list > > > of bad devices) then we don't acutally need to do that anymore. > > > > > > VMM will always create a synthetic PASID cap and kernel will always > > > suppress a real one. > > > > oh you suggest that there won't even be a 1:1 map for PF! > > Right. No real need.. > > > kind of continue with the device_feature method as this series does. > > and it could include all VMM-emulated capabilities which are not > > enumerated properly from vfio pci config space. > > 1) VFIO creates the iommufd idev > 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc, > etc is supported > 3) VMM locates empty space in the config space > 4) VMM figures out where and what cap blocks to create (considering > migration needs/etc) > 5) VMM synthesizes the blocks and ties emulation to other iommufd things > > This works generically for any synthetic vPCI function including a > non-vfio-pci one. Maybe this is the actual value in implementing this in the VMM, one implementation can support multiple device interfaces. > Most likely due to migration needs the exact layout of the PCI config > space should be configured to the VMM, including the location of any > blocks copied from physical and any blocks synthezied. This is the > only way to be sure the config space is actually 100% consistent. Where is this concern about config space arbitrarily changing coming from? It's possible, yes, but vfio-pci-core or a variant driver are going to have some sort of reasoning for exposing a capability at a given offset. A variant driver is necessary for supporting migration, that variant driver should be aware that capability offsets are part of the migration contract, and QEMU will enforce identical config space unless we introduce exceptions. > For non migration cases to make it automatic we can check the free > space via gaps. The broken devices that have problems with this can > either be told to use the explicit approach above,the VMM could > consult some text file, or vPASID/etc can be left disabled. IMHO the > support of PASID is so rare today this is probably fine. > > Vendors should be *strongly encouraged* to wrap their special used > config space areas in DVSEC and not hide them in free space. As in my previous reply, this is a new approach that I had thought we weren't comfortable making, and I'm still not very comfortable with. The fact is random registers exist outside of capabilities today and creating a general policy that we're going to deal with that as issues arise from a generic "find a gap" algorithm is concerning. > We may also want a DVSEC to indicate free space - but if vendors are > going to change their devices I'd rather them change to mark the used > space with DVSEC then mark the free space :) Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be great. Thanks, Alex
On Wed, Apr 24, 2024 at 12:38:51PM -0600, Alex Williamson wrote: > On Wed, 24 Apr 2024 11:15:25 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Wednesday, April 24, 2024 8:12 AM > > > > > > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > > Sent: Tuesday, April 23, 2024 8:02 PM > > > > > > > > > > > > It feels simpler if the indicates if PASID and ATS can be supported > > > > > > and userspace builds the capability blocks. > > > > > > > > > > this routes back to Alex's original question about using different > > > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF. > > > > > > > > I'm not sure it is different interfaces.. > > > > > > > > The only reason to pass the PF's PASID cap is to give free space to > > > > the VMM. If we are saying that gaps are free space (excluding a list > > > > of bad devices) then we don't acutally need to do that anymore. > > > > > > > > VMM will always create a synthetic PASID cap and kernel will always > > > > suppress a real one. > > > > > > oh you suggest that there won't even be a 1:1 map for PF! > > > > Right. No real need.. > > > > > kind of continue with the device_feature method as this series does. > > > and it could include all VMM-emulated capabilities which are not > > > enumerated properly from vfio pci config space. > > > > 1) VFIO creates the iommufd idev > > 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc, > > etc is supported > > 3) VMM locates empty space in the config space > > 4) VMM figures out where and what cap blocks to create (considering > > migration needs/etc) > > 5) VMM synthesizes the blocks and ties emulation to other iommufd things > > > > This works generically for any synthetic vPCI function including a > > non-vfio-pci one. > > Maybe this is the actual value in implementing this in the VMM, one > implementation can support multiple device interfaces. > > > Most likely due to migration needs the exact layout of the PCI config > > space should be configured to the VMM, including the location of any > > blocks copied from physical and any blocks synthezied. This is the > > only way to be sure the config space is actually 100% consistent. > > Where is this concern about config space arbitrarily changing coming > from? It is important for migration. Today with the drivers we have the devices have to take care of their own config space layout only in HW. We are not expecting migration drivers to worry about any SW created config space. SW config space is a new thing. > > We may also want a DVSEC to indicate free space - but if vendors are > > going to change their devices I'd rather them change to mark the used > > space with DVSEC then mark the free space :) > > Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be > great We are applying this going forward to PASID, and PRI, which barely exist on devices at all today. We don't need to worry about those 10+yr old devices that don't support PASID/PRI in the first place. I think this makes the problem much smaller. Jason
On Wed, 24 Apr 2024 15:36:26 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote: > > > The only reason to pass the PF's PASID cap is to give free space to > > > the VMM. If we are saying that gaps are free space (excluding a list > > > of bad devices) then we don't acutally need to do that anymore. > > > > Are we saying that now?? That's new. > > I suggested it a few times > > > > > > VMM will always create a synthetic PASID cap and kernel will always > > > suppress a real one. > > > > > > An iommufd query will indicate if the vIOMMU can support vPASID on > > > that device. > > > > > > Same for all the troublesome non-physical caps. > > > > > > > > There are migration considerations too - the blocks need to be > > > > > migrated over and end up in the same place as well.. > > > > > > > > Can you elaborate what is the problem with the kernel emulating > > > > the PASID cap in this consideration? > > > > > > If the kernel changes the algorithm, say it wants to do PASID, PRI, > > > something_new then it might change the layout > > > > > > We can't just have the kernel decide without also providing a way for > > > userspace to say what the right layout actually is. :\ > > > > The capability layout is only relevant to migration, right? > > Yes, proabbly > > > A variant > > driver that supports migration is a prerequisite and would also be > > responsible for exposing the PASID capability. This isn't as disjoint > > as it's being portrayed. > > I guess.. But also not quite. We still have the problem that kernel > migration driver V1 could legitimately create a different config space > that migration driver V2 > > And now you are saying that the migration driver has to parse the > migration stream and readjust its own layout > > And every driver needs to do this? > > We can, it is a quite big bit of infrastructure I think, but sure.. > > I fear the VMM still has to be involved somehow because it still has > to know if the source VMM has removed any kernel created caps. This is kind of an absurd example to portray as a ubiquitous problem. Typically the config space layout is a reflection of hardware whether the device supports migration or not. If a driver were to insert a virtual capability, then yes it would want to be consistent about it if it also cares about migration. If the driver needs to change the location of a virtual capability, problems will arise, but that's also not something that every driver needs to do. Also, how exactly does emulating the capability in the VMM solve this problem? Currently QEMU migration simply applies state to an identical VM on the target. QEMU doesn't modify the target VM to conform to the data stream. So in either case, the problem might be more along the lines of how to make a V1 device from a V2 driver, which is more the device type/flavor/persona problem. > > Outside of migration, what does it matter if the cap layout is > > different? A driver should never hard code the address for a > > capability. > > Yes, talking about migration here - migration is the hardest case it > seems. > > > > At least if the VMM is doing this then the VMM can include the > > > information in its migration scheme and use it to recreate the PCI > > > layout withotu having to create a bunch of uAPI to do so. > > > > We're again back to migration compatibility, where again the capability > > layout would be governed by the migration support in the in-kernel > > variant driver. Once migration is involved the location of a PASID > > shouldn't be arbitrary, whether it's provided by the kernel or the VMM. > > I wasn't going in this direction. I was thinking to make the VMM > create the config space layout that is approriate and hold it stable > as a migration ABI. > > I think in practice many VMMs are going to do this anyhow unless we > put full support for config space synthesis, stable versions, and > version selection in the kernel directly. I was thinking to avoid > doing that. Currently QEMU replies on determinism that a given command line results in an identical machine configuration and identical devices. State of that target VM is then populated, not defined by, the migration stream. > > Regardless, the VMM ultimately has the authority what the guest > > sees in config space. The VMM is not bound to expose the PASID at the > > offset provided by the kernel, or bound to expose it at all. The > > kernel exposed PASID can simply provide an available location and set > > of enabled capabilities. > > And if the VMM is going to ignore the kernel layout then why do so > much work in the kernel to create it? Ok, let's not ignore it ;) > I think we need to decide, either only the VMM or only the kernel > should do this. What are you actually proposing? Are you suggesting that if a device supports the Power Management capability it will be virtualized at offset 0x60, if the device supports the MSI capability it will be virtualized at 0x68,... if a device supports PASID it will be virtualized at offset 0x300, etc...? That's not only impractical because we can't layout all the capabilities within the available space, but also because we will run into masking hidden registers and devices where the driver hard codes a capability offset. If the VMM implements the "find a gap" solution then it's just as subject to config space changes in hardware or provided by the variant driver. If not either of those, are we hard coding a device specific config space map into the VMM or providing one on the command line? I thought we were using vfio-pci variant drivers and a defined vfio migration API in order to prevent modifying the VMM for every device we want to support migration. Also I'd wonder if the driver itself shouldn't be configured to provide a compatible type. Thanks, Alex
On 2024/4/25 02:24, Alex Williamson wrote: > On Tue, 23 Apr 2024 21:12:21 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Tuesday, April 23, 2024 8:02 PM >>>> >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: >>>>> I'm not sure how userspace can fully handle this w/o certain assistance >>>>> from the kernel. >>>>> >>>>> So I kind of agree that emulated PASID capability is probably the only >>>>> contract which the kernel should provide: >>>>> - mapped 1:1 at the physical location, or >>>>> - constructed at an offset according to DVSEC, or >>>>> - constructed at an offset according to a look-up table >>>>> >>>>> The VMM always scans the vfio pci config space to expose vPASID. >>>>> >>>>> Then the remaining open is what VMM could do when a VF supports >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability >>>>> of inspecting the PASID state of PF, probably the only feasible option >>>>> is to maintain a look-up table in VMM itself and assumes the kernel >>>>> always enables the PASID cap on PF. >>>> >>>> I'm still not sure I like doing this in the kernel - we need to do the >>>> same sort of thing for ATS too, right? >>> >>> VF is allowed to implement ATS. >>> >>> PRI has the same problem as PASID. >> >> I'm surprised by this, I would have guessed ATS would be the device >> global one, PRI not being per-VF seems problematic??? How do you >> disable PRI generation to get a clean shutdown? >> >>>> It feels simpler if the indicates if PASID and ATS can be supported >>>> and userspace builds the capability blocks. >>> >>> this routes back to Alex's original question about using different >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF. >> >> I'm not sure it is different interfaces.. >> >> The only reason to pass the PF's PASID cap is to give free space to >> the VMM. If we are saying that gaps are free space (excluding a list >> of bad devices) then we don't acutally need to do that anymore. > > Are we saying that now?? That's new. > >> VMM will always create a synthetic PASID cap and kernel will always >> suppress a real one. >> >> An iommufd query will indicate if the vIOMMU can support vPASID on >> that device. >> >> Same for all the troublesome non-physical caps. >> >>>> There are migration considerations too - the blocks need to be >>>> migrated over and end up in the same place as well.. >>> >>> Can you elaborate what is the problem with the kernel emulating >>> the PASID cap in this consideration? >> >> If the kernel changes the algorithm, say it wants to do PASID, PRI, >> something_new then it might change the layout >> >> We can't just have the kernel decide without also providing a way for >> userspace to say what the right layout actually is. :\ > > The capability layout is only relevant to migration, right? A variant > driver that supports migration is a prerequisite and would also be > responsible for exposing the PASID capability. This isn't as disjoint > as it's being portrayed. > >>> Does it talk about a case where the devices between src/dest are >>> different versions (but backward compatible) with different unused >>> space layout and the kernel approach may pick up different offsets >>> while the VMM can guarantee the same offset? >> >> That is also a concern where the PCI cap layout may change a bit but >> they are still migration compatible, but my bigger worry is that the >> kernel just lays out the fake caps in a different way because the >> kernel changes. > > Outside of migration, what does it matter if the cap layout is > different? A driver should never hard code the address for a > capability. > But it may store the offset of capability to make next cap access more convenient. I noticted struct pci_dev stores the offset of PRI and PASID cap. So if the layout of config space changed between src and dst, it may result in problem in guest when guest driver uses the offsets to access PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi, msix). So there is already a problem even put aside the PRI and PASID cap. #ifdef CONFIG_PCI_PRI u16 pri_cap; /* PRI Capability offset */ u32 pri_reqs_alloc; /* Number of PRI requests allocated */ unsigned int pasid_required:1; /* PRG Response PASID Required */ #endif #ifdef CONFIG_PCI_PASID u16 pasid_cap; /* PASID Capability offset */ u16 pasid_features; #endif #ifdef CONFIG_PCI_P2PDMA struct pci_p2pdma __rcu *p2pdma; #endif #ifdef CONFIG_PCI_DOE struct xarray doe_mbs; /* Data Object Exchange mailboxes */ #endif u16 acs_cap; /* ACS Capability offset */ https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350 >> At least if the VMM is doing this then the VMM can include the >> information in its migration scheme and use it to recreate the PCI >> layout withotu having to create a bunch of uAPI to do so. > > We're again back to migration compatibility, where again the capability > layout would be governed by the migration support in the in-kernel > variant driver. Once migration is involved the location of a PASID > shouldn't be arbitrary, whether it's provided by the kernel or the VMM. > > Regardless, the VMM ultimately has the authority what the guest > sees in config space. The VMM is not bound to expose the PASID at the > offset provided by the kernel, or bound to expose it at all. The > kernel exposed PASID can simply provide an available location and set > of enabled capabilities. Thanks, > > Alex >
On Thu, 25 Apr 2024 17:26:54 +0800 Yi Liu <yi.l.liu@intel.com> wrote: > On 2024/4/25 02:24, Alex Williamson wrote: > > On Tue, 23 Apr 2024 21:12:21 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: > >>>> From: Jason Gunthorpe <jgg@nvidia.com> > >>>> Sent: Tuesday, April 23, 2024 8:02 PM > >>>> > >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: > >>>>> I'm not sure how userspace can fully handle this w/o certain assistance > >>>>> from the kernel. > >>>>> > >>>>> So I kind of agree that emulated PASID capability is probably the only > >>>>> contract which the kernel should provide: > >>>>> - mapped 1:1 at the physical location, or > >>>>> - constructed at an offset according to DVSEC, or > >>>>> - constructed at an offset according to a look-up table > >>>>> > >>>>> The VMM always scans the vfio pci config space to expose vPASID. > >>>>> > >>>>> Then the remaining open is what VMM could do when a VF supports > >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability > >>>>> of inspecting the PASID state of PF, probably the only feasible option > >>>>> is to maintain a look-up table in VMM itself and assumes the kernel > >>>>> always enables the PASID cap on PF. > >>>> > >>>> I'm still not sure I like doing this in the kernel - we need to do the > >>>> same sort of thing for ATS too, right? > >>> > >>> VF is allowed to implement ATS. > >>> > >>> PRI has the same problem as PASID. > >> > >> I'm surprised by this, I would have guessed ATS would be the device > >> global one, PRI not being per-VF seems problematic??? How do you > >> disable PRI generation to get a clean shutdown? > >> > >>>> It feels simpler if the indicates if PASID and ATS can be supported > >>>> and userspace builds the capability blocks. > >>> > >>> this routes back to Alex's original question about using different > >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF. > >> > >> I'm not sure it is different interfaces.. > >> > >> The only reason to pass the PF's PASID cap is to give free space to > >> the VMM. If we are saying that gaps are free space (excluding a list > >> of bad devices) then we don't acutally need to do that anymore. > > > > Are we saying that now?? That's new. > > > >> VMM will always create a synthetic PASID cap and kernel will always > >> suppress a real one. > >> > >> An iommufd query will indicate if the vIOMMU can support vPASID on > >> that device. > >> > >> Same for all the troublesome non-physical caps. > >> > >>>> There are migration considerations too - the blocks need to be > >>>> migrated over and end up in the same place as well.. > >>> > >>> Can you elaborate what is the problem with the kernel emulating > >>> the PASID cap in this consideration? > >> > >> If the kernel changes the algorithm, say it wants to do PASID, PRI, > >> something_new then it might change the layout > >> > >> We can't just have the kernel decide without also providing a way for > >> userspace to say what the right layout actually is. :\ > > > > The capability layout is only relevant to migration, right? A variant > > driver that supports migration is a prerequisite and would also be > > responsible for exposing the PASID capability. This isn't as disjoint > > as it's being portrayed. > > > >>> Does it talk about a case where the devices between src/dest are > >>> different versions (but backward compatible) with different unused > >>> space layout and the kernel approach may pick up different offsets > >>> while the VMM can guarantee the same offset? > >> > >> That is also a concern where the PCI cap layout may change a bit but > >> they are still migration compatible, but my bigger worry is that the > >> kernel just lays out the fake caps in a different way because the > >> kernel changes. > > > > Outside of migration, what does it matter if the cap layout is ^^^^^^^^^^^^^^^^^^^^ > > different? A driver should never hard code the address for a > > capability. > > > > But it may store the offset of capability to make next cap access more > convenient. I noticted struct pci_dev stores the offset of PRI and PASID > cap. So if the layout of config space changed between src and dst, it may > result in problem in guest when guest driver uses the offsets to access > PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi, > msix). So there is already a problem even put aside the PRI and PASID cap. Yes, I had noted "outside of migration" above. Config space must be consistent to a running VM. But the possibility of config space changing like this only exists in the case where the driver supports migration, so I think we're inventing an unrealistic concern that a driver that supports migration would arbitrarily modify the config space layout in order to make an argument for VMM managed layout. Thanks, Alex > #ifdef CONFIG_PCI_PRI > u16 pri_cap; /* PRI Capability offset */ > u32 pri_reqs_alloc; /* Number of PRI requests allocated */ > unsigned int pasid_required:1; /* PRG Response PASID Required */ > #endif > #ifdef CONFIG_PCI_PASID > u16 pasid_cap; /* PASID Capability offset */ > u16 pasid_features; > #endif > #ifdef CONFIG_PCI_P2PDMA > struct pci_p2pdma __rcu *p2pdma; > #endif > #ifdef CONFIG_PCI_DOE > struct xarray doe_mbs; /* Data Object Exchange mailboxes */ > #endif > u16 acs_cap; /* ACS Capability offset */ > > https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350 > > >> At least if the VMM is doing this then the VMM can include the > >> information in its migration scheme and use it to recreate the PCI > >> layout withotu having to create a bunch of uAPI to do so. > > > > We're again back to migration compatibility, where again the capability > > layout would be governed by the migration support in the in-kernel > > variant driver. Once migration is involved the location of a PASID > > shouldn't be arbitrary, whether it's provided by the kernel or the VMM. > > > > Regardless, the VMM ultimately has the authority what the guest > > sees in config space. The VMM is not bound to expose the PASID at the > > offset provided by the kernel, or bound to expose it at all. The > > kernel exposed PASID can simply provide an available location and set > > of enabled capabilities. Thanks, > > > > Alex > > >
On 2024/4/25 20:58, Alex Williamson wrote: > On Thu, 25 Apr 2024 17:26:54 +0800 > Yi Liu <yi.l.liu@intel.com> wrote: > >> On 2024/4/25 02:24, Alex Williamson wrote: >>> On Tue, 23 Apr 2024 21:12:21 -0300 >>> Jason Gunthorpe <jgg@nvidia.com> wrote: >>> >>>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote: >>>>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>>>> Sent: Tuesday, April 23, 2024 8:02 PM >>>>>> >>>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote: >>>>>>> I'm not sure how userspace can fully handle this w/o certain assistance >>>>>>> from the kernel. >>>>>>> >>>>>>> So I kind of agree that emulated PASID capability is probably the only >>>>>>> contract which the kernel should provide: >>>>>>> - mapped 1:1 at the physical location, or >>>>>>> - constructed at an offset according to DVSEC, or >>>>>>> - constructed at an offset according to a look-up table >>>>>>> >>>>>>> The VMM always scans the vfio pci config space to expose vPASID. >>>>>>> >>>>>>> Then the remaining open is what VMM could do when a VF supports >>>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability >>>>>>> of inspecting the PASID state of PF, probably the only feasible option >>>>>>> is to maintain a look-up table in VMM itself and assumes the kernel >>>>>>> always enables the PASID cap on PF. >>>>>> >>>>>> I'm still not sure I like doing this in the kernel - we need to do the >>>>>> same sort of thing for ATS too, right? >>>>> >>>>> VF is allowed to implement ATS. >>>>> >>>>> PRI has the same problem as PASID. >>>> >>>> I'm surprised by this, I would have guessed ATS would be the device >>>> global one, PRI not being per-VF seems problematic??? How do you >>>> disable PRI generation to get a clean shutdown? >>>> >>>>>> It feels simpler if the indicates if PASID and ATS can be supported >>>>>> and userspace builds the capability blocks. >>>>> >>>>> this routes back to Alex's original question about using different >>>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF. >>>> >>>> I'm not sure it is different interfaces.. >>>> >>>> The only reason to pass the PF's PASID cap is to give free space to >>>> the VMM. If we are saying that gaps are free space (excluding a list >>>> of bad devices) then we don't acutally need to do that anymore. >>> >>> Are we saying that now?? That's new. >>> >>>> VMM will always create a synthetic PASID cap and kernel will always >>>> suppress a real one. >>>> >>>> An iommufd query will indicate if the vIOMMU can support vPASID on >>>> that device. >>>> >>>> Same for all the troublesome non-physical caps. >>>> >>>>>> There are migration considerations too - the blocks need to be >>>>>> migrated over and end up in the same place as well.. >>>>> >>>>> Can you elaborate what is the problem with the kernel emulating >>>>> the PASID cap in this consideration? >>>> >>>> If the kernel changes the algorithm, say it wants to do PASID, PRI, >>>> something_new then it might change the layout >>>> >>>> We can't just have the kernel decide without also providing a way for >>>> userspace to say what the right layout actually is. :\ >>> >>> The capability layout is only relevant to migration, right? A variant >>> driver that supports migration is a prerequisite and would also be >>> responsible for exposing the PASID capability. This isn't as disjoint >>> as it's being portrayed. >>> >>>>> Does it talk about a case where the devices between src/dest are >>>>> different versions (but backward compatible) with different unused >>>>> space layout and the kernel approach may pick up different offsets >>>>> while the VMM can guarantee the same offset? >>>> >>>> That is also a concern where the PCI cap layout may change a bit but >>>> they are still migration compatible, but my bigger worry is that the >>>> kernel just lays out the fake caps in a different way because the >>>> kernel changes. >>> >>> Outside of migration, what does it matter if the cap layout is > ^^^^^^^^^^^^^^^^^^^^ >>> different? A driver should never hard code the address for a >>> capability. >>> >> >> But it may store the offset of capability to make next cap access more >> convenient. I noticted struct pci_dev stores the offset of PRI and PASID >> cap. So if the layout of config space changed between src and dst, it may >> result in problem in guest when guest driver uses the offsets to access >> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi, >> msix). So there is already a problem even put aside the PRI and PASID cap. > > Yes, I had noted "outside of migration" above. Config space must be > consistent to a running VM. But the possibility of config space > changing like this only exists in the case where the driver supports > migration, so I think we're inventing an unrealistic concern that a > driver that supports migration would arbitrarily modify the config space > layout in order to make an argument for VMM managed layout. Thanks, I was considering a case in which the src VM has a v1 hw device, while the dst VM has a v2 hw device. Due to whatever reasons, the config space layouts are different between v1 and v2 hw. Since the current vfio copies the physical layout (except for the hidden caps) to VMM, the layout between the src and dst VM would be different. This would result in problem since the cap offsets are stale. It seems to be a problem you mentioned in another email of this thread[1]. :) [1] "So in either case, the problem might be more along the lines of how to make a V1 device from a V2 driver, which is more the device type/flavor/persona problem. " https://lore.kernel.org/linux-iommu/20240424141349.376bdbf9.alex.williamson@redhat.com/ > Alex > >> #ifdef CONFIG_PCI_PRI >> u16 pri_cap; /* PRI Capability offset */ >> u32 pri_reqs_alloc; /* Number of PRI requests allocated */ >> unsigned int pasid_required:1; /* PRG Response PASID Required */ >> #endif >> #ifdef CONFIG_PCI_PASID >> u16 pasid_cap; /* PASID Capability offset */ >> u16 pasid_features; >> #endif >> #ifdef CONFIG_PCI_P2PDMA >> struct pci_p2pdma __rcu *p2pdma; >> #endif >> #ifdef CONFIG_PCI_DOE >> struct xarray doe_mbs; /* Data Object Exchange mailboxes */ >> #endif >> u16 acs_cap; /* ACS Capability offset */ >> >> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350 >> >>>> At least if the VMM is doing this then the VMM can include the >>>> information in its migration scheme and use it to recreate the PCI >>>> layout withotu having to create a bunch of uAPI to do so. >>> >>> We're again back to migration compatibility, where again the capability >>> layout would be governed by the migration support in the in-kernel >>> variant driver. Once migration is involved the location of a PASID >>> shouldn't be arbitrary, whether it's provided by the kernel or the VMM. >>> >>> Regardless, the VMM ultimately has the authority what the guest >>> sees in config space. The VMM is not bound to expose the PASID at the >>> offset provided by the kernel, or bound to expose it at all. The >>> kernel exposed PASID can simply provide an available location and set >>> of enabled capabilities. Thanks, >>> >>> Alex >>> >> >
On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > This is kind of an absurd example to portray as a ubiquitous problem. > Typically the config space layout is a reflection of hardware whether > the device supports migration or not. Er, all our HW has FW constructed config space. It changes with FW upgrades. We change it during the life of the product. This has to be considered.. > If a driver were to insert a > virtual capability, then yes it would want to be consistent about it if > it also cares about migration. If the driver needs to change the > location of a virtual capability, problems will arise, but that's also > not something that every driver needs to do. Well, mlx5 has to cope with this. It supports so many devices with so many config space layouts :( I don't know if we can just hard wire an offset to stick in a PASID cap and expect that to work... > Also, how exactly does emulating the capability in the VMM solve this > problem? Currently QEMU migration simply applies state to an identical > VM on the target. QEMU doesn't modify the target VM to conform to the > data stream. So in either case, the problem might be more along the > lines of how to make a V1 device from a V2 driver, which is more the > device type/flavor/persona problem. Yes, it doesn't solve anything, it just puts the responsibility for something that is very complicated in userspace where there are more options to configure and customize it to the environment. > Currently QEMU replies on determinism that a given command line results > in an identical machine configuration and identical devices. State of > that target VM is then populated, not defined by, the migration stream. But that won't be true if the kernel is making decisions. The config space layout depends now on the kernel driver version too. > > I think we need to decide, either only the VMM or only the kernel > > should do this. > > What are you actually proposing? Okay, what I'm thinking about is a text file that describes the vPCI function configuration space to create. The community will standardize this and VMMs will have to implement to get PASID/etc. Maybe the community will provide a BSD licensed library to do this job. The text file allows the operator to specify exactly the configuration space the VFIO function should have. It would not be derived automatically from physical. AFAIK qemu does not have this capability currently. This reflects my observation and discussions around the live migration standardization. I belive we are fast reaching a point where this is required. Consider standards based migration between wildly different devices. The devices will not standardize their physical config space, but an operator could generate a consistent vPCI config space that works with all the devices in their fleet. Consider the usual working model of the large operators - they define instance types with some regularity. But an instance type is fixed in concrete once it is specified, things like the vPCI config space are fixed. Running Instance A on newer hardware with a changed physical config space should continue to present Instance A's vPCI config layout regardless. Ie Instance A might not support PASID but Instance B can run on newer HW that does. The config space layout depends on the requested Instance Type, not the physical layout. The auto-configuration of the config layout from physical is a nice feature and is excellent for development/small scale, but it shouldn't be the only way to work. So - if we accept that text file configuration should be something the VMM supports then let's reconsider how to solve the PASID problem. I'd say the way to solve it should be via a text file specifying a full config space layout that includes the PASID cap. From the VMM perspective this works fine, and it ports to every VMM directly via processing the text file. The autoconfiguration use case can be done by making a tool build the text file by deriving it from physical, much like today. The single instance of that tool could have device specific knowledge to avoid quirks. This way the smarts can still be shared by all the VMMs without going into the kernel. Special devices with hidden config space could get special quirks or special reference text files into the tool repo. Serious operators doing production SRIOV/etc would negotiate the text file with the HW vendors when they define their Instance Type. Ideally these reference text files would be contributed to the tool repo above. I think there would be some nice idea to define fully open source Instance Types that include VFIO devices too. Is it too much of a fantasy? Jason
On Fri, 26 Apr 2024 11:11:17 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > > > This is kind of an absurd example to portray as a ubiquitous problem. > > Typically the config space layout is a reflection of hardware whether > > the device supports migration or not. > > Er, all our HW has FW constructed config space. It changes with FW > upgrades. We change it during the life of the product. This has to be > considered.. So as I understand it, the concern is that you have firmware that supports migration, but it also openly hostile to the fundamental aspects of exposing a stable device ABI in support of migration. > > If a driver were to insert a > > virtual capability, then yes it would want to be consistent about it if > > it also cares about migration. If the driver needs to change the > > location of a virtual capability, problems will arise, but that's also > > not something that every driver needs to do. > > Well, mlx5 has to cope with this. It supports so many devices with so > many config space layouts :( I don't know if we can just hard wire an > offset to stick in a PASID cap and expect that to work... > > > Also, how exactly does emulating the capability in the VMM solve this > > problem? Currently QEMU migration simply applies state to an identical > > VM on the target. QEMU doesn't modify the target VM to conform to the > > data stream. So in either case, the problem might be more along the > > lines of how to make a V1 device from a V2 driver, which is more the > > device type/flavor/persona problem. > > Yes, it doesn't solve anything, it just puts the responsibility for > something that is very complicated in userspace where there are more > options to configure and customize it to the environment. > > > Currently QEMU replies on determinism that a given command line results > > in an identical machine configuration and identical devices. State of > > that target VM is then populated, not defined by, the migration stream. > > But that won't be true if the kernel is making decisions. The config > space layout depends now on the kernel driver version too. But in the cases where we support migration there's a device specific variant driver that supports that migration. It's the job of that variant driver to not only export and import the device state, but also to provide a consistent ABI to the user, which includes the config space layout. I don't understand why we'd say the device programming ABI itself falls within the purview of the device/variant driver, but PCI config space is defined by device specific code at a higher level. > > > I think we need to decide, either only the VMM or only the kernel > > > should do this. > > > > What are you actually proposing? > > Okay, what I'm thinking about is a text file that describes the vPCI > function configuration space to create. The community will standardize > this and VMMs will have to implement to get PASID/etc. Maybe the > community will provide a BSD licensed library to do this job. > > The text file allows the operator to specify exactly the configuration > space the VFIO function should have. It would not be derived > automatically from physical. AFAIK qemu does not have this capability > currently. > > This reflects my observation and discussions around the live migration > standardization. I belive we are fast reaching a point where this is > required. > > Consider standards based migration between wildly different > devices. The devices will not standardize their physical config space, > but an operator could generate a consistent vPCI config space that > works with all the devices in their fleet. > > Consider the usual working model of the large operators - they define > instance types with some regularity. But an instance type is fixed in > concrete once it is specified, things like the vPCI config space are > fixed. > > Running Instance A on newer hardware with a changed physical config > space should continue to present Instance A's vPCI config layout > regardless. Ie Instance A might not support PASID but Instance B can > run on newer HW that does. The config space layout depends on the > requested Instance Type, not the physical layout. > > The auto-configuration of the config layout from physical is a nice > feature and is excellent for development/small scale, but it shouldn't > be the only way to work. > > So - if we accept that text file configuration should be something the > VMM supports then let's reconsider how to solve the PASID problem. > > I'd say the way to solve it should be via a text file specifying a > full config space layout that includes the PASID cap. From the VMM > perspective this works fine, and it ports to every VMM directly via > processing the text file. > > The autoconfiguration use case can be done by making a tool build the > text file by deriving it from physical, much like today. The single > instance of that tool could have device specific knowledge to avoid > quirks. This way the smarts can still be shared by all the VMMs > without going into the kernel. Special devices with hidden config > space could get special quirks or special reference text files into > the tool repo. > > Serious operators doing production SRIOV/etc would negotiate the text > file with the HW vendors when they define their Instance Type. Ideally > these reference text files would be contributed to the tool repo > above. I think there would be some nice idea to define fully open > source Instance Types that include VFIO devices too. Regarding "if we accept that text file configuration should be something the VMM supports", I'm not on board with this yet, so applying it to PASID discussion seems premature. We've developed variant drivers specifically to host the device specific aspects of migration support. The requirement of a consistent config space layout is a problem that only exists relative to migration. This is an issue that I would have considered the responsibility of the variant driver, which would likely expect a consistent interface from the hardware/firmware. Why does hostile firmware suddenly make it the VMM's problem to provide a consistent ABI to the config space of the device rather than the variant driver? Obviously config maps are something that a VMM could do, but it also seems to impose a non-trivial burden that every VMM requires an implementation of a config space map and integration for each device rather than simply expecting the exposed config space of the device to be part of the migration ABI. Also this solution specifically only addresses config space compatibility without considering the more generic issue that a variant driver can expose different device personas. A versioned persona and config space virtualization in the variant driver is a much more flexible solution. Thanks, Alex
On Fri, Apr 26, 2024 at 11:11:17AM -0300, Jason Gunthorpe wrote: > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > > > This is kind of an absurd example to portray as a ubiquitous problem. > > Typically the config space layout is a reflection of hardware whether > > the device supports migration or not. > > Er, all our HW has FW constructed config space. It changes with FW > upgrades. We change it during the life of the product. This has to be > considered.. FYI, the same is true for just about any storage device I know. Maybe not all of the config space, but definitely parts of it.
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, April 27, 2024 4:14 AM > > On Fri, 26 Apr 2024 11:11:17 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > > > > > This is kind of an absurd example to portray as a ubiquitous problem. > > > Typically the config space layout is a reflection of hardware whether > > > the device supports migration or not. > > > > Er, all our HW has FW constructed config space. It changes with FW > > upgrades. We change it during the life of the product. This has to be > > considered.. > > So as I understand it, the concern is that you have firmware that > supports migration, but it also openly hostile to the fundamental > aspects of exposing a stable device ABI in support of migration. > > > > If a driver were to insert a > > > virtual capability, then yes it would want to be consistent about it if > > > it also cares about migration. If the driver needs to change the > > > location of a virtual capability, problems will arise, but that's also > > > not something that every driver needs to do. > > > > Well, mlx5 has to cope with this. It supports so many devices with so > > many config space layouts :( I don't know if we can just hard wire an > > offset to stick in a PASID cap and expect that to work... Are those config space layout differences usually also coming with mmio-side interface change? If yes there are more to handle for running V1 instance on V2 device and it'd make sense to manage everything about compatibility in one place. If we pursue the direction deciding the vconfig layout in VMM, does it imply that anything related to mmio layout would also be put in VMM too? e.g. it's not unusual to see a device mmio layout as: REG_BASE: base addr of a register block in a BAR REG_FEAT1_OFF: feature1 regs offset in the register block REG_FEAT2_OFF: feature2 regs offset in the register block ... Driver accesses registers according to those read-only offsets. A FW upgrade may lead to change of offsets but functions stay the same. An instance created on an old version can be migrated to the new version as long as accesses to old offsets are trapped and routed to the new offsets. Do we envision things like above in the variant driver or in VMM? > > > > > > What are you actually proposing? > > > > Okay, what I'm thinking about is a text file that describes the vPCI > > function configuration space to create. The community will standardize > > this and VMMs will have to implement to get PASID/etc. Maybe the > > community will provide a BSD licensed library to do this job. > > > > The text file allows the operator to specify exactly the configuration > > space the VFIO function should have. It would not be derived > > automatically from physical. AFAIK qemu does not have this capability > > currently. > > > > This reflects my observation and discussions around the live migration > > standardization. I belive we are fast reaching a point where this is > > required. > > > > Consider standards based migration between wildly different > > devices. The devices will not standardize their physical config space, > > but an operator could generate a consistent vPCI config space that > > works with all the devices in their fleet. It's hard to believe that 'wildly different' devices only have difference in the layout of vPCI config space. > > > > Consider the usual working model of the large operators - they define > > instance types with some regularity. But an instance type is fixed in > > concrete once it is specified, things like the vPCI config space are > > fixed. > > > > Running Instance A on newer hardware with a changed physical config > > space should continue to present Instance A's vPCI config layout > > regardless. Ie Instance A might not support PASID but Instance B can > > run on newer HW that does. The config space layout depends on the > > requested Instance Type, not the physical layout. > > > > The auto-configuration of the config layout from physical is a nice > > feature and is excellent for development/small scale, but it shouldn't > > be the only way to work. > > > > So - if we accept that text file configuration should be something the > > VMM supports then let's reconsider how to solve the PASID problem. > > > > I'd say the way to solve it should be via a text file specifying a > > full config space layout that includes the PASID cap. From the VMM > > perspective this works fine, and it ports to every VMM directly via > > processing the text file. > > > > The autoconfiguration use case can be done by making a tool build the > > text file by deriving it from physical, much like today. The single > > instance of that tool could have device specific knowledge to avoid > > quirks. This way the smarts can still be shared by all the VMMs > > without going into the kernel. Special devices with hidden config > > space could get special quirks or special reference text files into > > the tool repo. > > > > Serious operators doing production SRIOV/etc would negotiate the text > > file with the HW vendors when they define their Instance Type. Ideally > > these reference text files would be contributed to the tool repo > > above. I think there would be some nice idea to define fully open > > source Instance Types that include VFIO devices too. > > Regarding "if we accept that text file configuration should be > something the VMM supports", I'm not on board with this yet, so > applying it to PASID discussion seems premature. > > We've developed variant drivers specifically to host the device specific > aspects of migration support. The requirement of a consistent config > space layout is a problem that only exists relative to migration. This > is an issue that I would have considered the responsibility of the > variant driver, which would likely expect a consistent interface from > the hardware/firmware. Why does hostile firmware suddenly make it the > VMM's problem to provide a consistent ABI to the config space of the > device rather than the variant driver? > > Obviously config maps are something that a VMM could do, but it also > seems to impose a non-trivial burden that every VMM requires an > implementation of a config space map and integration for each device > rather than simply expecting the exposed config space of the device to > be part of the migration ABI. Also this solution specifically only > addresses config space compatibility without considering the more > generic issue that a variant driver can expose different device > personas. A versioned persona and config space virtualization in the > variant driver is a much more flexible solution. Thanks, > and looks this community lacks of a clear criteria on what burden should be put in the kernel vs. in the VMM. e.g. in earlier nvgrace-gpu discussion a major open was whether the PCI bar emulation should be done by the variant driver or by the VMM (with variant driver providing a device feature). It ends up to be in the variant driver with one major argument that doing so avoids the burden in various VMMs. But now seems the 'text-file' proposal heads the opposite direction? btw while this discussion may continue some time, I wonder whether this vPASID reporting open can be handled separately from the pasid attach/detach series so we can move the ball and merge something already in agreement. anyway it's just a read-only cap so won't affect how VFIO/IOMMUFD handles the pasid related requests. Thanks Kevin
On 2024/4/28 14:19, Tian, Kevin wrote: >> From: Alex Williamson <alex.williamson@redhat.com> >> Sent: Saturday, April 27, 2024 4:14 AM >> >> On Fri, 26 Apr 2024 11:11:17 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: >>> >>>> This is kind of an absurd example to portray as a ubiquitous problem. >>>> Typically the config space layout is a reflection of hardware whether >>>> the device supports migration or not. >>> >>> Er, all our HW has FW constructed config space. It changes with FW >>> upgrades. We change it during the life of the product. This has to be >>> considered.. >> >> So as I understand it, the concern is that you have firmware that >> supports migration, but it also openly hostile to the fundamental >> aspects of exposing a stable device ABI in support of migration. >> >>>> If a driver were to insert a >>>> virtual capability, then yes it would want to be consistent about it if >>>> it also cares about migration. If the driver needs to change the >>>> location of a virtual capability, problems will arise, but that's also >>>> not something that every driver needs to do. >>> >>> Well, mlx5 has to cope with this. It supports so many devices with so >>> many config space layouts :( I don't know if we can just hard wire an >>> offset to stick in a PASID cap and expect that to work... > > Are those config space layout differences usually also coming with > mmio-side interface change? If yes there are more to handle for > running V1 instance on V2 device and it'd make sense to manage > everything about compatibility in one place. > > If we pursue the direction deciding the vconfig layout in VMM, does > it imply that anything related to mmio layout would also be put in > VMM too? > > e.g. it's not unusual to see a device mmio layout as: > > REG_BASE: base addr of a register block in a BAR > REG_FEAT1_OFF: feature1 regs offset in the register block > REG_FEAT2_OFF: feature2 regs offset in the register block > ... Should we also consider the possibility of vBAR address difference between the src and dst? It should be impossible that the guest driver mmaps vBAR again if it has already been loaded. > Driver accesses registers according to those read-only offsets. > > A FW upgrade may lead to change of offsets but functions stay the > same. An instance created on an old version can be migrated to > the new version as long as accesses to old offsets are trapped > and routed to the new offsets. > > Do we envision things like above in the variant driver or in VMM? > >>>> >>>> What are you actually proposing? >>> >>> Okay, what I'm thinking about is a text file that describes the vPCI >>> function configuration space to create. The community will standardize >>> this and VMMs will have to implement to get PASID/etc. Maybe the >>> community will provide a BSD licensed library to do this job. >>> >>> The text file allows the operator to specify exactly the configuration >>> space the VFIO function should have. It would not be derived >>> automatically from physical. AFAIK qemu does not have this capability >>> currently. >>> >>> This reflects my observation and discussions around the live migration >>> standardization. I belive we are fast reaching a point where this is >>> required. >>> >>> Consider standards based migration between wildly different >>> devices. The devices will not standardize their physical config space, >>> but an operator could generate a consistent vPCI config space that >>> works with all the devices in their fleet. > > It's hard to believe that 'wildly different' devices only have difference > in the layout of vPCI config space. > >>> >>> Consider the usual working model of the large operators - they define >>> instance types with some regularity. But an instance type is fixed in >>> concrete once it is specified, things like the vPCI config space are >>> fixed. >>> >>> Running Instance A on newer hardware with a changed physical config >>> space should continue to present Instance A's vPCI config layout >>> regardless. Ie Instance A might not support PASID but Instance B can >>> run on newer HW that does. The config space layout depends on the >>> requested Instance Type, not the physical layout. >>> >>> The auto-configuration of the config layout from physical is a nice >>> feature and is excellent for development/small scale, but it shouldn't >>> be the only way to work. >>> >>> So - if we accept that text file configuration should be something the >>> VMM supports then let's reconsider how to solve the PASID problem. >>> >>> I'd say the way to solve it should be via a text file specifying a >>> full config space layout that includes the PASID cap. From the VMM >>> perspective this works fine, and it ports to every VMM directly via >>> processing the text file. >>> >>> The autoconfiguration use case can be done by making a tool build the >>> text file by deriving it from physical, much like today. The single >>> instance of that tool could have device specific knowledge to avoid >>> quirks. This way the smarts can still be shared by all the VMMs >>> without going into the kernel. Special devices with hidden config >>> space could get special quirks or special reference text files into >>> the tool repo. >>> >>> Serious operators doing production SRIOV/etc would negotiate the text >>> file with the HW vendors when they define their Instance Type. Ideally >>> these reference text files would be contributed to the tool repo >>> above. I think there would be some nice idea to define fully open >>> source Instance Types that include VFIO devices too. >> >> Regarding "if we accept that text file configuration should be >> something the VMM supports", I'm not on board with this yet, so >> applying it to PASID discussion seems premature. >> >> We've developed variant drivers specifically to host the device specific >> aspects of migration support. The requirement of a consistent config >> space layout is a problem that only exists relative to migration. This >> is an issue that I would have considered the responsibility of the >> variant driver, which would likely expect a consistent interface from >> the hardware/firmware. Why does hostile firmware suddenly make it the >> VMM's problem to provide a consistent ABI to the config space of the >> device rather than the variant driver? >> >> Obviously config maps are something that a VMM could do, but it also >> seems to impose a non-trivial burden that every VMM requires an >> implementation of a config space map and integration for each device >> rather than simply expecting the exposed config space of the device to >> be part of the migration ABI. Also this solution specifically only >> addresses config space compatibility without considering the more >> generic issue that a variant driver can expose different device >> personas. A versioned persona and config space virtualization in the >> variant driver is a much more flexible solution. Thanks, >> > > and looks this community lacks of a clear criteria on what burden > should be put in the kernel vs. in the VMM. > > e.g. in earlier nvgrace-gpu discussion a major open was whether > the PCI bar emulation should be done by the variant driver or > by the VMM (with variant driver providing a device feature). > > It ends up to be in the variant driver with one major argument > that doing so avoids the burden in various VMMs. > > But now seems the 'text-file' proposal heads the opposite direction? > > btw while this discussion may continue some time, I wonder whether > this vPASID reporting open can be handled separately from the > pasid attach/detach series so we can move the ball and merge > something already in agreement. anyway it's just a read-only cap so > won't affect how VFIO/IOMMUFD handles the pasid related requests. > > Thanks > Kevin >
On Sun, Apr 28, 2024 at 06:19:29AM +0000, Tian, Kevin wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Saturday, April 27, 2024 4:14 AM > > > > On Fri, 26 Apr 2024 11:11:17 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > > > > > > > This is kind of an absurd example to portray as a ubiquitous problem. > > > > Typically the config space layout is a reflection of hardware whether > > > > the device supports migration or not. > > > > > > Er, all our HW has FW constructed config space. It changes with FW > > > upgrades. We change it during the life of the product. This has to be > > > considered.. > > > > So as I understand it, the concern is that you have firmware that > > supports migration, but it also openly hostile to the fundamental > > aspects of exposing a stable device ABI in support of migration. > > > > > > If a driver were to insert a > > > > virtual capability, then yes it would want to be consistent about it if > > > > it also cares about migration. If the driver needs to change the > > > > location of a virtual capability, problems will arise, but that's also > > > > not something that every driver needs to do. > > > > > > Well, mlx5 has to cope with this. It supports so many devices with so > > > many config space layouts :( I don't know if we can just hard wire an > > > offset to stick in a PASID cap and expect that to work... > > Are those config space layout differences usually also coming with > mmio-side interface change? Not really > If yes there are more to handle for > running V1 instance on V2 device and it'd make sense to manage > everything about compatibility in one place. It's complicated :| The config space layout can't change once the device is discovered by the OS and I have a feeling it can't differ between VFs in a SRIOV. So even if we did say that a device HW/FW will change it's personality on the fly, I'm not sure we actually can and still conform to the PCI specs? > If we pursue the direction deciding the vconfig layout in VMM, does > it imply that anything related to mmio layout would also be put in > VMM too? I'd guess no, MMIO doesn't have the OS and standards entanglements. If the FW can reprogram the MMIO layout then when it loads the migration state, or provisions the device, it should put the MMIO to the right configuration. > Do we envision things like above in the variant driver or in VMM? In the device itself. I think it would be an extreme case for someone to make a device that had a different MMIO layout that could be backwards compatible to an old layout and NOT provide a way for the device HW to go back to the old layout directly. If so that device would have to mediate MMIO in the VMM or kernel. Depends on the complexity I suppose where to do it. > > > Consider standards based migration between wildly different > > > devices. The devices will not standardize their physical config space, > > > but an operator could generate a consistent vPCI config space that > > > works with all the devices in their fleet. > > It's hard to believe that 'wildly different' devices only have difference > in the layout of vPCI config space. Well, I mean a device from vendor A and another device from vendor B. They don't have anything in common except implementing the standard. The standard perscribes the MMIO layout. Both devices have a way to accept a migration stream defined by a standard. The MMIO will be adjusted as the standard requires after loading the migration stream. Config space primiarily remains a problem in this imagined world. My best view is that it needs to be solved by config space synthesis in the hypervisor and not via HW/FW on the physical device. > and looks this community lacks of a clear criteria on what burden > should be put in the kernel vs. in the VMM. Right, we've always discussed this. I've fairly consistently wanted to push things to userspace, principally for security. > e.g. in earlier nvgrace-gpu discussion a major open was whether > the PCI bar emulation should be done by the variant driver or > by the VMM (with variant driver providing a device feature). Sort of, this was actually also about config space synthesis. If we had this kind of scheme then perhaps the argument would be different. > It ends up to be in the variant driver with one major argument > that doing so avoids the burden in various VMMs. Yes, that has always been the argument that has won out. > btw while this discussion may continue some time, I wonder whether > this vPASID reporting open can be handled separately from the > pasid attach/detach series so we can move the ball and merge > something already in agreement. anyway it's just a read-only cap so > won't affect how VFIO/IOMMUFD handles the pasid related requests. Yes, this makes sense to me. Ultimately we all agree the config space will have to be synthesized. Jason
On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: > On Fri, 26 Apr 2024 11:11:17 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: > > > > > This is kind of an absurd example to portray as a ubiquitous problem. > > > Typically the config space layout is a reflection of hardware whether > > > the device supports migration or not. > > > > Er, all our HW has FW constructed config space. It changes with FW > > upgrades. We change it during the life of the product. This has to be > > considered.. > > So as I understand it, the concern is that you have firmware that > supports migration, but it also openly hostile to the fundamental > aspects of exposing a stable device ABI in support of migration. Well, that makes it sound rude, but yes that is part of it. mlx5 is tremendously FW defined. The FW can only cope with migration in some limited cases today. Making that compatability bigger is ongoing work. Config space is one of the areas that has not been addressed. Currently things are such that the FW won't support migration in combinations that have different physical config space - so it is not a problem. But, in principle, it is an issue. AFAIK, the only complete solution is for the hypervisor to fully synthesize a stable config space. So, if we keep this in the kernel then I'd imagine the kernel will need to grow some shared infrastructure to fully synthezise the config space - not text file based, but basically the same as what I described for the VMM. > > But that won't be true if the kernel is making decisions. The config > > space layout depends now on the kernel driver version too. > > But in the cases where we support migration there's a device specific > variant driver that supports that migration. It's the job of that > variant driver to not only export and import the device state, but also > to provide a consistent ABI to the user, which includes the config > space layout. Yes, we could do that, but I'm not sure how it will work in all cases. > I don't understand why we'd say the device programming ABI itself > falls within the purview of the device/variant driver, but PCI > config space is defined by device specific code at a higher level. The "device programming ABI" doesn't contain any *policy*. The layout of the config space is like 50% policy. Especially when we start to talk about standards defined migration. The standards will set the "device programming ABI" and maybe even specify the migration stream. They will not, and arguably can not, specify the config space. Config space layout is substantially policy of the instance type. Even little things like the vendor IDs can be meaningfully replaced in VMs. > Regarding "if we accept that text file configuration should be > something the VMM supports", I'm not on board with this yet, so > applying it to PASID discussion seems premature. Sure, I'm just explaining a way this could all fit together. > We've developed variant drivers specifically to host the device specific > aspects of migration support. The requirement of a consistent config > space layout is a problem that only exists relative to migration. Well, I wouldn't go quite so far. Arguably even non-migritable instance types may want to adjust thier config space. Eg if I'm using a DPU and I get a NVMe/Virtio PCI function I may want to scrub out details from the config space to make it more general. Even without migration. This already happens today in places like VDPA which completely replace the underlying config space in some cases. I see it as a difference from a world of highly constrained "instance types" and a more ad hoc world. > is an issue that I would have considered the responsibility of the > variant driver, which would likely expect a consistent interface from > the hardware/firmware. Why does hostile firmware suddenly make it the > VMM's problem to provide a consistent ABI to the config space of the > device rather than the variant driver? It is not "hostile firmware"! It accepting that a significant aspect of the config layout is actually policy. Plus the standards limitations that mean we can't change the config space on the fly make it pretty much impossible for the device to acutally do anything to help here. Software must fix the config space. > Obviously config maps are something that a VMM could do, but it also > seems to impose a non-trivial burden that every VMM requires an > implementation of a config space map and integration for each device > rather than simply expecting the exposed config space of the device to > be part of the migration ABI. Well, the flip is true to, it is alot of burden on every variant device driver implement and on the kernel in general to manage config space policy on behalf of the VMM. My point is if the VMM is already going to be forced to manage config space policy for other good reasons, are we sure we want to put a bunch of stuff in the kernel that sometimes won't be used? > Also this solution specifically only addresses config space > compatibility without considering the more generic issue that a > variant driver can expose different device personas. A versioned > persona and config space virtualization in the variant driver is a > much more flexible solution. It is addressed, the different personas would have their own text file maps. The target VMM would have to load the right map. Shared common code across all the variant drivers. Jason
Hi Alex, Jason, We didn't get any conclusion on the vPASID cap last time. Shall we have more alignment on it? +Vasant as well. On 2024/4/30 01:44, Jason Gunthorpe wrote: > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: >> On Fri, 26 Apr 2024 11:11:17 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote: >>> >>>> This is kind of an absurd example to portray as a ubiquitous problem. >>>> Typically the config space layout is a reflection of hardware whether >>>> the device supports migration or not. >>> >>> Er, all our HW has FW constructed config space. It changes with FW >>> upgrades. We change it during the life of the product. This has to be >>> considered.. >> >> So as I understand it, the concern is that you have firmware that >> supports migration, but it also openly hostile to the fundamental >> aspects of exposing a stable device ABI in support of migration. > > Well, that makes it sound rude, but yes that is part of it. > > mlx5 is tremendously FW defined. The FW can only cope with migration > in some limited cases today. Making that compatability bigger is > ongoing work. > > Config space is one of the areas that has not been addressed. > Currently things are such that the FW won't support migration in > combinations that have different physical config space - so it is not > a problem. > > But, in principle, it is an issue. AFAIK, the only complete solution > is for the hypervisor to fully synthesize a stable config space. > > So, if we keep this in the kernel then I'd imagine the kernel will > need to grow some shared infrastructure to fully synthezise the config > space - not text file based, but basically the same as what I > described for the VMM. > >>> But that won't be true if the kernel is making decisions. The config >>> space layout depends now on the kernel driver version too. >> >> But in the cases where we support migration there's a device specific >> variant driver that supports that migration. It's the job of that >> variant driver to not only export and import the device state, but also >> to provide a consistent ABI to the user, which includes the config >> space layout. > > Yes, we could do that, but I'm not sure how it will work in all cases. > >> I don't understand why we'd say the device programming ABI itself >> falls within the purview of the device/variant driver, but PCI >> config space is defined by device specific code at a higher level. > > The "device programming ABI" doesn't contain any *policy*. The layout > of the config space is like 50% policy. Especially when we start to > talk about standards defined migration. The standards will set the > "device programming ABI" and maybe even specify the migration > stream. They will not, and arguably can not, specify the config space. > > Config space layout is substantially policy of the instance type. Even > little things like the vendor IDs can be meaningfully replaced in VMs. > >> Regarding "if we accept that text file configuration should be >> something the VMM supports", I'm not on board with this yet, so >> applying it to PASID discussion seems premature. > > Sure, I'm just explaining a way this could all fit together. > >> We've developed variant drivers specifically to host the device specific >> aspects of migration support. The requirement of a consistent config >> space layout is a problem that only exists relative to migration. > > Well, I wouldn't go quite so far. Arguably even non-migritable > instance types may want to adjust thier config space. Eg if I'm using > a DPU and I get a NVMe/Virtio PCI function I may want to scrub out > details from the config space to make it more general. Even without > migration. > > This already happens today in places like VDPA which completely > replace the underlying config space in some cases. > > I see it as a difference from a world of highly constrained "instance > types" and a more ad hoc world. > >> is an issue that I would have considered the responsibility of the >> variant driver, which would likely expect a consistent interface from >> the hardware/firmware. Why does hostile firmware suddenly make it the >> VMM's problem to provide a consistent ABI to the config space of the >> device rather than the variant driver? > > It is not "hostile firmware"! It accepting that a significant aspect > of the config layout is actually policy. > > Plus the standards limitations that mean we can't change the config > space on the fly make it pretty much impossible for the device to > acutally do anything to help here. Software must fix the config space. > >> Obviously config maps are something that a VMM could do, but it also >> seems to impose a non-trivial burden that every VMM requires an >> implementation of a config space map and integration for each device >> rather than simply expecting the exposed config space of the device to >> be part of the migration ABI. > > Well, the flip is true to, it is alot of burden on every variant > device driver implement and on the kernel in general to manage config > space policy on behalf of the VMM. > > My point is if the VMM is already going to be forced to manage config > space policy for other good reasons, are we sure we want to put a > bunch of stuff in the kernel that sometimes won't be used? > >> Also this solution specifically only addresses config space >> compatibility without considering the more generic issue that a >> variant driver can expose different device personas. A versioned >> persona and config space virtualization in the variant driver is a >> much more flexible solution. > > It is addressed, the different personas would have their own text file > maps. The target VMM would have to load the right map. Shared common > code across all the variant drivers. > > Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, April 30, 2024 1:45 AM > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: > > Regarding "if we accept that text file configuration should be > > something the VMM supports", I'm not on board with this yet, so > > applying it to PASID discussion seems premature. > > Sure, I'm just explaining a way this could all fit together. > Thinking more along this direction. I'm not sure how long it will take to standardize such text files and share them across VMMs. We may need a way to move in steps in Qemu to unblock the kernel development toward that end goal, e.g. first accepting a pasid option plus user-specified offset (if offset unspecified then auto-pick one in cap holes). Later when the text file is ready then such per-cap options can be deprecated. This simple way won't fix the migration issue, but at least it's on par with physical caps (i.e. fail the migration if offset mismatched between dest/src) and both will be fixed when the text file model is ready. Then look at what uAPI is required to report the vPASID cap. In earlier discussion it's leaning toward extending GET_HW_INFO in iommufd given both iommu/pci support are required to get PASID working and iommu driver will not report such support until pasid has been enabled in both iommu/pci. With that there is no need to further report PASID in vfio-pci. But there may be other caps which are shared between VF and PF while having nothing to do with the iommu. e.g. the Device Serial Number extended cap (permitted but not recommended in VF). If there is a need to report such cap on VF which doesn't implement it to userspace, a vfio uAPI (device_feature or a new one dedicated to synthetical vcap) appears to be inevitable. So I wonder whether we leave this part untouched until a real demand comes or use vpasid to formalize that uAPI to be forward looking. If in the end such uAPI will exist then it's a bit weird to have PASID escaped (especially when vfio-pci already reports PRI/ATS which have iommu dependency too in vconfig). In concept the Qemu logic will be clearer if any PCI caps (real or synthesized) is always conveyed via vfio-pci while iommufd is for identifying a viommu cap. Thanks Kevin
On Wed, 24 Jul 2024 02:26:20 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, April 30, 2024 1:45 AM > > > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: > > > Regarding "if we accept that text file configuration should be > > > something the VMM supports", I'm not on board with this yet, so > > > applying it to PASID discussion seems premature. > > > > Sure, I'm just explaining a way this could all fit together. > > > > Thinking more along this direction. > > I'm not sure how long it will take to standardize such text files and > share them across VMMs. We may need a way to move in steps in > Qemu to unblock the kernel development toward that end goal, e.g. > first accepting a pasid option plus user-specified offset (if offset > unspecified then auto-pick one in cap holes). Later when the text > file is ready then such per-cap options can be deprecated. Planned obsolescence is a hard sell. > This simple way won't fix the migration issue, but at least it's on > par with physical caps (i.e. fail the migration if offset mismatched > between dest/src) and both will be fixed when the text file model > is ready. > > Then look at what uAPI is required to report the vPASID cap. > > In earlier discussion it's leaning toward extending GET_HW_INFO > in iommufd given both iommu/pci support are required to get > PASID working and iommu driver will not report such support until > pasid has been enabled in both iommu/pci. With that there is no > need to further report PASID in vfio-pci. > > But there may be other caps which are shared between VF and > PF while having nothing to do with the iommu. e.g. the Device > Serial Number extended cap (permitted but not recommended > in VF). If there is a need to report such cap on VF which doesn't > implement it to userspace, a vfio uAPI (device_feature or a new > one dedicated to synthetical vcap) appears to be inevitable. > > So I wonder whether we leave this part untouched until a real > demand comes or use vpasid to formalize that uAPI to be forward > looking. If in the end such uAPI will exist then it's a bit weird to > have PASID escaped (especially when vfio-pci already reports > PRI/ATS which have iommu dependency too in vconfig). > > In concept the Qemu logic will be clearer if any PCI caps (real > or synthesized) is always conveyed via vfio-pci while iommufd is > for identifying a viommu cap. There are so many moving pieces here and the discussion trailed off a long time ago. I have trouble keeping all the relevant considerations in my head, so let me try to enumerate them, please correct/add. - The PASID capability cannot be implemented on VFs per the PCIe spec. All VFs share the PF PASID configuration. This also implies that the VF PASID capability is essentially emulated since the VF driver cannot manipulate the PF PASID directly. - VFIO does not currently expose the PASID capability for PFs, nor does anything construct a vPASID capability for VFs. - The PASID capability is only useful in combination with a vIOMMU with PASID support, which does not yet exist in QEMU. - Some devices are known to place registers in configuration space, outside of the capability chains, which historically makes it difficult to place a purely virtual capability without potentially masking such hidden registers. Current virtual capabilities are placed at vendor defined fixed locations to avoid conflicts. - There is some expectation that otherwise compatible devices may not present identical capability chains, for example devices running different firmware or devices from different vendors implementing a standard register ABI (virtio) where capability chain layout is not standardized. - There have been arguments that the layout of device capabilities is a policy choice, where both the kernel and libvirt traditionally try to avoid making policy decisions. - Seamless live migration of devices requires that configuration space remains at least consistent, if not identical for much of it. Capability offsets cannot change during live migration. This leads to the text file reference above, which is essentially just the notion that if the VMM defines the capability layout in config space, it would need to do so via a static reference, independent of the layout of the physical device and we might want to share that among multiple VMMs. - For a vfio-pci device to support live migration it must be enabled to do so by a vfio-pci variant driver. - We've discussed in the community and seem to have a consensus that a DVSEC (Designated Vendor Specific Extended Capability) could be defined to describe unused configuration space. Such a DVSEC could be implemented natively by the device or supplied by a vfio-pci variant driver. There is currently no definition of such a DVSEC. So what are we trying to accomplish here. PASID is the first non-device specific virtual capability that we'd like to insert into the VM view of the capability chain. It won't be the last. - Do we push the policy of defining the capability offset to the user? - Do we do some hand waving that devices supporting PASID shouldn't have hidden registers and therefore the VMM can simply find a gap? - Do we ask the hardware vendor or variant driver to insert a DVSEC to identify available config space? - Do we handle this as just another device quirk, where we maintain a table of supported devices and vPASID offset for each? - Do we consider this an inflection point where the VMM entirely takes over the layout of the capability spaces to impose a stable migration layout? On what basis do we apply that inflection? - Also, do we require the same policy for both standard and extended capability chains? I understand the desire to make some progress, but QEMU relies on integration with management tools, so a temporary option for a user to specify a PASID offset in isolation sounds like a non-starter to me. This might be a better sell if the user interface allowed fully defining the capability chain layout from the command line and this interface would continue to exist and supersede how the VMM might otherwise define the capability chain when used. A fully user defined layout would be complicated though, so I think there would still be a desire for QEMU to consume or define a consistent policy itself. Even if QEMU defines the layout for a device, there may be multiple versions of that device. For example, maybe we just add PASID now, but at some point we decide that we do want to replicate the PF serial number capability. At that point we have versions of the device which would need to be tied to versions of the machine and maybe also selected via a profile switch on the device command line. If we want to simplify this, maybe we do just look at whether the vIOMMU is configured for PASID support and if the device supports it, then we just look for a gap and add the capability. If we end up with different results between source and target for migration, then migration will fail. Possibly we end up with a quirk table to override the default placement of specific capabilities on specific devices. That might evolve into a lookup for where we place all capabilities, which essentially turns into the "file" where the VMM defines the entire layout for some devices. This is already TL;DR, so I'll end with that before I further drowned the possibility of discussion. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, July 31, 2024 1:35 AM > > On Wed, 24 Jul 2024 02:26:20 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, April 30, 2024 1:45 AM > > > > > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: > > > > Regarding "if we accept that text file configuration should be > > > > something the VMM supports", I'm not on board with this yet, so > > > > applying it to PASID discussion seems premature. > > > > > > Sure, I'm just explaining a way this could all fit together. > > > > > > > Thinking more along this direction. > > > > I'm not sure how long it will take to standardize such text files and > > share them across VMMs. We may need a way to move in steps in > > Qemu to unblock the kernel development toward that end goal, e.g. > > first accepting a pasid option plus user-specified offset (if offset > > unspecified then auto-pick one in cap holes). Later when the text > > file is ready then such per-cap options can be deprecated. > > Planned obsolescence is a hard sell. > > > This simple way won't fix the migration issue, but at least it's on > > par with physical caps (i.e. fail the migration if offset mismatched > > between dest/src) and both will be fixed when the text file model > > is ready. > > > > Then look at what uAPI is required to report the vPASID cap. > > > > In earlier discussion it's leaning toward extending GET_HW_INFO > > in iommufd given both iommu/pci support are required to get > > PASID working and iommu driver will not report such support until > > pasid has been enabled in both iommu/pci. With that there is no > > need to further report PASID in vfio-pci. > > > > But there may be other caps which are shared between VF and > > PF while having nothing to do with the iommu. e.g. the Device > > Serial Number extended cap (permitted but not recommended > > in VF). If there is a need to report such cap on VF which doesn't > > implement it to userspace, a vfio uAPI (device_feature or a new > > one dedicated to synthetical vcap) appears to be inevitable. > > > > So I wonder whether we leave this part untouched until a real > > demand comes or use vpasid to formalize that uAPI to be forward > > looking. If in the end such uAPI will exist then it's a bit weird to > > have PASID escaped (especially when vfio-pci already reports > > PRI/ATS which have iommu dependency too in vconfig). > > > > In concept the Qemu logic will be clearer if any PCI caps (real > > or synthesized) is always conveyed via vfio-pci while iommufd is > > for identifying a viommu cap. > > There are so many moving pieces here and the discussion trailed off a > long time ago. I have trouble keeping all the relevant considerations > in my head, so let me try to enumerate them, please correct/add. Thanks for the summary! > > - The PASID capability cannot be implemented on VFs per the PCIe spec. > All VFs share the PF PASID configuration. This also implies that > the VF PASID capability is essentially emulated since the VF driver > cannot manipulate the PF PASID directly. > > - VFIO does not currently expose the PASID capability for PFs, nor > does anything construct a vPASID capability for VFs. > > - The PASID capability is only useful in combination with a vIOMMU > with PASID support, which does not yet exist in QEMU. > > - Some devices are known to place registers in configuration space, > outside of the capability chains, which historically makes it > difficult to place a purely virtual capability without potentially > masking such hidden registers. Current virtual capabilities are > placed at vendor defined fixed locations to avoid conflicts. > > - There is some expectation that otherwise compatible devices may > not present identical capability chains, for example devices running > different firmware or devices from different vendors implementing a > standard register ABI (virtio) where capability chain layout is not > standardized. > > - There have been arguments that the layout of device capabilities is > a policy choice, where both the kernel and libvirt traditionally try > to avoid making policy decisions. > > - Seamless live migration of devices requires that configuration space > remains at least consistent, if not identical for much of it. I didn't quite get it. I thought being consistent means fully identical config space from guest p.o.v. > Capability offsets cannot change during live migration. This leads > to the text file reference above, which is essentially just the > notion that if the VMM defines the capability layout in config > space, it would need to do so via a static reference, independent of > the layout of the physical device and we might want to share that > among multiple VMMs. > > - For a vfio-pci device to support live migration it must be enabled > to do so by a vfio-pci variant driver. > > - We've discussed in the community and seem to have a consensus that a > DVSEC (Designated Vendor Specific Extended Capability) could be > defined to describe unused configuration space. Such a DVSEC could > be implemented natively by the device or supplied by a vfio-pci > variant driver. There is currently no definition of such a DVSEC. I'm not sure whether DVSEC is still that necessary if the direction is to go userspace-defined layout. In a synthetic world the unused physical space doesn't really matter. So this consensus IMHO was better placed under the umbrella of the other direction having the kernel define the layout. > > So what are we trying to accomplish here. PASID is the first > non-device specific virtual capability that we'd like to insert into > the VM view of the capability chain. It won't be the last. > > - Do we push the policy of defining the capability offset to the user? Looks yes as I didn't see a strong argument for the opposite way. > > - Do we do some hand waving that devices supporting PASID shouldn't > have hidden registers and therefore the VMM can simply find a gap? I assume 'handwaving' doesn't mean any measure in code to actually block those devices (as doing so likely requires certain denylist based on device/vendor ID but then why not going a step further to also hard code an offset?). It's more a try-and-fail model where vPASID is opted in via a cmdline parameter then a device with hidden registers may misbehave if the VMM happens to find a conflict gap. And the impact is restricted only to a new setup where the user is interested in PASID to opt hence can afford diagnostics effort to figure out the restriction. > > - Do we ask the hardware vendor or variant driver to insert a DVSEC to > identify available config space? As said I don't think it's necessary if leaving the policy to the user > > - Do we handle this as just another device quirk, where we maintain a > table of supported devices and vPASID offset for each? > > - Do we consider this an inflection point where the VMM entirely takes > over the layout of the capability spaces to impose a stable > migration layout? On what basis do we apply that inflection? > > - Also, do we require the same policy for both standard and extended > capability chains? suppose yes. > > I understand the desire to make some progress, but QEMU relies on > integration with management tools, so a temporary option for a user to > specify a PASID offset in isolation sounds like a non-starter to me. > > This might be a better sell if the user interface allowed fully > defining the capability chain layout from the command line and this > interface would continue to exist and supersede how the VMM might > otherwise define the capability chain when used. A fully user defined > layout would be complicated though, so I think there would still be a > desire for QEMU to consume or define a consistent policy itself. > > Even if QEMU defines the layout for a device, there may be multiple > versions of that device. For example, maybe we just add PASID now, but > at some point we decide that we do want to replicate the PF serial > number capability. At that point we have versions of the device which > would need to be tied to versions of the machine and maybe also > selected via a profile switch on the device command line. > > If we want to simplify this, maybe we do just look at whether the > vIOMMU is configured for PASID support and if the device supports it, and this is related to the open which I raised in last mail - whether we want to report the PASID support both in iommufd and vfio-pci uAPI. My impression is yes as there may be requirement of exposing a virtual capability which doesn't rely on the IOMMU. > then we just look for a gap and add the capability. If we end up with > different results between source and target for migration, then > migration will fail. Possibly we end up with a quirk table to override > the default placement of specific capabilities on specific devices. emm how does a quirk table work with devices having volatile config space layout cross FW versions? Can VMM assigned with a VF be able to check the FW version of the PF? > That might evolve into a lookup for where we place all capabilities, > which essentially turns into the "file" where the VMM defines the entire > layout for some devices. Overall this sounds a feasible path to move forward - starting with the VMM to find the gap automatically if a new PASID option is opted in. Devices with hidden registers may fail. Devices with volatile config space due to FW upgrade or cross vendors may fail to migrate. Then evolving it to the file-based scheme, and there is time to discuss any intermediate improvement (fixed quirks, cmdline offset, etc.) in between. > > This is already TL;DR, so I'll end with that before I further drowned > the possibility of discussion. Thanks, > > Alex
On Wed, 31 Jul 2024 05:15:25 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, July 31, 2024 1:35 AM > > > > On Wed, 24 Jul 2024 02:26:20 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Tuesday, April 30, 2024 1:45 AM > > > > > > > > On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote: > > > > > Regarding "if we accept that text file configuration should be > > > > > something the VMM supports", I'm not on board with this yet, so > > > > > applying it to PASID discussion seems premature. > > > > > > > > Sure, I'm just explaining a way this could all fit together. > > > > > > > > > > Thinking more along this direction. > > > > > > I'm not sure how long it will take to standardize such text files and > > > share them across VMMs. We may need a way to move in steps in > > > Qemu to unblock the kernel development toward that end goal, e.g. > > > first accepting a pasid option plus user-specified offset (if offset > > > unspecified then auto-pick one in cap holes). Later when the text > > > file is ready then such per-cap options can be deprecated. > > > > Planned obsolescence is a hard sell. > > > > > This simple way won't fix the migration issue, but at least it's on > > > par with physical caps (i.e. fail the migration if offset mismatched > > > between dest/src) and both will be fixed when the text file model > > > is ready. > > > > > > Then look at what uAPI is required to report the vPASID cap. > > > > > > In earlier discussion it's leaning toward extending GET_HW_INFO > > > in iommufd given both iommu/pci support are required to get > > > PASID working and iommu driver will not report such support until > > > pasid has been enabled in both iommu/pci. With that there is no > > > need to further report PASID in vfio-pci. > > > > > > But there may be other caps which are shared between VF and > > > PF while having nothing to do with the iommu. e.g. the Device > > > Serial Number extended cap (permitted but not recommended > > > in VF). If there is a need to report such cap on VF which doesn't > > > implement it to userspace, a vfio uAPI (device_feature or a new > > > one dedicated to synthetical vcap) appears to be inevitable. > > > > > > So I wonder whether we leave this part untouched until a real > > > demand comes or use vpasid to formalize that uAPI to be forward > > > looking. If in the end such uAPI will exist then it's a bit weird to > > > have PASID escaped (especially when vfio-pci already reports > > > PRI/ATS which have iommu dependency too in vconfig). > > > > > > In concept the Qemu logic will be clearer if any PCI caps (real > > > or synthesized) is always conveyed via vfio-pci while iommufd is > > > for identifying a viommu cap. > > > > There are so many moving pieces here and the discussion trailed off a > > long time ago. I have trouble keeping all the relevant considerations > > in my head, so let me try to enumerate them, please correct/add. > > Thanks for the summary! > > > > > - The PASID capability cannot be implemented on VFs per the PCIe spec. > > All VFs share the PF PASID configuration. This also implies that > > the VF PASID capability is essentially emulated since the VF driver > > cannot manipulate the PF PASID directly. > > > > - VFIO does not currently expose the PASID capability for PFs, nor > > does anything construct a vPASID capability for VFs. > > > > - The PASID capability is only useful in combination with a vIOMMU > > with PASID support, which does not yet exist in QEMU. > > > > - Some devices are known to place registers in configuration space, > > outside of the capability chains, which historically makes it > > difficult to place a purely virtual capability without potentially > > masking such hidden registers. Current virtual capabilities are > > placed at vendor defined fixed locations to avoid conflicts. > > > > - There is some expectation that otherwise compatible devices may > > not present identical capability chains, for example devices running > > different firmware or devices from different vendors implementing a > > standard register ABI (virtio) where capability chain layout is not > > standardized. > > > > - There have been arguments that the layout of device capabilities is > > a policy choice, where both the kernel and libvirt traditionally try > > to avoid making policy decisions. > > > > - Seamless live migration of devices requires that configuration space > > remains at least consistent, if not identical for much of it. > > I didn't quite get it. I thought being consistent means fully identical > config space from guest p.o.v. See for example: https://gitlab.com/qemu-project/qemu/-/commit/187716feeba406b5a3879db66a7bafd687472a1f The layout of config space and most of the contents therein need to be identical, but there are arguably elements that could be volatile which only need to be consistent. > > Capability offsets cannot change during live migration. This leads > > to the text file reference above, which is essentially just the > > notion that if the VMM defines the capability layout in config > > space, it would need to do so via a static reference, independent of > > the layout of the physical device and we might want to share that > > among multiple VMMs. > > > > - For a vfio-pci device to support live migration it must be enabled > > to do so by a vfio-pci variant driver. > > > > - We've discussed in the community and seem to have a consensus that a > > DVSEC (Designated Vendor Specific Extended Capability) could be > > defined to describe unused configuration space. Such a DVSEC could > > be implemented natively by the device or supplied by a vfio-pci > > variant driver. There is currently no definition of such a DVSEC. > > I'm not sure whether DVSEC is still that necessary if the direction is > to go userspace-defined layout. In a synthetic world the unused > physical space doesn't really matter. > > So this consensus IMHO was better placed under the umbrella of > the other direction having the kernel define the layout. I agree that we don't seem to be headed in a direction that requires this, but I just wanted to include that there was a roughly agreed upon way for devices and variant drivers to annotate unused config space ranges for higher levels. If we head in a direction where the VMM chooses an offset for the PASID capability, we need to keep track of whether this DVSEC comes to fruition and how that affects the offset that QEMU might choose. > > So what are we trying to accomplish here. PASID is the first > > non-device specific virtual capability that we'd like to insert into > > the VM view of the capability chain. It won't be the last. > > > > - Do we push the policy of defining the capability offset to the user? > > Looks yes as I didn't see a strong argument for the opposite way. It's a policy choice though, so where and how is it implemented? It works fine for those of us willing to edit xml or launch VMs by command line, but libvirt isn't going to sign up to insert a policy choice for a device. If we get to even higher level tools, does anything that wants to implement PASID support required a vendor operator driver to make such policy choices (btw, I'm just throwing out the "operator" term as if I know what it means, I don't). > > - Do we do some hand waving that devices supporting PASID shouldn't > > have hidden registers and therefore the VMM can simply find a gap? > > I assume 'handwaving' doesn't mean any measure in code to actually > block those devices (as doing so likely requires certain denylist based on > device/vendor ID but then why not going a step further to also hard > code an offset?). It's more a try-and-fail model where vPASID is opted > in via a cmdline parameter then a device with hidden registers may > misbehave if the VMM happens to find a conflict gap. And the impact > is restricted only to a new setup where the user is interested in > PASID to opt hence can afford diagnostics effort to figure out the restriction. If you want to hard code an offset then we're effectively introducing a device specific quirk to enable PASID support. I thought we wanted this to work generically for any device exposing PASID, therefore I was thinking more of "find a gap" as the default strategy with quirks used to augment the resulting offset where necessary. I'd also be careful about command line parameters. I think we require one for the vIOMMU to enable PASID support, but I'd prefer to avoid one on the vfio-pci device, instead simply enabling support when both the vIOMMU support is enabled and the device is detected to support it. Each command line option requires support in the upper level tools to enable it. > > - Do we ask the hardware vendor or variant driver to insert a DVSEC to > > identify available config space? > > As said I don't think it's necessary if leaving the policy to the user Leaving the policy to the user is essentially just kicking the can down the road and I don't know where that policy actually gets implemented in a cloud, production environment. I think it would align with QEMU practices if the user could override a default policy on the command line, but ultimately if we want to keep the policy decision out of the kernel then the defaults probably need to be implemented in QEMU. > > - Do we handle this as just another device quirk, where we maintain a > > table of supported devices and vPASID offset for each? > > > > - Do we consider this an inflection point where the VMM entirely takes > > over the layout of the capability spaces to impose a stable > > migration layout? On what basis do we apply that inflection? > > > > - Also, do we require the same policy for both standard and extended > > capability chains? > > suppose yes. > > > > > I understand the desire to make some progress, but QEMU relies on > > integration with management tools, so a temporary option for a user to > > specify a PASID offset in isolation sounds like a non-starter to me. > > > > This might be a better sell if the user interface allowed fully > > defining the capability chain layout from the command line and this > > interface would continue to exist and supersede how the VMM might > > otherwise define the capability chain when used. A fully user defined > > layout would be complicated though, so I think there would still be a > > desire for QEMU to consume or define a consistent policy itself. > > > > Even if QEMU defines the layout for a device, there may be multiple > > versions of that device. For example, maybe we just add PASID now, but > > at some point we decide that we do want to replicate the PF serial > > number capability. At that point we have versions of the device which > > would need to be tied to versions of the machine and maybe also > > selected via a profile switch on the device command line. > > > > If we want to simplify this, maybe we do just look at whether the > > vIOMMU is configured for PASID support and if the device supports it, > > and this is related to the open which I raised in last mail - whether we > want to report the PASID support both in iommufd and vfio-pci uAPI. > > My impression is yes as there may be requirement of exposing a virtual > capability which doesn't rely on the IOMMU. What's the purpose of reporting PASID via both iommufd and vfio-pci? I agree that there will be capabilities related to the iommufd and capabilities only related to the device, but I disagree that that provides justification to report PASID via both uAPIs. Are we also going to ask iommufd to report that a device has an optional serial number capability? It clearly doesn't make sense for iommufd to be involved with that, so why does it make sense for vfio-pci to be involved in reporting something that is more iommufd specific? > > then we just look for a gap and add the capability. If we end up with > > different results between source and target for migration, then > > migration will fail. Possibly we end up with a quirk table to override > > the default placement of specific capabilities on specific devices. > > emm how does a quirk table work with devices having volatile config > space layout cross FW versions? Can VMM assigned with a VF be able > to check the FW version of the PF? If the VMM can't find the same gap between source and destination then a quirk could make sure that the PASID offset is consistent. But also if the VMM doesn't find the same gap then that suggests the config space is already different and not only the offset of the PASID capability will need to be fixed via a quirk, so then we're into quirking the entire capability space for the device. The VMM should not be assumed to have any additional privileges beyond what we provide it through the vfio device and iommufd interface. Testing anything about the PF would require access on the host that won't work in more secure environments. Therefore if we can't consistently place the PASID for a device, we probably need to quirk it based on the vendor/device IDs or sub-IDs or we need to rely on a management implied policy such as a device profile option on the QEMU command line or maybe different classes of the vfio-pci driver in QEMU. > > That might evolve into a lookup for where we place all capabilities, > > which essentially turns into the "file" where the VMM defines the entire > > layout for some devices. > > Overall this sounds a feasible path to move forward - starting with > the VMM to find the gap automatically if a new PASID option is > opted in. Devices with hidden registers may fail. Devices with volatile > config space due to FW upgrade or cross vendors may fail to migrate. > Then evolving it to the file-based scheme, and there is time to discuss > any intermediate improvement (fixed quirks, cmdline offset, etc.) in > between. As above, let's be careful about introducing unnecessary command line options, especially if we expect support for them in higher level tools. If we place the PASID somewhere that makes the device not work, then disabling PASID on the vIOMMU should resolve that. It won't be a regression, it will only be an incompatibility with a new feature. That incompatibility may require a quirk to resolve to have the PASID placed somewhere else. If the PASID is placed at different offsets based on device firmware or vendor then the location of the PASID alone isn't the only thing preventing migration and we'll need to introduce code for the VMM to take ownership of the capability layout at that point. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Thursday, August 1, 2024 1:05 AM > > On Wed, 31 Jul 2024 05:15:25 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, July 31, 2024 1:35 AM > > > > > > - Seamless live migration of devices requires that configuration space > > > remains at least consistent, if not identical for much of it. > > > > I didn't quite get it. I thought being consistent means fully identical > > config space from guest p.o.v. > > See for example: > > https://gitlab.com/qemu-project/qemu/- > /commit/187716feeba406b5a3879db66a7bafd687472a1f Thanks! > > The layout of config space and most of the contents therein need to be > identical, but there are arguably elements that could be volatile which > only need to be consistent. hmm IMHO it's more that the guest doesn't care volatile content in that field instead of the guest view being strictly consistent. Probably I don't really understand the meaning of consistency in this context... btw that fix claims: " Here consistency could mean that VSC format should be same on source and destination, however actual Vendor Specific Info may not be byte-to-byte identical. " Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU but I'm not sure whether some vendor driver might be sensitive to byte-to-byte consistency in VSC. > > > > > > - We've discussed in the community and seem to have a consensus that a > > > DVSEC (Designated Vendor Specific Extended Capability) could be > > > defined to describe unused configuration space. Such a DVSEC could > > > be implemented natively by the device or supplied by a vfio-pci > > > variant driver. There is currently no definition of such a DVSEC. > > > > I'm not sure whether DVSEC is still that necessary if the direction is > > to go userspace-defined layout. In a synthetic world the unused > > physical space doesn't really matter. > > > > So this consensus IMHO was better placed under the umbrella of > > the other direction having the kernel define the layout. > > I agree that we don't seem to be headed in a direction that requires > this, but I just wanted to include that there was a roughly agreed upon > way for devices and variant drivers to annotate unused config space > ranges for higher levels. If we head in a direction where the VMM > chooses an offset for the PASID capability, we need to keep track of > whether this DVSEC comes to fruition and how that affects the offset > that QEMU might choose. Yes. We can keep this option in case there is a demand, especially if the file-based synthesized scheme won't be built in one day and we need a default policy in VMM. > > > > So what are we trying to accomplish here. PASID is the first > > > non-device specific virtual capability that we'd like to insert into > > > the VM view of the capability chain. It won't be the last. > > > > > > - Do we push the policy of defining the capability offset to the user? > > > > Looks yes as I didn't see a strong argument for the opposite way. > > It's a policy choice though, so where and how is it implemented? It > works fine for those of us willing to edit xml or launch VMs by command > line, but libvirt isn't going to sign up to insert a policy choice for > a device. If we get to even higher level tools, does anything that > wants to implement PASID support required a vendor operator driver to > make such policy choices (btw, I'm just throwing out the "operator" > term as if I know what it means, I don't). I had a rough feeling that there might be other usages requiring such vendor plugin, e.g. provisioning VF/ADI may require vendor specific configurations, but not really an expert in this area. Overall I feel most of our discussions so far are about VMM-auto- find-offset vs. file-based-policy-scheme which both belong to user-defined policy, suggesting that we all agreed to drop the other way having kernel define the offset (plus in-kernel quirks, etc.)? Even the said DVSEC is to assist such user-defined direction. > > > > - Do we do some hand waving that devices supporting PASID shouldn't > > > have hidden registers and therefore the VMM can simply find a gap? > > > > I assume 'handwaving' doesn't mean any measure in code to actually > > block those devices (as doing so likely requires certain denylist based on > > device/vendor ID but then why not going a step further to also hard > > code an offset?). It's more a try-and-fail model where vPASID is opted > > in via a cmdline parameter then a device with hidden registers may > > misbehave if the VMM happens to find a conflict gap. And the impact > > is restricted only to a new setup where the user is interested in > > PASID to opt hence can afford diagnostics effort to figure out the > restriction. > > If you want to hard code an offset then we're effectively introducing a > device specific quirk to enable PASID support. I thought we wanted > this to work generically for any device exposing PASID, therefore I was > thinking more of "find a gap" as the default strategy with quirks used > to augment the resulting offset where necessary. > > I'd also be careful about command line parameters. I think we require > one for the vIOMMU to enable PASID support, but I'd prefer to avoid one > on the vfio-pci device, instead simply enabling support when both the > vIOMMU support is enabled and the device is detected to support it. > Each command line option requires support in the upper level tools to > enable it. Make sense. btw will there be a requirement that the user wants to disable PASID even if the device supports it, e.g. for testing purpose or to workaround a HW errata disclosed after host driver claims the support in an old kernel? > > > > > > I understand the desire to make some progress, but QEMU relies on > > > integration with management tools, so a temporary option for a user to > > > specify a PASID offset in isolation sounds like a non-starter to me. > > > > > > This might be a better sell if the user interface allowed fully > > > defining the capability chain layout from the command line and this > > > interface would continue to exist and supersede how the VMM might > > > otherwise define the capability chain when used. A fully user defined > > > layout would be complicated though, so I think there would still be a > > > desire for QEMU to consume or define a consistent policy itself. > > > > > > Even if QEMU defines the layout for a device, there may be multiple > > > versions of that device. For example, maybe we just add PASID now, but > > > at some point we decide that we do want to replicate the PF serial > > > number capability. At that point we have versions of the device which > > > would need to be tied to versions of the machine and maybe also > > > selected via a profile switch on the device command line. > > > > > > If we want to simplify this, maybe we do just look at whether the > > > vIOMMU is configured for PASID support and if the device supports it, > > > > and this is related to the open which I raised in last mail - whether we > > want to report the PASID support both in iommufd and vfio-pci uAPI. > > > > My impression is yes as there may be requirement of exposing a virtual > > capability which doesn't rely on the IOMMU. > > What's the purpose of reporting PASID via both iommufd and vfio-pci? I > agree that there will be capabilities related to the iommufd and > capabilities only related to the device, but I disagree that that > provides justification to report PASID via both uAPIs. Are we also > going to ask iommufd to report that a device has an optional serial > number capability? It clearly doesn't make sense for iommufd to be Certainly no. My point was that vfio-pci/iommufd each reports its own capability set. They may overlap but this fact just matches the physical world. > involved with that, so why does it make sense for vfio-pci to be > involved in reporting something that is more iommufd specific? It doesn't matter which one involves more. It's more akin to the physical world. btw vfio-pci already reports ATS/PRI which both rely on iommufd in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a good justification for why it's special. I envision an extension to vfio device feature or a new vfio uAPI for reporting virtual capabilities as augment to the ones filled in vconfig space. > > > > then we just look for a gap and add the capability. If we end up with > > > different results between source and target for migration, then > > > migration will fail. Possibly we end up with a quirk table to override > > > the default placement of specific capabilities on specific devices. > > > > emm how does a quirk table work with devices having volatile config > > space layout cross FW versions? Can VMM assigned with a VF be able > > to check the FW version of the PF? > > If the VMM can't find the same gap between source and destination then > a quirk could make sure that the PASID offset is consistent. But also > if the VMM doesn't find the same gap then that suggests the config > space is already different and not only the offset of the PASID > capability will need to be fixed via a quirk, so then we're into > quirking the entire capability space for the device. yes. So the quirk table is more for fixing the functional gap (i.e. not overlap with a hidden register) instead of for migration. As long as a device can function correctly with it, the virtual caps fall into the same restriction as physical caps in migration i.e. upon inconsistent layout between src/dest we'll need separate way to synthesize the entire space. > > The VMM should not be assumed to have any additional privileges beyond > what we provide it through the vfio device and iommufd interface. > Testing anything about the PF would require access on the host that > won't work in more secure environments. Therefore if we can't > consistently place the PASID for a device, we probably need to quirk it > based on the vendor/device IDs or sub-IDs or we need to rely on a > management implied policy such as a device profile option on the QEMU > command line or maybe different classes of the vfio-pci driver in QEMU. > > > > That might evolve into a lookup for where we place all capabilities, > > > which essentially turns into the "file" where the VMM defines the entire > > > layout for some devices. > > > > Overall this sounds a feasible path to move forward - starting with > > the VMM to find the gap automatically if a new PASID option is > > opted in. Devices with hidden registers may fail. Devices with volatile > > config space due to FW upgrade or cross vendors may fail to migrate. > > Then evolving it to the file-based scheme, and there is time to discuss > > any intermediate improvement (fixed quirks, cmdline offset, etc.) in > > between. > > As above, let's be careful about introducing unnecessary command line > options, especially if we expect support for them in higher level > tools. If we place the PASID somewhere that makes the device not work, > then disabling PASID on the vIOMMU should resolve that. It won't be a vIOMMU is per-platform then it applies to all devices behind, including those which don't have a problem with auto-selected offset. Not sure whether one would want to continue enabling PASID for other devices or should stop immediately to find a quirk for the problematic one and then resume. > regression, it will only be an incompatibility with a new feature. > That incompatibility may require a quirk to resolve to have the PASID > placed somewhere else. If the PASID is placed at different offsets > based on device firmware or vendor then the location of the PASID alone > isn't the only thing preventing migration and we'll need to introduce > code for the VMM to take ownership of the capability layout at that > point. Thanks, > Yes, the migration issue might be solved in a separate track as it applies to both physical and virtual capabilities.
On Thu, 1 Aug 2024 07:45:43 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Thursday, August 1, 2024 1:05 AM > > > > On Wed, 31 Jul 2024 05:15:25 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Wednesday, July 31, 2024 1:35 AM > > > > > > > > - Seamless live migration of devices requires that configuration space > > > > remains at least consistent, if not identical for much of it. > > > > > > I didn't quite get it. I thought being consistent means fully identical > > > config space from guest p.o.v. > > > > See for example: > > > > https://gitlab.com/qemu-project/qemu/- > > /commit/187716feeba406b5a3879db66a7bafd687472a1f > > Thanks! > > > > > The layout of config space and most of the contents therein need to be > > identical, but there are arguably elements that could be volatile which > > only need to be consistent. > > hmm IMHO it's more that the guest doesn't care volatile content in that > field instead of the guest view being strictly consistent. Probably I > don't really understand the meaning of consistency in this context... > > btw that fix claims: > > " > Here consistency could mean that VSC format should be same on > source and destination, however actual Vendor Specific Info may > not be byte-to-byte identical. > " > > Does it apply to all devices supporting VSC? It's OK for NVDIA vGPU > but I'm not sure whether some vendor driver might be sensitive to > byte-to-byte consistency in VSC. It applies to all VSC capabilities. The argument is that there is no standard definition of the content here therefore QEMU has no authority to impose a policy on the contents. > > > > > > > > - We've discussed in the community and seem to have a consensus that a > > > > DVSEC (Designated Vendor Specific Extended Capability) could be > > > > defined to describe unused configuration space. Such a DVSEC could > > > > be implemented natively by the device or supplied by a vfio-pci > > > > variant driver. There is currently no definition of such a DVSEC. > > > > > > I'm not sure whether DVSEC is still that necessary if the direction is > > > to go userspace-defined layout. In a synthetic world the unused > > > physical space doesn't really matter. > > > > > > So this consensus IMHO was better placed under the umbrella of > > > the other direction having the kernel define the layout. > > > > I agree that we don't seem to be headed in a direction that requires > > this, but I just wanted to include that there was a roughly agreed upon > > way for devices and variant drivers to annotate unused config space > > ranges for higher levels. If we head in a direction where the VMM > > chooses an offset for the PASID capability, we need to keep track of > > whether this DVSEC comes to fruition and how that affects the offset > > that QEMU might choose. > > Yes. We can keep this option in case there is a demand, especially > if the file-based synthesized scheme won't be built in one day and > we need a default policy in VMM. > > > > > > > So what are we trying to accomplish here. PASID is the first > > > > non-device specific virtual capability that we'd like to insert into > > > > the VM view of the capability chain. It won't be the last. > > > > > > > > - Do we push the policy of defining the capability offset to the user? > > > > > > Looks yes as I didn't see a strong argument for the opposite way. > > > > It's a policy choice though, so where and how is it implemented? It > > works fine for those of us willing to edit xml or launch VMs by command > > line, but libvirt isn't going to sign up to insert a policy choice for > > a device. If we get to even higher level tools, does anything that > > wants to implement PASID support required a vendor operator driver to > > make such policy choices (btw, I'm just throwing out the "operator" > > term as if I know what it means, I don't). > > I had a rough feeling that there might be other usages requiring such > vendor plugin, e.g. provisioning VF/ADI may require vendor specific > configurations, but not really an expert in this area. > > Overall I feel most of our discussions so far are about VMM-auto- > find-offset vs. file-based-policy-scheme which both belong to > user-defined policy, suggesting that we all agreed to drop the other > way having kernel define the offset (plus in-kernel quirks, etc.)? > > Even the said DVSEC is to assist such user-defined direction. To me a "user defined policy" is placing an option on the command line and requiring the user, or some higher level authority representing the user, to provide the policy. If it's done by the VMM then we're saying QEMU owns the policy but it might be overridden by the user via a command line argument or modifying the policy file consumed by QEMU. > > > > - Do we do some hand waving that devices supporting PASID shouldn't > > > > have hidden registers and therefore the VMM can simply find a gap? > > > > > > I assume 'handwaving' doesn't mean any measure in code to actually > > > block those devices (as doing so likely requires certain denylist based on > > > device/vendor ID but then why not going a step further to also hard > > > code an offset?). It's more a try-and-fail model where vPASID is opted > > > in via a cmdline parameter then a device with hidden registers may > > > misbehave if the VMM happens to find a conflict gap. And the impact > > > is restricted only to a new setup where the user is interested in > > > PASID to opt hence can afford diagnostics effort to figure out the > > restriction. > > > > If you want to hard code an offset then we're effectively introducing a > > device specific quirk to enable PASID support. I thought we wanted > > this to work generically for any device exposing PASID, therefore I was > > thinking more of "find a gap" as the default strategy with quirks used > > to augment the resulting offset where necessary. > > > > I'd also be careful about command line parameters. I think we require > > one for the vIOMMU to enable PASID support, but I'd prefer to avoid one > > on the vfio-pci device, instead simply enabling support when both the > > vIOMMU support is enabled and the device is detected to support it. > > Each command line option requires support in the upper level tools to > > enable it. > > Make sense. btw will there be a requirement that the user wants to > disable PASID even if the device supports it, e.g. for testing purpose > or to workaround a HW errata disclosed after host driver claims the > support in an old kernel? We can make use of "experimental options", ie. "x-" prefixed options, in QEMU for this. These are useful for debugging but will not be considered for support by higher level tools nor do we consider them to have a stable ABI. > > > > I understand the desire to make some progress, but QEMU relies on > > > > integration with management tools, so a temporary option for a user to > > > > specify a PASID offset in isolation sounds like a non-starter to me. > > > > > > > > This might be a better sell if the user interface allowed fully > > > > defining the capability chain layout from the command line and this > > > > interface would continue to exist and supersede how the VMM might > > > > otherwise define the capability chain when used. A fully user defined > > > > layout would be complicated though, so I think there would still be a > > > > desire for QEMU to consume or define a consistent policy itself. > > > > > > > > Even if QEMU defines the layout for a device, there may be multiple > > > > versions of that device. For example, maybe we just add PASID now, but > > > > at some point we decide that we do want to replicate the PF serial > > > > number capability. At that point we have versions of the device which > > > > would need to be tied to versions of the machine and maybe also > > > > selected via a profile switch on the device command line. > > > > > > > > If we want to simplify this, maybe we do just look at whether the > > > > vIOMMU is configured for PASID support and if the device supports it, > > > > > > and this is related to the open which I raised in last mail - whether we > > > want to report the PASID support both in iommufd and vfio-pci uAPI. > > > > > > My impression is yes as there may be requirement of exposing a virtual > > > capability which doesn't rely on the IOMMU. > > > > What's the purpose of reporting PASID via both iommufd and vfio-pci? I > > agree that there will be capabilities related to the iommufd and > > capabilities only related to the device, but I disagree that that > > provides justification to report PASID via both uAPIs. Are we also > > going to ask iommufd to report that a device has an optional serial > > number capability? It clearly doesn't make sense for iommufd to be > > Certainly no. My point was that vfio-pci/iommufd each reports its > own capability set. They may overlap but this fact just matches the > physical world. > > > involved with that, so why does it make sense for vfio-pci to be > > involved in reporting something that is more iommufd specific? > > It doesn't matter which one involves more. It's more akin to the > physical world. > > btw vfio-pci already reports ATS/PRI which both rely on iommufd > in vconfig space. Throwing PASID alone to iommufd uAPI lacks of a > good justification for why it's special. > > I envision an extension to vfio device feature or a new vfio uAPI > for reporting virtual capabilities as augment to the ones filled in > vconfig space. Should ATS and PRI be reported through vfio-pci or should we just turn them off to be more like PASID? Maybe the issue simply hasn't arisen yet because we don't have vIOMMU support and with that support QEMU might need to filter out those capabilities and look elsewhere. Anyway, iommufd and vfio-pci should not duplicate each other here. > > > > then we just look for a gap and add the capability. If we end up with > > > > different results between source and target for migration, then > > > > migration will fail. Possibly we end up with a quirk table to override > > > > the default placement of specific capabilities on specific devices. > > > > > > emm how does a quirk table work with devices having volatile config > > > space layout cross FW versions? Can VMM assigned with a VF be able > > > to check the FW version of the PF? > > > > If the VMM can't find the same gap between source and destination then > > a quirk could make sure that the PASID offset is consistent. But also > > if the VMM doesn't find the same gap then that suggests the config > > space is already different and not only the offset of the PASID > > capability will need to be fixed via a quirk, so then we're into > > quirking the entire capability space for the device. > > yes. So the quirk table is more for fixing the functional gap (i.e. not > overlap with a hidden register) instead of for migration. As long as > a device can function correctly with it, the virtual caps fall into the > same restriction as physical caps in migration i.e. upon inconsistent > layout between src/dest we'll need separate way to synthesize the > entire space. Yes. > > The VMM should not be assumed to have any additional privileges beyond > > what we provide it through the vfio device and iommufd interface. > > Testing anything about the PF would require access on the host that > > won't work in more secure environments. Therefore if we can't > > consistently place the PASID for a device, we probably need to quirk it > > based on the vendor/device IDs or sub-IDs or we need to rely on a > > management implied policy such as a device profile option on the QEMU > > command line or maybe different classes of the vfio-pci driver in QEMU. > > > > > > That might evolve into a lookup for where we place all capabilities, > > > > which essentially turns into the "file" where the VMM defines the entire > > > > layout for some devices. > > > > > > Overall this sounds a feasible path to move forward - starting with > > > the VMM to find the gap automatically if a new PASID option is > > > opted in. Devices with hidden registers may fail. Devices with volatile > > > config space due to FW upgrade or cross vendors may fail to migrate. > > > Then evolving it to the file-based scheme, and there is time to discuss > > > any intermediate improvement (fixed quirks, cmdline offset, etc.) in > > > between. > > > > As above, let's be careful about introducing unnecessary command line > > options, especially if we expect support for them in higher level > > tools. If we place the PASID somewhere that makes the device not work, > > then disabling PASID on the vIOMMU should resolve that. It won't be a > > vIOMMU is per-platform then it applies to all devices behind, including > those which don't have a problem with auto-selected offset. Not sure > whether one would want to continue enabling PASID for other devices > or should stop immediately to find a quirk for the problematic one and > then resume. I'm not sure if this is a real issue, we're talking about a VM, not a server. If a user wants PASID support and it's incompatible with a device, the device can be excluded from the VM or we can have an experimental option on the vfio-pci device in QEMU as a workaround. I don't think this is something we need to plumb up through the tool stack. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Saturday, August 3, 2024 2:25 AM > > On Thu, 1 Aug 2024 07:45:43 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Thursday, August 1, 2024 1:05 AM > > > > > > On Wed, 31 Jul 2024 05:15:25 +0000 > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Wednesday, July 31, 2024 1:35 AM > > > > > > > > > > So what are we trying to accomplish here. PASID is the first > > > > > non-device specific virtual capability that we'd like to insert into > > > > > the VM view of the capability chain. It won't be the last. > > > > > > > > > > - Do we push the policy of defining the capability offset to the user? > > > > > > > > Looks yes as I didn't see a strong argument for the opposite way. > > > > > > It's a policy choice though, so where and how is it implemented? It > > > works fine for those of us willing to edit xml or launch VMs by command > > > line, but libvirt isn't going to sign up to insert a policy choice for > > > a device. If we get to even higher level tools, does anything that > > > wants to implement PASID support required a vendor operator driver to > > > make such policy choices (btw, I'm just throwing out the "operator" > > > term as if I know what it means, I don't). > > > > I had a rough feeling that there might be other usages requiring such > > vendor plugin, e.g. provisioning VF/ADI may require vendor specific > > configurations, but not really an expert in this area. > > > > Overall I feel most of our discussions so far are about VMM-auto- > > find-offset vs. file-based-policy-scheme which both belong to > > user-defined policy, suggesting that we all agreed to drop the other > > way having kernel define the offset (plus in-kernel quirks, etc.)? > > > > Even the said DVSEC is to assist such user-defined direction. > > To me a "user defined policy" is placing an option on the command line > and requiring the user, or some higher level authority representing the > user, to provide the policy. If it's done by the VMM then we're saying > QEMU owns the policy but it might be overridden by the user via a > command line argument or modifying the policy file consumed by QEMU. Okay, I see the difference here. In my reply it's clearer to say "userspace" instead of "user".
On Wed, Jul 31, 2024 at 11:04:36AM -0600, Alex Williamson wrote: > I'd also be careful about command line parameters. I think we require > one for the vIOMMU to enable PASID support, but I'd prefer to avoid one > on the vfio-pci device, instead simply enabling support when both the > vIOMMU support is enabled and the device is detected to support it. > Each command line option requires support in the upper level tools to > enable it. FWIW AMD seems to have an issue here where if they don't intend to use PASID then they get a bunch more performance if they leave it turned off. So cases like no vIOMMU, or a vIOMMU with no PASID capability, should automatically tell the kernel to disable PASID support for the device. These decisions needs to be made when the HWPT's are first allocated. Jason
On Fri, Aug 02, 2024 at 12:25:28PM -0600, Alex Williamson wrote: > > I envision an extension to vfio device feature or a new vfio uAPI > > for reporting virtual capabilities as augment to the ones filled in > > vconfig space. > > Should ATS and PRI be reported through vfio-pci or should we just turn > them off to be more like PASID? Maybe the issue simply hasn't arisen > yet because we don't have vIOMMU support and with that support QEMU > might need to filter out those capabilities and look elsewhere. > Anyway, iommufd and vfio-pci should not duplicate each other here. ATS and PRI are strictly controlled only by the iommu driver, VFIO must not change them dynamically. Effectively they become set/unset based on what domains are attached through iommufd. PRI is turned on when any fault capable domain is attached, eg the HWPT uses IOMMU_HWPT_FAULT_ID_VALID. I'm not sure I know the full story here how this will work on the qemu side, but I would guess that PRI changes in the virtual config space will result in different nesting domains becoming attached? ATS, at least for SMMUv3, is controlled by a bit in the vSTE from the guest, the virtual PCI config space changes should not be reflected to real HW. There are data integrity/security concerns here. ATS must only be changed as part of a sequence that can flush the cache. Jason
On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: > Okay. With that I edited my earlier reply a bit by removing the note > of cmdline option, adding DVSEC possibility, and making it clear that > the PASID option is in vIOMMU: > > " > Overall this sounds a feasible path to move forward - starting with > the VMM to find the gap automatically if PASID is opted in vIOMMU. > Devices with hidden registers may fail. Devices with volatile > config space due to FW upgrade or cross vendors may fail to migrate. > Then evolving it to the file-based scheme, and there is time to discuss > any intermediate improvement (fixed quirks, DVSEC, etc.) in between. > " > > Jason, your thoughts? This thread is big and I've read it quickly, but I could support the above summary. For migration, where we are today, it is completely up to the device and it's FW to present a new config space that is stable across migration. In practice this means the device needs to have pre-set the PF to whatever config state it needs during FLR, and refuse to migrate if things are not correct. Moving away from that limitation toward a VMM created stable config space is definitely a long term interest. I understand other VMM's are already doing things like this. It seems we need more time to bake on this long term direction. I suggested a text file as something very general, but we can do other things too. What I see as the general industry direction is toward a very perspective vPCI function, where you might have v1, v2, etc of a device that present different config space/behavior/etc. I expect standards bodies are going to define reference vPCI functions for their standards along these lines. This would specify all the MMIO memory layout and every bit of config space right down to each offset and bit. This is the sort of restriction that will be needed to allow more generic live migration between different devices and different FW. There is a pretty clear role here for the VMM to synthesize a highly perscribed config space. How we get there and how the libvirt stack should support this, I don't know. Jason
On Tue, Jul 30, 2024 at 11:35:17AM -0600, Alex Williamson wrote: > Even if QEMU defines the layout for a device, there may be multiple > versions of that device. For example, maybe we just add PASID now, but > at some point we decide that we do want to replicate the PF serial > number capability. At that point we have versions of the device which > would need to be tied to versions of the machine and maybe also > selected via a profile switch on the device command line. This is the larger end state I see here. When the admin asks to install a vPCI function into a VM the admin would specify they want, say, 'virtio-net vPCI spec 1.0'. The text of spec 1.0 would specify each and every bit of config space and register layout. It would specify the content of the live migration blobs, and an exact feature set/behaviors the device must advertise. Even if the device can do more, a combination of the VMM and the variant PCI driver+device FW would make it do only what spec 1.0 says it should do. You can imagine there would be a range of these standards available, and large sites have a high chance to have their own private forks. This should be viewed as totally opposite the current VFIO behavior that intends to reflect the exact device, as-is, into the VM. I think we can reasonably have different approaches to each. So, how we manage this as an ecosystem, I don't know. It sure would be nice to not need kernel changes to push through a new virtual device spec! But I think your summary is good. Thanks, Jason
On 2024/8/6 22:20, Jason Gunthorpe wrote: > On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: > >> Okay. With that I edited my earlier reply a bit by removing the note >> of cmdline option, adding DVSEC possibility, and making it clear that >> the PASID option is in vIOMMU: >> >> " >> Overall this sounds a feasible path to move forward - starting with >> the VMM to find the gap automatically if PASID is opted in vIOMMU. >> Devices with hidden registers may fail. Devices with volatile >> config space due to FW upgrade or cross vendors may fail to migrate. >> Then evolving it to the file-based scheme, and there is time to discuss >> any intermediate improvement (fixed quirks, DVSEC, etc.) in between. >> " >> >> Jason, your thoughts? > > This thread is big and I've read it quickly, but I could support the > above summary. > thanks for the ideas. I think we still need a uapi to report if the device supports PASID or not. Do we have agreement on where should this uapi be defined? vfio or iommufd. Besides, I've a question on how the userspace know the hidden registers when trying to find a gap for the vPASID cap. It should only know the standard pci caps.
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Wednesday, August 14, 2024 2:39 PM > > On 2024/8/6 22:20, Jason Gunthorpe wrote: > > On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: > > > >> Okay. With that I edited my earlier reply a bit by removing the note > >> of cmdline option, adding DVSEC possibility, and making it clear that > >> the PASID option is in vIOMMU: > >> > >> " > >> Overall this sounds a feasible path to move forward - starting with > >> the VMM to find the gap automatically if PASID is opted in vIOMMU. > >> Devices with hidden registers may fail. Devices with volatile > >> config space due to FW upgrade or cross vendors may fail to migrate. > >> Then evolving it to the file-based scheme, and there is time to discuss > >> any intermediate improvement (fixed quirks, DVSEC, etc.) in between. > >> " > >> > >> Jason, your thoughts? > > > > This thread is big and I've read it quickly, but I could support the > > above summary. > > > > thanks for the ideas. I think we still need a uapi to report if the device > supports PASID or not. Do we have agreement on where should this uapi be > defined? vfio or iommufd. IOMMUFD_CMD_GET_HW_INFO. > > Besides, I've a question on how the userspace know the hidden registers > when trying to find a gap for the vPASID cap. It should only know the > standard pci caps. > for the initial implementation VMM doesn't know any hidden registers. The user passes a new vIOMMU option to the VMM for exposing the PASID capability in vIOMMU and in device, based on info in IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between existing caps to put the vPASID cap. If a device with hidden registers doesn’t work correctly then then a quirk may be added for it.
On 2024/8/14 15:38, Tian, Kevin wrote: >> From: Liu, Yi L <yi.l.liu@intel.com> >> Sent: Wednesday, August 14, 2024 2:39 PM >> >> On 2024/8/6 22:20, Jason Gunthorpe wrote: >>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: >>> >>>> Okay. With that I edited my earlier reply a bit by removing the note >>>> of cmdline option, adding DVSEC possibility, and making it clear that >>>> the PASID option is in vIOMMU: >>>> >>>> " >>>> Overall this sounds a feasible path to move forward - starting with >>>> the VMM to find the gap automatically if PASID is opted in vIOMMU. >>>> Devices with hidden registers may fail. Devices with volatile >>>> config space due to FW upgrade or cross vendors may fail to migrate. >>>> Then evolving it to the file-based scheme, and there is time to discuss >>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between. >>>> " >>>> >>>> Jason, your thoughts? >>> >>> This thread is big and I've read it quickly, but I could support the >>> above summary. >>> >> >> thanks for the ideas. I think we still need a uapi to report if the device >> supports PASID or not. Do we have agreement on where should this uapi be >> defined? vfio or iommufd. > > IOMMUFD_CMD_GET_HW_INFO. I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to report PASID supporting makes it cover device capability now. I may need to add one more capability enum in struct iommu_hw_info just like the below one. Perhaps name it as iommufd_device_capabilities. And only set the PASID capability in the new device capability when the device's or its PF's pasid cap is enabled. Does it look good? /** * enum iommufd_hw_capabilities * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking * If available, it means the following APIs * are supported: * * IOMMU_HWPT_GET_DIRTY_BITMAP * IOMMU_HWPT_SET_DIRTY_TRACKING * */ enum iommufd_hw_capabilities { IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, }; >> >> Besides, I've a question on how the userspace know the hidden registers >> when trying to find a gap for the vPASID cap. It should only know the >> standard pci caps. >> > > for the initial implementation VMM doesn't know any hidden registers. > The user passes a new vIOMMU option to the VMM for exposing > the PASID capability in vIOMMU and in device, based on info in > IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between > existing caps to put the vPASID cap. If a device with hidden registers > doesn’t work correctly then then a quirk may be added for it. I see. But we cannot know it until a guest device driver failed. This is acceptable. right? Perhaps this should be documented somewhere to let the user/device vendor know before safely exposing vPASID cap on a device. :)
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Wednesday, August 14, 2024 4:19 PM > > On 2024/8/14 15:38, Tian, Kevin wrote: > >> From: Liu, Yi L <yi.l.liu@intel.com> > >> Sent: Wednesday, August 14, 2024 2:39 PM > >> > >> On 2024/8/6 22:20, Jason Gunthorpe wrote: > >>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: > >>> > >>>> Okay. With that I edited my earlier reply a bit by removing the note > >>>> of cmdline option, adding DVSEC possibility, and making it clear that > >>>> the PASID option is in vIOMMU: > >>>> > >>>> " > >>>> Overall this sounds a feasible path to move forward - starting with > >>>> the VMM to find the gap automatically if PASID is opted in vIOMMU. > >>>> Devices with hidden registers may fail. Devices with volatile > >>>> config space due to FW upgrade or cross vendors may fail to migrate. > >>>> Then evolving it to the file-based scheme, and there is time to discuss > >>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between. > >>>> " > >>>> > >>>> Jason, your thoughts? > >>> > >>> This thread is big and I've read it quickly, but I could support the > >>> above summary. > >>> > >> > >> thanks for the ideas. I think we still need a uapi to report if the device > >> supports PASID or not. Do we have agreement on where should this uapi > be > >> defined? vfio or iommufd. > > > > IOMMUFD_CMD_GET_HW_INFO. > > I see. TBH. The existing GET_HW_INFO is for iommu hw info. Extending it to > report PASID supporting makes it cover device capability now. I may need to > add one more capability enum in struct iommu_hw_info just like the below > one. Perhaps name it as iommufd_device_capabilities. And only set the > PASID > capability in the new device capability when the device's or its PF's pasid > cap is enabled. Does it look good? Just add a new bit in iommufd_hw_capabilities. Anyway PASID is meaningful only when both device and IOMMU supports it. Here we are reporting what IOMMU capabilities meaningful to a given device so checking both does make sense. > > > /** > * enum iommufd_hw_capabilities > * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for > dirty tracking > * If available, it means the following APIs > * are supported: > * > * IOMMU_HWPT_GET_DIRTY_BITMAP > * IOMMU_HWPT_SET_DIRTY_TRACKING > * > */ > enum iommufd_hw_capabilities { > IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, > }; > > >> > >> Besides, I've a question on how the userspace know the hidden registers > >> when trying to find a gap for the vPASID cap. It should only know the > >> standard pci caps. > >> > > > > for the initial implementation VMM doesn't know any hidden registers. > > The user passes a new vIOMMU option to the VMM for exposing > > the PASID capability in vIOMMU and in device, based on info in > > IOMMUFD_CMD_GET_HW_INFO. The VMM identifies a hole between > > existing caps to put the vPASID cap. If a device with hidden registers > > doesn’t work correctly then then a quirk may be added for it. > > I see. But we cannot know it until a guest device driver failed. This is > acceptable. right? Perhaps this should be documented somewhere to let the > user/device vendor know before safely exposing vPASID cap on a device. :) > presumably in VMM.
On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote: > /** > * enum iommufd_hw_capabilities > * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking > * If available, it means the following APIs > * are supported: > * > * IOMMU_HWPT_GET_DIRTY_BITMAP > * IOMMU_HWPT_SET_DIRTY_TRACKING > * > */ > enum iommufd_hw_capabilities { > IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, > }; I think it would be appropriate to add the flag here Is it OK to rely on the PCI config space PASID enable? I see all the drivers right now are turning on PASID support during probe if the iommu supports it. Jason
On 2024/8/14 22:40, Jason Gunthorpe wrote: > On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote: > >> /** >> * enum iommufd_hw_capabilities >> * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking >> * If available, it means the following APIs >> * are supported: >> * >> * IOMMU_HWPT_GET_DIRTY_BITMAP >> * IOMMU_HWPT_SET_DIRTY_TRACKING >> * >> */ >> enum iommufd_hw_capabilities { >> IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, >> }; > > I think it would be appropriate to add the flag here ok. > Is it OK to rely on the PCI config space PASID enable? I see all the > drivers right now are turning on PASID support during probe if the > iommu supports it. Intel side is not ready yet as it enables pasid when the device is attached to a non-blocking domain. I've chatted with Baolu, and he will kindly to enable the pasid cap in the probe_device() op if both iommu and device has this cap. After that, Intel side should be fine to rely on the PASID enable bit in the PCI config space. How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant?
Yi, On 8/15/2024 7:42 AM, Yi Liu wrote: > On 2024/8/14 22:40, Jason Gunthorpe wrote: >> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote: >> >>> /** >>> * enum iommufd_hw_capabilities >>> * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking >>> * If available, it means the following APIs >>> * are supported: >>> * >>> * IOMMU_HWPT_GET_DIRTY_BITMAP >>> * IOMMU_HWPT_SET_DIRTY_TRACKING >>> * >>> */ >>> enum iommufd_hw_capabilities { >>> IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, >>> }; >> >> I think it would be appropriate to add the flag here > > ok. > >> Is it OK to rely on the PCI config space PASID enable? I see all the >> drivers right now are turning on PASID support during probe if the >> iommu supports it. > > Intel side is not ready yet as it enables pasid when the device is attached > to a non-blocking domain. I've chatted with Baolu, and he will kindly to > enable the pasid cap in the probe_device() op if both iommu and device has > this cap. After that, Intel side should be fine to rely on the PASID enable > bit in the PCI config space. > > How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant? AMD driver currently discovers capability in probe_device() and enabled it in attach_dev() path. -Vasant >
On 2024/8/16 16:29, Vasant Hegde wrote: > Yi, > > > On 8/15/2024 7:42 AM, Yi Liu wrote: >> On 2024/8/14 22:40, Jason Gunthorpe wrote: >>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote: >>> >>>> /** >>>> * enum iommufd_hw_capabilities >>>> * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking >>>> * If available, it means the following APIs >>>> * are supported: >>>> * >>>> * IOMMU_HWPT_GET_DIRTY_BITMAP >>>> * IOMMU_HWPT_SET_DIRTY_TRACKING >>>> * >>>> */ >>>> enum iommufd_hw_capabilities { >>>> IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, >>>> }; >>> >>> I think it would be appropriate to add the flag here >> >> ok. >> >>> Is it OK to rely on the PCI config space PASID enable? I see all the >>> drivers right now are turning on PASID support during probe if the >>> iommu supports it. >> >> Intel side is not ready yet as it enables pasid when the device is attached >> to a non-blocking domain. I've chatted with Baolu, and he will kindly to >> enable the pasid cap in the probe_device() op if both iommu and device has >> this cap. After that, Intel side should be fine to rely on the PASID enable >> bit in the PCI config space. >> >> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant? > > AMD driver currently discovers capability in probe_device() and enabled it in > attach_dev() path. I see. So AMD side also has a gap. Is it easy to make it suit Jason's suggestion in the above?
Yi, On 8/16/2024 5:22 PM, Yi Liu wrote: > On 2024/8/16 16:29, Vasant Hegde wrote: >> Yi, >> >> >> On 8/15/2024 7:42 AM, Yi Liu wrote: >>> On 2024/8/14 22:40, Jason Gunthorpe wrote: >>>> On Wed, Aug 14, 2024 at 04:19:13PM +0800, Yi Liu wrote: >>>> >>>>> /** >>>>> * enum iommufd_hw_capabilities >>>>> * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking >>>>> * If available, it means the following APIs >>>>> * are supported: >>>>> * >>>>> * IOMMU_HWPT_GET_DIRTY_BITMAP >>>>> * IOMMU_HWPT_SET_DIRTY_TRACKING >>>>> * >>>>> */ >>>>> enum iommufd_hw_capabilities { >>>>> IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0, >>>>> }; >>>> >>>> I think it would be appropriate to add the flag here >>> >>> ok. >>> >>>> Is it OK to rely on the PCI config space PASID enable? I see all the >>>> drivers right now are turning on PASID support during probe if the >>>> iommu supports it. >>> >>> Intel side is not ready yet as it enables pasid when the device is attached >>> to a non-blocking domain. I've chatted with Baolu, and he will kindly to >>> enable the pasid cap in the probe_device() op if both iommu and device has >>> this cap. After that, Intel side should be fine to rely on the PASID enable >>> bit in the PCI config space. >>> >>> How about SMMU and AMD iommu side? @Jason, @Suravee, @Vasant? >> >> AMD driver currently discovers capability in probe_device() and enabled it in >> attach_dev() path. > > I see. So AMD side also has a gap. Is it easy to make it suit Jason's > suggestion in the above? We can do that. We can enable ATS, PRI and PASID capability during probe time and keep it enabled always. -Vasant
On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote: > > I see. So AMD side also has a gap. Is it easy to make it suit Jason's > > suggestion in the above? > > We can do that. We can enable ATS, PRI and PASID capability during probe time > and keep it enabled always. I don't see a downside to enabling PASID at probe time, it exists to handshake with the device if the root complex is able to understand PASID TLPs. SMMU3 calls arm_smmu_probe_device arm_smmu_enable_pasid pci_enable_pasid So it looks Ok Jason
On 2024/8/16 20:52, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote: >>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's >>> suggestion in the above? >> We can do that. We can enable ATS, PRI and PASID capability during probe time >> and keep it enabled always. > I don't see a downside to enabling PASID at probe time, it exists to > handshake with the device if the root complex is able to understand > PASID TLPs. > > SMMU3 calls > arm_smmu_probe_device > arm_smmu_enable_pasid > pci_enable_pasid > > So it looks Ok I made a patch for the Intel driver. https://lore.kernel.org/linux-iommu/20240816104945.97160-1-baolu.lu@linux.intel.com/ Thanks, baolu
Hi Jason, Yi, On 8/16/2024 6:22 PM, Jason Gunthorpe wrote: > On Fri, Aug 16, 2024 at 05:31:31PM +0530, Vasant Hegde wrote: >>> I see. So AMD side also has a gap. Is it easy to make it suit Jason's >>> suggestion in the above? >> >> We can do that. We can enable ATS, PRI and PASID capability during probe time >> and keep it enabled always. > > I don't see a downside to enabling PASID at probe time, it exists to > handshake with the device if the root complex is able to understand > PASID TLPs. Ack. I will put it on my list. Will make change to AMD driver soon. -Vasant
On 2024/8/14 15:38, Tian, Kevin wrote: >> From: Liu, Yi L <yi.l.liu@intel.com> >> Sent: Wednesday, August 14, 2024 2:39 PM >> >> On 2024/8/6 22:20, Jason Gunthorpe wrote: >>> On Mon, Aug 05, 2024 at 05:35:17AM +0000, Tian, Kevin wrote: >>> >>>> Okay. With that I edited my earlier reply a bit by removing the note >>>> of cmdline option, adding DVSEC possibility, and making it clear that >>>> the PASID option is in vIOMMU: >>>> >>>> " >>>> Overall this sounds a feasible path to move forward - starting with >>>> the VMM to find the gap automatically if PASID is opted in vIOMMU. >>>> Devices with hidden registers may fail. Devices with volatile >>>> config space due to FW upgrade or cross vendors may fail to migrate. >>>> Then evolving it to the file-based scheme, and there is time to discuss >>>> any intermediate improvement (fixed quirks, DVSEC, etc.) in between. >>>> " >>>> >>>> Jason, your thoughts? >>> >>> This thread is big and I've read it quickly, but I could support the >>> above summary. >>> >> >> thanks for the ideas. I think we still need a uapi to report if the device >> supports PASID or not. Do we have agreement on where should this uapi be >> defined? vfio or iommufd. > > IOMMUFD_CMD_GET_HW_INFO. Hi Kevin, Jason, In order to synthesize the vPASID cap, the VMM should get to know the capabilities like Privilege mode, Execute permission from the physical device's config space. We have two choices as well. vfio or iommufd. It appears to be better reporting the capabilities via vfio uapi (e.g. VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the PASID capabilities to userspace. Please let me know your preference. :)
On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote: > In order to synthesize the vPASID cap, the VMM should get to know the > capabilities like Privilege mode, Execute permission from the physical > device's config space. We have two choices as well. vfio or iommufd. > > It appears to be better reporting the capabilities via vfio uapi (e.g. > VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to > add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the > PASID capabilities to userspace. Please let me know your preference. :) I don't think you'd need a new data_uptr, that doesn't quite make sense What struct data do you imagine needing? Jason
On 2024/9/9 21:04, Jason Gunthorpe wrote: > On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote: > >> In order to synthesize the vPASID cap, the VMM should get to know the >> capabilities like Privilege mode, Execute permission from the physical >> device's config space. We have two choices as well. vfio or iommufd. >> >> It appears to be better reporting the capabilities via vfio uapi (e.g. >> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to >> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the >> PASID capabilities to userspace. Please let me know your preference. :) > > I don't think you'd need a new data_uptr, that doesn't quite make > sense > > What struct data do you imagine needing? something like below. struct iommufd_hw_info_pasid { __u16 capabilities; #define IOMMUFD_PASID_CAP_EXEC (1 << 0) #define IOMMUFD_PASID_CAP_PRIV (1 << 1) __u8 width; __u8 __reserved; };
On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote: > On 2024/9/9 21:04, Jason Gunthorpe wrote: > > On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote: > > > > > In order to synthesize the vPASID cap, the VMM should get to know the > > > capabilities like Privilege mode, Execute permission from the physical > > > device's config space. We have two choices as well. vfio or iommufd. > > > > > > It appears to be better reporting the capabilities via vfio uapi (e.g. > > > VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to > > > add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the > > > PASID capabilities to userspace. Please let me know your preference. :) > > > > I don't think you'd need a new data_uptr, that doesn't quite make > > sense > > > > What struct data do you imagine needing? > > something like below. > > struct iommufd_hw_info_pasid { > __u16 capabilities; > #define IOMMUFD_PASID_CAP_EXEC (1 << 0) > #define IOMMUFD_PASID_CAP_PRIV (1 << 1) > __u8 width; > __u8 __reserved; > }; I think you could just stick that in the top level GET_HW_INFO struct if you want. It does make a sense that an iommufd user would need to know that information, especially width (but call it something better, max_pasid_log2 or something) to successefully use the iommfd PASID APIs anyhow. Jason
On Mon, Sep 09, 2024 at 10:02:02PM +0800, Yi Liu wrote: > However, I have one scalability concern. Do we only have PASID and PRI that > needs to be synthesized by userspace? For PRI, kernel would need to report > Outstanding Page Request Capacity to userspace. If only PASID and PRI cap, > it's fine. But if there are more such caps, then the struct iommu_hw_info > may grow bigger and bigger. I don't see a particular issue with this though? As long as it is well defined and it is clear to userspace when what parts are valid or not it is fine to get pretty big. Jason
On 2024/9/9 21:40, Jason Gunthorpe wrote: > On Mon, Sep 09, 2024 at 09:29:09PM +0800, Yi Liu wrote: >> On 2024/9/9 21:04, Jason Gunthorpe wrote: >>> On Mon, Sep 09, 2024 at 08:59:32PM +0800, Yi Liu wrote: >>> >>>> In order to synthesize the vPASID cap, the VMM should get to know the >>>> capabilities like Privilege mode, Execute permission from the physical >>>> device's config space. We have two choices as well. vfio or iommufd. >>>> >>>> It appears to be better reporting the capabilities via vfio uapi (e.g. >>>> VFIO_DEVICE_FEATURE). If we want to go through iommufd, then we need to >>>> add a pair of data_uptr/data_size fields in the GET_HW_INFO to report the >>>> PASID capabilities to userspace. Please let me know your preference. :) >>> >>> I don't think you'd need a new data_uptr, that doesn't quite make >>> sense >>> >>> What struct data do you imagine needing? >> >> something like below. >> >> struct iommufd_hw_info_pasid { >> __u16 capabilities; >> #define IOMMUFD_PASID_CAP_EXEC (1 << 0) >> #define IOMMUFD_PASID_CAP_PRIV (1 << 1) >> __u8 width; >> __u8 __reserved; >> }; > > I think you could just stick that in the top level GET_HW_INFO struct > if you want. > > It does make a sense that an iommufd user would need to know that > information, especially width (but call it something better, > max_pasid_log2 or something) to successefully use the iommfd PASID > APIs anyhow. I see, we may define the IOMMUFD_PASID_CAP_EXEC and IOMMUFD_PASID_CAP_PRIV in the struct iommu_hw_info::out_capabilities, and add a max_pasid_log2 to the struct iommu_hw_info. However, I have one scalability concern. Do we only have PASID and PRI that needs to be synthesized by userspace? For PRI, kernel would need to report Outstanding Page Request Capacity to userspace. If only PASID and PRI cap, it's fine. But if there are more such caps, then the struct iommu_hw_info may grow bigger and bigger.