Message ID | 159534667974.28840.2045034360240786644.stgit@djiang5-desk3.ch.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Add VFIO mediated device support and DEV-MSI support for the idxd driver | expand |
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:
"RFC" to me means "I don't really think this is mergable, so I'm
throwing it out there." Which implies you know it needs more work
before others should review it as you are not comfortable with it :(
So, back-of-the-queue you go...
greg k-h
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote: > v2: > IMS (now dev-msi): > With recommendations from Jason/Thomas/Dan on making IMS more generic: > Pass a non-pci generic device(struct device) for IMS management instead of mdev > Remove all references to mdev and symbol_get/put > Remove all references to IMS in common code and replace with dev-msi > remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers > Create a generic dev-msi domain with and without interrupt remapping enabled. > Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis I didn't dig into the details of irq handling to really check this, but the big picture of this is much more in line with what I would expect for this kind of ability. > Link to previous discussions with Jason: > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/ > The emulation part that can be moved to user space is very small due to the majority of the > emulations being control bits and need to reside in the kernel. We can revisit the necessity of > moving the small emulation part to userspace and required architectural changes at a later time. The point here is that you already have a user space interface for these queues that already has kernel support to twiddle the control bits. Generally I'd expect extending that existing kernel code to do the small bit more needed for mapping the queue through to PCI emulation to be smaller than the 2kloc of new code here to put all the emulation and support framework in the kernel, and exposes a lower attack surface of kernel code to the guest. > The kernel can specify the requirements for these callback functions > (e.g., the driver is not expected to block, or not expected to take > a lock in the callback function). I didn't notice any of this in the patch series? What is the calling context for the platform_msi_ops ? I think I already mentioned that ideally we'd need blocking/sleeping. The big selling point is that IMS allows this data to move off-chip, which means accessing it is no longer just an atomic write to some on-chip memory. These details should be documented in the comment on top of platform_msi_ops I'm actually a little confused how idxd_ims_irq_mask() manages this - I thought IRQ masking should be synchronous, shouldn't there at least be a flushing read to ensure that new MSI's are stopped and any in flight are flushed to the APIC? Jason
On 7/21/2020 9:28 AM, Greg KH wrote: > On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote: >> v2: > > "RFC" to me means "I don't really think this is mergable, so I'm > throwing it out there." Which implies you know it needs more work > before others should review it as you are not comfortable with it :( > > So, back-of-the-queue you go... > > greg k-h > Hi Greg! Yes this absolutely needs more work! I think it's in pretty good shape, but it has reached the point where it needs the talented eyes of reviewers from outside of Intel. I was really hoping to get feedback from folks like Jason (Thanks Jason!!) and KVM and VFIO experts like Alex, Paolo, Eric, and Kirti. I can understand that you are quite busy and can not necessarily provide a detailed review at this phase. Would you prefer to be cc'd on code at this phase in the future? Or, should we reserve putting you on the cc for times when we know it's ready for merge?
On 7/21/2020 9:45 AM, Jason Gunthorpe wrote: > On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote: >> v2: >> IMS (now dev-msi): >> With recommendations from Jason/Thomas/Dan on making IMS more generic: >> Pass a non-pci generic device(struct device) for IMS management instead of mdev >> Remove all references to mdev and symbol_get/put >> Remove all references to IMS in common code and replace with dev-msi >> remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers >> Create a generic dev-msi domain with and without interrupt remapping enabled. >> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis > > I didn't dig into the details of irq handling to really check this, > but the big picture of this is much more in line with what I would > expect for this kind of ability. > >> Link to previous discussions with Jason: >> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/ >> The emulation part that can be moved to user space is very small due to the majority of the >> emulations being control bits and need to reside in the kernel. We can revisit the necessity of >> moving the small emulation part to userspace and required architectural changes at a later time. > > The point here is that you already have a user space interface for > these queues that already has kernel support to twiddle the control > bits. Generally I'd expect extending that existing kernel code to do > the small bit more needed for mapping the queue through to PCI > emulation to be smaller than the 2kloc of new code here to put all the > emulation and support framework in the kernel, and exposes a lower > attack surface of kernel code to the guest. > >> The kernel can specify the requirements for these callback functions >> (e.g., the driver is not expected to block, or not expected to take >> a lock in the callback function). > > I didn't notice any of this in the patch series? What is the calling > context for the platform_msi_ops ? I think I already mentioned that > ideally we'd need blocking/sleeping. The big selling point is that IMS > allows this data to move off-chip, which means accessing it is no > longer just an atomic write to some on-chip memory. > > These details should be documented in the comment on top of > platform_msi_ops > > I'm actually a little confused how idxd_ims_irq_mask() manages this - > I thought IRQ masking should be synchronous, shouldn't there at least be a > flushing read to ensure that new MSI's are stopped and any in flight > are flushed to the APIC? You are right Jason. It's missing a flushing read. > > Jason >
On Tue, Jul 21, 2020 at 9:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote: > > v2: > > "RFC" to me means "I don't really think this is mergable, so I'm > throwing it out there." Which implies you know it needs more work > before others should review it as you are not comfortable with it :( There's full blown reviewed-by from me on the irq changes. The VFIO / mdev changes looked ok to me, but I did not feel comfortable / did not have time to sign-off on them. At the same time I did not see much to be gained to keeping those internal. So "RFC" in this case is a bit modest. It's more internal reviewer said this looks like it is going in the right direction, but wants more community discussion on the approach. > So, back-of-the-queue you go... Let's consider this not RFC in that context. The drivers/base/ pieces have my review for you, the rest are dmaengine and vfio subsystem concerns that could use some commentary.
> From: Jason Gunthorpe <jgg@mellanox.com> > Sent: Wednesday, July 22, 2020 12:45 AM > > > Link to previous discussions with Jason: > > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f- > 46d823ca0b5f@intel.com/ > > The emulation part that can be moved to user space is very small due to > the majority of the > > emulations being control bits and need to reside in the kernel. We can > revisit the necessity of > > moving the small emulation part to userspace and required architectural > changes at a later time. > > The point here is that you already have a user space interface for > these queues that already has kernel support to twiddle the control > bits. Generally I'd expect extending that existing kernel code to do > the small bit more needed for mapping the queue through to PCI > emulation to be smaller than the 2kloc of new code here to put all the > emulation and support framework in the kernel, and exposes a lower > attack surface of kernel code to the guest. > We discussed in v1 about why extending user space interface is not a strong motivation at current stage: https://lore.kernel.org/lkml/20200513124016.GG19158@mellanox.com/ In a nutshell, applications don't require raw WQ controllability as guest kernel drivers may expect. Extending DSA user space interface to be another passthrough interface just for virtualization needs is less compelling than leveraging established VFIO/mdev framework (with the major merit that existing user space VMMs just work w/o any change as long as they already support VFIO uAPI). And in this version we split previous 2kloc mdev patch into three parts: [09] mdev framework and callbacks; [10] mmio/pci_cfg emulation; and [11] handling of control commands. Only patch[10] is purely about emulation (~500LOC), while the other two parts are tightly coupled to physical resource management. In last review you said that you didn't hard nak this approach and would like to hear opinion from virtualization guys. In this version we CCed KVM mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud hypervisor), etc. Let's see how they feel about this approach. Thanks Kevin
On 7/21/2020 11:00 AM, Dave Jiang wrote: > > > On 7/21/2020 9:45 AM, Jason Gunthorpe wrote: >> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote: >>> v2: >>> IMS (now dev-msi): >>> With recommendations from Jason/Thomas/Dan on making IMS more generic: >>> Pass a non-pci generic device(struct device) for IMS management >>> instead of mdev >>> Remove all references to mdev and symbol_get/put >>> Remove all references to IMS in common code and replace with dev-msi >>> remove dynamic allocation of platform-msi interrupts: no groups,no >>> new msi list or list helpers >>> Create a generic dev-msi domain with and without interrupt remapping >>> enabled. >>> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis >> >> I didn't dig into the details of irq handling to really check this, >> but the big picture of this is much more in line with what I would >> expect for this kind of ability. >> >>> Link to previous discussions with Jason: >>> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/ >>> >>> The emulation part that can be moved to user space is very small due >>> to the majority of the >>> emulations being control bits and need to reside in the kernel. We >>> can revisit the necessity of >>> moving the small emulation part to userspace and required >>> architectural changes at a later time. >> >> The point here is that you already have a user space interface for >> these queues that already has kernel support to twiddle the control >> bits. Generally I'd expect extending that existing kernel code to do >> the small bit more needed for mapping the queue through to PCI >> emulation to be smaller than the 2kloc of new code here to put all the >> emulation and support framework in the kernel, and exposes a lower >> attack surface of kernel code to the guest. >> >>> The kernel can specify the requirements for these callback functions >>> (e.g., the driver is not expected to block, or not expected to take >>> a lock in the callback function). >> >> I didn't notice any of this in the patch series? What is the calling >> context for the platform_msi_ops ? I think I already mentioned that >> ideally we'd need blocking/sleeping. The big selling point is that IMS >> allows this data to move off-chip, which means accessing it is no >> longer just an atomic write to some on-chip memory. >> >> These details should be documented in the comment on top of >> platform_msi_ops so the platform_msi_ops care called from the same context as the existing msi_ops for instance, we are not adding anything new. I think the above comment is a little misleading I will remove it next time around. Also, I thought even the current write to on-chip memory is not atomic.. could you let me know which piece of code you are referring to? Since the driver gets to write to the off chip memory, shouldn't it be the drivers responsibility to call it from a sleeping/blocking context? >> >> I'm actually a little confused how idxd_ims_irq_mask() manages this - >> I thought IRQ masking should be synchronous, shouldn't there at least >> be a >> flushing read to ensure that new MSI's are stopped and any in flight >> are flushed to the APIC? > > You are right Jason. It's missing a flushing read. > >> >> Jason >> > .
On Wed, Jul 22, 2020 at 10:31:28AM -0700, Dey, Megha wrote: > > > I didn't notice any of this in the patch series? What is the calling > > > context for the platform_msi_ops ? I think I already mentioned that > > > ideally we'd need blocking/sleeping. The big selling point is that IMS > > > allows this data to move off-chip, which means accessing it is no > > > longer just an atomic write to some on-chip memory. > > > > > > These details should be documented in the comment on top of > > > platform_msi_ops > > so the platform_msi_ops care called from the same context as the existing > msi_ops for instance, we are not adding anything new. I think the above > comment is a little misleading I will remove it next time around. If it is true that all calls are under driver control then I think it would be good to document that. I actually don't know off hand if mask/unmask are restricted like that As this is a op a driver has to implement vs the arch it probably need a bit more hand holding. > Also, I thought even the current write to on-chip memory is not > atomic.. The writel to the MSI-X table in MMIO memory is 'atomic' Jason
On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote: > In a nutshell, applications don't require raw WQ controllability as guest > kernel drivers may expect. Extending DSA user space interface to be another > passthrough interface just for virtualization needs is less compelling than > leveraging established VFIO/mdev framework (with the major merit that > existing user space VMMs just work w/o any change as long as they already > support VFIO uAPI). Sure, but the above is how the cover letter should have summarized that discussion, not as "it is not much code difference" > In last review you said that you didn't hard nak this approach and would > like to hear opinion from virtualization guys. In this version we CCed KVM > mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud > hypervisor), etc. Let's see how they feel about this approach. Yes, the VFIO community should decide. If we are doing emulation tasks in the kernel now, then I can think of several nice semi-emulated mdevs to propose. This will not be some one off, but the start of a widely copied pattern. Jason
On Thu, 23 Jul 2020 21:19:30 -0300 Jason Gunthorpe <jgg@mellanox.com> wrote: > On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote: > > In a nutshell, applications don't require raw WQ controllability as guest > > kernel drivers may expect. Extending DSA user space interface to be another > > passthrough interface just for virtualization needs is less compelling than > > leveraging established VFIO/mdev framework (with the major merit that > > existing user space VMMs just work w/o any change as long as they already > > support VFIO uAPI). > > Sure, but the above is how the cover letter should have summarized > that discussion, not as "it is not much code difference" > > > In last review you said that you didn't hard nak this approach and would > > like to hear opinion from virtualization guys. In this version we CCed KVM > > mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud > > hypervisor), etc. Let's see how they feel about this approach. > > Yes, the VFIO community should decide. > > If we are doing emulation tasks in the kernel now, then I can think of > several nice semi-emulated mdevs to propose. > > This will not be some one off, but the start of a widely copied > pattern. And that's definitely a concern, there should be a reason for implementing device emulation in the kernel beyond an easy path to get a device exposed up through a virtualization stack. The entire idea of mdev is the mediation of access to a device to make it safe for a user and to fit within the vfio device API. Mediation, emulation, and virtualization can be hard to differentiate, and there is some degree of emulation required to fill out the device API, for vfio-pci itself included. So I struggle with a specific measure of where to draw the line, and also whose authority it is to draw that line. I don't think it's solely mine, that's something we need to decide as a community. If you see this as an abuse of the framework, then let's identify those specific issues and come up with a better approach. As we've discussed before, things like basic PCI config space emulation are acceptable overhead and low risk (imo) and some degree of register emulation is well within the territory of an mdev driver. Drivers are accepting some degree of increased attack surface by each addition of a uAPI and the complexity of those uAPIs, but it seems largely a decision for those drivers whether they're willing to take on that responsibility and burden. At some point, possibly in the near-ish future, we might have a vfio-user interface with userspace vfio-over-socket servers that might be able to consume existing uAPIs and offload some of this complexity and emulation to userspace while still providing an easy path to insert devices into the virtualization stack. Hopefully if/when that comes along, it would provide these sorts of drivers an opportunity to offload some of the current overhead out to userspace, but I'm not sure it's worth denying a mainline implementation now. Thanks, Alex
On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > If you see this as an abuse of the framework, then let's identify those > specific issues and come up with a better approach. As we've discussed > before, things like basic PCI config space emulation are acceptable > overhead and low risk (imo) and some degree of register emulation is > well within the territory of an mdev driver. What troubles me is that idxd already has a direct userspace interface to its HW, and does userspace DMA. The purpose of this mdev is to provide a second direct userspace interface that is a little different and trivially plugs into the virtualization stack. I don't think VFIO should be the only entry point to virtualization. If we say the universe of devices doing user space DMA must also implement a VFIO mdev to plug into virtualization then it will be alot of mdevs. I would prefer to see that the existing userspace interface have the extra needed bits for virtualization (eg by having appropriate internal kernel APIs to make this easy) and all the emulation to build the synthetic PCI device be done in userspace. Not only is it better for security, it keeps things to one device driver per device.. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, August 7, 2020 8:20 PM > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > > > If you see this as an abuse of the framework, then let's identify those > > specific issues and come up with a better approach. As we've discussed > > before, things like basic PCI config space emulation are acceptable > > overhead and low risk (imo) and some degree of register emulation is > > well within the territory of an mdev driver. > > What troubles me is that idxd already has a direct userspace interface > to its HW, and does userspace DMA. The purpose of this mdev is to > provide a second direct userspace interface that is a little different > and trivially plugs into the virtualization stack. No. Userspace DMA and subdevice passthrough (what mdev provides) are two distinct usages IMO (at least in idxd context). and this might be the main divergence between us, thus let me put more words here. If we could reach consensus in this matter, which direction to go would be clearer. First, a passthrough interface requires some unique requirements which are not commonly observed in an userspace DMA interface, e.g.: - Tracking DMA dirty pages for live migration; - A set of interfaces for using SVA inside guest; * PASID allocation/free (on some platforms); * bind/unbind guest mm/page table (nested translation); * invalidate IOMMU cache/iotlb for guest page table changes; * report page request from device to guest; * forward page response from guest to device; - Configuring irqbypass for posted interrupt; - ... Second, a passthrough interface requires delegating raw controllability of subdevice to guest driver, while the same delegation might not be required for implementing an userspace DMA interface (especially for modern devices which support SVA). For example, idxd allows following setting per wq (guest driver may configure them in any combination): - put in dedicated or shared mode; - enable/disable SVA; - Associate guest-provided PASID to MSI/IMS entry; - set threshold; - allow/deny privileged access; - allocate/free interrupt handle (enlightened for guest); - collect error status; - ... We plan to support idxd userspace DMA with SVA. The driver just needs to prepare a wq with a predefined configuration (e.g. shared, SVA, etc.), bind the process mm to IOMMU (non-nested) and then map the portal to userspace. The goal that userspace can do DMA to associated wq doesn't change the fact that the wq is still *owned* and *controlled* by kernel driver. However as far as passthrough is concerned, the wq is considered 'owned' by the guest driver thus we need an interface which can support low-level *controllability* from guest driver. It is sort of a mess in uAPI when mixing the two together. Based on above two reasons, we see distinct requirements between userspace DMA and passthrough interfaces, at least in idxd context (though other devices may have less distinction in-between). Therefore, we didn't see the value/necessity of reinventing the wheel that mdev already handles well to evolve an simple application-oriented usespace DMA interface to a complex guest-driver-oriented passthrough interface. The complexity of doing so would incur far more kernel-side changes than the portion of emulation code that you've been concerned about... > > I don't think VFIO should be the only entry point to > virtualization. If we say the universe of devices doing user space DMA > must also implement a VFIO mdev to plug into virtualization then it > will be alot of mdevs. Certainly VFIO will not be the only entry point. and This has to be a case-by-case decision. If an userspace DMA interface can be easily adapted to be a passthrough one, it might be the choice. But for idxd, we see mdev a much better fit here, given the big difference between what userspace DMA requires and what guest driver requires in this hw. > > I would prefer to see that the existing userspace interface have the > extra needed bits for virtualization (eg by having appropriate > internal kernel APIs to make this easy) and all the emulation to build > the synthetic PCI device be done in userspace. In the end what decides the direction is the amount of changes that we have to put in kernel, not whether we call it 'emulation'. For idxd, adding special passthrough requirements (guest SVA, dirty tracking, etc.) and raw controllability to the simple userspace DMA interface is for sure making kernel more complex than reusing the mdev framework (plus some degree of emulation mockup behind). Not to mention the merit of uAPI compatibility with mdev... Thanks Kevin
On Mon, 10 Aug 2020 07:32:24 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, August 7, 2020 8:20 PM > > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > > > > > If you see this as an abuse of the framework, then let's identify those > > > specific issues and come up with a better approach. As we've discussed > > > before, things like basic PCI config space emulation are acceptable > > > overhead and low risk (imo) and some degree of register emulation is > > > well within the territory of an mdev driver. > > > > What troubles me is that idxd already has a direct userspace interface > > to its HW, and does userspace DMA. The purpose of this mdev is to > > provide a second direct userspace interface that is a little different > > and trivially plugs into the virtualization stack. > > No. Userspace DMA and subdevice passthrough (what mdev provides) > are two distinct usages IMO (at least in idxd context). and this might > be the main divergence between us, thus let me put more words here. > If we could reach consensus in this matter, which direction to go > would be clearer. > > First, a passthrough interface requires some unique requirements > which are not commonly observed in an userspace DMA interface, e.g.: > > - Tracking DMA dirty pages for live migration; > - A set of interfaces for using SVA inside guest; > * PASID allocation/free (on some platforms); > * bind/unbind guest mm/page table (nested translation); > * invalidate IOMMU cache/iotlb for guest page table changes; > * report page request from device to guest; > * forward page response from guest to device; > - Configuring irqbypass for posted interrupt; > - ... > > Second, a passthrough interface requires delegating raw controllability > of subdevice to guest driver, while the same delegation might not be > required for implementing an userspace DMA interface (especially for > modern devices which support SVA). For example, idxd allows following > setting per wq (guest driver may configure them in any combination): > - put in dedicated or shared mode; > - enable/disable SVA; > - Associate guest-provided PASID to MSI/IMS entry; > - set threshold; > - allow/deny privileged access; > - allocate/free interrupt handle (enlightened for guest); > - collect error status; > - ... > > We plan to support idxd userspace DMA with SVA. The driver just needs > to prepare a wq with a predefined configuration (e.g. shared, SVA, > etc.), bind the process mm to IOMMU (non-nested) and then map > the portal to userspace. The goal that userspace can do DMA to > associated wq doesn't change the fact that the wq is still *owned* > and *controlled* by kernel driver. However as far as passthrough > is concerned, the wq is considered 'owned' by the guest driver thus > we need an interface which can support low-level *controllability* > from guest driver. It is sort of a mess in uAPI when mixing the > two together. > > Based on above two reasons, we see distinct requirements between > userspace DMA and passthrough interfaces, at least in idxd context > (though other devices may have less distinction in-between). Therefore, > we didn't see the value/necessity of reinventing the wheel that mdev > already handles well to evolve an simple application-oriented usespace > DMA interface to a complex guest-driver-oriented passthrough interface. > The complexity of doing so would incur far more kernel-side changes > than the portion of emulation code that you've been concerned about... > > > > > I don't think VFIO should be the only entry point to > > virtualization. If we say the universe of devices doing user space DMA > > must also implement a VFIO mdev to plug into virtualization then it > > will be alot of mdevs. > > Certainly VFIO will not be the only entry point. and This has to be a > case-by-case decision. If an userspace DMA interface can be easily > adapted to be a passthrough one, it might be the choice. But for idxd, > we see mdev a much better fit here, given the big difference between > what userspace DMA requires and what guest driver requires in this hw. > > > > > I would prefer to see that the existing userspace interface have the > > extra needed bits for virtualization (eg by having appropriate > > internal kernel APIs to make this easy) and all the emulation to build > > the synthetic PCI device be done in userspace. > > In the end what decides the direction is the amount of changes that > we have to put in kernel, not whether we call it 'emulation'. For idxd, > adding special passthrough requirements (guest SVA, dirty tracking, > etc.) and raw controllability to the simple userspace DMA interface > is for sure making kernel more complex than reusing the mdev > framework (plus some degree of emulation mockup behind). Not to > mention the merit of uAPI compatibility with mdev... I agree with a lot of this argument, exposing a device through a userspace interface versus allowing user access to a device through a userspace interface are different levels of abstraction and control. In an ideal world, perhaps we could compose one from the other, but I don't think the existence of one is proof that the other is redundant. That's not to say that mdev/vfio isn't ripe for abuse in this space, but I'm afraid the test for that abuse is probably much more subtle. I'll also remind folks that LPC is coming up in just a couple short weeks and this might be something we should discuss (virtually) in-person. uconf CfPs are currently open. </plug> Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, August 12, 2020 1:01 AM > > On Mon, 10 Aug 2020 07:32:24 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, August 7, 2020 8:20 PM > > > > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > > > > > > > If you see this as an abuse of the framework, then let's identify those > > > > specific issues and come up with a better approach. As we've discussed > > > > before, things like basic PCI config space emulation are acceptable > > > > overhead and low risk (imo) and some degree of register emulation is > > > > well within the territory of an mdev driver. > > > > > > What troubles me is that idxd already has a direct userspace interface > > > to its HW, and does userspace DMA. The purpose of this mdev is to > > > provide a second direct userspace interface that is a little different > > > and trivially plugs into the virtualization stack. > > > > No. Userspace DMA and subdevice passthrough (what mdev provides) > > are two distinct usages IMO (at least in idxd context). and this might > > be the main divergence between us, thus let me put more words here. > > If we could reach consensus in this matter, which direction to go > > would be clearer. > > > > First, a passthrough interface requires some unique requirements > > which are not commonly observed in an userspace DMA interface, e.g.: > > > > - Tracking DMA dirty pages for live migration; > > - A set of interfaces for using SVA inside guest; > > * PASID allocation/free (on some platforms); > > * bind/unbind guest mm/page table (nested translation); > > * invalidate IOMMU cache/iotlb for guest page table changes; > > * report page request from device to guest; > > * forward page response from guest to device; > > - Configuring irqbypass for posted interrupt; > > - ... > > > > Second, a passthrough interface requires delegating raw controllability > > of subdevice to guest driver, while the same delegation might not be > > required for implementing an userspace DMA interface (especially for > > modern devices which support SVA). For example, idxd allows following > > setting per wq (guest driver may configure them in any combination): > > - put in dedicated or shared mode; > > - enable/disable SVA; > > - Associate guest-provided PASID to MSI/IMS entry; > > - set threshold; > > - allow/deny privileged access; > > - allocate/free interrupt handle (enlightened for guest); > > - collect error status; > > - ... > > > > We plan to support idxd userspace DMA with SVA. The driver just needs > > to prepare a wq with a predefined configuration (e.g. shared, SVA, > > etc.), bind the process mm to IOMMU (non-nested) and then map > > the portal to userspace. The goal that userspace can do DMA to > > associated wq doesn't change the fact that the wq is still *owned* > > and *controlled* by kernel driver. However as far as passthrough > > is concerned, the wq is considered 'owned' by the guest driver thus > > we need an interface which can support low-level *controllability* > > from guest driver. It is sort of a mess in uAPI when mixing the > > two together. > > > > Based on above two reasons, we see distinct requirements between > > userspace DMA and passthrough interfaces, at least in idxd context > > (though other devices may have less distinction in-between). Therefore, > > we didn't see the value/necessity of reinventing the wheel that mdev > > already handles well to evolve an simple application-oriented usespace > > DMA interface to a complex guest-driver-oriented passthrough interface. > > The complexity of doing so would incur far more kernel-side changes > > than the portion of emulation code that you've been concerned about... > > > > > > > > I don't think VFIO should be the only entry point to > > > virtualization. If we say the universe of devices doing user space DMA > > > must also implement a VFIO mdev to plug into virtualization then it > > > will be alot of mdevs. > > > > Certainly VFIO will not be the only entry point. and This has to be a > > case-by-case decision. If an userspace DMA interface can be easily > > adapted to be a passthrough one, it might be the choice. But for idxd, > > we see mdev a much better fit here, given the big difference between > > what userspace DMA requires and what guest driver requires in this hw. > > > > > > > > I would prefer to see that the existing userspace interface have the > > > extra needed bits for virtualization (eg by having appropriate > > > internal kernel APIs to make this easy) and all the emulation to build > > > the synthetic PCI device be done in userspace. > > > > In the end what decides the direction is the amount of changes that > > we have to put in kernel, not whether we call it 'emulation'. For idxd, > > adding special passthrough requirements (guest SVA, dirty tracking, > > etc.) and raw controllability to the simple userspace DMA interface > > is for sure making kernel more complex than reusing the mdev > > framework (plus some degree of emulation mockup behind). Not to > > mention the merit of uAPI compatibility with mdev... > > I agree with a lot of this argument, exposing a device through a > userspace interface versus allowing user access to a device through a > userspace interface are different levels of abstraction and control. > In an ideal world, perhaps we could compose one from the other, but I > don't think the existence of one is proof that the other is redundant. > That's not to say that mdev/vfio isn't ripe for abuse in this space, > but I'm afraid the test for that abuse is probably much more subtle. > > I'll also remind folks that LPC is coming up in just a couple short > weeks and this might be something we should discuss (virtually) > in-person. uconf CfPs are currently open. </plug> Thanks, > Yes, LPC is a good place to reach consensus. btw I saw there is already one VFIO topic called "device assignment/sub-assignment". Do you think whether this can be covered under that topic, or makes more sense to be a new one? Thanks Kevin
On Wed, 12 Aug 2020 01:58:00 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, August 12, 2020 1:01 AM > > > > On Mon, 10 Aug 2020 07:32:24 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Friday, August 7, 2020 8:20 PM > > > > > > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > > > > > > > > > If you see this as an abuse of the framework, then let's identify those > > > > > specific issues and come up with a better approach. As we've discussed > > > > > before, things like basic PCI config space emulation are acceptable > > > > > overhead and low risk (imo) and some degree of register emulation is > > > > > well within the territory of an mdev driver. > > > > > > > > What troubles me is that idxd already has a direct userspace interface > > > > to its HW, and does userspace DMA. The purpose of this mdev is to > > > > provide a second direct userspace interface that is a little different > > > > and trivially plugs into the virtualization stack. > > > > > > No. Userspace DMA and subdevice passthrough (what mdev provides) > > > are two distinct usages IMO (at least in idxd context). and this might > > > be the main divergence between us, thus let me put more words here. > > > If we could reach consensus in this matter, which direction to go > > > would be clearer. > > > > > > First, a passthrough interface requires some unique requirements > > > which are not commonly observed in an userspace DMA interface, e.g.: > > > > > > - Tracking DMA dirty pages for live migration; > > > - A set of interfaces for using SVA inside guest; > > > * PASID allocation/free (on some platforms); > > > * bind/unbind guest mm/page table (nested translation); > > > * invalidate IOMMU cache/iotlb for guest page table changes; > > > * report page request from device to guest; > > > * forward page response from guest to device; > > > - Configuring irqbypass for posted interrupt; > > > - ... > > > > > > Second, a passthrough interface requires delegating raw controllability > > > of subdevice to guest driver, while the same delegation might not be > > > required for implementing an userspace DMA interface (especially for > > > modern devices which support SVA). For example, idxd allows following > > > setting per wq (guest driver may configure them in any combination): > > > - put in dedicated or shared mode; > > > - enable/disable SVA; > > > - Associate guest-provided PASID to MSI/IMS entry; > > > - set threshold; > > > - allow/deny privileged access; > > > - allocate/free interrupt handle (enlightened for guest); > > > - collect error status; > > > - ... > > > > > > We plan to support idxd userspace DMA with SVA. The driver just needs > > > to prepare a wq with a predefined configuration (e.g. shared, SVA, > > > etc.), bind the process mm to IOMMU (non-nested) and then map > > > the portal to userspace. The goal that userspace can do DMA to > > > associated wq doesn't change the fact that the wq is still *owned* > > > and *controlled* by kernel driver. However as far as passthrough > > > is concerned, the wq is considered 'owned' by the guest driver thus > > > we need an interface which can support low-level *controllability* > > > from guest driver. It is sort of a mess in uAPI when mixing the > > > two together. > > > > > > Based on above two reasons, we see distinct requirements between > > > userspace DMA and passthrough interfaces, at least in idxd context > > > (though other devices may have less distinction in-between). Therefore, > > > we didn't see the value/necessity of reinventing the wheel that mdev > > > already handles well to evolve an simple application-oriented usespace > > > DMA interface to a complex guest-driver-oriented passthrough interface. > > > The complexity of doing so would incur far more kernel-side changes > > > than the portion of emulation code that you've been concerned about... > > > > > > > > > > > I don't think VFIO should be the only entry point to > > > > virtualization. If we say the universe of devices doing user space DMA > > > > must also implement a VFIO mdev to plug into virtualization then it > > > > will be alot of mdevs. > > > > > > Certainly VFIO will not be the only entry point. and This has to be a > > > case-by-case decision. If an userspace DMA interface can be easily > > > adapted to be a passthrough one, it might be the choice. But for idxd, > > > we see mdev a much better fit here, given the big difference between > > > what userspace DMA requires and what guest driver requires in this hw. > > > > > > > > > > > I would prefer to see that the existing userspace interface have the > > > > extra needed bits for virtualization (eg by having appropriate > > > > internal kernel APIs to make this easy) and all the emulation to build > > > > the synthetic PCI device be done in userspace. > > > > > > In the end what decides the direction is the amount of changes that > > > we have to put in kernel, not whether we call it 'emulation'. For idxd, > > > adding special passthrough requirements (guest SVA, dirty tracking, > > > etc.) and raw controllability to the simple userspace DMA interface > > > is for sure making kernel more complex than reusing the mdev > > > framework (plus some degree of emulation mockup behind). Not to > > > mention the merit of uAPI compatibility with mdev... > > > > I agree with a lot of this argument, exposing a device through a > > userspace interface versus allowing user access to a device through a > > userspace interface are different levels of abstraction and control. > > In an ideal world, perhaps we could compose one from the other, but I > > don't think the existence of one is proof that the other is redundant. > > That's not to say that mdev/vfio isn't ripe for abuse in this space, > > but I'm afraid the test for that abuse is probably much more subtle. > > > > I'll also remind folks that LPC is coming up in just a couple short > > weeks and this might be something we should discuss (virtually) > > in-person. uconf CfPs are currently open. </plug> Thanks, > > > > Yes, LPC is a good place to reach consensus. btw I saw there is > already one VFIO topic called "device assignment/sub-assignment". > Do you think whether this can be covered under that topic, or > makes more sense to be a new one? All the things listed in the CFP are only potential topics to get ideas flowing, there is currently no proposal to talk about sub-assignment. I'd suggest submitting separate topics for each and if we run into time constraints we can ask that they might be combined together. Thanks, Alex
On 2020/8/10 下午3:32, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Friday, August 7, 2020 8:20 PM >> >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: >> >>> If you see this as an abuse of the framework, then let's identify those >>> specific issues and come up with a better approach. As we've discussed >>> before, things like basic PCI config space emulation are acceptable >>> overhead and low risk (imo) and some degree of register emulation is >>> well within the territory of an mdev driver. >> What troubles me is that idxd already has a direct userspace interface >> to its HW, and does userspace DMA. The purpose of this mdev is to >> provide a second direct userspace interface that is a little different >> and trivially plugs into the virtualization stack. > No. Userspace DMA and subdevice passthrough (what mdev provides) > are two distinct usages IMO (at least in idxd context). and this might > be the main divergence between us, thus let me put more words here. > If we could reach consensus in this matter, which direction to go > would be clearer. > > First, a passthrough interface requires some unique requirements > which are not commonly observed in an userspace DMA interface, e.g.: > > - Tracking DMA dirty pages for live migration; > - A set of interfaces for using SVA inside guest; > * PASID allocation/free (on some platforms); > * bind/unbind guest mm/page table (nested translation); > * invalidate IOMMU cache/iotlb for guest page table changes; > * report page request from device to guest; > * forward page response from guest to device; > - Configuring irqbypass for posted interrupt; > - ... > > Second, a passthrough interface requires delegating raw controllability > of subdevice to guest driver, while the same delegation might not be > required for implementing an userspace DMA interface (especially for > modern devices which support SVA). For example, idxd allows following > setting per wq (guest driver may configure them in any combination): > - put in dedicated or shared mode; > - enable/disable SVA; > - Associate guest-provided PASID to MSI/IMS entry; > - set threshold; > - allow/deny privileged access; > - allocate/free interrupt handle (enlightened for guest); > - collect error status; > - ... > > We plan to support idxd userspace DMA with SVA. The driver just needs > to prepare a wq with a predefined configuration (e.g. shared, SVA, > etc.), bind the process mm to IOMMU (non-nested) and then map > the portal to userspace. The goal that userspace can do DMA to > associated wq doesn't change the fact that the wq is still *owned* > and *controlled* by kernel driver. However as far as passthrough > is concerned, the wq is considered 'owned' by the guest driver thus > we need an interface which can support low-level *controllability* > from guest driver. It is sort of a mess in uAPI when mixing the > two together. So for userspace drivers like DPDK, it can use both of the two uAPIs? > > Based on above two reasons, we see distinct requirements between > userspace DMA and passthrough interfaces, at least in idxd context > (though other devices may have less distinction in-between). Therefore, > we didn't see the value/necessity of reinventing the wheel that mdev > already handles well to evolve an simple application-oriented usespace > DMA interface to a complex guest-driver-oriented passthrough interface. > The complexity of doing so would incur far more kernel-side changes > than the portion of emulation code that you've been concerned about... > >> I don't think VFIO should be the only entry point to >> virtualization. If we say the universe of devices doing user space DMA >> must also implement a VFIO mdev to plug into virtualization then it >> will be alot of mdevs. > Certainly VFIO will not be the only entry point. and This has to be a > case-by-case decision. The problem is that if we tie all controls via VFIO uAPI, the other subsystem like vDPA is likely to duplicate them. I wonder if there is a way to decouple the vSVA out of VFIO uAPI? > If an userspace DMA interface can be easily > adapted to be a passthrough one, it might be the choice. It's not that easy even for VFIO which requires a lot of new uAPIs and infrastructures(e.g mdev) to be invented. > But for idxd, > we see mdev a much better fit here, given the big difference between > what userspace DMA requires and what guest driver requires in this hw. A weak point for mdev is that it can't serve kernel subsystem other than VFIO. In this case, you need some other infrastructures (like [1]) to do this. (For idxd, you probably don't need this, but it's pretty common in the case of networking or storage device.) Thanks [1] https://patchwork.kernel.org/patch/11280547/ > >> I would prefer to see that the existing userspace interface have the >> extra needed bits for virtualization (eg by having appropriate >> internal kernel APIs to make this easy) and all the emulation to build >> the synthetic PCI device be done in userspace. > In the end what decides the direction is the amount of changes that > we have to put in kernel, not whether we call it 'emulation'. For idxd, > adding special passthrough requirements (guest SVA, dirty tracking, > etc.) and raw controllability to the simple userspace DMA interface > is for sure making kernel more complex than reusing the mdev > framework (plus some degree of emulation mockup behind). Not to > mention the merit of uAPI compatibility with mdev... > > Thanks > Kevin >
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, August 12, 2020 10:36 AM > On Wed, 12 Aug 2020 01:58:00 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > I'll also remind folks that LPC is coming up in just a couple short > > > weeks and this might be something we should discuss (virtually) > > > in-person. uconf CfPs are currently open. </plug> Thanks, > > > > > > > Yes, LPC is a good place to reach consensus. btw I saw there is > > already one VFIO topic called "device assignment/sub-assignment". > > Do you think whether this can be covered under that topic, or > > makes more sense to be a new one? > > All the things listed in the CFP are only potential topics to get ideas > flowing, there is currently no proposal to talk about sub-assignment. > I'd suggest submitting separate topics for each and if we run into time > constraints we can ask that they might be combined together. Thanks, > Done. -- title: Criteria of using VFIO mdev (vs. userspace DMA) Content: VFIO mdev provides a framework for subdevice assignment and reuses existing VFIO uAPI to handle common passthrough-related requirements. However, subdevice (e.g. ADI defined in Intel Scalable IOV) might not be a PCI endpoint (e.g. just a work queue), thus requires some degree of emulation/mediation in kernel to fit into VFIO device API. Then there is a concern on putting emulation in kernel and how to judge abuse of mdev framework by simply using it as an easy path to hook into virtualization stack. An associated open is about differentiating mdev from userspace DMA framework (such as uacce), and whether building passthrough features on top of userspace DMA framework is a better choice than using mdev. Thanks Kevin
> From: Jason Wang <jasowang@redhat.com> > Sent: Wednesday, August 12, 2020 11:28 AM > > > On 2020/8/10 下午3:32, Tian, Kevin wrote: > >> From: Jason Gunthorpe <jgg@nvidia.com> > >> Sent: Friday, August 7, 2020 8:20 PM > >> > >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote: > >> > >>> If you see this as an abuse of the framework, then let's identify those > >>> specific issues and come up with a better approach. As we've discussed > >>> before, things like basic PCI config space emulation are acceptable > >>> overhead and low risk (imo) and some degree of register emulation is > >>> well within the territory of an mdev driver. > >> What troubles me is that idxd already has a direct userspace interface > >> to its HW, and does userspace DMA. The purpose of this mdev is to > >> provide a second direct userspace interface that is a little different > >> and trivially plugs into the virtualization stack. > > No. Userspace DMA and subdevice passthrough (what mdev provides) > > are two distinct usages IMO (at least in idxd context). and this might > > be the main divergence between us, thus let me put more words here. > > If we could reach consensus in this matter, which direction to go > > would be clearer. > > > > First, a passthrough interface requires some unique requirements > > which are not commonly observed in an userspace DMA interface, e.g.: > > > > - Tracking DMA dirty pages for live migration; > > - A set of interfaces for using SVA inside guest; > > * PASID allocation/free (on some platforms); > > * bind/unbind guest mm/page table (nested translation); > > * invalidate IOMMU cache/iotlb for guest page table changes; > > * report page request from device to guest; > > * forward page response from guest to device; > > - Configuring irqbypass for posted interrupt; > > - ... > > > > Second, a passthrough interface requires delegating raw controllability > > of subdevice to guest driver, while the same delegation might not be > > required for implementing an userspace DMA interface (especially for > > modern devices which support SVA). For example, idxd allows following > > setting per wq (guest driver may configure them in any combination): > > - put in dedicated or shared mode; > > - enable/disable SVA; > > - Associate guest-provided PASID to MSI/IMS entry; > > - set threshold; > > - allow/deny privileged access; > > - allocate/free interrupt handle (enlightened for guest); > > - collect error status; > > - ... > > > > We plan to support idxd userspace DMA with SVA. The driver just needs > > to prepare a wq with a predefined configuration (e.g. shared, SVA, > > etc.), bind the process mm to IOMMU (non-nested) and then map > > the portal to userspace. The goal that userspace can do DMA to > > associated wq doesn't change the fact that the wq is still *owned* > > and *controlled* by kernel driver. However as far as passthrough > > is concerned, the wq is considered 'owned' by the guest driver thus > > we need an interface which can support low-level *controllability* > > from guest driver. It is sort of a mess in uAPI when mixing the > > two together. > > > So for userspace drivers like DPDK, it can use both of the two uAPIs? yes. > > > > > > Based on above two reasons, we see distinct requirements between > > userspace DMA and passthrough interfaces, at least in idxd context > > (though other devices may have less distinction in-between). Therefore, > > we didn't see the value/necessity of reinventing the wheel that mdev > > already handles well to evolve an simple application-oriented usespace > > DMA interface to a complex guest-driver-oriented passthrough interface. > > The complexity of doing so would incur far more kernel-side changes > > than the portion of emulation code that you've been concerned about... > > > >> I don't think VFIO should be the only entry point to > >> virtualization. If we say the universe of devices doing user space DMA > >> must also implement a VFIO mdev to plug into virtualization then it > >> will be alot of mdevs. > > Certainly VFIO will not be the only entry point. and This has to be a > > case-by-case decision. > > > The problem is that if we tie all controls via VFIO uAPI, the other > subsystem like vDPA is likely to duplicate them. I wonder if there is a > way to decouple the vSVA out of VFIO uAPI? vSVA is a per-device (either pdev or mdev) feature thus naturally should be managed by its device driver (VFIO or vDPA). From this angle some duplication is inevitable given VFIO and vDPA are orthogonal passthrough frameworks. Within the kernel the majority of vSVA handling is done by IOMMU and IOASID modules thus most logic are shared. > > > > If an userspace DMA interface can be easily > > adapted to be a passthrough one, it might be the choice. > > > It's not that easy even for VFIO which requires a lot of new uAPIs and > infrastructures(e.g mdev) to be invented. > > > > But for idxd, > > we see mdev a much better fit here, given the big difference between > > what userspace DMA requires and what guest driver requires in this hw. > > > A weak point for mdev is that it can't serve kernel subsystem other than > VFIO. In this case, you need some other infrastructures (like [1]) to do > this. mdev is not exclusive from kernel usages. It's perfectly fine for a driver to reserve some work queues for host usages, while wrapping others into mdevs. Thanks Kevin > > (For idxd, you probably don't need this, but it's pretty common in the > case of networking or storage device.) > > Thanks > > [1] https://patchwork.kernel.org/patch/11280547/ > > > > > >> I would prefer to see that the existing userspace interface have the > >> extra needed bits for virtualization (eg by having appropriate > >> internal kernel APIs to make this easy) and all the emulation to build > >> the synthetic PCI device be done in userspace. > > In the end what decides the direction is the amount of changes that > > we have to put in kernel, not whether we call it 'emulation'. For idxd, > > adding special passthrough requirements (guest SVA, dirty tracking, > > etc.) and raw controllability to the simple userspace DMA interface > > is for sure making kernel more complex than reusing the mdev > > framework (plus some degree of emulation mockup behind). Not to > > mention the merit of uAPI compatibility with mdev... > > > > Thanks > > Kevin > >
On 2020/8/12 下午12:05, Tian, Kevin wrote: >> The problem is that if we tie all controls via VFIO uAPI, the other >> subsystem like vDPA is likely to duplicate them. I wonder if there is a >> way to decouple the vSVA out of VFIO uAPI? > vSVA is a per-device (either pdev or mdev) feature thus naturally should > be managed by its device driver (VFIO or vDPA). From this angle some > duplication is inevitable given VFIO and vDPA are orthogonal passthrough > frameworks. Within the kernel the majority of vSVA handling is done by > IOMMU and IOASID modules thus most logic are shared. So why not introduce vSVA uAPI at IOMMU or IOASID layer? > >>> If an userspace DMA interface can be easily >>> adapted to be a passthrough one, it might be the choice. >> It's not that easy even for VFIO which requires a lot of new uAPIs and >> infrastructures(e.g mdev) to be invented. >> >> >>> But for idxd, >>> we see mdev a much better fit here, given the big difference between >>> what userspace DMA requires and what guest driver requires in this hw. >> A weak point for mdev is that it can't serve kernel subsystem other than >> VFIO. In this case, you need some other infrastructures (like [1]) to do >> this. > mdev is not exclusive from kernel usages. It's perfectly fine for a driver > to reserve some work queues for host usages, while wrapping others > into mdevs. I meant you may want slices to be an independent device from the kernel point of view: E.g for ethernet devices, you may want 10K mdevs to be passed to guest. Similarly, you may want 10K net devices which is connected to the kernel networking subsystems. In this case it's not simply reserving queues but you need some other type of device abstraction. There could be some kind of duplication between this and mdev. Thanks > > Thanks > Kevin >
> From: Jason Wang <jasowang@redhat.com> > Sent: Thursday, August 13, 2020 12:34 PM > > > On 2020/8/12 下午12:05, Tian, Kevin wrote: > >> The problem is that if we tie all controls via VFIO uAPI, the other > >> subsystem like vDPA is likely to duplicate them. I wonder if there is a > >> way to decouple the vSVA out of VFIO uAPI? > > vSVA is a per-device (either pdev or mdev) feature thus naturally should > > be managed by its device driver (VFIO or vDPA). From this angle some > > duplication is inevitable given VFIO and vDPA are orthogonal passthrough > > frameworks. Within the kernel the majority of vSVA handling is done by > > IOMMU and IOASID modules thus most logic are shared. > > > So why not introduce vSVA uAPI at IOMMU or IOASID layer? One may ask a similar question why IOMMU doesn't expose map/unmap as uAPI... > > > > > >>> If an userspace DMA interface can be easily > >>> adapted to be a passthrough one, it might be the choice. > >> It's not that easy even for VFIO which requires a lot of new uAPIs and > >> infrastructures(e.g mdev) to be invented. > >> > >> > >>> But for idxd, > >>> we see mdev a much better fit here, given the big difference between > >>> what userspace DMA requires and what guest driver requires in this hw. > >> A weak point for mdev is that it can't serve kernel subsystem other than > >> VFIO. In this case, you need some other infrastructures (like [1]) to do > >> this. > > mdev is not exclusive from kernel usages. It's perfectly fine for a driver > > to reserve some work queues for host usages, while wrapping others > > into mdevs. > > > I meant you may want slices to be an independent device from the kernel > point of view: > > E.g for ethernet devices, you may want 10K mdevs to be passed to guest. > > Similarly, you may want 10K net devices which is connected to the kernel > networking subsystems. > > In this case it's not simply reserving queues but you need some other > type of device abstraction. There could be some kind of duplication > between this and mdev. > yes, some abstraction required but isn't it what the driver should care about instead of mdev framework itself? If the driver reports the same set of resource to both mdev and networking, it needs to make sure when the resource is claimed in one interface then it should be marked in-use in another. e.g. each mdev includes a available_intances attribute. the driver could report 10k available instances initially and then update it to 5K when another 5K is used for net devices later. Mdev definitely has its usage limitations. Some may be improved in the future, some may not. But those are distracting from the original purpose of this thread (mdev vs. userspace DMA) and better be discussed in other places e.g. LPC... Thanks Kevin
On 2020/8/13 下午1:26, Tian, Kevin wrote: >> From: Jason Wang <jasowang@redhat.com> >> Sent: Thursday, August 13, 2020 12:34 PM >> >> >> On 2020/8/12 下午12:05, Tian, Kevin wrote: >>>> The problem is that if we tie all controls via VFIO uAPI, the other >>>> subsystem like vDPA is likely to duplicate them. I wonder if there is a >>>> way to decouple the vSVA out of VFIO uAPI? >>> vSVA is a per-device (either pdev or mdev) feature thus naturally should >>> be managed by its device driver (VFIO or vDPA). From this angle some >>> duplication is inevitable given VFIO and vDPA are orthogonal passthrough >>> frameworks. Within the kernel the majority of vSVA handling is done by >>> IOMMU and IOASID modules thus most logic are shared. >> >> So why not introduce vSVA uAPI at IOMMU or IOASID layer? > One may ask a similar question why IOMMU doesn't expose map/unmap > as uAPI... I think this is probably a good idea as well. If there's anything missed in the infrastructure, we can invent. Besides vhost-vDPA, there are other subsystems that relaying their uAPI to IOMMU API. Duplicating uAPIs is usually a hint of the codes duplication. Simple map/unmap could be easy but vSVA uAPI is much more complicated. > >> >>>>> If an userspace DMA interface can be easily >>>>> adapted to be a passthrough one, it might be the choice. >>>> It's not that easy even for VFIO which requires a lot of new uAPIs and >>>> infrastructures(e.g mdev) to be invented. >>>> >>>> >>>>> But for idxd, >>>>> we see mdev a much better fit here, given the big difference between >>>>> what userspace DMA requires and what guest driver requires in this hw. >>>> A weak point for mdev is that it can't serve kernel subsystem other than >>>> VFIO. In this case, you need some other infrastructures (like [1]) to do >>>> this. >>> mdev is not exclusive from kernel usages. It's perfectly fine for a driver >>> to reserve some work queues for host usages, while wrapping others >>> into mdevs. >> >> I meant you may want slices to be an independent device from the kernel >> point of view: >> >> E.g for ethernet devices, you may want 10K mdevs to be passed to guest. >> >> Similarly, you may want 10K net devices which is connected to the kernel >> networking subsystems. >> >> In this case it's not simply reserving queues but you need some other >> type of device abstraction. There could be some kind of duplication >> between this and mdev. >> > yes, some abstraction required but isn't it what the driver should > care about instead of mdev framework itself? With mdev you present a "PCI" device, but what's kind of device it tries to present to kernel? If it's still PCI, there's duplication with mdev, if it's something new, maybe we can switch to that API. > If the driver reports > the same set of resource to both mdev and networking, it needs to > make sure when the resource is claimed in one interface then it > should be marked in-use in another. e.g. each mdev includes a > available_intances attribute. the driver could report 10k available > instances initially and then update it to 5K when another 5K is used > for net devices later. Right but this probably means you need another management layer under mdev. > > Mdev definitely has its usage limitations. Some may be improved > in the future, some may not. But those are distracting from the > original purpose of this thread (mdev vs. userspace DMA) and better > be discussed in other places e.g. LPC... Ok. Thanks > > Thanks > Kevin
On Thu, Aug 13, 2020 at 02:01:58PM +0800, Jason Wang wrote: > > On 2020/8/13 下午1:26, Tian, Kevin wrote: > > > From: Jason Wang <jasowang@redhat.com> > > > Sent: Thursday, August 13, 2020 12:34 PM > > > > > > > > > On 2020/8/12 下午12:05, Tian, Kevin wrote: > > > > > The problem is that if we tie all controls via VFIO uAPI, the other > > > > > subsystem like vDPA is likely to duplicate them. I wonder if there is a > > > > > way to decouple the vSVA out of VFIO uAPI? > > > > vSVA is a per-device (either pdev or mdev) feature thus naturally should > > > > be managed by its device driver (VFIO or vDPA). From this angle some > > > > duplication is inevitable given VFIO and vDPA are orthogonal passthrough > > > > frameworks. Within the kernel the majority of vSVA handling is done by > > > > IOMMU and IOASID modules thus most logic are shared. > > > > > > So why not introduce vSVA uAPI at IOMMU or IOASID layer? > > One may ask a similar question why IOMMU doesn't expose map/unmap > > as uAPI... > > > I think this is probably a good idea as well. If there's anything missed in > the infrastructure, we can invent. Besides vhost-vDPA, there are other > subsystems that relaying their uAPI to IOMMU API. Duplicating uAPIs is > usually a hint of the codes duplication. Simple map/unmap could be easy but > vSVA uAPI is much more complicated. A way to create the vSVA objects unrelated to VFIO and then pass those objects for device use into existing uAPIs, to me, makes alot of sense. You should not have to use the VFIO API just to get vSVA. Or stated another way, the existing user driver should be able to get a PASID from the vSVA components as well as just create a PASID from the local mm_struct. The same basic argument goes for all the points - the issue is really the only uAPI we have for this stuff is under VFIO, and the better solution is to disagregate that uAPI, not to try and make everything pretend to be a VFIO device. Jason
On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote: > > I would prefer to see that the existing userspace interface have the > > extra needed bits for virtualization (eg by having appropriate > > internal kernel APIs to make this easy) and all the emulation to build > > the synthetic PCI device be done in userspace. > > In the end what decides the direction is the amount of changes that > we have to put in kernel, not whether we call it 'emulation'. No, this is not right. The decision should be based on what will end up more maintable in the long run. Yes it would be more code to dis-aggregate some of the things currently only bundled as uAPI inside VFIO (eg your vSVA argument above) but once it is disaggregated the maintability of the whole solution will be better overall, and more drivers will be able to use this functionality. Jason
> From: Jason Gunthorpe > Sent: Friday, August 14, 2020 9:35 PM > > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote: > > > > I would prefer to see that the existing userspace interface have the > > > extra needed bits for virtualization (eg by having appropriate > > > internal kernel APIs to make this easy) and all the emulation to build > > > the synthetic PCI device be done in userspace. > > > > In the end what decides the direction is the amount of changes that > > we have to put in kernel, not whether we call it 'emulation'. > > No, this is not right. The decision should be based on what will end > up more maintable in the long run. > > Yes it would be more code to dis-aggregate some of the things > currently only bundled as uAPI inside VFIO (eg your vSVA argument > above) but once it is disaggregated the maintability of the whole > solution will be better overall, and more drivers will be able to use > this functionality. > Disaggregation is an orthogonal topic to the main divergence in this thread, which is passthrough vs. userspace DMA. I gave detail explanation about the difference between the two in last reply. the possibility of dis-aggregating something between passthrough frameworks (e.g. VFIO and vDPA) is not the reason for growing every userspace DMA framework to be a passthrough framework. Doing that is instead hurting maintainability in general... Thanks Kevin
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, August 14, 2020 9:24 PM > > The same basic argument goes for all the points - the issue is really > the only uAPI we have for this stuff is under VFIO, and the better > solution is to disagregate that uAPI, not to try and make everything > pretend to be a VFIO device. > Nobody is proposing to make everything VFIO. there must be some criteria which can be brainstormed in LPC. But the opposite also holds - the fact that we should not make everything VFIO doesn't imply prohibition on anyone from using it. There is a clear difference between passthrough and userspace DMA requirements in idxd context, and we see good reasons to use VFIO for our passthrough requirements. Thanks Kevin
On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Friday, August 14, 2020 9:35 PM > > > > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote: > > > > > > I would prefer to see that the existing userspace interface have the > > > > extra needed bits for virtualization (eg by having appropriate > > > > internal kernel APIs to make this easy) and all the emulation to build > > > > the synthetic PCI device be done in userspace. > > > > > > In the end what decides the direction is the amount of changes that > > > we have to put in kernel, not whether we call it 'emulation'. > > > > No, this is not right. The decision should be based on what will end > > up more maintable in the long run. > > > > Yes it would be more code to dis-aggregate some of the things > > currently only bundled as uAPI inside VFIO (eg your vSVA argument > > above) but once it is disaggregated the maintability of the whole > > solution will be better overall, and more drivers will be able to use > > this functionality. > > > > Disaggregation is an orthogonal topic to the main divergence in > this thread, which is passthrough vs. userspace DMA. I gave detail > explanation about the difference between the two in last reply. You said the first issue was related to SVA, which is understandable because we have no SVA uAPIs outside VFIO. Imagine if we had some /dev/sva that provided this API and user space DMA drivers could simply accept an FD and work properly. It is not such a big leap anymore, nor is it customized code in idxd. The other pass through issue was IRQ, which last time I looked, was fairly trivial to connect via interrupt remapping in the kernel, or could be made extremely trivial. The last, seemed to be a concern that the current uapi for idxd was lacking seems idxd specific features, which seems like an quite weak reason to use VFIO. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, August 18, 2020 8:44 AM > > On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Friday, August 14, 2020 9:35 PM > > > > > > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote: > > > > > > > > I would prefer to see that the existing userspace interface have the > > > > > extra needed bits for virtualization (eg by having appropriate > > > > > internal kernel APIs to make this easy) and all the emulation to build > > > > > the synthetic PCI device be done in userspace. > > > > > > > > In the end what decides the direction is the amount of changes that > > > > we have to put in kernel, not whether we call it 'emulation'. > > > > > > No, this is not right. The decision should be based on what will end > > > up more maintable in the long run. > > > > > > Yes it would be more code to dis-aggregate some of the things > > > currently only bundled as uAPI inside VFIO (eg your vSVA argument > > > above) but once it is disaggregated the maintability of the whole > > > solution will be better overall, and more drivers will be able to use > > > this functionality. > > > > > > > Disaggregation is an orthogonal topic to the main divergence in > > this thread, which is passthrough vs. userspace DMA. I gave detail > > explanation about the difference between the two in last reply. > > You said the first issue was related to SVA, which is understandable > because we have no SVA uAPIs outside VFIO. > > Imagine if we had some /dev/sva that provided this API and user space > DMA drivers could simply accept an FD and work properly. It is not > such a big leap anymore, nor is it customized code in idxd. > > The other pass through issue was IRQ, which last time I looked, was > fairly trivial to connect via interrupt remapping in the kernel, or > could be made extremely trivial. > > The last, seemed to be a concern that the current uapi for idxd was > lacking seems idxd specific features, which seems like an quite weak > reason to use VFIO. > The difference in my reply is not just about the implementation gap of growing a userspace DMA framework to a passthrough framework. My real point is about the different goals that each wants to achieve. Userspace DMA is purely about allowing userspace to directly access the portal and do DMA, but the wq configuration is always under kernel driver's control. On the other hand, passthrough means delegating full control of the wq to the guest and then associated mock-up (live migration, vSVA, posted interrupt, etc.) for that to work. I really didn't see the value of mixing them together when there is already a good candidate to handle passthrough... Thanks Kevin
On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote: > The difference in my reply is not just about the implementation gap > of growing a userspace DMA framework to a passthrough framework. > My real point is about the different goals that each wants to achieve. > Userspace DMA is purely about allowing userspace to directly access > the portal and do DMA, but the wq configuration is always under kernel > driver's control. On the other hand, passthrough means delegating full > control of the wq to the guest and then associated mock-up (live migration, > vSVA, posted interrupt, etc.) for that to work. I really didn't see the > value of mixing them together when there is already a good candidate > to handle passthrough... In Linux a 'VM' and virtualization has always been a normal system process that uses a few extra kernel features. This has been more or less the cornerstone of that design since the start. In that view it doesn't make any sense to say that uAPI from idxd that is useful for virtualization somehow doesn't belong as part of the standard uAPI. Especially when it is such a small detail like what APIs are used to configure the wq. For instance, what about suspend/resume of containers using idxd? Wouldn't you want to have the same basic approach of controlling the wq from userspace that virtualization uses? Jason
On 18/08/20 13:50, Jason Gunthorpe wrote: > For instance, what about suspend/resume of containers using idxd? > Wouldn't you want to have the same basic approach of controlling the > wq from userspace that virtualization uses? The difference is that VFIO more or less standardizes the approach you use for live migration. With another interface you'd have to come up with something for every driver, and add support in CRIU for every driver as well. Paolo
On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote: > On 18/08/20 13:50, Jason Gunthorpe wrote: > > For instance, what about suspend/resume of containers using idxd? > > Wouldn't you want to have the same basic approach of controlling the > > wq from userspace that virtualization uses? > > The difference is that VFIO more or less standardizes the approach you > use for live migration. With another interface you'd have to come up > with something for every driver, and add support in CRIU for every > driver as well. VFIO is very unsuitable for use as some general userspace. It only 1:1 with a single process and just can't absorb what the existing idxd userspace is doing. So VFIO is already not a solution for normal userspace idxd where CRIU becomes interesting. Not sure what you are trying to say? My point was the opposite, if you want to enable CRIU for idxd then you probably need all the same stuff as for qemu/VFIO except in the normal idxd user API. Jason
On 18/08/20 18:49, Jason Gunthorpe wrote: > On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote: >> On 18/08/20 13:50, Jason Gunthorpe wrote: >>> For instance, what about suspend/resume of containers using idxd? >>> Wouldn't you want to have the same basic approach of controlling the >>> wq from userspace that virtualization uses? >> >> The difference is that VFIO more or less standardizes the approach you >> use for live migration. With another interface you'd have to come up >> with something for every driver, and add support in CRIU for every >> driver as well. > > VFIO is very unsuitable for use as some general userspace. It only 1:1 > with a single process and just can't absorb what the existing idxd > userspace is doing. The point of mdev is that it's not 1:1 anymore. Paolo > So VFIO is already not a solution for normal userspace idxd where CRIU > becomes interesting. Not sure what you are trying to say? > > My point was the opposite, if you want to enable CRIU for idxd then > you probably need all the same stuff as for qemu/VFIO except in the > normal idxd user API. > > Jason >
On Tue, Aug 18, 2020 at 07:05:16PM +0200, Paolo Bonzini wrote: > On 18/08/20 18:49, Jason Gunthorpe wrote: > > On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote: > >> On 18/08/20 13:50, Jason Gunthorpe wrote: > >>> For instance, what about suspend/resume of containers using idxd? > >>> Wouldn't you want to have the same basic approach of controlling the > >>> wq from userspace that virtualization uses? > >> > >> The difference is that VFIO more or less standardizes the approach you > >> use for live migration. With another interface you'd have to come up > >> with something for every driver, and add support in CRIU for every > >> driver as well. > > > > VFIO is very unsuitable for use as some general userspace. It only 1:1 > > with a single process and just can't absorb what the existing idxd > > userspace is doing. > > The point of mdev is that it's not 1:1 anymore. The lifecycle model of mdev is not compatible with how something like idxd works, it needs a multi-open cdev. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, August 18, 2020 7:50 PM > > On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote: > > The difference in my reply is not just about the implementation gap > > of growing a userspace DMA framework to a passthrough framework. > > My real point is about the different goals that each wants to achieve. > > Userspace DMA is purely about allowing userspace to directly access > > the portal and do DMA, but the wq configuration is always under kernel > > driver's control. On the other hand, passthrough means delegating full > > control of the wq to the guest and then associated mock-up (live migration, > > vSVA, posted interrupt, etc.) for that to work. I really didn't see the > > value of mixing them together when there is already a good candidate > > to handle passthrough... > > In Linux a 'VM' and virtualization has always been a normal system > process that uses a few extra kernel features. This has been more or > less the cornerstone of that design since the start. > > In that view it doesn't make any sense to say that uAPI from idxd that > is useful for virtualization somehow doesn't belong as part of the > standard uAPI. The point is that we already have a more standard uAPI (VFIO) which is unified and vendor-agnostic to userspace. Creating a idxd specific uAPI to absorb similar requirements that VFIO already does is not compelling and instead causes more trouble to Qemu or other VMMs as they need to deal with every such driver uAPI even when Qemu itself has no interest in the device detail (since the real user is inside guest). > > Especially when it is such a small detail like what APIs are used to > configure the wq. > > For instance, what about suspend/resume of containers using idxd? > Wouldn't you want to have the same basic approach of controlling the > wq from userspace that virtualization uses? > I'm not familiar with how container suspend/resume is done today. But my gut-feeling is that it's different from virtualization. For virtualization, the whole wq is assigned to the guest thus the uAPI must provide a way to save the wq state including its configuration at suspsend, and then restore the state to what guest expects when resume. However in container case which does userspace DMA, the wq is managed by host kernel and could be shared between multiple containers. So the wq state is irrelevant to container. The only relevant state is the in-fly workloads which needs a draining interface. In this view I think the two have a major difference. Thanks Kevin