Message ID | 20230530053724.232765-1-baolu.lu@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | IOMMUFD: Deliver IO page faults to user space | expand |
Hi Baolu, On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > This series implements the functionality of delivering IO page faults to > user space through the IOMMUFD framework. The use case is nested > translation, where modern IOMMU hardware supports two-stage translation > tables. The second-stage translation table is managed by the host VMM > while the first-stage translation table is owned by the user space. > Hence, any IO page fault that occurs on the first-stage page table > should be delivered to the user space and handled there. The user space > should respond the page fault handling result to the device top-down > through the IOMMUFD response uAPI. > > User space indicates its capablity of handling IO page faults by setting > a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD > will then setup its infrastructure for page fault delivery. Together > with the iopf-capable flag, user space should also provide an eventfd > where it will listen on any down-top page fault messages. > > On a successful return of the allocation of iopf-capable HWPT, a fault > fd will be returned. User space can open and read fault messages from it > once the eventfd is signaled. I think that, whether the guest has an IOPF capability or not, the host should always forward any stage-1 fault/error back to the guest. Yet, the implementation of this series builds with the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. And I have my doubt at the using the IOPF framework with that IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for its bottom half workqueue, because a page response could take a long cycle. But adding that flag feels like we don't really need the bottom half workqueue, i.e. losing the point of using the IOPF framework, IMHO. Combining the two facts above, I wonder if we really need to go through the IOPF framework; can't we just register a user fault handler in the iommufd directly upon a valid event_fd? Thanks Nicolin
On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > Hi folks, > > This series implements the functionality of delivering IO page faults to > user space through the IOMMUFD framework. The use case is nested > translation, where modern IOMMU hardware supports two-stage translation > tables. The second-stage translation table is managed by the host VMM > while the first-stage translation table is owned by the user space. > Hence, any IO page fault that occurs on the first-stage page table > should be delivered to the user space and handled there. The user space > should respond the page fault handling result to the device top-down > through the IOMMUFD response uAPI. > > User space indicates its capablity of handling IO page faults by setting > a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD > will then setup its infrastructure for page fault delivery. Together > with the iopf-capable flag, user space should also provide an eventfd > where it will listen on any down-top page fault messages. > > On a successful return of the allocation of iopf-capable HWPT, a fault > fd will be returned. User space can open and read fault messages from it > once the eventfd is signaled. This is a performance path so we really need to think about this more, polling on an eventfd and then reading a different fd is not a good design. What I would like is to have a design from the start that fits into io_uring, so we can have pre-posted 'recvs' in io_uring that just get completed at high speed when PRIs come in. This suggests that the PRI should be delivered via read() on a single FD and pollability on the single FD without any eventfd. > Besides the overall design, I'd like to hear comments about below > designs: > > - The IOMMUFD fault message format. It is very similar to that in > uapi/linux/iommu which has been discussed before and partially used by > the IOMMU SVA implementation. I'd like to get more comments on the > format when it comes to IOMMUFD. We have to have the same discussion as always, does a generic fault message format make any sense here? PRI seems more likely that it would but it needs a big carefull cross vendor check out. Jason
On 5/31/23 2:50 AM, Nicolin Chen wrote: > Hi Baolu, Hi Nicolin, > > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > >> This series implements the functionality of delivering IO page faults to >> user space through the IOMMUFD framework. The use case is nested >> translation, where modern IOMMU hardware supports two-stage translation >> tables. The second-stage translation table is managed by the host VMM >> while the first-stage translation table is owned by the user space. >> Hence, any IO page fault that occurs on the first-stage page table >> should be delivered to the user space and handled there. The user space >> should respond the page fault handling result to the device top-down >> through the IOMMUFD response uAPI. >> >> User space indicates its capablity of handling IO page faults by setting >> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >> will then setup its infrastructure for page fault delivery. Together >> with the iopf-capable flag, user space should also provide an eventfd >> where it will listen on any down-top page fault messages. >> >> On a successful return of the allocation of iopf-capable HWPT, a fault >> fd will be returned. User space can open and read fault messages from it >> once the eventfd is signaled. > > I think that, whether the guest has an IOPF capability or not, > the host should always forward any stage-1 fault/error back to > the guest. Yet, the implementation of this series builds with > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. I agree with you that DMA unrecoverable faults on stage-1 hwpt should also be reported to user space. However, I have some concerns about how this will be implemented. In the shadow page table case, we don't report DMA unrecoverable faults. This could lead to confusion for users, as they may expect to receive DMA unrecoverable faults regardless of whether hardware nested translation is used. I would suggest that we report DMA unrecoverable faults in all cases, regardless of whether hardware nested translation is used. This would make it easier for users to understand the behavior of their systems. > > And I have my doubt at the using the IOPF framework with that > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for > its bottom half workqueue, because a page response could take > a long cycle. But adding that flag feels like we don't really > need the bottom half workqueue, i.e. losing the point of using > the IOPF framework, IMHO. > > Combining the two facts above, I wonder if we really need to > go through the IOPF framework; can't we just register a user > fault handler in the iommufd directly upon a valid event_fd? I agree with you that the existing IOPF framework is not ideal for IOMMUFD. The adding ASYNC flag conflicts with the IOPF workqueue. This could lead to performance issues. I can improve the IOPF framework to make it more friendly to IOMMUFD. One way to do this would be not use workqueue for the IOMMUFD case. Have I covered all your concerns? Best regards, baolu
On 5/31/23 8:33 AM, Jason Gunthorpe wrote: > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: >> Hi folks, >> >> This series implements the functionality of delivering IO page faults to >> user space through the IOMMUFD framework. The use case is nested >> translation, where modern IOMMU hardware supports two-stage translation >> tables. The second-stage translation table is managed by the host VMM >> while the first-stage translation table is owned by the user space. >> Hence, any IO page fault that occurs on the first-stage page table >> should be delivered to the user space and handled there. The user space >> should respond the page fault handling result to the device top-down >> through the IOMMUFD response uAPI. >> >> User space indicates its capablity of handling IO page faults by setting >> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >> will then setup its infrastructure for page fault delivery. Together >> with the iopf-capable flag, user space should also provide an eventfd >> where it will listen on any down-top page fault messages. >> >> On a successful return of the allocation of iopf-capable HWPT, a fault >> fd will be returned. User space can open and read fault messages from it >> once the eventfd is signaled. > > This is a performance path so we really need to think about this more, > polling on an eventfd and then reading a different fd is not a good > design. > > What I would like is to have a design from the start that fits into > io_uring, so we can have pre-posted 'recvs' in io_uring that just get > completed at high speed when PRIs come in. > > This suggests that the PRI should be delivered via read() on a single > FD and pollability on the single FD without any eventfd. Good suggestion. I will head in this direction. >> Besides the overall design, I'd like to hear comments about below >> designs: >> >> - The IOMMUFD fault message format. It is very similar to that in >> uapi/linux/iommu which has been discussed before and partially used by >> the IOMMU SVA implementation. I'd like to get more comments on the >> format when it comes to IOMMUFD. > > We have to have the same discussion as always, does a generic fault > message format make any sense here? > > PRI seems more likely that it would but it needs a big carefull cross > vendor check out. Yeah, good point. As far as I can see, there are at least three types of IOPF hardware implementation. - PCI/PRI: Vendors might have their own additions. For example, VT-d 3.0 allows root-complex integrated endpoints to carry device specific private data in their page requests. This has been removed from the spec since v4.0. - DMA stalls. - Device-specific (non-PRI, not through IOMMU). Does IOMMUFD want to support the last case? Best regards, baolu
On Wed, May 31, 2023 at 10:10:15AM +0800, Baolu Lu wrote: > I agree with you that the existing IOPF framework is not ideal for > IOMMUFD. The adding ASYNC flag conflicts with the IOPF workqueue. > This could lead to performance issues. > > I can improve the IOPF framework to make it more friendly to IOMMUFD. > One way to do this would be not use workqueue for the IOMMUFD case. > > Have I covered all your concerns? Yea. My concern was mainly at the fault report for non-PRI cases. Though I am still on the fence about using IOPF framework, let's see first how the improved design would look like. Thanks Nic
Hi Baolu, On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > - The timeout value for the pending page fault messages. Ideally we > should determine the timeout value from the device configuration, but > I failed to find any statement in the PCI specification (version 6.x). > A default 100 milliseconds is selected in the implementation, but it > leave the room for grow the code for per-device setting. If it helps we had some discussions about this timeout [1]. It's useful to print out a warning for debugging, but I don't think completing the fault on timeout is correct, we should leave the fault pending. Given that the PCI spec does not indicate a timeout, the guest can wait as long as it wants to complete the fault (and 100ms may even be reasonable on an emulator, who knows how many layers and context switches the fault has to go through). Another outstanding issue was what to do for PASID stop. When the guest device driver stops using a PASID it issues a PASID stop request to the device (a device-specific mechanism). If the device is not using PRI stop markers it waits for pending PRs to complete and we're fine. Otherwise it sends a stop marker which is flushed to the PRI queue, but does not wait for pending PRs. Handling stop markers is annoying. If the device issues one, then the PRI queue contains stale faults, a stop marker, followed by valid faults for the next address space bound to this PASID. The next address space will get all the spurious faults because the fault handler doesn't know that there is a stop marker coming. Linux is probably alright with spurious faults, though maybe not in all cases, and other guests may not support them at all. We might need to revisit supporting stop markers: request that each device driver declares whether their device uses stop markers on unbind() ("This mechanism must indicate that a Stop Marker Message will be generated." says the spec, but doesn't say if the function always uses one or the other mechanism so it's per-unbind). Then we still have to synchronize unbind() with the fault handler to deal with the pending stop marker, which might have already gone through or be generated later. Currently we ignore all that and just flush the PRI queue, followed by the IOPF queue, to get rid of any stale fault before reassigning the PASID. A guest however would also need to first flush the HW PRI queue, but doesn't have a direct way to do that. If we want to support guests that don't deal with stop markers, the host needs to flush the PRI queue when a PASID is detached. I guess on Intel detaching the PASID goes through the host which can flush the host queue. On Arm we'll probably need to flush the queue when receiving a PASID cache invalidation, which the guest issues after clearing a PASID table entry. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/20180423153622.GC38106@ostrya.localdomain/ Also unregistration, not sure if relevant here https://lore.kernel.org/linux-iommu/20190605154553.0d00ad8d@jacob-builder/
On 6/16/23 7:32 PM, Jean-Philippe Brucker wrote: > Hi Baolu, Hi Jean, Thank you for the informational reply. > > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: >> - The timeout value for the pending page fault messages. Ideally we >> should determine the timeout value from the device configuration, but >> I failed to find any statement in the PCI specification (version 6.x). >> A default 100 milliseconds is selected in the implementation, but it >> leave the room for grow the code for per-device setting. > > If it helps we had some discussions about this timeout [1]. It's useful to > print out a warning for debugging, but I don't think completing the fault > on timeout is correct, we should leave the fault pending. Given that the > PCI spec does not indicate a timeout, the guest can wait as long as it > wants to complete the fault (and 100ms may even be reasonable on an > emulator, who knows how many layers and context switches the fault has to > go through). When I was designing this, I was also hesitant about whether to use a timer. Even worse, I didn't see any description of timeout in the PCI spec. I agree with you that a better approach might be to ensure that devices respect the number of in-flight PPRs that are allocated to them. We need to design a queue that is large enough to prevent device from flooding it with page requests. > > Another outstanding issue was what to do for PASID stop. When the guest > device driver stops using a PASID it issues a PASID stop request to the > device (a device-specific mechanism). If the device is not using PRI stop > markers it waits for pending PRs to complete and we're fine. Otherwise it > sends a stop marker which is flushed to the PRI queue, but does not wait > for pending PRs. > > Handling stop markers is annoying. If the device issues one, then the PRI > queue contains stale faults, a stop marker, followed by valid faults for > the next address space bound to this PASID. The next address space will > get all the spurious faults because the fault handler doesn't know that > there is a stop marker coming. Linux is probably alright with spurious > faults, though maybe not in all cases, and other guests may not support > them at all. > > We might need to revisit supporting stop markers: request that each device > driver declares whether their device uses stop markers on unbind() ("This > mechanism must indicate that a Stop Marker Message will be generated." > says the spec, but doesn't say if the function always uses one or the > other mechanism so it's per-unbind). Then we still have to synchronize > unbind() with the fault handler to deal with the pending stop marker, > which might have already gone through or be generated later. I don't quite follow here. Once a PASID is unbound from the device, the device driver should be free to release the PASID. The PASID could then be used for any other purpose. The device driver has no idea when the pending page requests are flushed after unbind(), so it cannot decide how long should the PASID be delayed for reuse. Therefore, I understand that a successful return from the unbind() function denotes that all pending page requests have been flushed and the PASID is viable for other use. > > Currently we ignore all that and just flush the PRI queue, followed by the > IOPF queue, to get rid of any stale fault before reassigning the PASID. A > guest however would also need to first flush the HW PRI queue, but doesn't > have a direct way to do that. If we want to support guests that don't deal > with stop markers, the host needs to flush the PRI queue when a PASID is > detached. I guess on Intel detaching the PASID goes through the host which > can flush the host queue. On Arm we'll probably need to flush the queue > when receiving a PASID cache invalidation, which the guest issues after > clearing a PASID table entry. The Intel VT-d driver follows below steps to drain pending page requests when a PASID is unbound from a device. - Tear down the device's pasid table entry for the stopped pasid. This ensures that ATS/PRI will stop putting more page requests for the pasid in VT-d PRQ. - Sync with the PRQ handling thread until all related page requests in PRQ have been delivered. - Flush the iopf queue with iopf_queue_flush_dev(). - Follow the steps defined in VT-d spec section 7.10 to drain all page requests and responses between VT-d and the endpoint device. > > Thanks, > Jean > > [1] https://lore.kernel.org/linux-iommu/20180423153622.GC38106@ostrya.localdomain/ > Also unregistration, not sure if relevant here > https://lore.kernel.org/linux-iommu/20190605154553.0d00ad8d@jacob-builder/ > Best regards, baolu
On Fri, Jun 16, 2023 at 12:32:32PM +0100, Jean-Philippe Brucker wrote: > We might need to revisit supporting stop markers: request that each device > driver declares whether their device uses stop markers on unbind() ("This > mechanism must indicate that a Stop Marker Message will be generated." > says the spec, but doesn't say if the function always uses one or the > other mechanism so it's per-unbind). Then we still have to synchronize > unbind() with the fault handler to deal with the pending stop marker, > which might have already gone through or be generated later. An explicit API to wait for the stop marker makes sense, with the expectation that well behaved devices will generate it and well behaved drivers will wait for it. Things like VFIO should have a way to barrier/drain the PRI queue after issuing FLR. ie the VMM processing FLR should also barrier the real HW queues and flush them to VM visibility. > with stop markers, the host needs to flush the PRI queue when a PASID is > detached. I guess on Intel detaching the PASID goes through the host which > can flush the host queue. On Arm we'll probably need to flush the queue > when receiving a PASID cache invalidation, which the guest issues after > clearing a PASID table entry. We are trying to get ARM to a point where invalidations don't need to be trapped. It would be good to not rely on that anyplace. Jason
On 5/31/23 8:33 AM, Jason Gunthorpe wrote: > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: >> Hi folks, >> >> This series implements the functionality of delivering IO page faults to >> user space through the IOMMUFD framework. The use case is nested >> translation, where modern IOMMU hardware supports two-stage translation >> tables. The second-stage translation table is managed by the host VMM >> while the first-stage translation table is owned by the user space. >> Hence, any IO page fault that occurs on the first-stage page table >> should be delivered to the user space and handled there. The user space >> should respond the page fault handling result to the device top-down >> through the IOMMUFD response uAPI. >> >> User space indicates its capablity of handling IO page faults by setting >> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >> will then setup its infrastructure for page fault delivery. Together >> with the iopf-capable flag, user space should also provide an eventfd >> where it will listen on any down-top page fault messages. >> >> On a successful return of the allocation of iopf-capable HWPT, a fault >> fd will be returned. User space can open and read fault messages from it >> once the eventfd is signaled. > This is a performance path so we really need to think about this more, > polling on an eventfd and then reading a different fd is not a good > design. > > What I would like is to have a design from the start that fits into > io_uring, so we can have pre-posted 'recvs' in io_uring that just get > completed at high speed when PRIs come in. > > This suggests that the PRI should be delivered via read() on a single > FD and pollability on the single FD without any eventfd. I will remove the eventfd and provide a single FD for both read() and write(). The userspace reads the FD to retrieve the fault messages while writing the FD to respond the handling of the faults. The user space could leverage the io_uring for asynchronous I/O. A sample userspace design could look like this: [pseudo code for discussion only] struct io_uring ring; io_uring_setup(IOPF_ENTRIES, &ring); while (1) { struct io_uring_prep_read read; struct io_uring_cqe *cqe; read.fd = iopf_fd; read.buf = malloc(IOPF_SIZE); read.len = IOPF_SIZE; read.flags = 0; io_uring_prep_read(&ring, &read); io_uring_submit(&ring); // Wait for the read to complete while ((cqe = io_uring_get_cqe(&ring)) != NULL) { // Check if the read completed if (cqe->res < 0) break; if (page_fault_read_completion(cqe)) { // Get the fault data void *data = cqe->buf; size_t size = cqe->res; // Handle the page fault handle_page_fault(data); // Respond the fault struct io_uring_prep_write write; write.fd = iopf_fd; write.buf = malloc(IOPF_RESPONSE_SIZE); write.len = IOPF_RESPONSE_SIZE; write.flags = 0; io_uring_prep_write(&ring, &write); io_uring_submit(&ring); } // Reap the cqe io_uring_cqe_free(&ring, cqe); } } Did I understand you correctly? Best regards, baolu
On Fri, Jun 23, 2023 at 02:18:38PM +0800, Baolu Lu wrote: > struct io_uring ring; > > io_uring_setup(IOPF_ENTRIES, &ring); > > while (1) { > struct io_uring_prep_read read; > struct io_uring_cqe *cqe; > > read.fd = iopf_fd; > read.buf = malloc(IOPF_SIZE); > read.len = IOPF_SIZE; > read.flags = 0; > > io_uring_prep_read(&ring, &read); > io_uring_submit(&ring); > > // Wait for the read to complete > while ((cqe = io_uring_get_cqe(&ring)) != NULL) { > // Check if the read completed > if (cqe->res < 0) > break; > > if (page_fault_read_completion(cqe)) { > // Get the fault data > void *data = cqe->buf; > size_t size = cqe->res; > > // Handle the page fault > handle_page_fault(data); > > // Respond the fault > struct io_uring_prep_write write; > write.fd = iopf_fd; > write.buf = malloc(IOPF_RESPONSE_SIZE); > write.len = IOPF_RESPONSE_SIZE; > write.flags = 0; > > io_uring_prep_write(&ring, &write); > io_uring_submit(&ring); > } > > // Reap the cqe > io_uring_cqe_free(&ring, cqe); > } > } > > Did I understand you correctly? Yes, basically this is the right idea. There are more complex ways to use the iouring that would be faster still. And the kernel side can have support to speed it up as well. I'm wondering if we should be pushing invalidations on io_uring as well? Jason
On 2023/5/31 2:50, Nicolin Chen wrote: > Hi Baolu, > > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > >> This series implements the functionality of delivering IO page faults to >> user space through the IOMMUFD framework. The use case is nested >> translation, where modern IOMMU hardware supports two-stage translation >> tables. The second-stage translation table is managed by the host VMM >> while the first-stage translation table is owned by the user space. >> Hence, any IO page fault that occurs on the first-stage page table >> should be delivered to the user space and handled there. The user space >> should respond the page fault handling result to the device top-down >> through the IOMMUFD response uAPI. >> >> User space indicates its capablity of handling IO page faults by setting >> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >> will then setup its infrastructure for page fault delivery. Together >> with the iopf-capable flag, user space should also provide an eventfd >> where it will listen on any down-top page fault messages. >> >> On a successful return of the allocation of iopf-capable HWPT, a fault >> fd will be returned. User space can open and read fault messages from it >> once the eventfd is signaled. > > I think that, whether the guest has an IOPF capability or not, > the host should always forward any stage-1 fault/error back to > the guest. Yet, the implementation of this series builds with > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. > > And I have my doubt at the using the IOPF framework with that > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for > its bottom half workqueue, because a page response could take > a long cycle. But adding that flag feels like we don't really > need the bottom half workqueue, i.e. losing the point of using > the IOPF framework, IMHO. > > Combining the two facts above, I wonder if we really need to > go through the IOPF framework; can't we just register a user > fault handler in the iommufd directly upon a valid event_fd? Agreed. We should avoid workqueue in sva iopf framework. Perhaps we could go ahead with below code? It will be registered to device with iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling path. Un-registering in the disable path of cause. static int io_pgfault_handler(struct iommu_fault *fault, void *cookie) { ioasid_t pasid = fault->prm.pasid; struct device *dev = cookie; struct iommu_domain *domain; if (fault->type != IOMMU_FAULT_PAGE_REQ) return -EOPNOTSUPP; if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0); else domain = iommu_get_domain_for_dev(dev); if (!domain || !domain->iopf_handler) return -ENODEV; if (domain->type == IOMMU_DOMAIN_SVA) return iommu_queue_iopf(fault, cookie); return domain->iopf_handler(fault, dev, domain->fault_data); } Best regards, baolu
On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote: > External email: Use caution opening links or attachments > > > On 2023/5/31 2:50, Nicolin Chen wrote: > > Hi Baolu, > > > > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: > > > > > This series implements the functionality of delivering IO page faults to > > > user space through the IOMMUFD framework. The use case is nested > > > translation, where modern IOMMU hardware supports two-stage translation > > > tables. The second-stage translation table is managed by the host VMM > > > while the first-stage translation table is owned by the user space. > > > Hence, any IO page fault that occurs on the first-stage page table > > > should be delivered to the user space and handled there. The user space > > > should respond the page fault handling result to the device top-down > > > through the IOMMUFD response uAPI. > > > > > > User space indicates its capablity of handling IO page faults by setting > > > a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD > > > will then setup its infrastructure for page fault delivery. Together > > > with the iopf-capable flag, user space should also provide an eventfd > > > where it will listen on any down-top page fault messages. > > > > > > On a successful return of the allocation of iopf-capable HWPT, a fault > > > fd will be returned. User space can open and read fault messages from it > > > once the eventfd is signaled. > > > > I think that, whether the guest has an IOPF capability or not, > > the host should always forward any stage-1 fault/error back to > > the guest. Yet, the implementation of this series builds with > > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. > > > > And I have my doubt at the using the IOPF framework with that > > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for > > its bottom half workqueue, because a page response could take > > a long cycle. But adding that flag feels like we don't really > > need the bottom half workqueue, i.e. losing the point of using > > the IOPF framework, IMHO. > > > > Combining the two facts above, I wonder if we really need to > > go through the IOPF framework; can't we just register a user > > fault handler in the iommufd directly upon a valid event_fd? > > Agreed. We should avoid workqueue in sva iopf framework. Perhaps we > could go ahead with below code? It will be registered to device with > iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling > path. Un-registering in the disable path of cause. Well, for a virtualization use case, I still think it's should be registered in iommufd. Having a device without an IOPF/PRI capability, a guest OS should receive some faults too, if that device causes a translation failure. And for a vSVA use case, the IOMMU_DEV_FEAT_IOPF feature only gets enabled in the guest VM right? How could the host enable the IOMMU_DEV_FEAT_IOPF to trigger this handler? Thanks Nic > static int io_pgfault_handler(struct iommu_fault *fault, void *cookie) > { > ioasid_t pasid = fault->prm.pasid; > struct device *dev = cookie; > struct iommu_domain *domain; > > if (fault->type != IOMMU_FAULT_PAGE_REQ) > return -EOPNOTSUPP; > > if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) > domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0); > else > domain = iommu_get_domain_for_dev(dev); > > if (!domain || !domain->iopf_handler) > return -ENODEV; > > if (domain->type == IOMMU_DOMAIN_SVA) > return iommu_queue_iopf(fault, cookie); > > return domain->iopf_handler(fault, dev, domain->fault_data); > } > > Best regards, > baolu
On 6/26/23 3:21 AM, Nicolin Chen wrote: > On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote: >> External email: Use caution opening links or attachments >> >> >> On 2023/5/31 2:50, Nicolin Chen wrote: >>> Hi Baolu, >>> >>> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: >>> >>>> This series implements the functionality of delivering IO page faults to >>>> user space through the IOMMUFD framework. The use case is nested >>>> translation, where modern IOMMU hardware supports two-stage translation >>>> tables. The second-stage translation table is managed by the host VMM >>>> while the first-stage translation table is owned by the user space. >>>> Hence, any IO page fault that occurs on the first-stage page table >>>> should be delivered to the user space and handled there. The user space >>>> should respond the page fault handling result to the device top-down >>>> through the IOMMUFD response uAPI. >>>> >>>> User space indicates its capablity of handling IO page faults by setting >>>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >>>> will then setup its infrastructure for page fault delivery. Together >>>> with the iopf-capable flag, user space should also provide an eventfd >>>> where it will listen on any down-top page fault messages. >>>> >>>> On a successful return of the allocation of iopf-capable HWPT, a fault >>>> fd will be returned. User space can open and read fault messages from it >>>> once the eventfd is signaled. >>> I think that, whether the guest has an IOPF capability or not, >>> the host should always forward any stage-1 fault/error back to >>> the guest. Yet, the implementation of this series builds with >>> the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. >>> >>> And I have my doubt at the using the IOPF framework with that >>> IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for >>> its bottom half workqueue, because a page response could take >>> a long cycle. But adding that flag feels like we don't really >>> need the bottom half workqueue, i.e. losing the point of using >>> the IOPF framework, IMHO. >>> >>> Combining the two facts above, I wonder if we really need to >>> go through the IOPF framework; can't we just register a user >>> fault handler in the iommufd directly upon a valid event_fd? >> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we >> could go ahead with below code? It will be registered to device with >> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling >> path. Un-registering in the disable path of cause. > Well, for a virtualization use case, I still think it's should > be registered in iommufd. Emm.. you suggest iommufd calls iommu_register_device_fault_handler() to register its own page fault handler, right? I have a different opinion, iommu_register_device_fault_handler() is called to register a fault handler for a device. It should be called or initiated by a device driver. The iommufd only needs to install a per-domain io page fault handler. I am considering a use case on Intel platform. Perhaps it's similar on other platforms. An SIOV-capable device can support host SVA and assigning mediated devices to user space at the same time. Both host SVA and mediated devices require IOPF. So there will be multiple places where a page fault handler needs to be registered. > Having a device without an IOPF/PRI > capability, a guest OS should receive some faults too, if that > device causes a translation failure. Yes. DMA faults are also a consideration. But I would like to have it supported in a separated series. As I explained in the previous reply, we also need to consider the software nested translation case. > > And for a vSVA use case, the IOMMU_DEV_FEAT_IOPF feature only > gets enabled in the guest VM right? How could the host enable > the IOMMU_DEV_FEAT_IOPF to trigger this handler? As mentioned above, this should be initiated by the kernel device driver, vfio or possible mediated device driver. Best regards, baolu
On Mon, Jun 19, 2023 at 11:35:50AM +0800, Baolu Lu wrote: > > Another outstanding issue was what to do for PASID stop. When the guest > > device driver stops using a PASID it issues a PASID stop request to the > > device (a device-specific mechanism). If the device is not using PRI stop > > markers it waits for pending PRs to complete and we're fine. Otherwise it > > sends a stop marker which is flushed to the PRI queue, but does not wait > > for pending PRs. > > > > Handling stop markers is annoying. If the device issues one, then the PRI > > queue contains stale faults, a stop marker, followed by valid faults for > > the next address space bound to this PASID. The next address space will > > get all the spurious faults because the fault handler doesn't know that > > there is a stop marker coming. Linux is probably alright with spurious > > faults, though maybe not in all cases, and other guests may not support > > them at all. > > > > We might need to revisit supporting stop markers: request that each device > > driver declares whether their device uses stop markers on unbind() ("This > > mechanism must indicate that a Stop Marker Message will be generated." > > says the spec, but doesn't say if the function always uses one or the > > other mechanism so it's per-unbind). Then we still have to synchronize > > unbind() with the fault handler to deal with the pending stop marker, > > which might have already gone through or be generated later. > > I don't quite follow here. Once a PASID is unbound from the device, the > device driver should be free to release the PASID. The PASID could then > be used for any other purpose. The device driver has no idea when the > pending page requests are flushed after unbind(), so it cannot decide > how long should the PASID be delayed for reuse. Therefore, I understand > that a successful return from the unbind() function denotes that all > pending page requests have been flushed and the PASID is viable for > other use. Yes that's the contract for unbind() at the moment > > > > > Currently we ignore all that and just flush the PRI queue, followed by the > > IOPF queue, to get rid of any stale fault before reassigning the PASID. A > > guest however would also need to first flush the HW PRI queue, but doesn't > > have a direct way to do that. If we want to support guests that don't deal > > with stop markers, the host needs to flush the PRI queue when a PASID is > > detached. I guess on Intel detaching the PASID goes through the host which > > can flush the host queue. On Arm we'll probably need to flush the queue > > when receiving a PASID cache invalidation, which the guest issues after > > clearing a PASID table entry. > > The Intel VT-d driver follows below steps to drain pending page requests > when a PASID is unbound from a device. > > - Tear down the device's pasid table entry for the stopped pasid. > This ensures that ATS/PRI will stop putting more page requests for the > pasid in VT-d PRQ. Oh that's interesting, I didn't know about the implicit TLB invalidations on page requests for VT-d. For Arm SMMU, clearing the PASID table entry does cause ATS Translation Requests to return with Completer Abort, but does not affect PRI. The SMMU pushes page requests directly into the PRI queue without reading any table (unless the queue overflows). We're counting on the device driver to perform the PASID stop request before calling unbind(), described in PCIe 6.20.1 (Managing PASID Usage) and 10.4.1.2 (Managing PASID Usage on PRG Requests). This ensures that when unbind() is called, no more page request for the PASID is pushed into the PRI queue. But some may still be in the queue if the device uses stop markers. > - Sync with the PRQ handling thread until all related page requests in > PRQ have been delivered. This is what I'm concerned about. For VT-d this happens in the host which is in charge of modifying the PASID table. For SMMU, the guest writes the PASID table. It flushes its virtual PRI queue, but not the physical queue that is managed by the host. One synchronization point where the host could flush the physical PRI queue is the PASID config invalidation (CMD_CFGI_CD). As Jason pointed out the host may not be able to observe those if a command queue is assigned directly to the guest (a theoretical SMMU extension), though in that case the guest may also have direct access to a PRI queue (like the AMD vIOMMU extension) and be able to flush the queue directly. But we can just wait for PRI implementations and see what the drivers need. Maybe no device will implement stop markers. Thanks, Jean > - Flush the iopf queue with iopf_queue_flush_dev(). > - Follow the steps defined in VT-d spec section 7.10 to drain all page > requests and responses between VT-d and the endpoint device.
On Mon, Jun 26, 2023 at 11:10:22AM +0800, Baolu Lu wrote: > > > > I think that, whether the guest has an IOPF capability or not, > > > > the host should always forward any stage-1 fault/error back to > > > > the guest. Yet, the implementation of this series builds with > > > > the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV. > > > > > > > > And I have my doubt at the using the IOPF framework with that > > > > IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for > > > > its bottom half workqueue, because a page response could take > > > > a long cycle. But adding that flag feels like we don't really > > > > need the bottom half workqueue, i.e. losing the point of using > > > > the IOPF framework, IMHO. > > > > > > > > Combining the two facts above, I wonder if we really need to > > > > go through the IOPF framework; can't we just register a user > > > > fault handler in the iommufd directly upon a valid event_fd? > > > Agreed. We should avoid workqueue in sva iopf framework. Perhaps we > > > could go ahead with below code? It will be registered to device with > > > iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling > > > path. Un-registering in the disable path of cause. > > Well, for a virtualization use case, I still think it's should > > be registered in iommufd. > > Emm.. you suggest iommufd calls iommu_register_device_fault_handler() to > register its own page fault handler, right? > > I have a different opinion, iommu_register_device_fault_handler() is > called to register a fault handler for a device. It should be called > or initiated by a device driver. The iommufd only needs to install a > per-domain io page fault handler. > > I am considering a use case on Intel platform. Perhaps it's similar > on other platforms. An SIOV-capable device can support host SVA and > assigning mediated devices to user space at the same time. Both host > SVA and mediated devices require IOPF. So there will be multiple places > where a page fault handler needs to be registered. Okay, the narrative makes sense to me. I was more thinking of the nesting case. The iommu_register_device_fault_handler() is registered per device, as its name implies, while the handler probably should be slightly different by the attaching domain. It seems that your io_pgfault_handler() in the previous email can potentially handle this, i.e. a IOMMU_DOMAIN_NESTED could set domain->iopf_handler to forward DMA faults to user space. We just need to make sure this pathway would be unconditional at the handler registration and fault->type. > > Having a device without an IOPF/PRI > > capability, a guest OS should receive some faults too, if that > > device causes a translation failure. > > Yes. DMA faults are also a consideration. But I would like to have it > supported in a separated series. As I explained in the previous reply, > we also need to consider the software nested translation case. I see. Thanks Nic
On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote: > Agreed. We should avoid workqueue in sva iopf framework. Perhaps we > could go ahead with below code? It will be registered to device with > iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling > path. Un-registering in the disable path of cause. This maze needs to be undone as well. It makes no sense that all the drivers are calling iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); The driver should RX a PRI fault and deliver it to some core code function, this looks like a good start: > static int io_pgfault_handler(struct iommu_fault *fault, void *cookie) > { > ioasid_t pasid = fault->prm.pasid; > struct device *dev = cookie; > struct iommu_domain *domain; > > if (fault->type != IOMMU_FAULT_PAGE_REQ) > return -EOPNOTSUPP; > > if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) > domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0); > else > domain = iommu_get_domain_for_dev(dev); > > if (!domain || !domain->iopf_handler) > return -ENODEV; > > if (domain->type == IOMMU_DOMAIN_SVA) > return iommu_queue_iopf(fault, cookie); > > return domain->iopf_handler(fault, dev, domain->fault_data); Then we find the domain that owns the translation and invoke its domain->ops->iopf_handler() If the driver created a SVA domain then the op should point to some generic 'handle sva fault' function. There shouldn't be weird SVA stuff in the core code. The weird SVA stuff is really just a generic per-device workqueue dispatcher, so if we think that is valuable then it should be integrated into the iommu_domain (domain->ops->use_iopf_workqueue = true for instance). Then it could route the fault through the workqueue and still invoke domain->ops->iopf_handler. The word "SVA" should not appear in any of this. Not sure what iommu_register_device_fault_handler() has to do with all of this.. Setting up the dev_iommu stuff to allow for the workqueue should happen dynamically during domain attach, ideally in the core code before calling to the driver. Also, I can understand there is a need to turn on PRI support really early, and it can make sense to have some IOMMU_DEV_FEAT_IOPF/SVA to ask to turn it on.. But that should really only be needed if the HW cannot turn it on dynamically during domain attach of a PRI enabled domain. It needs cleaning up.. Jason
On 2023/6/27 2:33, Jason Gunthorpe wrote: > On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote: > >> Agreed. We should avoid workqueue in sva iopf framework. Perhaps we >> could go ahead with below code? It will be registered to device with >> iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling >> path. Un-registering in the disable path of cause. > > This maze needs to be undone as well. > > It makes no sense that all the drivers are calling > > iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); > > The driver should RX a PRI fault and deliver it to some core code > function, this looks like a good start: > >> static int io_pgfault_handler(struct iommu_fault *fault, void *cookie) >> { >> ioasid_t pasid = fault->prm.pasid; >> struct device *dev = cookie; >> struct iommu_domain *domain; >> >> if (fault->type != IOMMU_FAULT_PAGE_REQ) >> return -EOPNOTSUPP; >> >> if (fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) >> domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0); >> else >> domain = iommu_get_domain_for_dev(dev); >> >> if (!domain || !domain->iopf_handler) >> return -ENODEV; >> >> if (domain->type == IOMMU_DOMAIN_SVA) >> return iommu_queue_iopf(fault, cookie); >> >> return domain->iopf_handler(fault, dev, domain->fault_data); > > Then we find the domain that owns the translation and invoke its > domain->ops->iopf_handler() Agreed. The iommu_register_device_fault_handler() could only be called by the device drivers who want to handle the DMA faults and IO page faults by themselves in any special ways. By default, the faults should be dispatched to domain->iopf_handler in a generic core code. > > If the driver created a SVA domain then the op should point to some > generic 'handle sva fault' function. There shouldn't be weird SVA > stuff in the core code. > > The weird SVA stuff is really just a generic per-device workqueue > dispatcher, so if we think that is valuable then it should be > integrated into the iommu_domain (domain->ops->use_iopf_workqueue = > true for instance). Then it could route the fault through the > workqueue and still invoke domain->ops->iopf_handler. > > The word "SVA" should not appear in any of this. Yes. We should make it generic. The domain->use_iopf_workqueue flag denotes that the page faults of a fault group should be put together and then be handled and responded in a workqueue. Otherwise, the page fault is dispatched to domain->iopf_handler directly. > > Not sure what iommu_register_device_fault_handler() has to do with all > of this.. Setting up the dev_iommu stuff to allow for the workqueue > should happen dynamically during domain attach, ideally in the core > code before calling to the driver. There are two pointers under struct dev_iommu for fault handling. /** * struct dev_iommu - Collection of per-device IOMMU data * * @fault_param: IOMMU detected device fault reporting data * @iopf_param: I/O Page Fault queue and data [...] struct dev_iommu { struct mutex lock; struct iommu_fault_param *fault_param; struct iopf_device_param *iopf_param; My understanding is that @fault_param is a place holder for generic things, while @iopf_param is workqueue specific. Perhaps we could make @fault_param static and initialize it during iommu device_probe, as IOMMU fault is generic on every device managed by an IOMMU. @iopf_param could be allocated on demand. (perhaps renaming it to a more meaningful one?) It happens before a domain with use_iopf_workqueue flag set attaches to a device. iopf_param keeps alive until device_release. > > Also, I can understand there is a need to turn on PRI support really > early, and it can make sense to have some IOMMU_DEV_FEAT_IOPF/SVA to > ask to turn it on.. But that should really only be needed if the HW > cannot turn it on dynamically during domain attach of a PRI enabled > domain. > > It needs cleaning up.. Yes. I can put this and other cleanup things that we've discussed in a preparation series and send it out for review after the next rc1 is released. > > Jason Best regards, baolu
On Wed, Jun 28, 2023 at 10:00:56AM +0800, Baolu Lu wrote: > > If the driver created a SVA domain then the op should point to some > > generic 'handle sva fault' function. There shouldn't be weird SVA > > stuff in the core code. > > > > The weird SVA stuff is really just a generic per-device workqueue > > dispatcher, so if we think that is valuable then it should be > > integrated into the iommu_domain (domain->ops->use_iopf_workqueue = > > true for instance). Then it could route the fault through the > > workqueue and still invoke domain->ops->iopf_handler. > > > > The word "SVA" should not appear in any of this. > > Yes. We should make it generic. The domain->use_iopf_workqueue flag > denotes that the page faults of a fault group should be put together and > then be handled and responded in a workqueue. Otherwise, the page fault > is dispatched to domain->iopf_handler directly. It might be better to have iopf_handler and iopf_handler_work function pointers to distinguish to two cases. > > Not sure what iommu_register_device_fault_handler() has to do with all > > of this.. Setting up the dev_iommu stuff to allow for the workqueue > > should happen dynamically during domain attach, ideally in the core > > code before calling to the driver. > > There are two pointers under struct dev_iommu for fault handling. > > /** > * struct dev_iommu - Collection of per-device IOMMU data > * > * @fault_param: IOMMU detected device fault reporting data > * @iopf_param: I/O Page Fault queue and data > > [...] > > struct dev_iommu { > struct mutex lock; > struct iommu_fault_param *fault_param; > struct iopf_device_param *iopf_param; > > My understanding is that @fault_param is a place holder for generic > things, while @iopf_param is workqueue specific. Well, lets look struct iommu_fault_param { iommu_dev_fault_handler_t handler; void *data; These two make no sense now. handler is always iommu_queue_iopf. Given our domain centric design we want the function pointer in the domain, not in the device. So delete it. struct list_head faults; struct mutex lock; Queue of unhandled/unacked faults? Seems sort of reasonable > @iopf_param could be allocated on demand. (perhaps renaming it to a more > meaningful one?) It happens before a domain with use_iopf_workqueue flag > set attaches to a device. iopf_param keeps alive until device_release. Yes Do this for the iommu_fault_param as well, in fact, probably just put the two things together in one allocation and allocate if we attach a PRI using domain. I don't think we need to micro optimze further.. Jason
On 2023/6/28 20:49, Jason Gunthorpe wrote: > On Wed, Jun 28, 2023 at 10:00:56AM +0800, Baolu Lu wrote: >>> If the driver created a SVA domain then the op should point to some >>> generic 'handle sva fault' function. There shouldn't be weird SVA >>> stuff in the core code. >>> >>> The weird SVA stuff is really just a generic per-device workqueue >>> dispatcher, so if we think that is valuable then it should be >>> integrated into the iommu_domain (domain->ops->use_iopf_workqueue = >>> true for instance). Then it could route the fault through the >>> workqueue and still invoke domain->ops->iopf_handler. >>> >>> The word "SVA" should not appear in any of this. >> >> Yes. We should make it generic. The domain->use_iopf_workqueue flag >> denotes that the page faults of a fault group should be put together and >> then be handled and responded in a workqueue. Otherwise, the page fault >> is dispatched to domain->iopf_handler directly. > > It might be better to have iopf_handler and > iopf_handler_work function pointers to distinguish to two cases. Both are okay. Let's choose one when we have the code. > >>> Not sure what iommu_register_device_fault_handler() has to do with all >>> of this.. Setting up the dev_iommu stuff to allow for the workqueue >>> should happen dynamically during domain attach, ideally in the core >>> code before calling to the driver. >> >> There are two pointers under struct dev_iommu for fault handling. >> >> /** >> * struct dev_iommu - Collection of per-device IOMMU data >> * >> * @fault_param: IOMMU detected device fault reporting data >> * @iopf_param: I/O Page Fault queue and data >> >> [...] >> >> struct dev_iommu { >> struct mutex lock; >> struct iommu_fault_param *fault_param; >> struct iopf_device_param *iopf_param; >> >> My understanding is that @fault_param is a place holder for generic >> things, while @iopf_param is workqueue specific. > > Well, lets look > > struct iommu_fault_param { > iommu_dev_fault_handler_t handler; > void *data; > > These two make no sense now. handler is always iommu_queue_iopf. Given > our domain centric design we want the function pointer in the domain, > not in the device. So delete it. Agreed. > > struct list_head faults; > struct mutex lock; > > Queue of unhandled/unacked faults? Seems sort of reasonable It's the list of faults pending for response. >> @iopf_param could be allocated on demand. (perhaps renaming it to a more >> meaningful one?) It happens before a domain with use_iopf_workqueue flag >> set attaches to a device. iopf_param keeps alive until device_release. > > Yes > > Do this for the iommu_fault_param as well, in fact, probably just put > the two things together in one allocation and allocate if we attach a > PRI using domain. I don't think we need to micro optimze further.. Yeah, let me try this. Best regards, baolu