Message ID | 20170227195441.5170-23-jean-philippe.brucker@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> From: Jean-Philippe Brucker > Sent: Tuesday, February 28, 2017 3:55 AM > [...] > > API naming > ========== > > I realize that "SVM" as a name isn't great because the svm namespace is > already taken by AMD-V (Secure Virtual Machine) in arch/x86. Also, the > name itself doesn't say much. > > I personally prefer "Unified Virtual Addressing" (UVA), adopted by CUDA, > or rather Unified Virtual Address Space (UVAS). Another possibility is > Unified Virtual Memory (UVM). Acronym UAS for Unified Address Space is > already used by USB. Same for Shared Address Space (SAS), already in use > in the kernel, but SVAS would work (although it doesn't look good). > 'unified' is not exactly matching to 'shared'. In some context it means unifying device local memory and system memory in one virtual address space, while SVM is more for sharing of a CPU virtual address space with device. What about Shared Virtual Addressing (SVA)? Thanks Kevin
On Mon, 2017-02-27 at 19:54 +0000, Jean-Philippe Brucker wrote: > Add three functions to the IOMMU API. iommu_bind_task takes a device and a > task as argument. If the IOMMU, the device and the bus support it, attach > task to device and create a Process Address Space ID (PASID) unique to the > device. DMA from the device can then use the PASID to read or write into > the address space. iommu_unbind_task removes a bond created with > iommu_bind_task. iommu_set_svm_ops allows a device driver to set some > callbacks for specific SVM-related operations. > > Try to accommodate current implementations (AMD, Intel and ARM), by > letting the IOMMU driver do all the work, but attempt by the same occasion > to find intersections between implementations. > > * amd_iommu_v2 expects the device to allocate a PASID and pass it to the > IOMMU. The driver also provides separate functions to register callbacks > that handles failed PRI requests and invalidate PASIDs. > > int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid, > struct task_struct *task) > void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid) > int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev, > amd_iommu_invalid_ppr_cb cb) > int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev, > amd_iommu_invalidate_ctx cb) > > * intel-svm allocates a PASID, and requires the driver to pass > "svm_dev_ops", which currently contains a fault callback. It also > doesn't take a task as argument, but uses 'current'. > > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, > struct svm_dev_ops *ops) > int intel_svm_unbind_mm(struct device *dev, int pasid) > > * For arm-smmu-v3, PASID must be allocated by the SMMU driver since it > indexes contexts in an array handled by the SMMU device. Right. The Intel version was designed with all of the above three in mind. It was discussed at the Kernel Summit and LPC on more than one occasion as it took shape, and what I implemented for Intel basicall represents the consensus of what we thought it should look like. I meant to convert the AMD driver to the same API, but don't have access to test hardware. Note that the amdkfd code will need careful attention here. Intel slightly deviates from the "one PASID per process" vision too, because it currently has a PASID allocator idr per IOMMU. That wants making system-wide. And probably not Intel-specific. Some other comments... The callbacks and fault handlers could perhaps be deprecated. In an ideal world nobody would ever use them — the device itself is supposed to be able to communicate with its driver about the request that failed; we don't need a dirty hook into the IOMMU code from when *it* handles the fault. In the Intel IOMMU fault reports, there are some additional bits in the descriptor which are 'context private' bits. For built-in devices like the graphics engine, this contains further information about precisely which context was performing the failing access. But again I don't think we should need it in an ideal world. It's a horrid thing to have to feed through a generic IOMMU API. One thing which might help us *avoid* needing it is the SVM_FLAG_PRIVATE_PASID option, which asks for a *new* PASID. So a single process can have more than one PASID. That's still OK on ARM, isn't it? As long as they're all allocated from the same pool and we never use a given PASID for more than one address space simultaneously on different devices. We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel address space. Yes, people use it. > PASID invalidation > ------------------ > > Next, we need to let the IOMMU driver notify the device driver before it > attempts to unbind a PASID. Subsequent patches discuss PASID invalidation > in more details, so we'll simply propose the following interface for now. > > AMD has: > > void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid); > > We put the following in iommu_svm_ops: > > int (*invalidate_pasid)(struct device *dev, int pasid, void *priv); These can basically take for ever, right? You're asking the *device* to tell you when it's finished using that PASID. > Capability detection > ==================== > ... > > int iommu_svm_capable(struct device *dev, int flags); We already had this for Intel. It basically goes through *all* the enabling checks that it needs to for really setting up SVM, and that's why it's actually the *same* call, but with a NULL pasid argument: #define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
Hi David, Good to hear back from you! On Fri, Mar 03, 2017 at 09:40:44AM +0000, David Woodhouse wrote: > > Intel slightly deviates from the "one PASID per process" vision too, > because it currently has a PASID allocator idr per IOMMU. That wants > making system-wide. And probably not Intel-specific. Intel version is getting the system wide pasid. We are actively working on it and should be out shortly. > > Some other comments... > > We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel > address space. Yes, people use it. There are some gaps with the current implementation, esp related to vmalloc. We have something internally, these will also be posted soon after some testing. Cheers, Ashok
Hi David, On Fri, Mar 03, 2017 at 09:40:44AM +0000, David Woodhouse wrote: > On Mon, 2017-02-27 at 19:54 +0000, Jean-Philippe Brucker wrote: > > Add three functions to the IOMMU API. iommu_bind_task takes a device and a > > task as argument. If the IOMMU, the device and the bus support it, attach > > task to device and create a Process Address Space ID (PASID) unique to the > > device. DMA from the device can then use the PASID to read or write into > > the address space. iommu_unbind_task removes a bond created with > > iommu_bind_task. iommu_set_svm_ops allows a device driver to set some > > callbacks for specific SVM-related operations. > > > > Try to accommodate current implementations (AMD, Intel and ARM), by > > letting the IOMMU driver do all the work, but attempt by the same occasion > > to find intersections between implementations. > > > > * amd_iommu_v2 expects the device to allocate a PASID and pass it to the > > IOMMU. The driver also provides separate functions to register callbacks > > that handles failed PRI requests and invalidate PASIDs. > > > > int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid, > > struct task_struct *task) > > void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid) > > int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev, > > amd_iommu_invalid_ppr_cb cb) > > int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev, > > amd_iommu_invalidate_ctx cb) > > > > * intel-svm allocates a PASID, and requires the driver to pass > > "svm_dev_ops", which currently contains a fault callback. It also > > doesn't take a task as argument, but uses 'current'. > > > > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, > > struct svm_dev_ops *ops) > > int intel_svm_unbind_mm(struct device *dev, int pasid) > > > > * For arm-smmu-v3, PASID must be allocated by the SMMU driver since it > > indexes contexts in an array handled by the SMMU device. > > Right. The Intel version was designed with all of the above three in > mind. It was discussed at the Kernel Summit and LPC on more than one > occasion as it took shape, and what I implemented for Intel basicall > represents the consensus of what we thought it should look like. Good to know. I didn't intend to deviate much from the Intel version. > I meant to convert the AMD driver to the same API, but don't have > access to test hardware. Note that the amdkfd code will need careful > attention here. > > Intel slightly deviates from the "one PASID per process" vision too, > because it currently has a PASID allocator idr per IOMMU. That wants > making system-wide. And probably not Intel-specific. Yes, it would be nice to have a common PASID allocator. But I don't think that a system-wide PASID space is workable for us. At the moment systems might have a few identical devices all supporting 20 bits of PASID. But consider the case where one odd device can only handle four address spaces, and supports a maximum of two PASID bits. We'd quickly run out of PASIDs to hand to such devices, even though we could easily have one PASID space per endpoint (from a quick glance at the specs, I assume that both Intel and AMD IOMMUs offer one PASID table per RID.) For example, we could have three devices, all with different PASID capabilities, attached to process B. Device X was already bound to process A, so it gets PASID 2 for process B. Devices Y and Z get PASID 1 for process B: .-- PASID 1 ----- process A device X -+-- PASID 2 ---. device Y ---- PASID 1 ---+- process B device Z ---- PASID 1 ---' This is the model I implemented for ARM, as I wanted to keep the RFC simple before getting a chance to discuss the API. This is the finest granularity, but history shows that some systems cannot isolate all devices/RIDs from each others, be it by accident (bug, missing ACS) or by design (NTB). Arguably this hasn't come up with SVM yet, and it might never do. However, bringing IOMMU groups in the picture early seems preferable than having to redesign everything five years from now. We would have PASID spaces for each group, capped to the lowest common denominator in the group, pasid_bits = min(pasid_bits of each device): .- PASID 1 ----- process A device W -+- group -+- PASID 2 ---. device X -' | device X --- group --- PASID 1 ---+- process B device Z --- group --- PASID 1 ---' Finally, to get in line with the core API, we could make the iommu_domain hold the PASID space instead of the group. The API would then look like this: int iommu_bind(domain, task, &pasid, flags, priv) int iommu_unbind(domain, task, flags) IOMMU drivers would then call into a common PASID allocator and return a PASID unique to the domain. We can accept any number of attach_group on the domain prior to a bind. The IOMMU driver can enable PASIDs as late as the first bind, so we don't know the domain capacity. After the first bind, the number of PASIDs is fixed to pasid_bits = min(pasid_bits of each group) for the domain, and we should refuse to attach a new group with less PASIDs. device W -+- group --- domain --- PASID 1 ---. device X -' | device X --- group -+- domain -+- PASID 1 ---+- process B device Z --- group -' '- PASID 2 ----- process A I could try that for next version. If it doesn't seem acceptable for the core API, I guess we can always keep the per-device interface and have IOMMU drivers manage either one or multiple PASID spaces. > Some other comments... > > The callbacks and fault handlers could perhaps be deprecated. In an > ideal world nobody would ever use them — the device itself is supposed > to be able to communicate with its driver about the request that > failed; we don't need a dirty hook into the IOMMU code from when *it* > handles the fault. Agreed. > In the Intel IOMMU fault reports, there are some additional bits in the > descriptor which are 'context private' bits. For built-in devices like > the graphics engine, this contains further information about precisely > which context was performing the failing access. But again I don't > think we should need it in an ideal world. It's a horrid thing to have > to feed through a generic IOMMU API. > > One thing which might help us *avoid* needing it is the > SVM_FLAG_PRIVATE_PASID option, which asks for a *new* PASID. So a > single process can have more than one PASID. That's still OK on ARM, > isn't it? As long as they're all allocated from the same pool and we > never use a given PASID for more than one address space simultaneously > on different devices. Yes, multiple PASIDs per task for the same device should be fine, and I wouldn't have any objection to adding the PRIVATE_PASID flag into the core API. > We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel > address space. Yes, people use it. Argh, that seems dodgy :( I'm hoping we won't need this on ARM, since drivers can always go through the DMA API for accessing kernel memory, even for a domain that has SVM enabled. > > PASID invalidation > > ------------------ > > > > Next, we need to let the IOMMU driver notify the device driver before it > > attempts to unbind a PASID. Subsequent patches discuss PASID invalidation > > in more details, so we'll simply propose the following interface for now. > > > > AMD has: > > > > void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid); > > > > We put the following in iommu_svm_ops: > > > > int (*invalidate_pasid)(struct device *dev, int pasid, void *priv); > > These can basically take for ever, right? You're asking the *device* to > tell you when it's finished using that PASID. Theoretically yes, this could take forever. PCI doesn't give any time boundary for the stop PASID mechanism (and PCI isn't the whole story on ARM). But the device driver could simply return 0 after a timeout, telling us that it was unable to invalidate the PASID. In which case the IOMMU has to decide whether it is ready to risk having pending transactions for this PASID access the next address space, or if it should just throw the PASID away and never use it again. > > Capability detection > > ==================== > > ... > > > > int iommu_svm_capable(struct device *dev, int flags); > > We already had this for Intel. It basically goes through *all* the > enabling checks that it needs to for really setting up SVM, and that's > why it's actually the *same* call, but with a NULL pasid argument: > > #define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL)) Right, Alex proposed a similar interface for VFIO, and I like the idea. Perhaps we could add a flag IOMMU_BIND_PROBE for the core API instead of relying on NULL pointers, to make it more clear? Thanks, Jean-Philippe
On Fri, Mar 03, 2017 at 06:39:58PM +0000, Jean-Philippe Brucker wrote: > Yes, it would be nice to have a common PASID allocator. But I don't > think that a system-wide PASID space is workable for us. At the moment > systems might have a few identical devices all supporting 20 bits of > PASID. But consider the case where one odd device can only handle four > address spaces, and supports a maximum of two PASID bits. We'd quickly > run out of PASIDs to hand to such devices, even though we could easily > have one PASID space per endpoint (from a quick glance at the specs, I > assume that both Intel and AMD IOMMUs offer one PASID table per RID.) But that shouldn't be a problem if we allocate PASIDs top-down (meaning starting from the biggest value supported by a given device), right? Then we can satisfy the devices with 16 or 20 bit PASIDs and still have the 2-bit PASIDs free for the devices that need it. Joerg
Hi Jean-Philippe, On Mon, Feb 27, 2017 at 07:54:33PM +0000, Jean-Philippe Brucker wrote: > +extern int iommu_set_svm_ops(struct device *dev, > + const struct iommu_svm_ops *svm_ops); > +extern int iommu_bind_task(struct device *dev, struct task_struct *task, > + int *pasid, int flags, void *priv); > + > +extern int iommu_unbind_task(struct device *dev, int pasid, int flags); I really like that API, it is simpler than what the AMD driver currently implements but should work for it too (once we adapt the AMD-KFD driver to it). One issue I like to have discussed is whether we can make a global PASID allocation (with a one-PASID per-task model) workable with SMMU too. Joerg
Hi Joerg, On 22/03/17 15:36, Joerg Roedel wrote: > On Fri, Mar 03, 2017 at 06:39:58PM +0000, Jean-Philippe Brucker wrote: >> Yes, it would be nice to have a common PASID allocator. But I don't >> think that a system-wide PASID space is workable for us. At the moment >> systems might have a few identical devices all supporting 20 bits of >> PASID. But consider the case where one odd device can only handle four >> address spaces, and supports a maximum of two PASID bits. We'd quickly >> run out of PASIDs to hand to such devices, even though we could easily >> have one PASID space per endpoint (from a quick glance at the specs, I >> assume that both Intel and AMD IOMMUs offer one PASID table per RID.) > > But that shouldn't be a problem if we allocate PASIDs top-down (meaning > starting from the biggest value supported by a given device), right? > > Then we can satisfy the devices with 16 or 20 bit PASIDs and still have > the 2-bit PASIDs free for the devices that need it. But if there is more than 4 devices that only support 2 bit PASIDs, you still get a starvation that you wouldn't get with per-domain/device PASID allocator. Arguably I have no real-world example to back this up, we can probably expect vendors to always implement a sane amount of PASID bits. Unifying the API is certainly more important than imagining all the twisted configurations possible, and a PASID allocator with per-task top-down allocation seems to me like an acceptable compromise. Thanks, Jean-Philippe
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8ea14f41a979..26c5f6528c69 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1438,6 +1438,114 @@ void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group) } EXPORT_SYMBOL_GPL(iommu_detach_group); +int iommu_set_svm_ops(struct device *dev, const struct iommu_svm_ops *svm_ops) +{ + const struct iommu_ops *ops; + struct iommu_group *group; + int ret; + + group = iommu_group_get_for_dev(dev); + if (IS_ERR(group)) + return PTR_ERR(group); + + ops = dev->bus->iommu_ops; + if (!ops->set_svm_ops) { + iommu_group_put(group); + return -ENODEV; + } + + mutex_lock(&group->mutex); + ret = ops->set_svm_ops(dev, svm_ops); + mutex_unlock(&group->mutex); + + iommu_group_put(group); + return ret; + +} +EXPORT_SYMBOL_GPL(iommu_set_svm_ops); + +/* + * iommu_bind_task - Share task address space with device + * + * @dev: device to bind + * @task: task to bind + * @pasid: valid address where the PASID is stored + * @flags: driver-specific flags + * @priv: private data to associate with the bond + * + * Create a bond between device and task, allowing the device to access the task + * address space using @pasid. Intel and ARM SMMU drivers allocate and return + * the PASID, while AMD requires the caller to allocate a PASID beforehand. + * + * iommu_unbind_task must be called with this PASID before the task exits. + */ +int iommu_bind_task(struct device *dev, struct task_struct *task, int *pasid, + int flags, void *priv) +{ + const struct iommu_ops *ops; + struct iommu_group *group; + int ret; + + if (!pasid) + return -EINVAL; + + group = iommu_group_get(dev); + if (!group) + return -ENODEV; + + ops = dev->bus->iommu_ops; + if (!ops->bind_task) { + iommu_group_put(group); + return -ENODEV; + } + + mutex_lock(&group->mutex); + if (!group->domain) + ret = -EINVAL; + else + ret = ops->bind_task(dev, task, pasid, flags, priv); + mutex_unlock(&group->mutex); + + iommu_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(iommu_bind_task); + +/* + * iommu_unbind_task - Remove a bond created with iommu_bind_task + * + * @dev: device bound to the task + * @pasid: identifier of the bond + * @flags: state of the PASID and driver-specific flags + */ +int iommu_unbind_task(struct device *dev, int pasid, int flags) +{ + const struct iommu_ops *ops; + struct iommu_group *group; + int ret; + + group = iommu_group_get(dev); + if (!group) + return -ENODEV; + + ops = dev->bus->iommu_ops; + if (!ops->unbind_task) { + iommu_group_put(group); + return -ENODEV; + } + + mutex_lock(&group->mutex); + if (!group->domain) + ret = -EINVAL; + else + ret = ops->unbind_task(dev, pasid, flags); + mutex_unlock(&group->mutex); + + iommu_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(iommu_unbind_task); + phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova) { if (unlikely(domain->ops->iova_to_phys == NULL)) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 6a6de187ddc0..9554f45d4305 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -145,6 +145,16 @@ struct iommu_resv_region { int type; }; +/* + * @handle_fault: report or handle a fault from the device (FIXME: imprecise) + * @invalidate_pasid: stop using a PASID. + */ +struct iommu_svm_ops { + int (*handle_fault)(struct device *dev, int pasid, u64 address, + int prot, int status, void *priv); + int (*invalidate_pasid)(struct device *dev, int pasid, void *priv); +}; + #ifdef CONFIG_IOMMU_API /** @@ -154,6 +164,9 @@ struct iommu_resv_region { * @domain_free: free iommu domain * @attach_dev: attach device to an iommu domain * @detach_dev: detach device from an iommu domain + * @set_svm_ops: set SVM callbacks for device + * @bind_task: attach a task address space to a device + * @unbind_task: detach a task address space from a device * @map: map a physically contiguous memory region to an iommu domain * @unmap: unmap a physically contiguous memory region from an iommu domain * @map_sg: map a scatter-gather list of physically contiguous memory chunks @@ -183,6 +196,10 @@ struct iommu_ops { int (*attach_dev)(struct iommu_domain *domain, struct device *dev); void (*detach_dev)(struct iommu_domain *domain, struct device *dev); + int (*set_svm_ops)(struct device *dev, const struct iommu_svm_ops *ops); + int (*bind_task)(struct device *dev, struct task_struct *task, + int *pasid, int flags, void *priv); + int (*unbind_task)(struct device *dev, int pasid, int flags); int (*map)(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot); size_t (*unmap)(struct iommu_domain *domain, unsigned long iova, @@ -403,6 +420,13 @@ void iommu_fwspec_free(struct device *dev); int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids); const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode); +extern int iommu_set_svm_ops(struct device *dev, + const struct iommu_svm_ops *svm_ops); +extern int iommu_bind_task(struct device *dev, struct task_struct *task, + int *pasid, int flags, void *priv); + +extern int iommu_unbind_task(struct device *dev, int pasid, int flags); + #else /* CONFIG_IOMMU_API */ struct iommu_ops {}; @@ -663,6 +687,23 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode) return NULL; } +static inline int iommu_set_svm_ops(struct device *dev, + const struct iommu_svm_ops *svm_ops) +{ + return -ENODEV; +} + +static inline int iommu_bind_task(struct device *dev, struct task_struct *task, + int *pasid, int flags, void *priv) +{ + return -ENODEV; +} + +static int iommu_unbind_task(struct device *dev, int pasid, int flags) +{ + return -ENODEV; +} + #endif /* CONFIG_IOMMU_API */ #endif /* __LINUX_IOMMU_H */
Add three functions to the IOMMU API. iommu_bind_task takes a device and a task as argument. If the IOMMU, the device and the bus support it, attach task to device and create a Process Address Space ID (PASID) unique to the device. DMA from the device can then use the PASID to read or write into the address space. iommu_unbind_task removes a bond created with iommu_bind_task. iommu_set_svm_ops allows a device driver to set some callbacks for specific SVM-related operations. Try to accommodate current implementations (AMD, Intel and ARM), by letting the IOMMU driver do all the work, but attempt by the same occasion to find intersections between implementations. * amd_iommu_v2 expects the device to allocate a PASID and pass it to the IOMMU. The driver also provides separate functions to register callbacks that handles failed PRI requests and invalidate PASIDs. int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid, struct task_struct *task) void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid) int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev, amd_iommu_invalid_ppr_cb cb) int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev, amd_iommu_invalidate_ctx cb) * intel-svm allocates a PASID, and requires the driver to pass "svm_dev_ops", which currently contains a fault callback. It also doesn't take a task as argument, but uses 'current'. int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops) int intel_svm_unbind_mm(struct device *dev, int pasid) * For arm-smmu-v3, PASID must be allocated by the SMMU driver since it indexes contexts in an array handled by the SMMU device. Bind and unbind =============== The following could suit existing implementations: int iommu_bind_task(struct device *dev, struct task_struct *task, int *pasid, int flags, void *priv); int iommu_unbind_task(struct device *dev, int pasid, int flags); This is similar to existing functions. * @dev is a SVM-capable device. If it is not, bind fails, * @task is a userspace task. It doesn't have to be current, but implementations can reject the call if they only support current. * @pasid is a handle for the bond. It would be nice to have the IOMMU driver handle PASID allocation, for consistency. Otherwise, the requirement for drivers to allocate PASIDs might be advertised in a capability. * @flags represents parameters of bind/unbind. We might want to reserve a few bits, maybe the bottom half, for the API, and give the rest to the driver. * @priv will be passed to SVM callbacks targeting this bond SVM device callbacks ===================== Making svm_dev_ops (here iommu_svm_ops) a first-class citizen of struct device would be a useful next step. Device drivers could set this structure when they want to participate in SVM. For the moment, iommu_set_svm_ops must be called. I'm not sure what to do when assigning a device via VFIO. Should we remove the SVM ops when detaching from a domain, or have the device driver remove them when detaching itself from a device? Fault handling -------------- The first callback allows a driver to be notified when the IOMMU driver cannot handle a fault. amd_iommu_v2 has: int (*amd_iommu_invalid_ppr_cb)(struct pci_dev *pdev, int pasid, unsigned long address, u16 prot) intel-svm has (called for all faults): void (*fault_cb)(struct device *dev, int pasid, u64 address, u32 private, int rwxp, int response) We put the following in iommu_svm_ops: int (*handle_fault)(struct device *dev, int pasid, u64 address, int prot, int status, void *priv); The IOMMU driver calls handle_mm_fault and sends the result back to the device. If the fault cannot be handled, it gives a chance to the device driver to record the fault and maybe even fix it up. @pasid, @address and @prot are copied from the page request. @status is the return value of handle_mm_fault. @prot could use the format defined in iommu.h (IOMMU_READ, IOMMU_WRITE, etc.) @status could be a combination of VM_FAULT_* as returned by handle_mm_fault, but this leaves out the case where we don't even reach the fault handling part. We could instead define new status flags: one for failure to locate the context associated to the PASID, one for failure of mm to handle the fault. We cannot piggy-back on existing IOMMU_FAULT_READ and WRITE in their current state, because devices might request both read and write permissions at the same time. They would need to be redefined as flags. All callbacks have a @priv field. This is an opaque pointer set by the device driver when binding. This way the device driver gets both a PASID and its metadata in the callback, and we avoid duplicating pasid state lookups in both IOMMU driver and device driver. Another question is the location of the callback. IOMMU driver could notify device driver either: * before handle_mm_fault, to do some custom fault handling and perhaps bypass the IOMMU handler entirely, * after handle_mm_fault, to notify the driver of an error (AMD), * after handle_mm_fault, to notify the driver of any page request (Intel), We might want to let the driver decide when binding a PASID, or offer two callbacks: handle_fault and report_fault. I don't have a proposal for this yet. handle_fault returns the response that the IOMMU driver should send to the device. Either success, meaning that the page has been mapped (or it is likely to succeed later), or failure, meaning that the device shouldn't bother retrying. It would be nice to reconcile with the iommu_fault_handler API, that isn't widely used yet but is being considered for handling domain faults from platform devices on the SMMUv2, using the stall model instead of ATS/PRI. Yet another concern for ARM is that platform devices may issue traffic over multiple stream IDs, for instance one stream ID per channel in a DMA engine. handle_fault doesn't provide a way to pass those stream IDs back to the driver. PASID invalidation ------------------ Next, we need to let the IOMMU driver notify the device driver before it attempts to unbind a PASID. Subsequent patches discuss PASID invalidation in more details, so we'll simply propose the following interface for now. AMD has: void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid); We put the following in iommu_svm_ops: int (*invalidate_pasid)(struct device *dev, int pasid, void *priv); Capability detection ==================== I didn't add any public function for detecting SVM capability yet. In my opinion, a nice way to do it is to have user query the state of the device to know if they can call bind/unbind. If the IOMMU supports SVM, and the IOMMU driver was able to enable it successfully in the device, then user can call bind/unbind on the device. In the VFIO patch later in this series, I implemented the PCIe detection like this: if ATS, PRI and PASID are enabled (by the IOMMU driver), then the device can do SVM. If for some reason the IOMMU is incompatible with the device's SVM properties or is incompatible with the MMU page tables, then it shouldn't enable PRI or PASID. For platform devices, the requirements are very blurry at the moment. We'll probably add a device- tree property saying that a device and its bus are SVM-capable. The following interface could be added to the API: int iommu_svm_capable(struct device *dev, int flags); This tells the device driver whether the IOMMU driver is capable of binding a task to the device. @flags may contain specific SVM capabilities (paging/pinned, executable, etc) and the function could return a subset of these flags. For PCI devices, everything is enabled when this call is successful. For platform devices the device driver would have to enable SVM itself. API naming ========== I realize that "SVM" as a name isn't great because the svm namespace is already taken by AMD-V (Secure Virtual Machine) in arch/x86. Also, the name itself doesn't say much. I personally prefer "Unified Virtual Addressing" (UVA), adopted by CUDA, or rather Unified Virtual Address Space (UVAS). Another possibility is Unified Virtual Memory (UVM). Acronym UAS for Unified Address Space is already used by USB. Same for Shared Address Space (SAS), already in use in the kernel, but SVAS would work (although it doesn't look good). Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> --- drivers/iommu/iommu.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/iommu.h | 41 +++++++++++++++++++ 2 files changed, 149 insertions(+)