diff mbox

[RFC,22/30] iommu: Bind/unbind tasks to/from devices

Message ID 20170227195441.5170-23-jean-philippe.brucker@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jean-Philippe Brucker Feb. 27, 2017, 7:54 p.m. UTC
Add three functions to the IOMMU API. iommu_bind_task takes a device and a
task as argument. If the IOMMU, the device and the bus support it, attach
task to device and create a Process Address Space ID (PASID) unique to the
device. DMA from the device can then use the PASID to read or write into
the address space. iommu_unbind_task removes a bond created with
iommu_bind_task. iommu_set_svm_ops allows a device driver to set some
callbacks for specific SVM-related operations.

Try to accommodate current implementations (AMD, Intel and ARM), by
letting the IOMMU driver do all the work, but attempt by the same occasion
to find intersections between implementations.

* amd_iommu_v2 expects the device to allocate a PASID and pass it to the
  IOMMU. The driver also provides separate functions to register callbacks
  that handles failed PRI requests and invalidate PASIDs.

  int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid,
			   struct task_struct *task)
  void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid)
  int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev,
				   amd_iommu_invalid_ppr_cb cb)
  int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev,
				      amd_iommu_invalidate_ctx cb)

* intel-svm allocates a PASID, and requires the driver to pass
  "svm_dev_ops", which currently contains a fault callback. It also
  doesn't take a task as argument, but uses 'current'.

  int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
			struct svm_dev_ops *ops)
  int intel_svm_unbind_mm(struct device *dev, int pasid)

* For arm-smmu-v3, PASID must be allocated by the SMMU driver since it
  indexes contexts in an array handled by the SMMU device.


  Bind and unbind
  ===============

The following could suit existing implementations:

int iommu_bind_task(struct device *dev, struct task_struct *task,
		    int *pasid, int flags, void *priv);

int iommu_unbind_task(struct device *dev, int pasid, int flags);

This is similar to existing functions.
* @dev is a SVM-capable device. If it is not, bind fails,
* @task is a userspace task. It doesn't have to be current, but
  implementations can reject the call if they only support current.
* @pasid is a handle for the bond. It would be nice to have the IOMMU
  driver handle PASID allocation, for consistency. Otherwise, the
  requirement for drivers to allocate PASIDs might be advertised in a
  capability.
* @flags represents parameters of bind/unbind. We might want to reserve a
  few bits, maybe the bottom half, for the API, and give the rest to the
  driver.
* @priv will be passed to SVM callbacks targeting this bond


  SVM device callbacks
  =====================

Making svm_dev_ops (here iommu_svm_ops) a first-class citizen of struct
device would be a useful next step. Device drivers could set this
structure when they want to participate in SVM. For the moment,
iommu_set_svm_ops must be called. I'm not sure what to do when assigning a
device via VFIO. Should we remove the SVM ops when detaching from a
domain, or have the device driver remove them when detaching itself from a
device?

  Fault handling
  --------------

The first callback allows a driver to be notified when the IOMMU driver
cannot handle a fault.

amd_iommu_v2 has:

int (*amd_iommu_invalid_ppr_cb)(struct pci_dev *pdev, int pasid,
				unsigned long address, u16 prot)

intel-svm has (called for all faults):

void (*fault_cb)(struct device *dev, int pasid, u64 address, u32 private,
		 int rwxp, int response)

We put the following in iommu_svm_ops:

int (*handle_fault)(struct device *dev, int pasid, u64 address, int prot,
		    int status, void *priv);

The IOMMU driver calls handle_mm_fault and sends the result back to the
device. If the fault cannot be handled, it gives a chance to the device
driver to record the fault and maybe even fix it up. @pasid, @address and
@prot are copied from the page request. @status is the return value of
handle_mm_fault. @prot could use the format defined in iommu.h
(IOMMU_READ, IOMMU_WRITE, etc.) @status could be a combination of
VM_FAULT_* as returned by handle_mm_fault, but this leaves out the case
where we don't even reach the fault handling part. We could instead define
new status flags: one for failure to locate the context associated to the
PASID, one for failure of mm to handle the fault. We cannot piggy-back on
existing IOMMU_FAULT_READ and WRITE in their current state, because
devices might request both read and write permissions at the same time.
They would need to be redefined as flags.

All callbacks have a @priv field. This is an opaque pointer set by the
device driver when binding. This way the device driver gets both a PASID
and its metadata in the callback, and we avoid duplicating pasid state
lookups in both IOMMU driver and device driver.

Another question is the location of the callback. IOMMU driver could
notify device driver either:

* before handle_mm_fault, to do some custom fault handling and perhaps
  bypass the IOMMU handler entirely,
* after handle_mm_fault, to notify the driver of an error (AMD),
* after handle_mm_fault, to notify the driver of any page request (Intel),

We might want to let the driver decide when binding a PASID, or offer two
callbacks: handle_fault and report_fault. I don't have a proposal for this
yet.

handle_fault returns the response that the IOMMU driver should send to the
device. Either success, meaning that the page has been mapped (or it is
likely to succeed later), or failure, meaning that the device shouldn't
bother retrying.

It would be nice to reconcile with the iommu_fault_handler API, that isn't
widely used yet but is being considered for handling domain faults from
platform devices on the SMMUv2, using the stall model instead of ATS/PRI.
Yet another concern for ARM is that platform devices may issue traffic
over multiple stream IDs, for instance one stream ID per channel in a DMA
engine. handle_fault doesn't provide a way to pass those stream IDs back
to the driver.

  PASID invalidation
  ------------------

Next, we need to let the IOMMU driver notify the device driver before it
attempts to unbind a PASID. Subsequent patches discuss PASID invalidation
in more details, so we'll simply propose the following interface for now.

AMD has:

void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid);

We put the following in iommu_svm_ops:

int (*invalidate_pasid)(struct device *dev, int pasid, void *priv);

  Capability detection
  ====================

I didn't add any public function for detecting SVM capability yet. In my
opinion, a nice way to do it is to have user query the state of the device
to know if they can call bind/unbind. If the IOMMU supports SVM, and the
IOMMU driver was able to enable it successfully in the device, then user
can call bind/unbind on the device.

In the VFIO patch later in this series, I implemented the PCIe detection
like this: if ATS, PRI and PASID are enabled (by the IOMMU driver), then
the device can do SVM. If for some reason the IOMMU is incompatible with
the device's SVM properties or is incompatible with the MMU page tables,
then it shouldn't enable PRI or PASID. For platform devices, the
requirements are very blurry at the moment. We'll probably add a device-
tree property saying that a device and its bus are SVM-capable. The
following interface could be added to the API:

int iommu_svm_capable(struct device *dev, int flags);

This tells the device driver whether the IOMMU driver is capable of
binding a task to the device. @flags may contain specific SVM capabilities
(paging/pinned, executable, etc) and the function could return a subset of
these flags. For PCI devices, everything is enabled when this call is
successful. For platform devices the device driver would have to enable
SVM itself.

  API naming
  ==========

I realize that "SVM" as a name isn't great because the svm namespace is
already taken by AMD-V (Secure Virtual Machine) in arch/x86. Also, the
name itself doesn't say much.

I personally prefer "Unified Virtual Addressing" (UVA), adopted by CUDA,
or rather Unified Virtual Address Space (UVAS). Another possibility is
Unified Virtual Memory (UVM). Acronym UAS for Unified Address Space is
already used by USB. Same for Shared Address Space (SAS), already in use
in the kernel, but SVAS would work (although it doesn't look good).

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/iommu.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h |  41 +++++++++++++++++++
 2 files changed, 149 insertions(+)

Comments

Tian, Kevin March 2, 2017, 7:29 a.m. UTC | #1
> From: Jean-Philippe Brucker
> Sent: Tuesday, February 28, 2017 3:55 AM
> 
[...]
> 
>   API naming
>   ==========
> 
> I realize that "SVM" as a name isn't great because the svm namespace is
> already taken by AMD-V (Secure Virtual Machine) in arch/x86. Also, the
> name itself doesn't say much.
> 
> I personally prefer "Unified Virtual Addressing" (UVA), adopted by CUDA,
> or rather Unified Virtual Address Space (UVAS). Another possibility is
> Unified Virtual Memory (UVM). Acronym UAS for Unified Address Space is
> already used by USB. Same for Shared Address Space (SAS), already in use
> in the kernel, but SVAS would work (although it doesn't look good).
> 

'unified' is not exactly matching to 'shared'. In some context it means
unifying device local memory and system memory in one virtual address
space, while SVM is more for sharing of a CPU virtual address space with 
device.

What about Shared Virtual Addressing (SVA)?

Thanks
Kevin
David Woodhouse March 3, 2017, 9:40 a.m. UTC | #2
On Mon, 2017-02-27 at 19:54 +0000, Jean-Philippe Brucker wrote:
> Add three functions to the IOMMU API. iommu_bind_task takes a device and a
> task as argument. If the IOMMU, the device and the bus support it, attach
> task to device and create a Process Address Space ID (PASID) unique to the
> device. DMA from the device can then use the PASID to read or write into
> the address space. iommu_unbind_task removes a bond created with
> iommu_bind_task. iommu_set_svm_ops allows a device driver to set some
> callbacks for specific SVM-related operations.
> 
> Try to accommodate current implementations (AMD, Intel and ARM), by
> letting the IOMMU driver do all the work, but attempt by the same occasion
> to find intersections between implementations.
> 
> * amd_iommu_v2 expects the device to allocate a PASID and pass it to the
>   IOMMU. The driver also provides separate functions to register callbacks
>   that handles failed PRI requests and invalidate PASIDs.
> 
>   int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid,
> 			   struct task_struct *task)
>   void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid)
>   int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev,
> 				   amd_iommu_invalid_ppr_cb cb)
>   int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev,
> 				      amd_iommu_invalidate_ctx cb)
> 
> * intel-svm allocates a PASID, and requires the driver to pass
>   "svm_dev_ops", which currently contains a fault callback. It also
>   doesn't take a task as argument, but uses 'current'.
> 
>   int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> 			struct svm_dev_ops *ops)
>   int intel_svm_unbind_mm(struct device *dev, int pasid)
> 
> * For arm-smmu-v3, PASID must be allocated by the SMMU driver since it
>   indexes contexts in an array handled by the SMMU device.

Right. The Intel version was designed with all of the above three in
mind. It was discussed at the Kernel Summit and LPC on more than one
occasion as it took shape, and what I implemented for Intel basicall
represents the consensus of what we thought it should look like.

I meant to convert the AMD driver to the same API, but don't have
access to test hardware. Note that the amdkfd code will need careful
attention here.

Intel slightly deviates from the "one PASID per process" vision too,
because it currently has a PASID allocator idr per IOMMU. That wants
making system-wide. And probably not Intel-specific.

Some other comments...

The callbacks and fault handlers could perhaps be deprecated. In an
ideal world nobody would ever use them — the device itself is supposed
to be able to communicate with its driver about the request that
failed; we don't need a dirty hook into the IOMMU code from when *it*
handles the fault.

In the Intel IOMMU fault reports, there are some additional bits in the
descriptor which are 'context private' bits. For built-in devices like
the graphics engine, this contains further information about precisely
which context was performing the failing access. But again I don't
think we should need it in an ideal world. It's a horrid thing to have
to feed through a generic IOMMU API.

One thing which might help us *avoid* needing it is the
SVM_FLAG_PRIVATE_PASID option, which asks for a *new* PASID. So a
single process can have more than one PASID. That's still OK on ARM,
isn't it? As long as they're all allocated from the same pool and we
never use a given PASID for more than one address space simultaneously
on different devices.

We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel
address space. Yes, people use it.



>   PASID invalidation
>   ------------------
> 
> Next, we need to let the IOMMU driver notify the device driver before it
> attempts to unbind a PASID. Subsequent patches discuss PASID invalidation
> in more details, so we'll simply propose the following interface for now.
> 
> AMD has:
> 
> void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid);
> 
> We put the following in iommu_svm_ops:
> 
> int (*invalidate_pasid)(struct device *dev, int pasid, void *priv);

These can basically take for ever, right? You're asking the *device* to
tell you when it's finished using that PASID.

>   Capability detection
>   ====================
> ...
> 
> int iommu_svm_capable(struct device *dev, int flags);

We already had this for Intel. It basically goes through *all* the
enabling checks that it needs to for really setting up SVM, and that's
why it's actually the *same* call, but with a NULL pasid argument:

#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
Ashok Raj March 3, 2017, 5:05 p.m. UTC | #3
Hi David,

Good to hear back from you!


On Fri, Mar 03, 2017 at 09:40:44AM +0000, David Woodhouse wrote:
> 
> Intel slightly deviates from the "one PASID per process" vision too,
> because it currently has a PASID allocator idr per IOMMU. That wants
> making system-wide. And probably not Intel-specific.

Intel version is getting the system wide pasid. We are actively working on it
and should be out shortly.

> 
> Some other comments...
> 
> We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel
> address space. Yes, people use it.

There are some gaps with the current implementation, esp related to 
vmalloc. We have something internally, these will also be posted soon after
some testing. 

Cheers,
Ashok
Jean-Philippe Brucker March 3, 2017, 6:39 p.m. UTC | #4
Hi David,

On Fri, Mar 03, 2017 at 09:40:44AM +0000, David Woodhouse wrote:
> On Mon, 2017-02-27 at 19:54 +0000, Jean-Philippe Brucker wrote:
> > Add three functions to the IOMMU API. iommu_bind_task takes a device and a
> > task as argument. If the IOMMU, the device and the bus support it, attach
> > task to device and create a Process Address Space ID (PASID) unique to the
> > device. DMA from the device can then use the PASID to read or write into
> > the address space. iommu_unbind_task removes a bond created with
> > iommu_bind_task. iommu_set_svm_ops allows a device driver to set some
> > callbacks for specific SVM-related operations.
> > 
> > Try to accommodate current implementations (AMD, Intel and ARM), by
> > letting the IOMMU driver do all the work, but attempt by the same occasion
> > to find intersections between implementations.
> > 
> > * amd_iommu_v2 expects the device to allocate a PASID and pass it to the
> >   IOMMU. The driver also provides separate functions to register callbacks
> >   that handles failed PRI requests and invalidate PASIDs.
> > 
> >   int amd_iommu_bind_pasid(struct pci_dev *pdev, int pasid,
> > 			   struct task_struct *task)
> >   void amd_iommu_unbind_pasid(struct pci_dev *pdev, int pasid)
> >   int amd_iommu_set_invalid_ppr_cb(struct pci_dev *pdev,
> > 				   amd_iommu_invalid_ppr_cb cb)
> >   int amd_iommu_set_invalidate_ctx_cb(struct pci_dev *pdev,
> > 				      amd_iommu_invalidate_ctx cb)
> > 
> > * intel-svm allocates a PASID, and requires the driver to pass
> >   "svm_dev_ops", which currently contains a fault callback. It also
> >   doesn't take a task as argument, but uses 'current'.
> > 
> >   int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
> > 			struct svm_dev_ops *ops)
> >   int intel_svm_unbind_mm(struct device *dev, int pasid)
> > 
> > * For arm-smmu-v3, PASID must be allocated by the SMMU driver since it
> >   indexes contexts in an array handled by the SMMU device.
> 
> Right. The Intel version was designed with all of the above three in
> mind. It was discussed at the Kernel Summit and LPC on more than one
> occasion as it took shape, and what I implemented for Intel basicall
> represents the consensus of what we thought it should look like.

Good to know. I didn't intend to deviate much from the Intel version.

> I meant to convert the AMD driver to the same API, but don't have
> access to test hardware. Note that the amdkfd code will need careful
> attention here.
> 
> Intel slightly deviates from the "one PASID per process" vision too,
> because it currently has a PASID allocator idr per IOMMU. That wants
> making system-wide. And probably not Intel-specific.

Yes, it would be nice to have a common PASID allocator. But I don't
think that a system-wide PASID space is workable for us. At the moment
systems might have a few identical devices all supporting 20 bits of
PASID. But consider the case where one odd device can only handle four
address spaces, and supports a maximum of two PASID bits. We'd quickly
run out of PASIDs to hand to such devices, even though we could easily
have one PASID space per endpoint (from a quick glance at the specs, I
assume that both Intel and AMD IOMMUs offer one PASID table per RID.)

For example, we could have three devices, all with different PASID
capabilities, attached to process B. Device X was already bound to
process A, so it gets PASID 2 for process B. Devices Y and Z get PASID 1
for process B:

                        .-- PASID 1 ----- process A
              device X -+-- PASID 2 ---.
              device Y ---- PASID 1 ---+- process B
              device Z ---- PASID 1 ---'

This is the model I implemented for ARM, as I wanted to keep the RFC
simple before getting a chance to discuss the API. This is the finest
granularity, but history shows that some systems cannot isolate all
devices/RIDs from each others, be it by accident (bug, missing ACS) or
by design (NTB). Arguably this hasn't come up with SVM yet, and it might
never do. However, bringing IOMMU groups in the picture early seems
preferable than having to redesign everything five years from now. We
would have PASID spaces for each group, capped to the lowest common
denominator in the group, pasid_bits = min(pasid_bits of each device):

                               .- PASID 1 ----- process A
           device W -+- group -+- PASID 2 ---.
           device X -'                       |
           device X --- group --- PASID 1 ---+- process B
           device Z --- group --- PASID 1 ---'

Finally, to get in line with the core API, we could make the
iommu_domain hold the PASID space instead of the group. The API would
then look like this:

int iommu_bind(domain, task, &pasid, flags, priv)
int iommu_unbind(domain, task, flags)

IOMMU drivers would then call into a common PASID allocator and return a
PASID unique to the domain.

We can accept any number of attach_group on the domain prior to a bind.
The IOMMU driver can enable PASIDs as late as the first bind, so we
don't know the domain capacity. After the first bind, the number of
PASIDs is fixed to pasid_bits = min(pasid_bits of each group) for the
domain, and we should refuse to attach a new group with less PASIDs.

     device W -+- group --- domain --- PASID 1 ---.
     device X -'                                  |
     device X --- group -+- domain -+- PASID 1 ---+- process B
     device Z --- group -'          '- PASID 2 ----- process A

I could try that for next version. If it doesn't seem acceptable for the
core API, I guess we can always keep the per-device interface and have
IOMMU drivers manage either one or multiple PASID spaces.

> Some other comments...
> 
> The callbacks and fault handlers could perhaps be deprecated. In an
> ideal world nobody would ever use them — the device itself is supposed
> to be able to communicate with its driver about the request that
> failed; we don't need a dirty hook into the IOMMU code from when *it*
> handles the fault.

Agreed.

> In the Intel IOMMU fault reports, there are some additional bits in the
> descriptor which are 'context private' bits. For built-in devices like
> the graphics engine, this contains further information about precisely
> which context was performing the failing access. But again I don't
> think we should need it in an ideal world. It's a horrid thing to have
> to feed through a generic IOMMU API.
>
> One thing which might help us *avoid* needing it is the
> SVM_FLAG_PRIVATE_PASID option, which asks for a *new* PASID. So a
> single process can have more than one PASID. That's still OK on ARM,
> isn't it? As long as they're all allocated from the same pool and we
> never use a given PASID for more than one address space simultaneously
> on different devices.

Yes, multiple PASIDs per task for the same device should be fine, and
I wouldn't have any objection to adding the PRIVATE_PASID flag into the
core API.

> We also have SVM_FLAG_SUPERVISOR_MODE, which gives access to kernel
> address space. Yes, people use it.

Argh, that seems dodgy :( I'm hoping we won't need this on ARM, since
drivers can always go through the DMA API for accessing kernel memory,
even for a domain that has SVM enabled.

> >   PASID invalidation
> >   ------------------
> > 
> > Next, we need to let the IOMMU driver notify the device driver before it
> > attempts to unbind a PASID. Subsequent patches discuss PASID invalidation
> > in more details, so we'll simply propose the following interface for now.
> > 
> > AMD has:
> > 
> > void (*amd_iommu_invalidate_ctx)(struct pci_dev *pdev, int pasid);
> > 
> > We put the following in iommu_svm_ops:
> > 
> > int (*invalidate_pasid)(struct device *dev, int pasid, void *priv);
> 
> These can basically take for ever, right? You're asking the *device* to
> tell you when it's finished using that PASID.

Theoretically yes, this could take forever. PCI doesn't give any time
boundary for the stop PASID mechanism (and PCI isn't the whole story on
ARM). But the device driver could simply return 0 after a timeout,
telling us that it was unable to invalidate the PASID. In which case the
IOMMU has to decide whether it is ready to risk having pending
transactions for this PASID access the next address space, or if it
should just throw the PASID away and never use it again.

> >   Capability detection
> >   ====================
> > ...
> > 
> > int iommu_svm_capable(struct device *dev, int flags);
> 
> We already had this for Intel. It basically goes through *all* the
> enabling checks that it needs to for really setting up SVM, and that's
> why it's actually the *same* call, but with a NULL pasid argument:
> 
> #define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))

Right, Alex proposed a similar interface for VFIO, and I like the idea.
Perhaps we could add a flag IOMMU_BIND_PROBE for the core API instead of
relying on NULL pointers, to make it more clear?

Thanks,
Jean-Philippe
Joerg Roedel March 22, 2017, 3:36 p.m. UTC | #5
On Fri, Mar 03, 2017 at 06:39:58PM +0000, Jean-Philippe Brucker wrote:
> Yes, it would be nice to have a common PASID allocator. But I don't
> think that a system-wide PASID space is workable for us. At the moment
> systems might have a few identical devices all supporting 20 bits of
> PASID. But consider the case where one odd device can only handle four
> address spaces, and supports a maximum of two PASID bits. We'd quickly
> run out of PASIDs to hand to such devices, even though we could easily
> have one PASID space per endpoint (from a quick glance at the specs, I
> assume that both Intel and AMD IOMMUs offer one PASID table per RID.)

But that shouldn't be a problem if we allocate PASIDs top-down (meaning
starting from the biggest value supported by a given device), right?

Then we can satisfy the devices with 16 or 20 bit PASIDs and still have
the 2-bit PASIDs free for the devices that need it.



	Joerg
Joerg Roedel March 22, 2017, 3:38 p.m. UTC | #6
Hi Jean-Philippe,

On Mon, Feb 27, 2017 at 07:54:33PM +0000, Jean-Philippe Brucker wrote:
> +extern int iommu_set_svm_ops(struct device *dev,
> +			     const struct iommu_svm_ops *svm_ops);
> +extern int iommu_bind_task(struct device *dev, struct task_struct *task,
> +			   int *pasid, int flags, void *priv);
> +
> +extern int iommu_unbind_task(struct device *dev, int pasid, int flags);

I really like that API, it is simpler than what the AMD driver
currently implements but should work for it too (once we adapt the
AMD-KFD driver to it).

One issue I like to have discussed is whether we can make a global PASID
allocation (with a one-PASID per-task model) workable with SMMU too.



	Joerg
Jean-Philippe Brucker March 22, 2017, 6:30 p.m. UTC | #7
Hi Joerg,

On 22/03/17 15:36, Joerg Roedel wrote:
> On Fri, Mar 03, 2017 at 06:39:58PM +0000, Jean-Philippe Brucker wrote:
>> Yes, it would be nice to have a common PASID allocator. But I don't
>> think that a system-wide PASID space is workable for us. At the moment
>> systems might have a few identical devices all supporting 20 bits of
>> PASID. But consider the case where one odd device can only handle four
>> address spaces, and supports a maximum of two PASID bits. We'd quickly
>> run out of PASIDs to hand to such devices, even though we could easily
>> have one PASID space per endpoint (from a quick glance at the specs, I
>> assume that both Intel and AMD IOMMUs offer one PASID table per RID.)
> 
> But that shouldn't be a problem if we allocate PASIDs top-down (meaning
> starting from the biggest value supported by a given device), right?
> 
> Then we can satisfy the devices with 16 or 20 bit PASIDs and still have
> the 2-bit PASIDs free for the devices that need it.

But if there is more than 4 devices that only support 2 bit PASIDs, you
still get a starvation that you wouldn't get with per-domain/device PASID
allocator. Arguably I have no real-world example to back this up, we can
probably expect vendors to always implement a sane amount of PASID bits.
Unifying the API is certainly more important than imagining all the
twisted configurations possible, and a PASID allocator with per-task
top-down allocation seems to me like an acceptable compromise.

Thanks,
Jean-Philippe
diff mbox

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8ea14f41a979..26c5f6528c69 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1438,6 +1438,114 @@  void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_group);
 
+int iommu_set_svm_ops(struct device *dev, const struct iommu_svm_ops *svm_ops)
+{
+	const struct iommu_ops *ops;
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get_for_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	ops = dev->bus->iommu_ops;
+	if (!ops->set_svm_ops) {
+		iommu_group_put(group);
+		return -ENODEV;
+	}
+
+	mutex_lock(&group->mutex);
+	ret = ops->set_svm_ops(dev, svm_ops);
+	mutex_unlock(&group->mutex);
+
+	iommu_group_put(group);
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_set_svm_ops);
+
+/*
+ * iommu_bind_task - Share task address space with device
+ *
+ * @dev: device to bind
+ * @task: task to bind
+ * @pasid: valid address where the PASID is stored
+ * @flags: driver-specific flags
+ * @priv: private data to associate with the bond
+ *
+ * Create a bond between device and task, allowing the device to access the task
+ * address space using @pasid. Intel and ARM SMMU drivers allocate and return
+ * the PASID, while AMD requires the caller to allocate a PASID beforehand.
+ *
+ * iommu_unbind_task must be called with this PASID before the task exits.
+ */
+int iommu_bind_task(struct device *dev, struct task_struct *task, int *pasid,
+		    int flags, void *priv)
+{
+	const struct iommu_ops *ops;
+	struct iommu_group *group;
+	int ret;
+
+	if (!pasid)
+		return -EINVAL;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -ENODEV;
+
+	ops = dev->bus->iommu_ops;
+	if (!ops->bind_task) {
+		iommu_group_put(group);
+		return -ENODEV;
+	}
+
+	mutex_lock(&group->mutex);
+	if (!group->domain)
+		ret = -EINVAL;
+	else
+		ret = ops->bind_task(dev, task, pasid, flags, priv);
+	mutex_unlock(&group->mutex);
+
+	iommu_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_bind_task);
+
+/*
+ * iommu_unbind_task - Remove a bond created with iommu_bind_task
+ *
+ * @dev: device bound to the task
+ * @pasid: identifier of the bond
+ * @flags: state of the PASID and driver-specific flags
+ */
+int iommu_unbind_task(struct device *dev, int pasid, int flags)
+{
+	const struct iommu_ops *ops;
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -ENODEV;
+
+	ops = dev->bus->iommu_ops;
+	if (!ops->unbind_task) {
+		iommu_group_put(group);
+		return -ENODEV;
+	}
+
+	mutex_lock(&group->mutex);
+	if (!group->domain)
+		ret = -EINVAL;
+	else
+		ret = ops->unbind_task(dev, pasid, flags);
+	mutex_unlock(&group->mutex);
+
+	iommu_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_task);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 {
 	if (unlikely(domain->ops->iova_to_phys == NULL))
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6a6de187ddc0..9554f45d4305 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -145,6 +145,16 @@  struct iommu_resv_region {
 	int			type;
 };
 
+/*
+ * @handle_fault: report or handle a fault from the device (FIXME: imprecise)
+ * @invalidate_pasid: stop using a PASID.
+ */
+struct iommu_svm_ops {
+	int (*handle_fault)(struct device *dev, int pasid, u64 address,
+			    int prot, int status, void *priv);
+	int (*invalidate_pasid)(struct device *dev, int pasid, void *priv);
+};
+
 #ifdef CONFIG_IOMMU_API
 
 /**
@@ -154,6 +164,9 @@  struct iommu_resv_region {
  * @domain_free: free iommu domain
  * @attach_dev: attach device to an iommu domain
  * @detach_dev: detach device from an iommu domain
+ * @set_svm_ops: set SVM callbacks for device
+ * @bind_task: attach a task address space to a device
+ * @unbind_task: detach a task address space from a device
  * @map: map a physically contiguous memory region to an iommu domain
  * @unmap: unmap a physically contiguous memory region from an iommu domain
  * @map_sg: map a scatter-gather list of physically contiguous memory chunks
@@ -183,6 +196,10 @@  struct iommu_ops {
 
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
 	void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
+	int (*set_svm_ops)(struct device *dev, const struct iommu_svm_ops *ops);
+	int (*bind_task)(struct device *dev, struct task_struct *task,
+			 int *pasid, int flags, void *priv);
+	int (*unbind_task)(struct device *dev, int pasid, int flags);
 	int (*map)(struct iommu_domain *domain, unsigned long iova,
 		   phys_addr_t paddr, size_t size, int prot);
 	size_t (*unmap)(struct iommu_domain *domain, unsigned long iova,
@@ -403,6 +420,13 @@  void iommu_fwspec_free(struct device *dev);
 int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids);
 const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode);
 
+extern int iommu_set_svm_ops(struct device *dev,
+			     const struct iommu_svm_ops *svm_ops);
+extern int iommu_bind_task(struct device *dev, struct task_struct *task,
+			   int *pasid, int flags, void *priv);
+
+extern int iommu_unbind_task(struct device *dev, int pasid, int flags);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -663,6 +687,23 @@  const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline int iommu_set_svm_ops(struct device *dev,
+				    const struct iommu_svm_ops *svm_ops)
+{
+	return -ENODEV;
+}
+
+static inline int iommu_bind_task(struct device *dev, struct task_struct *task,
+				  int *pasid, int flags, void *priv)
+{
+	return -ENODEV;
+}
+
+static int iommu_unbind_task(struct device *dev, int pasid, int flags)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */