diff mbox

[RFC] iommu: arm-smmu: stall support

Message ID 20170914194444.32551-1-robdclark@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rob Clark Sept. 14, 2017, 7:44 p.m. UTC
Adds a new domain property for iommu clients to opt-in to stalling
with asynchronous resume, and for the client to determine if the
iommu supports this.

Current motivation is that:

a) On 8x96/a530, if we don't enable CFCFG (or HUPCF) then non-
   faulting translations which are happening concurrently with
   one that faults, fail (or return garbage), which triggers all
   sorts of fun GPU crashes, which generally have no relation
   to the root fault.  (The CP can be far ahead in the cmdstream
   from the other parts of the GPU...)

b) I am working on a debugfs feature to dump submits/batches
   that cause GPU hangs, and I would like to also use this for
   faults.  But it needs to run in non-atomic context, so I
   need to toss things off to a workqueue, and then resume
   the iommu after it finishes.

c) (and ofc at some point in the future for SVM we'd like to
   be able to pin unpinned pages and things like that, in
   response to faults.)

TODO
 - For RFC I thought it would be easier to review the idea
   as a single patch, but it should be split into separate
   core and arm-smmu parts

 - I vaguely remember someone (Will?) mentioning that there
   could be cases with multiple masters sharing a single
   context bank, and somehow stalling might not work in that
   case?  (How does that even happen, arm-smmu assignes the
   context banks?  Maybe I'm mis-remembering the details.)
   I think that this probably shouldn't effect the API parts
   of this RFC, the iommu driver should already know about
   all the devices that might attach because of ->attach_dev()
   so it could fail in _set_attr()?

Signed-off-by: Rob Clark <robdclark@gmail.com>
---
 drivers/iommu/arm-smmu.c | 36 ++++++++++++++++++++++++++++++++----
 drivers/iommu/iommu.c    | 21 +++++++++++++++++++++
 include/linux/iommu.h    | 14 ++++++++++++++
 3 files changed, 67 insertions(+), 4 deletions(-)

Comments

Jean-Philippe Brucker Sept. 18, 2017, 11:13 a.m. UTC | #1
Hi Rob,

On 14/09/17 20:44, Rob Clark wrote:
> Adds a new domain property for iommu clients to opt-in to stalling
> with asynchronous resume, and for the client to determine if the
> iommu supports this.
> 
> Current motivation is that:
> 
> a) On 8x96/a530, if we don't enable CFCFG (or HUPCF) then non-
>    faulting translations which are happening concurrently with
>    one that faults, fail (or return garbage), which triggers all
>    sorts of fun GPU crashes, which generally have no relation
>    to the root fault.  (The CP can be far ahead in the cmdstream
>    from the other parts of the GPU...)

Would the GPU driver always enable stall for this implementation? Or only
enable it for specific domains?

Instead of enabling it at domain level, I wonder if couldn't be left
entirely to the SMMU driver. I have a proposal (that I'll publish shortly)
for adding a "can-stall" attribute to device-tree nodes, telling the SMMU
driver that the device can withstand stalled transactions without locking
up the system.

The SMMU would then enable stall for this particular device without
needing approval from the device driver. I'm doing this for v3, which has
a more mature stall model, but I suppose we can do the same for v2 as well.

In any case, the firmware has to tell the OS that a device is capable of
stalling, because it is unlikely that many platform devices will
gracefully handle this mode.

> b) I am working on a debugfs feature to dump submits/batches
>    that cause GPU hangs, and I would like to also use this for
>    faults.  But it needs to run in non-atomic context, so I
>    need to toss things off to a workqueue, and then resume
>    the iommu after it finishes.

Are you relying on stalled transaction to freeze the GPU state and
allow for introspection? I suppose the debug code would always terminate
after recording the fault? I'm just trying to get a picture of all
possible users of a common fault API.

> c) (and ofc at some point in the future for SVM we'd like to
>    be able to pin unpinned pages and things like that, in
>    response to faults.)

For SVM there will be generic code calling into the mm code to pin pages
and resume the SMMU. We are working on consolidating this with other
IOMMUs at the moment and use generic code where possible. Ideally the GPU
driver shouldn't need to get involved.

That new API will be based on PASIDs/SSIDs, which doesn't exist in SMMUv2.
I do believe that we also need to consolidate the API for devices and
IOMMUs that support page faults but not PASIDs. We could use a common
fault workqueue in the IOMMU core.

It seems like your use-case (b) could fit in there. If the device driver
didn't bind to a process but instead registered a fault handler, then we
could ask it to do something with the fault. And since it's in a wq, the
call to device driver would be synchronous and we'd pass the return status
(retry/terminate) to the SMMU.

This is probably easier to handle than a separate "resume" callback,
especially with SMMUv3 stall and PRI, where faults are out of order and
contain a token identifying a fault.

> TODO
>  - For RFC I thought it would be easier to review the idea
>    as a single patch, but it should be split into separate
>    core and arm-smmu parts
> 
>  - I vaguely remember someone (Will?) mentioning that there
>    could be cases with multiple masters sharing a single
>    context bank, and somehow stalling might not work in that
>    case?  (How does that even happen, arm-smmu assignes the
>    context banks?  Maybe I'm mis-remembering the details.)
>    I think that this probably shouldn't effect the API parts
>    of this RFC, the iommu driver should already know about
>    all the devices that might attach because of ->attach_dev()
>    so it could fail in _set_attr()?

With VFIO, userspace can decide to put multiple device in the same domain.
But attach_dev can fail if there are device incompatibilities, and then
VFIO will use a new domain. It might become relevant with vSVM, forwarding
recoverable faults from guest-assigned devices.

Thanks,
Jean
Rob Clark Sept. 18, 2017, 12:11 p.m. UTC | #2
On Mon, Sep 18, 2017 at 7:13 AM, Jean-Philippe Brucker
<jean-philippe.brucker@arm.com> wrote:
> Hi Rob,
>
> On 14/09/17 20:44, Rob Clark wrote:
>> Adds a new domain property for iommu clients to opt-in to stalling
>> with asynchronous resume, and for the client to determine if the
>> iommu supports this.
>>
>> Current motivation is that:
>>
>> a) On 8x96/a530, if we don't enable CFCFG (or HUPCF) then non-
>>    faulting translations which are happening concurrently with
>>    one that faults, fail (or return garbage), which triggers all
>>    sorts of fun GPU crashes, which generally have no relation
>>    to the root fault.  (The CP can be far ahead in the cmdstream
>>    from the other parts of the GPU...)
>
> Would the GPU driver always enable stall for this implementation? Or only
> enable it for specific domains?

I expect for all domains.  (Currently that is just a single domain,
but I expect that to change)

> Instead of enabling it at domain level, I wonder if couldn't be left
> entirely to the SMMU driver. I have a proposal (that I'll publish shortly)
> for adding a "can-stall" attribute to device-tree nodes, telling the SMMU
> driver that the device can withstand stalled transactions without locking
> up the system.
>
> The SMMU would then enable stall for this particular device without
> needing approval from the device driver. I'm doing this for v3, which has
> a more mature stall model, but I suppose we can do the same for v2 as well.

The GPU driver does need to know if stalling is supported/enabled by
the iommu driver (since depending on SoC, drm/msm works with one of
three different iommu drivers currently), and to be in control of
resume.. I'm a bit sceptical about trying to abstract too much at the
iommu level.

For example when the gpu gets a fault, it tends to get 1000s of
faults.  On the first fault, I want to kick things off to a wq where I
can snapshot the cmdstream and gpu state.  But subsequent faults on
the same submit I ignore.

Btw, apologies that I haven't sent the corresponding drm/msm patches
yet.  I haven't had a chance to clean up yet, but you can find
something rough here:

  https://github.com/freedreno/kernel-msm/commits/integration-linux-qcomlt-v4.13-rc3

> In any case, the firmware has to tell the OS that a device is capable of
> stalling, because it is unlikely that many platform devices will
> gracefully handle this mode.
>
>> b) I am working on a debugfs feature to dump submits/batches
>>    that cause GPU hangs, and I would like to also use this for
>>    faults.  But it needs to run in non-atomic context, so I
>>    need to toss things off to a workqueue, and then resume
>>    the iommu after it finishes.
>
> Are you relying on stalled transaction to freeze the GPU state and
> allow for introspection? I suppose the debug code would always terminate
> after recording the fault? I'm just trying to get a picture of all
> possible users of a common fault API.

yes, this is what I'm doing now.  For SVM, however, we'd retry the
transaction instead of terminating.

>> c) (and ofc at some point in the future for SVM we'd like to
>>    be able to pin unpinned pages and things like that, in
>>    response to faults.)
>
> For SVM there will be generic code calling into the mm code to pin pages
> and resume the SMMU. We are working on consolidating this with other
> IOMMUs at the moment and use generic code where possible. Ideally the GPU
> driver shouldn't need to get involved.
>
> That new API will be based on PASIDs/SSIDs, which doesn't exist in SMMUv2.
> I do believe that we also need to consolidate the API for devices and
> IOMMUs that support page faults but not PASIDs. We could use a common
> fault workqueue in the IOMMU core.

I've no idea qcom's plans for future hw, but pretty sure we are going
to want to implement SVM on v2 iommu, without PASIDs/SSIDs.  However
on current hw, there is really only one userspace process active on
the gpu at a time, so we don't really need PASIDs/SSIDs.

> It seems like your use-case (b) could fit in there. If the device driver
> didn't bind to a process but instead registered a fault handler, then we
> could ask it to do something with the fault. And since it's in a wq, the
> call to device driver would be synchronous and we'd pass the return status
> (retry/terminate) to the SMMU.
>
> This is probably easier to handle than a separate "resume" callback,
> especially with SMMUv3 stall and PRI, where faults are out of order and
> contain a token identifying a fault.

IIRC Will or Robin mentioned wanting a token in earlier stall
discussion.. although not being familiar with v3 I wasn't quite sure
what the use was.

At any rate, adding a token to fault handler callback and
iommu_domain_resume() seems like something that could be added later,
when it is needed.

Anyways, I am interested to see what your proposal is.. although tend
to be a bit sceptical about trying to abstract too much.  (And if
there should be something more abstract, maybe it should be at the
dma-mapping layer instead?)

I don't suppose you or someone working on this from ARM side will be
at linaro connect in a couple weeks?  Jordan and myself will be there,
and it could be a good time to chat about this, and also per-process
pagetables and gpu switching switching pagetables directly on v2 hw.

BR,
-R

>> TODO
>>  - For RFC I thought it would be easier to review the idea
>>    as a single patch, but it should be split into separate
>>    core and arm-smmu parts
>>
>>  - I vaguely remember someone (Will?) mentioning that there
>>    could be cases with multiple masters sharing a single
>>    context bank, and somehow stalling might not work in that
>>    case?  (How does that even happen, arm-smmu assignes the
>>    context banks?  Maybe I'm mis-remembering the details.)
>>    I think that this probably shouldn't effect the API parts
>>    of this RFC, the iommu driver should already know about
>>    all the devices that might attach because of ->attach_dev()
>>    so it could fail in _set_attr()?
>
> With VFIO, userspace can decide to put multiple device in the same domain.
> But attach_dev can fail if there are device incompatibilities, and then
> VFIO will use a new domain. It might become relevant with vSVM, forwarding
> recoverable faults from guest-assigned devices.
>
> Thanks,
> Jean
Will Deacon Sept. 18, 2017, 5:33 p.m. UTC | #3
On Mon, Sep 18, 2017 at 08:11:47AM -0400, Rob Clark wrote:
> IIRC Will or Robin mentioned wanting a token in earlier stall
> discussion.. although not being familiar with v3 I wasn't quite sure
> what the use was.
> 
> At any rate, adding a token to fault handler callback and
> iommu_domain_resume() seems like something that could be added later,
> when it is needed.
> 
> Anyways, I am interested to see what your proposal is.. although tend
> to be a bit sceptical about trying to abstract too much.  (And if
> there should be something more abstract, maybe it should be at the
> dma-mapping layer instead?)

Whilst I'm also often sceptical about premature abstraction (which is why
I previously suggested just plugging into iommu_set_fault_handler and
managing any work queues yourself), in this case Jean-Philippe actually
has SVM up and running with multiple hardware platforms using stalls, so
I'm confident that if we have something that works for both of you, then
it's worth abstracting at this stage.

When we last had this discussion, you were the only person writing code
for this, but I think that things have moved on since then.

> I don't suppose you or someone working on this from ARM side will be
> at linaro connect in a couple weeks?  Jordan and myself will be there,
> and it could be a good time to chat about this, and also per-process
> pagetables and gpu switching switching pagetables directly on v2 hw.

James Morse (CC'd) will be there from our group. He's not looked at this
at all, but it might be worth speaking to him anyway as this is often
better done face-to-face than over email.

Will
Joerg Roedel Sept. 19, 2017, 12:30 p.m. UTC | #4
Hi Rob,

thanks for the RFC patch. I have some comments about the interface to
the IOMMU-API below.

On Thu, Sep 14, 2017 at 03:44:33PM -0400, Rob Clark wrote:
> +/**
> + * iommu_domain_resume - Resume translations for a domain after a fault.
> + *
> + * This can be called at some point after the fault handler is called,
> + * allowing the user of the IOMMU to (for example) handle the fault
> + * from a task context.  It is illegal to call this if
> + * iommu_domain_set_attr(STALL) failed.
> + *
> + * @domain:    the domain to resume
> + * @terminate: if true, the translation that triggered the fault should
> + *    be terminated, else it should be retried.
> + */
> +void iommu_domain_resume(struct iommu_domain *domain, bool terminate)
> +{
> +	/* invalid to call if iommu_domain_set_attr(STALL) failed: */
> +	if (WARN_ON(!domain->ops->domain_resume))
> +		return;
> +	domain->ops->domain_resume(domain, terminate);
> +}
> +EXPORT_SYMBOL_GPL(iommu_domain_resume);

So this function is being called by the device driver owning the domain,
right?

I don't think that the resume call-back you added needs to be exposed
like this. It is better to do the page-fault handling completly in the
iommu-code, including calling the resume call-back and just let the
device-driver provide a per-domain call-back to let it handle the fault
and map in the required pages.

The interface could look like this:

	* New function iommu_domain_enable_stalls(domain) - When
	  this function returns the domain is in stall-handling mode. A
	  iommu_domain_disable_stalls() might make sense too, not sure
	  about that.

	* When stalls are enabled for a domain, report_iommu_fault()
	  queues the fault to a workqueue (so that its handler can
	  block) and in the workqueue you call ->resume() based on the
	  return value of the handler.

As a side-note, as there has been discussion on this: For now it doesn't
make sense to merge this with the SVM page-fault handling efforts, as
this path is different enough (SVM will call handle_mm_fault() as the
handler, for example).


Regards,

	Joerg
Rob Clark Sept. 19, 2017, 2:23 p.m. UTC | #5
On Tue, Sep 19, 2017 at 8:30 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hi Rob,
>
> thanks for the RFC patch. I have some comments about the interface to
> the IOMMU-API below.
>
> On Thu, Sep 14, 2017 at 03:44:33PM -0400, Rob Clark wrote:
>> +/**
>> + * iommu_domain_resume - Resume translations for a domain after a fault.
>> + *
>> + * This can be called at some point after the fault handler is called,
>> + * allowing the user of the IOMMU to (for example) handle the fault
>> + * from a task context.  It is illegal to call this if
>> + * iommu_domain_set_attr(STALL) failed.
>> + *
>> + * @domain:    the domain to resume
>> + * @terminate: if true, the translation that triggered the fault should
>> + *    be terminated, else it should be retried.
>> + */
>> +void iommu_domain_resume(struct iommu_domain *domain, bool terminate)
>> +{
>> +     /* invalid to call if iommu_domain_set_attr(STALL) failed: */
>> +     if (WARN_ON(!domain->ops->domain_resume))
>> +             return;
>> +     domain->ops->domain_resume(domain, terminate);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_domain_resume);
>
> So this function is being called by the device driver owning the domain,
> right?

yes, this was my plan

> I don't think that the resume call-back you added needs to be exposed
> like this. It is better to do the page-fault handling completly in the
> iommu-code, including calling the resume call-back and just let the
> device-driver provide a per-domain call-back to let it handle the fault
> and map in the required pages.

I would like to decide in the IRQ whether or not to queue work or not,
because when we get a gpu fault, we tend to get 1000's of gpu faults
all at once (and I really only need to handle the first one).  I
suppose that could also be achieved by having a special return value
from the fault handler to say "call me again from a wq"..

Note that in the drm driver I already have a suitable wq to queue the
work, so it really doesn't buy me anything to have the iommu driver
toss things off to a wq for me.  Might be a different situation for
other drivers (but I guess mostly other drivers are using iommu API
indirectly via dma-mapping?)

> The interface could look like this:
>
>         * New function iommu_domain_enable_stalls(domain) - When
>           this function returns the domain is in stall-handling mode. A
>           iommu_domain_disable_stalls() might make sense too, not sure
>           about that.

I don't particularly see a use-case for disabling stalls, fwiw

BR,
-R

>         * When stalls are enabled for a domain, report_iommu_fault()
>           queues the fault to a workqueue (so that its handler can
>           block) and in the workqueue you call ->resume() based on the
>           return value of the handler.
>
> As a side-note, as there has been discussion on this: For now it doesn't
> make sense to merge this with the SVM page-fault handling efforts, as
> this path is different enough (SVM will call handle_mm_fault() as the
> handler, for example).
>
>
> Regards,
>
>         Joerg
>
Joerg Roedel Sept. 22, 2017, 9:02 a.m. UTC | #6
On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
> I would like to decide in the IRQ whether or not to queue work or not,
> because when we get a gpu fault, we tend to get 1000's of gpu faults
> all at once (and I really only need to handle the first one).  I
> suppose that could also be achieved by having a special return value
> from the fault handler to say "call me again from a wq"..
> 
> Note that in the drm driver I already have a suitable wq to queue the
> work, so it really doesn't buy me anything to have the iommu driver
> toss things off to a wq for me.  Might be a different situation for
> other drivers (but I guess mostly other drivers are using iommu API
> indirectly via dma-mapping?)

Okay, so since you are the only user for now, we don't need a
work-queue. But I still want the ->resume call-back to be hidden in the
iommu code and not be exposed to users.

We already have per-domain fault-handlers, so the best solution for now
is to call ->resume from report_iommu_fault() when the fault-handler
returns a special value.


Regards,

	Joerg
Jean-Philippe Brucker Sept. 22, 2017, 10:02 a.m. UTC | #7
On 22/09/17 10:02, Joerg Roedel wrote:
> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
>> I would like to decide in the IRQ whether or not to queue work or not,
>> because when we get a gpu fault, we tend to get 1000's of gpu faults
>> all at once (and I really only need to handle the first one).  I
>> suppose that could also be achieved by having a special return value
>> from the fault handler to say "call me again from a wq"..
>>
>> Note that in the drm driver I already have a suitable wq to queue the
>> work, so it really doesn't buy me anything to have the iommu driver
>> toss things off to a wq for me.  Might be a different situation for
>> other drivers (but I guess mostly other drivers are using iommu API
>> indirectly via dma-mapping?)
> 
> Okay, so since you are the only user for now, we don't need a
> work-queue. But I still want the ->resume call-back to be hidden in the
> iommu code and not be exposed to users.
> 
> We already have per-domain fault-handlers, so the best solution for now
> is to call ->resume from report_iommu_fault() when the fault-handler
> returns a special value.

The problem is that report_iommu_fault is called from IRQ context by the
SMMU driver, so the device driver callback cannot sleep.

So if the device driver needs to be able to sleep between fault report and
resume, as I understand Rob needs for writing debugfs, we can either:

* call report_iommu_fault from higher up, in a thread or workqueue.
* split the fault reporting as this patch proposes. The exact same
  mechanism is needed for the vSVM work by Intel: in order to inject fault
  into the guest, they would like to have an atomic notifier registered by
  VFIO for passing down the Page Request, and a new function in the IOMMU
  API to resume/complete the fault.

Thanks,
Jean
Rob Clark Sept. 22, 2017, 6:42 p.m. UTC | #8
On Fri, Sep 22, 2017 at 6:02 AM, Jean-Philippe Brucker
<jean-philippe.brucker@arm.com> wrote:
> On 22/09/17 10:02, Joerg Roedel wrote:
>> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
>>> I would like to decide in the IRQ whether or not to queue work or not,
>>> because when we get a gpu fault, we tend to get 1000's of gpu faults
>>> all at once (and I really only need to handle the first one).  I
>>> suppose that could also be achieved by having a special return value
>>> from the fault handler to say "call me again from a wq"..
>>>
>>> Note that in the drm driver I already have a suitable wq to queue the
>>> work, so it really doesn't buy me anything to have the iommu driver
>>> toss things off to a wq for me.  Might be a different situation for
>>> other drivers (but I guess mostly other drivers are using iommu API
>>> indirectly via dma-mapping?)
>>
>> Okay, so since you are the only user for now, we don't need a
>> work-queue. But I still want the ->resume call-back to be hidden in the
>> iommu code and not be exposed to users.
>>
>> We already have per-domain fault-handlers, so the best solution for now
>> is to call ->resume from report_iommu_fault() when the fault-handler
>> returns a special value.
>
> The problem is that report_iommu_fault is called from IRQ context by the
> SMMU driver, so the device driver callback cannot sleep.
>
> So if the device driver needs to be able to sleep between fault report and
> resume, as I understand Rob needs for writing debugfs, we can either:
>
> * call report_iommu_fault from higher up, in a thread or workqueue.
> * split the fault reporting as this patch proposes. The exact same
>   mechanism is needed for the vSVM work by Intel: in order to inject fault
>   into the guest, they would like to have an atomic notifier registered by
>   VFIO for passing down the Page Request, and a new function in the IOMMU
>   API to resume/complete the fault.
>

I'm in favour if splitting the reporting *somehow*.. the two
approaches that seemed sane are:

1) call fault handler from irq and having separate domain->resume()
called by the driver, potentially from a wq
2) or having two fault callbacks, first called before wq and then
based on returned value, optionally 2nd callback called from wq

The first seemed less intrusive to me, but I'm flexible.

BR,
-R
Joerg Roedel Sept. 27, 2017, 12:15 p.m. UTC | #9
Hi Rob, Jean,

On Fri, Sep 22, 2017 at 02:42:44PM -0400, Rob Clark wrote:
> I'm in favour if splitting the reporting *somehow*.. the two
> approaches that seemed sane are:
> 
> 1) call fault handler from irq and having separate domain->resume()
> called by the driver, potentially from a wq
> 2) or having two fault callbacks, first called before wq and then
> based on returned value, optionally 2nd callback called from wq
> 
> The first seemed less intrusive to me, but I'm flexible.

How about adding a flag to the fault-handler call-back that tells us
whether it wants to sleep or not. If it wants, we call it from a wq, if
not we call call it directly like we do today in the
report_iommu_fault() function.

In any case we call iommu_ops->resume() when set on completion of the
fault-handler either from the workqueue or report_iommu_fault itself.


Regards,

	Joerg
Jean-Philippe Brucker Sept. 27, 2017, 1:49 p.m. UTC | #10
Hi Joerg,

On 27/09/17 13:15, Joerg Roedel wrote:
> Hi Rob, Jean,
> 
> On Fri, Sep 22, 2017 at 02:42:44PM -0400, Rob Clark wrote:
>> I'm in favour if splitting the reporting *somehow*.. the two
>> approaches that seemed sane are:
>> 
>> 1) call fault handler from irq and having separate domain->resume()
>> called by the driver, potentially from a wq
>> 2) or having two fault callbacks, first called before wq and then
>> based on returned value, optionally 2nd callback called from wq
>> 
>> The first seemed less intrusive to me, but I'm flexible.
> 
> How about adding a flag to the fault-handler call-back that tells us
> whether it wants to sleep or not. If it wants, we call it from a wq, if
> not we call call it directly like we do today in the
> report_iommu_fault() function.
> 
> In any case we call iommu_ops->resume() when set on completion of the
> fault-handler either from the workqueue or report_iommu_fault itself.

I like this approach. When the device driver registers a fault handler,
it also tells when it would like to be called (either in atomic context,
blocking context, or both).

Then the handler itself receives a flag that says which context it's
being called from. It returns a value telling the IOMMU how to proceed.
Depending on this value we either resume/abort immediately, or add the
fault to the workqueue if necessary.

How about using the following return values:

/**
 * enum iommu_fault_status - Return status of fault handlers, telling the IOMMU
 *      driver how to proceed with the fault.
 *
 * @IOMMU_FAULT_STATUS_NONE: Fault was not handled. Call the next handler, or
 *      terminate.
 * @IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults from
 *      this device if possible. This is "Response Failure" in PCI PRI.
 * @IOMMU_FAULT_STATUS_INVALID: Could not handle this fault, don't retry the
 *      access. This is "Invalid Request" in PCI PRI.
 * @IOMMU_FAULT_STATUS_HANDLED: Fault has been handled and the page tables
 *      populated, retry the access.
 * @IOMMU_FAULT_STATUS_IGNORE: Stop processing the fault, and do not send a
 *      reply to the device.
 *
 * For unrecoverable faults, the only valid status is IOMMU_FAULT_STATUS_NONE
 * For a recoverable fault, if no one handled the fault, treat as
 * IOMMU_FAULT_STATUS_INVALID.
 */
enum iommu_fault_status {
        IOMMU_FAULT_STATUS_NONE = 0,
        IOMMU_FAULT_STATUS_FAILURE,
        IOMMU_FAULT_STATUS_INVALID,
        IOMMU_FAULT_STATUS_HANDLED,
        IOMMU_FAULT_STATUS_IGNORE,
};

This would probably cover the two use-cases of reporting faults to
device drivers, and injecting them into the guest with VFIO, as well as
handling PPRs internally. I'm also working on providing more details
(pasid for instance) in the fault callback.

We could also use the fault handler for invalid PRI Page Requests
(currently specialized by amd_iommu_set_invalid_ppr_cb). It's just a
matter of adding a registration flag to iommu_set_fault_handler.

Thanks,
Jean
Joerg Roedel Sept. 27, 2017, 2:35 p.m. UTC | #11
Hi Jean,

On Wed, Sep 27, 2017 at 02:49:00PM +0100, Jean-Philippe Brucker wrote:
> I like this approach. When the device driver registers a fault handler,
> it also tells when it would like to be called (either in atomic context,
> blocking context, or both).

Is there a use-case for calling the same handler from both contexts?

> enum iommu_fault_status {
>         IOMMU_FAULT_STATUS_NONE = 0,
>         IOMMU_FAULT_STATUS_FAILURE,
>         IOMMU_FAULT_STATUS_INVALID,
>         IOMMU_FAULT_STATUS_HANDLED,
>         IOMMU_FAULT_STATUS_IGNORE,
> };


This all certainly makes sense for the PRI/PASID case, but I don't think
that it makes sense yet to extend the existing report_iommu_fault()
interface to also handle PASID/PPR faults.

The later needs a lot more parameters to successfully handle a fault. In
the AMD driver these are all in 'struct fault', the relevant members
are:

        u64 address;
        u16 devid;
        u16 pasid;
        u16 tag;
        u16 finish;
        u16 flags;

And passing all this through the existing interface which also handles
non-pasid faults is cumbersome. So I'd like to keep the PASID/PPR
interface separate from the old one for now.

Regards,

	Joerg
Rob Clark Sept. 27, 2017, 4:14 p.m. UTC | #12
On Wed, Sep 27, 2017 at 9:49 AM, Jean-Philippe Brucker
<jean-philippe.brucker@arm.com> wrote:
> Hi Joerg,
>
> On 27/09/17 13:15, Joerg Roedel wrote:
>> Hi Rob, Jean,
>>
>> On Fri, Sep 22, 2017 at 02:42:44PM -0400, Rob Clark wrote:
>>> I'm in favour if splitting the reporting *somehow*.. the two
>>> approaches that seemed sane are:
>>>
>>> 1) call fault handler from irq and having separate domain->resume()
>>> called by the driver, potentially from a wq
>>> 2) or having two fault callbacks, first called before wq and then
>>> based on returned value, optionally 2nd callback called from wq
>>>
>>> The first seemed less intrusive to me, but I'm flexible.
>>
>> How about adding a flag to the fault-handler call-back that tells us
>> whether it wants to sleep or not. If it wants, we call it from a wq, if
>> not we call call it directly like we do today in the
>> report_iommu_fault() function.
>>
>> In any case we call iommu_ops->resume() when set on completion of the
>> fault-handler either from the workqueue or report_iommu_fault itself.
>
> I like this approach. When the device driver registers a fault handler,
> it also tells when it would like to be called (either in atomic context,
> blocking context, or both).

What I have in mind is still a case-by-case decision.  Ie. I'd decide
if it is the first fault from a particular submit (job), in which case
I'd want to schedule the wq, or if it is one of the 999 following
faults from the same submit (in which case, skip the wq).

So a static decision when registering the fault handler doesn't work.

BR,
-R


> Then the handler itself receives a flag that says which context it's
> being called from. It returns a value telling the IOMMU how to proceed.
> Depending on this value we either resume/abort immediately, or add the
> fault to the workqueue if necessary.
>
> How about using the following return values:
>
> /**
>  * enum iommu_fault_status - Return status of fault handlers, telling the IOMMU
>  *      driver how to proceed with the fault.
>  *
>  * @IOMMU_FAULT_STATUS_NONE: Fault was not handled. Call the next handler, or
>  *      terminate.
>  * @IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults from
>  *      this device if possible. This is "Response Failure" in PCI PRI.
>  * @IOMMU_FAULT_STATUS_INVALID: Could not handle this fault, don't retry the
>  *      access. This is "Invalid Request" in PCI PRI.
>  * @IOMMU_FAULT_STATUS_HANDLED: Fault has been handled and the page tables
>  *      populated, retry the access.
>  * @IOMMU_FAULT_STATUS_IGNORE: Stop processing the fault, and do not send a
>  *      reply to the device.
>  *
>  * For unrecoverable faults, the only valid status is IOMMU_FAULT_STATUS_NONE
>  * For a recoverable fault, if no one handled the fault, treat as
>  * IOMMU_FAULT_STATUS_INVALID.
>  */
> enum iommu_fault_status {
>         IOMMU_FAULT_STATUS_NONE = 0,
>         IOMMU_FAULT_STATUS_FAILURE,
>         IOMMU_FAULT_STATUS_INVALID,
>         IOMMU_FAULT_STATUS_HANDLED,
>         IOMMU_FAULT_STATUS_IGNORE,
> };
>
> This would probably cover the two use-cases of reporting faults to
> device drivers, and injecting them into the guest with VFIO, as well as
> handling PPRs internally. I'm also working on providing more details
> (pasid for instance) in the fault callback.
>
> We could also use the fault handler for invalid PRI Page Requests
> (currently specialized by amd_iommu_set_invalid_ppr_cb). It's just a
> matter of adding a registration flag to iommu_set_fault_handler.
>
> Thanks,
> Jean
Rob Clark May 10, 2019, 6:23 p.m. UTC | #13
On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
<jean-philippe.brucker@arm.com> wrote:
>
> On 22/09/17 10:02, Joerg Roedel wrote:
> > On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
> >> I would like to decide in the IRQ whether or not to queue work or not,
> >> because when we get a gpu fault, we tend to get 1000's of gpu faults
> >> all at once (and I really only need to handle the first one).  I
> >> suppose that could also be achieved by having a special return value
> >> from the fault handler to say "call me again from a wq"..
> >>
> >> Note that in the drm driver I already have a suitable wq to queue the
> >> work, so it really doesn't buy me anything to have the iommu driver
> >> toss things off to a wq for me.  Might be a different situation for
> >> other drivers (but I guess mostly other drivers are using iommu API
> >> indirectly via dma-mapping?)
> >
> > Okay, so since you are the only user for now, we don't need a
> > work-queue. But I still want the ->resume call-back to be hidden in the
> > iommu code and not be exposed to users.
> >
> > We already have per-domain fault-handlers, so the best solution for now
> > is to call ->resume from report_iommu_fault() when the fault-handler
> > returns a special value.
>
> The problem is that report_iommu_fault is called from IRQ context by the
> SMMU driver, so the device driver callback cannot sleep.
>
> So if the device driver needs to be able to sleep between fault report and
> resume, as I understand Rob needs for writing debugfs, we can either:
>
> * call report_iommu_fault from higher up, in a thread or workqueue.
> * split the fault reporting as this patch proposes. The exact same
>   mechanism is needed for the vSVM work by Intel: in order to inject fault
>   into the guest, they would like to have an atomic notifier registered by
>   VFIO for passing down the Page Request, and a new function in the IOMMU
>   API to resume/complete the fault.
>

So I was thinking about this topic again.. I would still like to get
some sort of async resume so that I can wire up GPU cmdstream/state
logging on iommu fault (without locally resurrecting and rebasing this
patch and drm/msm side changes each time I need to debug iommu
faults)..

And I do still prefer the fault cb in irq (or not requiring it in
wq)..  but on thinking about it, the two ideas aren't entirely
conflicting, ie. add some flags either when we register handler[1], or
they could be handled thru domain_set_attr, like:

 _EXPLICIT_RESUME - iommu API user calls iommu_domain_resume(),
potentialy from wq/thread after fault handler returns
 _HANDLER_SLEEPS  - iommu core handles the wq, and calls ops->resume()
internally

In both cases, from the iommu driver PoV it just implements
iommu_ops::resume().. in first case it is called via iommu user either
from the fault handler or at some point later (ie. wq or thread).

I don't particularly need the _HANDLER_SLEEPS case (unless I can't
convince anyone that iommu_domamin_resume() called from outside iommu
core is a good idea).. so probably I wouldn't wire up the wq plumbing
for the _HANDLER_SLEEPS case unless someone really wanted me to.

Since there are more iommu drivers, than places that register fault
handlers, I like the idea that in either case, from the driver PoV, it
is just implementing the resume callback.

[1] currently I only see a few places where fault handlers are
registered, so changing iommu_set_fault_handler() is really not much
churn

BR,
-R
Jean-Philippe Brucker May 13, 2019, 6:37 p.m. UTC | #14
Hi Rob,

On 10/05/2019 19:23, Rob Clark wrote:
> On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
> <jean-philippe.brucker@arm.com> wrote:
>>
>> On 22/09/17 10:02, Joerg Roedel wrote:
>>> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
>>>> I would like to decide in the IRQ whether or not to queue work or not,
>>>> because when we get a gpu fault, we tend to get 1000's of gpu faults
>>>> all at once (and I really only need to handle the first one).  I
>>>> suppose that could also be achieved by having a special return value
>>>> from the fault handler to say "call me again from a wq"..
>>>>
>>>> Note that in the drm driver I already have a suitable wq to queue the
>>>> work, so it really doesn't buy me anything to have the iommu driver
>>>> toss things off to a wq for me.  Might be a different situation for
>>>> other drivers (but I guess mostly other drivers are using iommu API
>>>> indirectly via dma-mapping?)
>>>
>>> Okay, so since you are the only user for now, we don't need a
>>> work-queue. But I still want the ->resume call-back to be hidden in the
>>> iommu code and not be exposed to users.
>>>
>>> We already have per-domain fault-handlers, so the best solution for now
>>> is to call ->resume from report_iommu_fault() when the fault-handler
>>> returns a special value.
>>
>> The problem is that report_iommu_fault is called from IRQ context by the
>> SMMU driver, so the device driver callback cannot sleep.
>>
>> So if the device driver needs to be able to sleep between fault report and
>> resume, as I understand Rob needs for writing debugfs, we can either:
>>
>> * call report_iommu_fault from higher up, in a thread or workqueue.
>> * split the fault reporting as this patch proposes. The exact same
>>   mechanism is needed for the vSVM work by Intel: in order to inject fault
>>   into the guest, they would like to have an atomic notifier registered by
>>   VFIO for passing down the Page Request, and a new function in the IOMMU
>>   API to resume/complete the fault.
>>
> 
> So I was thinking about this topic again.. I would still like to get
> some sort of async resume so that I can wire up GPU cmdstream/state
> logging on iommu fault (without locally resurrecting and rebasing this
> patch and drm/msm side changes each time I need to debug iommu
> faults)..

We've been working on the new fault reporting API with Jacob and Eric,
and I intend to send it out soon. It is supposed to be used for
reporting faults to guests via VFIO, handling page faults via mm, and
also reporting events directly to device drivers. Please let us know
what works and what doesn't in your case

The most recent version of the patches is at
http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
(git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
list sometimes next week, I'll add you on Cc.

In particular, see commits
	iommu: Introduce device fault data
	iommu: Introduce device fault report API
	iommu: Add recoverable fault reporting

The device driver calls iommu_register_device_fault_handler(dev, cb,
data). To report a fault, the SMMU driver calls
iommu_report_device_fault(dev, fault). This calls into the device driver
directly, there isn't any workqueue. If the fault is recoverable (the
SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
once it has dealt with the fault (after sleeping if it needs to). This
invokes the SMMU driver's resume callback.

At the moment we use mutexes, so iommu_report_device_fault() can only be
called from an IRQ thread, which is incompatible with the current SMMUv2
driver. Either we need to switch the SMMUv2 driver to an IRQ thread, or
rework the fault handler to be called from an IRQ handler. The reporting
also has to be per device rather than per domain, and I'm not sure if
the SMMUv2 driver can deal with this.

> 
> And I do still prefer the fault cb in irq (or not requiring it in
> wq)..  but on thinking about it, the two ideas aren't entirely
> conflicting, ie. add some flags either when we register handler[1], or
> they could be handled thru domain_set_attr, like:
> 
>  _EXPLICIT_RESUME - iommu API user calls iommu_domain_resume(),
> potentialy from wq/thread after fault handler returns
>  _HANDLER_SLEEPS  - iommu core handles the wq, and calls ops->resume()
> internally
> 
> In both cases, from the iommu driver PoV it just implements
> iommu_ops::resume().. in first case it is called via iommu user either
> from the fault handler or at some point later (ie. wq or thread).
> 
> I don't particularly need the _HANDLER_SLEEPS case (unless I can't
> convince anyone that iommu_domamin_resume() called from outside iommu
> core is a good idea).. so probably I wouldn't wire up the wq plumbing
> for the _HANDLER_SLEEPS case unless someone really wanted me to.
> 
> Since there are more iommu drivers, than places that register fault
> handlers, I like the idea that in either case, from the driver PoV, it
> is just implementing the resume callback.
> 
> [1] currently I only see a few places where fault handlers are
> registered, so changing iommu_set_fault_handler() is really not much
> churn

At the moment we're keeping the new fault reporting mechanism separate
from iommu_set_fault_handler()/report_iommu_fault(), to ease the transition.

Thanks,
Jean
Rob Clark May 14, 2019, 1:54 a.m. UTC | #15
On Mon, May 13, 2019 at 11:37 AM Jean-Philippe Brucker
<jean-philippe.brucker@arm.com> wrote:
>
> Hi Rob,
>
> On 10/05/2019 19:23, Rob Clark wrote:
> > On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
> > <jean-philippe.brucker@arm.com> wrote:
> >>
> >> On 22/09/17 10:02, Joerg Roedel wrote:
> >>> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
> >>>> I would like to decide in the IRQ whether or not to queue work or not,
> >>>> because when we get a gpu fault, we tend to get 1000's of gpu faults
> >>>> all at once (and I really only need to handle the first one).  I
> >>>> suppose that could also be achieved by having a special return value
> >>>> from the fault handler to say "call me again from a wq"..
> >>>>
> >>>> Note that in the drm driver I already have a suitable wq to queue the
> >>>> work, so it really doesn't buy me anything to have the iommu driver
> >>>> toss things off to a wq for me.  Might be a different situation for
> >>>> other drivers (but I guess mostly other drivers are using iommu API
> >>>> indirectly via dma-mapping?)
> >>>
> >>> Okay, so since you are the only user for now, we don't need a
> >>> work-queue. But I still want the ->resume call-back to be hidden in the
> >>> iommu code and not be exposed to users.
> >>>
> >>> We already have per-domain fault-handlers, so the best solution for now
> >>> is to call ->resume from report_iommu_fault() when the fault-handler
> >>> returns a special value.
> >>
> >> The problem is that report_iommu_fault is called from IRQ context by the
> >> SMMU driver, so the device driver callback cannot sleep.
> >>
> >> So if the device driver needs to be able to sleep between fault report and
> >> resume, as I understand Rob needs for writing debugfs, we can either:
> >>
> >> * call report_iommu_fault from higher up, in a thread or workqueue.
> >> * split the fault reporting as this patch proposes. The exact same
> >>   mechanism is needed for the vSVM work by Intel: in order to inject fault
> >>   into the guest, they would like to have an atomic notifier registered by
> >>   VFIO for passing down the Page Request, and a new function in the IOMMU
> >>   API to resume/complete the fault.
> >>
> >
> > So I was thinking about this topic again.. I would still like to get
> > some sort of async resume so that I can wire up GPU cmdstream/state
> > logging on iommu fault (without locally resurrecting and rebasing this
> > patch and drm/msm side changes each time I need to debug iommu
> > faults)..
>
> We've been working on the new fault reporting API with Jacob and Eric,
> and I intend to send it out soon. It is supposed to be used for
> reporting faults to guests via VFIO, handling page faults via mm, and
> also reporting events directly to device drivers. Please let us know
> what works and what doesn't in your case
>
> The most recent version of the patches is at
> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
> (git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
> list sometimes next week, I'll add you on Cc.
>
> In particular, see commits
>         iommu: Introduce device fault data
>         iommu: Introduce device fault report API
>         iommu: Add recoverable fault reporting
>
> The device driver calls iommu_register_device_fault_handler(dev, cb,
> data). To report a fault, the SMMU driver calls
> iommu_report_device_fault(dev, fault). This calls into the device driver
> directly, there isn't any workqueue. If the fault is recoverable (the
> SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
> IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
> once it has dealt with the fault (after sleeping if it needs to). This
> invokes the SMMU driver's resume callback.

Ok, this sounds at a high level similar to my earlier RFC, in that
resume is split (and that was the main thing I was interested in).
And it does solve one thing I was struggling with, namely that when
the domain is created it doesn't know which iommu device it will be
attached to (given that at least the original arm-smmu.c driver cannot
support stall in all cases)..

For GPU translation faults, I also don't really need to know if the
faulting translation is stalled until the callback (I mainly want to
not bother to snapshot GPU state if it is not stalled, because in that
case the data we snapshot is unlikely to be related to the fault if
the translation is not stalled).

> At the moment we use mutexes, so iommu_report_device_fault() can only be
> called from an IRQ thread, which is incompatible with the current SMMUv2
> driver. Either we need to switch the SMMUv2 driver to an IRQ thread, or
> rework the fault handler to be called from an IRQ handler. The reporting
> also has to be per device rather than per domain, and I'm not sure if
> the SMMUv2 driver can deal with this.

I'll take a closer look at the branch and try to formulate some plan
to add v2 support for this.

For my cases, the GPU always has it's own iommu device, while display
and other blocks share an apps_smmu.. although this sort of
functionality isn't really required outside of the GPU.. but I'll have
to think a bit about how we can support both cases in the single v2
driver.

BR,
-R

>
> >
> > And I do still prefer the fault cb in irq (or not requiring it in
> > wq)..  but on thinking about it, the two ideas aren't entirely
> > conflicting, ie. add some flags either when we register handler[1], or
> > they could be handled thru domain_set_attr, like:
> >
> >  _EXPLICIT_RESUME - iommu API user calls iommu_domain_resume(),
> > potentialy from wq/thread after fault handler returns
> >  _HANDLER_SLEEPS  - iommu core handles the wq, and calls ops->resume()
> > internally
> >
> > In both cases, from the iommu driver PoV it just implements
> > iommu_ops::resume().. in first case it is called via iommu user either
> > from the fault handler or at some point later (ie. wq or thread).
> >
> > I don't particularly need the _HANDLER_SLEEPS case (unless I can't
> > convince anyone that iommu_domamin_resume() called from outside iommu
> > core is a good idea).. so probably I wouldn't wire up the wq plumbing
> > for the _HANDLER_SLEEPS case unless someone really wanted me to.
> >
> > Since there are more iommu drivers, than places that register fault
> > handlers, I like the idea that in either case, from the driver PoV, it
> > is just implementing the resume callback.
> >
> > [1] currently I only see a few places where fault handlers are
> > registered, so changing iommu_set_fault_handler() is really not much
> > churn
>
> At the moment we're keeping the new fault reporting mechanism separate
> from iommu_set_fault_handler()/report_iommu_fault(), to ease the transition.
>
> Thanks,
> Jean
Robin Murphy May 14, 2019, 10:24 a.m. UTC | #16
On 14/05/2019 02:54, Rob Clark wrote:
> On Mon, May 13, 2019 at 11:37 AM Jean-Philippe Brucker
> <jean-philippe.brucker@arm.com> wrote:
>>
>> Hi Rob,
>>
>> On 10/05/2019 19:23, Rob Clark wrote:
>>> On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
>>> <jean-philippe.brucker@arm.com> wrote:
>>>>
>>>> On 22/09/17 10:02, Joerg Roedel wrote:
>>>>> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
>>>>>> I would like to decide in the IRQ whether or not to queue work or not,
>>>>>> because when we get a gpu fault, we tend to get 1000's of gpu faults
>>>>>> all at once (and I really only need to handle the first one).  I
>>>>>> suppose that could also be achieved by having a special return value
>>>>>> from the fault handler to say "call me again from a wq"..
>>>>>>
>>>>>> Note that in the drm driver I already have a suitable wq to queue the
>>>>>> work, so it really doesn't buy me anything to have the iommu driver
>>>>>> toss things off to a wq for me.  Might be a different situation for
>>>>>> other drivers (but I guess mostly other drivers are using iommu API
>>>>>> indirectly via dma-mapping?)
>>>>>
>>>>> Okay, so since you are the only user for now, we don't need a
>>>>> work-queue. But I still want the ->resume call-back to be hidden in the
>>>>> iommu code and not be exposed to users.
>>>>>
>>>>> We already have per-domain fault-handlers, so the best solution for now
>>>>> is to call ->resume from report_iommu_fault() when the fault-handler
>>>>> returns a special value.
>>>>
>>>> The problem is that report_iommu_fault is called from IRQ context by the
>>>> SMMU driver, so the device driver callback cannot sleep.
>>>>
>>>> So if the device driver needs to be able to sleep between fault report and
>>>> resume, as I understand Rob needs for writing debugfs, we can either:
>>>>
>>>> * call report_iommu_fault from higher up, in a thread or workqueue.
>>>> * split the fault reporting as this patch proposes. The exact same
>>>>    mechanism is needed for the vSVM work by Intel: in order to inject fault
>>>>    into the guest, they would like to have an atomic notifier registered by
>>>>    VFIO for passing down the Page Request, and a new function in the IOMMU
>>>>    API to resume/complete the fault.
>>>>
>>>
>>> So I was thinking about this topic again.. I would still like to get
>>> some sort of async resume so that I can wire up GPU cmdstream/state
>>> logging on iommu fault (without locally resurrecting and rebasing this
>>> patch and drm/msm side changes each time I need to debug iommu
>>> faults)..
>>
>> We've been working on the new fault reporting API with Jacob and Eric,
>> and I intend to send it out soon. It is supposed to be used for
>> reporting faults to guests via VFIO, handling page faults via mm, and
>> also reporting events directly to device drivers. Please let us know
>> what works and what doesn't in your case
>>
>> The most recent version of the patches is at
>> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
>> (git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
>> list sometimes next week, I'll add you on Cc.
>>
>> In particular, see commits
>>          iommu: Introduce device fault data
>>          iommu: Introduce device fault report API
>>          iommu: Add recoverable fault reporting
>>
>> The device driver calls iommu_register_device_fault_handler(dev, cb,
>> data). To report a fault, the SMMU driver calls
>> iommu_report_device_fault(dev, fault). This calls into the device driver
>> directly, there isn't any workqueue. If the fault is recoverable (the
>> SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
>> IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
>> once it has dealt with the fault (after sleeping if it needs to). This
>> invokes the SMMU driver's resume callback.
> 
> Ok, this sounds at a high level similar to my earlier RFC, in that
> resume is split (and that was the main thing I was interested in).
> And it does solve one thing I was struggling with, namely that when
> the domain is created it doesn't know which iommu device it will be
> attached to (given that at least the original arm-smmu.c driver cannot
> support stall in all cases)..
> 
> For GPU translation faults, I also don't really need to know if the
> faulting translation is stalled until the callback (I mainly want to
> not bother to snapshot GPU state if it is not stalled, because in that
> case the data we snapshot is unlikely to be related to the fault if
> the translation is not stalled).
> 
>> At the moment we use mutexes, so iommu_report_device_fault() can only be
>> called from an IRQ thread, which is incompatible with the current SMMUv2
>> driver. Either we need to switch the SMMUv2 driver to an IRQ thread, or
>> rework the fault handler to be called from an IRQ handler. The reporting
>> also has to be per device rather than per domain, and I'm not sure if
>> the SMMUv2 driver can deal with this.
> 
> I'll take a closer look at the branch and try to formulate some plan
> to add v2 support for this.

What's fun is that we should be able to identify a stream ID for most 
context faults *except* translation faults...

We've considered threaded IRQs before, and IIRC the problem with doing 
it at the architectural level is that in some cases the fault interrupt 
can only be deasserted by actually resuming/terminating the stalled 
transaction.

> For my cases, the GPU always has it's own iommu device, while display
> and other blocks share an apps_smmu.. although this sort of
> functionality isn't really required outside of the GPU.. but I'll have
> to think a bit about how we can support both cases in the single v2
> driver.

With the above said, I am in the middle of a big refactoring[1] to allow 
everyone's imp-def stuff to coexist nicely, so ultimately if qcom 
implementations can guarantee the appropriate hardware behaviour then 
they can have their own interrupt handlers to accommodate this.

Robin.

[1] 
http://linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/smmu-impl 
- note that this is very, very WIP right now
Rob Clark May 14, 2019, 5:17 p.m. UTC | #17
On Tue, May 14, 2019 at 3:24 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 14/05/2019 02:54, Rob Clark wrote:
> > On Mon, May 13, 2019 at 11:37 AM Jean-Philippe Brucker
> > <jean-philippe.brucker@arm.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 10/05/2019 19:23, Rob Clark wrote:
> >>> On Fri, Sep 22, 2017 at 2:58 AM Jean-Philippe Brucker
> >>> <jean-philippe.brucker@arm.com> wrote:
> >>>>
> >>>> On 22/09/17 10:02, Joerg Roedel wrote:
> >>>>> On Tue, Sep 19, 2017 at 10:23:43AM -0400, Rob Clark wrote:
> >>>>>> I would like to decide in the IRQ whether or not to queue work or not,
> >>>>>> because when we get a gpu fault, we tend to get 1000's of gpu faults
> >>>>>> all at once (and I really only need to handle the first one).  I
> >>>>>> suppose that could also be achieved by having a special return value
> >>>>>> from the fault handler to say "call me again from a wq"..
> >>>>>>
> >>>>>> Note that in the drm driver I already have a suitable wq to queue the
> >>>>>> work, so it really doesn't buy me anything to have the iommu driver
> >>>>>> toss things off to a wq for me.  Might be a different situation for
> >>>>>> other drivers (but I guess mostly other drivers are using iommu API
> >>>>>> indirectly via dma-mapping?)
> >>>>>
> >>>>> Okay, so since you are the only user for now, we don't need a
> >>>>> work-queue. But I still want the ->resume call-back to be hidden in the
> >>>>> iommu code and not be exposed to users.
> >>>>>
> >>>>> We already have per-domain fault-handlers, so the best solution for now
> >>>>> is to call ->resume from report_iommu_fault() when the fault-handler
> >>>>> returns a special value.
> >>>>
> >>>> The problem is that report_iommu_fault is called from IRQ context by the
> >>>> SMMU driver, so the device driver callback cannot sleep.
> >>>>
> >>>> So if the device driver needs to be able to sleep between fault report and
> >>>> resume, as I understand Rob needs for writing debugfs, we can either:
> >>>>
> >>>> * call report_iommu_fault from higher up, in a thread or workqueue.
> >>>> * split the fault reporting as this patch proposes. The exact same
> >>>>    mechanism is needed for the vSVM work by Intel: in order to inject fault
> >>>>    into the guest, they would like to have an atomic notifier registered by
> >>>>    VFIO for passing down the Page Request, and a new function in the IOMMU
> >>>>    API to resume/complete the fault.
> >>>>
> >>>
> >>> So I was thinking about this topic again.. I would still like to get
> >>> some sort of async resume so that I can wire up GPU cmdstream/state
> >>> logging on iommu fault (without locally resurrecting and rebasing this
> >>> patch and drm/msm side changes each time I need to debug iommu
> >>> faults)..
> >>
> >> We've been working on the new fault reporting API with Jacob and Eric,
> >> and I intend to send it out soon. It is supposed to be used for
> >> reporting faults to guests via VFIO, handling page faults via mm, and
> >> also reporting events directly to device drivers. Please let us know
> >> what works and what doesn't in your case
> >>
> >> The most recent version of the patches is at
> >> http://www.linux-arm.org/git?p=linux-jpb.git;a=shortlog;h=refs/heads/sva/api
> >> (git://www.linux-arm.org/linux-jpb.git branch sva/api). Hopefully on the
> >> list sometimes next week, I'll add you on Cc.
> >>
> >> In particular, see commits
> >>          iommu: Introduce device fault data
> >>          iommu: Introduce device fault report API
> >>          iommu: Add recoverable fault reporting
> >>
> >> The device driver calls iommu_register_device_fault_handler(dev, cb,
> >> data). To report a fault, the SMMU driver calls
> >> iommu_report_device_fault(dev, fault). This calls into the device driver
> >> directly, there isn't any workqueue. If the fault is recoverable (the
> >> SMMU driver set type IOMMU_FAULT_PAGE_REQ rather than
> >> IOMMU_FAULT_DMA_UNRECOV), the device driver calls iommu_page_response()
> >> once it has dealt with the fault (after sleeping if it needs to). This
> >> invokes the SMMU driver's resume callback.
> >
> > Ok, this sounds at a high level similar to my earlier RFC, in that
> > resume is split (and that was the main thing I was interested in).
> > And it does solve one thing I was struggling with, namely that when
> > the domain is created it doesn't know which iommu device it will be
> > attached to (given that at least the original arm-smmu.c driver cannot
> > support stall in all cases)..
> >
> > For GPU translation faults, I also don't really need to know if the
> > faulting translation is stalled until the callback (I mainly want to
> > not bother to snapshot GPU state if it is not stalled, because in that
> > case the data we snapshot is unlikely to be related to the fault if
> > the translation is not stalled).
> >
> >> At the moment we use mutexes, so iommu_report_device_fault() can only be
> >> called from an IRQ thread, which is incompatible with the current SMMUv2
> >> driver. Either we need to switch the SMMUv2 driver to an IRQ thread, or
> >> rework the fault handler to be called from an IRQ handler. The reporting
> >> also has to be per device rather than per domain, and I'm not sure if
> >> the SMMUv2 driver can deal with this.
> >
> > I'll take a closer look at the branch and try to formulate some plan
> > to add v2 support for this.
>
> What's fun is that we should be able to identify a stream ID for most
> context faults *except* translation faults...
>
> We've considered threaded IRQs before, and IIRC the problem with doing
> it at the architectural level is that in some cases the fault interrupt
> can only be deasserted by actually resuming/terminating the stalled
> transaction.
>
> > For my cases, the GPU always has it's own iommu device, while display
> > and other blocks share an apps_smmu.. although this sort of
> > functionality isn't really required outside of the GPU.. but I'll have
> > to think a bit about how we can support both cases in the single v2
> > driver.
>
> With the above said, I am in the middle of a big refactoring[1] to allow
> everyone's imp-def stuff to coexist nicely, so ultimately if qcom
> implementations can guarantee the appropriate hardware behaviour then
> they can have their own interrupt handlers to accommodate this.
>

Ok, maybe I'll hold off a bit and work on other things, to avoid feet stomping..

I don't suppose you have thoughts about split pagetables, which is one
of the things we want for implementing per-context pagetables?  I
suppose that is less of an impl thing and more an architecture thing,
but maybe no one on other implementations wants this?

BR,
-R


> Robin.
>
> [1]
> http://linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/smmu-impl
> - note that this is very, very WIP right now
diff mbox

Patch

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index fe8e7fd61282..50131985a1e7 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -239,6 +239,7 @@  struct arm_smmu_domain {
 	struct io_pgtable_ops		*pgtbl_ops;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
+	bool				stall;
 	struct mutex			init_mutex; /* Protects smmu pointer */
 	spinlock_t			cb_lock; /* Serialises ATS1* ops */
 	struct iommu_domain		domain;
@@ -544,6 +545,24 @@  static const struct iommu_gather_ops arm_smmu_s2_tlb_ops_v1 = {
 	.tlb_sync	= arm_smmu_tlb_sync_vmid,
 };
 
+static void arm_smmu_domain_resume(struct iommu_domain *domain, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	void __iomem *cb_base;
+	unsigned val;
+
+	cb_base = ARM_SMMU_CB(smmu, cfg->cbndx);
+
+	if (terminate)
+		val = RESUME_TERMINATE;
+	else
+		val = RESUME_RETRY;
+
+	writel_relaxed(val, cb_base + ARM_SMMU_CB_RESUME);
+}
+
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
 	u32 fsr, fsynr;
@@ -563,11 +582,14 @@  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
 	iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
-			    fsr, iova, fsynr, cfg->cbndx);
-
 	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
+
+	if (!report_iommu_fault(domain, smmu->dev, iova, 0)) {
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
+				    fsr, iova, fsynr, cfg->cbndx);
+	}
+
 	return IRQ_HANDLED;
 }
 
@@ -698,6 +720,8 @@  static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
 
 	/* SCTLR */
 	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_AFE | SCTLR_TRE | SCTLR_M;
+	if (smmu_domain->stall)
+		reg |= SCTLR_CFCFG;    /* stall on fault */
 	if (stage1)
 		reg |= SCTLR_S1_ASIDPNE;
 #ifdef __BIG_ENDIAN
@@ -1524,6 +1548,9 @@  static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 			smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
 
 		break;
+	case DOMAIN_ATTR_STALL:
+		smmu_domain->stall = *(bool *)data;
+		break;
 	default:
 		ret = -ENODEV;
 	}
@@ -1587,6 +1614,7 @@  static struct iommu_ops arm_smmu_ops = {
 	.device_group		= arm_smmu_device_group,
 	.domain_get_attr	= arm_smmu_domain_get_attr,
 	.domain_set_attr	= arm_smmu_domain_set_attr,
+	.domain_resume		= arm_smmu_domain_resume,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= arm_smmu_put_resv_regions,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3f6ea160afed..49eecfb7abd7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1788,6 +1788,27 @@  int iommu_domain_set_attr(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
 
+/**
+ * iommu_domain_resume - Resume translations for a domain after a fault.
+ *
+ * This can be called at some point after the fault handler is called,
+ * allowing the user of the IOMMU to (for example) handle the fault
+ * from a task context.  It is illegal to call this if
+ * iommu_domain_set_attr(STALL) failed.
+ *
+ * @domain:    the domain to resume
+ * @terminate: if true, the translation that triggered the fault should
+ *    be terminated, else it should be retried.
+ */
+void iommu_domain_resume(struct iommu_domain *domain, bool terminate)
+{
+	/* invalid to call if iommu_domain_set_attr(STALL) failed: */
+	if (WARN_ON(!domain->ops->domain_resume))
+		return;
+	domain->ops->domain_resume(domain, terminate);
+}
+EXPORT_SYMBOL_GPL(iommu_domain_resume);
+
 void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 {
 	const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54adc4a33..2154fe2591a0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -124,6 +124,17 @@  enum iommu_attr {
 	DOMAIN_ATTR_FSL_PAMU_ENABLE,
 	DOMAIN_ATTR_FSL_PAMUV1,
 	DOMAIN_ATTR_NESTING,	/* two stages of translation */
+	/*
+	 * Domain stalls faulting translations, if DOMAIN_ATTR_STALL is
+	 * enabled, user of domain calls iommu_domain_resume() at some
+	 * point (either from fault handler or asynchronously after
+	 * the fault handler is called (for example, from a workqueue)
+	 * to resume translations.
+	 *
+	 * The attribute value is a bool, and should be set before
+	 * attaching the domain.
+	 */
+	DOMAIN_ATTR_STALL,
 	DOMAIN_ATTR_MAX,
 };
 
@@ -208,6 +219,8 @@  struct iommu_ops {
 	int (*domain_set_attr)(struct iommu_domain *domain,
 			       enum iommu_attr attr, void *data);
 
+	void (*domain_resume)(struct iommu_domain *domain, bool terminate);
+
 	/* Request/Free a list of reserved regions for a device */
 	void (*get_resv_regions)(struct device *dev, struct list_head *list);
 	void (*put_resv_regions)(struct device *dev, struct list_head *list);
@@ -333,6 +346,7 @@  extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
 extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
+extern void iommu_domain_resume(struct iommu_domain *domain, bool terminate);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,