diff mbox

[v11,3/3] vt-d: fix vt-d Device-TLB flush timeout issue

Message ID 1464771922-7794-4-git-send-email-quan.xu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Quan Xu June 1, 2016, 9:05 a.m. UTC
From: Quan Xu <quan.xu@intel.com>

If Device-TLB flush timed out, we hide the target ATS device
immediately and crash the domain owning this ATS device. If
impacted domain is hardware domain, just throw out a warning.

By hiding the device, we make sure it can't be assigned to any
domain any longer (see device_assigned).

v11:
   1. Pass down struct pci_dev *, instead of SBDF inside struct
      pci_ats_dev.
   2. change invalidate_sync() back as I add a specific function
      - dev_invalidate_sync() for device IOTLB.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/pci.c         |  6 +--
 xen/drivers/passthrough/vtd/extern.h  |  5 +-
 xen/drivers/passthrough/vtd/qinval.c  | 86 +++++++++++++++++++++++++++++------
 xen/drivers/passthrough/vtd/x86/ats.c |  7 +--
 xen/include/xen/pci.h                 |  1 +
 5 files changed, 82 insertions(+), 23 deletions(-)

Comments

Jan Beulich June 2, 2016, 11:07 a.m. UTC | #1
>>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/extern.h
> +++ b/xen/drivers/passthrough/vtd/extern.h
> @@ -21,6 +21,7 @@
>  #define _VTD_EXTERN_H_
>  
>  #include "dmar.h"
> +#include "../ats.h"

Why? You don't de-reference struct pci_ats_dev * in this file, so
all you'd need is a forward declaration. But then this is not in line
with your v11 change description above, so I wonder whether
you actually sent a stale patch. After all I thought the v10
discussion (see
http://lists.xenproject.org/archives/html/xen-devel/2016-05/msg02208.html
) had made clear that this passing down, besides reducing the
number of arguments of some function, would also be meant to
eliminate ...

> +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> +                                         struct pci_ats_dev *ats_dev)
> +{
> +    struct domain *d = NULL;
> +    struct pci_dev *pdev;
> +
> +    if ( test_bit(did, iommu->domid_bitmap) )
> +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> +
> +    /*
> +     * In case the domain has been freed or the IOMMU domid bitmap is
> +     * not valid, the device no longer belongs to this domain.
> +     */
> +    if ( d == NULL )
> +        return;
> +
> +    pcidevs_lock();
> +
> +    for_each_pdev(d, pdev)
> +    {
> +        if ( (pdev->seg == ats_dev->seg) &&
> +             (pdev->bus == ats_dev->bus) &&
> +             (pdev->devfn == ats_dev->devfn) )
> +        {
> +            ASSERT(pdev->domain);
> +            list_del(&pdev->domain_list);
> +            pdev->domain = NULL;
> +            pci_hide_existing_device(pdev);
> +            break;
> +        }
> +    }
> +
> +    pcidevs_unlock();

... this loop (and locking). (Of course such a change may better be
done in another preparatory patch.)

> +    if ( !is_hardware_domain(d) )
> +        domain_crash(d);
> +    else
> +        printk(XENLOG_WARNING VTDPREFIX
> +               " dom%d: ATS device %04x:%02x:%02x.%u flush failed\n",
> +               d->domain_id, ats_dev->seg, ats_dev->bus,
> +               PCI_SLOT(ats_dev->devfn), PCI_FUNC(ats_dev->devfn));

Please use the same logic for logging and crashing as you do in
the other series, so that at least on average a resulting DomU
crash will be accompanied with some indication of the reason
beyond just the source file name and line number.

> +static int __must_check dev_invalidate_sync(struct iommu *iommu, u16 did,
> +                                            struct pci_ats_dev *ats_dev)
> +{
> +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> +    int rc = 0;
> +
> +    if ( qi_ctrl->qinval_maddr )
> +    {
> +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
> +
> +        if ( rc == -ETIMEDOUT )
> +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
> +    }
> +
> +    return rc;
> +}

I've never really understood why invalidate_sync() returns success
when it didn't do anything. Now that you copy this same behavior
here, I really need to ask you to explain that.

Jan
Quan Xu June 16, 2016, 8:42 a.m. UTC | #2
(* I will CC arm/amd maintainer after this vt-d specific discussion, and then send out my proposal...)

On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
> > --- a/xen/drivers/passthrough/vtd/extern.h
> > +++ b/xen/drivers/passthrough/vtd/extern.h
> > @@ -21,6 +21,7 @@
> >  #define _VTD_EXTERN_H_
> >
> >  #include "dmar.h"
> > +#include "../ats.h"
> 
> Why? You don't de-reference struct pci_ats_dev * in this file, so all you'd need
> is a forward declaration. But then this is not in line with your v11 change
> description above, so I wonder whether you actually sent a stale patch.

Sorry, this patch is really strange to me.

> After all I thought the v10 discussion (see
> http://lists.xenproject.org/archives/html/xen-devel/2016-
> 05/msg02208.html
> ) had made clear that this passing down,

Sure, what you said is very clear. But I read these code again, I found a  pci_get_pdev_by_domain()
Can also get *pdev without below loop.. (also hardware domain calls pci_get_pdev_by_domain() to get pdev.)

To be honest,  now I don't like to add a struct pci_dev * inside struct pci_ats_dev, as I need to change
' struct pci_ats_dev *pdev' to ' struct pci_ats_dev * pci_ats_dev ' in some functions as well.


> besides reducing the number of
> arguments of some function, would also be meant to eliminate ...
> 
> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > +                                         struct pci_ats_dev *ats_dev)
> > +{
> > +    struct domain *d = NULL;
> > +    struct pci_dev *pdev;
> > +
> > +    if ( test_bit(did, iommu->domid_bitmap) )
> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > +
> > +    /*
> > +     * In case the domain has been freed or the IOMMU domid bitmap is
> > +     * not valid, the device no longer belongs to this domain.
> > +     */
> > +    if ( d == NULL )
> > +        return;
> > +
> > +    pcidevs_lock();
> > +
> > +    for_each_pdev(d, pdev)
> > +    {
> > +        if ( (pdev->seg == ats_dev->seg) &&
> > +             (pdev->bus == ats_dev->bus) &&
> > +             (pdev->devfn == ats_dev->devfn) )
> > +        {
> > +            ASSERT(pdev->domain);
> > +            list_del(&pdev->domain_list);
> > +            pdev->domain = NULL;
> > +            pci_hide_existing_device(pdev);
> > +            break;
> > +        }
> > +    }
> > +
> > +    pcidevs_unlock();
> 
> ... this loop (and locking). (Of course such a change may better be done in
> another preparatory patch.)
> 

To eliminate the locking?  I am afraid the locking is still a must here even without the loop, also referring  to device_assigned()..


> 
> > +static int __must_check dev_invalidate_sync(struct iommu *iommu, u16
> did,
> > +                                            struct pci_ats_dev
> > +*ats_dev) {
> > +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > +    int rc = 0;
> > +
> > +    if ( qi_ctrl->qinval_maddr )
> > +    {
> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
> > +
> > +        if ( rc == -ETIMEDOUT )
> > +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
> > +    }
> > +
> > +    return rc;
> > +}
> 
> I've never really understood why invalidate_sync() returns success when it
> didn't do anything. Now that you copy this same behavior here, I really need
> to ask you to explain that.
> 

It is acceptable to me, returning success when it didn't do anything -- this is worth reflection and criticism:(..
It is better:
+    if ( qi_ctrl->qinval_maddr )
+        ...
+    else
+        rc = -ENOENT;

A question:
I find the page related to qi_ctrl->qinval_maddr is not freed at all. IMO,
In disable_qinval (), we need to do:
     - free the page related to qi_ctrl->qinval_maddr.
     - qi_ctrl->qinval_maddr = 0;

Right?

Quan
Jan Beulich June 16, 2016, 9:04 a.m. UTC | #3
>>> On 16.06.16 at 10:42, <quan.xu@intel.com> wrote:
> On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
>> > --- a/xen/drivers/passthrough/vtd/extern.h
>> > +++ b/xen/drivers/passthrough/vtd/extern.h
>> > @@ -21,6 +21,7 @@
>> >  #define _VTD_EXTERN_H_
>> >
>> >  #include "dmar.h"
>> > +#include "../ats.h"
>> 
>> Why? You don't de-reference struct pci_ats_dev * in this file, so all you'd need
>> is a forward declaration. But then this is not in line with your v11 change
>> description above, so I wonder whether you actually sent a stale patch.
> 
> Sorry, this patch is really strange to me.

Well, it's yours, so you ought to be able to explain it.

>> After all I thought the v10 discussion (see
>> http://lists.xenproject.org/archives/html/xen-devel/2016- 
>> 05/msg02208.html
>> ) had made clear that this passing down,
> 
> Sure, what you said is very clear. But I read these code again, I found a  
> pci_get_pdev_by_domain()
> Can also get *pdev without below loop.. (also hardware domain calls 
> pci_get_pdev_by_domain() to get pdev.)

Which would not eliminate the loop - pci_get_pdev_by_domain()
does have a loop itself.

> To be honest,  now I don't like to add a struct pci_dev * inside struct 
> pci_ats_dev, as I need to change
> ' struct pci_ats_dev *pdev' to ' struct pci_ats_dev * pci_ats_dev ' in some 
> functions as well.

I don't see why variables would need renaming. If you need a
variable of type struct pci_dev * in a function which already has
a variable named pdev, simply name the new variable differently
(e.g. pci_dev).

>> besides reducing the number of
>> arguments of some function, would also be meant to eliminate ...
>> 
>> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
>> > +                                         struct pci_ats_dev *ats_dev)
>> > +{
>> > +    struct domain *d = NULL;
>> > +    struct pci_dev *pdev;
>> > +
>> > +    if ( test_bit(did, iommu->domid_bitmap) )
>> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> > +
>> > +    /*
>> > +     * In case the domain has been freed or the IOMMU domid bitmap is
>> > +     * not valid, the device no longer belongs to this domain.
>> > +     */
>> > +    if ( d == NULL )
>> > +        return;
>> > +
>> > +    pcidevs_lock();
>> > +
>> > +    for_each_pdev(d, pdev)
>> > +    {
>> > +        if ( (pdev->seg == ats_dev->seg) &&
>> > +             (pdev->bus == ats_dev->bus) &&
>> > +             (pdev->devfn == ats_dev->devfn) )
>> > +        {
>> > +            ASSERT(pdev->domain);
>> > +            list_del(&pdev->domain_list);
>> > +            pdev->domain = NULL;
>> > +            pci_hide_existing_device(pdev);
>> > +            break;
>> > +        }
>> > +    }
>> > +
>> > +    pcidevs_unlock();
>> 
>> ... this loop (and locking). (Of course such a change may better be done in
>> another preparatory patch.)
>> 
> 
> To eliminate the locking?  I am afraid the locking is still a must here even 
> without the loop, also referring  to device_assigned()..

If the entire loop gets eliminated, what would be left is

    pcidevs_lock();
    pcidevs_unlock();

which I don't think does any good.

>> > +static int __must_check dev_invalidate_sync(struct iommu *iommu, u16
>> did,
>> > +                                            struct pci_ats_dev
>> > +*ats_dev) {
>> > +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
>> > +    int rc = 0;
>> > +
>> > +    if ( qi_ctrl->qinval_maddr )
>> > +    {
>> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
>> > +
>> > +        if ( rc == -ETIMEDOUT )
>> > +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
>> > +    }
>> > +
>> > +    return rc;
>> > +}
>> 
>> I've never really understood why invalidate_sync() returns success when it
>> didn't do anything. Now that you copy this same behavior here, I really need
>> to ask you to explain that.
>> 
> 
> It is acceptable to me, returning success when it didn't do anything -- this 
> is worth reflection and criticism:(..
> It is better:
> +    if ( qi_ctrl->qinval_maddr )
> +        ...
> +    else
> +        rc = -ENOENT;

Right. And perhaps a separate patch to make invalidate_sync()
do the same. Question is whether this really ought to be a
conditional, or whether instead this code is unreachable when
qinval is not in use, in which case these conditionals would imo
better be converted to ASSERT()s.

> A question:
> I find the page related to qi_ctrl->qinval_maddr is not freed at all. IMO,
> In disable_qinval (), we need to do:
>      - free the page related to qi_ctrl->qinval_maddr.
>      - qi_ctrl->qinval_maddr = 0;

Well, that's a correct observation, but not a big problem imo: If this
was a per-domain resource, it surely would need fixing. But if freeing
a couple of these pages (one per IOMMU) causes synchronization
headaches (e.g. because there may still be dangling references to
it), then I think freeing them is not a must. But if freeing them is safe
(like you seem to imply), then I'm certainly fine with you fixing this
(not that my opinion would matter much here, as I'm not the
maintainer of this code).

Jan
Quan Xu June 17, 2016, 6:08 a.m. UTC | #4
On June 16, 2016 5:05 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 16.06.16 at 10:42, <quan.xu@intel.com> wrote:
> > On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> >> > +                                         struct pci_ats_dev
> >> > +*ats_dev) {
> >> > +    struct domain *d = NULL;
> >> > +    struct pci_dev *pdev;
> >> > +
> >> > +    if ( test_bit(did, iommu->domid_bitmap) )
> >> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >> > +
> >> > +    /*
> >> > +     * In case the domain has been freed or the IOMMU domid bitmap is
> >> > +     * not valid, the device no longer belongs to this domain.
> >> > +     */
> >> > +    if ( d == NULL )
> >> > +        return;
> >> > +
> >> > +    pcidevs_lock();
> >> > +
> >> > +    for_each_pdev(d, pdev)
> >> > +    {
> >> > +        if ( (pdev->seg == ats_dev->seg) &&
> >> > +             (pdev->bus == ats_dev->bus) &&
> >> > +             (pdev->devfn == ats_dev->devfn) )
> >> > +        {
> >> > +            ASSERT(pdev->domain);
> >> > +            list_del(&pdev->domain_list);
> >> > +            pdev->domain = NULL;
> >> > +            pci_hide_existing_device(pdev);
> >> > +            break;
> >> > +        }
> >> > +    }
> >> > +
> >> > +    pcidevs_unlock();
> >>
> >> ... this loop (and locking). (Of course such a change may better be
> >> done in another preparatory patch.)
> >>
> >
> > To eliminate the locking?  I am afraid the locking is still a must
> > here even without the loop, also referring  to device_assigned()..
> 
> If the entire loop gets eliminated, what would be left is
> 
>     pcidevs_lock();
>     pcidevs_unlock();
> 
> which I don't think does any good.
> 

Why? I can't follow it..


> >> > +static int __must_check dev_invalidate_sync(struct iommu *iommu,
> >> > +u16
> >> did,
> >> > +                                            struct pci_ats_dev
> >> > +*ats_dev) {
> >> > +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> >> > +    int rc = 0;
> >> > +
> >> > +    if ( qi_ctrl->qinval_maddr )
> >> > +    {
> >> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
> >> > +
> >> > +        if ( rc == -ETIMEDOUT )
> >> > +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
> >> > +    }
> >> > +
> >> > +    return rc;
> >> > +}
> >>
> >> I've never really understood why invalidate_sync() returns success
> >> when it didn't do anything. Now that you copy this same behavior
> >> here, I really need to ask you to explain that.
> >>
> >
> > It is acceptable to me, returning success when it didn't do anything
> > -- this is worth reflection and criticism:(..
> > It is better:
> > +    if ( qi_ctrl->qinval_maddr )
> > +        ...
> > +    else
> > +        rc = -ENOENT;
> 
> Right. And perhaps a separate patch to make invalidate_sync() do the same.

Agreed.

> Question is whether this really ought to be a conditional, or whether instead
> this code is unreachable when qinval is not in use, in which case these
> conditionals would imo better be converted to ASSERT()s.
> 

IMO, this should be a conditional.
As mentioned below, strictly speaking, this is a bug. We can't  ASSERT() based on a bug..
A coming patch may fix it..


> > A question:
> > I find the page related to qi_ctrl->qinval_maddr is not freed at all.
> > IMO, In disable_qinval (), we need to do:
> >      - free the page related to qi_ctrl->qinval_maddr.
> >      - qi_ctrl->qinval_maddr = 0;
> 
> Well, that's a correct observation, but not a big problem imo: If this was a per-
> domain resource, it surely would need fixing. But if freeing a couple of these
> pages (one per IOMMU) causes synchronization headaches (e.g. because
> there may still be dangling references to it), then I think freeing them is not a
> must. But if freeing them is safe (like you seem to imply), then I'm certainly
> fine with you fixing this (not that my opinion would matter much here, as I'm
> not the maintainer of this code).
> 

Agreed, thanks for your explanation. At least I will leave it as is in this patch set.

Quan
Jan Beulich June 17, 2016, 7 a.m. UTC | #5
>>> On 17.06.16 at 08:08, <quan.xu@intel.com> wrote:

> On June 16, 2016 5:05 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >>> On 16.06.16 at 10:42, <quan.xu@intel.com> wrote:
>> > On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
>> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
>> >> > +                                         struct pci_ats_dev
>> >> > +*ats_dev) {
>> >> > +    struct domain *d = NULL;
>> >> > +    struct pci_dev *pdev;
>> >> > +
>> >> > +    if ( test_bit(did, iommu->domid_bitmap) )
>> >> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> >> > +
>> >> > +    /*
>> >> > +     * In case the domain has been freed or the IOMMU domid bitmap is
>> >> > +     * not valid, the device no longer belongs to this domain.
>> >> > +     */
>> >> > +    if ( d == NULL )
>> >> > +        return;
>> >> > +
>> >> > +    pcidevs_lock();
>> >> > +
>> >> > +    for_each_pdev(d, pdev)
>> >> > +    {
>> >> > +        if ( (pdev->seg == ats_dev->seg) &&
>> >> > +             (pdev->bus == ats_dev->bus) &&
>> >> > +             (pdev->devfn == ats_dev->devfn) )
>> >> > +        {
>> >> > +            ASSERT(pdev->domain);
>> >> > +            list_del(&pdev->domain_list);
>> >> > +            pdev->domain = NULL;
>> >> > +            pci_hide_existing_device(pdev);
>> >> > +            break;
>> >> > +        }
>> >> > +    }
>> >> > +
>> >> > +    pcidevs_unlock();
>> >>
>> >> ... this loop (and locking). (Of course such a change may better be
>> >> done in another preparatory patch.)
>> >>
>> >
>> > To eliminate the locking?  I am afraid the locking is still a must
>> > here even without the loop, also referring  to device_assigned()..
>> 
>> If the entire loop gets eliminated, what would be left is
>> 
>>     pcidevs_lock();
>>     pcidevs_unlock();
>> 
>> which I don't think does any good.
> 
> Why? I can't follow it..

I don't understand your question. Can you explain what use above
code sequence is, in your opinion? Or else - what does the "why"
refer to?

>> >> > +static int __must_check dev_invalidate_sync(struct iommu *iommu,
>> >> > +u16
>> >> did,
>> >> > +                                            struct pci_ats_dev
>> >> > +*ats_dev) {
>> >> > +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
>> >> > +    int rc = 0;
>> >> > +
>> >> > +    if ( qi_ctrl->qinval_maddr )
>> >> > +    {
>> >> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
>> >> > +
>> >> > +        if ( rc == -ETIMEDOUT )
>> >> > +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
>> >> > +    }
>> >> > +
>> >> > +    return rc;
>> >> > +}
>> >>
>> >> I've never really understood why invalidate_sync() returns success
>> >> when it didn't do anything. Now that you copy this same behavior
>> >> here, I really need to ask you to explain that.
>> >>
>> >
>> > It is acceptable to me, returning success when it didn't do anything
>> > -- this is worth reflection and criticism:(..
>> > It is better:
>> > +    if ( qi_ctrl->qinval_maddr )
>> > +        ...
>> > +    else
>> > +        rc = -ENOENT;
>> 
>> Right. And perhaps a separate patch to make invalidate_sync() do the same.
> 
> Agreed.
> 
>> Question is whether this really ought to be a conditional, or whether 
> instead
>> this code is unreachable when qinval is not in use, in which case these
>> conditionals would imo better be converted to ASSERT()s.
>> 
> 
> IMO, this should be a conditional.
> As mentioned below, strictly speaking, this is a bug. We can't  ASSERT() 
> based on a bug..
> A coming patch may fix it..

And again I don't understand: ASSERT()s are to verify assumed
state. If static code analysis resulted in understanding a function
is unreachable when qi_ctrl->qinval_maddr is zero (because
qinval ought to have got disabled if any of the table setup failed),
then adding ASSERT() would (a) document that and (b) allow to
know quickly if something broke that assumption.

But then again I may simply misunderstand your wording: "We
can't ASSERT() based on a bug" is really pretty unclear to me.

Jan
Quan Xu June 17, 2016, 8:15 a.m. UTC | #6
On June 17, 2016 3:01 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 17.06.16 at 08:08, <quan.xu@intel.com> wrote:
> 
> > On June 16, 2016 5:05 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >> >>> On 16.06.16 at 10:42, <quan.xu@intel.com> wrote:
> >> > On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >> >> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
> >> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16
> did,
> >> >> > +                                         struct pci_ats_dev
> >> >> > +*ats_dev) {
> >> >> > +    struct domain *d = NULL;
> >> >> > +    struct pci_dev *pdev;
> >> >> > +
> >> >> > +    if ( test_bit(did, iommu->domid_bitmap) )
> >> >> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> >> >> > +
> >> >> > +    /*
> >> >> > +     * In case the domain has been freed or the IOMMU domid bitmap
> is
> >> >> > +     * not valid, the device no longer belongs to this domain.
> >> >> > +     */
> >> >> > +    if ( d == NULL )
> >> >> > +        return;
> >> >> > +
> >> >> > +    pcidevs_lock();
> >> >> > +
> >> >> > +    for_each_pdev(d, pdev)
> >> >> > +    {
> >> >> > +        if ( (pdev->seg == ats_dev->seg) &&
> >> >> > +             (pdev->bus == ats_dev->bus) &&
> >> >> > +             (pdev->devfn == ats_dev->devfn) )
> >> >> > +        {
> >> >> > +            ASSERT(pdev->domain);
> >> >> > +            list_del(&pdev->domain_list);
> >> >> > +            pdev->domain = NULL;
> >> >> > +            pci_hide_existing_device(pdev);
> >> >> > +            break;
> >> >> > +        }
> >> >> > +    }
> >> >> > +
> >> >> > +    pcidevs_unlock();
> >> >>
> >> >> ... this loop (and locking). (Of course such a change may better
> >> >> be done in another preparatory patch.)
> >> >>
> >> >
> >> > To eliminate the locking?  I am afraid the locking is still a must
> >> > here even without the loop, also referring  to device_assigned()..
> >>
> >> If the entire loop gets eliminated, what would be left is
> >>
> >>     pcidevs_lock();
> >>     pcidevs_unlock();
> >>
> >> which I don't think does any good.
> >
> > Why? I can't follow it..
> 
> I don't understand your question. Can you explain what use above code
> sequence is, in your opinion? Or else - what does the "why"
> refer to?
> 

Ah, there may be a gap between us. without this loop,  these pdev operation should be still there, such as,


+    ASSERT(pdev->domain);
+    list_del(&pdev->domain_list);
+    pdev->domain = NULL;
+    pci_hide_existing_device(pdev);

So, the left is not only:
   pcidevs_lock();
   pcidevs_unlock();


> >> >> > +static int __must_check dev_invalidate_sync(struct iommu
> >> >> > +*iommu,
> >> >> > +u16
> >> >> did,
> >> >> > +                                            struct pci_ats_dev
> >> >> > +*ats_dev) {
> >> >> > +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> >> >> > +    int rc = 0;
> >> >> > +
> >> >> > +    if ( qi_ctrl->qinval_maddr )
> >> >> > +    {
> >> >> > +        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
> >> >> > +
> >> >> > +        if ( rc == -ETIMEDOUT )
> >> >> > +            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
> >> >> > +    }
> >> >> > +
> >> >> > +    return rc;
> >> >> > +}
> >> >>
> >> >> I've never really understood why invalidate_sync() returns success
> >> >> when it didn't do anything. Now that you copy this same behavior
> >> >> here, I really need to ask you to explain that.
> >> >>
> >> >
> >> > It is acceptable to me, returning success when it didn't do
> >> > anything
> >> > -- this is worth reflection and criticism:(..
> >> > It is better:
> >> > +    if ( qi_ctrl->qinval_maddr )
> >> > +        ...
> >> > +    else
> >> > +        rc = -ENOENT;
> >>
> >> Right. And perhaps a separate patch to make invalidate_sync() do the
> same.
> >
> > Agreed.
> >
> >> Question is whether this really ought to be a conditional, or whether
> > instead
> >> this code is unreachable when qinval is not in use, in which case
> >> these conditionals would imo better be converted to ASSERT()s.
> >>
> >
> > IMO, this should be a conditional.
> > As mentioned below, strictly speaking, this is a bug. We can't
> > ASSERT() based on a bug..
> > A coming patch may fix it..
> 
> And again I don't understand: ASSERT()s are to verify assumed state. If static
> code analysis resulted in understanding a function is unreachable when
> qi_ctrl->qinval_maddr is zero (because qinval ought to have got disabled if any
> of the table setup failed), then adding ASSERT() would (a) document that and
> (b) allow to know quickly if something broke that assumption.
> 

You are right.  A separate patch does this.

> But then again I may simply misunderstand your wording: "We can't ASSERT()
> based on a bug" is really pretty unclear to me.
> 

I supposed the variable in ASSERT() is always true, but disable_qinval() needs to make
qi_ctrl->qinval_maddr  zero, but today it doesn't do this -- a bug.
With your explanation,  I got it now. Thanks.

Quan
Jan Beulich June 17, 2016, 8:40 a.m. UTC | #7
>>> On 17.06.16 at 10:15, <quan.xu@intel.com> wrote:
> On June 17, 2016 3:01 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >>> On 17.06.16 at 08:08, <quan.xu@intel.com> wrote:
>> 
>> > On June 16, 2016 5:05 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> >>> On 16.06.16 at 10:42, <quan.xu@intel.com> wrote:
>> >> > On June 02, 2016 7:07 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> >> >>> On 01.06.16 at 11:05, <quan.xu@intel.com> wrote:
>> >> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16
>> did,
>> >> >> > +                                         struct pci_ats_dev
>> >> >> > +*ats_dev) {
>> >> >> > +    struct domain *d = NULL;
>> >> >> > +    struct pci_dev *pdev;
>> >> >> > +
>> >> >> > +    if ( test_bit(did, iommu->domid_bitmap) )
>> >> >> > +        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> >> >> > +
>> >> >> > +    /*
>> >> >> > +     * In case the domain has been freed or the IOMMU domid bitmap
>> is
>> >> >> > +     * not valid, the device no longer belongs to this domain.
>> >> >> > +     */
>> >> >> > +    if ( d == NULL )
>> >> >> > +        return;
>> >> >> > +
>> >> >> > +    pcidevs_lock();
>> >> >> > +
>> >> >> > +    for_each_pdev(d, pdev)
>> >> >> > +    {
>> >> >> > +        if ( (pdev->seg == ats_dev->seg) &&
>> >> >> > +             (pdev->bus == ats_dev->bus) &&
>> >> >> > +             (pdev->devfn == ats_dev->devfn) )
>> >> >> > +        {
>> >> >> > +            ASSERT(pdev->domain);
>> >> >> > +            list_del(&pdev->domain_list);
>> >> >> > +            pdev->domain = NULL;
>> >> >> > +            pci_hide_existing_device(pdev);
>> >> >> > +            break;
>> >> >> > +        }
>> >> >> > +    }
>> >> >> > +
>> >> >> > +    pcidevs_unlock();
>> >> >>
>> >> >> ... this loop (and locking). (Of course such a change may better
>> >> >> be done in another preparatory patch.)
>> >> >>
>> >> >
>> >> > To eliminate the locking?  I am afraid the locking is still a must
>> >> > here even without the loop, also referring  to device_assigned()..
>> >>
>> >> If the entire loop gets eliminated, what would be left is
>> >>
>> >>     pcidevs_lock();
>> >>     pcidevs_unlock();
>> >>
>> >> which I don't think does any good.
>> >
>> > Why? I can't follow it..
>> 
>> I don't understand your question. Can you explain what use above code
>> sequence is, in your opinion? Or else - what does the "why"
>> refer to?
>> 
> 
> Ah, there may be a gap between us. without this loop,  these pdev operation 
> should be still there, such as,
> 
> 
> +    ASSERT(pdev->domain);
> +    list_del(&pdev->domain_list);
> +    pdev->domain = NULL;
> +    pci_hide_existing_device(pdev);
> 
> So, the left is not only:
>    pcidevs_lock();
>    pcidevs_unlock();

Oh, indeed. My bad.

Jan
Quan Xu June 22, 2016, 3:54 p.m. UTC | #8
On June 17, 2016 3:01 PM, Jan Beulich <JBeulich@suse.com> wrote:
> And again I don't understand: ASSERT()s are to verify assumed state. If static
> code analysis resulted in understanding a function is unreachable when
> qi_ctrl->qinval_maddr is zero (because qinval ought to have got disabled if any
> of the table setup failed), then adding ASSERT() would (a) document that and
> (b) allow to know quickly if something broke that assumption.

other than enable_qinval() -- yes, I need to convert conditionals of qi_ctrl->qinval_maddr into
 ASSERT()s..
But in enable_qinval(), I am still not quite sure whether I need to convert these conditionals of
 qi_ctrl->qinval_maddr into ASSERT()s or not.

Quan
Jan Beulich June 22, 2016, 4:18 p.m. UTC | #9
>>> On 22.06.16 at 17:54, <quan.xu@intel.com> wrote:
> On June 17, 2016 3:01 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> And again I don't understand: ASSERT()s are to verify assumed state. If 
> static
>> code analysis resulted in understanding a function is unreachable when
>> qi_ctrl->qinval_maddr is zero (because qinval ought to have got disabled if 
> any
>> of the table setup failed), then adding ASSERT() would (a) document that and
>> (b) allow to know quickly if something broke that assumption.
> 
> other than enable_qinval() -- yes, I need to convert conditionals of 
> qi_ctrl->qinval_maddr into
>  ASSERT()s..
> But in enable_qinval(), I am still not quite sure whether I need to convert 
> these conditionals of
>  qi_ctrl->qinval_maddr into ASSERT()s or not.

No, I don't think you want to so there - you'd bring the system down
in case of an actual initialization error. ASSERT()s should only be
used on conditions controlled entirely by the hypervisor.

Jan
Quan Xu June 23, 2016, 2:08 a.m. UTC | #10
On June 23, 2016 12:18 AM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 22.06.16 at 17:54, <quan.xu@intel.com> wrote:
> > On June 17, 2016 3:01 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >> And again I don't understand: ASSERT()s are to verify assumed state.
> >> If
> > static
> >> code analysis resulted in understanding a function is unreachable
> >> when qi_ctrl->qinval_maddr is zero (because qinval ought to have got
> >> disabled if
> > any
> >> of the table setup failed), then adding ASSERT() would (a) document
> >> that and
> >> (b) allow to know quickly if something broke that assumption.
> >
> > other than enable_qinval() -- yes, I need to convert conditionals of
> > qi_ctrl->qinval_maddr into  ASSERT()s..
> > But in enable_qinval(), I am still not quite sure whether I need to
> > convert these conditionals of  qi_ctrl->qinval_maddr into ASSERT()s or
> > not.
> 
> No, I don't think you want to so there - you'd bring the system down in case
> of an actual initialization error. ASSERT()s should only be used on conditions
> controlled entirely by the hypervisor.
> 

Jan, thank you. Now I am clear.

Quan
diff mbox

Patch

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 98936f55c..843dc88 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -419,7 +419,7 @@  static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
     xfree(pdev);
 }
 
-static void _pci_hide_device(struct pci_dev *pdev)
+void pci_hide_existing_device(struct pci_dev *pdev)
 {
     if ( pdev->domain )
         return;
@@ -436,7 +436,7 @@  int __init pci_hide_device(int bus, int devfn)
     pdev = alloc_pdev(get_pseg(0), bus, devfn);
     if ( pdev )
     {
-        _pci_hide_device(pdev);
+        pci_hide_existing_device(pdev);
         rc = 0;
     }
     pcidevs_unlock();
@@ -466,7 +466,7 @@  int __init pci_ro_device(int seg, int bus, int devfn)
     }
 
     __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
-    _pci_hide_device(pdev);
+    pci_hide_existing_device(pdev);
 
     return 0;
 }
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 45357f2..94ca97a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -21,6 +21,7 @@ 
 #define _VTD_EXTERN_H_
 
 #include "dmar.h"
+#include "../ats.h"
 #include <xen/keyhandler.h>
 
 #define VTDPREFIX "[VT-D]"
@@ -60,8 +61,8 @@  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
                          u64 addr, unsigned int size_order, u64 type);
 
 int __must_check qinval_device_iotlb_sync(struct iommu *iommu,
-                                          u32 max_invs_pend,
-                                          u16 sid, u16 size, u64 addr);
+                                          struct pci_ats_dev *ats_dev,
+                                          u16 did, u16 size, u64 addr);
 
 unsigned int get_cache_line_size(void);
 void cacheline_flush(char *);
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 9116588..bea3e86 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -30,8 +30,7 @@ 
 
 #define IOMMU_QI_TIMEOUT MILLISECS(1)
 
-static int __must_check invalidate_sync(struct iommu *iommu,
-                                        bool_t flush_dev_iotlb);
+static int __must_check invalidate_sync(struct iommu *iommu);
 
 static void print_qi_regs(struct iommu *iommu)
 {
@@ -103,7 +102,7 @@  static int __must_check queue_invalidate_context_sync(struct iommu *iommu,
 
     unmap_vtd_domain_page(qinval_entries);
 
-    return invalidate_sync(iommu, 0);
+    return invalidate_sync(iommu);
 }
 
 static int __must_check queue_invalidate_iotlb_sync(struct iommu *iommu,
@@ -140,7 +139,7 @@  static int __must_check queue_invalidate_iotlb_sync(struct iommu *iommu,
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
-    return invalidate_sync(iommu, 0);
+    return invalidate_sync(iommu);
 }
 
 static int __must_check queue_invalidate_wait(struct iommu *iommu,
@@ -200,20 +199,81 @@  static int __must_check queue_invalidate_wait(struct iommu *iommu,
     return -EOPNOTSUPP;
 }
 
-static int __must_check invalidate_sync(struct iommu *iommu,
-                                        bool_t flush_dev_iotlb)
+static int __must_check invalidate_sync(struct iommu *iommu)
 {
     struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
 
     if ( qi_ctrl->qinval_maddr )
-        return queue_invalidate_wait(iommu, 0, 1, 1, flush_dev_iotlb);
+        return queue_invalidate_wait(iommu, 0, 1, 1, 0);
 
     return 0;
 }
 
+static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
+                                         struct pci_ats_dev *ats_dev)
+{
+    struct domain *d = NULL;
+    struct pci_dev *pdev;
+
+    if ( test_bit(did, iommu->domid_bitmap) )
+        d = rcu_lock_domain_by_id(iommu->domid_map[did]);
+
+    /*
+     * In case the domain has been freed or the IOMMU domid bitmap is
+     * not valid, the device no longer belongs to this domain.
+     */
+    if ( d == NULL )
+        return;
+
+    pcidevs_lock();
+
+    for_each_pdev(d, pdev)
+    {
+        if ( (pdev->seg == ats_dev->seg) &&
+             (pdev->bus == ats_dev->bus) &&
+             (pdev->devfn == ats_dev->devfn) )
+        {
+            ASSERT(pdev->domain);
+            list_del(&pdev->domain_list);
+            pdev->domain = NULL;
+            pci_hide_existing_device(pdev);
+            break;
+        }
+    }
+
+    pcidevs_unlock();
+
+    if ( !is_hardware_domain(d) )
+        domain_crash(d);
+    else
+        printk(XENLOG_WARNING VTDPREFIX
+               " dom%d: ATS device %04x:%02x:%02x.%u flush failed\n",
+               d->domain_id, ats_dev->seg, ats_dev->bus,
+               PCI_SLOT(ats_dev->devfn), PCI_FUNC(ats_dev->devfn));
+
+    rcu_unlock_domain(d);
+}
+
+static int __must_check dev_invalidate_sync(struct iommu *iommu, u16 did,
+                                            struct pci_ats_dev *ats_dev)
+{
+    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
+    int rc = 0;
+
+    if ( qi_ctrl->qinval_maddr )
+    {
+        rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
+
+        if ( rc == -ETIMEDOUT )
+            dev_invalidate_iotlb_timeout(iommu, did, ats_dev);
+    }
+
+    return rc;
+}
+
 int qinval_device_iotlb_sync(struct iommu *iommu,
-                             u32 max_invs_pend,
-                             u16 sid, u16 size, u64 addr)
+                             struct pci_ats_dev *ats_dev,
+                             u16 did, u16 size, u64 addr)
 {
     unsigned long flags;
     unsigned int index;
@@ -229,9 +289,9 @@  int qinval_device_iotlb_sync(struct iommu *iommu,
 
     qinval_entry->q.dev_iotlb_inv_dsc.lo.type = TYPE_INVAL_DEVICE_IOTLB;
     qinval_entry->q.dev_iotlb_inv_dsc.lo.res_1 = 0;
-    qinval_entry->q.dev_iotlb_inv_dsc.lo.max_invs_pend = max_invs_pend;
+    qinval_entry->q.dev_iotlb_inv_dsc.lo.max_invs_pend = ats_dev->ats_queue_depth;
     qinval_entry->q.dev_iotlb_inv_dsc.lo.res_2 = 0;
-    qinval_entry->q.dev_iotlb_inv_dsc.lo.sid = sid;
+    qinval_entry->q.dev_iotlb_inv_dsc.lo.sid = PCI_BDF2(ats_dev->bus, ats_dev->devfn);
     qinval_entry->q.dev_iotlb_inv_dsc.lo.res_3 = 0;
 
     qinval_entry->q.dev_iotlb_inv_dsc.hi.size = size;
@@ -242,7 +302,7 @@  int qinval_device_iotlb_sync(struct iommu *iommu,
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
-    return invalidate_sync(iommu, 1);
+    return dev_invalidate_sync(iommu, did, ats_dev);
 }
 
 static int __must_check queue_invalidate_iec_sync(struct iommu *iommu,
@@ -273,7 +333,7 @@  static int __must_check queue_invalidate_iec_sync(struct iommu *iommu,
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
-    ret = invalidate_sync(iommu, 0);
+    ret = invalidate_sync(iommu);
 
     /*
      * reading vt-d architecture register will ensure
diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c
index dfa4d30..ecae94c 100644
--- a/xen/drivers/passthrough/vtd/x86/ats.c
+++ b/xen/drivers/passthrough/vtd/x86/ats.c
@@ -116,7 +116,6 @@  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
 
     list_for_each_entry( pdev, &ats_devices, list )
     {
-        u16 sid = PCI_BDF2(pdev->bus, pdev->devfn);
         bool_t sbit;
         int rc = 0;
 
@@ -134,8 +133,7 @@  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
             /* invalidate all translations: sbit=1,bit_63=0,bit[62:12]=1 */
             sbit = 1;
             addr = (~0UL << PAGE_SHIFT_4K) & 0x7FFFFFFFFFFFFFFF;
-            rc = qinval_device_iotlb_sync(iommu, pdev->ats_queue_depth,
-                                          sid, sbit, addr);
+            rc = qinval_device_iotlb_sync(iommu, pdev, did, sbit, addr);
             break;
         case DMA_TLB_PSI_FLUSH:
             if ( !device_in_domain(iommu, pdev, did) )
@@ -154,8 +152,7 @@  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
                 addr |= (((u64)1 << (size_order - 1)) - 1) << PAGE_SHIFT_4K;
             }
 
-            rc = qinval_device_iotlb_sync(iommu, pdev->ats_queue_depth,
-                                          sid, sbit, addr);
+            rc = qinval_device_iotlb_sync(iommu, pdev, did, sbit, addr);
             break;
         default:
             dprintk(XENLOG_WARNING VTDPREFIX, "invalid vt-d flush type\n");
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 6ed29dd..e4940cd 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -118,6 +118,7 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
 int pci_ro_device(int seg, int bus, int devfn);
 int pci_hide_device(int bus, int devfn);
+void pci_hide_existing_device(struct pci_dev *pdev);
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
 struct pci_dev *pci_get_real_pdev(int seg, int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(