diff mbox series

[RFC] pci/ats: do not allow broken devices to be assigned

Message ID 20220224124344.86192-1-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show
Series [RFC] pci/ats: do not allow broken devices to be assigned | expand

Commit Message

Roger Pau Monné Feb. 24, 2022, 12:43 p.m. UTC
Introduce a new field to mark devices as broken: having it set
prevents the device from being assigned to domains. Use the field in
order to mark ATS devices that have failed a flush as broken, thus
preventing them to be assigned to any guest.

This allows the device IOMMU context entry to be cleaned up properly,
as calling _pci_hide_device will just change the ownership of the
device, but the IOMMU context entry of the device would be left as-is.
It would also leak a Domain ID, as removing the device from it's
previous owner will allow releasing the DID used by the device without
having cleaned up the context entry.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
RFC: I haven't tested the code path, as I have no ATS devices on the
box I'm currently testing on. In any case, ATS is not supported, and
removing the call to _pci_hide_device in iommu_dev_iotlb_flush_timeout
should allow to remove the dependency on recursive pcidevs lock.

TBD: it's unclear whether we still need the pcidevs_lock in
iommu_dev_iotlb_flush_timeout. The caller of
iommu_dev_iotlb_flush_timeout is already bogus as it iterates over a
list of pdevs without holding the pcidevs_lock.

TBD: getting rid of ATS altogether could also be an option, but it's
more work.
---
Cc: Oleksandr Andrushchenko <andr2000@gmail.com>
---
 xen/drivers/passthrough/pci.c        | 11 +++++++----
 xen/drivers/passthrough/vtd/qinval.c |  8 +++++++-
 xen/include/xen/pci.h                |  3 +++
 3 files changed, 17 insertions(+), 5 deletions(-)

Comments

Jan Beulich Feb. 24, 2022, 12:58 p.m. UTC | #1
On 24.02.2022 13:43, Roger Pau Monne wrote:
> Introduce a new field to mark devices as broken: having it set
> prevents the device from being assigned to domains. Use the field in
> order to mark ATS devices that have failed a flush as broken, thus
> preventing them to be assigned to any guest.
> 
> This allows the device IOMMU context entry to be cleaned up properly,
> as calling _pci_hide_device will just change the ownership of the
> device, but the IOMMU context entry of the device would be left as-is.
> It would also leak a Domain ID, as removing the device from it's
> previous owner will allow releasing the DID used by the device without
> having cleaned up the context entry.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> RFC: I haven't tested the code path, as I have no ATS devices on the
> box I'm currently testing on. In any case, ATS is not supported, and
> removing the call to _pci_hide_device in iommu_dev_iotlb_flush_timeout
> should allow to remove the dependency on recursive pcidevs lock.

No objection in principle. Whether this is the only dependency on
recursive pcidevs lock isn't really know though, is it?

> TBD: it's unclear whether we still need the pcidevs_lock in
> iommu_dev_iotlb_flush_timeout. The caller of
> iommu_dev_iotlb_flush_timeout is already bogus as it iterates over a
> list of pdevs without holding the pcidevs_lock.

Analysis of whether / where recursive uses are needed should imo
include cases where the lock ought to be held, but currently isn't
(like apparently this case).

> @@ -1487,6 +1487,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>                      pdev->domain == dom_io));
>  
> +    /* Do not allow broken devices to be assigned. */
> +    rc = -EBADF;
> +    if ( pdev->broken )
> +        goto done;

I think this wants exceptions for Dom0 and DomIO. An admin may be
able to fix things in Dom0, e.g. by updating device firmware.

Jan
Roger Pau Monné Feb. 24, 2022, 2:53 p.m. UTC | #2
On Thu, Feb 24, 2022 at 01:58:31PM +0100, Jan Beulich wrote:
> On 24.02.2022 13:43, Roger Pau Monne wrote:
> > Introduce a new field to mark devices as broken: having it set
> > prevents the device from being assigned to domains. Use the field in
> > order to mark ATS devices that have failed a flush as broken, thus
> > preventing them to be assigned to any guest.
> > 
> > This allows the device IOMMU context entry to be cleaned up properly,
> > as calling _pci_hide_device will just change the ownership of the
> > device, but the IOMMU context entry of the device would be left as-is.
> > It would also leak a Domain ID, as removing the device from it's
> > previous owner will allow releasing the DID used by the device without
> > having cleaned up the context entry.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > RFC: I haven't tested the code path, as I have no ATS devices on the
> > box I'm currently testing on. In any case, ATS is not supported, and
> > removing the call to _pci_hide_device in iommu_dev_iotlb_flush_timeout
> > should allow to remove the dependency on recursive pcidevs lock.
> 
> No objection in principle. Whether this is the only dependency on
> recursive pcidevs lock isn't really know though, is it?

Indeed. I didn't word this clearly. The recursive lock was introduced
for this specific use case. Whether we have gained more recursive
paths in the meantime I haven't assessed.

> > TBD: it's unclear whether we still need the pcidevs_lock in
> > iommu_dev_iotlb_flush_timeout. The caller of
> > iommu_dev_iotlb_flush_timeout is already bogus as it iterates over a
> > list of pdevs without holding the pcidevs_lock.
> 
> Analysis of whether / where recursive uses are needed should imo
> include cases where the lock ought to be held, but currently isn't
> (like apparently this case).

Well, I'm not opposed to that. I think just aiming to get the current
usages analyzed will already be hard.

> > @@ -1487,6 +1487,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> >      ASSERT(pdev && (pdev->domain == hardware_domain ||
> >                      pdev->domain == dom_io));
> >  
> > +    /* Do not allow broken devices to be assigned. */
> > +    rc = -EBADF;
> > +    if ( pdev->broken )
> > +        goto done;
> 
> I think this wants exceptions for Dom0 and DomIO. An admin may be
> able to fix things in Dom0, e.g. by updating device firmware.

Doesn't a device get assigned to DomIO (or Dom0 if not using quarantine
mode), and then when deassigned from DomIO gets assigned to Dom0?

So there's no usage of assign_device in the path that (re)assigns a
device used by a guest into Dom0?

Or would you rather imply that pdev->broken should get cleared at some
point (ie: when the device is assigned back to Dom0)?

Thanks, Roger.
Jan Beulich Feb. 24, 2022, 3:33 p.m. UTC | #3
On 24.02.2022 15:53, Roger Pau Monné wrote:
> On Thu, Feb 24, 2022 at 01:58:31PM +0100, Jan Beulich wrote:
>> On 24.02.2022 13:43, Roger Pau Monne wrote:
>>> TBD: it's unclear whether we still need the pcidevs_lock in
>>> iommu_dev_iotlb_flush_timeout. The caller of
>>> iommu_dev_iotlb_flush_timeout is already bogus as it iterates over a
>>> list of pdevs without holding the pcidevs_lock.
>>
>> Analysis of whether / where recursive uses are needed should imo
>> include cases where the lock ought to be held, but currently isn't
>> (like apparently this case).
> 
> Well, I'm not opposed to that. I think just aiming to get the current
> usages analyzed will already be hard.

While I fully agree, the decision to drop recursive locking would better
not put roadblocks in the way of adding locking where it is currently
missing.

>>> @@ -1487,6 +1487,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>      ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>                      pdev->domain == dom_io));
>>>  
>>> +    /* Do not allow broken devices to be assigned. */
>>> +    rc = -EBADF;
>>> +    if ( pdev->broken )
>>> +        goto done;
>>
>> I think this wants exceptions for Dom0 and DomIO. An admin may be
>> able to fix things in Dom0, e.g. by updating device firmware.
> 
> Doesn't a device get assigned to DomIO (or Dom0 if not using quarantine
> mode), and then when deassigned from DomIO gets assigned to Dom0?
> 
> So there's no usage of assign_device in the path that (re)assigns a
> device used by a guest into Dom0?

Well, this assumes all tool stacks behave like the xl one presently
does. Which I think is the wrong way round: Making a device assignable
should be "deassign from Dom0" (implicitly making it land in DomIO),
while removing a device from the set of assignable ones should be
"assign to Dom0" (implicitly deassigning from DomIO). That way (I
think I said so elsewhere earlier on) libxl would also avoid the need
to actually use DOMID_IO explicitly. It using DOMID_SELF like done in
various other places would seem more clen to me.

Paul - I think it was you who decided to make it work the way it
currently does. Do you have any insights into your thought process
back then which you could share?

> Or would you rather imply that pdev->broken should get cleared at some
> point (ie: when the device is assigned back to Dom0)?

I did ponder this for a while when writing the earlier reply. But I
decided against, for it being a functional change: _pci_hide_device()
currently is also sticky. But yes, in principle if a misbehaving
device _can_ be fixed (e.g. by updating its firmware), then I think
there needs to be a way to clear the "broken" flag.

Jan
diff mbox series

Patch

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 70b6684981..4b81c1c04a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -501,7 +501,7 @@  static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
     xfree(pdev);
 }
 
-static void _pci_hide_device(struct pci_dev *pdev)
+static void __init _pci_hide_device(struct pci_dev *pdev)
 {
     if ( pdev->domain )
         return;
@@ -1487,6 +1487,11 @@  static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     ASSERT(pdev && (pdev->domain == hardware_domain ||
                     pdev->domain == dom_io));
 
+    /* Do not allow broken devices to be assigned. */
+    rc = -EBADF;
+    if ( pdev->broken )
+        goto done;
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1585,9 +1590,7 @@  void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev)
         return;
     }
 
-    list_del(&pdev->domain_list);
-    pdev->domain = NULL;
-    _pci_hide_device(pdev);
+    pdev->broken = true;
 
     if ( !d->is_shutting_down && printk_ratelimit() )
         printk(XENLOG_ERR "dom%d: ATS device %pp flush failed\n",
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 9f291f47e5..4436c22c05 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -227,7 +227,7 @@  static int __must_check dev_invalidate_sync(struct vtd_iommu *iommu,
 
     ASSERT(iommu->qinval_maddr);
     rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
-    if ( rc == -ETIMEDOUT )
+    if ( rc == -ETIMEDOUT && !pdev->broken )
     {
         struct domain *d = rcu_lock_domain_by_id(did_to_domain_id(iommu, did));
 
@@ -241,6 +241,12 @@  static int __must_check dev_invalidate_sync(struct vtd_iommu *iommu,
         iommu_dev_iotlb_flush_timeout(d, pdev);
         rcu_unlock_domain(d);
     }
+    else if ( rc == -ETIMEDOUT )
+        /*
+         * The device is already marked as broken, ignore the error in order to
+         * allow deassign to succeed.
+         */
+        rc = 0;
 
     return rc;
 }
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index b6d7e454f8..00b44e8487 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -108,6 +108,9 @@  struct pci_dev {
     /* Device with errata, ignore the BARs. */
     bool ignore_bars;
 
+    /* Device misbehaving, prevent assigning it. */
+    bool broken;
+
     enum pdev_type {
         DEV_TYPE_PCI_UNKNOWN,
         DEV_TYPE_PCIe_ENDPOINT,