diff mbox series

[9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed

Message ID 1a7b974c-8dee-3422-28fb-4118fe145b4e@suse.com (mailing list archive)
State New
Headers show
Series IOMMU: XSA-373 follow-on | expand

Commit Message

Jan Beulich June 9, 2021, 9:30 a.m. UTC
Failure here could in principle mean the device may still be issuing DMA
requests, which would continue to be translated by the page tables the
device entry currently points at. With this we cannot allow the
subsequent cleanup step of freeing the page tables to occur, to prevent
use-after-free issues. We would need to accept, for the time being, that
in such a case the remaining domain resources will all be leaked, and
the domain will continue to exist as a zombie.

However, with flushes no longer timing out (and with proper timeout
detection for device I/O TLB flushing yet to be implemented), there's no
way anymore for failures to occur, except due to bugs elsewhere. Hence
the change here is merely a "just in case" one.

In order to continue the loop in spite of an error, we can't use
pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
the first place, instead of the cheaper list iteration.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
A first step beyond this could be to have the backing functions of
deassign_device() allow the caller to tell whether the failure was from
removing the device from the domain being cleaned up, or from re-setup
in wherever the device was supposed to get moved to. In the latter case
we could allow domain cleanup to continue. I wonder whether we could
simply make those functions return "success" anyway, overriding their
returning of an error when ->is_dying is set.

A next step then might be to figure whether there's any "emergency"
adjustment that could be done instead of the full-fledged (and failed)
de-assign, to allow at least recovering all the memory from the guest.

Comments

Paul Durrant June 24, 2021, 3:34 p.m. UTC | #1
On 09/06/2021 10:30, Jan Beulich wrote:
> Failure here could in principle mean the device may still be issuing DMA
> requests, which would continue to be translated by the page tables the
> device entry currently points at. With this we cannot allow the
> subsequent cleanup step of freeing the page tables to occur, to prevent
> use-after-free issues. We would need to accept, for the time being, that
> in such a case the remaining domain resources will all be leaked, and
> the domain will continue to exist as a zombie.
> 
> However, with flushes no longer timing out (and with proper timeout
> detection for device I/O TLB flushing yet to be implemented), there's no
> way anymore for failures to occur, except due to bugs elsewhere. Hence
> the change here is merely a "just in case" one.
> 
> In order to continue the loop in spite of an error, we can't use
> pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
> the first place, instead of the cheaper list iteration.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul@xen.org>

> ---
> A first step beyond this could be to have the backing functions of
> deassign_device() allow the caller to tell whether the failure was from
> removing the device from the domain being cleaned up, or from re-setup
> in wherever the device was supposed to get moved to. In the latter case
> we could allow domain cleanup to continue. I wonder whether we could
> simply make those functions return "success" anyway, overriding their
> returning of an error when ->is_dying is set.
> 
> A next step then might be to figure whether there's any "emergency"
> adjustment that could be done instead of the full-fledged (and failed)
> de-assign, to allow at least recovering all the memory from the guest.
> 
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -894,7 +894,7 @@ static int deassign_device(struct domain
>   
>   int pci_release_devices(struct domain *d)
>   {
> -    struct pci_dev *pdev;
> +    struct pci_dev *pdev, *tmp;
>       u8 bus, devfn;
>       int ret;
>   
> @@ -905,15 +905,15 @@ int pci_release_devices(struct domain *d
>           pcidevs_unlock();
>           return ret;
>       }
> -    while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
> +    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>       {
>           bus = pdev->bus;
>           devfn = pdev->devfn;
> -        deassign_device(d, pdev->seg, bus, devfn);
> +        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>       }
>       pcidevs_unlock();
>   
> -    return 0;
> +    return ret;
>   }
>   
>   #define PCI_CLASS_BRIDGE_HOST    0x0600
>
diff mbox series

Patch

--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -894,7 +894,7 @@  static int deassign_device(struct domain
 
 int pci_release_devices(struct domain *d)
 {
-    struct pci_dev *pdev;
+    struct pci_dev *pdev, *tmp;
     u8 bus, devfn;
     int ret;
 
@@ -905,15 +905,15 @@  int pci_release_devices(struct domain *d
         pcidevs_unlock();
         return ret;
     }
-    while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
+    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
     {
         bus = pdev->bus;
         devfn = pdev->devfn;
-        deassign_device(d, pdev->seg, bus, devfn);
+        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
     }
     pcidevs_unlock();
 
-    return 0;
+    return ret;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600