diff mbox series

[for-4.15,v2,4/5] xen/iommu: x86: Don't leak the IOMMU page-tables

Message ID 20210209152816.15792-5-julien@xen.org (mailing list archive)
State Superseded
Headers show
Series xen/iommu: Collection of bug fixes for IOMMU teadorwn | expand

Commit Message

Julien Grall Feb. 9, 2021, 3:28 p.m. UTC
From: Julien Grall <jgrall@amazon.com>

The new IOMMU page-tables allocator will release the pages when
relinquish the domain resources. However, this is not sufficient when
the domain is dying because nothing prevents page-table to be allocated.

iommu_alloc_pgtable() is now checking if the domain is dying before
adding the page in the list. We are relying on &hd->arch.pgtables.lock
to synchronize d->is_dying.

Take the opportunity to check in arch_iommu_domain_destroy() that all
that page tables have been freed.

Signed-off-by: Julien Grall <jgrall@amazon.com>

---

There is one more bug that will be solved in the next patch as I felt
they each needed a long explanation.

Changes in v2:
    - Rework the approach
    - Move the patch earlier in the series
---
 xen/drivers/passthrough/x86/iommu.c | 36 ++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

Comments

Paul Durrant Feb. 9, 2021, 8:33 p.m. UTC | #1
> -----Original Message-----
> From: Julien Grall <julien@xen.org>
> Sent: 09 February 2021 15:28
> To: xen-devel@lists.xenproject.org
> Cc: hongyxia@amazon.co.uk; iwj@xenproject.org; Julien Grall <jgrall@amazon.com>; Jan Beulich
> <jbeulich@suse.com>; Paul Durrant <paul@xen.org>
> Subject: [for-4.15][PATCH v2 4/5] xen/iommu: x86: Don't leak the IOMMU page-tables
> 
> From: Julien Grall <jgrall@amazon.com>
> 
> The new IOMMU page-tables allocator will release the pages when
> relinquish the domain resources. However, this is not sufficient when
> the domain is dying because nothing prevents page-table to be allocated.
> 
> iommu_alloc_pgtable() is now checking if the domain is dying before
> adding the page in the list. We are relying on &hd->arch.pgtables.lock
> to synchronize d->is_dying.
> 
> Take the opportunity to check in arch_iommu_domain_destroy() that all
> that page tables have been freed.
> 
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich Feb. 10, 2021, 2:32 p.m. UTC | #2
On 09.02.2021 16:28, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> The new IOMMU page-tables allocator will release the pages when
> relinquish the domain resources. However, this is not sufficient when
> the domain is dying because nothing prevents page-table to be allocated.
> 
> iommu_alloc_pgtable() is now checking if the domain is dying before
> adding the page in the list. We are relying on &hd->arch.pgtables.lock
> to synchronize d->is_dying.

As said in reply to an earlier patch, I think suppressing
(really: ignoring) new mappings would be better. You could
utilize the same lock, but you'd need to duplicate the
checking in {amd,intel}_iommu_map_page().

I'm not entirely certain in the case about unmap requests:
It may be possible to also suppress/ignore them, but this
may require some further thought.

Apart from this, just in case we settle on your current
approach, a few spelling nits:

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d)
>  
>  void arch_iommu_domain_destroy(struct domain *d)
>  {
> +    /*
> +     * There should be not page-tables left allocated by the time the

... should be no ...

> +     * domain is destroyed. Note that arch_iommu_domain_destroy() is
> +     * called unconditionally, so pgtables may be unitialized.

uninitialized

> @@ -303,9 +317,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>      unmap_domain_page(p);
>  
>      spin_lock(&hd->arch.pgtables.lock);
> -    page_list_add(pg, &hd->arch.pgtables.list);
> +    /*
> +     * The IOMMU page-tables are freed when relinquishing the domain, but
> +     * nothing prevent allocation to happen afterwards. There is no valid

prevents

> +     * reasons to continue to update the IOMMU page-tables while the

reason

> +     * domain is dying.
> +     *
> +     * So prevent page-table allocation when the domain is dying.
> +     *
> +     * We relying on &hd->arch.pgtables.lock to synchronize d->is_dying.

rely

Jan
Jan Beulich Feb. 10, 2021, 2:44 p.m. UTC | #3
On 09.02.2021 16:28, Julien Grall wrote:
> @@ -303,9 +317,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>      unmap_domain_page(p);
>  
>      spin_lock(&hd->arch.pgtables.lock);
> -    page_list_add(pg, &hd->arch.pgtables.list);
> +    /*
> +     * The IOMMU page-tables are freed when relinquishing the domain, but
> +     * nothing prevent allocation to happen afterwards. There is no valid
> +     * reasons to continue to update the IOMMU page-tables while the
> +     * domain is dying.
> +     *
> +     * So prevent page-table allocation when the domain is dying.
> +     *
> +     * We relying on &hd->arch.pgtables.lock to synchronize d->is_dying.
> +     */
> +    if ( likely(!d->is_dying) )
> +    {
> +        alive = true;
> +        page_list_add(pg, &hd->arch.pgtables.list);
> +    }
>      spin_unlock(&hd->arch.pgtables.lock);
>  
> +    if ( unlikely(!alive) )
> +    {
> +        free_domheap_page(pg);
> +        pg = NULL;
> +    }
> +
>      return pg;
>  }

There's a pretty clear downside to this approach compared to that
of ignoring maps (and perhaps also unmaps) for dying domains: The
caller here will (hopefully) recognize and propagate an error.

Additionally (considering the situation patch 5 fixes) ignoring
unmaps may provide quite a bit of a performance gain for domain
destruction - we don't need every individual page unmapped from
the page tables.

Jan
Julien Grall Feb. 10, 2021, 3:04 p.m. UTC | #4
On 10/02/2021 14:32, Jan Beulich wrote:
> On 09.02.2021 16:28, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> The new IOMMU page-tables allocator will release the pages when
>> relinquish the domain resources. However, this is not sufficient when
>> the domain is dying because nothing prevents page-table to be allocated.
>>
>> iommu_alloc_pgtable() is now checking if the domain is dying before
>> adding the page in the list. We are relying on &hd->arch.pgtables.lock
>> to synchronize d->is_dying.
> 
> As said in reply to an earlier patch, I think suppressing
> (really: ignoring) new mappings would be better.

This is exactly what I suggested in v1 but you wrote:

"Ignoring requests there seems fragile to me. Paul - what are your
thoughts about bailing early from hvm_add_ioreq_gfn() when the
domain is dying?"

Are you know saying that the following snipped would be fine:

if ( d->is_dying )
   return 0;

> You could
> utilize the same lock, but you'd need to duplicate the
> checking in {amd,intel}_iommu_map_page().
> 
> I'm not entirely certain in the case about unmap requests:
> It may be possible to also suppress/ignore them, but this
> may require some further thought.

I think the unmap part is quite risky to d->is_dying because the PCI 
devices may not quiesced and still assigned to the domain.

> 
> Apart from this, just in case we settle on your current
> approach, a few spelling nits:
> 
>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -149,6 +149,13 @@ int arch_iommu_domain_init(struct domain *d)
>>   
>>   void arch_iommu_domain_destroy(struct domain *d)
>>   {
>> +    /*
>> +     * There should be not page-tables left allocated by the time the
> 
> ... should be no ...
> 
>> +     * domain is destroyed. Note that arch_iommu_domain_destroy() is
>> +     * called unconditionally, so pgtables may be unitialized.
> 
> uninitialized
> 
>> @@ -303,9 +317,29 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>>       unmap_domain_page(p);
>>   
>>       spin_lock(&hd->arch.pgtables.lock);
>> -    page_list_add(pg, &hd->arch.pgtables.list);
>> +    /*
>> +     * The IOMMU page-tables are freed when relinquishing the domain, but
>> +     * nothing prevent allocation to happen afterwards. There is no valid
> 
> prevents
> 
>> +     * reasons to continue to update the IOMMU page-tables while the
> 
> reason
> 
>> +     * domain is dying.
>> +     *
>> +     * So prevent page-table allocation when the domain is dying.
>> +     *
>> +     * We relying on &hd->arch.pgtables.lock to synchronize d->is_dying.
> 
> rely
> 
> Jan
>
Jan Beulich Feb. 10, 2021, 4:12 p.m. UTC | #5
On 10.02.2021 16:04, Julien Grall wrote:
> On 10/02/2021 14:32, Jan Beulich wrote:
>> On 09.02.2021 16:28, Julien Grall wrote:
>>> From: Julien Grall <jgrall@amazon.com>
>>>
>>> The new IOMMU page-tables allocator will release the pages when
>>> relinquish the domain resources. However, this is not sufficient when
>>> the domain is dying because nothing prevents page-table to be allocated.
>>>
>>> iommu_alloc_pgtable() is now checking if the domain is dying before
>>> adding the page in the list. We are relying on &hd->arch.pgtables.lock
>>> to synchronize d->is_dying.
>>
>> As said in reply to an earlier patch, I think suppressing
>> (really: ignoring) new mappings would be better.
> 
> This is exactly what I suggested in v1 but you wrote:
> 
> "Ignoring requests there seems fragile to me. Paul - what are your
> thoughts about bailing early from hvm_add_ioreq_gfn() when the
> domain is dying?"

Was this on the thread of this patch? I didn't find such a
reply of mine. I need more context here because you name
hvm_add_ioreq_gfn() above, while I refer to iommu_map()
(and downwards the call stack).

> Are you know saying that the following snipped would be fine:
> 
> if ( d->is_dying )
>    return 0;

In {amd,intel}_iommu_map_page(), after the lock was acquired
and with it suitably released, yes. And if that's what you
suggested, then I'm sorry - I don't think I can see anything
fragile there anymore.

>> You could
>> utilize the same lock, but you'd need to duplicate the
>> checking in {amd,intel}_iommu_map_page().
>>
>> I'm not entirely certain in the case about unmap requests:
>> It may be possible to also suppress/ignore them, but this
>> may require some further thought.
> 
> I think the unmap part is quite risky to d->is_dying because the PCI 
> devices may not quiesced and still assigned to the domain.

Hmm, yes, good point. Of course upon first unmap with is_dying
observed set we could zap the root page table, but I don't
suppose that's something we want to do for 4.15.

Jan
Julien Grall Feb. 17, 2021, 11:49 a.m. UTC | #6
Hi Jan,

On 10/02/2021 16:12, Jan Beulich wrote:
> On 10.02.2021 16:04, Julien Grall wrote:
>> On 10/02/2021 14:32, Jan Beulich wrote:
>>> On 09.02.2021 16:28, Julien Grall wrote:
>>>> From: Julien Grall <jgrall@amazon.com>
>>>>
>>>> The new IOMMU page-tables allocator will release the pages when
>>>> relinquish the domain resources. However, this is not sufficient when
>>>> the domain is dying because nothing prevents page-table to be allocated.
>>>>
>>>> iommu_alloc_pgtable() is now checking if the domain is dying before
>>>> adding the page in the list. We are relying on &hd->arch.pgtables.lock
>>>> to synchronize d->is_dying.
>>>
>>> As said in reply to an earlier patch, I think suppressing
>>> (really: ignoring) new mappings would be better.
>>
>> This is exactly what I suggested in v1 but you wrote:
>>
>> "Ignoring requests there seems fragile to me. Paul - what are your
>> thoughts about bailing early from hvm_add_ioreq_gfn() when the
>> domain is dying?"
> 
> Was this on the thread of this patch? I didn't find such a
> reply of mine. I need more context here because you name
> hvm_add_ioreq_gfn() above, while I refer to iommu_map()
> (and downwards the call stack).

See [1].

> 
>> Are you know saying that the following snipped would be fine:
>>
>> if ( d->is_dying )
>>     return 0;
> 
> In {amd,intel}_iommu_map_page(), after the lock was acquired
> and with it suitably released, yes. And if that's what you
> suggested, then I'm sorry - I don't think I can see anything
> fragile there anymore.

Duplicating the check sounds good to me.

> 
>>> You could
>>> utilize the same lock, but you'd need to duplicate the
>>> checking in {amd,intel}_iommu_map_page().
>>>
>>> I'm not entirely certain in the case about unmap requests:
>>> It may be possible to also suppress/ignore them, but this
>>> may require some further thought.
>>
>> I think the unmap part is quite risky to d->is_dying because the PCI
>> devices may not quiesced and still assigned to the domain.
> 
> Hmm, yes, good point. Of course upon first unmap with is_dying
> observed set we could zap the root page table, but I don't
> suppose that's something we want to do for 4.15.

We would still need to zap the root page table in the relinquish path. 
So I am not sure what benefits it would give us to zap the page tables 
on the first iommu_unmap() afther domain dies.

Cheers,

[1] 
https://lore.kernel.org/xen-devel/f21f1f61-5213-55a8-320c-43e5fe80100f@suse.com/
Jan Beulich Feb. 17, 2021, 12:57 p.m. UTC | #7
On 17.02.2021 12:49, Julien Grall wrote:
> On 10/02/2021 16:12, Jan Beulich wrote:
>> On 10.02.2021 16:04, Julien Grall wrote:
>>> Are you know saying that the following snipped would be fine:
>>>
>>> if ( d->is_dying )
>>>     return 0;
>>
>> In {amd,intel}_iommu_map_page(), after the lock was acquired
>> and with it suitably released, yes. And if that's what you
>> suggested, then I'm sorry - I don't think I can see anything
>> fragile there anymore.
> 
> Duplicating the check sounds good to me.

The checks in said functions are mandatory, and I didn't really
have any duplication in mind. But yes, iommu_map() could have
an early (but racy) check, if so wanted.

>>>> You could
>>>> utilize the same lock, but you'd need to duplicate the
>>>> checking in {amd,intel}_iommu_map_page().
>>>>
>>>> I'm not entirely certain in the case about unmap requests:
>>>> It may be possible to also suppress/ignore them, but this
>>>> may require some further thought.
>>>
>>> I think the unmap part is quite risky to d->is_dying because the PCI
>>> devices may not quiesced and still assigned to the domain.
>>
>> Hmm, yes, good point. Of course upon first unmap with is_dying
>> observed set we could zap the root page table, but I don't
>> suppose that's something we want to do for 4.15.
> 
> We would still need to zap the root page table in the relinquish path. 
> So I am not sure what benefits it would give us to zap the page tables 
> on the first iommu_unmap() afther domain dies.

I guess we think of different aspects of the zapping - what I'm
suggesting here is for the effect of no translations remaining
active anymore. I'm not after freeing the memory at this point;
that'll have to happen on the relinquish path, as you say. The
problem with allowing individual unmaps to proceed (unlike your
plan for maps) is that these, too, can trigger allocations (when
a large page needs shattering).

Jan
diff mbox series

Patch

diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index cea1032b3d02..82d770107a47 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -149,6 +149,13 @@  int arch_iommu_domain_init(struct domain *d)
 
 void arch_iommu_domain_destroy(struct domain *d)
 {
+    /*
+     * There should be not page-tables left allocated by the time the
+     * domain is destroyed. Note that arch_iommu_domain_destroy() is
+     * called unconditionally, so pgtables may be unitialized.
+     */
+    ASSERT(dom_iommu(d)->platform_ops == NULL ||
+           page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
@@ -267,6 +274,12 @@  int iommu_free_pgtables(struct domain *d)
     struct page_info *pg;
     unsigned int done = 0;
 
+    if ( !is_iommu_enabled(d) )
+        return 0;
+
+    /* After this barrier no new page allocations can occur. */
+    spin_barrier(&hd->arch.pgtables.lock);
+
     while ( (pg = page_list_remove_head(&hd->arch.pgtables.list)) )
     {
         free_domheap_page(pg);
@@ -284,6 +297,7 @@  struct page_info *iommu_alloc_pgtable(struct domain *d)
     unsigned int memflags = 0;
     struct page_info *pg;
     void *p;
+    bool alive = false;
 
 #ifdef CONFIG_NUMA
     if ( hd->node != NUMA_NO_NODE )
@@ -303,9 +317,29 @@  struct page_info *iommu_alloc_pgtable(struct domain *d)
     unmap_domain_page(p);
 
     spin_lock(&hd->arch.pgtables.lock);
-    page_list_add(pg, &hd->arch.pgtables.list);
+    /*
+     * The IOMMU page-tables are freed when relinquishing the domain, but
+     * nothing prevent allocation to happen afterwards. There is no valid
+     * reasons to continue to update the IOMMU page-tables while the
+     * domain is dying.
+     *
+     * So prevent page-table allocation when the domain is dying.
+     *
+     * We relying on &hd->arch.pgtables.lock to synchronize d->is_dying.
+     */
+    if ( likely(!d->is_dying) )
+    {
+        alive = true;
+        page_list_add(pg, &hd->arch.pgtables.list);
+    }
     spin_unlock(&hd->arch.pgtables.lock);
 
+    if ( unlikely(!alive) )
+    {
+        free_domheap_page(pg);
+        pg = NULL;
+    }
+
     return pg;
 }