diff mbox series

[for-4.15,4/4] xen/iommu: x86: Don't leak the IOMMU page-tables

Message ID 20201222154338.9459-5-julien@xen.org (mailing list archive)
State Superseded
Headers show
Series xen/iommu: Collection of bug fixes for IOMMU teadorwn | expand

Commit Message

Julien Grall Dec. 22, 2020, 3:43 p.m. UTC
From: Julien Grall <jgrall@amazon.com>

The new IOMMU page-tables allocator will release the pages when
relinquish the domain resources. However, this is not sufficient in two
cases:
    1) domain_relinquish_resources() is not called when the domain
    creation fails.
    2) There is nothing preventing page-table allocations when the
    domain is dying.

In both cases, this can be solved by freeing the page-tables again
when the domain destruction. Although, this may result to an high
number of page-tables to free.

In the second case, it is pointless to allow page-table allocation when
the domain is going to die. iommu_alloc_pgtable() will now return an
error when it is called while the domain is dying.

Signed-off-by: Julien Grall <jgrall@amazon.com>
---
 xen/arch/x86/domain.c               |  2 +-
 xen/drivers/passthrough/x86/iommu.c | 32 +++++++++++++++++++++++++++--
 xen/include/asm-x86/iommu.h         |  2 +-
 3 files changed, 32 insertions(+), 4 deletions(-)

Comments

Julien Grall Dec. 22, 2020, 5:12 p.m. UTC | #1
Hi,

On 22/12/2020 15:43, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> The new IOMMU page-tables allocator will release the pages when
> relinquish the domain resources. However, this is not sufficient in two
> cases:
>      1) domain_relinquish_resources() is not called when the domain
>      creation fails.
>      2) There is nothing preventing page-table allocations when the
>      domain is dying.
> 
> In both cases, this can be solved by freeing the page-tables again
> when the domain destruction. Although, this may result to an high
> number of page-tables to free.
> 
> In the second case, it is pointless to allow page-table allocation when
> the domain is going to die. iommu_alloc_pgtable() will now return an
> error when it is called while the domain is dying.
> 
> Signed-off-by: Julien Grall <jgrall@amazon.com>

I forgot to mention this is fixing 15bc9a1ef51c "x86/iommu: add common 
page-table allocator". I will add a Fixes tag for the next version.

Cheers,
Jan Beulich Dec. 23, 2020, 2:34 p.m. UTC | #2
On 22.12.2020 16:43, Julien Grall wrote:
> From: Julien Grall <jgrall@amazon.com>
> 
> The new IOMMU page-tables allocator will release the pages when
> relinquish the domain resources. However, this is not sufficient in two
> cases:
>     1) domain_relinquish_resources() is not called when the domain
>     creation fails.

Could you remind me of what IOMMU page table insertions there
are during domain creation? No memory got allocated to the
domain at that point yet, so it would seem to me there simply
is nothing to map.

>     2) There is nothing preventing page-table allocations when the
>     domain is dying.
> 
> In both cases, this can be solved by freeing the page-tables again
> when the domain destruction. Although, this may result to an high
> number of page-tables to free.

Since I've seen this before in this series, and despite me also
not being a native speaker, as a nit: I don't think it can
typically be other than "result in".

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -2290,7 +2290,7 @@ int domain_relinquish_resources(struct domain *d)
>  
>      PROGRESS(iommu_pagetables):
>  
> -        ret = iommu_free_pgtables(d);
> +        ret = iommu_free_pgtables(d, false);

I suppose you mean "true" here, but I also think the other
approach (checking for DOMDYING_dead, which you don't seem to
like very much) is better, if for no other reason than it
already being used elsewhere.

> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>          memflags = MEMF_node(hd->node);
>  #endif
>  
> +    /*
> +     * The IOMMU page-tables are freed when relinquishing the domain, but
> +     * nothing prevent allocation to happen afterwards. There is no valid
> +     * reasons to continue to update the IOMMU page-tables while the
> +     * domain is dying.
> +     *
> +     * So prevent page-table allocation when the domain is dying. Note
> +     * this doesn't fully prevent the race because d->is_dying may not
> +     * yet be seen.
> +     */
> +    if ( d->is_dying )
> +        return NULL;
> +
>      pg = alloc_domheap_page(NULL, memflags);
>      if ( !pg )
>          return NULL;

As said in reply to an earlier patch - with a suitable
spin_barrier() you can place your check further down, along the
lines of

    spin_lock(&hd->arch.pgtables.lock);
    if ( likely(!d->is_dying) )
    {
        page_list_add(pg, &hd->arch.pgtables.list);
        p = NULL;
    }
    spin_unlock(&hd->arch.pgtables.lock);

    if ( p )
    {
        free_domheap_page(pg);
        pg = NULL;
    }

(albeit I'm relatively sure you won't like the re-purposing of
p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be
nice to use here, but we seem to only have FREE_XENHEAP_PAGE()
so far.)

Jan
Julien Grall Dec. 23, 2020, 4:07 p.m. UTC | #3
On 23/12/2020 14:34, Jan Beulich wrote:
> On 22.12.2020 16:43, Julien Grall wrote:
>> From: Julien Grall <jgrall@amazon.com>
>>
>> The new IOMMU page-tables allocator will release the pages when
>> relinquish the domain resources. However, this is not sufficient in two
>> cases:
>>      1) domain_relinquish_resources() is not called when the domain
>>      creation fails.
> 
> Could you remind me of what IOMMU page table insertions there
> are during domain creation? No memory got allocated to the
> domain at that point yet, so it would seem to me there simply
> is nothing to map.

The P2M is first modified in hvm_domain_initialise():

(XEN) Xen call trace:
(XEN)    [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137
(XEN)    [<ffff82d04025f9f5>] F 
drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8
(XEN)    [<ffff82d04025fcc5>] F 
drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b
(XEN)    [<ffff82d04026d949>] F iommu_map+0x6d/0x16f
(XEN)    [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63
(XEN)    [<ffff82d040301bdc>] F 
arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730
(XEN)    [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128
(XEN)    [<ffff82d0402f6b5c>] F 
arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7
(XEN)    [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e
(XEN)    [<ffff82d04029a080>] F 
arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137
(XEN)    [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7
(XEN)    [<ffff82d04031eae7>] F arch_domain_create+0x478/0x4ff
(XEN)    [<ffff82d04020476e>] F domain_create+0x4f2/0x778
(XEN)    [<ffff82d04023b0d2>] F do_domctl+0xb1e/0x18b8
(XEN)    [<ffff82d040311dbf>] F pv_hypercall+0x2f0/0x55f
(XEN)    [<ffff82d040390432>] F lstar_enter+0x112/0x120

> 
>>      2) There is nothing preventing page-table allocations when the
>>      domain is dying.
>>
>> In both cases, this can be solved by freeing the page-tables again
>> when the domain destruction. Although, this may result to an high
>> number of page-tables to free.
> 
> Since I've seen this before in this series, and despite me also
> not being a native speaker, as a nit: I don't think it can
> typically be other than "result in".

I think you are right.

> 
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -2290,7 +2290,7 @@ int domain_relinquish_resources(struct domain *d)
>>   
>>       PROGRESS(iommu_pagetables):
>>   
>> -        ret = iommu_free_pgtables(d);
>> +        ret = iommu_free_pgtables(d, false);
> 
> I suppose you mean "true" here, but I also think the other
> approach (checking for DOMDYING_dead, which you don't seem to
> like very much) is better, if for no other reason than it
> already being used elsewhere.

I think "don't like very much" is an understatement :). There seems to 
be more function using an extra parameter (such as hap_set_allocation() 
which was introduced before your DOMDYING_dead). So I only followed what 
they did.

> 
>> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>>           memflags = MEMF_node(hd->node);
>>   #endif
>>   
>> +    /*
>> +     * The IOMMU page-tables are freed when relinquishing the domain, but
>> +     * nothing prevent allocation to happen afterwards. There is no valid
>> +     * reasons to continue to update the IOMMU page-tables while the
>> +     * domain is dying.
>> +     *
>> +     * So prevent page-table allocation when the domain is dying. Note
>> +     * this doesn't fully prevent the race because d->is_dying may not
>> +     * yet be seen.
>> +     */
>> +    if ( d->is_dying )
>> +        return NULL;
>> +
>>       pg = alloc_domheap_page(NULL, memflags);
>>       if ( !pg )
>>           return NULL;
> 
> As said in reply to an earlier patch - with a suitable
> spin_barrier() you can place your check further down, along the
> lines of
> 
>      spin_lock(&hd->arch.pgtables.lock);
>      if ( likely(!d->is_dying) )
>      {
>          page_list_add(pg, &hd->arch.pgtables.list);
>          p = NULL;
>      }
>      spin_unlock(&hd->arch.pgtables.lock);
> 
>      if ( p )
>      {
>          free_domheap_page(pg);
>          pg = NULL;
>      }
> 
> (albeit I'm relatively sure you won't like the re-purposing of
> p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be
> nice to use here, but we seem to only have FREE_XENHEAP_PAGE()
> so far.)

In fact I don't mind the re-purposing of p. However, I dislike the 
allocation and then freeing when the domain is dying.

I think I prefer the small race introduced (the pages will still be 
freed) over this solution.

Note that Paul's IOMMU series will completely rework the function. So 
this is only temporary.

Cheers,
Jan Beulich Dec. 23, 2020, 4:15 p.m. UTC | #4
On 23.12.2020 17:07, Julien Grall wrote:
> On 23/12/2020 14:34, Jan Beulich wrote:
>> On 22.12.2020 16:43, Julien Grall wrote:
>>> From: Julien Grall <jgrall@amazon.com>
>>>
>>> The new IOMMU page-tables allocator will release the pages when
>>> relinquish the domain resources. However, this is not sufficient in two
>>> cases:
>>>      1) domain_relinquish_resources() is not called when the domain
>>>      creation fails.
>>
>> Could you remind me of what IOMMU page table insertions there
>> are during domain creation? No memory got allocated to the
>> domain at that point yet, so it would seem to me there simply
>> is nothing to map.
> 
> The P2M is first modified in hvm_domain_initialise():
> 
> (XEN) Xen call trace:
> (XEN)    [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137
> (XEN)    [<ffff82d04025f9f5>] F 
> drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8
> (XEN)    [<ffff82d04025fcc5>] F 
> drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b
> (XEN)    [<ffff82d04026d949>] F iommu_map+0x6d/0x16f
> (XEN)    [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63
> (XEN)    [<ffff82d040301bdc>] F 
> arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730
> (XEN)    [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128
> (XEN)    [<ffff82d0402f6b5c>] F 
> arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7
> (XEN)    [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e
> (XEN)    [<ffff82d04029a080>] F 
> arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137
> (XEN)    [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7

Oh, the infamous APIC access page again.

>>> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>>>           memflags = MEMF_node(hd->node);
>>>   #endif
>>>   
>>> +    /*
>>> +     * The IOMMU page-tables are freed when relinquishing the domain, but
>>> +     * nothing prevent allocation to happen afterwards. There is no valid
>>> +     * reasons to continue to update the IOMMU page-tables while the
>>> +     * domain is dying.
>>> +     *
>>> +     * So prevent page-table allocation when the domain is dying. Note
>>> +     * this doesn't fully prevent the race because d->is_dying may not
>>> +     * yet be seen.
>>> +     */
>>> +    if ( d->is_dying )
>>> +        return NULL;
>>> +
>>>       pg = alloc_domheap_page(NULL, memflags);
>>>       if ( !pg )
>>>           return NULL;
>>
>> As said in reply to an earlier patch - with a suitable
>> spin_barrier() you can place your check further down, along the
>> lines of
>>
>>      spin_lock(&hd->arch.pgtables.lock);
>>      if ( likely(!d->is_dying) )
>>      {
>>          page_list_add(pg, &hd->arch.pgtables.list);
>>          p = NULL;
>>      }
>>      spin_unlock(&hd->arch.pgtables.lock);
>>
>>      if ( p )
>>      {
>>          free_domheap_page(pg);
>>          pg = NULL;
>>      }
>>
>> (albeit I'm relatively sure you won't like the re-purposing of
>> p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be
>> nice to use here, but we seem to only have FREE_XENHEAP_PAGE()
>> so far.)
> 
> In fact I don't mind the re-purposing of p. However, I dislike the 
> allocation and then freeing when the domain is dying.
> 
> I think I prefer the small race introduced (the pages will still be 
> freed) over this solution.

The "will still be freed" is because of the 2nd round of freeing
you add in an earlier patch? I'd prefer to avoid the race to in
turn avoid that 2nd round of freeing.

Jan
Julien Grall Dec. 23, 2020, 4:19 p.m. UTC | #5
On 23/12/2020 16:15, Jan Beulich wrote:
> On 23.12.2020 17:07, Julien Grall wrote:
>> On 23/12/2020 14:34, Jan Beulich wrote:
>>> On 22.12.2020 16:43, Julien Grall wrote:
>>>> From: Julien Grall <jgrall@amazon.com>
>>>>
>>>> The new IOMMU page-tables allocator will release the pages when
>>>> relinquish the domain resources. However, this is not sufficient in two
>>>> cases:
>>>>       1) domain_relinquish_resources() is not called when the domain
>>>>       creation fails.
>>>
>>> Could you remind me of what IOMMU page table insertions there
>>> are during domain creation? No memory got allocated to the
>>> domain at that point yet, so it would seem to me there simply
>>> is nothing to map.
>>
>> The P2M is first modified in hvm_domain_initialise():
>>
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137
>> (XEN)    [<ffff82d04025f9f5>] F
>> drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8
>> (XEN)    [<ffff82d04025fcc5>] F
>> drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b
>> (XEN)    [<ffff82d04026d949>] F iommu_map+0x6d/0x16f
>> (XEN)    [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63
>> (XEN)    [<ffff82d040301bdc>] F
>> arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730
>> (XEN)    [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128
>> (XEN)    [<ffff82d0402f6b5c>] F
>> arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7
>> (XEN)    [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e
>> (XEN)    [<ffff82d04029a080>] F
>> arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137
>> (XEN)    [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7
> 
> Oh, the infamous APIC access page again.
> 
>>>> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d)
>>>>            memflags = MEMF_node(hd->node);
>>>>    #endif
>>>>    
>>>> +    /*
>>>> +     * The IOMMU page-tables are freed when relinquishing the domain, but
>>>> +     * nothing prevent allocation to happen afterwards. There is no valid
>>>> +     * reasons to continue to update the IOMMU page-tables while the
>>>> +     * domain is dying.
>>>> +     *
>>>> +     * So prevent page-table allocation when the domain is dying. Note
>>>> +     * this doesn't fully prevent the race because d->is_dying may not
>>>> +     * yet be seen.
>>>> +     */
>>>> +    if ( d->is_dying )
>>>> +        return NULL;
>>>> +
>>>>        pg = alloc_domheap_page(NULL, memflags);
>>>>        if ( !pg )
>>>>            return NULL;
>>>
>>> As said in reply to an earlier patch - with a suitable
>>> spin_barrier() you can place your check further down, along the
>>> lines of
>>>
>>>       spin_lock(&hd->arch.pgtables.lock);
>>>       if ( likely(!d->is_dying) )
>>>       {
>>>           page_list_add(pg, &hd->arch.pgtables.list);
>>>           p = NULL;
>>>       }
>>>       spin_unlock(&hd->arch.pgtables.lock);
>>>
>>>       if ( p )
>>>       {
>>>           free_domheap_page(pg);
>>>           pg = NULL;
>>>       }
>>>
>>> (albeit I'm relatively sure you won't like the re-purposing of
>>> p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be
>>> nice to use here, but we seem to only have FREE_XENHEAP_PAGE()
>>> so far.)
>>
>> In fact I don't mind the re-purposing of p. However, I dislike the
>> allocation and then freeing when the domain is dying.
>>
>> I think I prefer the small race introduced (the pages will still be
>> freed) over this solution.
> 
> The "will still be freed" is because of the 2nd round of freeing
> you add in an earlier patch? I'd prefer to avoid the race to in
> turn avoid that 2nd round of freeing.

The "2nd round" of freeing is necessary at least for the domain creation 
failure case (it would be the 1rst round).

If we can avoid IOMMU page-table allocation in the domain creation path, 
then yes I agree that we want to avoid that "2nd round". Otherwise, I 
think it is best to take advantage of it.

Cheers,
Jan Beulich Dec. 23, 2020, 4:35 p.m. UTC | #6
On 23.12.2020 17:19, Julien Grall wrote:
> On 23/12/2020 16:15, Jan Beulich wrote:
>> On 23.12.2020 17:07, Julien Grall wrote:
>>> I think I prefer the small race introduced (the pages will still be
>>> freed) over this solution.
>>
>> The "will still be freed" is because of the 2nd round of freeing
>> you add in an earlier patch? I'd prefer to avoid the race to in
>> turn avoid that 2nd round of freeing.
> 
> The "2nd round" of freeing is necessary at least for the domain creation 
> failure case (it would be the 1rst round).
> 
> If we can avoid IOMMU page-table allocation in the domain creation path, 
> then yes I agree that we want to avoid that "2nd round". Otherwise, I 
> think it is best to take advantage of it.

Well, I'm not really certain either way here. If it's really just
VMX'es APIC access page that's the problem here, custom cleanup
of this "fallout" from VMX code would certainly be an option.
Furthermore I consider it wrong to insert this page in the IOMMU
page tables in the first place. This page overlaps with the MSI
special address range, and hence accesses will be dealt with by
interrupt remapping machinery (if enabled). If interrupt
remapping is disabled, this page should be no different for I/O
purposes than all other pages in this window - they shouldn't
be mapped at all.

Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry()
should gain an is_special_page() check to prevent propagation to
the IOMMU for such pages (we can't do anything about them in
shared-page-table setups)? See also the (PV related) comment in
cleanup_page_mappings(), a few lines ahead of a respective use of
is_special_page(),

Jan
Julien Grall Jan. 14, 2021, 6:53 p.m. UTC | #7
Hi Jan,

On 23/12/2020 16:35, Jan Beulich wrote:
> On 23.12.2020 17:19, Julien Grall wrote:
>> On 23/12/2020 16:15, Jan Beulich wrote:
>>> On 23.12.2020 17:07, Julien Grall wrote:
>>>> I think I prefer the small race introduced (the pages will still be
>>>> freed) over this solution.
>>>
>>> The "will still be freed" is because of the 2nd round of freeing
>>> you add in an earlier patch? I'd prefer to avoid the race to in
>>> turn avoid that 2nd round of freeing.
>>
>> The "2nd round" of freeing is necessary at least for the domain creation
>> failure case (it would be the 1rst round).
>>
>> If we can avoid IOMMU page-table allocation in the domain creation path,
>> then yes I agree that we want to avoid that "2nd round". Otherwise, I
>> think it is best to take advantage of it.
> 
> Well, I'm not really certain either way here. If it's really just
> VMX'es APIC access page that's the problem here, custom cleanup
> of this "fallout" from VMX code would certainly be an option.

 From my testing, it looks like the VMX APIC page is the only entry 
added while the domain is created.

> Furthermore I consider it wrong to insert this page in the IOMMU
> page tables in the first place. This page overlaps with the MSI
> special address range, and hence accesses will be dealt with by
> interrupt remapping machinery (if enabled). If interrupt
> remapping is disabled, this page should be no different for I/O
> purposes than all other pages in this window - they shouldn't
> be mapped at all.

That's a fair point. I will have a look to see if I can avoid the IOMMU 
mapping for the VMX APIC page.

> Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry()
> should gain an is_special_page() check to prevent propagation to
> the IOMMU for such pages (we can't do anything about them in
> shared-page-table setups)? See also the (PV related) comment in
> cleanup_page_mappings(), a few lines ahead of a respective use of
> is_special_page(),

There seems to be a mismatch between what the comment says and the code 
adding the page in the IOMMU for PV domain.

AFAICT, the IOMMU mapping will be added via _get_page_type():

         /* Special pages should not be accessible from devices. */
         struct domain *d = page_get_owner(page);

         if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) )
         {
             mfn_t mfn = page_to_mfn(page);

             if ( (x & PGT_type_mask) == PGT_writable_page )
                 rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)),
                                         1ul << PAGE_ORDER_4K);
             else
                 rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn,
                                       1ul << PAGE_ORDER_4K,
                                       IOMMUF_readable | IOMMUF_writable);
         }

In this snippet, "special page" is interpreted as a page with no owner. 
However is_special_page() will return true when PGC_extra is set and the 
flag implies there is an owner (see assign_pages()).

So it looks like to me, any pages with PGC_extra would end up to have a 
mapping in the IOMMU pages-tables if they are writable.

This statement also seems to apply for xenheap pages shared with a 
domain (e.g grant-table).

Did I miss anything?

Cheers,
Jan Beulich Jan. 15, 2021, 11:24 a.m. UTC | #8
On 14.01.2021 19:53, Julien Grall wrote:
> On 23/12/2020 16:35, Jan Beulich wrote:
>> On 23.12.2020 17:19, Julien Grall wrote:
>>> On 23/12/2020 16:15, Jan Beulich wrote:
>>>> On 23.12.2020 17:07, Julien Grall wrote:
>>>>> I think I prefer the small race introduced (the pages will still be
>>>>> freed) over this solution.
>>>>
>>>> The "will still be freed" is because of the 2nd round of freeing
>>>> you add in an earlier patch? I'd prefer to avoid the race to in
>>>> turn avoid that 2nd round of freeing.
>>>
>>> The "2nd round" of freeing is necessary at least for the domain creation
>>> failure case (it would be the 1rst round).
>>>
>>> If we can avoid IOMMU page-table allocation in the domain creation path,
>>> then yes I agree that we want to avoid that "2nd round". Otherwise, I
>>> think it is best to take advantage of it.
>>
>> Well, I'm not really certain either way here. If it's really just
>> VMX'es APIC access page that's the problem here, custom cleanup
>> of this "fallout" from VMX code would certainly be an option.
> 
>  From my testing, it looks like the VMX APIC page is the only entry 
> added while the domain is created.
> 
>> Furthermore I consider it wrong to insert this page in the IOMMU
>> page tables in the first place. This page overlaps with the MSI
>> special address range, and hence accesses will be dealt with by
>> interrupt remapping machinery (if enabled). If interrupt
>> remapping is disabled, this page should be no different for I/O
>> purposes than all other pages in this window - they shouldn't
>> be mapped at all.
> 
> That's a fair point. I will have a look to see if I can avoid the IOMMU 
> mapping for the VMX APIC page.
> 
>> Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry()
>> should gain an is_special_page() check to prevent propagation to
>> the IOMMU for such pages (we can't do anything about them in
>> shared-page-table setups)? See also the (PV related) comment in
>> cleanup_page_mappings(), a few lines ahead of a respective use of
>> is_special_page(),
> 
> There seems to be a mismatch between what the comment says and the code 
> adding the page in the IOMMU for PV domain.
> 
> AFAICT, the IOMMU mapping will be added via _get_page_type():
> 
>          /* Special pages should not be accessible from devices. */
>          struct domain *d = page_get_owner(page);
> 
>          if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) )
>          {
>              mfn_t mfn = page_to_mfn(page);
> 
>              if ( (x & PGT_type_mask) == PGT_writable_page )
>                  rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)),
>                                          1ul << PAGE_ORDER_4K);
>              else
>                  rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn,
>                                        1ul << PAGE_ORDER_4K,
>                                        IOMMUF_readable | IOMMUF_writable);
>          }
> 
> In this snippet, "special page" is interpreted as a page with no owner. 
> However is_special_page() will return true when PGC_extra is set and the 
> flag implies there is an owner (see assign_pages()).
> 
> So it looks like to me, any pages with PGC_extra would end up to have a 
> mapping in the IOMMU pages-tables if they are writable.
> 
> This statement also seems to apply for xenheap pages shared with a 
> domain (e.g grant-table).
> 
> Did I miss anything?

First let me point out that what you quote is not what I had
pointed you at - you refer to _get_page_type() when I suggested
to look at cleanup_page_mappings(). The important aspect for
special pages (here I mean ones having been subject to
share_xen_page_with_guest()) is that their type gets "pinned",
i.e. they'll never be subject to _get_page_type()'s
transitioning of types. As you likely will have noticed,
cleanup_page_mappings() also only gets called when the last
_general_ ref of a page got dropped, i.e. entirely unrelated
to type references.

If PGC_extra pages have a similar requirement, they may need
similar pinning of their types. (Maybe they do; didn't check.)

Jan
Julien Grall Jan. 15, 2021, 11:30 a.m. UTC | #9
Hi Jan,

On 15/01/2021 11:24, Jan Beulich wrote:
> On 14.01.2021 19:53, Julien Grall wrote:
>> On 23/12/2020 16:35, Jan Beulich wrote:
>>> On 23.12.2020 17:19, Julien Grall wrote:
>>>> On 23/12/2020 16:15, Jan Beulich wrote:
>>>>> On 23.12.2020 17:07, Julien Grall wrote:
>>>>>> I think I prefer the small race introduced (the pages will still be
>>>>>> freed) over this solution.
>>>>>
>>>>> The "will still be freed" is because of the 2nd round of freeing
>>>>> you add in an earlier patch? I'd prefer to avoid the race to in
>>>>> turn avoid that 2nd round of freeing.
>>>>
>>>> The "2nd round" of freeing is necessary at least for the domain creation
>>>> failure case (it would be the 1rst round).
>>>>
>>>> If we can avoid IOMMU page-table allocation in the domain creation path,
>>>> then yes I agree that we want to avoid that "2nd round". Otherwise, I
>>>> think it is best to take advantage of it.
>>>
>>> Well, I'm not really certain either way here. If it's really just
>>> VMX'es APIC access page that's the problem here, custom cleanup
>>> of this "fallout" from VMX code would certainly be an option.
>>
>>   From my testing, it looks like the VMX APIC page is the only entry
>> added while the domain is created.
>>
>>> Furthermore I consider it wrong to insert this page in the IOMMU
>>> page tables in the first place. This page overlaps with the MSI
>>> special address range, and hence accesses will be dealt with by
>>> interrupt remapping machinery (if enabled). If interrupt
>>> remapping is disabled, this page should be no different for I/O
>>> purposes than all other pages in this window - they shouldn't
>>> be mapped at all.
>>
>> That's a fair point. I will have a look to see if I can avoid the IOMMU
>> mapping for the VMX APIC page.
>>
>>> Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry()
>>> should gain an is_special_page() check to prevent propagation to
>>> the IOMMU for such pages (we can't do anything about them in
>>> shared-page-table setups)? See also the (PV related) comment in
>>> cleanup_page_mappings(), a few lines ahead of a respective use of
>>> is_special_page(),
>>
>> There seems to be a mismatch between what the comment says and the code
>> adding the page in the IOMMU for PV domain.
>>
>> AFAICT, the IOMMU mapping will be added via _get_page_type():
>>
>>           /* Special pages should not be accessible from devices. */
>>           struct domain *d = page_get_owner(page);
>>
>>           if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) )
>>           {
>>               mfn_t mfn = page_to_mfn(page);
>>
>>               if ( (x & PGT_type_mask) == PGT_writable_page )
>>                   rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)),
>>                                           1ul << PAGE_ORDER_4K);
>>               else
>>                   rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn,
>>                                         1ul << PAGE_ORDER_4K,
>>                                         IOMMUF_readable | IOMMUF_writable);
>>           }
>>
>> In this snippet, "special page" is interpreted as a page with no owner.
>> However is_special_page() will return true when PGC_extra is set and the
>> flag implies there is an owner (see assign_pages()).
>>
>> So it looks like to me, any pages with PGC_extra would end up to have a
>> mapping in the IOMMU pages-tables if they are writable.
>>
>> This statement also seems to apply for xenheap pages shared with a
>> domain (e.g grant-table).
>>
>> Did I miss anything?
> 
> First let me point out that what you quote is not what I had
> pointed you at - you refer to _get_page_type() when I suggested
> to look at cleanup_page_mappings(). 
I know and that's why I pointed out the mismatch between the comments 
(in cleanup_page_mappings()) and the code adding the mapping in the 
IOMMU (_get_page_type()).

I only looked at _get_page_type() because I wanted to understand how the 
IOMMU mapping works for PV.

The important aspect for
> special pages (here I mean ones having been subject to
> share_xen_page_with_guest()) is that their type gets "pinned",
> i.e. they'll never be subject to _get_page_type()'s
> transitioning of types. As you likely will have noticed,
> cleanup_page_mappings() also only gets called when the last
> _general_ ref of a page got dropped, i.e. entirely unrelated
> to type references.

Ah, that makes sense. I added some debugging in the code and couldn't 
really figure out why the transition wasn't happening.

> 
> If PGC_extra pages have a similar requirement, they may need
> similar pinning of their types. (Maybe they do; didn't check.)

I have only looked at the VMX APIC page so far. It effectively pin the 
types.

Thanks for the explanation!

Cheers,
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b9ba04633e18..1b7ee5c1a8cb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2290,7 +2290,7 @@  int domain_relinquish_resources(struct domain *d)
 
     PROGRESS(iommu_pagetables):
 
-        ret = iommu_free_pgtables(d);
+        ret = iommu_free_pgtables(d, false);
         if ( ret )
             return ret;
 
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 99a23177b3d2..4a083e4b8f11 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -149,6 +149,21 @@  int arch_iommu_domain_init(struct domain *d)
 
 void arch_iommu_domain_destroy(struct domain *d)
 {
+    struct domain_iommu *hd = dom_iommu(d);
+    int rc;
+
+    /*
+     * The relinquish code will not be executed if the domain creation
+     * failed. To avoid any memory leak, we want to free any IOMMU
+     * page-tables that may have been allocated.
+     */
+    rc = iommu_free_pgtables(d, false);
+
+    /* The preemption was disabled, so the call should never fail. */
+    if ( rc )
+        ASSERT_UNREACHABLE();
+
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 }
 
 static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
@@ -261,7 +276,7 @@  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
         return;
 }
 
-int iommu_free_pgtables(struct domain *d)
+int iommu_free_pgtables(struct domain *d, bool preempt)
 {
     struct domain_iommu *hd = dom_iommu(d);
     struct page_info *pg;
@@ -282,7 +297,7 @@  int iommu_free_pgtables(struct domain *d)
     {
         free_domheap_page(pg);
 
-        if ( !(++done & 0xff) && general_preempt_check() )
+        if ( !(++done & 0xff) && preempt && general_preempt_check() )
         {
             spin_unlock(&hd->arch.pgtables.lock);
             return -ERESTART;
@@ -305,6 +320,19 @@  struct page_info *iommu_alloc_pgtable(struct domain *d)
         memflags = MEMF_node(hd->node);
 #endif
 
+    /*
+     * The IOMMU page-tables are freed when relinquishing the domain, but
+     * nothing prevent allocation to happen afterwards. There is no valid
+     * reasons to continue to update the IOMMU page-tables while the
+     * domain is dying.
+     *
+     * So prevent page-table allocation when the domain is dying. Note
+     * this doesn't fully prevent the race because d->is_dying may not
+     * yet be seen.
+     */
+    if ( d->is_dying )
+        return NULL;
+
     pg = alloc_domheap_page(NULL, memflags);
     if ( !pg )
         return NULL;
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 970eb06ffac5..874bb5bbfbde 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -135,7 +135,7 @@  int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
-int __must_check iommu_free_pgtables(struct domain *d);
+int __must_check iommu_free_pgtables(struct domain *d, bool preempt);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */