Message ID | 20201222154338.9459-5-julien@xen.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/iommu: Collection of bug fixes for IOMMU teadorwn | expand |
Hi, On 22/12/2020 15:43, Julien Grall wrote: > From: Julien Grall <jgrall@amazon.com> > > The new IOMMU page-tables allocator will release the pages when > relinquish the domain resources. However, this is not sufficient in two > cases: > 1) domain_relinquish_resources() is not called when the domain > creation fails. > 2) There is nothing preventing page-table allocations when the > domain is dying. > > In both cases, this can be solved by freeing the page-tables again > when the domain destruction. Although, this may result to an high > number of page-tables to free. > > In the second case, it is pointless to allow page-table allocation when > the domain is going to die. iommu_alloc_pgtable() will now return an > error when it is called while the domain is dying. > > Signed-off-by: Julien Grall <jgrall@amazon.com> I forgot to mention this is fixing 15bc9a1ef51c "x86/iommu: add common page-table allocator". I will add a Fixes tag for the next version. Cheers,
On 22.12.2020 16:43, Julien Grall wrote: > From: Julien Grall <jgrall@amazon.com> > > The new IOMMU page-tables allocator will release the pages when > relinquish the domain resources. However, this is not sufficient in two > cases: > 1) domain_relinquish_resources() is not called when the domain > creation fails. Could you remind me of what IOMMU page table insertions there are during domain creation? No memory got allocated to the domain at that point yet, so it would seem to me there simply is nothing to map. > 2) There is nothing preventing page-table allocations when the > domain is dying. > > In both cases, this can be solved by freeing the page-tables again > when the domain destruction. Although, this may result to an high > number of page-tables to free. Since I've seen this before in this series, and despite me also not being a native speaker, as a nit: I don't think it can typically be other than "result in". > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -2290,7 +2290,7 @@ int domain_relinquish_resources(struct domain *d) > > PROGRESS(iommu_pagetables): > > - ret = iommu_free_pgtables(d); > + ret = iommu_free_pgtables(d, false); I suppose you mean "true" here, but I also think the other approach (checking for DOMDYING_dead, which you don't seem to like very much) is better, if for no other reason than it already being used elsewhere. > @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) > memflags = MEMF_node(hd->node); > #endif > > + /* > + * The IOMMU page-tables are freed when relinquishing the domain, but > + * nothing prevent allocation to happen afterwards. There is no valid > + * reasons to continue to update the IOMMU page-tables while the > + * domain is dying. > + * > + * So prevent page-table allocation when the domain is dying. Note > + * this doesn't fully prevent the race because d->is_dying may not > + * yet be seen. > + */ > + if ( d->is_dying ) > + return NULL; > + > pg = alloc_domheap_page(NULL, memflags); > if ( !pg ) > return NULL; As said in reply to an earlier patch - with a suitable spin_barrier() you can place your check further down, along the lines of spin_lock(&hd->arch.pgtables.lock); if ( likely(!d->is_dying) ) { page_list_add(pg, &hd->arch.pgtables.list); p = NULL; } spin_unlock(&hd->arch.pgtables.lock); if ( p ) { free_domheap_page(pg); pg = NULL; } (albeit I'm relatively sure you won't like the re-purposing of p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be nice to use here, but we seem to only have FREE_XENHEAP_PAGE() so far.) Jan
On 23/12/2020 14:34, Jan Beulich wrote: > On 22.12.2020 16:43, Julien Grall wrote: >> From: Julien Grall <jgrall@amazon.com> >> >> The new IOMMU page-tables allocator will release the pages when >> relinquish the domain resources. However, this is not sufficient in two >> cases: >> 1) domain_relinquish_resources() is not called when the domain >> creation fails. > > Could you remind me of what IOMMU page table insertions there > are during domain creation? No memory got allocated to the > domain at that point yet, so it would seem to me there simply > is nothing to map. The P2M is first modified in hvm_domain_initialise(): (XEN) Xen call trace: (XEN) [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137 (XEN) [<ffff82d04025f9f5>] F drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8 (XEN) [<ffff82d04025fcc5>] F drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b (XEN) [<ffff82d04026d949>] F iommu_map+0x6d/0x16f (XEN) [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63 (XEN) [<ffff82d040301bdc>] F arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730 (XEN) [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128 (XEN) [<ffff82d0402f6b5c>] F arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7 (XEN) [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e (XEN) [<ffff82d04029a080>] F arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137 (XEN) [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7 (XEN) [<ffff82d04031eae7>] F arch_domain_create+0x478/0x4ff (XEN) [<ffff82d04020476e>] F domain_create+0x4f2/0x778 (XEN) [<ffff82d04023b0d2>] F do_domctl+0xb1e/0x18b8 (XEN) [<ffff82d040311dbf>] F pv_hypercall+0x2f0/0x55f (XEN) [<ffff82d040390432>] F lstar_enter+0x112/0x120 > >> 2) There is nothing preventing page-table allocations when the >> domain is dying. >> >> In both cases, this can be solved by freeing the page-tables again >> when the domain destruction. Although, this may result to an high >> number of page-tables to free. > > Since I've seen this before in this series, and despite me also > not being a native speaker, as a nit: I don't think it can > typically be other than "result in". I think you are right. > >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -2290,7 +2290,7 @@ int domain_relinquish_resources(struct domain *d) >> >> PROGRESS(iommu_pagetables): >> >> - ret = iommu_free_pgtables(d); >> + ret = iommu_free_pgtables(d, false); > > I suppose you mean "true" here, but I also think the other > approach (checking for DOMDYING_dead, which you don't seem to > like very much) is better, if for no other reason than it > already being used elsewhere. I think "don't like very much" is an understatement :). There seems to be more function using an extra parameter (such as hap_set_allocation() which was introduced before your DOMDYING_dead). So I only followed what they did. > >> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) >> memflags = MEMF_node(hd->node); >> #endif >> >> + /* >> + * The IOMMU page-tables are freed when relinquishing the domain, but >> + * nothing prevent allocation to happen afterwards. There is no valid >> + * reasons to continue to update the IOMMU page-tables while the >> + * domain is dying. >> + * >> + * So prevent page-table allocation when the domain is dying. Note >> + * this doesn't fully prevent the race because d->is_dying may not >> + * yet be seen. >> + */ >> + if ( d->is_dying ) >> + return NULL; >> + >> pg = alloc_domheap_page(NULL, memflags); >> if ( !pg ) >> return NULL; > > As said in reply to an earlier patch - with a suitable > spin_barrier() you can place your check further down, along the > lines of > > spin_lock(&hd->arch.pgtables.lock); > if ( likely(!d->is_dying) ) > { > page_list_add(pg, &hd->arch.pgtables.list); > p = NULL; > } > spin_unlock(&hd->arch.pgtables.lock); > > if ( p ) > { > free_domheap_page(pg); > pg = NULL; > } > > (albeit I'm relatively sure you won't like the re-purposing of > p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be > nice to use here, but we seem to only have FREE_XENHEAP_PAGE() > so far.) In fact I don't mind the re-purposing of p. However, I dislike the allocation and then freeing when the domain is dying. I think I prefer the small race introduced (the pages will still be freed) over this solution. Note that Paul's IOMMU series will completely rework the function. So this is only temporary. Cheers,
On 23.12.2020 17:07, Julien Grall wrote: > On 23/12/2020 14:34, Jan Beulich wrote: >> On 22.12.2020 16:43, Julien Grall wrote: >>> From: Julien Grall <jgrall@amazon.com> >>> >>> The new IOMMU page-tables allocator will release the pages when >>> relinquish the domain resources. However, this is not sufficient in two >>> cases: >>> 1) domain_relinquish_resources() is not called when the domain >>> creation fails. >> >> Could you remind me of what IOMMU page table insertions there >> are during domain creation? No memory got allocated to the >> domain at that point yet, so it would seem to me there simply >> is nothing to map. > > The P2M is first modified in hvm_domain_initialise(): > > (XEN) Xen call trace: > (XEN) [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137 > (XEN) [<ffff82d04025f9f5>] F > drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8 > (XEN) [<ffff82d04025fcc5>] F > drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b > (XEN) [<ffff82d04026d949>] F iommu_map+0x6d/0x16f > (XEN) [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63 > (XEN) [<ffff82d040301bdc>] F > arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730 > (XEN) [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128 > (XEN) [<ffff82d0402f6b5c>] F > arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7 > (XEN) [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e > (XEN) [<ffff82d04029a080>] F > arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137 > (XEN) [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7 Oh, the infamous APIC access page again. >>> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) >>> memflags = MEMF_node(hd->node); >>> #endif >>> >>> + /* >>> + * The IOMMU page-tables are freed when relinquishing the domain, but >>> + * nothing prevent allocation to happen afterwards. There is no valid >>> + * reasons to continue to update the IOMMU page-tables while the >>> + * domain is dying. >>> + * >>> + * So prevent page-table allocation when the domain is dying. Note >>> + * this doesn't fully prevent the race because d->is_dying may not >>> + * yet be seen. >>> + */ >>> + if ( d->is_dying ) >>> + return NULL; >>> + >>> pg = alloc_domheap_page(NULL, memflags); >>> if ( !pg ) >>> return NULL; >> >> As said in reply to an earlier patch - with a suitable >> spin_barrier() you can place your check further down, along the >> lines of >> >> spin_lock(&hd->arch.pgtables.lock); >> if ( likely(!d->is_dying) ) >> { >> page_list_add(pg, &hd->arch.pgtables.list); >> p = NULL; >> } >> spin_unlock(&hd->arch.pgtables.lock); >> >> if ( p ) >> { >> free_domheap_page(pg); >> pg = NULL; >> } >> >> (albeit I'm relatively sure you won't like the re-purposing of >> p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be >> nice to use here, but we seem to only have FREE_XENHEAP_PAGE() >> so far.) > > In fact I don't mind the re-purposing of p. However, I dislike the > allocation and then freeing when the domain is dying. > > I think I prefer the small race introduced (the pages will still be > freed) over this solution. The "will still be freed" is because of the 2nd round of freeing you add in an earlier patch? I'd prefer to avoid the race to in turn avoid that 2nd round of freeing. Jan
On 23/12/2020 16:15, Jan Beulich wrote: > On 23.12.2020 17:07, Julien Grall wrote: >> On 23/12/2020 14:34, Jan Beulich wrote: >>> On 22.12.2020 16:43, Julien Grall wrote: >>>> From: Julien Grall <jgrall@amazon.com> >>>> >>>> The new IOMMU page-tables allocator will release the pages when >>>> relinquish the domain resources. However, this is not sufficient in two >>>> cases: >>>> 1) domain_relinquish_resources() is not called when the domain >>>> creation fails. >>> >>> Could you remind me of what IOMMU page table insertions there >>> are during domain creation? No memory got allocated to the >>> domain at that point yet, so it would seem to me there simply >>> is nothing to map. >> >> The P2M is first modified in hvm_domain_initialise(): >> >> (XEN) Xen call trace: >> (XEN) [<ffff82d04026b9ec>] R iommu_alloc_pgtable+0x11/0x137 >> (XEN) [<ffff82d04025f9f5>] F >> drivers/passthrough/vtd/iommu.c#addr_to_dma_page_maddr+0x146/0x1d8 >> (XEN) [<ffff82d04025fcc5>] F >> drivers/passthrough/vtd/iommu.c#intel_iommu_map_page+0x6a/0x14b >> (XEN) [<ffff82d04026d949>] F iommu_map+0x6d/0x16f >> (XEN) [<ffff82d04026da71>] F iommu_legacy_map+0x26/0x63 >> (XEN) [<ffff82d040301bdc>] F >> arch/x86/mm/p2m-ept.c#ept_set_entry+0x6b2/0x730 >> (XEN) [<ffff82d0402f67e7>] F p2m_set_entry+0x91/0x128 >> (XEN) [<ffff82d0402f6b5c>] F >> arch/x86/mm/p2m.c#set_typed_p2m_entry+0xfe/0x3f7 >> (XEN) [<ffff82d0402f7f4c>] F set_mmio_p2m_entry+0x65/0x6e >> (XEN) [<ffff82d04029a080>] F >> arch/x86/hvm/vmx/vmx.c#vmx_domain_initialise+0xf6/0x137 >> (XEN) [<ffff82d0402af421>] F hvm_domain_initialise+0x357/0x4c7 > > Oh, the infamous APIC access page again. > >>>> @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) >>>> memflags = MEMF_node(hd->node); >>>> #endif >>>> >>>> + /* >>>> + * The IOMMU page-tables are freed when relinquishing the domain, but >>>> + * nothing prevent allocation to happen afterwards. There is no valid >>>> + * reasons to continue to update the IOMMU page-tables while the >>>> + * domain is dying. >>>> + * >>>> + * So prevent page-table allocation when the domain is dying. Note >>>> + * this doesn't fully prevent the race because d->is_dying may not >>>> + * yet be seen. >>>> + */ >>>> + if ( d->is_dying ) >>>> + return NULL; >>>> + >>>> pg = alloc_domheap_page(NULL, memflags); >>>> if ( !pg ) >>>> return NULL; >>> >>> As said in reply to an earlier patch - with a suitable >>> spin_barrier() you can place your check further down, along the >>> lines of >>> >>> spin_lock(&hd->arch.pgtables.lock); >>> if ( likely(!d->is_dying) ) >>> { >>> page_list_add(pg, &hd->arch.pgtables.list); >>> p = NULL; >>> } >>> spin_unlock(&hd->arch.pgtables.lock); >>> >>> if ( p ) >>> { >>> free_domheap_page(pg); >>> pg = NULL; >>> } >>> >>> (albeit I'm relatively sure you won't like the re-purposing of >>> p, but that's a minor detail). (FREE_DOMHEAP_PAGE() would be >>> nice to use here, but we seem to only have FREE_XENHEAP_PAGE() >>> so far.) >> >> In fact I don't mind the re-purposing of p. However, I dislike the >> allocation and then freeing when the domain is dying. >> >> I think I prefer the small race introduced (the pages will still be >> freed) over this solution. > > The "will still be freed" is because of the 2nd round of freeing > you add in an earlier patch? I'd prefer to avoid the race to in > turn avoid that 2nd round of freeing. The "2nd round" of freeing is necessary at least for the domain creation failure case (it would be the 1rst round). If we can avoid IOMMU page-table allocation in the domain creation path, then yes I agree that we want to avoid that "2nd round". Otherwise, I think it is best to take advantage of it. Cheers,
On 23.12.2020 17:19, Julien Grall wrote: > On 23/12/2020 16:15, Jan Beulich wrote: >> On 23.12.2020 17:07, Julien Grall wrote: >>> I think I prefer the small race introduced (the pages will still be >>> freed) over this solution. >> >> The "will still be freed" is because of the 2nd round of freeing >> you add in an earlier patch? I'd prefer to avoid the race to in >> turn avoid that 2nd round of freeing. > > The "2nd round" of freeing is necessary at least for the domain creation > failure case (it would be the 1rst round). > > If we can avoid IOMMU page-table allocation in the domain creation path, > then yes I agree that we want to avoid that "2nd round". Otherwise, I > think it is best to take advantage of it. Well, I'm not really certain either way here. If it's really just VMX'es APIC access page that's the problem here, custom cleanup of this "fallout" from VMX code would certainly be an option. Furthermore I consider it wrong to insert this page in the IOMMU page tables in the first place. This page overlaps with the MSI special address range, and hence accesses will be dealt with by interrupt remapping machinery (if enabled). If interrupt remapping is disabled, this page should be no different for I/O purposes than all other pages in this window - they shouldn't be mapped at all. Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry() should gain an is_special_page() check to prevent propagation to the IOMMU for such pages (we can't do anything about them in shared-page-table setups)? See also the (PV related) comment in cleanup_page_mappings(), a few lines ahead of a respective use of is_special_page(), Jan
Hi Jan, On 23/12/2020 16:35, Jan Beulich wrote: > On 23.12.2020 17:19, Julien Grall wrote: >> On 23/12/2020 16:15, Jan Beulich wrote: >>> On 23.12.2020 17:07, Julien Grall wrote: >>>> I think I prefer the small race introduced (the pages will still be >>>> freed) over this solution. >>> >>> The "will still be freed" is because of the 2nd round of freeing >>> you add in an earlier patch? I'd prefer to avoid the race to in >>> turn avoid that 2nd round of freeing. >> >> The "2nd round" of freeing is necessary at least for the domain creation >> failure case (it would be the 1rst round). >> >> If we can avoid IOMMU page-table allocation in the domain creation path, >> then yes I agree that we want to avoid that "2nd round". Otherwise, I >> think it is best to take advantage of it. > > Well, I'm not really certain either way here. If it's really just > VMX'es APIC access page that's the problem here, custom cleanup > of this "fallout" from VMX code would certainly be an option. From my testing, it looks like the VMX APIC page is the only entry added while the domain is created. > Furthermore I consider it wrong to insert this page in the IOMMU > page tables in the first place. This page overlaps with the MSI > special address range, and hence accesses will be dealt with by > interrupt remapping machinery (if enabled). If interrupt > remapping is disabled, this page should be no different for I/O > purposes than all other pages in this window - they shouldn't > be mapped at all. That's a fair point. I will have a look to see if I can avoid the IOMMU mapping for the VMX APIC page. > Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry() > should gain an is_special_page() check to prevent propagation to > the IOMMU for such pages (we can't do anything about them in > shared-page-table setups)? See also the (PV related) comment in > cleanup_page_mappings(), a few lines ahead of a respective use of > is_special_page(), There seems to be a mismatch between what the comment says and the code adding the page in the IOMMU for PV domain. AFAICT, the IOMMU mapping will be added via _get_page_type(): /* Special pages should not be accessible from devices. */ struct domain *d = page_get_owner(page); if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) ) { mfn_t mfn = page_to_mfn(page); if ( (x & PGT_type_mask) == PGT_writable_page ) rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)), 1ul << PAGE_ORDER_4K); else rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn, 1ul << PAGE_ORDER_4K, IOMMUF_readable | IOMMUF_writable); } In this snippet, "special page" is interpreted as a page with no owner. However is_special_page() will return true when PGC_extra is set and the flag implies there is an owner (see assign_pages()). So it looks like to me, any pages with PGC_extra would end up to have a mapping in the IOMMU pages-tables if they are writable. This statement also seems to apply for xenheap pages shared with a domain (e.g grant-table). Did I miss anything? Cheers,
On 14.01.2021 19:53, Julien Grall wrote: > On 23/12/2020 16:35, Jan Beulich wrote: >> On 23.12.2020 17:19, Julien Grall wrote: >>> On 23/12/2020 16:15, Jan Beulich wrote: >>>> On 23.12.2020 17:07, Julien Grall wrote: >>>>> I think I prefer the small race introduced (the pages will still be >>>>> freed) over this solution. >>>> >>>> The "will still be freed" is because of the 2nd round of freeing >>>> you add in an earlier patch? I'd prefer to avoid the race to in >>>> turn avoid that 2nd round of freeing. >>> >>> The "2nd round" of freeing is necessary at least for the domain creation >>> failure case (it would be the 1rst round). >>> >>> If we can avoid IOMMU page-table allocation in the domain creation path, >>> then yes I agree that we want to avoid that "2nd round". Otherwise, I >>> think it is best to take advantage of it. >> >> Well, I'm not really certain either way here. If it's really just >> VMX'es APIC access page that's the problem here, custom cleanup >> of this "fallout" from VMX code would certainly be an option. > > From my testing, it looks like the VMX APIC page is the only entry > added while the domain is created. > >> Furthermore I consider it wrong to insert this page in the IOMMU >> page tables in the first place. This page overlaps with the MSI >> special address range, and hence accesses will be dealt with by >> interrupt remapping machinery (if enabled). If interrupt >> remapping is disabled, this page should be no different for I/O >> purposes than all other pages in this window - they shouldn't >> be mapped at all. > > That's a fair point. I will have a look to see if I can avoid the IOMMU > mapping for the VMX APIC page. > >> Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry() >> should gain an is_special_page() check to prevent propagation to >> the IOMMU for such pages (we can't do anything about them in >> shared-page-table setups)? See also the (PV related) comment in >> cleanup_page_mappings(), a few lines ahead of a respective use of >> is_special_page(), > > There seems to be a mismatch between what the comment says and the code > adding the page in the IOMMU for PV domain. > > AFAICT, the IOMMU mapping will be added via _get_page_type(): > > /* Special pages should not be accessible from devices. */ > struct domain *d = page_get_owner(page); > > if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) ) > { > mfn_t mfn = page_to_mfn(page); > > if ( (x & PGT_type_mask) == PGT_writable_page ) > rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)), > 1ul << PAGE_ORDER_4K); > else > rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn, > 1ul << PAGE_ORDER_4K, > IOMMUF_readable | IOMMUF_writable); > } > > In this snippet, "special page" is interpreted as a page with no owner. > However is_special_page() will return true when PGC_extra is set and the > flag implies there is an owner (see assign_pages()). > > So it looks like to me, any pages with PGC_extra would end up to have a > mapping in the IOMMU pages-tables if they are writable. > > This statement also seems to apply for xenheap pages shared with a > domain (e.g grant-table). > > Did I miss anything? First let me point out that what you quote is not what I had pointed you at - you refer to _get_page_type() when I suggested to look at cleanup_page_mappings(). The important aspect for special pages (here I mean ones having been subject to share_xen_page_with_guest()) is that their type gets "pinned", i.e. they'll never be subject to _get_page_type()'s transitioning of types. As you likely will have noticed, cleanup_page_mappings() also only gets called when the last _general_ ref of a page got dropped, i.e. entirely unrelated to type references. If PGC_extra pages have a similar requirement, they may need similar pinning of their types. (Maybe they do; didn't check.) Jan
Hi Jan, On 15/01/2021 11:24, Jan Beulich wrote: > On 14.01.2021 19:53, Julien Grall wrote: >> On 23/12/2020 16:35, Jan Beulich wrote: >>> On 23.12.2020 17:19, Julien Grall wrote: >>>> On 23/12/2020 16:15, Jan Beulich wrote: >>>>> On 23.12.2020 17:07, Julien Grall wrote: >>>>>> I think I prefer the small race introduced (the pages will still be >>>>>> freed) over this solution. >>>>> >>>>> The "will still be freed" is because of the 2nd round of freeing >>>>> you add in an earlier patch? I'd prefer to avoid the race to in >>>>> turn avoid that 2nd round of freeing. >>>> >>>> The "2nd round" of freeing is necessary at least for the domain creation >>>> failure case (it would be the 1rst round). >>>> >>>> If we can avoid IOMMU page-table allocation in the domain creation path, >>>> then yes I agree that we want to avoid that "2nd round". Otherwise, I >>>> think it is best to take advantage of it. >>> >>> Well, I'm not really certain either way here. If it's really just >>> VMX'es APIC access page that's the problem here, custom cleanup >>> of this "fallout" from VMX code would certainly be an option. >> >> From my testing, it looks like the VMX APIC page is the only entry >> added while the domain is created. >> >>> Furthermore I consider it wrong to insert this page in the IOMMU >>> page tables in the first place. This page overlaps with the MSI >>> special address range, and hence accesses will be dealt with by >>> interrupt remapping machinery (if enabled). If interrupt >>> remapping is disabled, this page should be no different for I/O >>> purposes than all other pages in this window - they shouldn't >>> be mapped at all. >> >> That's a fair point. I will have a look to see if I can avoid the IOMMU >> mapping for the VMX APIC page. >> >>> Perhaps, along the lines of epte_get_entry_emt(), ept_set_entry() >>> should gain an is_special_page() check to prevent propagation to >>> the IOMMU for such pages (we can't do anything about them in >>> shared-page-table setups)? See also the (PV related) comment in >>> cleanup_page_mappings(), a few lines ahead of a respective use of >>> is_special_page(), >> >> There seems to be a mismatch between what the comment says and the code >> adding the page in the IOMMU for PV domain. >> >> AFAICT, the IOMMU mapping will be added via _get_page_type(): >> >> /* Special pages should not be accessible from devices. */ >> struct domain *d = page_get_owner(page); >> >> if ( d && unlikely(need_iommu_pt_sync(d)) && is_pv_domain(d) ) >> { >> mfn_t mfn = page_to_mfn(page); >> >> if ( (x & PGT_type_mask) == PGT_writable_page ) >> rc = iommu_legacy_unmap(d, _dfn(mfn_x(mfn)), >> 1ul << PAGE_ORDER_4K); >> else >> rc = iommu_legacy_map(d, _dfn(mfn_x(mfn)), mfn, >> 1ul << PAGE_ORDER_4K, >> IOMMUF_readable | IOMMUF_writable); >> } >> >> In this snippet, "special page" is interpreted as a page with no owner. >> However is_special_page() will return true when PGC_extra is set and the >> flag implies there is an owner (see assign_pages()). >> >> So it looks like to me, any pages with PGC_extra would end up to have a >> mapping in the IOMMU pages-tables if they are writable. >> >> This statement also seems to apply for xenheap pages shared with a >> domain (e.g grant-table). >> >> Did I miss anything? > > First let me point out that what you quote is not what I had > pointed you at - you refer to _get_page_type() when I suggested > to look at cleanup_page_mappings(). I know and that's why I pointed out the mismatch between the comments (in cleanup_page_mappings()) and the code adding the mapping in the IOMMU (_get_page_type()). I only looked at _get_page_type() because I wanted to understand how the IOMMU mapping works for PV. The important aspect for > special pages (here I mean ones having been subject to > share_xen_page_with_guest()) is that their type gets "pinned", > i.e. they'll never be subject to _get_page_type()'s > transitioning of types. As you likely will have noticed, > cleanup_page_mappings() also only gets called when the last > _general_ ref of a page got dropped, i.e. entirely unrelated > to type references. Ah, that makes sense. I added some debugging in the code and couldn't really figure out why the transition wasn't happening. > > If PGC_extra pages have a similar requirement, they may need > similar pinning of their types. (Maybe they do; didn't check.) I have only looked at the VMX APIC page so far. It effectively pin the types. Thanks for the explanation! Cheers,
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index b9ba04633e18..1b7ee5c1a8cb 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -2290,7 +2290,7 @@ int domain_relinquish_resources(struct domain *d) PROGRESS(iommu_pagetables): - ret = iommu_free_pgtables(d); + ret = iommu_free_pgtables(d, false); if ( ret ) return ret; diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c index 99a23177b3d2..4a083e4b8f11 100644 --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -149,6 +149,21 @@ int arch_iommu_domain_init(struct domain *d) void arch_iommu_domain_destroy(struct domain *d) { + struct domain_iommu *hd = dom_iommu(d); + int rc; + + /* + * The relinquish code will not be executed if the domain creation + * failed. To avoid any memory leak, we want to free any IOMMU + * page-tables that may have been allocated. + */ + rc = iommu_free_pgtables(d, false); + + /* The preemption was disabled, so the call should never fail. */ + if ( rc ) + ASSERT_UNREACHABLE(); + + ASSERT(page_list_empty(&hd->arch.pgtables.list)); } static bool __hwdom_init hwdom_iommu_map(const struct domain *d, @@ -261,7 +276,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d) return; } -int iommu_free_pgtables(struct domain *d) +int iommu_free_pgtables(struct domain *d, bool preempt) { struct domain_iommu *hd = dom_iommu(d); struct page_info *pg; @@ -282,7 +297,7 @@ int iommu_free_pgtables(struct domain *d) { free_domheap_page(pg); - if ( !(++done & 0xff) && general_preempt_check() ) + if ( !(++done & 0xff) && preempt && general_preempt_check() ) { spin_unlock(&hd->arch.pgtables.lock); return -ERESTART; @@ -305,6 +320,19 @@ struct page_info *iommu_alloc_pgtable(struct domain *d) memflags = MEMF_node(hd->node); #endif + /* + * The IOMMU page-tables are freed when relinquishing the domain, but + * nothing prevent allocation to happen afterwards. There is no valid + * reasons to continue to update the IOMMU page-tables while the + * domain is dying. + * + * So prevent page-table allocation when the domain is dying. Note + * this doesn't fully prevent the race because d->is_dying may not + * yet be seen. + */ + if ( d->is_dying ) + return NULL; + pg = alloc_domheap_page(NULL, memflags); if ( !pg ) return NULL; diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h index 970eb06ffac5..874bb5bbfbde 100644 --- a/xen/include/asm-x86/iommu.h +++ b/xen/include/asm-x86/iommu.h @@ -135,7 +135,7 @@ int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq, iommu_vcall(ops, sync_cache, addr, size); \ }) -int __must_check iommu_free_pgtables(struct domain *d); +int __must_check iommu_free_pgtables(struct domain *d, bool preempt); struct page_info *__must_check iommu_alloc_pgtable(struct domain *d); #endif /* !__ARCH_X86_IOMMU_H__ */