diff mbox series

[v2,07/18] IOMMU/x86: perform PV Dom0 mappings in batches

Message ID cc183d61-f4ed-1cf5-185b-0673c9927892@suse.com (mailing list archive)
State New, archived
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich Sept. 24, 2021, 9:47 a.m. UTC
For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).

The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.

There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Subsequently p2m_add_identity_entry() may want to also gain an order
parameter, for arch_iommu_hwdom_init() to use. While this only affects
non-RAM regions, systems typically have 2-16Mb of reserved space
immediately below 4Gb, which hence could be mapped more efficiently.

The installing of zero-ref writable types has in fact shown (observed
while putting together the change) that despite the intention by the
XSA-288 changes (affecting DomU-s only) for Dom0 a number of
sufficiently ordinary pages (at the very least initrd and P2M ones as
well as pages that are part of the initial allocation but not part of
the initial mapping) still have been starting out as PGT_none, meaning
that they would have gained IOMMU mappings only the first time these
pages would get mapped writably.

I didn't think I need to address the bug mentioned in the description in
a separate (prereq) patch, but if others disagree I could certainly
break out that part (needing to first use iommu_legacy_unmap() then).

Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
They'll end up mapped via the later get_page_and_type().

As to the way these refs get installed: I've chosen to avoid the more
expensive {get,put}_page_and_type(), putting in place the intended type
directly. I guess I could be convinced to avoid this bypassing of the
actual logic; I merely think it's unnecessarily expensive.

Comments

Roger Pau Monné Dec. 2, 2021, 2:10 p.m. UTC | #1
On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> For large page mappings to be easily usable (i.e. in particular without
> un-shattering of smaller page mappings) and for mapping operations to
> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> In dom0_construct_pv() and its helpers (covering strict mode) this
> additionally requires establishing the type of those pages (albeit with
> zero type references).
> 
> The earlier establishing of PGT_writable_page | PGT_validated requires
> the existing places where this gets done (through get_page_and_type())
> to be updated: For pages which actually have a mapping, the type
> refcount needs to be 1.
> 
> There is actually a related bug that gets fixed here as a side effect:
> Typically the last L1 table would get marked as such only after
> get_page_and_type(..., PGT_writable_page). While this is fine as far as
> refcounting goes, the page did remain mapped in the IOMMU in this case
> (when "iommu=dom0-strict").
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Subsequently p2m_add_identity_entry() may want to also gain an order
> parameter, for arch_iommu_hwdom_init() to use. While this only affects
> non-RAM regions, systems typically have 2-16Mb of reserved space
> immediately below 4Gb, which hence could be mapped more efficiently.
> 
> The installing of zero-ref writable types has in fact shown (observed
> while putting together the change) that despite the intention by the
> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> sufficiently ordinary pages (at the very least initrd and P2M ones as
> well as pages that are part of the initial allocation but not part of
> the initial mapping) still have been starting out as PGT_none, meaning
> that they would have gained IOMMU mappings only the first time these
> pages would get mapped writably.
> 
> I didn't think I need to address the bug mentioned in the description in
> a separate (prereq) patch, but if others disagree I could certainly
> break out that part (needing to first use iommu_legacy_unmap() then).
> 
> Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
> They'll end up mapped via the later get_page_and_type().
> 
> As to the way these refs get installed: I've chosen to avoid the more
> expensive {get,put}_page_and_type(), putting in place the intended type
> directly. I guess I could be convinced to avoid this bypassing of the
> actual logic; I merely think it's unnecessarily expensive.
> 
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -106,11 +106,26 @@ static __init void mark_pv_pt_pages_rdon
>      unmap_domain_page(pl3e);
>  }
>  
> +/*
> + * For IOMMU mappings done while building Dom0 the type of the pages needs to
> + * match (for _get_page_type() to unmap upon type change). Set the pages to
> + * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
> + */
> +static void __init make_pages_writable(struct page_info *page, unsigned long nr)
> +{
> +    for ( ; nr--; ++page )
> +    {
> +        ASSERT(!page->u.inuse.type_info);
> +        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
> +    }
> +}
> +
>  static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
>                                      unsigned long v_start, unsigned long v_end,
>                                      unsigned long vphysmap_start,
>                                      unsigned long vphysmap_end,
> -                                    unsigned long nr_pages)
> +                                    unsigned long nr_pages,
> +                                    unsigned int *flush_flags)
>  {
>      struct page_info *page = NULL;
>      l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
> @@ -123,6 +138,8 @@ static __init void setup_pv_physmap(stru
>  
>      while ( vphysmap_start < vphysmap_end )
>      {
> +        int rc = 0;
> +
>          if ( domain_tot_pages(d) +
>               ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
>               3 > nr_pages )
> @@ -176,7 +193,22 @@ static __init void setup_pv_physmap(stru
>                                               L3_PAGETABLE_SHIFT - PAGE_SHIFT,
>                                               MEMF_no_scrub)) != NULL )
>              {
> -                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
> +                mfn_t mfn = page_to_mfn(page);
> +
> +                if ( need_iommu_pt_sync(d) )
> +                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
> +                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
> +                                   IOMMUF_readable | IOMMUF_writable,
> +                                   flush_flags);
> +                if ( !rc )
> +                    make_pages_writable(page,
> +                                        SUPERPAGE_PAGES * SUPERPAGE_PAGES);
> +                else
> +                    printk(XENLOG_ERR
> +                           "pre-mapping P2M 1G-MFN %lx into IOMMU failed: %d\n",
> +                           mfn_x(mfn), rc);
> +
> +                *pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
>                  vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
>                  continue;
>              }
> @@ -202,7 +234,20 @@ static __init void setup_pv_physmap(stru
>                                               L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>                                               MEMF_no_scrub)) != NULL )
>              {
> -                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
> +                mfn_t mfn = page_to_mfn(page);
> +
> +                if ( need_iommu_pt_sync(d) )
> +                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, SUPERPAGE_PAGES,
> +                                   IOMMUF_readable | IOMMUF_writable,
> +                                   flush_flags);
> +                if ( !rc )
> +                    make_pages_writable(page, SUPERPAGE_PAGES);
> +                else
> +                    printk(XENLOG_ERR
> +                           "pre-mapping P2M 2M-MFN %lx into IOMMU failed: %d\n",
> +                           mfn_x(mfn), rc);
> +
> +                *pl2e = l2e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
>                  vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
>                  continue;
>              }
> @@ -310,6 +355,7 @@ int __init dom0_construct_pv(struct doma
>      unsigned long initrd_pfn = -1, initrd_mfn = 0;
>      unsigned long count;
>      struct page_info *page = NULL;
> +    unsigned int flush_flags = 0;
>      start_info_t *si;
>      struct vcpu *v = d->vcpu[0];
>      void *image_base = bootstrap_map(image);
> @@ -572,6 +618,18 @@ int __init dom0_construct_pv(struct doma
>                      BUG();
>          }
>          initrd->mod_end = 0;
> +
> +        count = PFN_UP(initrd_len);
> +
> +        if ( need_iommu_pt_sync(d) )
> +            rc = iommu_map(d, _dfn(initrd_mfn), _mfn(initrd_mfn), count,
> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +        if ( !rc )
> +            make_pages_writable(mfn_to_page(_mfn(initrd_mfn)), count);
> +        else
> +            printk(XENLOG_ERR
> +                   "pre-mapping initrd (MFN %lx) into IOMMU failed: %d\n",
> +                   initrd_mfn, rc);
>      }
>  
>      printk("PHYSICAL MEMORY ARRANGEMENT:\n"
> @@ -605,6 +663,22 @@ int __init dom0_construct_pv(struct doma
>  
>      process_pending_softirqs();
>  
> +    /*
> +     * We map the full range here and then punch a hole for page tables via
> +     * iommu_unmap() further down, once they have got marked as such.
> +     */
> +    if ( need_iommu_pt_sync(d) )
> +        rc = iommu_map(d, _dfn(alloc_spfn), _mfn(alloc_spfn),
> +                       alloc_epfn - alloc_spfn,
> +                       IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +    if ( !rc )
> +        make_pages_writable(mfn_to_page(_mfn(alloc_spfn)),
> +                            alloc_epfn - alloc_spfn);
> +    else
> +        printk(XENLOG_ERR
> +               "pre-mapping MFNs [%lx,%lx) into IOMMU failed: %d\n",
> +               alloc_spfn, alloc_epfn, rc);
> +
>      mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
>      if ( vinitrd_start )
>          mpt_alloc -= PAGE_ALIGN(initrd_len);
> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
>          l1tab++;
>  
>          page = mfn_to_page(_mfn(mfn));
> -        if ( !page->u.inuse.type_info &&
> +        if ( (!page->u.inuse.type_info ||
> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&

Would it be clearer to get page for all pages that have a 0 count:
!(type_info & PGT_count_mask). Or would that interact badly with page
table pages?

>               !get_page_and_type(page, d, PGT_writable_page) )
>              BUG();
>      }
> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>      /* Pages that are part of page tables must be read only. */
>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>  
> +    /*
> +     * This needs to come after all potentially excess
> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> +     * few lines further up, where the effect of calling that function in an
> +     * earlier loop iteration may get overwritten by a later one.
> +     */
> +    if ( need_iommu_pt_sync(d) &&
> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> +                     &flush_flags) )
> +        BUG();

Wouldn't such unmap better happen as part of changing the types of the
pages that become part of the guest page tables?

>      /* Mask all upcalls... */
>      for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
>          shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
> @@ -793,7 +879,7 @@ int __init dom0_construct_pv(struct doma
>      {
>          pfn = pagetable_get_pfn(v->arch.guest_table);
>          setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
> -                         nr_pages);
> +                         nr_pages, &flush_flags);
>      }
>  
>      /* Write the phys->machine and machine->phys table entries. */
> @@ -824,7 +910,9 @@ int __init dom0_construct_pv(struct doma
>          if ( get_gpfn_from_mfn(mfn) >= count )
>          {
>              BUG_ON(compat);
> -            if ( !page->u.inuse.type_info &&
> +            if ( (!page->u.inuse.type_info ||
> +                  page->u.inuse.type_info == (PGT_writable_page |
> +                                              PGT_validated)) &&
>                   !get_page_and_type(page, d, PGT_writable_page) )
>                  BUG();
>  
> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>  #endif
>      while ( pfn < nr_pages )
>      {
> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
> +        count = domain_tot_pages(d);
> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>              panic("Not enough RAM for DOM0 reservation\n");
> +        mfn = mfn_x(page_to_mfn(page));
> +
> +        if ( need_iommu_pt_sync(d) )
> +        {
> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> +            if ( rc )
> +                printk(XENLOG_ERR
> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
> +                       mfn, pfn, rc);
> +        }
> +
>          while ( pfn < domain_tot_pages(d) )
>          {
> -            mfn = mfn_x(page_to_mfn(page));
> +            if ( !rc )
> +                make_pages_writable(page, 1);

There's quite a lot of repetition of the pattern: allocate, iommu_map,
set as writable. Would it be possible to abstract this into some
kind of helper?

I've realized some of the allocations use alloc_chunk while others use
alloc_domheap_pages, so it might require some work.

> +
>  #ifndef NDEBUG
>  #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
>  #endif
>              dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
>  #undef pfn
> -            page++; pfn++;
> +            page++; mfn++; pfn++;
>              if ( !(pfn & 0xfffff) )
>                  process_pending_softirqs();
>          }
>      }
>  
> +    /* Use while() to avoid compiler warning. */
> +    while ( iommu_iotlb_flush_all(d, flush_flags) )
> +        break;

Might be worth to print a message here in case of error?

> +
>      if ( initrd_len != 0 )
>      {
>          si->mod_start = vinitrd_start ?: initrd_pfn;
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -325,8 +325,8 @@ static unsigned int __hwdom_init hwdom_i
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
> -    unsigned long i, top, max_pfn;
> -    unsigned int flush_flags = 0;
> +    unsigned long i, top, max_pfn, start, count;
> +    unsigned int flush_flags = 0, start_perms = 0;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -357,9 +357,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>       * setting up potentially conflicting mappings here.
>       */
> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>  
> -    for ( ; i < top; i++ )
> +    for ( i = start, count = 0; i < top; )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>                                          perms & IOMMUF_writable ? p2m_access_rw
>                                                                  : p2m_access_r,
>                                          0);
> +        else if ( pfn != start + count || perms != start_perms )
> +        {
> +        commit:
> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> +                           start_perms, &flush_flags);
> +            SWAP(start, pfn);
> +            start_perms = perms;
> +            count = 1;
> +        }
>          else
> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           perms, &flush_flags);
> +        {
> +            ++count;
> +            rc = 0;
> +        }
>  
>          if ( rc )
>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);

Would be nice to print the count (or end pfn) in case it's a range.

While not something that you have to fix here, the logic here is
becoming overly complicated IMO. It might be easier to just put all
the ram and reserved regions (or everything < 4G) into a rangeset and
then punch holes on it for non guest mappable regions, and finally use
rangeset_consume_ranges to iterate and map those. That's likely faster
than having to iterate over all pfns on the system, and easier to
understand from a logic PoV.

Thanks, Roger.
Jan Beulich Dec. 3, 2021, 12:38 p.m. UTC | #2
On 02.12.2021 15:10, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
>> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
>>          l1tab++;
>>  
>>          page = mfn_to_page(_mfn(mfn));
>> -        if ( !page->u.inuse.type_info &&
>> +        if ( (!page->u.inuse.type_info ||
>> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
> 
> Would it be clearer to get page for all pages that have a 0 count:
> !(type_info & PGT_count_mask). Or would that interact badly with page
> table pages?

Indeed this wouldn't work with page tables (and I recall having learned
this the hard way): Prior to mark_pv_pt_pages_rdonly() they all have a
type refcount of zero. Even if it wasn't for this, I'd prefer to not
relax the condition here more than necessary.

>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>>      /* Pages that are part of page tables must be read only. */
>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>>  
>> +    /*
>> +     * This needs to come after all potentially excess
>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
>> +     * few lines further up, where the effect of calling that function in an
>> +     * earlier loop iteration may get overwritten by a later one.
>> +     */
>> +    if ( need_iommu_pt_sync(d) &&
>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
>> +                     &flush_flags) )
>> +        BUG();
> 
> Wouldn't such unmap better happen as part of changing the types of the
> pages that become part of the guest page tables?

Not sure - it's a single call here, but would be a call per page when
e.g. moved into mark_pv_pt_pages_rdonly().

>> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>>  #endif
>>      while ( pfn < nr_pages )
>>      {
>> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
>> +        count = domain_tot_pages(d);
>> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>>              panic("Not enough RAM for DOM0 reservation\n");
>> +        mfn = mfn_x(page_to_mfn(page));
>> +
>> +        if ( need_iommu_pt_sync(d) )
>> +        {
>> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
>> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_ERR
>> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
>> +                       mfn, pfn, rc);
>> +        }
>> +
>>          while ( pfn < domain_tot_pages(d) )
>>          {
>> -            mfn = mfn_x(page_to_mfn(page));
>> +            if ( !rc )
>> +                make_pages_writable(page, 1);
> 
> There's quite a lot of repetition of the pattern: allocate, iommu_map,
> set as writable. Would it be possible to abstract this into some
> kind of helper?
> 
> I've realized some of the allocations use alloc_chunk while others use
> alloc_domheap_pages, so it might require some work.

Right, I'd leave the allocation part aside for the moment. I had actually
considered to fold iommu_map() and make_pages_writable() into a common
helper (or really rename make_pages_writable() and fold iommu_map() into
there). What I lacked was a reasonable, not overly long name for such a
function. Plus - maybe minor - I wanted to avoid extra MFN <-> page
translations.

With a fair name suggested, I'd be happy to give this a try.

>>  #ifndef NDEBUG
>>  #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
>>  #endif
>>              dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
>>  #undef pfn
>> -            page++; pfn++;
>> +            page++; mfn++; pfn++;
>>              if ( !(pfn & 0xfffff) )
>>                  process_pending_softirqs();
>>          }
>>      }
>>  
>> +    /* Use while() to avoid compiler warning. */
>> +    while ( iommu_iotlb_flush_all(d, flush_flags) )
>> +        break;
> 
> Might be worth to print a message here in case of error?

Maybe. But then not just here. See e.g. arch_iommu_hwdom_init().

>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>                                                                  : p2m_access_r,
>>                                          0);
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
>> +                           start_perms, &flush_flags);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
>> +        }
>>  
>>          if ( rc )
>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> 
> Would be nice to print the count (or end pfn) in case it's a range.

I can do so if you think it's worth further extra code. I can't use
"count" here in particular, as that was updated already (in context
above). The most reasonable change towards this would perhaps be to
duplicate the printk() into both the "if()" and the "else if()" bodies.

> While not something that you have to fix here, the logic here is
> becoming overly complicated IMO. It might be easier to just put all
> the ram and reserved regions (or everything < 4G) into a rangeset and
> then punch holes on it for non guest mappable regions, and finally use
> rangeset_consume_ranges to iterate and map those. That's likely faster
> than having to iterate over all pfns on the system, and easier to
> understand from a logic PoV.

Maybe; I didn't spend much time on figuring possible ways of
consolidating some of this.

Jan
Roger Pau Monné Dec. 10, 2021, 9:36 a.m. UTC | #3
On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
> On 02.12.2021 15:10, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> >> @@ -689,7 +763,8 @@ int __init dom0_construct_pv(struct doma
> >>          l1tab++;
> >>  
> >>          page = mfn_to_page(_mfn(mfn));
> >> -        if ( !page->u.inuse.type_info &&
> >> +        if ( (!page->u.inuse.type_info ||
> >> +              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
> > 
> > Would it be clearer to get page for all pages that have a 0 count:
> > !(type_info & PGT_count_mask). Or would that interact badly with page
> > table pages?
> 
> Indeed this wouldn't work with page tables (and I recall having learned
> this the hard way): Prior to mark_pv_pt_pages_rdonly() they all have a
> type refcount of zero. Even if it wasn't for this, I'd prefer to not
> relax the condition here more than necessary.

Right. Page tables will have some types set but a count of 0.

> >> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
> >>      /* Pages that are part of page tables must be read only. */
> >>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
> >>  
> >> +    /*
> >> +     * This needs to come after all potentially excess
> >> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> >> +     * few lines further up, where the effect of calling that function in an
> >> +     * earlier loop iteration may get overwritten by a later one.
> >> +     */
> >> +    if ( need_iommu_pt_sync(d) &&
> >> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> >> +                     &flush_flags) )
> >> +        BUG();
> > 
> > Wouldn't such unmap better happen as part of changing the types of the
> > pages that become part of the guest page tables?
> 
> Not sure - it's a single call here, but would be a call per page when
> e.g. moved into mark_pv_pt_pages_rdonly().

I see. So this would result in multiple calls when moved, plus all the
involved page shattering and aggregation logic. Overall it would be
less error prone, as the iommu unmap would happen next to the type
change, but I'm not going to insist if you think it's not worth it.
The page table structure pages shouldn't be that many anyway?

> >> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
> >>  #endif
> >>      while ( pfn < nr_pages )
> >>      {
> >> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
> >> +        count = domain_tot_pages(d);
> >> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
> >>              panic("Not enough RAM for DOM0 reservation\n");
> >> +        mfn = mfn_x(page_to_mfn(page));
> >> +
> >> +        if ( need_iommu_pt_sync(d) )
> >> +        {
> >> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
> >> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
> >> +            if ( rc )
> >> +                printk(XENLOG_ERR
> >> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
> >> +                       mfn, pfn, rc);
> >> +        }
> >> +
> >>          while ( pfn < domain_tot_pages(d) )
> >>          {
> >> -            mfn = mfn_x(page_to_mfn(page));
> >> +            if ( !rc )
> >> +                make_pages_writable(page, 1);
> > 
> > There's quite a lot of repetition of the pattern: allocate, iommu_map,
> > set as writable. Would it be possible to abstract this into some
> > kind of helper?
> > 
> > I've realized some of the allocations use alloc_chunk while others use
> > alloc_domheap_pages, so it might require some work.
> 
> Right, I'd leave the allocation part aside for the moment. I had actually
> considered to fold iommu_map() and make_pages_writable() into a common
> helper (or really rename make_pages_writable() and fold iommu_map() into
> there). What I lacked was a reasonable, not overly long name for such a
> function.

I'm not overly good at naming, but I think we need to somehow find a
way to place those together into a single helper.

I would be fine with naming this iommu_memory_{setup,add} or some
such. Marking the pages as writable is a result (or a requirement
might be a better way to express it?) of adding them to the IOMMU.
Would you be OK with one of those names?

> >> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>                                                                  : p2m_access_r,
> >>                                          0);
> >> +        else if ( pfn != start + count || perms != start_perms )
> >> +        {
> >> +        commit:
> >> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> >> +                           start_perms, &flush_flags);
> >> +            SWAP(start, pfn);
> >> +            start_perms = perms;
> >> +            count = 1;
> >> +        }
> >>          else
> >> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >> -                           perms, &flush_flags);
> >> +        {
> >> +            ++count;
> >> +            rc = 0;
> >> +        }
> >>  
> >>          if ( rc )
> >>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
> >>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> > 
> > Would be nice to print the count (or end pfn) in case it's a range.
> 
> I can do so if you think it's worth further extra code. I can't use
> "count" here in particular, as that was updated already (in context
> above). The most reasonable change towards this would perhaps be to
> duplicate the printk() into both the "if()" and the "else if()" bodies.

Maybe. The current message gives the impression that a single pfn has
been added and failed, but without printing the range that failed the
message will not be that helpful in diagnosing further issues that
might arise due to the mapping failure.

> > While not something that you have to fix here, the logic here is
> > becoming overly complicated IMO. It might be easier to just put all
> > the ram and reserved regions (or everything < 4G) into a rangeset and
> > then punch holes on it for non guest mappable regions, and finally use
> > rangeset_consume_ranges to iterate and map those. That's likely faster
> > than having to iterate over all pfns on the system, and easier to
> > understand from a logic PoV.
> 
> Maybe; I didn't spend much time on figuring possible ways of
> consolidating some of this.

I can give it a try after your code is merged. I think it's getting a
bit messy.

Thanks, Roger.
Jan Beulich Dec. 10, 2021, 11:41 a.m. UTC | #4
On 10.12.2021 10:36, Roger Pau Monné wrote:
> On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
>> On 02.12.2021 15:10, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
>>>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
>>>>      /* Pages that are part of page tables must be read only. */
>>>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>>>>  
>>>> +    /*
>>>> +     * This needs to come after all potentially excess
>>>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
>>>> +     * few lines further up, where the effect of calling that function in an
>>>> +     * earlier loop iteration may get overwritten by a later one.
>>>> +     */
>>>> +    if ( need_iommu_pt_sync(d) &&
>>>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
>>>> +                     &flush_flags) )
>>>> +        BUG();
>>>
>>> Wouldn't such unmap better happen as part of changing the types of the
>>> pages that become part of the guest page tables?
>>
>> Not sure - it's a single call here, but would be a call per page when
>> e.g. moved into mark_pv_pt_pages_rdonly().
> 
> I see. So this would result in multiple calls when moved, plus all the
> involved page shattering and aggregation logic. Overall it would be
> less error prone, as the iommu unmap would happen next to the type
> change, but I'm not going to insist if you think it's not worth it.
> The page table structure pages shouldn't be that many anyway?

Typically it wouldn't be that many, true. I'm not sure about "less
error prone", though: We'd have more problems if the range unmapped
here wasn't properly representing the set of page tables used.

>>>> @@ -840,22 +928,41 @@ int __init dom0_construct_pv(struct doma
>>>>  #endif
>>>>      while ( pfn < nr_pages )
>>>>      {
>>>> -        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
>>>> +        count = domain_tot_pages(d);
>>>> +        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
>>>>              panic("Not enough RAM for DOM0 reservation\n");
>>>> +        mfn = mfn_x(page_to_mfn(page));
>>>> +
>>>> +        if ( need_iommu_pt_sync(d) )
>>>> +        {
>>>> +            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
>>>> +                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
>>>> +            if ( rc )
>>>> +                printk(XENLOG_ERR
>>>> +                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
>>>> +                       mfn, pfn, rc);
>>>> +        }
>>>> +
>>>>          while ( pfn < domain_tot_pages(d) )
>>>>          {
>>>> -            mfn = mfn_x(page_to_mfn(page));
>>>> +            if ( !rc )
>>>> +                make_pages_writable(page, 1);
>>>
>>> There's quite a lot of repetition of the pattern: allocate, iommu_map,
>>> set as writable. Would it be possible to abstract this into some
>>> kind of helper?
>>>
>>> I've realized some of the allocations use alloc_chunk while others use
>>> alloc_domheap_pages, so it might require some work.
>>
>> Right, I'd leave the allocation part aside for the moment. I had actually
>> considered to fold iommu_map() and make_pages_writable() into a common
>> helper (or really rename make_pages_writable() and fold iommu_map() into
>> there). What I lacked was a reasonable, not overly long name for such a
>> function.
> 
> I'm not overly good at naming, but I think we need to somehow find a
> way to place those together into a single helper.
> 
> I would be fine with naming this iommu_memory_{setup,add} or some
> such. Marking the pages as writable is a result (or a requirement
> might be a better way to express it?) of adding them to the IOMMU.
> Would you be OK with one of those names?

I'll use the suggestion as a basis and see how it ends up looking /
feeling.

>>>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
>>>>                                          perms & IOMMUF_writable ? p2m_access_rw
>>>>                                                                  : p2m_access_r,
>>>>                                          0);
>>>> +        else if ( pfn != start + count || perms != start_perms )
>>>> +        {
>>>> +        commit:
>>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
>>>> +                           start_perms, &flush_flags);
>>>> +            SWAP(start, pfn);
>>>> +            start_perms = perms;
>>>> +            count = 1;
>>>> +        }
>>>>          else
>>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>>>> -                           perms, &flush_flags);
>>>> +        {
>>>> +            ++count;
>>>> +            rc = 0;
>>>> +        }
>>>>  
>>>>          if ( rc )
>>>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
>>>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
>>>
>>> Would be nice to print the count (or end pfn) in case it's a range.
>>
>> I can do so if you think it's worth further extra code. I can't use
>> "count" here in particular, as that was updated already (in context
>> above). The most reasonable change towards this would perhaps be to
>> duplicate the printk() into both the "if()" and the "else if()" bodies.
> 
> Maybe. The current message gives the impression that a single pfn has
> been added and failed, but without printing the range that failed the
> message will not be that helpful in diagnosing further issues that
> might arise due to the mapping failure.

I guess I'll make the change then. I'm still not really convinced though,
as the presence of the message should be far more concerning than whether
it's a single page or a range. As middle ground, would

             printk(XENLOG_WARNING "%pd: identity %smapping of %lx... failed: %d\n",

be indicative enough of this perhaps not having been just a single page?
Otoh splitting (and moving) the message would allow to drop the separate
paging_mode_translate() check.

Jan
Roger Pau Monné Dec. 10, 2021, 12:35 p.m. UTC | #5
On Fri, Dec 10, 2021 at 12:41:31PM +0100, Jan Beulich wrote:
> On 10.12.2021 10:36, Roger Pau Monné wrote:
> > On Fri, Dec 03, 2021 at 01:38:48PM +0100, Jan Beulich wrote:
> >> On 02.12.2021 15:10, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:47:41AM +0200, Jan Beulich wrote:
> >>>> @@ -720,6 +795,17 @@ int __init dom0_construct_pv(struct doma
> >>>>      /* Pages that are part of page tables must be read only. */
> >>>>      mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
> >>>>  
> >>>> +    /*
> >>>> +     * This needs to come after all potentially excess
> >>>> +     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
> >>>> +     * few lines further up, where the effect of calling that function in an
> >>>> +     * earlier loop iteration may get overwritten by a later one.
> >>>> +     */
> >>>> +    if ( need_iommu_pt_sync(d) &&
> >>>> +         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
> >>>> +                     &flush_flags) )
> >>>> +        BUG();
> >>>
> >>> Wouldn't such unmap better happen as part of changing the types of the
> >>> pages that become part of the guest page tables?
> >>
> >> Not sure - it's a single call here, but would be a call per page when
> >> e.g. moved into mark_pv_pt_pages_rdonly().
> > 
> > I see. So this would result in multiple calls when moved, plus all the
> > involved page shattering and aggregation logic. Overall it would be
> > less error prone, as the iommu unmap would happen next to the type
> > change, but I'm not going to insist if you think it's not worth it.
> > The page table structure pages shouldn't be that many anyway?
> 
> Typically it wouldn't be that many, true. I'm not sure about "less
> error prone", though: We'd have more problems if the range unmapped
> here wasn't properly representing the set of page tables used.

I have to admit I'm biased regarding the PV dom0 building code because
I find it utterly hard to follow, so IMO pairing the unmap call with
the code that marks the pages as read-only seemed less error prone and
less likely to go out of sync with regards to future changes.

That said, if you still feel it's better to do it in a block here I
won't argue anymore.

> >>>> @@ -372,16 +372,30 @@ void __hwdom_init arch_iommu_hwdom_init(
> >>>>                                          perms & IOMMUF_writable ? p2m_access_rw
> >>>>                                                                  : p2m_access_r,
> >>>>                                          0);
> >>>> +        else if ( pfn != start + count || perms != start_perms )
> >>>> +        {
> >>>> +        commit:
> >>>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count,
> >>>> +                           start_perms, &flush_flags);
> >>>> +            SWAP(start, pfn);
> >>>> +            start_perms = perms;
> >>>> +            count = 1;
> >>>> +        }
> >>>>          else
> >>>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> >>>> -                           perms, &flush_flags);
> >>>> +        {
> >>>> +            ++count;
> >>>> +            rc = 0;
> >>>> +        }
> >>>>  
> >>>>          if ( rc )
> >>>>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
> >>>>                     d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
> >>>
> >>> Would be nice to print the count (or end pfn) in case it's a range.
> >>
> >> I can do so if you think it's worth further extra code. I can't use
> >> "count" here in particular, as that was updated already (in context
> >> above). The most reasonable change towards this would perhaps be to
> >> duplicate the printk() into both the "if()" and the "else if()" bodies.
> > 
> > Maybe. The current message gives the impression that a single pfn has
> > been added and failed, but without printing the range that failed the
> > message will not be that helpful in diagnosing further issues that
> > might arise due to the mapping failure.
> 
> I guess I'll make the change then. I'm still not really convinced though,
> as the presence of the message should be far more concerning than whether
> it's a single page or a range. As middle ground, would
> 
>              printk(XENLOG_WARNING "%pd: identity %smapping of %lx... failed: %d\n",
> 
> be indicative enough of this perhaps not having been just a single page?

Let's go with that last suggestion then.

I would like to attempt to simplify part of the logic here, at which
point it might be easier to print a unified message for both the
translated and non-translated guests.

Thanks, Roger.
diff mbox series

Patch

--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -106,11 +106,26 @@  static __init void mark_pv_pt_pages_rdon
     unmap_domain_page(pl3e);
 }
 
+/*
+ * For IOMMU mappings done while building Dom0 the type of the pages needs to
+ * match (for _get_page_type() to unmap upon type change). Set the pages to
+ * writable with no type ref. NB: This is benign when !need_iommu_pt_sync(d).
+ */
+static void __init make_pages_writable(struct page_info *page, unsigned long nr)
+{
+    for ( ; nr--; ++page )
+    {
+        ASSERT(!page->u.inuse.type_info);
+        page->u.inuse.type_info = PGT_writable_page | PGT_validated;
+    }
+}
+
 static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
                                     unsigned long v_start, unsigned long v_end,
                                     unsigned long vphysmap_start,
                                     unsigned long vphysmap_end,
-                                    unsigned long nr_pages)
+                                    unsigned long nr_pages,
+                                    unsigned int *flush_flags)
 {
     struct page_info *page = NULL;
     l4_pgentry_t *pl4e, *l4start = map_domain_page(_mfn(pgtbl_pfn));
@@ -123,6 +138,8 @@  static __init void setup_pv_physmap(stru
 
     while ( vphysmap_start < vphysmap_end )
     {
+        int rc = 0;
+
         if ( domain_tot_pages(d) +
              ((round_pgup(vphysmap_end) - vphysmap_start) >> PAGE_SHIFT) +
              3 > nr_pages )
@@ -176,7 +193,22 @@  static __init void setup_pv_physmap(stru
                                              L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl3e = l3e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn,
+                                   SUPERPAGE_PAGES * SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page,
+                                        SUPERPAGE_PAGES * SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 1G-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl3e = l3e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
             }
@@ -202,7 +234,20 @@  static __init void setup_pv_physmap(stru
                                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                                              MEMF_no_scrub)) != NULL )
             {
-                *pl2e = l2e_from_page(page, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
+                mfn_t mfn = page_to_mfn(page);
+
+                if ( need_iommu_pt_sync(d) )
+                    rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, SUPERPAGE_PAGES,
+                                   IOMMUF_readable | IOMMUF_writable,
+                                   flush_flags);
+                if ( !rc )
+                    make_pages_writable(page, SUPERPAGE_PAGES);
+                else
+                    printk(XENLOG_ERR
+                           "pre-mapping P2M 2M-MFN %lx into IOMMU failed: %d\n",
+                           mfn_x(mfn), rc);
+
+                *pl2e = l2e_from_mfn(mfn, L1_PROT|_PAGE_DIRTY|_PAGE_PSE);
                 vphysmap_start += 1UL << L2_PAGETABLE_SHIFT;
                 continue;
             }
@@ -310,6 +355,7 @@  int __init dom0_construct_pv(struct doma
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
     struct page_info *page = NULL;
+    unsigned int flush_flags = 0;
     start_info_t *si;
     struct vcpu *v = d->vcpu[0];
     void *image_base = bootstrap_map(image);
@@ -572,6 +618,18 @@  int __init dom0_construct_pv(struct doma
                     BUG();
         }
         initrd->mod_end = 0;
+
+        count = PFN_UP(initrd_len);
+
+        if ( need_iommu_pt_sync(d) )
+            rc = iommu_map(d, _dfn(initrd_mfn), _mfn(initrd_mfn), count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+        if ( !rc )
+            make_pages_writable(mfn_to_page(_mfn(initrd_mfn)), count);
+        else
+            printk(XENLOG_ERR
+                   "pre-mapping initrd (MFN %lx) into IOMMU failed: %d\n",
+                   initrd_mfn, rc);
     }
 
     printk("PHYSICAL MEMORY ARRANGEMENT:\n"
@@ -605,6 +663,22 @@  int __init dom0_construct_pv(struct doma
 
     process_pending_softirqs();
 
+    /*
+     * We map the full range here and then punch a hole for page tables via
+     * iommu_unmap() further down, once they have got marked as such.
+     */
+    if ( need_iommu_pt_sync(d) )
+        rc = iommu_map(d, _dfn(alloc_spfn), _mfn(alloc_spfn),
+                       alloc_epfn - alloc_spfn,
+                       IOMMUF_readable | IOMMUF_writable, &flush_flags);
+    if ( !rc )
+        make_pages_writable(mfn_to_page(_mfn(alloc_spfn)),
+                            alloc_epfn - alloc_spfn);
+    else
+        printk(XENLOG_ERR
+               "pre-mapping MFNs [%lx,%lx) into IOMMU failed: %d\n",
+               alloc_spfn, alloc_epfn, rc);
+
     mpt_alloc = (vpt_start - v_start) + pfn_to_paddr(alloc_spfn);
     if ( vinitrd_start )
         mpt_alloc -= PAGE_ALIGN(initrd_len);
@@ -689,7 +763,8 @@  int __init dom0_construct_pv(struct doma
         l1tab++;
 
         page = mfn_to_page(_mfn(mfn));
-        if ( !page->u.inuse.type_info &&
+        if ( (!page->u.inuse.type_info ||
+              page->u.inuse.type_info == (PGT_writable_page | PGT_validated)) &&
              !get_page_and_type(page, d, PGT_writable_page) )
             BUG();
     }
@@ -720,6 +795,17 @@  int __init dom0_construct_pv(struct doma
     /* Pages that are part of page tables must be read only. */
     mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
 
+    /*
+     * This needs to come after all potentially excess
+     * get_page_and_type(..., PGT_writable_page) invocations; see the loop a
+     * few lines further up, where the effect of calling that function in an
+     * earlier loop iteration may get overwritten by a later one.
+     */
+    if ( need_iommu_pt_sync(d) &&
+         iommu_unmap(d, _dfn(PFN_DOWN(mpt_alloc) - nr_pt_pages), nr_pt_pages,
+                     &flush_flags) )
+        BUG();
+
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
         shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
@@ -793,7 +879,7 @@  int __init dom0_construct_pv(struct doma
     {
         pfn = pagetable_get_pfn(v->arch.guest_table);
         setup_pv_physmap(d, pfn, v_start, v_end, vphysmap_start, vphysmap_end,
-                         nr_pages);
+                         nr_pages, &flush_flags);
     }
 
     /* Write the phys->machine and machine->phys table entries. */
@@ -824,7 +910,9 @@  int __init dom0_construct_pv(struct doma
         if ( get_gpfn_from_mfn(mfn) >= count )
         {
             BUG_ON(compat);
-            if ( !page->u.inuse.type_info &&
+            if ( (!page->u.inuse.type_info ||
+                  page->u.inuse.type_info == (PGT_writable_page |
+                                              PGT_validated)) &&
                  !get_page_and_type(page, d, PGT_writable_page) )
                 BUG();
 
@@ -840,22 +928,41 @@  int __init dom0_construct_pv(struct doma
 #endif
     while ( pfn < nr_pages )
     {
-        if ( (page = alloc_chunk(d, nr_pages - domain_tot_pages(d))) == NULL )
+        count = domain_tot_pages(d);
+        if ( (page = alloc_chunk(d, nr_pages - count)) == NULL )
             panic("Not enough RAM for DOM0 reservation\n");
+        mfn = mfn_x(page_to_mfn(page));
+
+        if ( need_iommu_pt_sync(d) )
+        {
+            rc = iommu_map(d, _dfn(mfn), _mfn(mfn), domain_tot_pages(d) - count,
+                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+            if ( rc )
+                printk(XENLOG_ERR
+                       "pre-mapping MFN %lx (PFN %lx) into IOMMU failed: %d\n",
+                       mfn, pfn, rc);
+        }
+
         while ( pfn < domain_tot_pages(d) )
         {
-            mfn = mfn_x(page_to_mfn(page));
+            if ( !rc )
+                make_pages_writable(page, 1);
+
 #ifndef NDEBUG
 #define pfn (nr_pages - 1 - (pfn - (alloc_epfn - alloc_spfn)))
 #endif
             dom0_update_physmap(compat, pfn, mfn, vphysmap_start);
 #undef pfn
-            page++; pfn++;
+            page++; mfn++; pfn++;
             if ( !(pfn & 0xfffff) )
                 process_pending_softirqs();
         }
     }
 
+    /* Use while() to avoid compiler warning. */
+    while ( iommu_iotlb_flush_all(d, flush_flags) )
+        break;
+
     if ( initrd_len != 0 )
     {
         si->mod_start = vinitrd_start ?: initrd_pfn;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -325,8 +325,8 @@  static unsigned int __hwdom_init hwdom_i
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn;
-    unsigned int flush_flags = 0;
+    unsigned long i, top, max_pfn, start, count;
+    unsigned int flush_flags = 0, start_perms = 0;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -357,9 +357,9 @@  void __hwdom_init arch_iommu_hwdom_init(
      * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
      * setting up potentially conflicting mappings here.
      */
-    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
+    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
 
-    for ( ; i < top; i++ )
+    for ( i = start, count = 0; i < top; )
     {
         unsigned long pfn = pdx_to_pfn(i);
         unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
@@ -372,16 +372,30 @@  void __hwdom_init arch_iommu_hwdom_init(
                                         perms & IOMMUF_writable ? p2m_access_rw
                                                                 : p2m_access_r,
                                         0);
+        else if ( pfn != start + count || perms != start_perms )
+        {
+        commit:
+            rc = iommu_map(d, _dfn(start), _mfn(start), count,
+                           start_perms, &flush_flags);
+            SWAP(start, pfn);
+            start_perms = perms;
+            count = 1;
+        }
         else
-            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           perms, &flush_flags);
+        {
+            ++count;
+            rc = 0;
+        }
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
                    d, !paging_mode_translate(d) ? "IOMMU " : "", pfn, rc);
 
-        if (!(i & 0xfffff))
+        if ( !(++i & 0xfffff) )
             process_pending_softirqs();
+
+        if ( i == top && count )
+            goto commit;
     }
 
     /* Use if to avoid compiler warning */