[v3] x86/p2m: use large pages for MMIO mappings
diff mbox

Message ID 569780BD02000078000C6A1E@prv-mh.provo.novell.com
State New, archived
Headers show

Commit Message

Jan Beulich Jan. 14, 2016, 10:04 a.m. UTC
When mapping large BARs (e.g. the frame buffer of a graphics card) the
overhead of establishing such mappings using only 4k pages has,
particularly after the XSA-125 fix, become unacceptable. Alter the
XEN_DOMCTL_memory_mapping semantics once again, so that there's no
longer a fixed amount of guest frames that represents the upper limit
of what a single invocation can map. Instead bound execution time by
limiting the number of iterations (regardless of page size). The return
value can now be any of
- zero (success, everything done)
- positive (success, this many done, more to do: re-invoke)
- negative (error)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC dropped (due to lack of feedback), but the issues remain:
- ARM side unimplemented (and hence libxc for now made cope with both
  models), the main issue (besides my inability to test any change
  there) being the many internal uses of map_mmio_regions())
- error unmapping in map_mmio_regions() and error propagation to caller
  from unmap_mmio_regions() are not really satisfactory (for the latter
  a possible model might be to have the function - and hence the
  domctl - return the [non-zero] number of completed entries upon
  error, requiring the caller to re-invoke the hypercall to then obtain
  the actual error for the failed slot)
- iommu_{,un}map_page() interfaces don't support "order" (hence
  mmio_order() for now returns zero when !iommu_hap_pt_share, which in
  particular means the AMD side isn't being taken care of just yet, but
  note that this also has the intended effect of suppressing non-zero
  order mappings in the shadow mode case)
---
v3: Re-base on top of "x86/hvm: fold opt_hap_{2mb,1gb} into
    hap_capabilities". Extend description to spell out new return value
    meaning. Add a couple of code comments. Use PAGE_ORDER_4K instead
    of literal 0. Take into consideration r/o MMIO pages.
v2: Produce valid entries for large p2m_mmio_direct mappings in
    p2m_pt_set_entry(). Don't open code iommu_use_hap_pt() in
    mmio_order(). Update function comment of set_typed_p2m_entry() and
    clear_mmio_p2m_entry(). Use PRI_mfn. Add ASSERT()s to
    {,un}map_mmio_regions() to detect otherwise endless loops.
---
TBD: The p2m_pt_set_entry() change in v2 points out an apparent
     inconsistency with PoD handling: 2M mappings get valid entries
     created, while 4k mappings don't. It would seem to me that the
     4k case needs changing.
x86/p2m: use large pages for MMIO mappings

When mapping large BARs (e.g. the frame buffer of a graphics card) the
overhead of establishing such mappings using only 4k pages has,
particularly after the XSA-125 fix, become unacceptable. Alter the
XEN_DOMCTL_memory_mapping semantics once again, so that there's no
longer a fixed amount of guest frames that represents the upper limit
of what a single invocation can map. Instead bound execution time by
limiting the number of iterations (regardless of page size). The return
value can now be any of
- zero (success, everything done)
- positive (success, this many done, more to do: re-invoke)
- negative (error)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC dropped (due to lack of feedback), but the issues remain:
- ARM side unimplemented (and hence libxc for now made cope with both
  models), the main issue (besides my inability to test any change
  there) being the many internal uses of map_mmio_regions())
- error unmapping in map_mmio_regions() and error propagation to caller
  from unmap_mmio_regions() are not really satisfactory (for the latter
  a possible model might be to have the function - and hence the
  domctl - return the [non-zero] number of completed entries upon
  error, requiring the caller to re-invoke the hypercall to then obtain
  the actual error for the failed slot)
- iommu_{,un}map_page() interfaces don't support "order" (hence
  mmio_order() for now returns zero when !iommu_hap_pt_share, which in
  particular means the AMD side isn't being taken care of just yet, but
  note that this also has the intended effect of suppressing non-zero
  order mappings in the shadow mode case)
---
v3: Re-base on top of "x86/hvm: fold opt_hap_{2mb,1gb} into
    hap_capabilities". Extend description to spell out new return value
    meaning. Add a couple of code comments. Use PAGE_ORDER_4K instead
    of literal 0. Take into consideration r/o MMIO pages.
v2: Produce valid entries for large p2m_mmio_direct mappings in
    p2m_pt_set_entry(). Don't open code iommu_use_hap_pt() in
    mmio_order(). Update function comment of set_typed_p2m_entry() and
    clear_mmio_p2m_entry(). Use PRI_mfn. Add ASSERT()s to
    {,un}map_mmio_regions() to detect otherwise endless loops.
---
TBD: The p2m_pt_set_entry() change in v2 points out an apparent
     inconsistency with PoD handling: 2M mappings get valid entries
     created, while 4k mappings don't. It would seem to me that the
     4k case needs changing.

--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2206,7 +2206,7 @@ int xc_domain_memory_mapping(
 {
     DECLARE_DOMCTL;
     xc_dominfo_t info;
-    int ret = 0, err;
+    int ret = 0, rc;
     unsigned long done = 0, nr, max_batch_sz;
 
     if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ||
@@ -2231,19 +2231,24 @@ int xc_domain_memory_mapping(
         domctl.u.memory_mapping.nr_mfns = nr;
         domctl.u.memory_mapping.first_gfn = first_gfn + done;
         domctl.u.memory_mapping.first_mfn = first_mfn + done;
-        err = do_domctl(xch, &domctl);
-        if ( err && errno == E2BIG )
+        rc = do_domctl(xch, &domctl);
+        if ( rc < 0 && errno == E2BIG )
         {
             if ( max_batch_sz <= 1 )
                 break;
             max_batch_sz >>= 1;
             continue;
         }
+        if ( rc > 0 )
+        {
+            done += rc;
+            continue;
+        }
         /* Save the first error... */
         if ( !ret )
-            ret = err;
+            ret = rc;
         /* .. and ignore the rest of them when removing. */
-        if ( err && add_mapping != DPCI_REMOVE_MAPPING )
+        if ( rc && add_mapping != DPCI_REMOVE_MAPPING )
             break;
 
         done += nr;
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -436,7 +436,8 @@ static __init void pvh_add_mem_mapping(s
         else
             a = p2m_access_rw;
 
-        if ( (rc = set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i), a)) )
+        if ( (rc = set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i),
+                                      PAGE_ORDER_4K, a)) )
             panic("pvh_add_mem_mapping: gfn:%lx mfn:%lx i:%ld rc:%d\n",
                   gfn, mfn, i, rc);
         if ( !(i & 0xfffff) )
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2491,7 +2491,7 @@ static int vmx_alloc_vlapic_mapping(stru
     share_xen_page_with_guest(pg, d, XENSHARE_writable);
     d->arch.hvm_domain.vmx.apic_access_mfn = mfn;
     set_mmio_p2m_entry(d, paddr_to_pfn(APIC_DEFAULT_PHYS_BASE), _mfn(mfn),
-                       p2m_get_hostp2m(d)->default_access);
+                       PAGE_ORDER_4K, p2m_get_hostp2m(d)->default_access);
 
     return 0;
 }
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -899,48 +899,62 @@ void p2m_change_type_range(struct domain
     p2m_unlock(p2m);
 }
 
-/* Returns: 0 for success, -errno for failure */
+/*
+ * Returns:
+ *    0        for success
+ *    -errno   for failure
+ *    order+1  for caller to retry with order (guaranteed smaller than
+ *             the order value passed in)
+ */
 static int set_typed_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                               p2m_type_t gfn_p2mt, p2m_access_t access)
+                               unsigned int order, p2m_type_t gfn_p2mt,
+                               p2m_access_t access)
 {
     int rc = 0;
     p2m_access_t a;
     p2m_type_t ot;
     mfn_t omfn;
+    unsigned int cur_order = 0;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return -EIO;
 
-    gfn_lock(p2m, gfn, 0);
-    omfn = p2m->get_entry(p2m, gfn, &ot, &a, 0, NULL, NULL);
+    gfn_lock(p2m, gfn, order);
+    omfn = p2m->get_entry(p2m, gfn, &ot, &a, 0, &cur_order, NULL);
+    if ( cur_order < order )
+    {
+        gfn_unlock(p2m, gfn, order);
+        return cur_order + 1;
+    }
     if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
     {
-        gfn_unlock(p2m, gfn, 0);
+        gfn_unlock(p2m, gfn, order);
         domain_crash(d);
         return -ENOENT;
     }
     else if ( p2m_is_ram(ot) )
     {
+        unsigned long i;
+
         ASSERT(mfn_valid(omfn));
-        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+        for ( i = 0; i < (1UL << order); ++i )
+            set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY);
     }
 
     P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn, mfn_x(mfn));
-    rc = p2m_set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, gfn_p2mt,
-                       access);
+    rc = p2m_set_entry(p2m, gfn, mfn, order, gfn_p2mt, access);
     if ( rc )
-        gdprintk(XENLOG_ERR,
-                 "p2m_set_entry failed! mfn=%08lx rc:%d\n",
-                 mfn_x(get_gfn_query_unlocked(p2m->domain, gfn, &ot)), rc);
+        gdprintk(XENLOG_ERR, "p2m_set_entry: %#lx:%u -> %d (0x%"PRI_mfn")\n",
+                 gfn, order, rc, mfn_x(mfn));
     else if ( p2m_is_pod(ot) )
     {
         pod_lock(p2m);
-        p2m->pod.entry_count--;
+        p2m->pod.entry_count -= 1UL << order;
         BUG_ON(p2m->pod.entry_count < 0);
         pod_unlock(p2m);
     }
-    gfn_unlock(p2m, gfn, 0);
+    gfn_unlock(p2m, gfn, order);
 
     return rc;
 }
@@ -949,14 +963,21 @@ static int set_typed_p2m_entry(struct do
 static int set_foreign_p2m_entry(struct domain *d, unsigned long gfn,
                                  mfn_t mfn)
 {
-    return set_typed_p2m_entry(d, gfn, mfn, p2m_map_foreign,
+    return set_typed_p2m_entry(d, gfn, mfn, PAGE_ORDER_4K, p2m_map_foreign,
                                p2m_get_hostp2m(d)->default_access);
 }
 
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                       p2m_access_t access)
+                       unsigned int order, p2m_access_t access)
 {
-    return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct, access);
+    if ( order &&
+         rangeset_overlaps_range(mmio_ro_ranges, mfn_x(mfn),
+                                 mfn_x(mfn) + (1UL << order) - 1) &&
+         !rangeset_contains_range(mmio_ro_ranges, mfn_x(mfn),
+                                  mfn_x(mfn) + (1UL << order) - 1) )
+        return order;
+
+    return set_typed_p2m_entry(d, gfn, mfn, order, p2m_mmio_direct, access);
 }
 
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
@@ -1009,20 +1030,33 @@ int set_identity_p2m_entry(struct domain
     return ret;
 }
 
-/* Returns: 0 for success, -errno for failure */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
+/*
+ * Returns:
+ *    0        for success
+ *    -errno   for failure
+ *    order+1  for caller to retry with order (guaranteed smaller than
+ *             the order value passed in)
+ */
+int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
+                         unsigned int order)
 {
     int rc = -EINVAL;
     mfn_t actual_mfn;
     p2m_access_t a;
     p2m_type_t t;
+    unsigned int cur_order = 0;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return -EIO;
 
-    gfn_lock(p2m, gfn, 0);
-    actual_mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL, NULL);
+    gfn_lock(p2m, gfn, order);
+    actual_mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, &cur_order, NULL);
+    if ( cur_order < order )
+    {
+        rc = cur_order + 1;
+        goto out;
+    }
 
     /* Do not use mfn_valid() here as it will usually fail for MMIO pages. */
     if ( (INVALID_MFN == mfn_x(actual_mfn)) || (t != p2m_mmio_direct) )
@@ -1035,11 +1069,11 @@ int clear_mmio_p2m_entry(struct domain *
         gdprintk(XENLOG_WARNING,
                  "no mapping between mfn %08lx and gfn %08lx\n",
                  mfn_x(mfn), gfn);
-    rc = p2m_set_entry(p2m, gfn, _mfn(INVALID_MFN), PAGE_ORDER_4K, p2m_invalid,
+    rc = p2m_set_entry(p2m, gfn, _mfn(INVALID_MFN), order, p2m_invalid,
                        p2m->default_access);
 
  out:
-    gfn_unlock(p2m, gfn, 0);
+    gfn_unlock(p2m, gfn, order);
 
     return rc;
 }
@@ -2095,6 +2129,25 @@ void *map_domain_gfn(struct p2m_domain *
     return map_domain_page(*mfn);
 }
 
+static unsigned int mmio_order(const struct domain *d,
+                               unsigned long start_fn, unsigned long nr)
+{
+    if ( !need_iommu(d) || !iommu_use_hap_pt(d) ||
+         (start_fn & ((1UL << PAGE_ORDER_2M) - 1)) || !(nr >> PAGE_ORDER_2M) )
+        return 0;
+
+    if ( !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) && (nr >> PAGE_ORDER_1G) &&
+         hap_has_1gb )
+        return PAGE_ORDER_1G;
+
+    if ( hap_has_2mb )
+        return PAGE_ORDER_2M;
+
+    return 0;
+}
+
+#define MAP_MMIO_MAX_ITER 64 /* pretty arbitrary */
+
 int map_mmio_regions(struct domain *d,
                      unsigned long start_gfn,
                      unsigned long nr,
@@ -2102,22 +2155,48 @@ int map_mmio_regions(struct domain *d,
 {
     int ret = 0;
     unsigned long i;
+    unsigned int iter, order;
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    for ( i = 0; !ret && i < nr; i++ )
+    for ( iter = i = 0; i < nr && iter < MAP_MMIO_MAX_ITER;
+          i += 1UL << order, ++iter )
     {
-        ret = set_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i),
-                                 p2m_get_hostp2m(d)->default_access);
-        if ( ret )
+        /* OR'ing gfn and mfn values will return an order suitable to both. */
+        for ( order = mmio_order(d, (start_gfn + i) | (mfn + i), nr - i); ;
+              order = ret - 1 )
+        {
+            ret = set_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i), order,
+                                     p2m_get_hostp2m(d)->default_access);
+            if ( ret <= 0 )
+                break;
+            ASSERT(ret <= order);
+        }
+        if ( ret < 0 )
         {
-            unmap_mmio_regions(d, start_gfn, i, mfn);
+            for ( nr = i, iter = i = 0; i < nr ; i += 1UL << order, ++iter )
+            {
+                int rc;
+
+                WARN_ON(iter == MAP_MMIO_MAX_ITER);
+                for ( order = mmio_order(d, (start_gfn + i) | (mfn + i),
+                                         nr - i); ; order = rc - 1 )
+                {
+                    rc = clear_mmio_p2m_entry(d, start_gfn + i,
+                                              _mfn(mfn + i), order);
+                    if ( rc <= 0 )
+                        break;
+                    ASSERT(rc <= order);
+                }
+                if ( rc < 0 )
+                    order = 0;
+            }
             break;
         }
     }
 
-    return ret;
+    return ret < 0 ? ret : i == nr ? 0 : i;
 }
 
 int unmap_mmio_regions(struct domain *d,
@@ -2127,18 +2206,33 @@ int unmap_mmio_regions(struct domain *d,
 {
     int err = 0;
     unsigned long i;
+    unsigned int iter, order;
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    for ( i = 0; i < nr; i++ )
+    for ( iter = i = 0; i < nr && iter < MAP_MMIO_MAX_ITER;
+          i += 1UL << order, ++iter )
     {
-        int ret = clear_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i));
-        if ( ret )
+        int ret;
+
+        /* OR'ing gfn and mfn values will return an order suitable to both. */
+        for ( order = mmio_order(d, (start_gfn + i) | (mfn + i), nr - i); ;
+              order = ret - 1 )
+        {
+            ret = clear_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i), order);
+            if ( ret <= 0 )
+                break;
+            ASSERT(ret <= order);
+        }
+        if ( ret < 0 )
+        {
             err = ret;
+            order = 0;
+        }
     }
 
-    return err;
+    return err ?: i == nr ? 0 : i;
 }
 
 unsigned int p2m_find_altp2m_by_eptp(struct domain *d, uint64_t eptp)
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@ static void ept_p2m_type_to_flags(struct
             entry->r = entry->x = 1;
             entry->w = !rangeset_contains_singleton(mmio_ro_ranges,
                                                     entry->mfn);
+            ASSERT(entry->w || !is_epte_superpage(entry));
             entry->a = !!cpu_has_vmx_ept_ad;
             entry->d = entry->w && cpu_has_vmx_ept_ad;
             break;
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -72,7 +72,8 @@ static const unsigned long pgt[] = {
     PGT_l3_page_table
 };
 
-static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn)
+static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
+                                       unsigned int level)
 {
     unsigned long flags;
     /*
@@ -107,6 +108,8 @@ static unsigned long p2m_type_to_flags(p
     case p2m_mmio_direct:
         if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
             flags |= _PAGE_RW;
+        else
+            ASSERT(!level);
         return flags | P2M_BASE_FLAGS | _PAGE_PCD;
     }
 }
@@ -436,7 +449,7 @@ static int do_recalc(struct p2m_domain *
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
                               ? p2m_ram_logdirty : p2m_ram_rw;
             unsigned long mfn = l1e_get_pfn(e);
-            unsigned long flags = p2m_type_to_flags(p2mt, _mfn(mfn));
+            unsigned long flags = p2m_type_to_flags(p2mt, _mfn(mfn), level);
 
             if ( level )
             {
@@ -573,7 +576,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         l3e_content = mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt)
             ? l3e_from_pfn(mfn_x(mfn),
-                           p2m_type_to_flags(p2mt, mfn) | _PAGE_PSE)
+                           p2m_type_to_flags(p2mt, mfn, 2) | _PAGE_PSE)
             : l3e_empty();
         entry_content.l1 = l3e_content.l3;
 
@@ -609,7 +612,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
-                                             p2m_type_to_flags(p2mt, mfn));
+                                             p2m_type_to_flags(p2mt, mfn, 0));
         else
             entry_content = l1e_empty();
 
@@ -645,7 +648,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             l2e_content = l2e_from_pfn(mfn_x(mfn),
-                                       p2m_type_to_flags(p2mt, mfn) |
+                                       p2m_type_to_flags(p2mt, mfn, 1) |
                                        _PAGE_PSE);
         else
             l2e_content = l2e_empty();
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -1046,10 +1046,12 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
              (gfn + nr_mfns - 1) < gfn ) /* wrap? */
             break;
 
+#ifndef CONFIG_X86 /* XXX ARM!? */
         ret = -E2BIG;
         /* Must break hypercall up as this could take a while. */
         if ( nr_mfns > 64 )
             break;
+#endif
 
         ret = -EPERM;
         if ( !iomem_access_permitted(current->domain, mfn, mfn_end) ||
@@ -1067,7 +1069,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
                    d->domain_id, gfn, mfn, nr_mfns);
 
             ret = map_mmio_regions(d, gfn, nr_mfns, mfn);
-            if ( ret )
+            if ( ret < 0 )
                 printk(XENLOG_G_WARNING
                        "memory_map:fail: dom%d gfn=%lx mfn=%lx nr=%lx ret:%ld\n",
                        d->domain_id, gfn, mfn, nr_mfns, ret);
@@ -1079,7 +1081,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
                    d->domain_id, gfn, mfn, nr_mfns);
 
             ret = unmap_mmio_regions(d, gfn, nr_mfns, mfn);
-            if ( ret && is_hardware_domain(current->domain) )
+            if ( ret < 0 && is_hardware_domain(current->domain) )
                 printk(XENLOG_ERR
                        "memory_map: error %ld removing dom%d access to [%lx,%lx]\n",
                        ret, d->domain_id, mfn, mfn_end);
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -259,7 +259,7 @@ int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        clear_mmio_p2m_entry(d, gmfn, _mfn(mfn));
+        clear_mmio_p2m_entry(d, gmfn, _mfn(mfn), 0);
         put_gfn(d, gmfn);
         return 1;
     }
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -574,8 +574,9 @@ int p2m_is_logdirty_range(struct p2m_dom
 
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                       p2m_access_t access);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
+                       unsigned int order, p2m_access_t access);
+int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
+                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,

Comments

Ian Campbell Jan. 15, 2016, 10:09 a.m. UTC | #1
On Thu, 2016-01-14 at 03:04 -0700, Jan Beulich wrote:
> - ARM side unimplemented (and hence libxc for now made cope with both
>   models),

So, one model is the one described in the commit message:

> - zero (success, everything done)
> - positive (success, this many done, more to do: re-invoke)
> - negative (error)

What is the other one? I'd expect ARM to already implement a subset of this
(i.e. 0 or negative, perhaps with a subset of the possible errno values), which I'd then expect libxc to just cope with without it constituting a second model.

IOW I don't think there should be (or indeed is) any special casing of ARM
vs x86 here or one model vs another, just a case of one arch only using a
subset of the expressibility of the interface.

What have I missed?
Jan Beulich Jan. 15, 2016, 10:47 a.m. UTC | #2
>>> On 15.01.16 at 11:09, <ian.campbell@citrix.com> wrote:
> On Thu, 2016-01-14 at 03:04 -0700, Jan Beulich wrote:
>> - ARM side unimplemented (and hence libxc for now made cope with both
>>   models),
> 
> So, one model is the one described in the commit message:
> 
>> - zero (success, everything done)
>> - positive (success, this many done, more to do: re-invoke)
>> - negative (error)
> 
> What is the other one? I'd expect ARM to already implement a subset of this
> (i.e. 0 or negative, perhaps with a subset of the possible errno values), 
> which I'd then expect libxc to just cope with without it constituting a 
> second model.

Well, first of all ARM doesn't get switched away from the current
model (yet), i.e. returning -E2BIG out of do_domctl(). And then
the difference between what the patch implements and what the
non-commit message comment describes is how errors get handled:
The patch makes a negative error value returned upon error, with
the caller having no way to tell at what point the error occurred
(and with a best effort undo in the case of "map"). The described
alternative would return the number of succeeded entries unless
an error occurred on the very first MFN, without any attempt to
undo the part that was done successfully. I.e. it would leave it
to the caller to decide what to do, and whether/when to roll back.

Jan
Ian Campbell Jan. 15, 2016, 1:57 p.m. UTC | #3
On Fri, 2016-01-15 at 03:47 -0700, Jan Beulich wrote:
> > > > On 15.01.16 at 11:09, <ian.campbell@citrix.com> wrote:
> > On Thu, 2016-01-14 at 03:04 -0700, Jan Beulich wrote:
> > > - ARM side unimplemented (and hence libxc for now made cope with both
> > >   models),
> > 
> > So, one model is the one described in the commit message:
> > 
> > > - zero (success, everything done)
> > > - positive (success, this many done, more to do: re-invoke)
> > > - negative (error)
> > 
> > What is the other one? I'd expect ARM to already implement a subset of
> > this
> > (i.e. 0 or negative, perhaps with a subset of the possible errno
> > values), 
> > which I'd then expect libxc to just cope with without it constituting a
> > second model.
> 
> Well, first of all ARM doesn't get switched away from the current
> model (yet), i.e. returning -E2BIG out of do_domctl().

Which AFAICT is a valid behaviour under the new model described in the
commit message specifically the "negative (error)" case.

I think the core of my objection/confusion here is describing this as two
different models when what is being introduced is a single API which can
fail either partially or entirely, with that being at the discretion of the
internals. In any case libxc needs to cope with the complete gamut of
behaviours of the interface.

IOW rather than describing a new API and referring obliquely to ARM only
supporting an old model I think this needs a complete description of the
interface covering the full possibilities of the API.

>  And then
> the difference between what the patch implements and what the
> non-commit message comment describes is how errors get handled:
> The patch makes a negative error value returned upon error, with
> the caller having no way to tell at what point the error occurred
> (and with a best effort undo in the case of "map"). The described
> alternative would return the number of succeeded entries unless
> an error occurred on the very first MFN, without any attempt to
> undo the part that was done successfully. I.e. it would leave it
> to the caller to decide what to do, and whether/when to roll back.

That's (probably, I don't quite follow all the details as written) fine,
but the interface should be described as a single API with the possible
failure cases each spelled out, i.e. not described as a split/contrast
between old vs. new or x86 vs. arm behaviour. The fact that x86 and arm
might currently exhibit different subsets of the possibilities offered by
the API is of at best secondary interest.

Ian.
Jan Beulich Jan. 15, 2016, 2:39 p.m. UTC | #4
>>> On 15.01.16 at 14:57, <ian.campbell@citrix.com> wrote:
> On Fri, 2016-01-15 at 03:47 -0700, Jan Beulich wrote:
>> > > > On 15.01.16 at 11:09, <ian.campbell@citrix.com> wrote:
>> > On Thu, 2016-01-14 at 03:04 -0700, Jan Beulich wrote:
>> > > - ARM side unimplemented (and hence libxc for now made cope with both
>> > >   models),
>> > 
>> > So, one model is the one described in the commit message:
>> > 
>> > > - zero (success, everything done)
>> > > - positive (success, this many done, more to do: re-invoke)
>> > > - negative (error)
>> > 
>> > What is the other one? I'd expect ARM to already implement a subset of
>> > this
>> > (i.e. 0 or negative, perhaps with a subset of the possible errno
>> > values), 
>> > which I'd then expect libxc to just cope with without it constituting a
>> > second model.
>> 
>> Well, first of all ARM doesn't get switched away from the current
>> model (yet), i.e. returning -E2BIG out of do_domctl().
> 
> Which AFAICT is a valid behaviour under the new model described in the
> commit message specifically the "negative (error)" case.
> 
> I think the core of my objection/confusion here is describing this as two
> different models when what is being introduced is a single API which can
> fail either partially or entirely, with that being at the discretion of the
> internals. In any case libxc needs to cope with the complete gamut of
> behaviours of the interface.
> 
> IOW rather than describing a new API and referring obliquely to ARM only
> supporting an old model I think this needs a complete description of the
> interface covering the full possibilities of the API.
> 
>>  And then
>> the difference between what the patch implements and what the
>> non-commit message comment describes is how errors get handled:
>> The patch makes a negative error value returned upon error, with
>> the caller having no way to tell at what point the error occurred
>> (and with a best effort undo in the case of "map"). The described
>> alternative would return the number of succeeded entries unless
>> an error occurred on the very first MFN, without any attempt to
>> undo the part that was done successfully. I.e. it would leave it
>> to the caller to decide what to do, and whether/when to roll back.
> 
> That's (probably, I don't quite follow all the details as written) fine,
> but the interface should be described as a single API with the possible
> failure cases each spelled out, i.e. not described as a split/contrast
> between old vs. new or x86 vs. arm behaviour. The fact that x86 and arm
> might currently exhibit different subsets of the possibilities offered by
> the API is of at best secondary interest.

I don't think I agree - there are two models. The meaning of
-E2BIG for the caller to retry with a smaller amount doesn't exist in
the new model anymore, and hence libxc wouldn't need to deal
with that case anymore if the ARM side got updated too. Whereas
positive return values don't exist in the present (prior to the patch)
model.

Jan
Ian Campbell Jan. 15, 2016, 2:55 p.m. UTC | #5
On Fri, 2016-01-15 at 07:39 -0700, Jan Beulich wrote:
> > > > On 15.01.16 at 14:57, <ian.campbell@citrix.com> wrote:
> > On Fri, 2016-01-15 at 03:47 -0700, Jan Beulich wrote:
> > > > > > On 15.01.16 at 11:09, <ian.campbell@citrix.com> wrote:
> > > > On Thu, 2016-01-14 at 03:04 -0700, Jan Beulich wrote:
> > > > > - ARM side unimplemented (and hence libxc for now made cope with
> > > > > both
> > > > >   models),
> > > > 
> > > > So, one model is the one described in the commit message:
> > > > 
> > > > > - zero (success, everything done)
> > > > > - positive (success, this many done, more to do: re-invoke)
> > > > > - negative (error)
> > > > 
> > > > What is the other one? I'd expect ARM to already implement a subset
> > > > of
> > > > this
> > > > (i.e. 0 or negative, perhaps with a subset of the possible errno
> > > > values), 
> > > > which I'd then expect libxc to just cope with without it
> > > > constituting a
> > > > second model.
> > > 
> > > Well, first of all ARM doesn't get switched away from the current
> > > model (yet), i.e. returning -E2BIG out of do_domctl().
> > 
> > Which AFAICT is a valid behaviour under the new model described in the
> > commit message specifically the "negative (error)" case.
> > 
> > I think the core of my objection/confusion here is describing this as
> > two
> > different models when what is being introduced is a single API which
> > can
> > fail either partially or entirely, with that being at the discretion of
> > the
> > internals. In any case libxc needs to cope with the complete gamut of
> > behaviours of the interface.
> > 
> > IOW rather than describing a new API and referring obliquely to ARM
> > only
> > supporting an old model I think this needs a complete description of
> > the
> > interface covering the full possibilities of the API.
> > 
> > >  And then
> > > the difference between what the patch implements and what the
> > > non-commit message comment describes is how errors get handled:
> > > The patch makes a negative error value returned upon error, with
> > > the caller having no way to tell at what point the error occurred
> > > (and with a best effort undo in the case of "map"). The described
> > > alternative would return the number of succeeded entries unless
> > > an error occurred on the very first MFN, without any attempt to
> > > undo the part that was done successfully. I.e. it would leave it
> > > to the caller to decide what to do, and whether/when to roll back.
> > 
> > That's (probably, I don't quite follow all the details as written)
> > fine,
> > but the interface should be described as a single API with the possible
> > failure cases each spelled out, i.e. not described as a split/contrast
> > between old vs. new or x86 vs. arm behaviour. The fact that x86 and arm
> > might currently exhibit different subsets of the possibilities offered
> > by
> > the API is of at best secondary interest.
> 
> I don't think I agree - there are two models. The meaning of
> -E2BIG for the caller to retry with a smaller amount doesn't exist in
> the new model anymore, and hence libxc wouldn't need to deal
> with that case anymore if the ARM side got updated too.

If ARM still has this behaviour then it is still part of the interface
IMHO, whether or not x86 chooses to use this particular possibility or not.

>  Whereas
> positive return values don't exist in the present (prior to the patch)
> model.

If there were two models in the way you suggest then there would surely be
an ifdef somewhere in libxc. The fact that the two behaviours can coexist
means to me that they are two halves of the same model (irrespective of
arch code opting in to different halves, and irrespective if having updated
ARM there are then fewer possible error cases and a follow up
simplification to libxc).

Anyway, the current three-bullet point description of the new ABI in the
commit message is clearly insufficient for the complexity whether we want
to split hairs about how many models there are here or not.

At the very least the interface (_all_ aspects of it) should be thoroughly
described in domctl.h next to XEN_DOMCTL_memory_mapping (which I just
noticed describes E2BIG and isn't changed here at all).

Ian.
Jan Beulich Jan. 18, 2016, 8:11 a.m. UTC | #6
>>> On 15.01.16 at 15:55, <ian.campbell@citrix.com> wrote:
> On Fri, 2016-01-15 at 07:39 -0700, Jan Beulich wrote:
>> I don't think I agree - there are two models. The meaning of
>> -E2BIG for the caller to retry with a smaller amount doesn't exist in
>> the new model anymore, and hence libxc wouldn't need to deal
>> with that case anymore if the ARM side got updated too.
> 
> If ARM still has this behaviour then it is still part of the interface
> IMHO, whether or not x86 chooses to use this particular possibility or not.

Okay, that's a valid perspective.

>>  Whereas
>> positive return values don't exist in the present (prior to the patch)
>> model.
> 
> If there were two models in the way you suggest then there would surely be
> an ifdef somewhere in libxc. The fact that the two behaviours can coexist
> means to me that they are two halves of the same model (irrespective of
> arch code opting in to different halves, and irrespective if having updated
> ARM there are then fewer possible error cases and a follow up
> simplification to libxc).

Same here.

> Anyway, the current three-bullet point description of the new ABI in the
> commit message is clearly insufficient for the complexity whether we want
> to split hairs about how many models there are here or not.
> 
> At the very least the interface (_all_ aspects of it) should be thoroughly
> described in domctl.h next to XEN_DOMCTL_memory_mapping (which I just
> noticed describes E2BIG and isn't changed here at all).

I can certainly do that, but I'd like to avoid doing this for the current
model before having taken a decision on whether to instead use the
alternative described in the post-commit message issue list. In fact,
the more I think about it, the more I'm convinced that the alternative
provides the more consistent interface, no matter that it leaves more
of the (cleanup) work to the caller.

Jan
Ian Campbell Jan. 18, 2016, 4:32 p.m. UTC | #7
On Mon, 2016-01-18 at 01:11 -0700, Jan Beulich wrote:
> > > > On 15.01.16 at 15:55, <ian.campbell@citrix.com> wrote:
> > On Fri, 2016-01-15 at 07:39 -0700, Jan Beulich wrote:
> > > I don't think I agree - there are two models. The meaning of
> > > -E2BIG for the caller to retry with a smaller amount doesn't exist in
> > > the new model anymore, and hence libxc wouldn't need to deal
> > > with that case anymore if the ARM side got updated too.
> > 
> > If ARM still has this behaviour then it is still part of the interface
> > IMHO, whether or not x86 chooses to use this particular possibility or
> > not.
> 
> Okay, that's a valid perspective.
> 
> > >  Whereas
> > > positive return values don't exist in the present (prior to the
> > > patch)
> > > model.
> > 
> > If there were two models in the way you suggest then there would surely
> > be
> > an ifdef somewhere in libxc. The fact that the two behaviours can
> > coexist
> > means to me that they are two halves of the same model (irrespective of
> > arch code opting in to different halves, and irrespective if having
> > updated
> > ARM there are then fewer possible error cases and a follow up
> > simplification to libxc).
> 
> Same here.
> 
> > Anyway, the current three-bullet point description of the new ABI in
> > the
> > commit message is clearly insufficient for the complexity whether we
> > want
> > to split hairs about how many models there are here or not.
> > 
> > At the very least the interface (_all_ aspects of it) should be
> > thoroughly
> > described in domctl.h next to XEN_DOMCTL_memory_mapping (which I just
> > noticed describes E2BIG and isn't changed here at all).
> 
> I can certainly do that, but I'd like to avoid doing this for the current
> model before having taken a decision on whether to instead use the
> alternative described in the post-commit message issue list. In fact,
> the more I think about it, the more I'm convinced that the alternative
> provides the more consistent interface, no matter that it leaves more
> of the (cleanup) work to the caller.

I must confess I'm not entirely following what the various proposals are,
but FWIW I have no in-principal problem with the caller (by which I think
you mean the tools?) having to cleanup partial success in order to allow
incremental attempts to set things up with smaller and smaller page sizes.

Ian.
Jan Beulich Jan. 18, 2016, 4:51 p.m. UTC | #8
>>> On 18.01.16 at 17:32, <ian.campbell@citrix.com> wrote:
> I must confess I'm not entirely following what the various proposals are,

What is currently implemented by the patch is that, upon error on
iteration N the hypervisor would clean up on a best effort basis and
return the error indicator. In the alternative suggested model it
wouldn't do any cleanup and return N to indicate how far success
was seen; only in the event that N=0 would an error code be
returned.

> but FWIW I have no in-principal problem with the caller (by which I think
> you mean the tools?)

Yes.

> having to cleanup partial success in order to allow
> incremental attempts to set things up with smaller and smaller page sizes.

Except that in the new x86 model we're not talking about decreasing
page size, but just the splitting the hypervisor does in place of true
preemption. Decreasing page size would actually be harmful to the
goal of using large pages for the mappings.

Jan
Ian Campbell Jan. 18, 2016, 5 p.m. UTC | #9
On Mon, 2016-01-18 at 09:51 -0700, Jan Beulich wrote:
> > > > On 18.01.16 at 17:32, <ian.campbell@citrix.com> wrote:
> > I must confess I'm not entirely following what the various proposals
> > are,
> 
> What is currently implemented by the patch is that, upon error on
> iteration N the hypervisor would clean up on a best effort basis and
> return the error indicator. In the alternative suggested model it
> wouldn't do any cleanup and return N to indicate how far success
> was seen; only in the event that N=0 would an error code be
> returned.
> 
> > but FWIW I have no in-principal problem with the caller (by which I
> > think
> > you mean the tools?)
> 
> Yes.
> 
> > having to cleanup partial success in order to allow
> > incremental attempts to set things up with smaller and smaller page
> > sizes.
> 
> Except that in the new x86 model we're not talking about decreasing
> page size, but just the splitting the hypervisor does in place of true
> preemption. Decreasing page size would actually be harmful to the
> goal of using large pages for the mappings.

Ah, I assumed it was to allow things to progress if no large pages were
actually around. Doing it for preemption purposes sounds ok too I guess.

Ian.

Patch
diff mbox

--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2206,7 +2206,7 @@  int xc_domain_memory_mapping(
 {
     DECLARE_DOMCTL;
     xc_dominfo_t info;
-    int ret = 0, err;
+    int ret = 0, rc;
     unsigned long done = 0, nr, max_batch_sz;
 
     if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ||
@@ -2231,19 +2231,24 @@  int xc_domain_memory_mapping(
         domctl.u.memory_mapping.nr_mfns = nr;
         domctl.u.memory_mapping.first_gfn = first_gfn + done;
         domctl.u.memory_mapping.first_mfn = first_mfn + done;
-        err = do_domctl(xch, &domctl);
-        if ( err && errno == E2BIG )
+        rc = do_domctl(xch, &domctl);
+        if ( rc < 0 && errno == E2BIG )
         {
             if ( max_batch_sz <= 1 )
                 break;
             max_batch_sz >>= 1;
             continue;
         }
+        if ( rc > 0 )
+        {
+            done += rc;
+            continue;
+        }
         /* Save the first error... */
         if ( !ret )
-            ret = err;
+            ret = rc;
         /* .. and ignore the rest of them when removing. */
-        if ( err && add_mapping != DPCI_REMOVE_MAPPING )
+        if ( rc && add_mapping != DPCI_REMOVE_MAPPING )
             break;
 
         done += nr;
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -436,7 +436,8 @@  static __init void pvh_add_mem_mapping(s
         else
             a = p2m_access_rw;
 
-        if ( (rc = set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i), a)) )
+        if ( (rc = set_mmio_p2m_entry(d, gfn + i, _mfn(mfn + i),
+                                      PAGE_ORDER_4K, a)) )
             panic("pvh_add_mem_mapping: gfn:%lx mfn:%lx i:%ld rc:%d\n",
                   gfn, mfn, i, rc);
         if ( !(i & 0xfffff) )
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2491,7 +2491,7 @@  static int vmx_alloc_vlapic_mapping(stru
     share_xen_page_with_guest(pg, d, XENSHARE_writable);
     d->arch.hvm_domain.vmx.apic_access_mfn = mfn;
     set_mmio_p2m_entry(d, paddr_to_pfn(APIC_DEFAULT_PHYS_BASE), _mfn(mfn),
-                       p2m_get_hostp2m(d)->default_access);
+                       PAGE_ORDER_4K, p2m_get_hostp2m(d)->default_access);
 
     return 0;
 }
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -899,48 +899,62 @@  void p2m_change_type_range(struct domain
     p2m_unlock(p2m);
 }
 
-/* Returns: 0 for success, -errno for failure */
+/*
+ * Returns:
+ *    0        for success
+ *    -errno   for failure
+ *    order+1  for caller to retry with order (guaranteed smaller than
+ *             the order value passed in)
+ */
 static int set_typed_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                               p2m_type_t gfn_p2mt, p2m_access_t access)
+                               unsigned int order, p2m_type_t gfn_p2mt,
+                               p2m_access_t access)
 {
     int rc = 0;
     p2m_access_t a;
     p2m_type_t ot;
     mfn_t omfn;
+    unsigned int cur_order = 0;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return -EIO;
 
-    gfn_lock(p2m, gfn, 0);
-    omfn = p2m->get_entry(p2m, gfn, &ot, &a, 0, NULL, NULL);
+    gfn_lock(p2m, gfn, order);
+    omfn = p2m->get_entry(p2m, gfn, &ot, &a, 0, &cur_order, NULL);
+    if ( cur_order < order )
+    {
+        gfn_unlock(p2m, gfn, order);
+        return cur_order + 1;
+    }
     if ( p2m_is_grant(ot) || p2m_is_foreign(ot) )
     {
-        gfn_unlock(p2m, gfn, 0);
+        gfn_unlock(p2m, gfn, order);
         domain_crash(d);
         return -ENOENT;
     }
     else if ( p2m_is_ram(ot) )
     {
+        unsigned long i;
+
         ASSERT(mfn_valid(omfn));
-        set_gpfn_from_mfn(mfn_x(omfn), INVALID_M2P_ENTRY);
+        for ( i = 0; i < (1UL << order); ++i )
+            set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY);
     }
 
     P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn, mfn_x(mfn));
-    rc = p2m_set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, gfn_p2mt,
-                       access);
+    rc = p2m_set_entry(p2m, gfn, mfn, order, gfn_p2mt, access);
     if ( rc )
-        gdprintk(XENLOG_ERR,
-                 "p2m_set_entry failed! mfn=%08lx rc:%d\n",
-                 mfn_x(get_gfn_query_unlocked(p2m->domain, gfn, &ot)), rc);
+        gdprintk(XENLOG_ERR, "p2m_set_entry: %#lx:%u -> %d (0x%"PRI_mfn")\n",
+                 gfn, order, rc, mfn_x(mfn));
     else if ( p2m_is_pod(ot) )
     {
         pod_lock(p2m);
-        p2m->pod.entry_count--;
+        p2m->pod.entry_count -= 1UL << order;
         BUG_ON(p2m->pod.entry_count < 0);
         pod_unlock(p2m);
     }
-    gfn_unlock(p2m, gfn, 0);
+    gfn_unlock(p2m, gfn, order);
 
     return rc;
 }
@@ -949,14 +963,21 @@  static int set_typed_p2m_entry(struct do
 static int set_foreign_p2m_entry(struct domain *d, unsigned long gfn,
                                  mfn_t mfn)
 {
-    return set_typed_p2m_entry(d, gfn, mfn, p2m_map_foreign,
+    return set_typed_p2m_entry(d, gfn, mfn, PAGE_ORDER_4K, p2m_map_foreign,
                                p2m_get_hostp2m(d)->default_access);
 }
 
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                       p2m_access_t access)
+                       unsigned int order, p2m_access_t access)
 {
-    return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct, access);
+    if ( order &&
+         rangeset_overlaps_range(mmio_ro_ranges, mfn_x(mfn),
+                                 mfn_x(mfn) + (1UL << order) - 1) &&
+         !rangeset_contains_range(mmio_ro_ranges, mfn_x(mfn),
+                                  mfn_x(mfn) + (1UL << order) - 1) )
+        return order;
+
+    return set_typed_p2m_entry(d, gfn, mfn, order, p2m_mmio_direct, access);
 }
 
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
@@ -1009,20 +1030,33 @@  int set_identity_p2m_entry(struct domain
     return ret;
 }
 
-/* Returns: 0 for success, -errno for failure */
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
+/*
+ * Returns:
+ *    0        for success
+ *    -errno   for failure
+ *    order+1  for caller to retry with order (guaranteed smaller than
+ *             the order value passed in)
+ */
+int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
+                         unsigned int order)
 {
     int rc = -EINVAL;
     mfn_t actual_mfn;
     p2m_access_t a;
     p2m_type_t t;
+    unsigned int cur_order = 0;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return -EIO;
 
-    gfn_lock(p2m, gfn, 0);
-    actual_mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL, NULL);
+    gfn_lock(p2m, gfn, order);
+    actual_mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, &cur_order, NULL);
+    if ( cur_order < order )
+    {
+        rc = cur_order + 1;
+        goto out;
+    }
 
     /* Do not use mfn_valid() here as it will usually fail for MMIO pages. */
     if ( (INVALID_MFN == mfn_x(actual_mfn)) || (t != p2m_mmio_direct) )
@@ -1035,11 +1069,11 @@  int clear_mmio_p2m_entry(struct domain *
         gdprintk(XENLOG_WARNING,
                  "no mapping between mfn %08lx and gfn %08lx\n",
                  mfn_x(mfn), gfn);
-    rc = p2m_set_entry(p2m, gfn, _mfn(INVALID_MFN), PAGE_ORDER_4K, p2m_invalid,
+    rc = p2m_set_entry(p2m, gfn, _mfn(INVALID_MFN), order, p2m_invalid,
                        p2m->default_access);
 
  out:
-    gfn_unlock(p2m, gfn, 0);
+    gfn_unlock(p2m, gfn, order);
 
     return rc;
 }
@@ -2095,6 +2129,25 @@  void *map_domain_gfn(struct p2m_domain *
     return map_domain_page(*mfn);
 }
 
+static unsigned int mmio_order(const struct domain *d,
+                               unsigned long start_fn, unsigned long nr)
+{
+    if ( !need_iommu(d) || !iommu_use_hap_pt(d) ||
+         (start_fn & ((1UL << PAGE_ORDER_2M) - 1)) || !(nr >> PAGE_ORDER_2M) )
+        return 0;
+
+    if ( !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) && (nr >> PAGE_ORDER_1G) &&
+         hap_has_1gb )
+        return PAGE_ORDER_1G;
+
+    if ( hap_has_2mb )
+        return PAGE_ORDER_2M;
+
+    return 0;
+}
+
+#define MAP_MMIO_MAX_ITER 64 /* pretty arbitrary */
+
 int map_mmio_regions(struct domain *d,
                      unsigned long start_gfn,
                      unsigned long nr,
@@ -2102,22 +2155,48 @@  int map_mmio_regions(struct domain *d,
 {
     int ret = 0;
     unsigned long i;
+    unsigned int iter, order;
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    for ( i = 0; !ret && i < nr; i++ )
+    for ( iter = i = 0; i < nr && iter < MAP_MMIO_MAX_ITER;
+          i += 1UL << order, ++iter )
     {
-        ret = set_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i),
-                                 p2m_get_hostp2m(d)->default_access);
-        if ( ret )
+        /* OR'ing gfn and mfn values will return an order suitable to both. */
+        for ( order = mmio_order(d, (start_gfn + i) | (mfn + i), nr - i); ;
+              order = ret - 1 )
+        {
+            ret = set_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i), order,
+                                     p2m_get_hostp2m(d)->default_access);
+            if ( ret <= 0 )
+                break;
+            ASSERT(ret <= order);
+        }
+        if ( ret < 0 )
         {
-            unmap_mmio_regions(d, start_gfn, i, mfn);
+            for ( nr = i, iter = i = 0; i < nr ; i += 1UL << order, ++iter )
+            {
+                int rc;
+
+                WARN_ON(iter == MAP_MMIO_MAX_ITER);
+                for ( order = mmio_order(d, (start_gfn + i) | (mfn + i),
+                                         nr - i); ; order = rc - 1 )
+                {
+                    rc = clear_mmio_p2m_entry(d, start_gfn + i,
+                                              _mfn(mfn + i), order);
+                    if ( rc <= 0 )
+                        break;
+                    ASSERT(rc <= order);
+                }
+                if ( rc < 0 )
+                    order = 0;
+            }
             break;
         }
     }
 
-    return ret;
+    return ret < 0 ? ret : i == nr ? 0 : i;
 }
 
 int unmap_mmio_regions(struct domain *d,
@@ -2127,18 +2206,33 @@  int unmap_mmio_regions(struct domain *d,
 {
     int err = 0;
     unsigned long i;
+    unsigned int iter, order;
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    for ( i = 0; i < nr; i++ )
+    for ( iter = i = 0; i < nr && iter < MAP_MMIO_MAX_ITER;
+          i += 1UL << order, ++iter )
     {
-        int ret = clear_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i));
-        if ( ret )
+        int ret;
+
+        /* OR'ing gfn and mfn values will return an order suitable to both. */
+        for ( order = mmio_order(d, (start_gfn + i) | (mfn + i), nr - i); ;
+              order = ret - 1 )
+        {
+            ret = clear_mmio_p2m_entry(d, start_gfn + i, _mfn(mfn + i), order);
+            if ( ret <= 0 )
+                break;
+            ASSERT(ret <= order);
+        }
+        if ( ret < 0 )
+        {
             err = ret;
+            order = 0;
+        }
     }
 
-    return err;
+    return err ?: i == nr ? 0 : i;
 }
 
 unsigned int p2m_find_altp2m_by_eptp(struct domain *d, uint64_t eptp)
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -136,6 +136,7 @@  static void ept_p2m_type_to_flags(struct
             entry->r = entry->x = 1;
             entry->w = !rangeset_contains_singleton(mmio_ro_ranges,
                                                     entry->mfn);
+            ASSERT(entry->w || !is_epte_superpage(entry));
             entry->a = !!cpu_has_vmx_ept_ad;
             entry->d = entry->w && cpu_has_vmx_ept_ad;
             break;
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -72,7 +72,8 @@  static const unsigned long pgt[] = {
     PGT_l3_page_table
 };
 
-static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn)
+static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
+                                       unsigned int level)
 {
     unsigned long flags;
     /*
@@ -107,6 +108,8 @@  static unsigned long p2m_type_to_flags(p
     case p2m_mmio_direct:
         if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
             flags |= _PAGE_RW;
+        else
+            ASSERT(!level);
         return flags | P2M_BASE_FLAGS | _PAGE_PCD;
     }
 }
@@ -436,7 +449,7 @@  static int do_recalc(struct p2m_domain *
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
                               ? p2m_ram_logdirty : p2m_ram_rw;
             unsigned long mfn = l1e_get_pfn(e);
-            unsigned long flags = p2m_type_to_flags(p2mt, _mfn(mfn));
+            unsigned long flags = p2m_type_to_flags(p2mt, _mfn(mfn), level);
 
             if ( level )
             {
@@ -573,7 +576,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         l3e_content = mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt)
             ? l3e_from_pfn(mfn_x(mfn),
-                           p2m_type_to_flags(p2mt, mfn) | _PAGE_PSE)
+                           p2m_type_to_flags(p2mt, mfn, 2) | _PAGE_PSE)
             : l3e_empty();
         entry_content.l1 = l3e_content.l3;
 
@@ -609,7 +612,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m,
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
-                                             p2m_type_to_flags(p2mt, mfn));
+                                             p2m_type_to_flags(p2mt, mfn, 0));
         else
             entry_content = l1e_empty();
 
@@ -645,7 +648,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             l2e_content = l2e_from_pfn(mfn_x(mfn),
-                                       p2m_type_to_flags(p2mt, mfn) |
+                                       p2m_type_to_flags(p2mt, mfn, 1) |
                                        _PAGE_PSE);
         else
             l2e_content = l2e_empty();
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -1046,10 +1046,12 @@  long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
              (gfn + nr_mfns - 1) < gfn ) /* wrap? */
             break;
 
+#ifndef CONFIG_X86 /* XXX ARM!? */
         ret = -E2BIG;
         /* Must break hypercall up as this could take a while. */
         if ( nr_mfns > 64 )
             break;
+#endif
 
         ret = -EPERM;
         if ( !iomem_access_permitted(current->domain, mfn, mfn_end) ||
@@ -1067,7 +1069,7 @@  long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
                    d->domain_id, gfn, mfn, nr_mfns);
 
             ret = map_mmio_regions(d, gfn, nr_mfns, mfn);
-            if ( ret )
+            if ( ret < 0 )
                 printk(XENLOG_G_WARNING
                        "memory_map:fail: dom%d gfn=%lx mfn=%lx nr=%lx ret:%ld\n",
                        d->domain_id, gfn, mfn, nr_mfns, ret);
@@ -1079,7 +1081,7 @@  long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
                    d->domain_id, gfn, mfn, nr_mfns);
 
             ret = unmap_mmio_regions(d, gfn, nr_mfns, mfn);
-            if ( ret && is_hardware_domain(current->domain) )
+            if ( ret < 0 && is_hardware_domain(current->domain) )
                 printk(XENLOG_ERR
                        "memory_map: error %ld removing dom%d access to [%lx,%lx]\n",
                        ret, d->domain_id, mfn, mfn_end);
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -259,7 +259,7 @@  int guest_remove_page(struct domain *d,
     }
     if ( p2mt == p2m_mmio_direct )
     {
-        clear_mmio_p2m_entry(d, gmfn, _mfn(mfn));
+        clear_mmio_p2m_entry(d, gmfn, _mfn(mfn), 0);
         put_gfn(d, gmfn);
         return 1;
     }
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -574,8 +574,9 @@  int p2m_is_logdirty_range(struct p2m_dom
 
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
-                       p2m_access_t access);
-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
+                       unsigned int order, p2m_access_t access);
+int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
+                         unsigned int order);
 
 /* Set identity addresses in the p2m table (for pass-through) */
 int set_identity_p2m_entry(struct domain *d, unsigned long gfn,