diff mbox series

[v2,12/18] AMD/IOMMU: allow use of superpage mappings

Message ID cc93398d-982a-edbc-4ddd-b5459cef8f9a@suse.com (mailing list archive)
State New, archived
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich Sept. 24, 2021, 9:52 a.m. UTC
No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables can take quite a while when
replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway.

Comments

Roger Pau Monne Dec. 10, 2021, 3:06 p.m. UTC | #1
On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)

Also replacing/shattering such non-standard page sizes will require
more logic, so unless there's a performance benefit I would just skip
using them.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables can take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway.

I would limit to 1G, which is what we support for CPU page tables
also.

> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -                                                   unsigned long dfn)
> +                                                   unsigned long dfn,
> +                                                   unsigned int level)
>  {
>      union amd_iommu_pte *table, *pte, old;
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, 1)];
> +    pte = &table[pfn_to_pde_idx(dfn, level)];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>      return 0;
>  }
>  
> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)

Nit: should the last parameter be named level rather than next_level?
AFAICT it's the level of the mfn parameter.

Should we also assert that level (or next_level) is always != 0 for
extra safety?

Thanks, Roger.
Jan Beulich Dec. 13, 2021, 8:49 a.m. UTC | #2
On 10.12.2021 16:06, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>> ---
>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>> freeing of intermediate page tables can take quite a while when
>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>> there's no present code path via which 512G chunks of memory could be
>> allocated (and hence mapped) anyway.
> 
> I would limit to 1G, which is what we support for CPU page tables
> also.

I'm not sure I buy comparing with CPU side support when not sharing
page tables. Not the least with PV in mind.

>> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>>      return 0;
>>  }
>>  
>> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
> 
> Nit: should the last parameter be named level rather than next_level?
> AFAICT it's the level of the mfn parameter.

Yeah, might make sense.

> Should we also assert that level (or next_level) is always != 0 for
> extra safety?

As said elsewhere - if this wasn't a static helper, I'd agree. But all
call sites have respective conditionals around the call. If anything
I'd move those checks into the function (but only if you think that
would improve things, as to me having them at the call sites is more
logical).

Jan
Roger Pau Monne Dec. 13, 2021, 9:45 a.m. UTC | #3
On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> On 10.12.2021 16:06, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> >> ---
> >> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >> freeing of intermediate page tables can take quite a while when
> >> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> >> there's no present code path via which 512G chunks of memory could be
> >> allocated (and hence mapped) anyway.
> > 
> > I would limit to 1G, which is what we support for CPU page tables
> > also.
> 
> I'm not sure I buy comparing with CPU side support when not sharing
> page tables. Not the least with PV in mind.

Hm, my thinking was that similar reasons that don't allow us to do
512G mappings for the CPU side would also apply to IOMMU. Regardless
of that, given the current way in which replaced page table entries
are freed, I'm not sure it's fine to allow 512G mappings as the
freeing of the possible huge amount of 4K entries could allow guests
to hog a CPU for a long time.

It would be better if we could somehow account this in a per-vCPU way,
kind of similar to what we do with vPCI BAR mappings.

> > Should we also assert that level (or next_level) is always != 0 for
> > extra safety?
> 
> As said elsewhere - if this wasn't a static helper, I'd agree. But all
> call sites have respective conditionals around the call. If anything
> I'd move those checks into the function (but only if you think that
> would improve things, as to me having them at the call sites is more
> logical).

I'm fine to leave the checks in the callers, was just a suggestion in
case we gain new callers that forgot to do the checks themselves.

Thanks, Roger.
Jan Beulich Dec. 13, 2021, 10 a.m. UTC | #4
On 13.12.2021 10:45, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
>> On 10.12.2021 16:06, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>>>> ---
>>>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>>>> freeing of intermediate page tables can take quite a while when
>>>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>>>> there's no present code path via which 512G chunks of memory could be
>>>> allocated (and hence mapped) anyway.
>>>
>>> I would limit to 1G, which is what we support for CPU page tables
>>> also.
>>
>> I'm not sure I buy comparing with CPU side support when not sharing
>> page tables. Not the least with PV in mind.
> 
> Hm, my thinking was that similar reasons that don't allow us to do
> 512G mappings for the CPU side would also apply to IOMMU. Regardless
> of that, given the current way in which replaced page table entries
> are freed, I'm not sure it's fine to allow 512G mappings as the
> freeing of the possible huge amount of 4K entries could allow guests
> to hog a CPU for a long time.

This huge amount can occur only when replacing a hierarchy with
sufficiently many 4k leaves by a single 512G page. Yet there's no
way - afaics - that such an operation can be initiated right now.
That's, as said in the remark, because there's no way to allocate
a 512G chunk of memory in one go. When re-coalescing, the worst
that can happen is one L1 table worth of 4k mappings, one L2
table worth of 2M mappings, and one L3 table worth of 1G mappings.
All other mappings already need to have been superpage ones at the
time of the checking. Hence the total upper bound (for the
enclosing map / unmap) is again primarily determined by there not
being any way to establish 512G mappings.

Actually, thinking about it, there's one path where 512G mappings
could be established, but that's Dom0-reachable only
(XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
PCI device. Even if such a device existed, I think we're fine to
assume that Dom0 won't establish such mappings to replace
existing ones, but only ever put them in place when nothing was
mapped in that range yet.

> It would be better if we could somehow account this in a per-vCPU way,
> kind of similar to what we do with vPCI BAR mappings.

But recording them per-vCPU wouldn't make any difference to the
number of pages that could accumulate in a single run. Maybe I'm
missing something in what you're thinking about here ...

Jan
Roger Pau Monne Dec. 13, 2021, 10:33 a.m. UTC | #5
On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
> On 13.12.2021 10:45, Roger Pau Monné wrote:
> > On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> >> On 10.12.2021 16:06, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> >>>> ---
> >>>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >>>> freeing of intermediate page tables can take quite a while when
> >>>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> >>>> there's no present code path via which 512G chunks of memory could be
> >>>> allocated (and hence mapped) anyway.
> >>>
> >>> I would limit to 1G, which is what we support for CPU page tables
> >>> also.
> >>
> >> I'm not sure I buy comparing with CPU side support when not sharing
> >> page tables. Not the least with PV in mind.
> > 
> > Hm, my thinking was that similar reasons that don't allow us to do
> > 512G mappings for the CPU side would also apply to IOMMU. Regardless
> > of that, given the current way in which replaced page table entries
> > are freed, I'm not sure it's fine to allow 512G mappings as the
> > freeing of the possible huge amount of 4K entries could allow guests
> > to hog a CPU for a long time.
> 
> This huge amount can occur only when replacing a hierarchy with
> sufficiently many 4k leaves by a single 512G page. Yet there's no
> way - afaics - that such an operation can be initiated right now.
> That's, as said in the remark, because there's no way to allocate
> a 512G chunk of memory in one go. When re-coalescing, the worst
> that can happen is one L1 table worth of 4k mappings, one L2
> table worth of 2M mappings, and one L3 table worth of 1G mappings.
> All other mappings already need to have been superpage ones at the
> time of the checking. Hence the total upper bound (for the
> enclosing map / unmap) is again primarily determined by there not
> being any way to establish 512G mappings.
> 
> Actually, thinking about it, there's one path where 512G mappings
> could be established, but that's Dom0-reachable only
> (XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
> PCI device. Even if such a device existed, I think we're fine to
> assume that Dom0 won't establish such mappings to replace
> existing ones, but only ever put them in place when nothing was
> mapped in that range yet.
> 
> > It would be better if we could somehow account this in a per-vCPU way,
> > kind of similar to what we do with vPCI BAR mappings.
> 
> But recording them per-vCPU wouldn't make any difference to the
> number of pages that could accumulate in a single run. Maybe I'm
> missing something in what you're thinking about here ...

If Xen somehow did the free in guest vCPU context before resuming
guest execution then you could yield when events are pending and thus
allow other guests to run without hogging the pCPU, and the freeing
would be accounted to vCPU sched slice time.

Thanks, Roger.
Jan Beulich Dec. 13, 2021, 10:41 a.m. UTC | #6
On 13.12.2021 11:33, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
>> On 13.12.2021 10:45, Roger Pau Monné wrote:
>>> It would be better if we could somehow account this in a per-vCPU way,
>>> kind of similar to what we do with vPCI BAR mappings.
>>
>> But recording them per-vCPU wouldn't make any difference to the
>> number of pages that could accumulate in a single run. Maybe I'm
>> missing something in what you're thinking about here ...
> 
> If Xen somehow did the free in guest vCPU context before resuming
> guest execution then you could yield when events are pending and thus
> allow other guests to run without hogging the pCPU, and the freeing
> would be accounted to vCPU sched slice time.

That's an interesting thought. Shouldn't be difficult to arrange for
HVM (from {svm,vmx}_vmenter_helper()), but I can't immediately see a
clean way of having the same for PV (short of an ad hoc call out of
assembly code somewhere after test_all_events).

Jan
diff mbox series

Patch

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@  static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -288,10 +289,31 @@  static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+    if ( next_level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < next_level);
+                queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
                        unsigned int flags, unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -320,7 +342,7 @@  int amd_iommu_map_page(struct domain *d,
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -330,8 +352,8 @@  int amd_iommu_map_page(struct domain *d,
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -339,8 +361,13 @@  int amd_iommu_map_page(struct domain *d,
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -349,6 +376,7 @@  int amd_iommu_unmap_page(struct domain *
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -359,7 +387,7 @@  int amd_iommu_unmap_page(struct domain *
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -371,14 +399,19 @@  int amd_iommu_unmap_page(struct domain *
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( level > 1 && old.next_level )
+            queue_free_pt(d, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -630,7 +630,7 @@  static void amd_dump_page_tables(struct
 }
 
 static const struct iommu_ops __initconstrel _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@ 
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */