diff mbox series

intel-iommu: optimize nodmar memory regions

Message ID 20190313094323.18263-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series intel-iommu: optimize nodmar memory regions | expand

Commit Message

Peter Xu March 13, 2019, 9:43 a.m. UTC
Previously we have per-device system memory aliases when DMAR is
disabled by the system.  It will slow the system down if there are
lots of devices especially when DMAR is disabled, because each of the
aliased system address space will contain O(N) slots, and rendering
such N address spaces will be O(N^2) complexity.

This patch introduces a shared nodmar memory region and for each
device we only create an alias to the shared memory region.  With the
aliasing, QEMU memory core API will be able to detect when devices are
sharing the same address space (which is the nodmar address space)
when rendering the FlatViews and the total number of FlatViews can be
dramatically reduced when there are a lot of devices.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---

Hi, Sergio,

This patch implements the optimization that Paolo proposed in the
other thread.  Would you please try this patch to see whether it could
help for your case?  Thanks,

 hw/i386/intel_iommu.c         | 90 ++++++++++++++++++++++-------------
 include/hw/i386/intel_iommu.h |  7 ++-
 2 files changed, 62 insertions(+), 35 deletions(-)

Comments

Paolo Bonzini March 13, 2019, 11:21 a.m. UTC | #1
On 13/03/19 10:43, Peter Xu wrote:
> Previously we have per-device system memory aliases when DMAR is
> disabled by the system.  It will slow the system down if there are
> lots of devices especially when DMAR is disabled, because each of the
> aliased system address space will contain O(N) slots, and rendering
> such N address spaces will be O(N^2) complexity.
> 
> This patch introduces a shared nodmar memory region and for each
> device we only create an alias to the shared memory region.  With the
> aliasing, QEMU memory core API will be able to detect when devices are
> sharing the same address space (which is the nodmar address space)
> when rendering the FlatViews and the total number of FlatViews can be
> dramatically reduced when there are a lot of devices.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> Hi, Sergio,
> 
> This patch implements the optimization that Paolo proposed in the
> other thread.  Would you please try this patch to see whether it could
> help for your case?  Thanks,

Yes, this looks great.  Sergio, if you have time to test it it would
be great.  With this patch, we switch between few big flatviews at
boot (before IOMMU is loaded) or with iommu=pt, and many small flatviews
after IOMMU is loaded and iommu!=pt.  Both should be fine for performance,
and in particular the first should give no penalty at all compared to
no IOMMU.

I only made a very small change to give different names to the various
-dmar regions, and queued the patch:

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f87b1033f6..e38c27e39c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2947,7 +2947,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
     if (!vtd_dev_as) {
-        snprintf(name, sizeof(name), "vtd-as-%02x.%x", PCI_SLOT(devfn),
+        snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
                  PCI_FUNC(devfn));
         vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
 
@@ -2983,9 +2983,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
          * region here just like what we've done above with the nodmar
          * region.
          */
+        strcat(name, "-dmar");
         memory_region_init_iommu(&vtd_dev_as->iommu, sizeof(vtd_dev_as->iommu),
                                  TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
-                                 "vtd-dmar", UINT64_MAX);
+                                 name, UINT64_MAX);
         memory_region_init_alias(&vtd_dev_as->iommu_ir, OBJECT(s), "vtd-ir",
                                  &s->mr_ir, 0, memory_region_size(&s->mr_ir));
         memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),

Paolo
Sergio Lopez Pascual March 13, 2019, 11:45 a.m. UTC | #2
Peter Xu writes:

> Previously we have per-device system memory aliases when DMAR is
> disabled by the system.  It will slow the system down if there are
> lots of devices especially when DMAR is disabled, because each of the
> aliased system address space will contain O(N) slots, and rendering
> such N address spaces will be O(N^2) complexity.
>
> This patch introduces a shared nodmar memory region and for each
> device we only create an alias to the shared memory region.  With the
> aliasing, QEMU memory core API will be able to detect when devices are
> sharing the same address space (which is the nodmar address space)
> when rendering the FlatViews and the total number of FlatViews can be
> dramatically reduced when there are a lot of devices.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>
> Hi, Sergio,
>
> This patch implements the optimization that Paolo proposed in the
> other thread.  Would you please try this patch to see whether it could
> help for your case?  Thanks,

Hi,

I've just gave a try and it fixes the issue here. The number of
FlatViews goes down from 119 to 4, and the initialization time for PCI
devices on the Guest is back to normal levels.

Thanks!
Sergio.
Paolo Bonzini March 13, 2019, 6:12 p.m. UTC | #3
On 13/03/19 12:45, Sergio Lopez wrote:
> 
> Peter Xu writes:
> 
>> Previously we have per-device system memory aliases when DMAR is
>> disabled by the system.  It will slow the system down if there are
>> lots of devices especially when DMAR is disabled, because each of the
>> aliased system address space will contain O(N) slots, and rendering
>> such N address spaces will be O(N^2) complexity.
>>
>> This patch introduces a shared nodmar memory region and for each
>> device we only create an alias to the shared memory region.  With the
>> aliasing, QEMU memory core API will be able to detect when devices are
>> sharing the same address space (which is the nodmar address space)
>> when rendering the FlatViews and the total number of FlatViews can be
>> dramatically reduced when there are a lot of devices.
>>
>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>
>> Hi, Sergio,
>>
>> This patch implements the optimization that Paolo proposed in the
>> other thread.  Would you please try this patch to see whether it could
>> help for your case?  Thanks,
> 
> Hi,
> 
> I've just gave a try and it fixes the issue here. The number of
> FlatViews goes down from 119 to 4, and the initialization time for PCI
> devices on the Guest is back to normal levels.

You must be using "iommu=pt" then.  Can you also test performance
without it?  It should be fine even if the number of FlatViews goes back
to 119.

Paolo
Sergio Lopez Pascual March 14, 2019, 10:36 a.m. UTC | #4
Paolo Bonzini writes:

> On 13/03/19 12:45, Sergio Lopez wrote:
>> 
>> Peter Xu writes:
>> 
>>> Previously we have per-device system memory aliases when DMAR is
>>> disabled by the system.  It will slow the system down if there are
>>> lots of devices especially when DMAR is disabled, because each of the
>>> aliased system address space will contain O(N) slots, and rendering
>>> such N address spaces will be O(N^2) complexity.
>>>
>>> This patch introduces a shared nodmar memory region and for each
>>> device we only create an alias to the shared memory region.  With the
>>> aliasing, QEMU memory core API will be able to detect when devices are
>>> sharing the same address space (which is the nodmar address space)
>>> when rendering the FlatViews and the total number of FlatViews can be
>>> dramatically reduced when there are a lot of devices.
>>>
>>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>
>>> Hi, Sergio,
>>>
>>> This patch implements the optimization that Paolo proposed in the
>>> other thread.  Would you please try this patch to see whether it could
>>> help for your case?  Thanks,
>> 
>> Hi,
>> 
>> I've just gave a try and it fixes the issue here. The number of
>> FlatViews goes down from 119 to 4, and the initialization time for PCI
>> devices on the Guest is back to normal levels.
>
> You must be using "iommu=pt" then.  Can you also test performance
> without it?  It should be fine even if the number of FlatViews goes back
> to 119.

Hm... I don't have "iommu=pt" in either the Host nor the Guest (the
issue can also be perceived in SeaBIOS). After taking some traces of
what QEMU is doing during that time, I'm quite convinced the slowness
comes from having to construct that amount of FlatViews, each one with
up to 200 regions.

Thanks,
Sergio.
Peter Xu March 15, 2019, 6:02 a.m. UTC | #5
On Thu, Mar 14, 2019 at 11:36:39AM +0100, Sergio Lopez wrote:
> 
> Paolo Bonzini writes:
> 
> > On 13/03/19 12:45, Sergio Lopez wrote:
> >> 
> >> Peter Xu writes:
> >> 
> >>> Previously we have per-device system memory aliases when DMAR is
> >>> disabled by the system.  It will slow the system down if there are
> >>> lots of devices especially when DMAR is disabled, because each of the
> >>> aliased system address space will contain O(N) slots, and rendering
> >>> such N address spaces will be O(N^2) complexity.
> >>>
> >>> This patch introduces a shared nodmar memory region and for each
> >>> device we only create an alias to the shared memory region.  With the
> >>> aliasing, QEMU memory core API will be able to detect when devices are
> >>> sharing the same address space (which is the nodmar address space)
> >>> when rendering the FlatViews and the total number of FlatViews can be
> >>> dramatically reduced when there are a lot of devices.
> >>>
> >>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>> ---
> >>>
> >>> Hi, Sergio,
> >>>
> >>> This patch implements the optimization that Paolo proposed in the
> >>> other thread.  Would you please try this patch to see whether it could
> >>> help for your case?  Thanks,
> >> 
> >> Hi,
> >> 
> >> I've just gave a try and it fixes the issue here. The number of
> >> FlatViews goes down from 119 to 4, and the initialization time for PCI
> >> devices on the Guest is back to normal levels.
> >
> > You must be using "iommu=pt" then.  Can you also test performance
> > without it?  It should be fine even if the number of FlatViews goes back
> > to 119.
> 
> Hm... I don't have "iommu=pt" in either the Host nor the Guest (the
> issue can also be perceived in SeaBIOS). After taking some traces of
> what QEMU is doing during that time, I'm quite convinced the slowness
> comes from having to construct that amount of FlatViews, each one with
> up to 200 regions.

Thanks for giving it a shot!

And yes iommu=pt in the guest after Linux boots should probably have
the same state as during BIOS as long as the emulated VT-d device
declared "hardware passthrough" support (upstream QEMU has it on by
default, which corresponds to "-device intel-iommu,pt=on").

Regards,
Peter Xu March 15, 2019, 6:07 a.m. UTC | #6
On Wed, Mar 13, 2019 at 12:21:34PM +0100, Paolo Bonzini wrote:
> On 13/03/19 10:43, Peter Xu wrote:
> > Previously we have per-device system memory aliases when DMAR is
> > disabled by the system.  It will slow the system down if there are
> > lots of devices especially when DMAR is disabled, because each of the
> > aliased system address space will contain O(N) slots, and rendering
> > such N address spaces will be O(N^2) complexity.
> > 
> > This patch introduces a shared nodmar memory region and for each
> > device we only create an alias to the shared memory region.  With the
> > aliasing, QEMU memory core API will be able to detect when devices are
> > sharing the same address space (which is the nodmar address space)
> > when rendering the FlatViews and the total number of FlatViews can be
> > dramatically reduced when there are a lot of devices.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > 
> > Hi, Sergio,
> > 
> > This patch implements the optimization that Paolo proposed in the
> > other thread.  Would you please try this patch to see whether it could
> > help for your case?  Thanks,
> 
> Yes, this looks great.  Sergio, if you have time to test it it would
> be great.  With this patch, we switch between few big flatviews at
> boot (before IOMMU is loaded) or with iommu=pt, and many small flatviews
> after IOMMU is loaded and iommu!=pt.  Both should be fine for performance,
> and in particular the first should give no penalty at all compared to
> no IOMMU.
> 
> I only made a very small change to give different names to the various
> -dmar regions, and queued the patch:
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f87b1033f6..e38c27e39c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2947,7 +2947,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      vtd_dev_as = vtd_bus->dev_as[devfn];
>  
>      if (!vtd_dev_as) {
> -        snprintf(name, sizeof(name), "vtd-as-%02x.%x", PCI_SLOT(devfn),
> +        snprintf(name, sizeof(name), "vtd-%02x.%x", PCI_SLOT(devfn),
>                   PCI_FUNC(devfn));
>          vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
>  
> @@ -2983,9 +2983,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>           * region here just like what we've done above with the nodmar
>           * region.
>           */
> +        strcat(name, "-dmar");
>          memory_region_init_iommu(&vtd_dev_as->iommu, sizeof(vtd_dev_as->iommu),
>                                   TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
> -                                 "vtd-dmar", UINT64_MAX);
> +                                 name, UINT64_MAX);
>          memory_region_init_alias(&vtd_dev_as->iommu_ir, OBJECT(s), "vtd-ir",
>                                   &s->mr_ir, 0, memory_region_size(&s->mr_ir));
>          memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),

Thanks Paolo. :)

It's a pity that we can't append the PCI bus numbers to those names
too probably because PCI bus numbers are not allocated before BIOS
when vtd_find_add_as() is called the first time.  It would be nicer if
we can fix it some day (though I still have no good idea on how).

Regards,
diff mbox series

Patch

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ee22e754f0..f87b1033f6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1171,11 +1171,11 @@  static bool vtd_switch_address_space(VTDAddressSpace *as)
 
     /* Turn off first then on the other */
     if (use_iommu) {
-        memory_region_set_enabled(&as->sys_alias, false);
+        memory_region_set_enabled(&as->nodmar, false);
         memory_region_set_enabled(MEMORY_REGION(&as->iommu), true);
     } else {
         memory_region_set_enabled(MEMORY_REGION(&as->iommu), false);
-        memory_region_set_enabled(&as->sys_alias, true);
+        memory_region_set_enabled(&as->nodmar, true);
     }
 
     if (take_bql) {
@@ -2947,7 +2947,8 @@  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
     if (!vtd_dev_as) {
-        snprintf(name, sizeof(name), "intel_iommu_devfn_%d", devfn);
+        snprintf(name, sizeof(name), "vtd-as-%02x.%x", PCI_SLOT(devfn),
+                 PCI_FUNC(devfn));
         vtd_bus->dev_as[devfn] = vtd_dev_as = g_malloc0(sizeof(VTDAddressSpace));
 
         vtd_dev_as->bus = bus;
@@ -2956,44 +2957,52 @@  VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
         vtd_dev_as->context_cache_entry.context_cache_gen = 0;
         vtd_dev_as->iova_tree = iova_tree_new();
 
+        memory_region_init(&vtd_dev_as->root, OBJECT(s), name, UINT64_MAX);
+        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, "vtd-root");
+
         /*
-         * Memory region relationships looks like (Address range shows
-         * only lower 32 bits to make it short in length...):
-         *
-         * |-----------------+-------------------+----------|
-         * | Name            | Address range     | Priority |
-         * |-----------------+-------------------+----------+
-         * | vtd_root        | 00000000-ffffffff |        0 |
-         * |  intel_iommu    | 00000000-ffffffff |        1 |
-         * |  vtd_sys_alias  | 00000000-ffffffff |        1 |
-         * |  intel_iommu_ir | fee00000-feefffff |       64 |
-         * |-----------------+-------------------+----------|
+         * Build the DMAR-disabled container with aliases to the
+         * shared MRs.  Note that aliasing to a shared memory region
+         * could help the memory API to detect same FlatViews so we
+         * can have devices to share the same FlatView when DMAR is
+         * disabled (either by not providing "intel_iommu=on" or with
+         * "iommu=pt").  It will greatly reduce the total number of
+         * FlatViews of the system hence VM runs faster.
+         */
+        memory_region_init_alias(&vtd_dev_as->nodmar, OBJECT(s),
+                                 "vtd-nodmar", &s->mr_nodmar, 0,
+                                 memory_region_size(&s->mr_nodmar));
+
+        /*
+         * Build the per-device DMAR-enabled container.
          *
-         * We enable/disable DMAR by switching enablement for
-         * vtd_sys_alias and intel_iommu regions. IR region is always
-         * enabled.
+         * TODO: currently we have per-device IOMMU memory region only
+         * because we have per-device IOMMU notifiers for devices.  If
+         * one day we can abstract the IOMMU notifiers out of the
+         * memory regions then we can also share the same memory
+         * region here just like what we've done above with the nodmar
+         * region.
          */
         memory_region_init_iommu(&vtd_dev_as->iommu, sizeof(vtd_dev_as->iommu),
                                  TYPE_INTEL_IOMMU_MEMORY_REGION, OBJECT(s),
-                                 "intel_iommu_dmar",
-                                 UINT64_MAX);
-        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
-                                 "vtd_sys_alias", get_system_memory(),
-                                 0, memory_region_size(get_system_memory()));
-        memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s),
-                              &vtd_mem_ir_ops, s, "intel_iommu_ir",
-                              VTD_INTERRUPT_ADDR_SIZE);
-        memory_region_init(&vtd_dev_as->root, OBJECT(s),
-                           "vtd_root", UINT64_MAX);
-        memory_region_add_subregion_overlap(&vtd_dev_as->root,
+                                 "vtd-dmar", UINT64_MAX);
+        memory_region_init_alias(&vtd_dev_as->iommu_ir, OBJECT(s), "vtd-ir",
+                                 &s->mr_ir, 0, memory_region_size(&s->mr_ir));
+        memory_region_add_subregion_overlap(MEMORY_REGION(&vtd_dev_as->iommu),
                                             VTD_INTERRUPT_ADDR_FIRST,
-                                            &vtd_dev_as->iommu_ir, 64);
-        address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
-        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
-                                            &vtd_dev_as->sys_alias, 1);
+                                            &vtd_dev_as->iommu_ir, 1);
+
+        /*
+         * Hook both the containers under the root container, we
+         * switch between DMAR & noDMAR by enable/disable
+         * corresponding sub-containers
+         */
         memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
                                             MEMORY_REGION(&vtd_dev_as->iommu),
-                                            1);
+                                            0);
+        memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
+                                            &vtd_dev_as->nodmar, 0);
+
         vtd_switch_address_space(vtd_dev_as);
     }
     return vtd_dev_as;
@@ -3323,6 +3332,21 @@  static void vtd_realize(DeviceState *dev, Error **errp)
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
                           "intel_iommu", DMAR_REG_SIZE);
+
+    /* Create the shared memory regions by all devices */
+    memory_region_init(&s->mr_nodmar, OBJECT(s), "vtd-nodmar",
+                       UINT64_MAX);
+    memory_region_init_io(&s->mr_ir, OBJECT(s), &vtd_mem_ir_ops,
+                          s, "vtd-ir", VTD_INTERRUPT_ADDR_SIZE);
+    memory_region_init_alias(&s->mr_sys_alias, OBJECT(s),
+                             "vtd-sys-alias", get_system_memory(), 0,
+                             memory_region_size(get_system_memory()));
+    memory_region_add_subregion_overlap(&s->mr_nodmar, 0,
+                                        &s->mr_sys_alias, 0);
+    memory_region_add_subregion_overlap(&s->mr_nodmar,
+                                        VTD_INTERRUPT_ADDR_FIRST,
+                                        &s->mr_ir, 1);
+
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->csrmem);
     /* No corresponding destroy */
     s->iotlb = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index a321cc9691..b1b8b89e80 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -86,8 +86,8 @@  struct VTDAddressSpace {
     uint8_t devfn;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
-    MemoryRegion root;
-    MemoryRegion sys_alias;
+    MemoryRegion root;          /* The root container of the device */
+    MemoryRegion nodmar;        /* The alias of shared nodmar MR */
     MemoryRegion iommu_ir;      /* Interrupt region: 0xfeeXXXXX */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
@@ -202,6 +202,9 @@  union VTD_IR_MSIAddress {
 struct IntelIOMMUState {
     X86IOMMUState x86_iommu;
     MemoryRegion csrmem;
+    MemoryRegion mr_nodmar;
+    MemoryRegion mr_ir;
+    MemoryRegion mr_sys_alias;
     uint8_t csr[DMAR_REG_SIZE];     /* register values */
     uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
     uint8_t w1cmask[DMAR_REG_SIZE]; /* RW1C(Write 1 to Clear) bytes */