diff mbox series

[v6,1/8] AMD/IOMMU: don't blindly allocate interrupt remapping tables

Message ID f64deee2-7b5a-3330-997c-87bfc05b0807@suse.com (mailing list archive)
State Superseded
Headers show
Series AMD IOMMU: further improvements | expand

Commit Message

Jan Beulich Sept. 19, 2019, 1:21 p.m. UTC
ACPI tables are free to list far more device coordinates than there are
actual devices. By delaying the table allocations for most cases, and
doing them only when an actual device is known to be present at a given
position, overall memory used for the tables goes down from over 500k
pages to just over 1k (on my system having such ACPI table contents).

Tables continue to get allocated right away for special entries
(IO-APIC, HPET) as well as for alias IDs. While in the former case
that's simply because there may not be any device at a given position,
in the latter case this is to avoid having to introduce ref-counting of
table usage.

The change involves invoking
iterate_ivrs_mappings(amd_iommu_setup_device_table) a second time,
because the function now wants to be able to find PCI devices, which
isn't possible yet when IOMMU setup happens very early during x2APIC
mode setup. In this context amd_iommu_init_interrupt() gets renamed as
well.

The logic adjusting a DTE's interrupt remapping attributes also gets
changed, such that the lack of an IRT would result in target aborted
rather than not remapped interrupts (should any occur).

Note that for now phantom functions get separate IRTs allocated, as was
the case before.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v6: Acquire IOMMU lock in code added to amd_iommu_add_device(). Drop a
    pointless use of the conditional operator.
v5: New.
---
TBD: This retains prior (but suspicious) behavior of not calling
     amd_iommu_set_intremap_table() for "invalid" IVRS mapping entries.
     Since DTE.IV=0 means un-remapped interrupts, I wonder if this needs
     changing.

Backporting note: This depends on b5fbe81196 "iommu / x86: move call to
scan_pci_devices() out of vendor code"!

---
 xen/drivers/passthrough/amd/iommu_acpi.c      |   65 +++++++++++++----------
 xen/drivers/passthrough/amd/iommu_init.c      |   73 ++++++++++++++++++++------
 xen/drivers/passthrough/amd/iommu_intr.c      |    4 -
 xen/drivers/passthrough/amd/iommu_map.c       |    5 +
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   43 ++++++++++++++-
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |    2 
 6 files changed, 143 insertions(+), 49 deletions(-)

Comments

Paul Durrant Sept. 23, 2019, 3:27 p.m. UTC | #1
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan Beulich
> Sent: 19 September 2019 14:22
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> Subject: [Xen-devel] [PATCH v6 1/8] AMD/IOMMU: don't blindly allocate interrupt remapping tables
> 
> ACPI tables are free to list far more device coordinates than there are
> actual devices. By delaying the table allocations for most cases, and
> doing them only when an actual device is known to be present at a given
> position, overall memory used for the tables goes down from over 500k
> pages to just over 1k (on my system having such ACPI table contents).
> 
> Tables continue to get allocated right away for special entries
> (IO-APIC, HPET) as well as for alias IDs. While in the former case
> that's simply because there may not be any device at a given position,
> in the latter case this is to avoid having to introduce ref-counting of
> table usage.
> 
> The change involves invoking
> iterate_ivrs_mappings(amd_iommu_setup_device_table) a second time,
> because the function now wants to be able to find PCI devices, which
> isn't possible yet when IOMMU setup happens very early during x2APIC
> mode setup. In this context amd_iommu_init_interrupt() gets renamed as
> well.
> 
> The logic adjusting a DTE's interrupt remapping attributes also gets
> changed, such that the lack of an IRT would result in target aborted
> rather than not remapped interrupts (should any occur).
> 
> Note that for now phantom functions get separate IRTs allocated, as was
> the case before.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> v6: Acquire IOMMU lock in code added to amd_iommu_add_device(). Drop a
>     pointless use of the conditional operator.
> v5: New.
> ---
> TBD: This retains prior (but suspicious) behavior of not calling
>      amd_iommu_set_intremap_table() for "invalid" IVRS mapping entries.
>      Since DTE.IV=0 means un-remapped interrupts, I wonder if this needs
>      changing.
> 
> Backporting note: This depends on b5fbe81196 "iommu / x86: move call to
> scan_pci_devices() out of vendor code"!
> 
> ---
>  xen/drivers/passthrough/amd/iommu_acpi.c      |   65 +++++++++++++----------
>  xen/drivers/passthrough/amd/iommu_init.c      |   73 ++++++++++++++++++++------
>  xen/drivers/passthrough/amd/iommu_intr.c      |    4 -
>  xen/drivers/passthrough/amd/iommu_map.c       |    5 +
>  xen/drivers/passthrough/amd/pci_amd_iommu.c   |   43 ++++++++++++++-
>  xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |    2
>  6 files changed, 143 insertions(+), 49 deletions(-)
> 
> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
> @@ -53,7 +53,8 @@ union acpi_ivhd_device {
>  };
> 
>  static void __init add_ivrs_mapping_entry(
> -    u16 bdf, u16 alias_id, u8 flags, struct amd_iommu *iommu)
> +    uint16_t bdf, uint16_t alias_id, uint8_t flags, bool alloc_irt,
> +    struct amd_iommu *iommu)
>  {
>      struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(iommu->seg);
> 
> @@ -69,27 +70,32 @@ static void __init add_ivrs_mapping_entr
>      if ( iommu->bdf == bdf )
>          return;
> 
> -    if ( !ivrs_mappings[alias_id].intremap_table )
> +    /* Allocate interrupt remapping table if needed. */
> +    if ( iommu_intremap && !ivrs_mappings[alias_id].intremap_table )
>      {
> -         /* allocate per-device interrupt remapping table */
> -         if ( amd_iommu_perdev_intremap )
> -             ivrs_mappings[alias_id].intremap_table =
> -                 amd_iommu_alloc_intremap_table(
> -                     iommu,
> -                     &ivrs_mappings[alias_id].intremap_inuse);
> -         else
> -         {
> -             if ( shared_intremap_table == NULL  )
> -                 shared_intremap_table = amd_iommu_alloc_intremap_table(
> -                     iommu,
> -                     &shared_intremap_inuse);
> -             ivrs_mappings[alias_id].intremap_table = shared_intremap_table;
> -             ivrs_mappings[alias_id].intremap_inuse = shared_intremap_inuse;
> -         }
> -
> -         if ( !ivrs_mappings[alias_id].intremap_table )
> -             panic("No memory for %04x:%02x:%02x.%u's IRT\n", iommu->seg,
> -                   PCI_BUS(alias_id), PCI_SLOT(alias_id), PCI_FUNC(alias_id));
> +        if ( !amd_iommu_perdev_intremap )
> +        {
> +            if ( !shared_intremap_table )
> +                shared_intremap_table = amd_iommu_alloc_intremap_table(
> +                    iommu, &shared_intremap_inuse);
> +
> +            if ( !shared_intremap_table )
> +                panic("No memory for shared IRT\n");
> +
> +            ivrs_mappings[alias_id].intremap_table = shared_intremap_table;
> +            ivrs_mappings[alias_id].intremap_inuse = shared_intremap_inuse;
> +        }
> +        else if ( alloc_irt )
> +        {
> +            ivrs_mappings[alias_id].intremap_table =
> +                amd_iommu_alloc_intremap_table(
> +                    iommu, &ivrs_mappings[alias_id].intremap_inuse);
> +
> +            if ( !ivrs_mappings[alias_id].intremap_table )
> +                panic("No memory for %04x:%02x:%02x.%u's IRT\n",
> +                      iommu->seg, PCI_BUS(alias_id), PCI_SLOT(alias_id),
> +                      PCI_FUNC(alias_id));
> +        }
>      }
> 
>      ivrs_mappings[alias_id].valid = true;
> @@ -433,7 +439,8 @@ static u16 __init parse_ivhd_device_sele
>          return 0;
>      }
> 
> -    add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, iommu);
> +    add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, false,
> +                           iommu);
> 
>      return sizeof(*select);
>  }
> @@ -479,7 +486,7 @@ static u16 __init parse_ivhd_device_rang
> 
>      for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
>          add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting,
> -                               iommu);
> +                               false, iommu);
> 
>      return dev_length;
>  }
> @@ -513,7 +520,8 @@ static u16 __init parse_ivhd_device_alia
> 
>      AMD_IOMMU_DEBUG(" Dev_Id Alias: %#x\n", alias_id);
> 
> -    add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, iommu);
> +    add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, true,
> +                           iommu);
> 
>      return dev_length;
>  }
> @@ -568,7 +576,7 @@ static u16 __init parse_ivhd_device_alia
> 
>      for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
>          add_ivrs_mapping_entry(bdf, alias_id, range->alias.header.data_setting,
> -                               iommu);
> +                               true, iommu);
> 
>      return dev_length;
>  }
> @@ -593,7 +601,7 @@ static u16 __init parse_ivhd_device_exte
>          return 0;
>      }
> 
> -    add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, iommu);
> +    add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, false, iommu);
> 
>      return dev_length;
>  }
> @@ -640,7 +648,7 @@ static u16 __init parse_ivhd_device_exte
> 
>      for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
>          add_ivrs_mapping_entry(bdf, bdf, range->extended.header.data_setting,
> -                               iommu);
> +                               false, iommu);
> 
>      return dev_length;
>  }
> @@ -733,7 +741,8 @@ static u16 __init parse_ivhd_device_spec
>      AMD_IOMMU_DEBUG("IVHD Special: %04x:%02x:%02x.%u variety %#x handle %#x\n",
>                      seg, PCI_BUS(bdf), PCI_SLOT(bdf), PCI_FUNC(bdf),
>                      special->variety, special->handle);
> -    add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, iommu);
> +    add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, true,
> +                           iommu);
> 
>      switch ( special->variety )
>      {
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -30,6 +30,7 @@
>  #include <xen/delay.h>
> 
>  static int __initdata nr_amd_iommus;
> +static bool __initdata pci_init;
> 
>  static void do_amd_iommu_irq(unsigned long data);
>  static DECLARE_SOFTIRQ_TASKLET(amd_iommu_irq_tasklet, do_amd_iommu_irq, 0);
> @@ -1244,17 +1245,20 @@ static int __init amd_iommu_setup_device
> 
>      BUG_ON( (ivrs_bdf_entries == 0) );
> 
> -    /* allocate 'device table' on a 4K boundary */
> -    device_table.alloc_size = PAGE_SIZE <<
> -                              get_order_from_bytes(
> -                              PAGE_ALIGN(ivrs_bdf_entries *
> -                              IOMMU_DEV_TABLE_ENTRY_SIZE));
> -    device_table.entries = device_table.alloc_size /
> -                           IOMMU_DEV_TABLE_ENTRY_SIZE;
> -
> -    device_table.buffer = allocate_buffer(device_table.alloc_size,
> -                                          "Device Table");
> -    if  ( device_table.buffer == NULL )
> +    if ( !device_table.buffer )
> +    {
> +        /* allocate 'device table' on a 4K boundary */
> +        device_table.alloc_size = PAGE_SIZE <<
> +                                  get_order_from_bytes(
> +                                  PAGE_ALIGN(ivrs_bdf_entries *
> +                                  IOMMU_DEV_TABLE_ENTRY_SIZE));
> +        device_table.entries = device_table.alloc_size /
> +                               IOMMU_DEV_TABLE_ENTRY_SIZE;
> +
> +        device_table.buffer = allocate_buffer(device_table.alloc_size,
> +                                              "Device Table");
> +    }
> +    if ( !device_table.buffer )
>          return -ENOMEM;
> 
>      /* Add device table entries */
> @@ -1263,13 +1267,46 @@ static int __init amd_iommu_setup_device
>          if ( ivrs_mappings[bdf].valid )
>          {
>              void *dte;
> +            const struct pci_dev *pdev = NULL;
> 
>              /* add device table entry */
>              dte = device_table.buffer + (bdf * IOMMU_DEV_TABLE_ENTRY_SIZE);
>              iommu_dte_add_device_entry(dte, &ivrs_mappings[bdf]);
> 
> +            if ( iommu_intremap &&
> +                 ivrs_mappings[bdf].dte_requestor_id == bdf &&
> +                 !ivrs_mappings[bdf].intremap_table )
> +            {
> +                if ( !pci_init )
> +                    continue;
> +                pcidevs_lock();
> +                pdev = pci_get_pdev(seg, PCI_BUS(bdf), PCI_DEVFN2(bdf));
> +                pcidevs_unlock();
> +            }
> +
> +            if ( pdev )
> +            {
> +                unsigned int req_id = bdf;
> +
> +                do {
> +                    ivrs_mappings[req_id].intremap_table =
> +                        amd_iommu_alloc_intremap_table(
> +                            ivrs_mappings[bdf].iommu,
> +                            &ivrs_mappings[req_id].intremap_inuse);
> +                    if ( !ivrs_mappings[req_id].intremap_table )
> +                        return -ENOMEM;
> +
> +                    if ( !pdev->phantom_stride )
> +                        break;
> +                    req_id += pdev->phantom_stride;
> +                } while ( PCI_SLOT(req_id) == pdev->sbdf.dev );
> +            }
> +
>              amd_iommu_set_intremap_table(
> -                dte, virt_to_maddr(ivrs_mappings[bdf].intremap_table),
> +                dte,
> +                ivrs_mappings[bdf].intremap_table
> +                ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
> +                : 0,
>                  iommu_intremap);
>          }
>      }
> @@ -1402,7 +1439,8 @@ int __init amd_iommu_init(bool xt)
>      if ( rc )
>          goto error_out;
> 
> -    /* allocate and initialize a global device table shared by all iommus */
> +    /* Allocate and initialize device table(s). */
> +    pci_init = !xt;
>      rc = iterate_ivrs_mappings(amd_iommu_setup_device_table);
>      if ( rc )
>          goto error_out;
> @@ -1422,7 +1460,7 @@ int __init amd_iommu_init(bool xt)
>          /*
>           * Setting up of the IOMMU interrupts cannot occur yet at the (very
>           * early) time we get here when enabling x2APIC mode. Suppress it
> -         * here, and do it explicitly in amd_iommu_init_interrupt().
> +         * here, and do it explicitly in amd_iommu_init_late().
>           */
>          rc = amd_iommu_init_one(iommu, !xt);
>          if ( rc )
> @@ -1436,11 +1474,16 @@ error_out:
>      return rc;
>  }
> 
> -int __init amd_iommu_init_interrupt(void)
> +int __init amd_iommu_init_late(void)
>  {
>      struct amd_iommu *iommu;
>      int rc = 0;
> 
> +    /* Further initialize the device table(s). */
> +    pci_init = true;
> +    if ( iommu_intremap )
> +        rc = iterate_ivrs_mappings(amd_iommu_setup_device_table);
> +
>      for_each_amd_iommu ( iommu )
>      {
>          struct irq_desc *desc;
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -789,7 +789,7 @@ void amd_iommu_read_msi_from_ire(
>      }
>  }
> 
> -int __init amd_iommu_free_intremap_table(
> +int amd_iommu_free_intremap_table(
>      const struct amd_iommu *iommu, struct ivrs_mappings *ivrs_mapping)
>  {
>      void **tblp;
> @@ -814,7 +814,7 @@ int __init amd_iommu_free_intremap_table
>      return 0;
>  }
> 
> -void *__init amd_iommu_alloc_intremap_table(
> +void *amd_iommu_alloc_intremap_table(
>      const struct amd_iommu *iommu, unsigned long **inuse_map)
>  {
>      unsigned int order = intremap_table_order(iommu);
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -116,8 +116,9 @@ void __init amd_iommu_set_intremap_table
>      struct amd_iommu_dte *dte, uint64_t intremap_ptr, bool valid)
>  {
>      dte->it_root = intremap_ptr >> 6;
> -    dte->int_tab_len = IOMMU_INTREMAP_ORDER;
> -    dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
> +    dte->int_tab_len = intremap_ptr ? IOMMU_INTREMAP_ORDER : 0;
> +    dte->int_ctl = intremap_ptr ? IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED
> +                                : IOMMU_DEV_TABLE_INT_CONTROL_ABORTED;
>      dte->ig = false; /* unmapped interrupts result in i/o page faults */
>      dte->iv = valid;
>  }
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -164,7 +164,7 @@ static int __init iov_detect(void)
>      if ( !iommu_enable && !iommu_intremap )
>          return 0;
> 
> -    if ( (init_done ? amd_iommu_init_interrupt()
> +    if ( (init_done ? amd_iommu_init_late()
>                      : amd_iommu_init(false)) != 0 )
>      {
>          printk("AMD-Vi: Error initialization\n");
> @@ -428,6 +428,7 @@ static int amd_iommu_add_device(u8 devfn
>  {
>      struct amd_iommu *iommu;
>      u16 bdf;
> +    struct ivrs_mappings *ivrs_mappings;
> 
>      if ( !pdev->domain )
>          return -EINVAL;
> @@ -457,6 +458,36 @@ static int amd_iommu_add_device(u8 devfn
>          return -ENODEV;
>      }
> 
> +    ivrs_mappings = get_ivrs_mappings(pdev->seg);
> +    bdf = PCI_BDF2(pdev->bus, devfn);
> +    if ( !ivrs_mappings ||
> +         !ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].valid )
> +        return -EPERM;
> +
> +    if ( iommu_intremap &&
> +         ivrs_mappings[bdf].dte_requestor_id == bdf &&
> +         !ivrs_mappings[bdf].intremap_table )
> +    {
> +        unsigned long flags;
> +
> +        ivrs_mappings[bdf].intremap_table =
> +            amd_iommu_alloc_intremap_table(
> +                iommu, &ivrs_mappings[bdf].intremap_inuse);
> +        if ( !ivrs_mappings[bdf].intremap_table )
> +            return -ENOMEM;
> +
> +        spin_lock_irqsave(&iommu->lock, flags);
> +
> +        amd_iommu_set_intremap_table(
> +            iommu->dev_table.buffer + (bdf * IOMMU_DEV_TABLE_ENTRY_SIZE),
> +            virt_to_maddr(ivrs_mappings[bdf].intremap_table),
> +            iommu_intremap);
> +
> +        amd_iommu_flush_device(iommu, bdf);
> +
> +        spin_unlock_irqrestore(&iommu->lock, flags);
> +    }
> +
>      amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
>      return 0;
>  }
> @@ -465,6 +496,8 @@ static int amd_iommu_remove_device(u8 de
>  {
>      struct amd_iommu *iommu;
>      u16 bdf;
> +    struct ivrs_mappings *ivrs_mappings;
> +
>      if ( !pdev->domain )
>          return -EINVAL;
> 
> @@ -480,6 +513,14 @@ static int amd_iommu_remove_device(u8 de
>      }
> 
>      amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
> +
> +    ivrs_mappings = get_ivrs_mappings(pdev->seg);
> +    bdf = PCI_BDF2(pdev->bus, devfn);
> +    if ( amd_iommu_perdev_intremap &&
> +         ivrs_mappings[bdf].dte_requestor_id == bdf &&
> +         ivrs_mappings[bdf].intremap_table )
> +        amd_iommu_free_intremap_table(iommu, &ivrs_mappings[bdf]);
> +
>      return 0;
>  }
> 
> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> @@ -50,7 +50,7 @@ void get_iommu_features(struct amd_iommu
>  /* amd-iommu-init functions */
>  int amd_iommu_prepare(bool xt);
>  int amd_iommu_init(bool xt);
> -int amd_iommu_init_interrupt(void);
> +int amd_iommu_init_late(void);
>  int amd_iommu_update_ivrs_mapping_acpi(void);
>  int iov_adjust_irq_affinities(void);
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Andrew Cooper Sept. 25, 2019, 1:23 p.m. UTC | #2
On 19/09/2019 14:21, Jan Beulich wrote:
> ACPI tables are free to list far more device coordinates than there are
> actual devices. By delaying the table allocations for most cases, and
> doing them only when an actual device is known to be present at a given
> position, overall memory used for the tables goes down from over 500k
> pages to just over 1k (on my system having such ACPI table contents).
>
> Tables continue to get allocated right away for special entries
> (IO-APIC, HPET) as well as for alias IDs. While in the former case
> that's simply because there may not be any device at a given position,
> in the latter case this is to avoid having to introduce ref-counting of
> table usage.
>
> The change involves invoking
> iterate_ivrs_mappings(amd_iommu_setup_device_table) a second time,
> because the function now wants to be able to find PCI devices, which
> isn't possible yet when IOMMU setup happens very early during x2APIC
> mode setup. In this context amd_iommu_init_interrupt() gets renamed as
> well.
>
> The logic adjusting a DTE's interrupt remapping attributes also gets
> changed, such that the lack of an IRT would result in target aborted
> rather than not remapped interrupts (should any occur).

non-remapped or un-remapped would be slightly less awkward grammar here.

> Note that for now phantom functions get separate IRTs allocated, as was
> the case before.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
diff mbox series

Patch

--- a/xen/drivers/passthrough/amd/iommu_acpi.c
+++ b/xen/drivers/passthrough/amd/iommu_acpi.c
@@ -53,7 +53,8 @@  union acpi_ivhd_device {
 };
 
 static void __init add_ivrs_mapping_entry(
-    u16 bdf, u16 alias_id, u8 flags, struct amd_iommu *iommu)
+    uint16_t bdf, uint16_t alias_id, uint8_t flags, bool alloc_irt,
+    struct amd_iommu *iommu)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(iommu->seg);
 
@@ -69,27 +70,32 @@  static void __init add_ivrs_mapping_entr
     if ( iommu->bdf == bdf )
         return;
 
-    if ( !ivrs_mappings[alias_id].intremap_table )
+    /* Allocate interrupt remapping table if needed. */
+    if ( iommu_intremap && !ivrs_mappings[alias_id].intremap_table )
     {
-         /* allocate per-device interrupt remapping table */
-         if ( amd_iommu_perdev_intremap )
-             ivrs_mappings[alias_id].intremap_table =
-                 amd_iommu_alloc_intremap_table(
-                     iommu,
-                     &ivrs_mappings[alias_id].intremap_inuse);
-         else
-         {
-             if ( shared_intremap_table == NULL  )
-                 shared_intremap_table = amd_iommu_alloc_intremap_table(
-                     iommu,
-                     &shared_intremap_inuse);
-             ivrs_mappings[alias_id].intremap_table = shared_intremap_table;
-             ivrs_mappings[alias_id].intremap_inuse = shared_intremap_inuse;
-         }
-
-         if ( !ivrs_mappings[alias_id].intremap_table )
-             panic("No memory for %04x:%02x:%02x.%u's IRT\n", iommu->seg,
-                   PCI_BUS(alias_id), PCI_SLOT(alias_id), PCI_FUNC(alias_id));
+        if ( !amd_iommu_perdev_intremap )
+        {
+            if ( !shared_intremap_table )
+                shared_intremap_table = amd_iommu_alloc_intremap_table(
+                    iommu, &shared_intremap_inuse);
+
+            if ( !shared_intremap_table )
+                panic("No memory for shared IRT\n");
+
+            ivrs_mappings[alias_id].intremap_table = shared_intremap_table;
+            ivrs_mappings[alias_id].intremap_inuse = shared_intremap_inuse;
+        }
+        else if ( alloc_irt )
+        {
+            ivrs_mappings[alias_id].intremap_table =
+                amd_iommu_alloc_intremap_table(
+                    iommu, &ivrs_mappings[alias_id].intremap_inuse);
+
+            if ( !ivrs_mappings[alias_id].intremap_table )
+                panic("No memory for %04x:%02x:%02x.%u's IRT\n",
+                      iommu->seg, PCI_BUS(alias_id), PCI_SLOT(alias_id),
+                      PCI_FUNC(alias_id));
+        }
     }
 
     ivrs_mappings[alias_id].valid = true;
@@ -433,7 +439,8 @@  static u16 __init parse_ivhd_device_sele
         return 0;
     }
 
-    add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, iommu);
+    add_ivrs_mapping_entry(bdf, bdf, select->header.data_setting, false,
+                           iommu);
 
     return sizeof(*select);
 }
@@ -479,7 +486,7 @@  static u16 __init parse_ivhd_device_rang
 
     for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
         add_ivrs_mapping_entry(bdf, bdf, range->start.header.data_setting,
-                               iommu);
+                               false, iommu);
 
     return dev_length;
 }
@@ -513,7 +520,8 @@  static u16 __init parse_ivhd_device_alia
 
     AMD_IOMMU_DEBUG(" Dev_Id Alias: %#x\n", alias_id);
 
-    add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, iommu);
+    add_ivrs_mapping_entry(bdf, alias_id, alias->header.data_setting, true,
+                           iommu);
 
     return dev_length;
 }
@@ -568,7 +576,7 @@  static u16 __init parse_ivhd_device_alia
 
     for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
         add_ivrs_mapping_entry(bdf, alias_id, range->alias.header.data_setting,
-                               iommu);
+                               true, iommu);
 
     return dev_length;
 }
@@ -593,7 +601,7 @@  static u16 __init parse_ivhd_device_exte
         return 0;
     }
 
-    add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, iommu);
+    add_ivrs_mapping_entry(bdf, bdf, ext->header.data_setting, false, iommu);
 
     return dev_length;
 }
@@ -640,7 +648,7 @@  static u16 __init parse_ivhd_device_exte
 
     for ( bdf = first_bdf; bdf <= last_bdf; bdf++ )
         add_ivrs_mapping_entry(bdf, bdf, range->extended.header.data_setting,
-                               iommu);
+                               false, iommu);
 
     return dev_length;
 }
@@ -733,7 +741,8 @@  static u16 __init parse_ivhd_device_spec
     AMD_IOMMU_DEBUG("IVHD Special: %04x:%02x:%02x.%u variety %#x handle %#x\n",
                     seg, PCI_BUS(bdf), PCI_SLOT(bdf), PCI_FUNC(bdf),
                     special->variety, special->handle);
-    add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, iommu);
+    add_ivrs_mapping_entry(bdf, bdf, special->header.data_setting, true,
+                           iommu);
 
     switch ( special->variety )
     {
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -30,6 +30,7 @@ 
 #include <xen/delay.h>
 
 static int __initdata nr_amd_iommus;
+static bool __initdata pci_init;
 
 static void do_amd_iommu_irq(unsigned long data);
 static DECLARE_SOFTIRQ_TASKLET(amd_iommu_irq_tasklet, do_amd_iommu_irq, 0);
@@ -1244,17 +1245,20 @@  static int __init amd_iommu_setup_device
 
     BUG_ON( (ivrs_bdf_entries == 0) );
 
-    /* allocate 'device table' on a 4K boundary */
-    device_table.alloc_size = PAGE_SIZE <<
-                              get_order_from_bytes(
-                              PAGE_ALIGN(ivrs_bdf_entries *
-                              IOMMU_DEV_TABLE_ENTRY_SIZE));
-    device_table.entries = device_table.alloc_size /
-                           IOMMU_DEV_TABLE_ENTRY_SIZE;
-
-    device_table.buffer = allocate_buffer(device_table.alloc_size,
-                                          "Device Table");
-    if  ( device_table.buffer == NULL )
+    if ( !device_table.buffer )
+    {
+        /* allocate 'device table' on a 4K boundary */
+        device_table.alloc_size = PAGE_SIZE <<
+                                  get_order_from_bytes(
+                                  PAGE_ALIGN(ivrs_bdf_entries *
+                                  IOMMU_DEV_TABLE_ENTRY_SIZE));
+        device_table.entries = device_table.alloc_size /
+                               IOMMU_DEV_TABLE_ENTRY_SIZE;
+
+        device_table.buffer = allocate_buffer(device_table.alloc_size,
+                                              "Device Table");
+    }
+    if ( !device_table.buffer )
         return -ENOMEM;
 
     /* Add device table entries */
@@ -1263,13 +1267,46 @@  static int __init amd_iommu_setup_device
         if ( ivrs_mappings[bdf].valid )
         {
             void *dte;
+            const struct pci_dev *pdev = NULL;
 
             /* add device table entry */
             dte = device_table.buffer + (bdf * IOMMU_DEV_TABLE_ENTRY_SIZE);
             iommu_dte_add_device_entry(dte, &ivrs_mappings[bdf]);
 
+            if ( iommu_intremap &&
+                 ivrs_mappings[bdf].dte_requestor_id == bdf &&
+                 !ivrs_mappings[bdf].intremap_table )
+            {
+                if ( !pci_init )
+                    continue;
+                pcidevs_lock();
+                pdev = pci_get_pdev(seg, PCI_BUS(bdf), PCI_DEVFN2(bdf));
+                pcidevs_unlock();
+            }
+
+            if ( pdev )
+            {
+                unsigned int req_id = bdf;
+
+                do {
+                    ivrs_mappings[req_id].intremap_table =
+                        amd_iommu_alloc_intremap_table(
+                            ivrs_mappings[bdf].iommu,
+                            &ivrs_mappings[req_id].intremap_inuse);
+                    if ( !ivrs_mappings[req_id].intremap_table )
+                        return -ENOMEM;
+
+                    if ( !pdev->phantom_stride )
+                        break;
+                    req_id += pdev->phantom_stride;
+                } while ( PCI_SLOT(req_id) == pdev->sbdf.dev );
+            }
+
             amd_iommu_set_intremap_table(
-                dte, virt_to_maddr(ivrs_mappings[bdf].intremap_table),
+                dte,
+                ivrs_mappings[bdf].intremap_table
+                ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
+                : 0,
                 iommu_intremap);
         }
     }
@@ -1402,7 +1439,8 @@  int __init amd_iommu_init(bool xt)
     if ( rc )
         goto error_out;
 
-    /* allocate and initialize a global device table shared by all iommus */
+    /* Allocate and initialize device table(s). */
+    pci_init = !xt;
     rc = iterate_ivrs_mappings(amd_iommu_setup_device_table);
     if ( rc )
         goto error_out;
@@ -1422,7 +1460,7 @@  int __init amd_iommu_init(bool xt)
         /*
          * Setting up of the IOMMU interrupts cannot occur yet at the (very
          * early) time we get here when enabling x2APIC mode. Suppress it
-         * here, and do it explicitly in amd_iommu_init_interrupt().
+         * here, and do it explicitly in amd_iommu_init_late().
          */
         rc = amd_iommu_init_one(iommu, !xt);
         if ( rc )
@@ -1436,11 +1474,16 @@  error_out:
     return rc;
 }
 
-int __init amd_iommu_init_interrupt(void)
+int __init amd_iommu_init_late(void)
 {
     struct amd_iommu *iommu;
     int rc = 0;
 
+    /* Further initialize the device table(s). */
+    pci_init = true;
+    if ( iommu_intremap )
+        rc = iterate_ivrs_mappings(amd_iommu_setup_device_table);
+
     for_each_amd_iommu ( iommu )
     {
         struct irq_desc *desc;
--- a/xen/drivers/passthrough/amd/iommu_intr.c
+++ b/xen/drivers/passthrough/amd/iommu_intr.c
@@ -789,7 +789,7 @@  void amd_iommu_read_msi_from_ire(
     }
 }
 
-int __init amd_iommu_free_intremap_table(
+int amd_iommu_free_intremap_table(
     const struct amd_iommu *iommu, struct ivrs_mappings *ivrs_mapping)
 {
     void **tblp;
@@ -814,7 +814,7 @@  int __init amd_iommu_free_intremap_table
     return 0;
 }
 
-void *__init amd_iommu_alloc_intremap_table(
+void *amd_iommu_alloc_intremap_table(
     const struct amd_iommu *iommu, unsigned long **inuse_map)
 {
     unsigned int order = intremap_table_order(iommu);
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -116,8 +116,9 @@  void __init amd_iommu_set_intremap_table
     struct amd_iommu_dte *dte, uint64_t intremap_ptr, bool valid)
 {
     dte->it_root = intremap_ptr >> 6;
-    dte->int_tab_len = IOMMU_INTREMAP_ORDER;
-    dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
+    dte->int_tab_len = intremap_ptr ? IOMMU_INTREMAP_ORDER : 0;
+    dte->int_ctl = intremap_ptr ? IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED
+                                : IOMMU_DEV_TABLE_INT_CONTROL_ABORTED;
     dte->ig = false; /* unmapped interrupts result in i/o page faults */
     dte->iv = valid;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -164,7 +164,7 @@  static int __init iov_detect(void)
     if ( !iommu_enable && !iommu_intremap )
         return 0;
 
-    if ( (init_done ? amd_iommu_init_interrupt()
+    if ( (init_done ? amd_iommu_init_late()
                     : amd_iommu_init(false)) != 0 )
     {
         printk("AMD-Vi: Error initialization\n");
@@ -428,6 +428,7 @@  static int amd_iommu_add_device(u8 devfn
 {
     struct amd_iommu *iommu;
     u16 bdf;
+    struct ivrs_mappings *ivrs_mappings;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -457,6 +458,36 @@  static int amd_iommu_add_device(u8 devfn
         return -ENODEV;
     }
 
+    ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    bdf = PCI_BDF2(pdev->bus, devfn);
+    if ( !ivrs_mappings ||
+         !ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].valid )
+        return -EPERM;
+
+    if ( iommu_intremap &&
+         ivrs_mappings[bdf].dte_requestor_id == bdf &&
+         !ivrs_mappings[bdf].intremap_table )
+    {
+        unsigned long flags;
+
+        ivrs_mappings[bdf].intremap_table =
+            amd_iommu_alloc_intremap_table(
+                iommu, &ivrs_mappings[bdf].intremap_inuse);
+        if ( !ivrs_mappings[bdf].intremap_table )
+            return -ENOMEM;
+
+        spin_lock_irqsave(&iommu->lock, flags);
+
+        amd_iommu_set_intremap_table(
+            iommu->dev_table.buffer + (bdf * IOMMU_DEV_TABLE_ENTRY_SIZE),
+            virt_to_maddr(ivrs_mappings[bdf].intremap_table),
+            iommu_intremap);
+
+        amd_iommu_flush_device(iommu, bdf);
+
+        spin_unlock_irqrestore(&iommu->lock, flags);
+    }
+
     amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
     return 0;
 }
@@ -465,6 +496,8 @@  static int amd_iommu_remove_device(u8 de
 {
     struct amd_iommu *iommu;
     u16 bdf;
+    struct ivrs_mappings *ivrs_mappings;
+
     if ( !pdev->domain )
         return -EINVAL;
 
@@ -480,6 +513,14 @@  static int amd_iommu_remove_device(u8 de
     }
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
+
+    ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    bdf = PCI_BDF2(pdev->bus, devfn);
+    if ( amd_iommu_perdev_intremap &&
+         ivrs_mappings[bdf].dte_requestor_id == bdf &&
+         ivrs_mappings[bdf].intremap_table )
+        amd_iommu_free_intremap_table(iommu, &ivrs_mappings[bdf]);
+
     return 0;
 }
 
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -50,7 +50,7 @@  void get_iommu_features(struct amd_iommu
 /* amd-iommu-init functions */
 int amd_iommu_prepare(bool xt);
 int amd_iommu_init(bool xt);
-int amd_iommu_init_interrupt(void);
+int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);