diff mbox

[qemu,v13,16/16] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

Message ID 1456823441-46757-17-git-send-email-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Kardashevskiy March 1, 2016, 9:10 a.m. UTC
This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices. As all TCE root regions
are mapped at 0 and 64bit long (and actual tables are child regions),
this replaces memory_region_add_subregion() with _overlap() to make
QEMU memory API happy.

This reserves RTAS token numbers for DDW calls.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

TODO (which I have no idea how to implement properly):
1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
windows and 12/16/24 page shift;
2. fix container::min_iova, max_iova - as for now, they are useless,
and I'd expect IOMMU MR boundaries to serve this purpose really;
3. vfio_listener_region_add/vfio_listener_region_del do explicitely
create/remove huge DMA window as we do not have vfio_container_ioctl()
anymore, do we want to move these to some sort of callbacks? How, where?

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	include/hw/pci-host/spapr.h

# Conflicts:
#	hw/vfio/common.c
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_iommu.c        |  32 ++++-
 hw/ppc/spapr_pci.c          |  61 +++++++--
 hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/common.c            |  70 +++++++++-
 include/hw/pci-host/spapr.h |  13 ++
 include/hw/ppc/spapr.h      |  17 ++-
 trace-events                |   6 +
 9 files changed, 489 insertions(+), 24 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

Comments

David Gibson March 4, 2016, 4:51 a.m. UTC | #1
On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices. As all TCE root regions
> are mapped at 0 and 64bit long (and actual tables are child regions),
> this replaces memory_region_add_subregion() with _overlap() to make
> QEMU memory API happy.
> 
> This reserves RTAS token numbers for DDW calls.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> TODO (which I have no idea how to implement properly):
> 1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
> windows and 12/16/24 page shift;

As noted in a different subthread, this information is there in the
container.

> 2. fix container::min_iova, max_iova - as for now, they are useless,
> and I'd expect IOMMU MR boundaries to serve this purpose really;

This seems to show a similar confusion of concepts to #1.
container::min_iova, container::max_iova advertise limitations of the
host IOMMU, the IOMMU MR boundaries show constraints of the guest
IOMMU.  You need to verify the guest constraints against the host
constraints.

A more flexible method than min/max iova will be necessary though, now
that the host IOMMU allows more flexible configurations than a single
window.

> 3. vfio_listener_region_add/vfio_listener_region_del do explicitely
> create/remove huge DMA window as we do not have vfio_container_ioctl()
> anymore, do we want to move these to some sort of callbacks? How, where?
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> # Conflicts:
> #	include/hw/pci-host/spapr.h
> 
> # Conflicts:
> #	hw/vfio/common.c
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   7 +-
>  hw/ppc/spapr_iommu.c        |  32 ++++-
>  hw/ppc/spapr_pci.c          |  61 +++++++--
>  hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/common.c            |  70 +++++++++-
>  include/hw/pci-host/spapr.h |  13 ++
>  include/hw/ppc/spapr.h      |  17 ++-
>  trace-events                |   6 +
>  9 files changed, 489 insertions(+), 24 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e9d4abf..2473217 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>   * pseries-2.5
>   */
>  #define SPAPR_COMPAT_2_5 \
> -        HW_COMPAT_2_5
> +        HW_COMPAT_2_5 \
> +        {\
> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +            .property = "ddw",\
> +            .value    = stringify(off),\
> +        },
>  
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8aa2238..e32f71b 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>      return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +    tcet->migtable = tcet->table;
> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>      }
>  
> +    if (tcet->enabled) {
> +        if (!tcet->table) {
> +            tcet->enabled = false;
> +            /* VFIO does not migrate so pass vfio_accel == false */
> +            spapr_tce_table_do_enable(tcet, false);
> +        }

What if there was an existing table, but its size doesn't match that
in the incoming migration?  Don't you need to free() it and
re-allocate?  IIUC this would happen in practice if you migrated a
guest which had removed the default window and replaced it with one of
a different size.

> +        memcpy(tcet->table, tcet->migtable,
> +               tcet->nb_table * sizeof(tcet->table[0]));
> +        free(tcet->migtable);
> +        tcet->migtable = NULL;
> +    }

Likewise, what if your incoming migration is of a guest which has
completely removed the default window?  Don't you need to free the
existing default table?

>      return 0;
>  }
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>      .name = "spapr_iommu",
> -    .version_id = 2,
> +    .version_id = 3,
>      .minimum_version_id = 2,
> +    .pre_save = spapr_tce_table_pre_save,
>      .post_load = spapr_tce_table_post_load,
>      .fields      = (VMStateField []) {
>          /* Sanity check */
>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>          /* IOMMU state */
> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> +                                    vmstate_info_uint64, uint64_t),
>  
>          VMSTATE_END_OF_LIST()
>      },
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 4c6e687..1bc0710 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>      return buf;
>  }
>  
> -static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> -                                       uint32_t liobn, uint32_t page_shift,
> -                                       uint64_t window_addr,
> -                                       uint64_t window_size)
> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                uint32_t liobn, uint32_t page_shift,
> +                                uint64_t window_addr,
> +                                uint64_t window_size)
>  {
>      sPAPRTCETable *tcet;
>      uint32_t nb_table = window_size >> page_shift;
> @@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>          return -1;
>      }
>  
> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +        return -1;
> +    }
> +
>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>  
>      return 0;
>  }
>  
> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>  {
>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>  
> @@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      }
>  
>      /* DMA setup */
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_report("No default TCE table for %s", sphb->dtbusname);
> -        return;
> -    }
> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
> +    sphb->dma64_window_size = pow2ceil(ram_size);

Why do you need this value?  Isn't the size of the dma64 window
supplied when you create it with RTAS?  It makes more sense to me to
validate the value at that point rather than here where you have to
use a global.

Plus.. if your machine allows hotplug memory you probably need
maxram_size, rather than ram_size here.

>  
> -    memory_region_add_subregion(&sphb->iommu_root, 0,
> -                                spapr_tce_get_iommu(tcet));
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb),
> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
> +    }
>  
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
> @@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> +    int i;
> +
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
> +    }
>  
>      /* Register default 32bit DMA window */
>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> @@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..b8ea910
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,306 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->enabled) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->enabled) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> +                                 uint64_t page_mask)
> +{
> +    int i, j;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> +            if ((sps[i].page_shift == masks[j].shift) &&
> +                    (page_mask & (1ULL << masks[j].shift))) {
> +                mask |= masks[j].mask;
> +            }
> +        }
> +    }

Hmm... checking against the list of page sizes supported by the vcpu
seems conceptually wrong, although it's probably correct in practice.
Is there a way of checking directly against the pagesizes supported by
the host IOMMU.

> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t avail, addr, pgmask = 0;
> +    unsigned current;
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    current = spapr_phb_get_active_win_num(sphb);
> +    avail = (sphb->windows_supported > current) ?
> +            (sphb->windows_supported - current) : 0;

sphb->windows_supported < current indicates a bug in qemu, surely?  So
you should be able to do without the ?:.

> +
> +    /* Work out supported page masks */
> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as all RAM was in 4K pages.
> +     */
> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
> +                                pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid;
> +    long ret;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +
> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
> +        goto hw_error_exit;
> +    }
> +
> +    if (window_shift < page_shift) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
> +                                      sphb->dma64_window_addr,
> +                                      1ULL << window_shift);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift,
> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
> +                                 liobn, ret);
> +    if (ret || !tcet) {
> +        goto hw_error_exit;
> +    }

!ret && !tcet indicates a qemu bug, surely, an assert would make more
sense.

> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +    long ret;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
> +    trace_spapr_iommu_ddw_remove(liobn, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +    long ret = 0;

ret is never assigned a value other than 0; remove it.

> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
> +    if (ret) {
> +        goto hw_error_exit;
> +    }
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 42ef1eb..2332f8e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -395,6 +395,39 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>          giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>          QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);

It might make this easier to review if the guest side (non-VFIO) and
VFIO parts were in different patches.

> +        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

Might want to split this stuff out into a "new guest iommu" helper.
It would want to first check if the guest IOMMU can be supported with
the existing host IOMMU windows.  If not, and the host IOMMU supports
it (i.e. SPAPR_TCE_v2_IOMMU) it would attempt to create a new host
window.

> +            int ret;
> +            struct vfio_iommu_spapr_tce_create create = {
> +                .argsz = sizeof(create),
> +                .page_shift = ctz64(giommu->iova_pgsizes),
> +                .window_size = memory_region_size(section->mr),
> +                .levels = 0,
> +                .start_addr = 0,
> +            };
> +
> +            /*
> +             * Dynamic windows are supported, that means that there is no
> +             * pre-created window and we have to create one.
> +             */
> +            if (!create.levels) {

This test will always be true.

> +                unsigned entries = create.window_size >> create.page_shift;
> +                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
> +                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
> +                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;

Hmm.. does it make more sense for qemu to apply this heuristic, or the kernel?

> +            }
> +            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +            if (ret) {
> +                error_report("Failed to create a window");
> +            }
> +
> +            if (create.start_addr != section->offset_within_address_space) {
> +                error_report("Something went wrong!");

Shouldn't you at least set start_addr before the ioctl() as a hint to
the kernel?

> +            }
> +            trace_vfio_spapr_create_window(create.page_shift,
> +                                           create.window_size,
> +                                           create.start_addr);
> +        }
> +
>          memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>          giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
>          memory_region_iommu_replay(giommu->iommu, &giommu->n,
> @@ -500,6 +533,18 @@ static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
>                       container, iova, end - iova, ret);
>      }
>  
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        struct vfio_iommu_spapr_tce_remove remove = {
> +            .argsz = sizeof(remove),
> +            .start_addr = section->offset_within_address_space,
> +        };
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +        if (ret) {
> +            error_report("Failed to remove window");
> +        }
> +
> +        trace_vfio_spapr_remove_window(remove.start_addr);
> +    }
>      if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
>          iommu->iommu_ops->vfio_notify(section->mr, false);
>      }
> @@ -792,11 +837,6 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              }
>          }
>  
> -        /*
> -         * This only considers the host IOMMU's 32-bit window.  At
> -         * some point we need to add support for the optional 64-bit
> -         * window and dynamic windows
> -         */
>          info.argsz = sizeof(info);
>          ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
>          if (ret) {
> @@ -805,7 +845,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>              goto free_container_exit;
>          }
>          container->min_iova = info.dma32_window_start;
> -        container->max_iova = container->min_iova + info.dma32_window_size - 1;
> +        container->max_iova = (hwaddr)-1;

Rather than hacking min/max iova here, I think it makes more sense for
the "create new host window" path to *replace* the tests against
min/max iova in the add_region path.  Basically the min/max iova tests
are a (rather dumb) check of whether the new guest window is
compatible with the host windows.  When the host windows are dynamic a
static test doesn't make sense and should be replaced by the code to
create a new host window (and error if it can't make a matching one).

> +
> +        if (v2) {
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del happy, we better remove this window now
> +             * and let those iommu_listener callbacks create them when needed.
> +             */
> +            struct vfio_iommu_spapr_tce_remove remove = {
> +                .argsz = sizeof(remove),
> +                .start_addr = info.dma32_window_start,
> +            };
> +            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
> +            if (ret) {
> +                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
> +                ret = -errno;
> +                goto free_container_exit;
> +            }
> +        }
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..855e458 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -71,6 +71,12 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint32_t windows_supported;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_window_addr;
> +    uint64_t dma64_window_size;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> @@ -89,6 +95,8 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> @@ -148,5 +156,10 @@ static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  #endif
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +                                uint32_t liobn, uint32_t page_shift,
> +                                uint64_t window_addr,
> +                                uint64_t window_size);
> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
>  
>  #endif /* __HW_SPAPR_PCI_H__ */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 505cb3a..4f59d1b 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -417,6 +417,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -458,8 +468,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> @@ -545,6 +559,7 @@ struct sPAPRTCETable {
>      uint64_t bus_offset;
>      uint32_t page_shift;
>      uint64_t *table;
> +    uint64_t *migtable;
>      bool bypass;
>      int fd;
>      MemoryRegion root, iommu;
> diff --git a/trace-events b/trace-events
> index f5335ec..c7314b6 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1432,6 +1432,10 @@ spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
>  spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
>  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> @@ -1727,6 +1731,8 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
>  vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
>  vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
> +vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
>  
>  # hw/vfio/platform.c
>  vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
Alexey Kardashevskiy March 11, 2016, 9:03 a.m. UTC | #2
On 03/04/2016 03:51 PM, David Gibson wrote:
> On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.5 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> TODO (which I have no idea how to implement properly):
>> 1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
>> windows and 12/16/24 page shift;
>
> As noted in a different subthread, this information is there in the
> container.

Well, I rather want this in rtas_ibm_query_pe_dma_window() to report to the 
guest the supported page sizes but I cannot because of missing 
vfio_container_ioctl().

I guest I'll just make page_size_mask, windows_supported and 
dma64_window_start PHB properties, set them to what I think the host 
supports and if the host does not support something, then QEMU will just 
fail quite quick and quite obviously why.


>> 2. fix container::min_iova, max_iova - as for now, they are useless,
>> and I'd expect IOMMU MR boundaries to serve this purpose really;
>
> This seems to show a similar confusion of concepts to #1.
> container::min_iova, container::max_iova advertise limitations of the
> host IOMMU, the IOMMU MR boundaries show constraints of the guest
> IOMMU.  You need to verify the guest constraints against the host
> constraints.
>
> A more flexible method than min/max iova will be necessary though, now
> that the host IOMMU allows more flexible configurations than a single
> window.
>
>> 3. vfio_listener_region_add/vfio_listener_region_del do explicitely
>> create/remove huge DMA window as we do not have vfio_container_ioctl()
>> anymore, do we want to move these to some sort of callbacks? How, where?
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> # Conflicts:
>> #	include/hw/pci-host/spapr.h
>>
>> # Conflicts:
>> #	hw/vfio/common.c
>> ---
>>   hw/ppc/Makefile.objs        |   1 +
>>   hw/ppc/spapr.c              |   7 +-
>>   hw/ppc/spapr_iommu.c        |  32 ++++-
>>   hw/ppc/spapr_pci.c          |  61 +++++++--
>>   hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/common.c            |  70 +++++++++-
>>   include/hw/pci-host/spapr.h |  13 ++
>>   include/hw/ppc/spapr.h      |  17 ++-
>>   trace-events                |   6 +
>>   9 files changed, 489 insertions(+), 24 deletions(-)
>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c1ffc77..986b36f 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>   obj-y += spapr_pci_vfio.o
>>   endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e9d4abf..2473217 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>    * pseries-2.5
>>    */
>>   #define SPAPR_COMPAT_2_5 \
>> -        HW_COMPAT_2_5
>> +        HW_COMPAT_2_5 \
>> +        {\
>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +            .property = "ddw",\
>> +            .value    = stringify(off),\
>> +        },
>>
>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>   {
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 8aa2238..e32f71b 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
>>       return 1ULL << tcet->page_shift;
>>   }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> +    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> +    tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
>> +
>>   static int spapr_tce_table_post_load(void *opaque, int version_id)
>>   {
>>       sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>>           spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>>       }
>>
>> +    if (tcet->enabled) {
>> +        if (!tcet->table) {
>> +            tcet->enabled = false;
>> +            /* VFIO does not migrate so pass vfio_accel == false */
>> +            spapr_tce_table_do_enable(tcet, false);
>> +        }
>
> What if there was an existing table, but its size doesn't match that
> in the incoming migration?Don't you need to free() it and
> re-allocate?  IIUC this would happen in practice if you migrated a
> guest which had removed the default window and replaced it with one of
> a different size.
>
>> +        memcpy(tcet->table, tcet->migtable,
>> +               tcet->nb_table * sizeof(tcet->table[0]));
>> +        free(tcet->migtable);
>> +        tcet->migtable = NULL;
>> +    }
>
> Likewise, what if your incoming migration is of a guest which has
> completely removed the default window?  Don't you need to free the
> existing default table?
 >
>>       return 0;
>>   }
>>
>>   static const VMStateDescription vmstate_spapr_tce_table = {
>>       .name = "spapr_iommu",
>> -    .version_id = 2,
>> +    .version_id = 3,
>>       .minimum_version_id = 2,
>> +    .pre_save = spapr_tce_table_pre_save,
>>       .post_load = spapr_tce_table_post_load,
>>       .fields      = (VMStateField []) {
>>           /* Sanity check */
>>           VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>> -        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>>
>>           /* IOMMU state */
>> +        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
>> +        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>> +        VMSTATE_UINT32(nb_table, sPAPRTCETable),
>>           VMSTATE_BOOL(bypass, sPAPRTCETable),
>> -        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
>> +        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
>> +                                    vmstate_info_uint64, uint64_t),
>>
>>           VMSTATE_END_OF_LIST()
>>       },
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 4c6e687..1bc0710 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>       return buf;
>>   }
>>
>> -static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> -                                       uint32_t liobn, uint32_t page_shift,
>> -                                       uint64_t window_addr,
>> -                                       uint64_t window_size)
>> +int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>> +                                uint32_t liobn, uint32_t page_shift,
>> +                                uint64_t window_addr,
>> +                                uint64_t window_size)
>>   {
>>       sPAPRTCETable *tcet;
>>       uint32_t nb_table = window_size >> page_shift;
>> @@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>           return -1;
>>       }
>>
>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> +        return -1;
>> +    }
>> +
>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
>>
>>       return 0;
>>   }
>>
>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>   {
>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>
>> @@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>       }
>>
>>       /* DMA setup */
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>> -        return;
>> -    }
>> +    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
>> +    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
>> +    sphb->dma64_window_size = pow2ceil(ram_size);
>
> Why do you need this value?  Isn't the size of the dma64 window
> supplied when you create it with RTAS?  It makes more sense to me to
> validate the value at that point rather than here where you have to
> use a global.
>
> Plus.. if your machine allows hotplug memory you probably need
> maxram_size, rather than ram_size here.
>
>>
>> -    memory_region_add_subregion(&sphb->iommu_root, 0,
>> -                                spapr_tce_get_iommu(tcet));
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>> +    }
>>
>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>   }
>> @@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>
>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>   {
>> -    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>> +    int i;
>> +
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
>> +    }
>>
>>       /* Register default 32bit DMA window */
>>       spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
>> @@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
>>       /* Default DMA window is 0..1GB */
>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>
>> @@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       uint32_t interrupt_map_mask[] = {
>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>       sPAPRTCETable *tcet;
>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>       sPAPRFDT s_fdt;
>> @@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>       /* Build the interrupt-map, this must matches what is done
>>        * in pci_spapr_map_irq
>>        */
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..b8ea910
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,306 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->enabled) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->enabled) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>> +                                 uint64_t page_mask)
>> +{
>> +    int i, j;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>> +            if ((sps[i].page_shift == masks[j].shift) &&
>> +                    (page_mask & (1ULL << masks[j].shift))) {
>> +                mask |= masks[j].mask;
>> +            }
>> +        }
>> +    }
>
> Hmm... checking against the list of page sizes supported by the vcpu
> seems conceptually wrong, although it's probably correct in practice.
> Is there a way of checking directly against the pagesizes supported by
> the host IOMMU.


VFIO_IOMMU_SPAPR_TCE_GET_INFO returns the mask but since 
vfio_container_ioctl() is gone, there is no direct way of knowing it here, 
it is hidded now in hw/vfio/common.c.

Anyway the host IOMMU always supports 4K|64K|16M. QEMU may or may not use 
huge pages for the guest RAM, this defines whether H_PUT_TCE for 16M page 
suceeds or fails.


>
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    CPUPPCState *env = &cpu->env;
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    unsigned current;
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    current = spapr_phb_get_active_win_num(sphb);
>> +    avail = (sphb->windows_supported > current) ?
>> +            (sphb->windows_supported - current) : 0;
>
> sphb->windows_supported < current indicates a bug in qemu, surely?  So
> you should be able to do without the ?:.
>
>> +
>> +    /* Work out supported page masks */
>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as all RAM was in 4K pages.
>> +     */
>> +    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
>> +                                pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid;
>> +    long ret;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +
>> +    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    if (window_shift < page_shift) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
>> +                                      sphb->dma64_window_addr,
>> +                                      1ULL << window_shift);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift,
>> +                                 tcet ? tcet->bus_offset : 0xbaadf00d,
>> +                                 liobn, ret);
>> +    if (ret || !tcet) {
>> +        goto hw_error_exit;
>> +    }
>
> !ret && !tcet indicates a qemu bug, surely, an assert would make more
> sense.

Heh. That is correct. Although spapr_phb_dma_window_enable() calls 
eventually vfio_listener_region_add() which can fail as it calls the host 
VFIO IOMMU driver but there is no nice way of delivering that error here...


>
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +    long ret;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    ret = spapr_phb_dma_window_disable(sphb, liobn);
>> +    trace_spapr_iommu_ddw_remove(liobn, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +    long ret = 0;
>
> ret is never assigned a value other than 0; remove it.
>
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr, ret);
>> +    if (ret) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 42ef1eb..2332f8e 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -395,6 +395,39 @@ static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
>>           giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>
> It might make this easier to review if the guest side (non-VFIO) and
> VFIO parts were in different patches.
>
>> +        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>
> Might want to split this stuff out into a "new guest iommu" helper.
> It would want to first check if the guest IOMMU can be supported with
> the existing host IOMMU windows.  If not, and the host IOMMU supports
> it (i.e. SPAPR_TCE_v2_IOMMU) it would attempt to create a new host
> window.
>
>> +            int ret;
>> +            struct vfio_iommu_spapr_tce_create create = {
>> +                .argsz = sizeof(create),
>> +                .page_shift = ctz64(giommu->iova_pgsizes),
>> +                .window_size = memory_region_size(section->mr),
>> +                .levels = 0,
>> +                .start_addr = 0,
>> +            };
>> +
>> +            /*
>> +             * Dynamic windows are supported, that means that there is no
>> +             * pre-created window and we have to create one.
>> +             */
>> +            if (!create.levels) {
>
> This test will always be true.
>
>> +                unsigned entries = create.window_size >> create.page_shift;
>> +                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
>> +                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
>> +                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>
> Hmm.. does it make more sense for qemu to apply this heuristic, or the kernel?


If something can be done safely in the userspace, why would we want to put 
it to the kernel?



>> +            }
>> +            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> +            if (ret) {
>> +                error_report("Failed to create a window");
>> +            }
>> +
>> +            if (create.start_addr != section->offset_within_address_space) {
>> +                error_report("Something went wrong!");
>
> Shouldn't you at least set start_addr before the ioctl() as a hint to
> the kernel?


The kernel does not take hints. At least on POWER8 (may be it will on POWER9).
David Gibson March 15, 2016, 5:53 a.m. UTC | #3
On Fri, Mar 11, 2016 at 08:03:43PM +1100, Alexey Kardashevskiy wrote:
> On 03/04/2016 03:51 PM, David Gibson wrote:
> >On Tue, Mar 01, 2016 at 08:10:41PM +1100, Alexey Kardashevskiy wrote:
> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>the SPAPR specification which allows to have additional DMA window(s)
> >>
> >>This implements DDW for emulated and VFIO devices. As all TCE root regions
> >>are mapped at 0 and 64bit long (and actual tables are child regions),
> >>this replaces memory_region_add_subregion() with _overlap() to make
> >>QEMU memory API happy.
> >>
> >>This reserves RTAS token numbers for DDW calls.
> >>
> >>This changes the TCE table migration descriptor to support dynamic
> >>tables as from now on, PHB will create as many stub TCE table objects
> >>as PHB can possibly support but not all of them might be initialized at
> >>the time of migration because DDW might or might not be requested by
> >>the guest.
> >>
> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >>the pseries-2.5 machine and older disable it.
> >>
> >>This implements DDW for VFIO. The host kernel support is required.
> >>This adds a "levels" property to PHB to control the number of levels
> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >>value to tell QEMU to calculate the correct value. Current hardware
> >>supports up to 5 levels.
> >>
> >>The existing linux guests try creating one additional huge DMA window
> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>property which is a bus address for the 64bit window and by default
> >>set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >>uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >>This adds 4 RTAS handlers:
> >>* ibm,query-pe-dma-window
> >>* ibm,create-pe-dma-window
> >>* ibm,remove-pe-dma-window
> >>* ibm,reset-pe-dma-window
> >>These are registered from type_init() callback.
> >>
> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >>spapr_iommu.c with PCI.
> >>
> >>TODO (which I have no idea how to implement properly):
> >>1. check the host kernel actually supports SPAPR_PCI_DMA_MAX_WINDOWS
> >>windows and 12/16/24 page shift;
> >
> >As noted in a different subthread, this information is there in the
> >container.
> 
> Well, I rather want this in rtas_ibm_query_pe_dma_window() to report to the
> guest the supported page sizes but I cannot because of missing
> vfio_container_ioctl().

You'll need to add a new interface(s) in the VFIO code to retrieve
this.  It should take an AddressSpace and return the minimum
capabilities that can be simultaneously supported by all attached
containers.

> I guest I'll just make page_size_mask, windows_supported and
> dma64_window_start PHB properties, set them to what I think the host
> supports and if the host does not support something, then QEMU will just
> fail quite quick and quite obviously why.

Actually.. that's a better idea.  In general I think it makes for
saner handling of compatibility in future if you make the guest
properties directly settable and check whether they're possible on the
host, rather than trying to autoset the guest capabilities to match
the host.

> >>2. fix container::min_iova, max_iova - as for now, they are useless,
> >>and I'd expect IOMMU MR boundaries to serve this purpose really;
> >
> >This seems to show a similar confusion of concepts to #1.
> >container::min_iova, container::max_iova advertise limitations of the
> >host IOMMU, the IOMMU MR boundaries show constraints of the guest
> >IOMMU.  You need to verify the guest constraints against the host
> >constraints.
> >
> >A more flexible method than min/max iova will be necessary though, now
> >that the host IOMMU allows more flexible configurations than a single
> >window.
> >
> >>3. vfio_listener_region_add/vfio_listener_region_del do explicitely
> >>create/remove huge DMA window as we do not have vfio_container_ioctl()
> >>anymore, do we want to move these to some sort of callbacks? How, where?
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >># Conflicts:
> >>#	include/hw/pci-host/spapr.h
> >>
> >># Conflicts:
> >>#	hw/vfio/common.c
> >>---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   7 +-
> >>  hw/ppc/spapr_iommu.c        |  32 ++++-
> >>  hw/ppc/spapr_pci.c          |  61 +++++++--
> >>  hw/ppc/spapr_rtas_ddw.c     | 306 ++++++++++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/common.c            |  70 +++++++++-
> >>  include/hw/pci-host/spapr.h |  13 ++
> >>  include/hw/ppc/spapr.h      |  17 ++-
> >>  trace-events                |   6 +
> >>  9 files changed, 489 insertions(+), 24 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>index c1ffc77..986b36f 100644
> >>--- a/hw/ppc/Makefile.objs
> >>+++ b/hw/ppc/Makefile.objs
> >>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>index e9d4abf..2473217 100644
> >>--- a/hw/ppc/spapr.c
> >>+++ b/hw/ppc/spapr.c
> >>@@ -2370,7 +2370,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>   * pseries-2.5
> >>   */
> >>  #define SPAPR_COMPAT_2_5 \
> >>-        HW_COMPAT_2_5
> >>+        HW_COMPAT_2_5 \
> >>+        {\
> >>+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>+            .property = "ddw",\
> >>+            .value    = stringify(off),\
> >>+        },
> >>
> >>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>  {
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 8aa2238..e32f71b 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -150,6 +150,15 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> >>      return 1ULL << tcet->page_shift;
> >>  }
> >>
> >>+static void spapr_tce_table_pre_save(void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>+
> >>+    tcet->migtable = tcet->table;
> >>+}
> >>+
> >>+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
> >>+
> >>  static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>  {
> >>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> >>@@ -158,22 +167,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
> >>          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
> >>      }
> >>
> >>+    if (tcet->enabled) {
> >>+        if (!tcet->table) {
> >>+            tcet->enabled = false;
> >>+            /* VFIO does not migrate so pass vfio_accel == false */
> >>+            spapr_tce_table_do_enable(tcet, false);
> >>+        }
> >
> >What if there was an existing table, but its size doesn't match that
> >in the incoming migration?Don't you need to free() it and
> >re-allocate?  IIUC this would happen in practice if you migrated a
> >guest which had removed the default window and replaced it with one of
> >a different size.
> >
> >>+        memcpy(tcet->table, tcet->migtable,
> >>+               tcet->nb_table * sizeof(tcet->table[0]));
> >>+        free(tcet->migtable);
> >>+        tcet->migtable = NULL;
> >>+    }
> >
> >Likewise, what if your incoming migration is of a guest which has
> >completely removed the default window?  Don't you need to free the
> >existing default table?
> >
> >>      return 0;
> >>  }
> >>
> >>  static const VMStateDescription vmstate_spapr_tce_table = {
> >>      .name = "spapr_iommu",
> >>-    .version_id = 2,
> >>+    .version_id = 3,
> >>      .minimum_version_id = 2,
> >>+    .pre_save = spapr_tce_table_pre_save,
> >>      .post_load = spapr_tce_table_post_load,
> >>      .fields      = (VMStateField []) {
> >>          /* Sanity check */
> >>          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> >>-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
> >>
> >>          /* IOMMU state */
> >>+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
> >>+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
> >>          VMSTATE_BOOL(bypass, sPAPRTCETable),
> >>-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
> >>+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
> >>+                                    vmstate_info_uint64, uint64_t),
> >>
> >>          VMSTATE_END_OF_LIST()
> >>      },
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index 4c6e687..1bc0710 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,10 +803,10 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
> >>      return buf;
> >>  }
> >>
> >>-static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>-                                       uint32_t liobn, uint32_t page_shift,
> >>-                                       uint64_t window_addr,
> >>-                                       uint64_t window_size)
> >>+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+                                uint32_t liobn, uint32_t page_shift,
> >>+                                uint64_t window_addr,
> >>+                                uint64_t window_size)
> >>  {
> >>      sPAPRTCETable *tcet;
> >>      uint32_t nb_table = window_size >> page_shift;
> >>@@ -820,12 +820,16 @@ static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>          return -1;
> >>      }
> >>
> >>+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> >>+        return -1;
> >>+    }
> >>+
> >>      spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
> >>
> >>      return 0;
> >>  }
> >>
> >>-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
> >>  {
> >>      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
> >>
> >>@@ -1418,14 +1422,21 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>      }
> >>
> >>      /* DMA setup */
> >>-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> >>-    if (!tcet) {
> >>-        error_report("No default TCE table for %s", sphb->dtbusname);
> >>-        return;
> >>-    }
> >>+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
> >>+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
> >>+    sphb->dma64_window_size = pow2ceil(ram_size);
> >
> >Why do you need this value?  Isn't the size of the dma64 window
> >supplied when you create it with RTAS?  It makes more sense to me to
> >validate the value at that point rather than here where you have to
> >use a global.
> >
> >Plus.. if your machine allows hotplug memory you probably need
> >maxram_size, rather than ram_size here.
> >
> >>
> >>-    memory_region_add_subregion(&sphb->iommu_root, 0,
> >>-                                spapr_tce_get_iommu(tcet));
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        tcet = spapr_tce_new_table(DEVICE(sphb),
> >>+                                   SPAPR_PCI_LIOBN(sphb->index, i));
> >>+        if (!tcet) {
> >>+            error_setg(errp, "Creating window#%d failed for %s",
> >>+                       i, sphb->dtbusname);
> >>+            return;
> >>+        }
> >>+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> >>+                                            spapr_tce_get_iommu(tcet), 0);
> >>+    }
> >>
> >>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
> >>  }
> >>@@ -1443,7 +1454,11 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
> >>
> >>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >>  {
> >>-    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
> >>+    int i;
> >>+
> >>+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >>+        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
> >>+    }
> >>
> >>      /* Register default 32bit DMA window */
> >>      spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
> >>@@ -1481,6 +1496,9 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >>+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
> >>+                       0x800000000000000ULL),
> >>+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>
> >>@@ -1734,6 +1752,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >>+    uint32_t ddw_applicable[] = {
> >>+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >>+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >>+    };
> >>+    uint32_t ddw_extensions[] = {
> >>+        cpu_to_be32(1),
> >>+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >>+    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >>@@ -1758,6 +1785,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>
> >>+    /* Dynamic DMA window */
> >>+    if (phb->ddw_enabled) {
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >>+                         sizeof(ddw_applicable)));
> >>+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >>+                         &ddw_extensions, sizeof(ddw_extensions)));
> >>+    }
> >>+
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >>diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >>new file mode 100644
> >>index 0000000..b8ea910
> >>--- /dev/null
> >>+++ b/hw/ppc/spapr_rtas_ddw.c
> >>@@ -0,0 +1,306 @@
> >>+/*
> >>+ * QEMU sPAPR Dynamic DMA windows support
> >>+ *
> >>+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >>+ *
> >>+ *  This program is free software; you can redistribute it and/or modify
> >>+ *  it under the terms of the GNU General Public License as published by
> >>+ *  the Free Software Foundation; either version 2 of the License,
> >>+ *  or (at your option) any later version.
> >>+ *
> >>+ *  This program is distributed in the hope that it will be useful,
> >>+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>+ *  GNU General Public License for more details.
> >>+ *
> >>+ *  You should have received a copy of the GNU General Public License
> >>+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >>+ */
> >>+
> >>+#include "qemu/osdep.h"
> >>+#include "qemu/error-report.h"
> >>+#include "hw/ppc/spapr.h"
> >>+#include "hw/pci-host/spapr.h"
> >>+#include "trace.h"
> >>+
> >>+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && tcet->enabled) {
> >>+        ++*(unsigned *)opaque;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >>+{
> >>+    unsigned ret = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >>+
> >>+    return ret;
> >>+}
> >>+
> >>+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >>+{
> >>+    sPAPRTCETable *tcet;
> >>+
> >>+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >>+    if (tcet && !tcet->enabled) {
> >>+        *(uint32_t *)opaque = tcet->liobn;
> >>+        return 1;
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >>+{
> >>+    uint32_t liobn = 0;
> >>+
> >>+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >>+
> >>+    return liobn;
> >>+}
> >>+
> >>+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
> >>+                                 uint64_t page_mask)
> >>+{
> >>+    int i, j;
> >>+    uint32_t mask = 0;
> >>+    const struct { int shift; uint32_t mask; } masks[] = {
> >>+        { 12, RTAS_DDW_PGSIZE_4K },
> >>+        { 16, RTAS_DDW_PGSIZE_64K },
> >>+        { 24, RTAS_DDW_PGSIZE_16M },
> >>+        { 25, RTAS_DDW_PGSIZE_32M },
> >>+        { 26, RTAS_DDW_PGSIZE_64M },
> >>+        { 27, RTAS_DDW_PGSIZE_128M },
> >>+        { 28, RTAS_DDW_PGSIZE_256M },
> >>+        { 34, RTAS_DDW_PGSIZE_16G },
> >>+    };
> >>+
> >>+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
> >>+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
> >>+            if ((sps[i].page_shift == masks[j].shift) &&
> >>+                    (page_mask & (1ULL << masks[j].shift))) {
> >>+                mask |= masks[j].mask;
> >>+            }
> >>+        }
> >>+    }
> >
> >Hmm... checking against the list of page sizes supported by the vcpu
> >seems conceptually wrong, although it's probably correct in practice.
> >Is there a way of checking directly against the pagesizes supported by
> >the host IOMMU.
> 
> 
> VFIO_IOMMU_SPAPR_TCE_GET_INFO returns the mask but since
> vfio_container_ioctl() is gone, there is no direct way of knowing it here,
> it is hidded now in hw/vfio/common.c.
> 
> Anyway the host IOMMU always supports 4K|64K|16M. QEMU may or may not use
> huge pages for the guest RAM, this defines whether H_PUT_TCE for 16M page
> suceeds or fails.

Ah, so you need to check against both the host IOMMU supported
pagesizes and the host pagesize backing RAM.  So.. the full set of
pagesizes in the VCPU isn't relevant, just the minimum page size used
to back RAM.

So I think you'll need something inside VFIO that acts as a variant of
kvm_fixup_page_sizes() checking the host supported IOMMU page sizes
against the RAM pagesize.  Then you'll need some interface to check
the guest IOMMU page sizes against that list.
diff mbox

Patch

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e9d4abf..2473217 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2370,7 +2370,12 @@  DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
  * pseries-2.5
  */
 #define SPAPR_COMPAT_2_5 \
-        HW_COMPAT_2_5
+        HW_COMPAT_2_5 \
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
+        },
 
 static void spapr_machine_2_5_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 8aa2238..e32f71b 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -150,6 +150,15 @@  static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
     return 1ULL << tcet->page_shift;
 }
 
+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->migtable = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
+
 static int spapr_tce_table_post_load(void *opaque, int version_id)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -158,22 +167,39 @@  static int spapr_tce_table_post_load(void *opaque, int version_id)
         spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
     }
 
+    if (tcet->enabled) {
+        if (!tcet->table) {
+            tcet->enabled = false;
+            /* VFIO does not migrate so pass vfio_accel == false */
+            spapr_tce_table_do_enable(tcet, false);
+        }
+        memcpy(tcet->table, tcet->migtable,
+               tcet->nb_table * sizeof(tcet->table[0]));
+        free(tcet->migtable);
+        tcet->migtable = NULL;
+    }
+
     return 0;
 }
 
 static const VMStateDescription vmstate_spapr_tce_table = {
     .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
     .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
     .post_load = spapr_tce_table_post_load,
     .fields      = (VMStateField []) {
         /* Sanity check */
         VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
 
         /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
         VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),
 
         VMSTATE_END_OF_LIST()
     },
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 4c6e687..1bc0710 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,10 +803,10 @@  static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
     return buf;
 }
 
-static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
-                                       uint32_t liobn, uint32_t page_shift,
-                                       uint64_t window_addr,
-                                       uint64_t window_size)
+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                uint32_t liobn, uint32_t page_shift,
+                                uint64_t window_addr,
+                                uint64_t window_size)
 {
     sPAPRTCETable *tcet;
     uint32_t nb_table = window_size >> page_shift;
@@ -820,12 +820,16 @@  static int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
         return -1;
     }
 
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        return -1;
+    }
+
     spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table, false);
 
     return 0;
 }
 
-static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
 {
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
@@ -1418,14 +1422,21 @@  static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* DMA setup */
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_report("No default TCE table for %s", sphb->dtbusname);
-        return;
-    }
+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
+    sphb->dma64_window_size = pow2ceil(ram_size);
 
-    memory_region_add_subregion(&sphb->iommu_root, 0,
-                                spapr_tce_get_iommu(tcet));
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb),
+                                   SPAPR_PCI_LIOBN(sphb->index, i));
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
+    }
 
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
@@ -1443,7 +1454,11 @@  static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
+    int i;
+
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        spapr_phb_dma_window_disable(sphb, SPAPR_PCI_LIOBN(sphb->index, i));
+    }
 
     /* Register default 32bit DMA window */
     spapr_phb_dma_window_enable(sphb, sphb->dma_liobn,
@@ -1481,6 +1496,9 @@  static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1734,6 +1752,15 @@  int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1758,6 +1785,14 @@  int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..b8ea910
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,306 @@ 
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->enabled) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->enabled) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
+                                 uint64_t page_mask)
+{
+    int i, j;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
+        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
+            if ((sps[i].page_shift == masks[j].shift) &&
+                    (page_mask & (1ULL << masks[j].shift))) {
+                mask |= masks[j].mask;
+            }
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    CPUPPCState *env = &cpu->env;
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t avail, addr, pgmask = 0;
+    unsigned current;
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    current = spapr_phb_get_active_win_num(sphb);
+    avail = (sphb->windows_supported > current) ?
+            (sphb->windows_supported - current) : 0;
+
+    /* Work out supported page masks */
+    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as all RAM was in 4K pages.
+     */
+    rtas_st(rets, 2, sphb->dma64_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, sphb->dma64_window_size,
+                                pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid;
+    long ret;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+
+    if (!liobn || !(sphb->page_size_mask & (1ULL << page_shift))) {
+        goto hw_error_exit;
+    }
+
+    if (window_shift < page_shift) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_window_enable(sphb, liobn, page_shift,
+                                      sphb->dma64_window_addr,
+                                      1ULL << window_shift);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift,
+                                 tcet ? tcet->bus_offset : 0xbaadf00d,
+                                 liobn, ret);
+    if (ret || !tcet) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+    long ret;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    ret = spapr_phb_dma_window_disable(sphb, liobn);
+    trace_spapr_iommu_ddw_remove(liobn, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+    long ret = 0;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr, ret);
+    if (ret) {
+        goto hw_error_exit;
+    }
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 42ef1eb..2332f8e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -395,6 +395,39 @@  static void vfio_listener_region_add(VFIOMemoryListener *vlistener,
         giommu->iova_pgsizes = section->mr->iommu_ops->get_page_sizes(section->mr);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
+        if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+            int ret;
+            struct vfio_iommu_spapr_tce_create create = {
+                .argsz = sizeof(create),
+                .page_shift = ctz64(giommu->iova_pgsizes),
+                .window_size = memory_region_size(section->mr),
+                .levels = 0,
+                .start_addr = 0,
+            };
+
+            /*
+             * Dynamic windows are supported, that means that there is no
+             * pre-created window and we have to create one.
+             */
+            if (!create.levels) {
+                unsigned entries = create.window_size >> create.page_shift;
+                unsigned pages = (entries * sizeof(uint64_t)) / getpagesize();
+                /* 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 */
+                create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+            }
+            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+            if (ret) {
+                error_report("Failed to create a window");
+            }
+
+            if (create.start_addr != section->offset_within_address_space) {
+                error_report("Something went wrong!");
+            }
+            trace_vfio_spapr_create_window(create.page_shift,
+                                           create.window_size,
+                                           create.start_addr);
+        }
+
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
         giommu->iommu->iommu_ops->vfio_notify(section->mr, true);
         memory_region_iommu_replay(giommu->iommu, &giommu->n,
@@ -500,6 +533,18 @@  static void vfio_listener_region_del(VFIOMemoryListener *vlistener,
                      container, iova, end - iova, ret);
     }
 
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        struct vfio_iommu_spapr_tce_remove remove = {
+            .argsz = sizeof(remove),
+            .start_addr = section->offset_within_address_space,
+        };
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+        if (ret) {
+            error_report("Failed to remove window");
+        }
+
+        trace_vfio_spapr_remove_window(remove.start_addr);
+    }
     if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_notify) {
         iommu->iommu_ops->vfio_notify(section->mr, false);
     }
@@ -792,11 +837,6 @@  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             }
         }
 
-        /*
-         * This only considers the host IOMMU's 32-bit window.  At
-         * some point we need to add support for the optional 64-bit
-         * window and dynamic windows
-         */
         info.argsz = sizeof(info);
         ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
         if (ret) {
@@ -805,7 +845,25 @@  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
         container->min_iova = info.dma32_window_start;
-        container->max_iova = container->min_iova + info.dma32_window_size - 1;
+        container->max_iova = (hwaddr)-1;
+
+        if (v2) {
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del happy, we better remove this window now
+             * and let those iommu_listener callbacks create them when needed.
+             */
+            struct vfio_iommu_spapr_tce_remove remove = {
+                .argsz = sizeof(remove),
+                .start_addr = info.dma32_window_start,
+            };
+            ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, &remove);
+            if (ret) {
+                error_report("vfio: VFIO_IOMMU_SPAPR_TCE_REMOVE failed: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
+        }
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..855e458 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -71,6 +71,12 @@  struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint32_t windows_supported;
+    uint64_t page_size_mask;
+    uint64_t dma64_window_addr;
+    uint64_t dma64_window_size;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
@@ -89,6 +95,8 @@  struct sPAPRPHBState {
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
@@ -148,5 +156,10 @@  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 #endif
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
+int spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+                                uint32_t liobn, uint32_t page_shift,
+                                uint64_t window_addr,
+                                uint64_t window_size);
+int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn);
 
 #endif /* __HW_SPAPR_PCI_H__ */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 505cb3a..4f59d1b 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -417,6 +417,16 @@  int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -458,8 +468,12 @@  int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
@@ -545,6 +559,7 @@  struct sPAPRTCETable {
     uint64_t bus_offset;
     uint32_t page_shift;
     uint64_t *table;
+    uint64_t *migtable;
     bool bypass;
     int fd;
     MemoryRegion root, iommu;
diff --git a/trace-events b/trace-events
index f5335ec..c7314b6 100644
--- a/trace-events
+++ b/trace-events
@@ -1432,6 +1432,10 @@  spapr_iommu_pci_indirect(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t i
 spapr_iommu_pci_stuff(uint64_t liobn, uint64_t ioba, uint64_t tce_value, uint64_t npages, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tcevalue=0x%"PRIx64" npages=%"PRId64" ret=%"PRId64
 spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, unsigned pgsize) "liobn=%"PRIx64" 0x%"PRIx64" -> 0x%"PRIx64" perm=%u mask=%x"
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn, long ret) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_remove(uint32_t liobn, long ret) "liobn=%"PRIx32", ret = %ld"
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr, long ret) "buid=%"PRIx64" addr=%"PRIx32", ret = %ld"
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
@@ -1727,6 +1731,8 @@  vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
 vfio_put_base_device(int fd) "close vdev->fd=%d"
 vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_remove_window(uint64_t off) "offset=%"PRIx64
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"