diff mbox

[qemu,v18,5/5] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

Message ID 1466471645-5396-6-git-send-email-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Kardashevskiy June 21, 2016, 1:14 a.m. UTC
This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.6 machine and older disable it.
This also creates a single DMA window for the older machines to
maintain backward migration.

This implements DDW for PHB with emulated and VFIO devices. The host
kernel support is required. The advertised IOMMU page sizes are 4K and
64K; 16M pages are supported but not advertised by default, in order to
enable them, the user has to specify "pgsz" property for PHB and
enable huge pages for RAM.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v18:
* fixed bug when ddw-create rtas call was always creating window at 1<<59
offset
* update minimum supported machine version
* s/dma64_window_addr/dma_win_addr/ to match dma_win_addr

v17:
* fixed: "query" did return non-page-shifted value when memory hotplug is enabled

v16:
* s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
* s/SPAPR_PCI_LIOBN()/dma_liobn[]/

v15:
* moved page mask filtering to PHB realize(), use "-mempath" to know
if there are huge pages
* fixed error reporting in RTAS handlers
* max window size accounts now hotpluggable memory boundaries
---
 hw/ppc/Makefile.objs        |   1 +
 hw/ppc/spapr.c              |   7 +-
 hw/ppc/spapr_pci.c          |  77 +++++++++---
 hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |   8 +-
 include/hw/ppc/spapr.h      |  16 ++-
 trace-events                |   4 +
 7 files changed, 386 insertions(+), 22 deletions(-)
 create mode 100644 hw/ppc/spapr_rtas_ddw.c

Comments

David Gibson June 22, 2016, 2:35 a.m. UTC | #1
On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.6 machine and older disable it.
> This also creates a single DMA window for the older machines to
> maintain backward migration.
> 
> This implements DDW for PHB with emulated and VFIO devices. The host
> kernel support is required. The advertised IOMMU page sizes are 4K and
> 64K; 16M pages are supported but not advertised by default, in order to
> enable them, the user has to specify "pgsz" property for PHB and
> enable huge pages for RAM.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

A few queries below.  Not sure if they'll require code changes or just
explanation.

> ---
> Changes:
> v18:
> * fixed bug when ddw-create rtas call was always creating window at 1<<59
> offset
> * update minimum supported machine version
> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
> 
> v17:
> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> 
> v16:
> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> 
> v15:
> * moved page mask filtering to PHB realize(), use "-mempath" to know
> if there are huge pages
> * fixed error reporting in RTAS handlers
> * max window size accounts now hotpluggable memory boundaries
> ---
>  hw/ppc/Makefile.objs        |   1 +
>  hw/ppc/spapr.c              |   7 +-
>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |   8 +-
>  include/hw/ppc/spapr.h      |  16 ++-
>  trace-events                |   4 +
>  7 files changed, 386 insertions(+), 22 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 5cc6608..91a3420 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 778fa25..f7cff27 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>   * pseries-2.6
>   */
>  #define SPAPR_COMPAT_2_6 \
> -    HW_COMPAT_2_6
> +    HW_COMPAT_2_6 \
> +    { \
> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +        .property = "ddw",\
> +        .value    = stringify(off),\
> +    },
>  
>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 9f28fb3..0cb51dd 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -35,6 +35,7 @@
>  #include "hw/ppc/spapr.h"
>  #include "hw/pci-host/spapr.h"
>  #include "exec/address-spaces.h"
> +#include "exec/ram_addr.h"
>  #include <libfdt.h>
>  #include "trace.h"
>  #include "qemu/error-report.h"
> @@ -45,6 +46,7 @@
>  #include "hw/ppc/spapr_drc.h"
>  #include "sysemu/device_tree.h"
>  #include "sysemu/kvm.h"
> +#include "sysemu/hostmem.h"
>  
>  #include "hw/vfio/vfio.h"
>  
> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>      int fdt_start_offset = 0, fdt_size;
>  
>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>  
>          spapr_tce_set_need_vfio(tcet, true);

Now that Alex took your notifier on/off patches, can you remove this
chunk?  If it's still necessary, don't you need to loop over all the
possible liobns, rather than just acting on liobn[0]?

>      }
> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      PCIBus *bus;
>      uint64_t msi_window_size = 4096;
>      sPAPRTCETable *tcet;
> +    const unsigned windows_supported =
> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>  
>      if (sphb->index != (uint32_t)-1) {
>          hwaddr windows_base;
>  
> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
> +            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>              || (sphb->mem_win_addr != (hwaddr)-1)
>              || (sphb->io_win_addr != (hwaddr)-1)) {
>              error_setg(errp, "Either \"index\" or other parameters must"
> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>  
>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
> +        for (i = 0; i < windows_supported; ++i) {
> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> +        }
>  
>          windows_base = SPAPR_PCI_WINDOW_BASE
>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    if (sphb->dma_liobn == (uint32_t)-1) {
> -        error_setg(errp, "LIOBN not specified for PHB");
> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>          return;
>      }
>  
> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>          }
>      }
>  
> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
> -    if (!tcet) {
> -        error_setg(errp, "Unable to create TCE table for %s",
> -                   sphb->dtbusname);
> -        return;
> +    /* DMA setup */
> +    for (i = 0; i < windows_supported; ++i) {
> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
> +        if (!tcet) {
> +            error_setg(errp, "Creating window#%d failed for %s",
> +                       i, sphb->dtbusname);
> +            return;
> +        }
> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> +                                            spapr_tce_get_iommu(tcet), 0);
>      }
>  
> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
> -                                        spapr_tce_get_iommu(tcet), 0);
> -
>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>  }
>  
> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>  {
> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
> +    int i;
> +    sPAPRTCETable *tcet;
>  
> -    if (tcet && tcet->nb_table) {
> -        spapr_tce_table_disable(tcet);
> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
> +
> +        if (tcet && tcet->nb_table) {
> +            spapr_tce_table_disable(tcet);
> +        }
>      }
>  
>      /* Register default 32bit DMA window */
> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>  }
> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>  static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>                         SPAPR_PCI_MMIO_WIN_SIZE),
> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>      /* Default DMA window is 0..1GB */
>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
> +                       0x800000000000000ULL),
> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> +                       (1ULL << 12) | (1ULL << 16)),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>      .post_load = spapr_pci_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> +        VMSTATE_UNUSED(4), /* dma_liobn */

It's not obvious to me why this change is necessary.

>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      uint32_t interrupt_map_mask[] = {
>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> +    uint32_t ddw_applicable[] = {
> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> +    };
> +    uint32_t ddw_extensions[] = {
> +        cpu_to_be32(1),
> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> +    };
>      sPAPRTCETable *tcet;
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      sPAPRFDT s_fdt;
> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>  
> +    /* Dynamic DMA window */
> +    if (phb->ddw_enabled) {
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> +                         sizeof(ddw_applicable)));
> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> +                         &ddw_extensions, sizeof(ddw_extensions)));
> +    }
> +
>      /* Build the interrupt-map, this must matches what is done
>       * in pci_spapr_map_irq
>       */
> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>                       sizeof(interrupt_map)));
>  
> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>      if (!tcet) {
>          return -1;
>      }
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> new file mode 100644
> index 0000000..177dcff
> --- /dev/null
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -0,0 +1,295 @@
> +/*
> + * QEMU sPAPR Dynamic DMA windows support
> + *
> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "cpu.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "trace.h"
> +
> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && tcet->nb_table) {
> +        ++*(unsigned *)opaque;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> +{
> +    unsigned ret = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> +
> +    return ret;
> +}
> +
> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> +{
> +    sPAPRTCETable *tcet;
> +
> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> +    if (tcet && !tcet->nb_table) {
> +        *(uint32_t *)opaque = tcet->liobn;
> +        return 1;
> +    }
> +    return 0;
> +}
> +
> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> +{
> +    uint32_t liobn = 0;
> +
> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> +
> +    return liobn;
> +}
> +
> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> +{
> +    int i;
> +    uint32_t mask = 0;
> +    const struct { int shift; uint32_t mask; } masks[] = {
> +        { 12, RTAS_DDW_PGSIZE_4K },
> +        { 16, RTAS_DDW_PGSIZE_64K },
> +        { 24, RTAS_DDW_PGSIZE_16M },
> +        { 25, RTAS_DDW_PGSIZE_32M },
> +        { 26, RTAS_DDW_PGSIZE_64M },
> +        { 27, RTAS_DDW_PGSIZE_128M },
> +        { 28, RTAS_DDW_PGSIZE_256M },
> +        { 34, RTAS_DDW_PGSIZE_16G },
> +    };
> +
> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> +        if (page_mask & (1ULL << masks[i].shift)) {
> +            mask |= masks[i].mask;
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid, max_window_size;
> +    uint32_t avail, addr, pgmask = 0;
> +    MachineState *machine = MACHINE(spapr);
> +
> +    if ((nargs != 3) || (nret != 5)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    /* Translate page mask to LoPAPR format */
> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> +
> +    /*
> +     * This is "Largest contiguous block of TCEs allocated specifically
> +     * for (that is, are reserved for) this PE".
> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> +     */
> +    if (machine->ram_size == machine->maxram_size) {
> +        max_window_size = machine->ram_size;
> +    } else {
> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> +
> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> +    }
> +
> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, avail);
> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> +    rtas_st(rets, 3, pgmask);
> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> +
> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet = NULL;
> +    uint32_t addr, page_shift, window_shift, liobn;
> +    uint64_t buid, win_addr;
> +    int windows;
> +
> +    if ((nargs != 5) || (nret != 4)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    page_shift = rtas_ld(args, 3);
> +    window_shift = rtas_ld(args, 4);
> +    liobn = spapr_phb_get_free_liobn(sphb);
> +    windows = spapr_phb_get_active_win_num(sphb);
> +
> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> +        (window_shift < page_shift)) {
> +        goto param_error_exit;
> +    }
> +
> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
> +        goto hw_error_exit;
> +    }
> +
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto hw_error_exit;
> +    }
> +
> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;

If the guest delets the default 32-bit window, then requests a really
big 64-bit DMA window, will that work ok with the big window at 0
instead of the usual 64-bit window address?

> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
> +                           1ULL << (window_shift - page_shift));
> +    if (!tcet->nb_table) {
> +        goto hw_error_exit;
> +    }
> +
> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    rtas_st(rets, 1, liobn);
> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> +
> +    return;
> +
> +hw_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          uint32_t token, uint32_t nargs,
> +                                          target_ulong args,
> +                                          uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn;
> +
> +    if ((nargs != 1) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    liobn = rtas_ld(args, 0);
> +    tcet = spapr_tce_find_by_liobn(liobn);
> +    if (!tcet) {
> +        goto param_error_exit;
> +    }
> +
> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_tce_table_disable(tcet);
> +    trace_spapr_iommu_ddw_remove(liobn);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         uint32_t token, uint32_t nargs,
> +                                         target_ulong args,
> +                                         uint32_t nret, target_ulong rets)
> +{
> +    sPAPRPHBState *sphb;
> +    uint64_t buid;
> +    uint32_t addr;
> +
> +    if ((nargs != 3) || (nret != 1)) {
> +        goto param_error_exit;
> +    }
> +
> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> +    addr = rtas_ld(args, 0);
> +    sphb = spapr_pci_find_phb(spapr, buid);
> +    if (!sphb || !sphb->ddw_enabled) {
> +        goto param_error_exit;
> +    }
> +
> +    spapr_phb_dma_reset(sphb);
> +    trace_spapr_iommu_ddw_reset(buid, addr);
> +
> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> +
> +    return;
> +
> +param_error_exit:
> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> +}
> +
> +static void spapr_rtas_ddw_init(void)
> +{
> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> +                        "ibm,query-pe-dma-window",
> +                        rtas_ibm_query_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> +                        "ibm,create-pe-dma-window",
> +                        rtas_ibm_create_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> +                        "ibm,remove-pe-dma-window",
> +                        rtas_ibm_remove_pe_dma_window);
> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> +                        "ibm,reset-pe-dma-window",
> +                        rtas_ibm_reset_pe_dma_window);
> +}
> +
> +type_init(spapr_rtas_ddw_init)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 7848366..92aa610 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -32,6 +32,8 @@
>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>  
> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> +
>  typedef struct sPAPRPHBState sPAPRPHBState;
>  
>  typedef struct spapr_pci_msi {
> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>      MemoryRegion memwindow, iowindow, msiwindow;
>  
> -    uint32_t dma_liobn;
> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>      hwaddr dma_win_addr, dma_win_size;
>      AddressSpace iommu_as;
>      MemoryRegion iommu_root;
> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>      spapr_pci_msi_mig *msi_devs;
>  
>      QLIST_ENTRY(sPAPRPHBState) list;
> +
> +    bool ddw_enabled;
> +    uint64_t page_size_mask;
> +    uint64_t dma64_win_addr;
>  };
>  
>  #define SPAPR_PCI_MAX_INDEX          255
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index e1f8274..36d1748 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>  
> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> +#define RTAS_DDW_PGSIZE_4K       0x01
> +#define RTAS_DDW_PGSIZE_64K      0x02
> +#define RTAS_DDW_PGSIZE_16M      0x04
> +#define RTAS_DDW_PGSIZE_32M      0x08
> +#define RTAS_DDW_PGSIZE_64M      0x10
> +#define RTAS_DDW_PGSIZE_128M     0x20
> +#define RTAS_DDW_PGSIZE_256M     0x40
> +#define RTAS_DDW_PGSIZE_16G      0x80
> +
>  /* RTAS tokens */
>  #define RTAS_TOKEN_BASE      0x2000
>  
> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>  
> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>  
>  /* RTAS ibm,get-system-parameter token values */
>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> diff --git a/trace-events b/trace-events
> index 7e94d92..5b52634 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>  
>  # hw/ppc/ppc.c
>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
Alexey Kardashevskiy June 22, 2016, 3:23 a.m. UTC | #2
On 22/06/16 12:35, David Gibson wrote:
> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.6 machine and older disable it.
>> This also creates a single DMA window for the older machines to
>> maintain backward migration.
>>
>> This implements DDW for PHB with emulated and VFIO devices. The host
>> kernel support is required. The advertised IOMMU page sizes are 4K and
>> 64K; 16M pages are supported but not advertised by default, in order to
>> enable them, the user has to specify "pgsz" property for PHB and
>> enable huge pages for RAM.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> A few queries below.  Not sure if they'll require code changes or just
> explanation.
> 
>> ---
>> Changes:
>> v18:
>> * fixed bug when ddw-create rtas call was always creating window at 1<<59
>> offset
>> * update minimum supported machine version
>> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
>>
>> v17:
>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>
>> v16:
>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>
>> v15:
>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>> if there are huge pages
>> * fixed error reporting in RTAS handlers
>> * max window size accounts now hotpluggable memory boundaries
>> ---
>>  hw/ppc/Makefile.objs        |   1 +
>>  hw/ppc/spapr.c              |   7 +-
>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/pci-host/spapr.h |   8 +-
>>  include/hw/ppc/spapr.h      |  16 ++-
>>  trace-events                |   4 +
>>  7 files changed, 386 insertions(+), 22 deletions(-)
>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 5cc6608..91a3420 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>  obj-y += spapr_pci_vfio.o
>>  endif
>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>  # PowerPC 4xx boards
>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>  obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 778fa25..f7cff27 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>>   * pseries-2.6
>>   */
>>  #define SPAPR_COMPAT_2_6 \
>> -    HW_COMPAT_2_6
>> +    HW_COMPAT_2_6 \
>> +    { \
>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> +        .property = "ddw",\
>> +        .value    = stringify(off),\
>> +    },
>>  
>>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>>  {
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 9f28fb3..0cb51dd 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -35,6 +35,7 @@
>>  #include "hw/ppc/spapr.h"
>>  #include "hw/pci-host/spapr.h"
>>  #include "exec/address-spaces.h"
>> +#include "exec/ram_addr.h"
>>  #include <libfdt.h>
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>> @@ -45,6 +46,7 @@
>>  #include "hw/ppc/spapr_drc.h"
>>  #include "sysemu/device_tree.h"
>>  #include "sysemu/kvm.h"
>> +#include "sysemu/hostmem.h"
>>  
>>  #include "hw/vfio/vfio.h"
>>  
>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>      int fdt_start_offset = 0, fdt_size;
>>  
>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>  
>>          spapr_tce_set_need_vfio(tcet, true);
> 
> Now that Alex took your notifier on/off patches, can you remove this
> chunk? 

It will stop compiling as dma_liobn is an array now.


> If it's still necessary, don't you need to loop over all the
> possible liobns, rather than just acting on liobn[0]?

Ah, right. Forgot about it. That was the reason why I wanted those notifier
callbacks in this series, lost it in respins. I do need a loop here which
I'll have to remove soon though.


> 
>>      }
>> @@ -1310,11 +1312,14 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>      PCIBus *bus;
>>      uint64_t msi_window_size = 4096;
>>      sPAPRTCETable *tcet;
>> +    const unsigned windows_supported =
>> +        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
>>  
>>      if (sphb->index != (uint32_t)-1) {
>>          hwaddr windows_base;
>>  
>> -        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
>> +        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
>> +            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
>>              || (sphb->mem_win_addr != (hwaddr)-1)
>>              || (sphb->io_win_addr != (hwaddr)-1)) {
>>              error_setg(errp, "Either \"index\" or other parameters must"
>> @@ -1329,7 +1334,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>  
>>          sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
>> -        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
>> +        for (i = 0; i < windows_supported; ++i) {
>> +            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
>> +        }
>>  
>>          windows_base = SPAPR_PCI_WINDOW_BASE
>>              + sphb->index * SPAPR_PCI_WINDOW_SPACING;
>> @@ -1342,8 +1349,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>  
>> -    if (sphb->dma_liobn == (uint32_t)-1) {
>> -        error_setg(errp, "LIOBN not specified for PHB");
>> +    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
>> +        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
>> +        error_setg(errp, "LIOBN(s) not specified for PHB");
>>          return;
>>      }
>>  
>> @@ -1461,16 +1469,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>          }
>>      }
>>  
>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>> -    if (!tcet) {
>> -        error_setg(errp, "Unable to create TCE table for %s",
>> -                   sphb->dtbusname);
>> -        return;
>> +    /* DMA setup */
>> +    for (i = 0; i < windows_supported; ++i) {
>> +        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
>> +        if (!tcet) {
>> +            error_setg(errp, "Creating window#%d failed for %s",
>> +                       i, sphb->dtbusname);
>> +            return;
>> +        }
>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> +                                            spapr_tce_get_iommu(tcet), 0);
>>      }
>>  
>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>> -                                        spapr_tce_get_iommu(tcet), 0);
>> -
>>      sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>  }
>>  
>> @@ -1487,13 +1497,19 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>  
>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>  {
>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>> +    int i;
>> +    sPAPRTCETable *tcet;
>>  
>> -    if (tcet && tcet->nb_table) {
>> -        spapr_tce_table_disable(tcet);
>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
>> +
>> +        if (tcet && tcet->nb_table) {
>> +            spapr_tce_table_disable(tcet);
>> +        }
>>      }
>>  
>>      /* Register default 32bit DMA window */
>> +    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
>>      spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
>>                             sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
>>  }
>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>  static Property spapr_phb_properties[] = {
>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>      /* Default DMA window is 0..1GB */
>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>> +                       0x800000000000000ULL),
>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>> +                       (1ULL << 12) | (1ULL << 16)),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>      .post_load = spapr_pci_post_load,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>> +        VMSTATE_UNUSED(4), /* dma_liobn */
> 
> It's not obvious to me why this change is necessary.

It is not. But I was touching liobn and this is a proper cleanup which
needs to be done anyway as _EQUAL() macros are sort of deprecated and
rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
it (leaving some inconsistency)?



> 
>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      uint32_t interrupt_map_mask[] = {
>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>> +    uint32_t ddw_applicable[] = {
>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>> +    };
>> +    uint32_t ddw_extensions[] = {
>> +        cpu_to_be32(1),
>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>> +    };
>>      sPAPRTCETable *tcet;
>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>      sPAPRFDT s_fdt;
>> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>  
>> +    /* Dynamic DMA window */
>> +    if (phb->ddw_enabled) {
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>> +                         sizeof(ddw_applicable)));
>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>> +    }
>> +
>>      /* Build the interrupt-map, this must matches what is done
>>       * in pci_spapr_map_irq
>>       */
>> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>                       sizeof(interrupt_map)));
>>  
>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>      if (!tcet) {
>>          return -1;
>>      }
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> new file mode 100644
>> index 0000000..177dcff
>> --- /dev/null
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -0,0 +1,295 @@
>> +/*
>> + * QEMU sPAPR Dynamic DMA windows support
>> + *
>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "cpu.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "trace.h"
>> +
>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && tcet->nb_table) {
>> +        ++*(unsigned *)opaque;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>> +{
>> +    unsigned ret = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>> +{
>> +    sPAPRTCETable *tcet;
>> +
>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>> +    if (tcet && !tcet->nb_table) {
>> +        *(uint32_t *)opaque = tcet->liobn;
>> +        return 1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>> +{
>> +    uint32_t liobn = 0;
>> +
>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>> +
>> +    return liobn;
>> +}
>> +
>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>> +{
>> +    int i;
>> +    uint32_t mask = 0;
>> +    const struct { int shift; uint32_t mask; } masks[] = {
>> +        { 12, RTAS_DDW_PGSIZE_4K },
>> +        { 16, RTAS_DDW_PGSIZE_64K },
>> +        { 24, RTAS_DDW_PGSIZE_16M },
>> +        { 25, RTAS_DDW_PGSIZE_32M },
>> +        { 26, RTAS_DDW_PGSIZE_64M },
>> +        { 27, RTAS_DDW_PGSIZE_128M },
>> +        { 28, RTAS_DDW_PGSIZE_256M },
>> +        { 34, RTAS_DDW_PGSIZE_16G },
>> +    };
>> +
>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>> +        if (page_mask & (1ULL << masks[i].shift)) {
>> +            mask |= masks[i].mask;
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid, max_window_size;
>> +    uint32_t avail, addr, pgmask = 0;
>> +    MachineState *machine = MACHINE(spapr);
>> +
>> +    if ((nargs != 3) || (nret != 5)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    /* Translate page mask to LoPAPR format */
>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>> +
>> +    /*
>> +     * This is "Largest contiguous block of TCEs allocated specifically
>> +     * for (that is, are reserved for) this PE".
>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>> +     */
>> +    if (machine->ram_size == machine->maxram_size) {
>> +        max_window_size = machine->ram_size;
>> +    } else {
>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>> +
>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>> +    }
>> +
>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, avail);
>> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>> +    rtas_st(rets, 3, pgmask);
>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>> +
>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet = NULL;
>> +    uint32_t addr, page_shift, window_shift, liobn;
>> +    uint64_t buid, win_addr;
>> +    int windows;
>> +
>> +    if ((nargs != 5) || (nret != 4)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    page_shift = rtas_ld(args, 3);
>> +    window_shift = rtas_ld(args, 4);
>> +    liobn = spapr_phb_get_free_liobn(sphb);
>> +    windows = spapr_phb_get_active_win_num(sphb);
>> +
>> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
>> +        (window_shift < page_shift)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
> 
> If the guest delets the default 32-bit window, then requests a really
> big 64-bit DMA window, will that work ok with the big window at 0
> instead of the usual 64-bit window address?


There is no valid guest to try that as they keep 32bit window.

There was a relatively short period of time in v3.0-ish era (sles11 did
have it and sles11sp3 did not if I remember correctly) when the guest would
remove all windows and create one huge window but for some reason it
expected the window to start non from zero (perhaps pHyp implementation
detail) so it would fail. I did an experiment and removed that particular
check and it worked just fine.

Today guests always keep a 32bit window as the platform cannot tell if all
the drivers on a specific PHB will request 64bit DMA.




>> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
>> +                           1ULL << (window_shift - page_shift));
>> +    if (!tcet->nb_table) {
>> +        goto hw_error_exit;
>> +    }
>> +
>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    rtas_st(rets, 1, liobn);
>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>> +
>> +    return;
>> +
>> +hw_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          uint32_t token, uint32_t nargs,
>> +                                          target_ulong args,
>> +                                          uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn;
>> +
>> +    if ((nargs != 1) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    liobn = rtas_ld(args, 0);
>> +    tcet = spapr_tce_find_by_liobn(liobn);
>> +    if (!tcet) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_tce_table_disable(tcet);
>> +    trace_spapr_iommu_ddw_remove(liobn);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         uint32_t token, uint32_t nargs,
>> +                                         target_ulong args,
>> +                                         uint32_t nret, target_ulong rets)
>> +{
>> +    sPAPRPHBState *sphb;
>> +    uint64_t buid;
>> +    uint32_t addr;
>> +
>> +    if ((nargs != 3) || (nret != 1)) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>> +    addr = rtas_ld(args, 0);
>> +    sphb = spapr_pci_find_phb(spapr, buid);
>> +    if (!sphb || !sphb->ddw_enabled) {
>> +        goto param_error_exit;
>> +    }
>> +
>> +    spapr_phb_dma_reset(sphb);
>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>> +
>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>> +
>> +    return;
>> +
>> +param_error_exit:
>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>> +}
>> +
>> +static void spapr_rtas_ddw_init(void)
>> +{
>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>> +                        "ibm,query-pe-dma-window",
>> +                        rtas_ibm_query_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>> +                        "ibm,create-pe-dma-window",
>> +                        rtas_ibm_create_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>> +                        "ibm,remove-pe-dma-window",
>> +                        rtas_ibm_remove_pe_dma_window);
>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>> +                        "ibm,reset-pe-dma-window",
>> +                        rtas_ibm_reset_pe_dma_window);
>> +}
>> +
>> +type_init(spapr_rtas_ddw_init)
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 7848366..92aa610 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -32,6 +32,8 @@
>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>  
>> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
>> +
>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>  
>>  typedef struct spapr_pci_msi {
>> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>      MemoryRegion memwindow, iowindow, msiwindow;
>>  
>> -    uint32_t dma_liobn;
>> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>>      hwaddr dma_win_addr, dma_win_size;
>>      AddressSpace iommu_as;
>>      MemoryRegion iommu_root;
>> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>>      spapr_pci_msi_mig *msi_devs;
>>  
>>      QLIST_ENTRY(sPAPRPHBState) list;
>> +
>> +    bool ddw_enabled;
>> +    uint64_t page_size_mask;
>> +    uint64_t dma64_win_addr;
>>  };
>>  
>>  #define SPAPR_PCI_MAX_INDEX          255
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index e1f8274..36d1748 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>>  
>> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
>> +#define RTAS_DDW_PGSIZE_4K       0x01
>> +#define RTAS_DDW_PGSIZE_64K      0x02
>> +#define RTAS_DDW_PGSIZE_16M      0x04
>> +#define RTAS_DDW_PGSIZE_32M      0x08
>> +#define RTAS_DDW_PGSIZE_64M      0x10
>> +#define RTAS_DDW_PGSIZE_128M     0x20
>> +#define RTAS_DDW_PGSIZE_256M     0x40
>> +#define RTAS_DDW_PGSIZE_16G      0x80
>> +
>>  /* RTAS tokens */
>>  #define RTAS_TOKEN_BASE      0x2000
>>  
>> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
>> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
>> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
>> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>>  
>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>>  
>>  /* RTAS ibm,get-system-parameter token values */
>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>> diff --git a/trace-events b/trace-events
>> index 7e94d92..5b52634 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
>> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
>> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
>> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>>  
>>  # hw/ppc/ppc.c
>>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
>
David Gibson June 22, 2016, 7:01 a.m. UTC | #3
On Wed, Jun 22, 2016 at 01:23:51PM +1000, Alexey Kardashevskiy wrote:
> On 22/06/16 12:35, David Gibson wrote:
> > On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
> >> This adds support for Dynamic DMA Windows (DDW) option defined by
> >> the SPAPR specification which allows to have additional DMA window(s)
> >>
> >> The "ddw" property is enabled by default on a PHB but for compatibility
> >> the pseries-2.6 machine and older disable it.
> >> This also creates a single DMA window for the older machines to
> >> maintain backward migration.
> >>
> >> This implements DDW for PHB with emulated and VFIO devices. The host
> >> kernel support is required. The advertised IOMMU page sizes are 4K and
> >> 64K; 16M pages are supported but not advertised by default, in order to
> >> enable them, the user has to specify "pgsz" property for PHB and
> >> enable huge pages for RAM.
> >>
> >> The existing linux guests try creating one additional huge DMA window
> >> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >> the guest switches to dma_direct_ops and never calls TCE hypercalls
> >> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >> property which is a bus address for the 64bit window and by default
> >> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
> >> uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >> This adds 4 RTAS handlers:
> >> * ibm,query-pe-dma-window
> >> * ibm,create-pe-dma-window
> >> * ibm,remove-pe-dma-window
> >> * ibm,reset-pe-dma-window
> >> These are registered from type_init() callback.
> >>
> >> These RTAS handlers are implemented in a separate file to avoid polluting
> >> spapr_iommu.c with PCI.
> >>
> >> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > A few queries below.  Not sure if they'll require code changes or just
> > explanation.
> > 
> >> ---
> >> Changes:
> >> v18:
> >> * fixed bug when ddw-create rtas call was always creating window at 1<<59
> >> offset
> >> * update minimum supported machine version
> >> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
> >>
> >> v17:
> >> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
> >>
> >> v16:
> >> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
> >> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
> >>
> >> v15:
> >> * moved page mask filtering to PHB realize(), use "-mempath" to know
> >> if there are huge pages
> >> * fixed error reporting in RTAS handlers
> >> * max window size accounts now hotpluggable memory boundaries
> >> ---
> >>  hw/ppc/Makefile.objs        |   1 +
> >>  hw/ppc/spapr.c              |   7 +-
> >>  hw/ppc/spapr_pci.c          |  77 +++++++++---
> >>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/pci-host/spapr.h |   8 +-
> >>  include/hw/ppc/spapr.h      |  16 ++-
> >>  trace-events                |   4 +
> >>  7 files changed, 386 insertions(+), 22 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index 5cc6608..91a3420 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index 778fa25..f7cff27 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
> >>   * pseries-2.6
> >>   */
> >>  #define SPAPR_COMPAT_2_6 \
> >> -    HW_COMPAT_2_6
> >> +    HW_COMPAT_2_6 \
> >> +    { \
> >> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >> +        .property = "ddw",\
> >> +        .value    = stringify(off),\
> >> +    },
> >>  
> >>  static void spapr_machine_2_6_instance_options(MachineState *machine)
> >>  {
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 9f28fb3..0cb51dd 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -35,6 +35,7 @@
> >>  #include "hw/ppc/spapr.h"
> >>  #include "hw/pci-host/spapr.h"
> >>  #include "exec/address-spaces.h"
> >> +#include "exec/ram_addr.h"
> >>  #include <libfdt.h>
> >>  #include "trace.h"
> >>  #include "qemu/error-report.h"
> >> @@ -45,6 +46,7 @@
> >>  #include "hw/ppc/spapr_drc.h"
> >>  #include "sysemu/device_tree.h"
> >>  #include "sysemu/kvm.h"
> >> +#include "sysemu/hostmem.h"
> >>  
> >>  #include "hw/vfio/vfio.h"
> >>  
> >> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >>      int fdt_start_offset = 0, fdt_size;
> >>  
> >>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> >> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>  
> >>          spapr_tce_set_need_vfio(tcet, true);
> > 
> > Now that Alex took your notifier on/off patches, can you remove this
> > chunk? 
> 
> It will stop compiling as dma_liobn is an array now.

Sorry, I wasn't clear.  I meant remove this whole if statement, not
just remove this hunk of the patch.

> > If it's still necessary, don't you need to loop over all the
> > possible liobns, rather than just acting on liobn[0]?
> 
> Ah, right. Forgot about it. That was the reason why I wanted those notifier
> callbacks in this series, lost it in respins. I do need a loop here which
> I'll have to remove soon though.

Ok.
> >> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
> >>      /* Default DMA window is 0..1GB */
> >>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
> >>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
> >> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
> >> +                       0x800000000000000ULL),
> >> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >> +                       (1ULL << 12) | (1ULL << 16)),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
> >>      .post_load = spapr_pci_post_load,
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
> >> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
> >> +        VMSTATE_UNUSED(4), /* dma_liobn */
> > 
> > It's not obvious to me why this change is necessary.
> 
> It is not. But I was touching liobn and this is a proper cleanup which
> needs to be done anyway as _EQUAL() macros are sort of deprecated and
> rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
> I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
> it (leaving some inconsistency)?

Ah, ok, I see your point.  Yeah, I guess we can drop it.

> >>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
> >>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> >> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      uint32_t interrupt_map_mask[] = {
> >>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
> >>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
> >> +    uint32_t ddw_applicable[] = {
> >> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
> >> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
> >> +    };
> >> +    uint32_t ddw_extensions[] = {
> >> +        cpu_to_be32(1),
> >> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >> +    };
> >>      sPAPRTCETable *tcet;
> >>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >>      sPAPRFDT s_fdt;
> >> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
> >>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
> >>  
> >> +    /* Dynamic DMA window */
> >> +    if (phb->ddw_enabled) {
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
> >> +                         sizeof(ddw_applicable)));
> >> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
> >> +                         &ddw_extensions, sizeof(ddw_extensions)));
> >> +    }
> >> +
> >>      /* Build the interrupt-map, this must matches what is done
> >>       * in pci_spapr_map_irq
> >>       */
> >> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
> >>                       sizeof(interrupt_map)));
> >>  
> >> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> >> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
> >>      if (!tcet) {
> >>          return -1;
> >>      }
> >> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> >> new file mode 100644
> >> index 0000000..177dcff
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_rtas_ddw.c
> >> @@ -0,0 +1,295 @@
> >> +/*
> >> + * QEMU sPAPR Dynamic DMA windows support
> >> + *
> >> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
> >> + *
> >> + *  This program is free software; you can redistribute it and/or modify
> >> + *  it under the terms of the GNU General Public License as published by
> >> + *  the Free Software Foundation; either version 2 of the License,
> >> + *  or (at your option) any later version.
> >> + *
> >> + *  This program is distributed in the hope that it will be useful,
> >> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + *  GNU General Public License for more details.
> >> + *
> >> + *  You should have received a copy of the GNU General Public License
> >> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "cpu.h"
> >> +#include "qemu/error-report.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/pci-host/spapr.h"
> >> +#include "trace.h"
> >> +
> >> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && tcet->nb_table) {
> >> +        ++*(unsigned *)opaque;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
> >> +{
> >> +    unsigned ret = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
> >> +{
> >> +    sPAPRTCETable *tcet;
> >> +
> >> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
> >> +    if (tcet && !tcet->nb_table) {
> >> +        *(uint32_t *)opaque = tcet->liobn;
> >> +        return 1;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
> >> +{
> >> +    uint32_t liobn = 0;
> >> +
> >> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
> >> +
> >> +    return liobn;
> >> +}
> >> +
> >> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
> >> +{
> >> +    int i;
> >> +    uint32_t mask = 0;
> >> +    const struct { int shift; uint32_t mask; } masks[] = {
> >> +        { 12, RTAS_DDW_PGSIZE_4K },
> >> +        { 16, RTAS_DDW_PGSIZE_64K },
> >> +        { 24, RTAS_DDW_PGSIZE_16M },
> >> +        { 25, RTAS_DDW_PGSIZE_32M },
> >> +        { 26, RTAS_DDW_PGSIZE_64M },
> >> +        { 27, RTAS_DDW_PGSIZE_128M },
> >> +        { 28, RTAS_DDW_PGSIZE_256M },
> >> +        { 34, RTAS_DDW_PGSIZE_16G },
> >> +    };
> >> +
> >> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
> >> +        if (page_mask & (1ULL << masks[i].shift)) {
> >> +            mask |= masks[i].mask;
> >> +        }
> >> +    }
> >> +
> >> +    return mask;
> >> +}
> >> +
> >> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid, max_window_size;
> >> +    uint32_t avail, addr, pgmask = 0;
> >> +    MachineState *machine = MACHINE(spapr);
> >> +
> >> +    if ((nargs != 3) || (nret != 5)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    /* Translate page mask to LoPAPR format */
> >> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
> >> +
> >> +    /*
> >> +     * This is "Largest contiguous block of TCEs allocated specifically
> >> +     * for (that is, are reserved for) this PE".
> >> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
> >> +     */
> >> +    if (machine->ram_size == machine->maxram_size) {
> >> +        max_window_size = machine->ram_size;
> >> +    } else {
> >> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
> >> +
> >> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
> >> +    }
> >> +
> >> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, avail);
> >> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
> >> +    rtas_st(rets, 3, pgmask);
> >> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >> +
> >> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet = NULL;
> >> +    uint32_t addr, page_shift, window_shift, liobn;
> >> +    uint64_t buid, win_addr;
> >> +    int windows;
> >> +
> >> +    if ((nargs != 5) || (nret != 4)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    page_shift = rtas_ld(args, 3);
> >> +    window_shift = rtas_ld(args, 4);
> >> +    liobn = spapr_phb_get_free_liobn(sphb);
> >> +    windows = spapr_phb_get_active_win_num(sphb);
> >> +
> >> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
> >> +        (window_shift < page_shift)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    if (!tcet) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
> > 
> > If the guest delets the default 32-bit window, then requests a really
> > big 64-bit DMA window, will that work ok with the big window at 0
> > instead of the usual 64-bit window address?
> 
> 
> There is no valid guest to try that as they keep 32bit window.

Right, but we should aim to work in general, not just with known
guests.

> There was a relatively short period of time in v3.0-ish era (sles11 did
> have it and sles11sp3 did not if I remember correctly) when the guest would
> remove all windows and create one huge window but for some reason it
> expected the window to start non from zero (perhaps pHyp implementation
> detail) so it would fail. I did an experiment and removed that particular
> check and it worked just fine.

You mean removed the check for non-zero address from the guest?

> Today guests always keep a 32bit window as the platform cannot tell if all
> the drivers on a specific PHB will request 64bit DMA.
> 
> 
> 
> 
> >> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
> >> +                           1ULL << (window_shift - page_shift));
> >> +    if (!tcet->nb_table) {
> >> +        goto hw_error_exit;
> >> +    }
> >> +
> >> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
> >> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    rtas_st(rets, 1, liobn);
> >> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
> >> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
> >> +
> >> +    return;
> >> +
> >> +hw_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          uint32_t token, uint32_t nargs,
> >> +                                          target_ulong args,
> >> +                                          uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    sPAPRTCETable *tcet;
> >> +    uint32_t liobn;
> >> +
> >> +    if ((nargs != 1) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    liobn = rtas_ld(args, 0);
> >> +    tcet = spapr_tce_find_by_liobn(liobn);
> >> +    if (!tcet) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
> >> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_tce_table_disable(tcet);
> >> +    trace_spapr_iommu_ddw_remove(liobn);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         uint32_t token, uint32_t nargs,
> >> +                                         target_ulong args,
> >> +                                         uint32_t nret, target_ulong rets)
> >> +{
> >> +    sPAPRPHBState *sphb;
> >> +    uint64_t buid;
> >> +    uint32_t addr;
> >> +
> >> +    if ((nargs != 3) || (nret != 1)) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >> +    addr = rtas_ld(args, 0);
> >> +    sphb = spapr_pci_find_phb(spapr, buid);
> >> +    if (!sphb || !sphb->ddw_enabled) {
> >> +        goto param_error_exit;
> >> +    }
> >> +
> >> +    spapr_phb_dma_reset(sphb);
> >> +    trace_spapr_iommu_ddw_reset(buid, addr);
> >> +
> >> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >> +
> >> +    return;
> >> +
> >> +param_error_exit:
> >> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >> +}
> >> +
> >> +static void spapr_rtas_ddw_init(void)
> >> +{
> >> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
> >> +                        "ibm,query-pe-dma-window",
> >> +                        rtas_ibm_query_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
> >> +                        "ibm,create-pe-dma-window",
> >> +                        rtas_ibm_create_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
> >> +                        "ibm,remove-pe-dma-window",
> >> +                        rtas_ibm_remove_pe_dma_window);
> >> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
> >> +                        "ibm,reset-pe-dma-window",
> >> +                        rtas_ibm_reset_pe_dma_window);
> >> +}
> >> +
> >> +type_init(spapr_rtas_ddw_init)
> >> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >> index 7848366..92aa610 100644
> >> --- a/include/hw/pci-host/spapr.h
> >> +++ b/include/hw/pci-host/spapr.h
> >> @@ -32,6 +32,8 @@
> >>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
> >>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
> >>  
> >> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
> >> +
> >>  typedef struct sPAPRPHBState sPAPRPHBState;
> >>  
> >>  typedef struct spapr_pci_msi {
> >> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
> >>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
> >>      MemoryRegion memwindow, iowindow, msiwindow;
> >>  
> >> -    uint32_t dma_liobn;
> >> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
> >>      hwaddr dma_win_addr, dma_win_size;
> >>      AddressSpace iommu_as;
> >>      MemoryRegion iommu_root;
> >> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
> >>      spapr_pci_msi_mig *msi_devs;
> >>  
> >>      QLIST_ENTRY(sPAPRPHBState) list;
> >> +
> >> +    bool ddw_enabled;
> >> +    uint64_t page_size_mask;
> >> +    uint64_t dma64_win_addr;
> >>  };
> >>  
> >>  #define SPAPR_PCI_MAX_INDEX          255
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index e1f8274..36d1748 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
> >>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
> >>  
> >> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
> >> +#define RTAS_DDW_PGSIZE_4K       0x01
> >> +#define RTAS_DDW_PGSIZE_64K      0x02
> >> +#define RTAS_DDW_PGSIZE_16M      0x04
> >> +#define RTAS_DDW_PGSIZE_32M      0x08
> >> +#define RTAS_DDW_PGSIZE_64M      0x10
> >> +#define RTAS_DDW_PGSIZE_128M     0x20
> >> +#define RTAS_DDW_PGSIZE_256M     0x40
> >> +#define RTAS_DDW_PGSIZE_16G      0x80
> >> +
> >>  /* RTAS tokens */
> >>  #define RTAS_TOKEN_BASE      0x2000
> >>  
> >> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
> >>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
> >>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
> >>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
> >> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
> >> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
> >> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
> >>  
> >> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
> >> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
> >>  
> >>  /* RTAS ibm,get-system-parameter token values */
> >>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
> >> diff --git a/trace-events b/trace-events
> >> index 7e94d92..5b52634 100644
> >> --- a/trace-events
> >> +++ b/trace-events
> >> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
> >>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
> >>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
> >> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
> >> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
> >> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
> >> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
> >>  
> >>  # hw/ppc/ppc.c
> >>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
> > 
> 
>
Alexey Kardashevskiy June 22, 2016, 8:26 a.m. UTC | #4
On 22/06/16 17:01, David Gibson wrote:
> On Wed, Jun 22, 2016 at 01:23:51PM +1000, Alexey Kardashevskiy wrote:
>> On 22/06/16 12:35, David Gibson wrote:
>>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.6 machine and older disable it.
>>>> This also creates a single DMA window for the older machines to
>>>> maintain backward migration.
>>>>
>>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>>> 64K; 16M pages are supported but not advertised by default, in order to
>>>> enable them, the user has to specify "pgsz" property for PHB and
>>>> enable huge pages for RAM.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> A few queries below.  Not sure if they'll require code changes or just
>>> explanation.
>>>
>>>> ---
>>>> Changes:
>>>> v18:
>>>> * fixed bug when ddw-create rtas call was always creating window at 1<<59
>>>> offset
>>>> * update minimum supported machine version
>>>> * s/dma64_window_addr/dma_win_addr/ to match dma_win_addr
>>>>
>>>> v17:
>>>> * fixed: "query" did return non-page-shifted value when memory hotplug is enabled
>>>>
>>>> v16:
>>>> * s/dma_liobn/dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS]/
>>>> * s/SPAPR_PCI_LIOBN()/dma_liobn[]/
>>>>
>>>> v15:
>>>> * moved page mask filtering to PHB realize(), use "-mempath" to know
>>>> if there are huge pages
>>>> * fixed error reporting in RTAS handlers
>>>> * max window size accounts now hotpluggable memory boundaries
>>>> ---
>>>>  hw/ppc/Makefile.objs        |   1 +
>>>>  hw/ppc/spapr.c              |   7 +-
>>>>  hw/ppc/spapr_pci.c          |  77 +++++++++---
>>>>  hw/ppc/spapr_rtas_ddw.c     | 295 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/pci-host/spapr.h |   8 +-
>>>>  include/hw/ppc/spapr.h      |  16 ++-
>>>>  trace-events                |   4 +
>>>>  7 files changed, 386 insertions(+), 22 deletions(-)
>>>>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index 5cc6608..91a3420 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -8,6 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
>>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>  obj-y += spapr_pci_vfio.o
>>>>  endif
>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>  # PowerPC 4xx boards
>>>>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>  obj-y += ppc4xx_pci.o
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 778fa25..f7cff27 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2485,7 +2485,12 @@ DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
>>>>   * pseries-2.6
>>>>   */
>>>>  #define SPAPR_COMPAT_2_6 \
>>>> -    HW_COMPAT_2_6
>>>> +    HW_COMPAT_2_6 \
>>>> +    { \
>>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>> +        .property = "ddw",\
>>>> +        .value    = stringify(off),\
>>>> +    },
>>>>  
>>>>  static void spapr_machine_2_6_instance_options(MachineState *machine)
>>>>  {
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 9f28fb3..0cb51dd 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include "hw/ppc/spapr.h"
>>>>  #include "hw/pci-host/spapr.h"
>>>>  #include "exec/address-spaces.h"
>>>> +#include "exec/ram_addr.h"
>>>>  #include <libfdt.h>
>>>>  #include "trace.h"
>>>>  #include "qemu/error-report.h"
>>>> @@ -45,6 +46,7 @@
>>>>  #include "hw/ppc/spapr_drc.h"
>>>>  #include "sysemu/device_tree.h"
>>>>  #include "sysemu/kvm.h"
>>>> +#include "sysemu/hostmem.h"
>>>>  
>>>>  #include "hw/vfio/vfio.h"
>>>>  
>>>> @@ -1088,7 +1090,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>>>      int fdt_start_offset = 0, fdt_size;
>>>>  
>>>>      if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>>>> -        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>>>  
>>>>          spapr_tce_set_need_vfio(tcet, true);
>>>
>>> Now that Alex took your notifier on/off patches, can you remove this
>>> chunk? 
>>
>> It will stop compiling as dma_liobn is an array now.
> 
> Sorry, I wasn't clear.  I meant remove this whole if statement, not
> just remove this hunk of the patch.


Bisect-ability will suffer then, and we can easily avoided if this patch is
applied on top of these:

vfio, memory: Notify IOMMU about starting/stopping listening
spapr_iommu: Realloc guest visible TCE table when starting/stopping listening

All we need is Alex to send pull req, Peter to merge it and you to rebase
ppc-for-2.7 on top of this :)

> 
>>> If it's still necessary, don't you need to loop over all the
>>> possible liobns, rather than just acting on liobn[0]?
>>
>> Ah, right. Forgot about it. That was the reason why I wanted those notifier
>> callbacks in this series, lost it in respins. I do need a loop here which
>> I'll have to remove soon though.
> 
> Ok.
>>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>>      /* Default DMA window is 0..1GB */
>>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>>> +                       0x800000000000000ULL),
>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>>      .post_load = spapr_pci_post_load,
>>>>      .fields = (VMStateField[]) {
>>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>>
>>> It's not obvious to me why this change is necessary.
>>
>> It is not. But I was touching liobn and this is a proper cleanup which
>> needs to be done anyway as _EQUAL() macros are sort of deprecated and
>> rather pointless. Since I am adding a new 64bit LIOBN in this patch, should
>> I add it in VMSTATE as 32bit one and bump the vmstate version? Or not add
>> it (leaving some inconsistency)?
> 
> Ah, ok, I see your point.  Yeah, I guess we can drop it.
> 
>>>>          VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
>>>>          VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
>>>>          VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
>>>> @@ -1779,6 +1801,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      uint32_t interrupt_map_mask[] = {
>>>>          cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>      uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>> +    uint32_t ddw_applicable[] = {
>>>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>> +    };
>>>> +    uint32_t ddw_extensions[] = {
>>>> +        cpu_to_be32(1),
>>>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>> +    };
>>>>      sPAPRTCETable *tcet;
>>>>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>      sPAPRFDT s_fdt;
>>>> @@ -1803,6 +1834,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>      _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>  
>>>> +    /* Dynamic DMA window */
>>>> +    if (phb->ddw_enabled) {
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>> +                         sizeof(ddw_applicable)));
>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>>>> +    }
>>>> +
>>>>      /* Build the interrupt-map, this must matches what is done
>>>>       * in pci_spapr_map_irq
>>>>       */
>>>> @@ -1826,7 +1865,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>      _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
>>>>                       sizeof(interrupt_map)));
>>>>  
>>>> -    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
>>>> +    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
>>>>      if (!tcet) {
>>>>          return -1;
>>>>      }
>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>> new file mode 100644
>>>> index 0000000..177dcff
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>> @@ -0,0 +1,295 @@
>>>> +/*
>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>> + *
>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>> + *
>>>> + *  This program is free software; you can redistribute it and/or modify
>>>> + *  it under the terms of the GNU General Public License as published by
>>>> + *  the Free Software Foundation; either version 2 of the License,
>>>> + *  or (at your option) any later version.
>>>> + *
>>>> + *  This program is distributed in the hope that it will be useful,
>>>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + *  GNU General Public License for more details.
>>>> + *
>>>> + *  You should have received a copy of the GNU General Public License
>>>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "cpu.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/pci-host/spapr.h"
>>>> +#include "trace.h"
>>>> +
>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && tcet->nb_table) {
>>>> +        ++*(unsigned *)opaque;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>> +{
>>>> +    unsigned ret = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>> +{
>>>> +    sPAPRTCETable *tcet;
>>>> +
>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>> +    if (tcet && !tcet->nb_table) {
>>>> +        *(uint32_t *)opaque = tcet->liobn;
>>>> +        return 1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>> +{
>>>> +    uint32_t liobn = 0;
>>>> +
>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>> +
>>>> +    return liobn;
>>>> +}
>>>> +
>>>> +static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
>>>> +{
>>>> +    int i;
>>>> +    uint32_t mask = 0;
>>>> +    const struct { int shift; uint32_t mask; } masks[] = {
>>>> +        { 12, RTAS_DDW_PGSIZE_4K },
>>>> +        { 16, RTAS_DDW_PGSIZE_64K },
>>>> +        { 24, RTAS_DDW_PGSIZE_16M },
>>>> +        { 25, RTAS_DDW_PGSIZE_32M },
>>>> +        { 26, RTAS_DDW_PGSIZE_64M },
>>>> +        { 27, RTAS_DDW_PGSIZE_128M },
>>>> +        { 28, RTAS_DDW_PGSIZE_256M },
>>>> +        { 34, RTAS_DDW_PGSIZE_16G },
>>>> +    };
>>>> +
>>>> +    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
>>>> +        if (page_mask & (1ULL << masks[i].shift)) {
>>>> +            mask |= masks[i].mask;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return mask;
>>>> +}
>>>> +
>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         uint32_t token, uint32_t nargs,
>>>> +                                         target_ulong args,
>>>> +                                         uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    uint64_t buid, max_window_size;
>>>> +    uint32_t avail, addr, pgmask = 0;
>>>> +    MachineState *machine = MACHINE(spapr);
>>>> +
>>>> +    if ((nargs != 3) || (nret != 5)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    /* Translate page mask to LoPAPR format */
>>>> +    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
>>>> +
>>>> +    /*
>>>> +     * This is "Largest contiguous block of TCEs allocated specifically
>>>> +     * for (that is, are reserved for) this PE".
>>>> +     * Return the maximum number as maximum supported RAM size was in 4K pages.
>>>> +     */
>>>> +    if (machine->ram_size == machine->maxram_size) {
>>>> +        max_window_size = machine->ram_size;
>>>> +    } else {
>>>> +        MemoryHotplugState *hpms = &spapr->hotplug_memory;
>>>> +
>>>> +        max_window_size = hpms->base + memory_region_size(&hpms->mr);
>>>> +    }
>>>> +
>>>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, avail);
>>>> +    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
>>>> +    rtas_st(rets, 3, pgmask);
>>>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
>>>> +
>>>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRTCETable *tcet = NULL;
>>>> +    uint32_t addr, page_shift, window_shift, liobn;
>>>> +    uint64_t buid, win_addr;
>>>> +    int windows;
>>>> +
>>>> +    if ((nargs != 5) || (nret != 4)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    page_shift = rtas_ld(args, 3);
>>>> +    window_shift = rtas_ld(args, 4);
>>>> +    liobn = spapr_phb_get_free_liobn(sphb);
>>>> +    windows = spapr_phb_get_active_win_num(sphb);
>>>> +
>>>> +    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
>>>> +        (window_shift < page_shift)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    tcet = spapr_tce_find_by_liobn(liobn);
>>>> +    if (!tcet) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
>>>
>>> If the guest delets the default 32-bit window, then requests a really
>>> big 64-bit DMA window, will that work ok with the big window at 0
>>> instead of the usual 64-bit window address?
>>
>>
>> There is no valid guest to try that as they keep 32bit window.
> 
> Right, but we should aim to work in general, not just with known
> guests.


Well, I did the experiment described below.

The term "in general" is vague though - if pHyp did something not exactly
as PAPR said (and therefore guests expected that), what behavior should I
pick for QEMU? For example, the guest did not expect a new window to start
from zero so there should have been reason for that, something like pHyp
only can allocate a single window and only at 1<<59 offset or nobody
actually tested it (always a possibility).


>> There was a relatively short period of time in v3.0-ish era (sles11 did
>> have it and sles11sp3 did not if I remember correctly) when the guest would
>> remove all windows and create one huge window but for some reason it
>> expected the window to start non from zero (perhaps pHyp implementation
>> detail) so it would fail. I did an experiment and removed that particular
>> check and it worked just fine.
> 
> You mean removed the check for non-zero address from the guest?

Yes, that one.


>> Today guests always keep a 32bit window as the platform cannot tell if all
>> the drivers on a specific PHB will request 64bit DMA.
>>
>>
>>
>>
>>>> +    spapr_tce_table_enable(tcet, page_shift, win_addr,
>>>> +                           1ULL << (window_shift - page_shift));
>>>> +    if (!tcet->nb_table) {
>>>> +        goto hw_error_exit;
>>>> +    }
>>>> +
>>>> +    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
>>>> +                                 1ULL << window_shift, tcet->bus_offset, liobn);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    rtas_st(rets, 1, liobn);
>>>> +    rtas_st(rets, 2, tcet->bus_offset >> 32);
>>>> +    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
>>>> +
>>>> +    return;
>>>> +
>>>> +hw_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          uint32_t token, uint32_t nargs,
>>>> +                                          target_ulong args,
>>>> +                                          uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    sPAPRTCETable *tcet;
>>>> +    uint32_t liobn;
>>>> +
>>>> +    if ((nargs != 1) || (nret != 1)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    liobn = rtas_ld(args, 0);
>>>> +    tcet = spapr_tce_find_by_liobn(liobn);
>>>> +    if (!tcet) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
>>>> +    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    spapr_tce_table_disable(tcet);
>>>> +    trace_spapr_iommu_ddw_remove(liobn);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         uint32_t token, uint32_t nargs,
>>>> +                                         target_ulong args,
>>>> +                                         uint32_t nret, target_ulong rets)
>>>> +{
>>>> +    sPAPRPHBState *sphb;
>>>> +    uint64_t buid;
>>>> +    uint32_t addr;
>>>> +
>>>> +    if ((nargs != 3) || (nret != 1)) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>> +    addr = rtas_ld(args, 0);
>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>> +        goto param_error_exit;
>>>> +    }
>>>> +
>>>> +    spapr_phb_dma_reset(sphb);
>>>> +    trace_spapr_iommu_ddw_reset(buid, addr);
>>>> +
>>>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
>>>> +
>>>> +    return;
>>>> +
>>>> +param_error_exit:
>>>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
>>>> +}
>>>> +
>>>> +static void spapr_rtas_ddw_init(void)
>>>> +{
>>>> +    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
>>>> +                        "ibm,query-pe-dma-window",
>>>> +                        rtas_ibm_query_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
>>>> +                        "ibm,create-pe-dma-window",
>>>> +                        rtas_ibm_create_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
>>>> +                        "ibm,remove-pe-dma-window",
>>>> +                        rtas_ibm_remove_pe_dma_window);
>>>> +    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
>>>> +                        "ibm,reset-pe-dma-window",
>>>> +                        rtas_ibm_reset_pe_dma_window);
>>>> +}
>>>> +
>>>> +type_init(spapr_rtas_ddw_init)
>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>> index 7848366..92aa610 100644
>>>> --- a/include/hw/pci-host/spapr.h
>>>> +++ b/include/hw/pci-host/spapr.h
>>>> @@ -32,6 +32,8 @@
>>>>  #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>>>      OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>>>  
>>>> +#define SPAPR_PCI_DMA_MAX_WINDOWS    2
>>>> +
>>>>  typedef struct sPAPRPHBState sPAPRPHBState;
>>>>  
>>>>  typedef struct spapr_pci_msi {
>>>> @@ -56,7 +58,7 @@ struct sPAPRPHBState {
>>>>      hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>>>>      MemoryRegion memwindow, iowindow, msiwindow;
>>>>  
>>>> -    uint32_t dma_liobn;
>>>> +    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
>>>>      hwaddr dma_win_addr, dma_win_size;
>>>>      AddressSpace iommu_as;
>>>>      MemoryRegion iommu_root;
>>>> @@ -71,6 +73,10 @@ struct sPAPRPHBState {
>>>>      spapr_pci_msi_mig *msi_devs;
>>>>  
>>>>      QLIST_ENTRY(sPAPRPHBState) list;
>>>> +
>>>> +    bool ddw_enabled;
>>>> +    uint64_t page_size_mask;
>>>> +    uint64_t dma64_win_addr;
>>>>  };
>>>>  
>>>>  #define SPAPR_PCI_MAX_INDEX          255
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index e1f8274..36d1748 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -416,6 +416,16 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>>  #define RTAS_OUT_NOT_AUTHORIZED                 -9002
>>>>  #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
>>>>  
>>>> +/* DDW pagesize mask values from ibm,query-pe-dma-window */
>>>> +#define RTAS_DDW_PGSIZE_4K       0x01
>>>> +#define RTAS_DDW_PGSIZE_64K      0x02
>>>> +#define RTAS_DDW_PGSIZE_16M      0x04
>>>> +#define RTAS_DDW_PGSIZE_32M      0x08
>>>> +#define RTAS_DDW_PGSIZE_64M      0x10
>>>> +#define RTAS_DDW_PGSIZE_128M     0x20
>>>> +#define RTAS_DDW_PGSIZE_256M     0x40
>>>> +#define RTAS_DDW_PGSIZE_16G      0x80
>>>> +
>>>>  /* RTAS tokens */
>>>>  #define RTAS_TOKEN_BASE      0x2000
>>>>  
>>>> @@ -457,8 +467,12 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>>>>  #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
>>>>  #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
>>>>  #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
>>>> +#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
>>>> +#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
>>>> +#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
>>>> +#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
>>>>  
>>>> -#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
>>>> +#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
>>>>  
>>>>  /* RTAS ibm,get-system-parameter token values */
>>>>  #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
>>>> diff --git a/trace-events b/trace-events
>>>> index 7e94d92..5b52634 100644
>>>> --- a/trace-events
>>>> +++ b/trace-events
>>>> @@ -1435,6 +1435,10 @@ spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
>>>>  spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
>>>>  spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>>>  spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
>>>> +spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
>>>> +spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
>>>> +spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
>>>> +spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
>>>>  
>>>>  # hw/ppc/ppc.c
>>>>  ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"
>>>
>>
>>
> 
> 
> 
>
Thomas Huth June 22, 2016, 9:44 a.m. UTC | #5
On 22.06.2016 05:23, Alexey Kardashevskiy wrote:
> On 22/06/16 12:35, David Gibson wrote:
>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>> the SPAPR specification which allows to have additional DMA window(s)
>>>
>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>> the pseries-2.6 machine and older disable it.
>>> This also creates a single DMA window for the older machines to
>>> maintain backward migration.
>>>
>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>> 64K; 16M pages are supported but not advertised by default, in order to
>>> enable them, the user has to specify "pgsz" property for PHB and
>>> enable huge pages for RAM.
>>>
>>> The existing linux guests try creating one additional huge DMA window
>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>> property which is a bus address for the 64bit window and by default
>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>
>>> This adds 4 RTAS handlers:
>>> * ibm,query-pe-dma-window
>>> * ibm,create-pe-dma-window
>>> * ibm,remove-pe-dma-window
>>> * ibm,reset-pe-dma-window
>>> These are registered from type_init() callback.
>>>
>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>> spapr_iommu.c with PCI.
>>>
>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[...]
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 9f28fb3..0cb51dd 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
[...]
>>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>  static Property spapr_phb_properties[] = {
>>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>      /* Default DMA window is 0..1GB */
>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>> +                       0x800000000000000ULL),
>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>      DEFINE_PROP_END_OF_LIST(),
>>>  };
>>>  
>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>      .post_load = spapr_pci_post_load,
>>>      .fields = (VMStateField[]) {
>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>
>> It's not obvious to me why this change is necessary.
> 
> It is not. But I was touching liobn and this is a proper cleanup which
> needs to be done anyway as _EQUAL() macros are sort of deprecated and
> rather pointless.

Not sure, but if you mark this field as unused now, is migration
backwards to an older version of QEMU still working? If not, you might
need to bump the version number, too?

 Thomas
Alexey Kardashevskiy June 23, 2016, 2 a.m. UTC | #6
On 22/06/16 19:44, Thomas Huth wrote:
> On 22.06.2016 05:23, Alexey Kardashevskiy wrote:
>> On 22/06/16 12:35, David Gibson wrote:
>>> On Tue, Jun 21, 2016 at 11:14:05AM +1000, Alexey Kardashevskiy wrote:
>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>
>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>> the pseries-2.6 machine and older disable it.
>>>> This also creates a single DMA window for the older machines to
>>>> maintain backward migration.
>>>>
>>>> This implements DDW for PHB with emulated and VFIO devices. The host
>>>> kernel support is required. The advertised IOMMU page sizes are 4K and
>>>> 64K; 16M pages are supported but not advertised by default, in order to
>>>> enable them, the user has to specify "pgsz" property for PHB and
>>>> enable huge pages for RAM.
>>>>
>>>> The existing linux guests try creating one additional huge DMA window
>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>> property which is a bus address for the 64bit window and by default
>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>
>>>> This adds 4 RTAS handlers:
>>>> * ibm,query-pe-dma-window
>>>> * ibm,create-pe-dma-window
>>>> * ibm,remove-pe-dma-window
>>>> * ibm,reset-pe-dma-window
>>>> These are registered from type_init() callback.
>>>>
>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>> spapr_iommu.c with PCI.
>>>>
>>>> This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> [...]
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 9f28fb3..0cb51dd 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
> [...]
>>>> @@ -1515,7 +1531,8 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>>  static Property spapr_phb_properties[] = {
>>>>      DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
>>>>      DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
>>>> -    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
>>>> +    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
>>>> +    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
>>>>      DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
>>>>      DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
>>>>                         SPAPR_PCI_MMIO_WIN_SIZE),
>>>> @@ -1527,6 +1544,11 @@ static Property spapr_phb_properties[] = {
>>>>      /* Default DMA window is 0..1GB */
>>>>      DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>      DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
>>>> +                       0x800000000000000ULL),
>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>> +                       (1ULL << 12) | (1ULL << 16)),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -1603,7 +1625,7 @@ static const VMStateDescription vmstate_spapr_pci = {
>>>>      .post_load = spapr_pci_post_load,
>>>>      .fields = (VMStateField[]) {
>>>>          VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
>>>> -        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
>>>> +        VMSTATE_UNUSED(4), /* dma_liobn */
>>>
>>> It's not obvious to me why this change is necessary.
>>
>> It is not. But I was touching liobn and this is a proper cleanup which
>> needs to be done anyway as _EQUAL() macros are sort of deprecated and
>> rather pointless.
> 
> Not sure, but if you mark this field as unused now, is migration
> backwards to an older version of QEMU still working? If not, you might
> need to bump the version number, too?

Oh. Correct, it will fail. So I still need this field here. Ok, will fix
when resend.
diff mbox

Patch

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 5cc6608..91a3420 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -8,6 +8,7 @@  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
 obj-y += spapr_pci_vfio.o
 endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 778fa25..f7cff27 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2485,7 +2485,12 @@  DEFINE_SPAPR_MACHINE(2_7, "2.7", true);
  * pseries-2.6
  */
 #define SPAPR_COMPAT_2_6 \
-    HW_COMPAT_2_6
+    HW_COMPAT_2_6 \
+    { \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+        .property = "ddw",\
+        .value    = stringify(off),\
+    },
 
 static void spapr_machine_2_6_instance_options(MachineState *machine)
 {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 9f28fb3..0cb51dd 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -35,6 +35,7 @@ 
 #include "hw/ppc/spapr.h"
 #include "hw/pci-host/spapr.h"
 #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
 #include <libfdt.h>
 #include "trace.h"
 #include "qemu/error-report.h"
@@ -45,6 +46,7 @@ 
 #include "hw/ppc/spapr_drc.h"
 #include "sysemu/device_tree.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hostmem.h"
 
 #include "hw/vfio/vfio.h"
 
@@ -1088,7 +1090,7 @@  static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
     int fdt_start_offset = 0, fdt_size;
 
     if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
-        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
 
         spapr_tce_set_need_vfio(tcet, true);
     }
@@ -1310,11 +1312,14 @@  static void spapr_phb_realize(DeviceState *dev, Error **errp)
     PCIBus *bus;
     uint64_t msi_window_size = 4096;
     sPAPRTCETable *tcet;
+    const unsigned windows_supported =
+        sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
 
     if (sphb->index != (uint32_t)-1) {
         hwaddr windows_base;
 
-        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn != (uint32_t)-1)
+        if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != (uint32_t)-1)
+            || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported == 2)
             || (sphb->mem_win_addr != (hwaddr)-1)
             || (sphb->io_win_addr != (hwaddr)-1)) {
             error_setg(errp, "Either \"index\" or other parameters must"
@@ -1329,7 +1334,9 @@  static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
-        sphb->dma_liobn = SPAPR_PCI_LIOBN(sphb->index, 0);
+        for (i = 0; i < windows_supported; ++i) {
+            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
+        }
 
         windows_base = SPAPR_PCI_WINDOW_BASE
             + sphb->index * SPAPR_PCI_WINDOW_SPACING;
@@ -1342,8 +1349,9 @@  static void spapr_phb_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (sphb->dma_liobn == (uint32_t)-1) {
-        error_setg(errp, "LIOBN not specified for PHB");
+    if ((sphb->dma_liobn[0] == (uint32_t)-1) ||
+        ((sphb->dma_liobn[1] == (uint32_t)-1) && (windows_supported > 1))) {
+        error_setg(errp, "LIOBN(s) not specified for PHB");
         return;
     }
 
@@ -1461,16 +1469,18 @@  static void spapr_phb_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
-    if (!tcet) {
-        error_setg(errp, "Unable to create TCE table for %s",
-                   sphb->dtbusname);
-        return;
+    /* DMA setup */
+    for (i = 0; i < windows_supported; ++i) {
+        tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
+        if (!tcet) {
+            error_setg(errp, "Creating window#%d failed for %s",
+                       i, sphb->dtbusname);
+            return;
+        }
+        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
+                                            spapr_tce_get_iommu(tcet), 0);
     }
 
-    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
-                                        spapr_tce_get_iommu(tcet), 0);
-
     sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
 }
 
@@ -1487,13 +1497,19 @@  static int spapr_phb_children_reset(Object *child, void *opaque)
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb)
 {
-    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
+    int i;
+    sPAPRTCETable *tcet;
 
-    if (tcet && tcet->nb_table) {
-        spapr_tce_table_disable(tcet);
+    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+        tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[i]);
+
+        if (tcet && tcet->nb_table) {
+            spapr_tce_table_disable(tcet);
+        }
     }
 
     /* Register default 32bit DMA window */
+    tcet = spapr_tce_find_by_liobn(sphb->dma_liobn[0]);
     spapr_tce_table_enable(tcet, SPAPR_TCE_PAGE_SHIFT, sphb->dma_win_addr,
                            sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT);
 }
@@ -1515,7 +1531,8 @@  static void spapr_phb_reset(DeviceState *qdev)
 static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT32("index", sPAPRPHBState, index, -1),
     DEFINE_PROP_UINT64("buid", sPAPRPHBState, buid, -1),
-    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn, -1),
+    DEFINE_PROP_UINT32("liobn", sPAPRPHBState, dma_liobn[0], -1),
+    DEFINE_PROP_UINT32("liobn64", sPAPRPHBState, dma_liobn[1], -1),
     DEFINE_PROP_UINT64("mem_win_addr", sPAPRPHBState, mem_win_addr, -1),
     DEFINE_PROP_UINT64("mem_win_size", sPAPRPHBState, mem_win_size,
                        SPAPR_PCI_MMIO_WIN_SIZE),
@@ -1527,6 +1544,11 @@  static Property spapr_phb_properties[] = {
     /* Default DMA window is 0..1GB */
     DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
     DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
+    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_win_addr,
+                       0x800000000000000ULL),
+    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
+    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
+                       (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1603,7 +1625,7 @@  static const VMStateDescription vmstate_spapr_pci = {
     .post_load = spapr_pci_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-        VMSTATE_UINT32_EQUAL(dma_liobn, sPAPRPHBState),
+        VMSTATE_UNUSED(4), /* dma_liobn */
         VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
         VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
@@ -1779,6 +1801,15 @@  int spapr_populate_pci_dt(sPAPRPHBState *phb,
     uint32_t interrupt_map_mask[] = {
         cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
     uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
+    uint32_t ddw_applicable[] = {
+        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
+        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
+    };
+    uint32_t ddw_extensions[] = {
+        cpu_to_be32(1),
+        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
+    };
     sPAPRTCETable *tcet;
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
     sPAPRFDT s_fdt;
@@ -1803,6 +1834,14 @@  int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
     _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
 
+    /* Dynamic DMA window */
+    if (phb->ddw_enabled) {
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
+                         sizeof(ddw_applicable)));
+        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
+                         &ddw_extensions, sizeof(ddw_extensions)));
+    }
+
     /* Build the interrupt-map, this must matches what is done
      * in pci_spapr_map_irq
      */
@@ -1826,7 +1865,7 @@  int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
-    tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
+    tcet = spapr_tce_find_by_liobn(phb->dma_liobn[0]);
     if (!tcet) {
         return -1;
     }
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
new file mode 100644
index 0000000..177dcff
--- /dev/null
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -0,0 +1,295 @@ 
+/*
+ * QEMU sPAPR Dynamic DMA windows support
+ *
+ * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "cpu.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "trace.h"
+
+static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && tcet->nb_table) {
+        ++*(unsigned *)opaque;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
+{
+    unsigned ret = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
+
+    return ret;
+}
+
+static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
+    if (tcet && !tcet->nb_table) {
+        *(uint32_t *)opaque = tcet->liobn;
+        return 1;
+    }
+    return 0;
+}
+
+static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
+{
+    uint32_t liobn = 0;
+
+    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
+
+    return liobn;
+}
+
+static uint32_t spapr_page_mask_to_query_mask(uint64_t page_mask)
+{
+    int i;
+    uint32_t mask = 0;
+    const struct { int shift; uint32_t mask; } masks[] = {
+        { 12, RTAS_DDW_PGSIZE_4K },
+        { 16, RTAS_DDW_PGSIZE_64K },
+        { 24, RTAS_DDW_PGSIZE_16M },
+        { 25, RTAS_DDW_PGSIZE_32M },
+        { 26, RTAS_DDW_PGSIZE_64M },
+        { 27, RTAS_DDW_PGSIZE_128M },
+        { 28, RTAS_DDW_PGSIZE_256M },
+        { 34, RTAS_DDW_PGSIZE_16G },
+    };
+
+    for (i = 0; i < ARRAY_SIZE(masks); ++i) {
+        if (page_mask & (1ULL << masks[i].shift)) {
+            mask |= masks[i].mask;
+        }
+    }
+
+    return mask;
+}
+
+static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid, max_window_size;
+    uint32_t avail, addr, pgmask = 0;
+    MachineState *machine = MACHINE(spapr);
+
+    if ((nargs != 3) || (nret != 5)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    /* Translate page mask to LoPAPR format */
+    pgmask = spapr_page_mask_to_query_mask(sphb->page_size_mask);
+
+    /*
+     * This is "Largest contiguous block of TCEs allocated specifically
+     * for (that is, are reserved for) this PE".
+     * Return the maximum number as maximum supported RAM size was in 4K pages.
+     */
+    if (machine->ram_size == machine->maxram_size) {
+        max_window_size = machine->ram_size;
+    } else {
+        MemoryHotplugState *hpms = &spapr->hotplug_memory;
+
+        max_window_size = hpms->base + memory_region_size(&hpms->mr);
+    }
+
+    avail = SPAPR_PCI_DMA_MAX_WINDOWS - spapr_phb_get_active_win_num(sphb);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, avail);
+    rtas_st(rets, 2, max_window_size >> SPAPR_TCE_PAGE_SHIFT);
+    rtas_st(rets, 3, pgmask);
+    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
+
+    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, pgmask);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet = NULL;
+    uint32_t addr, page_shift, window_shift, liobn;
+    uint64_t buid, win_addr;
+    int windows;
+
+    if ((nargs != 5) || (nret != 4)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    page_shift = rtas_ld(args, 3);
+    window_shift = rtas_ld(args, 4);
+    liobn = spapr_phb_get_free_liobn(sphb);
+    windows = spapr_phb_get_active_win_num(sphb);
+
+    if (!(sphb->page_size_mask & (1ULL << page_shift)) ||
+        (window_shift < page_shift)) {
+        goto param_error_exit;
+    }
+
+    if (!liobn || !sphb->ddw_enabled || windows == SPAPR_PCI_DMA_MAX_WINDOWS) {
+        goto hw_error_exit;
+    }
+
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto hw_error_exit;
+    }
+
+    win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
+    spapr_tce_table_enable(tcet, page_shift, win_addr,
+                           1ULL << (window_shift - page_shift));
+    if (!tcet->nb_table) {
+        goto hw_error_exit;
+    }
+
+    trace_spapr_iommu_ddw_create(buid, addr, 1ULL << page_shift,
+                                 1ULL << window_shift, tcet->bus_offset, liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    rtas_st(rets, 1, liobn);
+    rtas_st(rets, 2, tcet->bus_offset >> 32);
+    rtas_st(rets, 3, tcet->bus_offset & ((uint32_t) -1));
+
+    return;
+
+hw_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_HW_ERROR);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_remove_pe_dma_window(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          uint32_t token, uint32_t nargs,
+                                          target_ulong args,
+                                          uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    sPAPRTCETable *tcet;
+    uint32_t liobn;
+
+    if ((nargs != 1) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    liobn = rtas_ld(args, 0);
+    tcet = spapr_tce_find_by_liobn(liobn);
+    if (!tcet) {
+        goto param_error_exit;
+    }
+
+    sphb = SPAPR_PCI_HOST_BRIDGE(OBJECT(tcet)->parent);
+    if (!sphb || !sphb->ddw_enabled || !tcet->nb_table) {
+        goto param_error_exit;
+    }
+
+    spapr_tce_table_disable(tcet);
+    trace_spapr_iommu_ddw_remove(liobn);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void rtas_ibm_reset_pe_dma_window(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         uint32_t token, uint32_t nargs,
+                                         target_ulong args,
+                                         uint32_t nret, target_ulong rets)
+{
+    sPAPRPHBState *sphb;
+    uint64_t buid;
+    uint32_t addr;
+
+    if ((nargs != 3) || (nret != 1)) {
+        goto param_error_exit;
+    }
+
+    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
+    addr = rtas_ld(args, 0);
+    sphb = spapr_pci_find_phb(spapr, buid);
+    if (!sphb || !sphb->ddw_enabled) {
+        goto param_error_exit;
+    }
+
+    spapr_phb_dma_reset(sphb);
+    trace_spapr_iommu_ddw_reset(buid, addr);
+
+    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
+
+    return;
+
+param_error_exit:
+    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
+}
+
+static void spapr_rtas_ddw_init(void)
+{
+    spapr_rtas_register(RTAS_IBM_QUERY_PE_DMA_WINDOW,
+                        "ibm,query-pe-dma-window",
+                        rtas_ibm_query_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_CREATE_PE_DMA_WINDOW,
+                        "ibm,create-pe-dma-window",
+                        rtas_ibm_create_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_REMOVE_PE_DMA_WINDOW,
+                        "ibm,remove-pe-dma-window",
+                        rtas_ibm_remove_pe_dma_window);
+    spapr_rtas_register(RTAS_IBM_RESET_PE_DMA_WINDOW,
+                        "ibm,reset-pe-dma-window",
+                        rtas_ibm_reset_pe_dma_window);
+}
+
+type_init(spapr_rtas_ddw_init)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 7848366..92aa610 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -32,6 +32,8 @@ 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_DMA_MAX_WINDOWS    2
+
 typedef struct sPAPRPHBState sPAPRPHBState;
 
 typedef struct spapr_pci_msi {
@@ -56,7 +58,7 @@  struct sPAPRPHBState {
     hwaddr mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     MemoryRegion memwindow, iowindow, msiwindow;
 
-    uint32_t dma_liobn;
+    uint32_t dma_liobn[SPAPR_PCI_DMA_MAX_WINDOWS];
     hwaddr dma_win_addr, dma_win_size;
     AddressSpace iommu_as;
     MemoryRegion iommu_root;
@@ -71,6 +73,10 @@  struct sPAPRPHBState {
     spapr_pci_msi_mig *msi_devs;
 
     QLIST_ENTRY(sPAPRPHBState) list;
+
+    bool ddw_enabled;
+    uint64_t page_size_mask;
+    uint64_t dma64_win_addr;
 };
 
 #define SPAPR_PCI_MAX_INDEX          255
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e1f8274..36d1748 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -416,6 +416,16 @@  int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_OUT_NOT_AUTHORIZED                 -9002
 #define RTAS_OUT_SYSPARM_PARAM_ERROR            -9999
 
+/* DDW pagesize mask values from ibm,query-pe-dma-window */
+#define RTAS_DDW_PGSIZE_4K       0x01
+#define RTAS_DDW_PGSIZE_64K      0x02
+#define RTAS_DDW_PGSIZE_16M      0x04
+#define RTAS_DDW_PGSIZE_32M      0x08
+#define RTAS_DDW_PGSIZE_64M      0x10
+#define RTAS_DDW_PGSIZE_128M     0x20
+#define RTAS_DDW_PGSIZE_256M     0x40
+#define RTAS_DDW_PGSIZE_16G      0x80
+
 /* RTAS tokens */
 #define RTAS_TOKEN_BASE      0x2000
 
@@ -457,8 +467,12 @@  int spapr_allocate_irq_block(int num, bool lsi, bool msi);
 #define RTAS_IBM_SET_SLOT_RESET                 (RTAS_TOKEN_BASE + 0x23)
 #define RTAS_IBM_CONFIGURE_PE                   (RTAS_TOKEN_BASE + 0x24)
 #define RTAS_IBM_SLOT_ERROR_DETAIL              (RTAS_TOKEN_BASE + 0x25)
+#define RTAS_IBM_QUERY_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_IBM_CREATE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x27)
+#define RTAS_IBM_REMOVE_PE_DMA_WINDOW           (RTAS_TOKEN_BASE + 0x28)
+#define RTAS_IBM_RESET_PE_DMA_WINDOW            (RTAS_TOKEN_BASE + 0x29)
 
-#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x26)
+#define RTAS_TOKEN_MAX                          (RTAS_TOKEN_BASE + 0x2A)
 
 /* RTAS ibm,get-system-parameter token values */
 #define RTAS_SYSPARM_SPLPAR_CHARACTERISTICS      20
diff --git a/trace-events b/trace-events
index 7e94d92..5b52634 100644
--- a/trace-events
+++ b/trace-events
@@ -1435,6 +1435,10 @@  spapr_iommu_xlate(uint64_t liobn, uint64_t ioba, uint64_t tce, unsigned perm, un
 spapr_iommu_new_table(uint64_t liobn, void *table, int fd) "liobn=%"PRIx64" table=%p fd=%d"
 spapr_iommu_pre_save(uint64_t liobn, uint32_t nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
 spapr_iommu_post_load(uint64_t liobn, uint32_t pre_nb, uint32_t post_nb, uint64_t offs, uint32_t ps) "liobn=%"PRIx64" %"PRIx32" => %"PRIx32" bus_offset=%"PRIx64" ps=%"PRIu32
+spapr_iommu_ddw_query(uint64_t buid, uint32_t cfgaddr, unsigned wa, uint64_t win_size, uint32_t pgmask) "buid=%"PRIx64" addr=%"PRIx32", %u windows available, max window size=%"PRIx64", mask=%"PRIx32
+spapr_iommu_ddw_create(uint64_t buid, uint32_t cfgaddr, uint64_t pg_size, uint64_t req_size, uint64_t start, uint32_t liobn) "buid=%"PRIx64" addr=%"PRIx32", page size=0x%"PRIx64", requested=0x%"PRIx64", start addr=%"PRIx64", liobn=%"PRIx32
+spapr_iommu_ddw_remove(uint32_t liobn) "liobn=%"PRIx32
+spapr_iommu_ddw_reset(uint64_t buid, uint32_t cfgaddr) "buid=%"PRIx64" addr=%"PRIx32
 
 # hw/ppc/ppc.c
 ppc_tb_adjust(uint64_t offs1, uint64_t offs2, int64_t diff, int64_t seconds) "adjusted from 0x%"PRIx64" to 0x%"PRIx64", diff %"PRId64" (%"PRId64"s)"