[v2,08/27] ARM: GICv3 ITS: introduce host LPI array

Message ID	20170316112030.20419-9-andre.przywara@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Andre Przywara <andre.przywara@arm.com> To: Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien.grall@arm.com> Date: Thu, 16 Mar 2017 11:20:11 +0000 Message-Id: <20170316112030.20419-9-andre.przywara@arm.com> In-Reply-To: <20170316112030.20419-1-andre.przywara@arm.com> References: <20170316112030.20419-1-andre.przywara@arm.com> Cc: xen-devel@lists.xenproject.org, Shanker Donthineni <shankerd@codeaurora.org>, Vijay Kilari <vijay.kilari@gmail.com> Subject: [Xen-devel] [PATCH v2 08/27] ARM: GICv3 ITS: introduce host LPI array Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Andre Przywara March 16, 2017, 11:20 a.m. UTC

The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
---
 xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
 xen/arch/arm/gic-v3-lpi.c        | 188 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-arm/gic.h        |   5 ++
 xen/include/asm-arm/gic_v3_its.h |  11 +++
 4 files changed, 292 insertions(+), 2 deletions(-)

Stefano Stabellini March 22, 2017, 11:38 p.m. UTC | #1

On Thu, 16 Mar 2017, Andre Przywara wrote:
> The number of LPIs on a host can be potentially huge (millions),
> although in practise will be mostly reasonable. So prematurely allocating
> an array of struct irq_desc's for each LPI is not an option.
> However Xen itself does not care about LPIs, as every LPI will be injected
> into a guest (Dom0 for now).
> Create a dense data structure (8 Bytes) for each LPI which holds just
> enough information to determine the virtual IRQ number and the VCPU into
> which the LPI needs to be injected.
> Also to not artificially limit the number of LPIs, we create a 2-level
> table for holding those structures.
> This patch introduces functions to initialize these tables and to
> create, lookup and destroy entries for a given LPI.
> By using the naturally atomic access guarantee the native uint64_t data
> type gives us, we allocate and access LPI information in a way that does
> not require a lock.
> 
> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
> ---
>  xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
>  xen/arch/arm/gic-v3-lpi.c        | 188 +++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-arm/gic.h        |   5 ++
>  xen/include/asm-arm/gic_v3_its.h |  11 +++
>  4 files changed, 292 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/gic-v3-its.c b/xen/arch/arm/gic-v3-its.c
> index 60b15b5..ed14d95 100644
> --- a/xen/arch/arm/gic-v3-its.c
> +++ b/xen/arch/arm/gic-v3-its.c
> @@ -148,6 +148,20 @@ static int its_send_cmd_sync(struct host_its *its, unsigned int cpu)
>      return its_send_command(its, cmd);
>  }
>  
> +static int its_send_cmd_mapti(struct host_its *its,
> +                              uint32_t deviceid, uint32_t eventid,
> +                              uint32_t pintid, uint16_t icid)
> +{
> +    uint64_t cmd[4];
> +
> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
> +    cmd[2] = icid;
> +    cmd[3] = 0x00;
> +
> +    return its_send_command(its, cmd);
> +}
> +
>  static int its_send_cmd_mapc(struct host_its *its, uint32_t collection_id,
>                               unsigned int cpu)
>  {
> @@ -180,6 +194,19 @@ static int its_send_cmd_mapd(struct host_its *its, uint32_t deviceid,
>      return its_send_command(its, cmd);
>  }
>  
> +static int its_send_cmd_inv(struct host_its *its,
> +                            uint32_t deviceid, uint32_t eventid)
> +{
> +    uint64_t cmd[4];
> +
> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
> +    cmd[1] = eventid;
> +    cmd[2] = 0x00;
> +    cmd[3] = 0x00;
> +
> +    return its_send_command(its, cmd);
> +}
> +
>  /* Set up the (1:1) collection mapping for the given host CPU. */
>  int gicv3_its_setup_collection(unsigned int cpu)
>  {
> @@ -462,7 +489,7 @@ int gicv3_its_init(void)
>  
>  static int remove_mapped_guest_device(struct its_devices *dev)
>  {
> -    int ret;
> +    int ret, i;
>  
>      if ( dev->hw_its )
>      {
> @@ -471,11 +498,19 @@ static int remove_mapped_guest_device(struct its_devices *dev)
>              return ret;
>      }
>  
> +    /*
> +     * The only error the function below would return is -ENOENT, in which
> +     * case there is nothing to free here. So we just ignore it.
> +     */
> +    for ( i = 0; i < DIV_ROUND_UP(dev->eventids, LPI_BLOCK); i++ )
> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);
> +
>      ret = gicv3_its_wait_commands(dev->hw_its);
>      if ( ret )
>          return ret;
>  
>      xfree(dev->itt_addr);
> +    xfree(dev->host_lpis);
>      xfree(dev);
>  
>      return 0;
> @@ -513,6 +548,36 @@ static int compare_its_guest_devices(struct its_devices *dev,
>  }
>  
>  /*
> + * On the host ITS @its, map @nr_events consecutive LPIs.
> + * The mapping connects a device @devid and event @eventid pair to LPI @lpi,
> + * increasing both @eventid and @lpi to cover the number of requested LPIs.
> + */
> +int gicv3_its_map_host_events(struct host_its *its,
> +                              uint32_t devid, uint32_t eventid, uint32_t lpi,
> +                              uint32_t nr_events)
> +{
> +    uint32_t i;
> +    int ret;
> +
> +    for ( i = 0; i < nr_events; i++ )
> +    {
> +        /* For now we map every host LPI to host CPU 0 */
> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
> +        if ( ret )
> +            return ret;
> +        ret = its_send_cmd_inv(its, devid, eventid + i);
> +        if ( ret )
> +            return ret;
> +    }
> +
> +    ret = its_send_cmd_sync(its, 0);
> +    if ( ret )
> +        return ret;
> +
> +    return gicv3_its_wait_commands(its);
> +}
> +
> +/*
>   * Map a hardware device, identified by a certain host ITS and its device ID
>   * to domain d, a guest ITS (identified by its doorbell address) and device ID.
>   * Also provide the number of events (MSIs) needed for that device.
> @@ -528,7 +593,7 @@ int gicv3_its_map_guest_device(struct domain *d,
>      struct host_its *hw_its;
>      struct its_devices *dev = NULL, *temp;
>      struct rb_node **new = &d->arch.vgic.its_devices.rb_node, *parent = NULL;
> -    int ret = -ENOENT;
> +    int ret = -ENOENT, i;
>  
>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
>      if ( !hw_its )
> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
>      if ( !dev )
>          goto out_unlock;
>  
> +    dev->host_lpis = xzalloc_array(uint32_t,
> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
> +    if ( !dev->host_lpis )
> +        goto out_unlock;

The expectation is that we are going to allocate far more than 32 lpis
for one device, possibly thousands, right?
If so, dev->host_lpis could be quite large. I am thinking we should
consider switching to a sparse data structure to save memory, as in most
cases the allocated blocks are going to be consecutive.

I would probably store the first lpi and the number of consecutive lpis,
rather than the first lpi in each block. For example, if the allocated
blocks are:

 0-31, 32-63, 64-95, 96-127, 128-159, 256-287

Now we store: 0, 32, 64, 96, 128, 256
I would store [0, 160], [256, 32]

In the case of a thousand consecutive block allocations, it would
shorten the array from 1000/32 to 1 element.

Does it make sense? Is it worth doing? This last question is mostly for
the Qualcomm and Cavium guys on the list because it depends on the
numbers of events on a given platform.


>      ret = its_send_cmd_mapd(hw_its, host_devid, fls(nr_events - 1) - 1,
>                              virt_to_maddr(itt_addr), true);
>      if ( ret )
> @@ -591,10 +661,26 @@ int gicv3_its_map_guest_device(struct domain *d,
>  
>      spin_unlock(&d->arch.vgic.its_devices_lock);
>  
> +    /*
> +     * Map all host LPIs within this device already. We can't afford to queue
> +     * any host ITS commands later on during the guest's runtime.
> +     */
> +    for ( i = 0; i < DIV_ROUND_UP(nr_events, LPI_BLOCK); i++ )
> +    {
> +        ret = gicv3_allocate_host_lpi_block(hw_its, d, host_devid,
> +                                            i * LPI_BLOCK);
> +        if ( ret < 0 )
> +            goto out;
> +
> +        dev->host_lpis[i] = ret;
> +    }
> +
>      return 0;
>  
>  out_unlock:
>      spin_unlock(&d->arch.vgic.its_devices_lock);
> +
> +out:
>      if ( dev )
>          xfree(dev->host_lpis);
>      xfree(itt_addr);
> diff --git a/xen/arch/arm/gic-v3-lpi.c b/xen/arch/arm/gic-v3-lpi.c
> index 51d7425..59d3ba4 100644
> --- a/xen/arch/arm/gic-v3-lpi.c
> +++ b/xen/arch/arm/gic-v3-lpi.c
> @@ -20,24 +20,44 @@
>  
>  #include <xen/lib.h>
>  #include <xen/mm.h>
> +#include <xen/sched.h>
> +#include <asm/atomic.h>
> +#include <asm/domain.h>
>  #include <asm/gic.h>
>  #include <asm/gic_v3_defs.h>
>  #include <asm/gic_v3_its.h>
>  #include <asm/io.h>
>  #include <asm/page.h>
>  
> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
> +union host_lpi {
> +    uint64_t data;
> +    struct {
> +        uint32_t virt_lpi;
> +        uint16_t dom_id;
> +        uint16_t vcpu_id;
> +    };
> +};
>
>  #define LPI_PROPTABLE_NEEDS_FLUSHING    (1U << 0)
>  /* Global state */
>  static struct {
>      /* The global LPI property table, shared by all redistributors. */
>      uint8_t *lpi_property;
>      /*
> +     * A two-level table to lookup LPIs firing on the host and look up the
> +     * domain and virtual LPI number to inject.
> +     */
> +    union host_lpi **host_lpis;
> +    /*
>       * Number of physical LPIs the host supports. This is a property of
>       * the GIC hardware. We depart from the habit of naming these things
>       * "physical" in Xen, as the GICv3/4 spec uses the term "physical LPI"
>       * in a different context to differentiate them from "virtual LPIs".
>       */
>      unsigned long int nr_host_lpis;
> +    /* Protects allocation and deallocation of host LPIs, but not the access */

OK. What does protect the access to host LPIs?


> +    spinlock_t host_lpis_lock;
>      unsigned int flags;
>  } lpi_data;
>  
> @@ -50,6 +70,19 @@ struct lpi_redist_data {
>  static DEFINE_PER_CPU(struct lpi_redist_data, lpi_redist);
>  
>  #define MAX_PHYS_LPIS   (lpi_data.nr_host_lpis - LPI_OFFSET)
> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
> +
> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
> +{
> +    if ( !is_lpi(plpi) || plpi >= MAX_PHYS_LPIS + LPI_OFFSET )

Arguably if (plpi >= MAX_PHYS_LPIS + LPI_OFFSET) is not really an lpi. I
would add the plpi >= MAX_PHYS_LPIS + LPI_OFFSET check to is_lpi.


> +        return NULL;
> +
> +    plpi -= LPI_OFFSET;
> +    if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
> +        return NULL;
> +
> +    return &lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % HOST_LPIS_PER_PAGE];

I am almost sure this line is longer than 80 characters.. Please double
check the coding style throughout the series.


> +}
>  
>  /* Stores this redistributor's physical address and ID in a per-CPU variable */
>  void gicv3_set_redist_address(paddr_t address, unsigned int redist_id)
> @@ -204,15 +237,170 @@ int gicv3_lpi_init_rdist(void __iomem * rdist_base)
>  static unsigned int max_lpi_bits = CONFIG_MAX_PHYS_LPI_BITS;
>  integer_param("max_lpi_bits", max_lpi_bits);
>  
> +/*
> + * Allocate the 2nd level array for host LPIs. This one holds pointers
> + * to the page with the actual "union host_lpi" entries. Our LPI limit
> + * avoids excessive memory usage.
> + */
>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>  {
> +    int nr_lpi_ptrs;
> +
> +    /* We rely on the data structure being atomically accessible. */
> +    BUILD_BUG_ON(sizeof(union host_lpi) > sizeof(unsigned long));

Technically, I think that arm32 is able to access uint64_t atomically
(ldrexd and strexd).


>      lpi_data.nr_host_lpis = BIT_ULL(min(hw_lpi_bits, max_lpi_bits));
>  
> +    spin_lock_init(&lpi_data.host_lpis_lock);
> +
> +    nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
> +    if ( !lpi_data.host_lpis )
> +        return -ENOMEM;
> +
>      printk("GICv3: using at most %ld LPIs on the host.\n", MAX_PHYS_LPIS);
>  
>      return 0;
>  }
>  
> +/* Must be called with host_lpis_lock held. */
> +static int find_unused_host_lpi(int start, uint32_t *index)
> +{
> +    int chunk;
> +    uint32_t i = *index;
> +
> +    for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE; chunk++ )
> +    {
> +        /* If we hit an unallocated chunk, use entry 0 in that one. */
> +        if ( !lpi_data.host_lpis[chunk] )
> +        {
> +            *index = 0;
> +            return chunk;
> +        }
> +
> +        /* Find an unallocated entry in this chunk. */
> +        for ( ; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
> +        {
> +            if ( lpi_data.host_lpis[chunk][i].dom_id == INVALID_DOMID )
> +            {
> +                *index = i;
> +                return chunk;
> +            }
> +        }
> +        i = 0;
> +    }
> +
> +    return -1;
> +}
> +
> +/*
> + * Allocate a block of 32 LPIs on the given host ITS for device "devid",
> + * starting with "eventid". Put them into the respective ITT by issuing a
> + * MAPTI command for each of them.
> + */
> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
> +                                  uint32_t host_devid, uint32_t eventid)
> +{
> +    static uint32_t next_lpi = 0;
> +    uint32_t lpi, lpi_idx = next_lpi % HOST_LPIS_PER_PAGE;
> +    int chunk;
> +    int i;
> +
> +    spin_lock(&lpi_data.host_lpis_lock);
> +    chunk = find_unused_host_lpi(next_lpi / HOST_LPIS_PER_PAGE, &lpi_idx);
> +
> +    if ( chunk == - 1 )          /* rescan for a hole from the beginning */
> +    {
> +        lpi_idx = 0;
> +        chunk = find_unused_host_lpi(0, &lpi_idx);
> +        if ( chunk == -1 )
> +        {
> +            spin_unlock(&lpi_data.host_lpis_lock);
> +            return -ENOSPC;
> +        }
> +    }
> +
> +    /* If we hit an unallocated chunk, we initialize it and use entry 0. */
> +    if ( !lpi_data.host_lpis[chunk] )
> +    {
> +        union host_lpi *new_chunk;
> +
> +        new_chunk = alloc_xenheap_pages(0, 0);
> +        if ( !new_chunk )
> +        {
> +            spin_unlock(&lpi_data.host_lpis_lock);
> +            return -ENOMEM;
> +        }
> +
> +        for ( i = 0; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
> +            new_chunk[i].dom_id = INVALID_DOMID;
> +
> +        lpi_data.host_lpis[chunk] = new_chunk;
> +        lpi_idx = 0;
> +    }
> +
> +    lpi = chunk * HOST_LPIS_PER_PAGE + lpi_idx;
> +
> +    for ( i = 0; i < LPI_BLOCK; i++ )
> +    {
> +        union host_lpi hlpi;
> +
> +        /*
> +         * Mark this host LPI as belonging to the domain, but don't assign
> +         * any virtual LPI or a VCPU yet.
> +         */
> +        hlpi.virt_lpi = INVALID_LPI;
> +        hlpi.dom_id = d->domain_id;
> +        hlpi.vcpu_id = INVALID_DOMID;
> +        write_u64_atomic(&lpi_data.host_lpis[chunk][lpi_idx + i].data,
> +                         hlpi.data);
> +
> +        /*
> +         * Enable this host LPI, so we don't have to do this during the
> +         * guest's runtime.
> +         */
> +        lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
> +    }
> +
> +    /*
> +     * We have allocated and initialized the host LPI entries, so it's safe
> +     * to drop the lock now. Access to the structures can be done concurrently
> +     * as it involves only an atomic uint64_t access.
> +     */
> +    spin_unlock(&lpi_data.host_lpis_lock);
> +
> +    if ( lpi_data.flags & LPI_PROPTABLE_NEEDS_FLUSHING )
> +        clean_and_invalidate_dcache_va_range(&lpi_data.lpi_property[lpi],
> +                                             LPI_BLOCK);

I would move this right below 
  lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;


> +    gicv3_its_map_host_events(its, host_devid, eventid, lpi + LPI_OFFSET,
> +                              LPI_BLOCK);
> +
> +    next_lpi = lpi + LPI_BLOCK;
> +    return lpi + LPI_OFFSET;
> +}
> +
> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi)
> +{
> +    union host_lpi *hlpi, empty_lpi = { .dom_id = INVALID_DOMID };
> +    int i;
> +
> +    hlpi = gic_get_host_lpi(lpi);

I think you need to make sure that "lpi" is the first lpi of a LPI_BLOCK.
Either that, or gic_get_host_lpi should make sure to always return the
first lpi in a block. 


> +    if ( !hlpi )
> +        return -ENOENT;
> +
> +    spin_lock(&lpi_data.host_lpis_lock);
> +
> +    for ( i = 0; i < LPI_BLOCK; i++ )
> +        write_u64_atomic(&hlpi[i].data, empty_lpi.data);
> +
> +    /* TODO: Call a function in gic-v3-its.c to send DISCARDs */
> +
> +    spin_unlock(&lpi_data.host_lpis_lock);
> +
> +    return 0;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-arm/gic.h b/xen/include/asm-arm/gic.h
> index 12bd155..7825575 100644
> --- a/xen/include/asm-arm/gic.h
> +++ b/xen/include/asm-arm/gic.h
> @@ -220,7 +220,12 @@ enum gic_version {
>      GIC_V3,
>  };
>  
> +#define INVALID_LPI     0
>  #define LPI_OFFSET      8192
> +static inline bool is_lpi(unsigned int irq)
> +{
> +    return irq >= LPI_OFFSET;
> +}
>  
>  extern enum gic_version gic_hw_version(void);
>  
> diff --git a/xen/include/asm-arm/gic_v3_its.h b/xen/include/asm-arm/gic_v3_its.h
> index 3421ea0..2387972 100644
> --- a/xen/include/asm-arm/gic_v3_its.h
> +++ b/xen/include/asm-arm/gic_v3_its.h
> @@ -106,6 +106,11 @@
>  #define HOST_ITS_FLUSH_CMD_QUEUE        (1U << 0)
>  #define HOST_ITS_USES_PTA               (1U << 1)
>  
> +#define INVALID_DOMID ((uint16_t)~0)

We already have DOMID_INVALID


> +/* We allocate LPIs on the hosts in chunks of 32 to reduce handling overhead. */
> +#define LPI_BLOCK                       32
> +
>  /* data structure for each hardware ITS */
>  struct host_its {
>      struct list_head entry;
> @@ -151,6 +156,12 @@ int gicv3_its_map_guest_device(struct domain *d,
>                                 paddr_t host_doorbell, uint32_t host_devid,
>                                 paddr_t guest_doorbell, uint32_t guest_devid,
>                                 uint32_t nr_events, bool valid);
> +int gicv3_its_map_host_events(struct host_its *its,
> +                              uint32_t host_devid, uint32_t eventid,
> +                              uint32_t lpi, uint32_t nrevents);
> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
> +                                  uint32_t host_devid, uint32_t eventid);
> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi);
>  
>  #else
>  
> -- 
> 2.9.0
>

Julien Grall March 23, 2017, 8:48 a.m. UTC | #2

Hi Stefano,

On 22/03/2017 23:38, Stefano Stabellini wrote:
> On Thu, 16 Mar 2017, Andre Przywara wrote:

[...]

>> +    dev->host_lpis = xzalloc_array(uint32_t,
>> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
>> +    if ( !dev->host_lpis )
>> +        goto out_unlock;
>
> The expectation is that we are going to allocate far more than 32 lpis
> for one device, possibly thousands, right?
> If so, dev->host_lpis could be quite large. I am thinking we should
> consider switching to a sparse data structure to save memory, as in most
> cases the allocated blocks are going to be consecutive.
>
> I would probably store the first lpi and the number of consecutive lpis,
> rather than the first lpi in each block. For example, if the allocated
> blocks are:
>
>  0-31, 32-63, 64-95, 96-127, 128-159, 256-287
>
> Now we store: 0, 32, 64, 96, 128, 256
> I would store [0, 160], [256, 32]
>
> In the case of a thousand consecutive block allocations, it would
> shorten the array from 1000/32 to 1 element.
>
> Does it make sense? Is it worth doing? This last question is mostly for
> the Qualcomm and Cavium guys on the list because it depends on the
> numbers of events on a given platform.

Not answering directly to the question. But I think this is an 
improvement and not necessary for a first version.

I would like to see this series merged in Xen 4.9 as a tech preview, so 
I would suggest to gather this list of improvement (maybe on Jira?) and 
defer them for Xen 4.10. They would need to be addressed before making 
ITS stable. Stefano, does it sound good to you?

Cheers,

Andre Przywara March 23, 2017, 10:21 a.m. UTC | #3

Hi,

On 22/03/17 23:38, Stefano Stabellini wrote:
> On Thu, 16 Mar 2017, Andre Przywara wrote:
>> The number of LPIs on a host can be potentially huge (millions),
>> although in practise will be mostly reasonable. So prematurely allocating
>> an array of struct irq_desc's for each LPI is not an option.
>> However Xen itself does not care about LPIs, as every LPI will be injected
>> into a guest (Dom0 for now).
>> Create a dense data structure (8 Bytes) for each LPI which holds just
>> enough information to determine the virtual IRQ number and the VCPU into
>> which the LPI needs to be injected.
>> Also to not artificially limit the number of LPIs, we create a 2-level
>> table for holding those structures.
>> This patch introduces functions to initialize these tables and to
>> create, lookup and destroy entries for a given LPI.
>> By using the naturally atomic access guarantee the native uint64_t data
>> type gives us, we allocate and access LPI information in a way that does
>> not require a lock.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
>> ---
>>  xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
>>  xen/arch/arm/gic-v3-lpi.c        | 188 +++++++++++++++++++++++++++++++++++++++
>>  xen/include/asm-arm/gic.h        |   5 ++
>>  xen/include/asm-arm/gic_v3_its.h |  11 +++
>>  4 files changed, 292 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/arm/gic-v3-its.c b/xen/arch/arm/gic-v3-its.c
>> index 60b15b5..ed14d95 100644
>> --- a/xen/arch/arm/gic-v3-its.c
>> +++ b/xen/arch/arm/gic-v3-its.c
>> @@ -148,6 +148,20 @@ static int its_send_cmd_sync(struct host_its *its, unsigned int cpu)
>>      return its_send_command(its, cmd);
>>  }
>>  
>> +static int its_send_cmd_mapti(struct host_its *its,
>> +                              uint32_t deviceid, uint32_t eventid,
>> +                              uint32_t pintid, uint16_t icid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
>> +    cmd[2] = icid;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  static int its_send_cmd_mapc(struct host_its *its, uint32_t collection_id,
>>                               unsigned int cpu)
>>  {
>> @@ -180,6 +194,19 @@ static int its_send_cmd_mapd(struct host_its *its, uint32_t deviceid,
>>      return its_send_command(its, cmd);
>>  }
>>  
>> +static int its_send_cmd_inv(struct host_its *its,
>> +                            uint32_t deviceid, uint32_t eventid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid;
>> +    cmd[2] = 0x00;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  /* Set up the (1:1) collection mapping for the given host CPU. */
>>  int gicv3_its_setup_collection(unsigned int cpu)
>>  {
>> @@ -462,7 +489,7 @@ int gicv3_its_init(void)
>>  
>>  static int remove_mapped_guest_device(struct its_devices *dev)
>>  {
>> -    int ret;
>> +    int ret, i;
>>  
>>      if ( dev->hw_its )
>>      {
>> @@ -471,11 +498,19 @@ static int remove_mapped_guest_device(struct its_devices *dev)
>>              return ret;
>>      }
>>  
>> +    /*
>> +     * The only error the function below would return is -ENOENT, in which
>> +     * case there is nothing to free here. So we just ignore it.
>> +     */
>> +    for ( i = 0; i < DIV_ROUND_UP(dev->eventids, LPI_BLOCK); i++ )
>> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);
>> +
>>      ret = gicv3_its_wait_commands(dev->hw_its);
>>      if ( ret )
>>          return ret;
>>  
>>      xfree(dev->itt_addr);
>> +    xfree(dev->host_lpis);
>>      xfree(dev);
>>  
>>      return 0;
>> @@ -513,6 +548,36 @@ static int compare_its_guest_devices(struct its_devices *dev,
>>  }
>>  
>>  /*
>> + * On the host ITS @its, map @nr_events consecutive LPIs.
>> + * The mapping connects a device @devid and event @eventid pair to LPI @lpi,
>> + * increasing both @eventid and @lpi to cover the number of requested LPIs.
>> + */
>> +int gicv3_its_map_host_events(struct host_its *its,
>> +                              uint32_t devid, uint32_t eventid, uint32_t lpi,
>> +                              uint32_t nr_events)
>> +{
>> +    uint32_t i;
>> +    int ret;
>> +
>> +    for ( i = 0; i < nr_events; i++ )
>> +    {
>> +        /* For now we map every host LPI to host CPU 0 */
>> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
>> +        if ( ret )
>> +            return ret;
>> +        ret = its_send_cmd_inv(its, devid, eventid + i);
>> +        if ( ret )
>> +            return ret;
>> +    }
>> +
>> +    ret = its_send_cmd_sync(its, 0);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    return gicv3_its_wait_commands(its);
>> +}
>> +
>> +/*
>>   * Map a hardware device, identified by a certain host ITS and its device ID
>>   * to domain d, a guest ITS (identified by its doorbell address) and device ID.
>>   * Also provide the number of events (MSIs) needed for that device.
>> @@ -528,7 +593,7 @@ int gicv3_its_map_guest_device(struct domain *d,
>>      struct host_its *hw_its;
>>      struct its_devices *dev = NULL, *temp;
>>      struct rb_node **new = &d->arch.vgic.its_devices.rb_node, *parent = NULL;
>> -    int ret = -ENOENT;
>> +    int ret = -ENOENT, i;
>>  
>>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
>>      if ( !hw_its )
>> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
>>      if ( !dev )
>>          goto out_unlock;
>>  
>> +    dev->host_lpis = xzalloc_array(uint32_t,
>> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
>> +    if ( !dev->host_lpis )
>> +        goto out_unlock;
> 
> The expectation is that we are going to allocate far more than 32 lpis
> for one device, possibly thousands, right?

Mmh, what makes you think so?
My understanding is that you have one MSI per interrupt "reason" (TX,
RX, error, for instance). And even if you have multiple queues, each
connected to separate MSIs, that won't go into the thousands. MSI-X for
instance is *limited* to 2048 MSIs.

But in reality I dont' see nearly that many MSIs, mostly less than a
dozen per device. And since 32 is the old PCI MSI limit, I thought this
would be a good starting point. So most of the time we just need one block.

I believe this could add up with virtual functions, but those are
separate devices, right?

Cheers,
Andre.

> If so, dev->host_lpis could be quite large. I am thinking we should
> consider switching to a sparse data structure to save memory, as in most
> cases the allocated blocks are going to be consecutive.
> 
> I would probably store the first lpi and the number of consecutive lpis,
> rather than the first lpi in each block. For example, if the allocated
> blocks are:
> 
>  0-31, 32-63, 64-95, 96-127, 128-159, 256-287
> 
> Now we store: 0, 32, 64, 96, 128, 256
> I would store [0, 160], [256, 32]
> 
> In the case of a thousand consecutive block allocations, it would
> shorten the array from 1000/32 to 1 element.
> 
> Does it make sense? Is it worth doing? This last question is mostly for
> the Qualcomm and Cavium guys on the list because it depends on the
> numbers of events on a given platform.
> 
> 
>>      ret = its_send_cmd_mapd(hw_its, host_devid, fls(nr_events - 1) - 1,
>>                              virt_to_maddr(itt_addr), true);
>>      if ( ret )
>> @@ -591,10 +661,26 @@ int gicv3_its_map_guest_device(struct domain *d,
>>  
>>      spin_unlock(&d->arch.vgic.its_devices_lock);
>>  
>> +    /*
>> +     * Map all host LPIs within this device already. We can't afford to queue
>> +     * any host ITS commands later on during the guest's runtime.
>> +     */
>> +    for ( i = 0; i < DIV_ROUND_UP(nr_events, LPI_BLOCK); i++ )
>> +    {
>> +        ret = gicv3_allocate_host_lpi_block(hw_its, d, host_devid,
>> +                                            i * LPI_BLOCK);
>> +        if ( ret < 0 )
>> +            goto out;
>> +
>> +        dev->host_lpis[i] = ret;
>> +    }
>> +
>>      return 0;
>>  
>>  out_unlock:
>>      spin_unlock(&d->arch.vgic.its_devices_lock);
>> +
>> +out:
>>      if ( dev )
>>          xfree(dev->host_lpis);
>>      xfree(itt_addr);
>> diff --git a/xen/arch/arm/gic-v3-lpi.c b/xen/arch/arm/gic-v3-lpi.c
>> index 51d7425..59d3ba4 100644
>> --- a/xen/arch/arm/gic-v3-lpi.c
>> +++ b/xen/arch/arm/gic-v3-lpi.c
>> @@ -20,24 +20,44 @@
>>  
>>  #include <xen/lib.h>
>>  #include <xen/mm.h>
>> +#include <xen/sched.h>
>> +#include <asm/atomic.h>
>> +#include <asm/domain.h>
>>  #include <asm/gic.h>
>>  #include <asm/gic_v3_defs.h>
>>  #include <asm/gic_v3_its.h>
>>  #include <asm/io.h>
>>  #include <asm/page.h>
>>  
>> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
>> +union host_lpi {
>> +    uint64_t data;
>> +    struct {
>> +        uint32_t virt_lpi;
>> +        uint16_t dom_id;
>> +        uint16_t vcpu_id;
>> +    };
>> +};
>>
>>  #define LPI_PROPTABLE_NEEDS_FLUSHING    (1U << 0)
>>  /* Global state */
>>  static struct {
>>      /* The global LPI property table, shared by all redistributors. */
>>      uint8_t *lpi_property;
>>      /*
>> +     * A two-level table to lookup LPIs firing on the host and look up the
>> +     * domain and virtual LPI number to inject.
>> +     */
>> +    union host_lpi **host_lpis;
>> +    /*
>>       * Number of physical LPIs the host supports. This is a property of
>>       * the GIC hardware. We depart from the habit of naming these things
>>       * "physical" in Xen, as the GICv3/4 spec uses the term "physical LPI"
>>       * in a different context to differentiate them from "virtual LPIs".
>>       */
>>      unsigned long int nr_host_lpis;
>> +    /* Protects allocation and deallocation of host LPIs, but not the access */
> 
> OK. What does protect the access to host LPIs?
> 
> 
>> +    spinlock_t host_lpis_lock;
>>      unsigned int flags;
>>  } lpi_data;
>>  
>> @@ -50,6 +70,19 @@ struct lpi_redist_data {
>>  static DEFINE_PER_CPU(struct lpi_redist_data, lpi_redist);
>>  
>>  #define MAX_PHYS_LPIS   (lpi_data.nr_host_lpis - LPI_OFFSET)
>> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
>> +
>> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
>> +{
>> +    if ( !is_lpi(plpi) || plpi >= MAX_PHYS_LPIS + LPI_OFFSET )
> 
> Arguably if (plpi >= MAX_PHYS_LPIS + LPI_OFFSET) is not really an lpi. I
> would add the plpi >= MAX_PHYS_LPIS + LPI_OFFSET check to is_lpi.
> 
> 
>> +        return NULL;
>> +
>> +    plpi -= LPI_OFFSET;
>> +    if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
>> +        return NULL;
>> +
>> +    return &lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % HOST_LPIS_PER_PAGE];
> 
> I am almost sure this line is longer than 80 characters.. Please double
> check the coding style throughout the series.
> 
> 
>> +}
>>  
>>  /* Stores this redistributor's physical address and ID in a per-CPU variable */
>>  void gicv3_set_redist_address(paddr_t address, unsigned int redist_id)
>> @@ -204,15 +237,170 @@ int gicv3_lpi_init_rdist(void __iomem * rdist_base)
>>  static unsigned int max_lpi_bits = CONFIG_MAX_PHYS_LPI_BITS;
>>  integer_param("max_lpi_bits", max_lpi_bits);
>>  
>> +/*
>> + * Allocate the 2nd level array for host LPIs. This one holds pointers
>> + * to the page with the actual "union host_lpi" entries. Our LPI limit
>> + * avoids excessive memory usage.
>> + */
>>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>>  {
>> +    int nr_lpi_ptrs;
>> +
>> +    /* We rely on the data structure being atomically accessible. */
>> +    BUILD_BUG_ON(sizeof(union host_lpi) > sizeof(unsigned long));
> 
> Technically, I think that arm32 is able to access uint64_t atomically
> (ldrexd and strexd).
> 
> 
>>      lpi_data.nr_host_lpis = BIT_ULL(min(hw_lpi_bits, max_lpi_bits));
>>  
>> +    spin_lock_init(&lpi_data.host_lpis_lock);
>> +
>> +    nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
>> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
>> +    if ( !lpi_data.host_lpis )
>> +        return -ENOMEM;
>> +
>>      printk("GICv3: using at most %ld LPIs on the host.\n", MAX_PHYS_LPIS);
>>  
>>      return 0;
>>  }
>>  
>> +/* Must be called with host_lpis_lock held. */
>> +static int find_unused_host_lpi(int start, uint32_t *index)
>> +{
>> +    int chunk;
>> +    uint32_t i = *index;
>> +
>> +    for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE; chunk++ )
>> +    {
>> +        /* If we hit an unallocated chunk, use entry 0 in that one. */
>> +        if ( !lpi_data.host_lpis[chunk] )
>> +        {
>> +            *index = 0;
>> +            return chunk;
>> +        }
>> +
>> +        /* Find an unallocated entry in this chunk. */
>> +        for ( ; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +        {
>> +            if ( lpi_data.host_lpis[chunk][i].dom_id == INVALID_DOMID )
>> +            {
>> +                *index = i;
>> +                return chunk;
>> +            }
>> +        }
>> +        i = 0;
>> +    }
>> +
>> +    return -1;
>> +}
>> +
>> +/*
>> + * Allocate a block of 32 LPIs on the given host ITS for device "devid",
>> + * starting with "eventid". Put them into the respective ITT by issuing a
>> + * MAPTI command for each of them.
>> + */
>> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
>> +                                  uint32_t host_devid, uint32_t eventid)
>> +{
>> +    static uint32_t next_lpi = 0;
>> +    uint32_t lpi, lpi_idx = next_lpi % HOST_LPIS_PER_PAGE;
>> +    int chunk;
>> +    int i;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +    chunk = find_unused_host_lpi(next_lpi / HOST_LPIS_PER_PAGE, &lpi_idx);
>> +
>> +    if ( chunk == - 1 )          /* rescan for a hole from the beginning */
>> +    {
>> +        lpi_idx = 0;
>> +        chunk = find_unused_host_lpi(0, &lpi_idx);
>> +        if ( chunk == -1 )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOSPC;
>> +        }
>> +    }
>> +
>> +    /* If we hit an unallocated chunk, we initialize it and use entry 0. */
>> +    if ( !lpi_data.host_lpis[chunk] )
>> +    {
>> +        union host_lpi *new_chunk;
>> +
>> +        new_chunk = alloc_xenheap_pages(0, 0);
>> +        if ( !new_chunk )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOMEM;
>> +        }
>> +
>> +        for ( i = 0; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +            new_chunk[i].dom_id = INVALID_DOMID;
>> +
>> +        lpi_data.host_lpis[chunk] = new_chunk;
>> +        lpi_idx = 0;
>> +    }
>> +
>> +    lpi = chunk * HOST_LPIS_PER_PAGE + lpi_idx;
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +    {
>> +        union host_lpi hlpi;
>> +
>> +        /*
>> +         * Mark this host LPI as belonging to the domain, but don't assign
>> +         * any virtual LPI or a VCPU yet.
>> +         */
>> +        hlpi.virt_lpi = INVALID_LPI;
>> +        hlpi.dom_id = d->domain_id;
>> +        hlpi.vcpu_id = INVALID_DOMID;
>> +        write_u64_atomic(&lpi_data.host_lpis[chunk][lpi_idx + i].data,
>> +                         hlpi.data);
>> +
>> +        /*
>> +         * Enable this host LPI, so we don't have to do this during the
>> +         * guest's runtime.
>> +         */
>> +        lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
>> +    }
>> +
>> +    /*
>> +     * We have allocated and initialized the host LPI entries, so it's safe
>> +     * to drop the lock now. Access to the structures can be done concurrently
>> +     * as it involves only an atomic uint64_t access.
>> +     */
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    if ( lpi_data.flags & LPI_PROPTABLE_NEEDS_FLUSHING )
>> +        clean_and_invalidate_dcache_va_range(&lpi_data.lpi_property[lpi],
>> +                                             LPI_BLOCK);
> 
> I would move this right below 
>   lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
> 
> 
>> +    gicv3_its_map_host_events(its, host_devid, eventid, lpi + LPI_OFFSET,
>> +                              LPI_BLOCK);
>> +
>> +    next_lpi = lpi + LPI_BLOCK;
>> +    return lpi + LPI_OFFSET;
>> +}
>> +
>> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi)
>> +{
>> +    union host_lpi *hlpi, empty_lpi = { .dom_id = INVALID_DOMID };
>> +    int i;
>> +
>> +    hlpi = gic_get_host_lpi(lpi);
> 
> I think you need to make sure that "lpi" is the first lpi of a LPI_BLOCK.
> Either that, or gic_get_host_lpi should make sure to always return the
> first lpi in a block. 
> 
> 
>> +    if ( !hlpi )
>> +        return -ENOENT;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +        write_u64_atomic(&hlpi[i].data, empty_lpi.data);
>> +
>> +    /* TODO: Call a function in gic-v3-its.c to send DISCARDs */
>> +
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    return 0;
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/include/asm-arm/gic.h b/xen/include/asm-arm/gic.h
>> index 12bd155..7825575 100644
>> --- a/xen/include/asm-arm/gic.h
>> +++ b/xen/include/asm-arm/gic.h
>> @@ -220,7 +220,12 @@ enum gic_version {
>>      GIC_V3,
>>  };
>>  
>> +#define INVALID_LPI     0
>>  #define LPI_OFFSET      8192
>> +static inline bool is_lpi(unsigned int irq)
>> +{
>> +    return irq >= LPI_OFFSET;
>> +}
>>  
>>  extern enum gic_version gic_hw_version(void);
>>  
>> diff --git a/xen/include/asm-arm/gic_v3_its.h b/xen/include/asm-arm/gic_v3_its.h
>> index 3421ea0..2387972 100644
>> --- a/xen/include/asm-arm/gic_v3_its.h
>> +++ b/xen/include/asm-arm/gic_v3_its.h
>> @@ -106,6 +106,11 @@
>>  #define HOST_ITS_FLUSH_CMD_QUEUE        (1U << 0)
>>  #define HOST_ITS_USES_PTA               (1U << 1)
>>  
>> +#define INVALID_DOMID ((uint16_t)~0)
> 
> We already have DOMID_INVALID
> 
> 
>> +/* We allocate LPIs on the hosts in chunks of 32 to reduce handling overhead. */
>> +#define LPI_BLOCK                       32
>> +
>>  /* data structure for each hardware ITS */
>>  struct host_its {
>>      struct list_head entry;
>> @@ -151,6 +156,12 @@ int gicv3_its_map_guest_device(struct domain *d,
>>                                 paddr_t host_doorbell, uint32_t host_devid,
>>                                 paddr_t guest_doorbell, uint32_t guest_devid,
>>                                 uint32_t nr_events, bool valid);
>> +int gicv3_its_map_host_events(struct host_its *its,
>> +                              uint32_t host_devid, uint32_t eventid,
>> +                              uint32_t lpi, uint32_t nrevents);
>> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
>> +                                  uint32_t host_devid, uint32_t eventid);
>> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi);
>>  
>>  #else
>>  
>> -- 
>> 2.9.0
>>

Stefano Stabellini March 23, 2017, 5:52 p.m. UTC | #4

On Thu, 23 Mar 2017, Andre Przywara wrote:
> Hi,
> 
> On 22/03/17 23:38, Stefano Stabellini wrote:
> > On Thu, 16 Mar 2017, Andre Przywara wrote:
> >> The number of LPIs on a host can be potentially huge (millions),
> >> although in practise will be mostly reasonable. So prematurely allocating
> >> an array of struct irq_desc's for each LPI is not an option.
> >> However Xen itself does not care about LPIs, as every LPI will be injected
> >> into a guest (Dom0 for now).
> >> Create a dense data structure (8 Bytes) for each LPI which holds just
> >> enough information to determine the virtual IRQ number and the VCPU into
> >> which the LPI needs to be injected.
> >> Also to not artificially limit the number of LPIs, we create a 2-level
> >> table for holding those structures.
> >> This patch introduces functions to initialize these tables and to
> >> create, lookup and destroy entries for a given LPI.
> >> By using the naturally atomic access guarantee the native uint64_t data
> >> type gives us, we allocate and access LPI information in a way that does
> >> not require a lock.
> >>
> >> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
> >> ---
> >>  xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
> >>  xen/arch/arm/gic-v3-lpi.c        | 188 +++++++++++++++++++++++++++++++++++++++
> >>  xen/include/asm-arm/gic.h        |   5 ++
> >>  xen/include/asm-arm/gic_v3_its.h |  11 +++
> >>  4 files changed, 292 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/xen/arch/arm/gic-v3-its.c b/xen/arch/arm/gic-v3-its.c
> >> index 60b15b5..ed14d95 100644
> >> --- a/xen/arch/arm/gic-v3-its.c
> >> +++ b/xen/arch/arm/gic-v3-its.c
> >> @@ -148,6 +148,20 @@ static int its_send_cmd_sync(struct host_its *its, unsigned int cpu)
> >>      return its_send_command(its, cmd);
> >>  }
> >>  
> >> +static int its_send_cmd_mapti(struct host_its *its,
> >> +                              uint32_t deviceid, uint32_t eventid,
> >> +                              uint32_t pintid, uint16_t icid)
> >> +{
> >> +    uint64_t cmd[4];
> >> +
> >> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
> >> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
> >> +    cmd[2] = icid;
> >> +    cmd[3] = 0x00;
> >> +
> >> +    return its_send_command(its, cmd);
> >> +}
> >> +
> >>  static int its_send_cmd_mapc(struct host_its *its, uint32_t collection_id,
> >>                               unsigned int cpu)
> >>  {
> >> @@ -180,6 +194,19 @@ static int its_send_cmd_mapd(struct host_its *its, uint32_t deviceid,
> >>      return its_send_command(its, cmd);
> >>  }
> >>  
> >> +static int its_send_cmd_inv(struct host_its *its,
> >> +                            uint32_t deviceid, uint32_t eventid)
> >> +{
> >> +    uint64_t cmd[4];
> >> +
> >> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
> >> +    cmd[1] = eventid;
> >> +    cmd[2] = 0x00;
> >> +    cmd[3] = 0x00;
> >> +
> >> +    return its_send_command(its, cmd);
> >> +}
> >> +
> >>  /* Set up the (1:1) collection mapping for the given host CPU. */
> >>  int gicv3_its_setup_collection(unsigned int cpu)
> >>  {
> >> @@ -462,7 +489,7 @@ int gicv3_its_init(void)
> >>  
> >>  static int remove_mapped_guest_device(struct its_devices *dev)
> >>  {
> >> -    int ret;
> >> +    int ret, i;
> >>  
> >>      if ( dev->hw_its )
> >>      {
> >> @@ -471,11 +498,19 @@ static int remove_mapped_guest_device(struct its_devices *dev)
> >>              return ret;
> >>      }
> >>  
> >> +    /*
> >> +     * The only error the function below would return is -ENOENT, in which
> >> +     * case there is nothing to free here. So we just ignore it.
> >> +     */
> >> +    for ( i = 0; i < DIV_ROUND_UP(dev->eventids, LPI_BLOCK); i++ )
> >> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);
> >> +
> >>      ret = gicv3_its_wait_commands(dev->hw_its);
> >>      if ( ret )
> >>          return ret;
> >>  
> >>      xfree(dev->itt_addr);
> >> +    xfree(dev->host_lpis);
> >>      xfree(dev);
> >>  
> >>      return 0;
> >> @@ -513,6 +548,36 @@ static int compare_its_guest_devices(struct its_devices *dev,
> >>  }
> >>  
> >>  /*
> >> + * On the host ITS @its, map @nr_events consecutive LPIs.
> >> + * The mapping connects a device @devid and event @eventid pair to LPI @lpi,
> >> + * increasing both @eventid and @lpi to cover the number of requested LPIs.
> >> + */
> >> +int gicv3_its_map_host_events(struct host_its *its,
> >> +                              uint32_t devid, uint32_t eventid, uint32_t lpi,
> >> +                              uint32_t nr_events)
> >> +{
> >> +    uint32_t i;
> >> +    int ret;
> >> +
> >> +    for ( i = 0; i < nr_events; i++ )
> >> +    {
> >> +        /* For now we map every host LPI to host CPU 0 */
> >> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
> >> +        if ( ret )
> >> +            return ret;
> >> +        ret = its_send_cmd_inv(its, devid, eventid + i);
> >> +        if ( ret )
> >> +            return ret;
> >> +    }
> >> +
> >> +    ret = its_send_cmd_sync(its, 0);
> >> +    if ( ret )
> >> +        return ret;
> >> +
> >> +    return gicv3_its_wait_commands(its);
> >> +}
> >> +
> >> +/*
> >>   * Map a hardware device, identified by a certain host ITS and its device ID
> >>   * to domain d, a guest ITS (identified by its doorbell address) and device ID.
> >>   * Also provide the number of events (MSIs) needed for that device.
> >> @@ -528,7 +593,7 @@ int gicv3_its_map_guest_device(struct domain *d,
> >>      struct host_its *hw_its;
> >>      struct its_devices *dev = NULL, *temp;
> >>      struct rb_node **new = &d->arch.vgic.its_devices.rb_node, *parent = NULL;
> >> -    int ret = -ENOENT;
> >> +    int ret = -ENOENT, i;
> >>  
> >>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
> >>      if ( !hw_its )
> >> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
> >>      if ( !dev )
> >>          goto out_unlock;
> >>  
> >> +    dev->host_lpis = xzalloc_array(uint32_t,
> >> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
> >> +    if ( !dev->host_lpis )
> >> +        goto out_unlock;
> > 
> > The expectation is that we are going to allocate far more than 32 lpis
> > for one device, possibly thousands, right?
> 
> Mmh, what makes you think so?
> My understanding is that you have one MSI per interrupt "reason" (TX,
> RX, error, for instance). And even if you have multiple queues, each
> connected to separate MSIs, that won't go into the thousands. MSI-X for
> instance is *limited* to 2048 MSIs.
> 
> But in reality I dont' see nearly that many MSIs, mostly less than a
> dozen per device. And since 32 is the old PCI MSI limit, I thought this
> would be a good starting point. So most of the time we just need one block.
> 
> I believe this could add up with virtual functions, but those are
> separate devices, right?

Yes, you are right about that, but even 2048 can make a difference when
taken 32 at a time: an array of 64 entries vs. 1 entry (if the
allocation is fully contiguous), and that for each device, in the case
of SR-IOV they can be hundreds.


On Thu, 23 Mar 2017, Julien Grall wrote:
> Not answering directly to the question. But I think this is an improvement and
> not necessary for a first version.
> 
> I would like to see this series merged in Xen 4.9 as a tech preview, so I
> would suggest to gather this list of improvement (maybe on Jira?) and defer
> them for Xen 4.10. They would need to be addressed before making ITS stable.
> Stefano, does it sound good to you?

Yes, I am OK with not having this in 4.9, but we need to keep track
(Jira or otherwise) of these things.

Julien Grall March 23, 2017, 7:08 p.m. UTC | #5

Hi Andre,

On 16/03/17 11:20, Andre Przywara wrote:
> The number of LPIs on a host can be potentially huge (millions),
> although in practise will be mostly reasonable. So prematurely allocating
> an array of struct irq_desc's for each LPI is not an option.
> However Xen itself does not care about LPIs, as every LPI will be injected
> into a guest (Dom0 for now).
> Create a dense data structure (8 Bytes) for each LPI which holds just
> enough information to determine the virtual IRQ number and the VCPU into
> which the LPI needs to be injected.
> Also to not artificially limit the number of LPIs, we create a 2-level
> table for holding those structures.
> This patch introduces functions to initialize these tables and to
> create, lookup and destroy entries for a given LPI.
> By using the naturally atomic access guarantee the native uint64_t data
> type gives us, we allocate and access LPI information in a way that does
> not require a lock.
>
> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
> ---
>  xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
>  xen/arch/arm/gic-v3-lpi.c        | 188 +++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-arm/gic.h        |   5 ++
>  xen/include/asm-arm/gic_v3_its.h |  11 +++
>  4 files changed, 292 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/arm/gic-v3-its.c b/xen/arch/arm/gic-v3-its.c
> index 60b15b5..ed14d95 100644
> --- a/xen/arch/arm/gic-v3-its.c
> +++ b/xen/arch/arm/gic-v3-its.c
> @@ -148,6 +148,20 @@ static int its_send_cmd_sync(struct host_its *its, unsigned int cpu)
>      return its_send_command(its, cmd);
>  }
>
> +static int its_send_cmd_mapti(struct host_its *its,
> +                              uint32_t deviceid, uint32_t eventid,
> +                              uint32_t pintid, uint16_t icid)
> +{
> +    uint64_t cmd[4];
> +
> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
> +    cmd[2] = icid;
> +    cmd[3] = 0x00;
> +
> +    return its_send_command(its, cmd);
> +}
> +
>  static int its_send_cmd_mapc(struct host_its *its, uint32_t collection_id,
>                               unsigned int cpu)
>  {
> @@ -180,6 +194,19 @@ static int its_send_cmd_mapd(struct host_its *its, uint32_t deviceid,
>      return its_send_command(its, cmd);
>  }
>
> +static int its_send_cmd_inv(struct host_its *its,
> +                            uint32_t deviceid, uint32_t eventid)
> +{
> +    uint64_t cmd[4];
> +
> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
> +    cmd[1] = eventid;
> +    cmd[2] = 0x00;
> +    cmd[3] = 0x00;
> +
> +    return its_send_command(its, cmd);
> +}
> +
>  /* Set up the (1:1) collection mapping for the given host CPU. */
>  int gicv3_its_setup_collection(unsigned int cpu)
>  {
> @@ -462,7 +489,7 @@ int gicv3_its_init(void)
>
>  static int remove_mapped_guest_device(struct its_devices *dev)
>  {
> -    int ret;
> +    int ret, i;
>
>      if ( dev->hw_its )
>      {
> @@ -471,11 +498,19 @@ static int remove_mapped_guest_device(struct its_devices *dev)
>              return ret;
>      }
>
> +    /*
> +     * The only error the function below would return is -ENOENT, in which
> +     * case there is nothing to free here. So we just ignore it.
> +     */

The function gicv3_free_host_lpi_block will only be used here. And to be 
fair, if you try to free something that does not exist then it is not 
really an error...

So I would prefer to see the function to be void.

> +    for ( i = 0; i < DIV_ROUND_UP(dev->eventids, LPI_BLOCK); i++ )
> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);

Again, without looking at the implementation of 
gicv3_free_host_lpi_block, I think the usage of the function is very 
confusing. When I read host_lpis, I expect to see one LPI per event. But 
instead it be the first LPI of a block. The lack of documentation of the 
field in its_devices does not help to understand what's going on.

So please add some documentation and probably renaming some fields.

Lastly, you have not answered to my question: should not we discard the 
LPIs before removing the device? Or does MAPD take care for you?"

> +
>      ret = gicv3_its_wait_commands(dev->hw_its);
>      if ( ret )
>          return ret;

I know I asked to wait the command, thank you for addressing it. But 
now, if the function fail you will end-up to leak memory. This is not 
better than failing to wait commands.

>
>      xfree(dev->itt_addr);
> +    xfree(dev->host_lpis);
>      xfree(dev);
>
>      return 0;
> @@ -513,6 +548,36 @@ static int compare_its_guest_devices(struct its_devices *dev,
>  }
>
>  /*
> + * On the host ITS @its, map @nr_events consecutive LPIs.
> + * The mapping connects a device @devid and event @eventid pair to LPI @lpi,
> + * increasing both @eventid and @lpi to cover the number of requested LPIs.
> + */
> +int gicv3_its_map_host_events(struct host_its *its,
> +                              uint32_t devid, uint32_t eventid, uint32_t lpi,
> +                              uint32_t nr_events)
> +{
> +    uint32_t i;
> +    int ret;
> +
> +    for ( i = 0; i < nr_events; i++ )
> +    {
> +        /* For now we map every host LPI to host CPU 0 */
> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
> +        if ( ret )
> +            return ret;
> +        ret = its_send_cmd_inv(its, devid, eventid + i);


So the spec allows up to 32KB event per device. As all the LPIs will be 
routed to CPU0 (e.g collection 0), it would be more efficient to do an 
INVALL. I would be happy to defer that to post Xen 4.9, but the rest 
needs to be fixed.

Furthermore, what if the queue is full? AFAIU, you will return an error 
but it is not propagate. So Xen will think the device has been mapped 
even if it is not true.

I think we need to have a plan here as this may likely happen if a 
device has many MSI and/or the queue is nearly full.

> +        if ( ret )
> +            return ret;
> +    }
> +
> +    ret = its_send_cmd_sync(its, 0);
> +    if ( ret )
> +        return ret;
> +
> +    return gicv3_its_wait_commands(its);
> +}
> +
> +/*
>   * Map a hardware device, identified by a certain host ITS and its device ID
>   * to domain d, a guest ITS (identified by its doorbell address) and device ID.
>   * Also provide the number of events (MSIs) needed for that device.
> @@ -528,7 +593,7 @@ int gicv3_its_map_guest_device(struct domain *d,
>      struct host_its *hw_its;
>      struct its_devices *dev = NULL, *temp;
>      struct rb_node **new = &d->arch.vgic.its_devices.rb_node, *parent = NULL;
> -    int ret = -ENOENT;
> +    int ret = -ENOENT, i;

Again, i should be unsigned.

>
>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
>      if ( !hw_its )
> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
>      if ( !dev )
>          goto out_unlock;
>
> +    dev->host_lpis = xzalloc_array(uint32_t,
> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));

Rather than having DIV_ROUND_UP spread everywhere. Would not be easier 
to round_up nr_events once for all?

> +    if ( !dev->host_lpis )
> +        goto out_unlock;
> +
>      ret = its_send_cmd_mapd(hw_its, host_devid, fls(nr_events - 1) - 1,
>                              virt_to_maddr(itt_addr), true);
>      if ( ret )
> @@ -591,10 +661,26 @@ int gicv3_its_map_guest_device(struct domain *d,
>
>      spin_unlock(&d->arch.vgic.its_devices_lock);
>
> +    /*
> +     * Map all host LPIs within this device already. We can't afford to queue
> +     * any host ITS commands later on during the guest's runtime.
> +     */
> +    for ( i = 0; i < DIV_ROUND_UP(nr_events, LPI_BLOCK); i++ )
> +    {
> +        ret = gicv3_allocate_host_lpi_block(hw_its, d, host_devid,
> +                                            i * LPI_BLOCK);
> +        if ( ret < 0 )
> +            goto out;

This sounds quite wrong to me. If you fail to allocate a block, you will 
end-up with half of the device configured. However, you are freeing the 
memory but not reverting the ITS command. This is going to end up very 
badly.

Please be really careful with the error path. Most of them are not 
correct and I don't want to end up dealing with security issue later on.

> +
> +        dev->host_lpis[i] = ret;

Hmmm... as I said on the previous version you should not use the return 
for both error code and LPIs. LPIs are encoded on 32-bit, sooner or 
later there will be a clash between the error and the LPI.

> +    }
> +
>      return 0;
>
>  out_unlock:
>      spin_unlock(&d->arch.vgic.its_devices_lock);
> +
> +out:
>      if ( dev )
>          xfree(dev->host_lpis);
>      xfree(itt_addr);
> diff --git a/xen/arch/arm/gic-v3-lpi.c b/xen/arch/arm/gic-v3-lpi.c
> index 51d7425..59d3ba4 100644
> --- a/xen/arch/arm/gic-v3-lpi.c
> +++ b/xen/arch/arm/gic-v3-lpi.c
> @@ -20,24 +20,44 @@
>
>  #include <xen/lib.h>
>  #include <xen/mm.h>
> +#include <xen/sched.h>
> +#include <asm/atomic.h>
> +#include <asm/domain.h>
>  #include <asm/gic.h>
>  #include <asm/gic_v3_defs.h>
>  #include <asm/gic_v3_its.h>
>  #include <asm/io.h>
>  #include <asm/page.h>
>
> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
> +union host_lpi {
> +    uint64_t data;
> +    struct {
> +        uint32_t virt_lpi;
> +        uint16_t dom_id;
> +        uint16_t vcpu_id;
> +    };
> +};
> +
>  #define LPI_PROPTABLE_NEEDS_FLUSHING    (1U << 0)
>  /* Global state */
>  static struct {
>      /* The global LPI property table, shared by all redistributors. */
>      uint8_t *lpi_property;
>      /*
> +     * A two-level table to lookup LPIs firing on the host and look up the
> +     * domain and virtual LPI number to inject.
> +     */
> +    union host_lpi **host_lpis;
> +    /*
>       * Number of physical LPIs the host supports. This is a property of
>       * the GIC hardware. We depart from the habit of naming these things
>       * "physical" in Xen, as the GICv3/4 spec uses the term "physical LPI"
>       * in a different context to differentiate them from "virtual LPIs".
>       */
>      unsigned long int nr_host_lpis;
> +    /* Protects allocation and deallocation of host LPIs, but not the access */
> +    spinlock_t host_lpis_lock;
>      unsigned int flags;
>  } lpi_data;
>
> @@ -50,6 +70,19 @@ struct lpi_redist_data {
>  static DEFINE_PER_CPU(struct lpi_redist_data, lpi_redist);
>
>  #define MAX_PHYS_LPIS   (lpi_data.nr_host_lpis - LPI_OFFSET)
> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
> +
> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
> +{
> +    if ( !is_lpi(plpi) || plpi >= MAX_PHYS_LPIS + LPI_OFFSET )
> +        return NULL;
> +
> +    plpi -= LPI_OFFSET;
> +    if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
> +        return NULL;
> +
> +    return &lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % HOST_LPIS_PER_PAGE];
> +}
>
>  /* Stores this redistributor's physical address and ID in a per-CPU variable */
>  void gicv3_set_redist_address(paddr_t address, unsigned int redist_id)
> @@ -204,15 +237,170 @@ int gicv3_lpi_init_rdist(void __iomem * rdist_base)
>  static unsigned int max_lpi_bits = CONFIG_MAX_PHYS_LPI_BITS;
>  integer_param("max_lpi_bits", max_lpi_bits);
>
> +/*
> + * Allocate the 2nd level array for host LPIs. This one holds pointers
> + * to the page with the actual "union host_lpi" entries. Our LPI limit
> + * avoids excessive memory usage.
> + */
>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>  {
> +    int nr_lpi_ptrs;

unsigned int please.

> +
> +    /* We rely on the data structure being atomically accessible. */
> +    BUILD_BUG_ON(sizeof(union host_lpi) > sizeof(unsigned long));
> +
>      lpi_data.nr_host_lpis = BIT_ULL(min(hw_lpi_bits, max_lpi_bits));
>
> +    spin_lock_init(&lpi_data.host_lpis_lock);
> +
> +    nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
> +    if ( !lpi_data.host_lpis )
> +        return -ENOMEM;
> +
>      printk("GICv3: using at most %ld LPIs on the host.\n", MAX_PHYS_LPIS);
>
>      return 0;
>  }
>
> +/* Must be called with host_lpis_lock held. */

Again, this is a call for adding an ASSERT in the function.

> +static int find_unused_host_lpi(int start, uint32_t *index)

start should probably be unsigned.

> +{
> +    int chunk;

Ditto.

> +    uint32_t i = *index;
> +
> +    for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE; chunk++ )
> +    {
> +        /* If we hit an unallocated chunk, use entry 0 in that one. */
> +        if ( !lpi_data.host_lpis[chunk] )
> +        {
> +            *index = 0;
> +            return chunk;
> +        }
> +
> +        /* Find an unallocated entry in this chunk. */
> +        for ( ; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
> +        {
> +            if ( lpi_data.host_lpis[chunk][i].dom_id == INVALID_DOMID )
> +            {
> +                *index = i;
> +                return chunk;
> +            }
> +        }
> +        i = 0;
> +    }
> +
> +    return -1;
> +}
> +
> +/*
> + * Allocate a block of 32 LPIs on the given host ITS for device "devid",
> + * starting with "eventid". Put them into the respective ITT by issuing a
> + * MAPTI command for each of them.
> + */
> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
> +                                  uint32_t host_devid, uint32_t eventid)

Again, most of the parameters here is just for calling back ITS and the 
caller should all in hand. So I would much prefer to avoid a call chain 
ITS -> LPI -> ITS and make the LPI code ITS agnostic.

Furthermore, the return value is both used to return an error and the 
LPI. Given that an LPI is encoded on 32-bit, sooner or later there will 
be a clash with between the error and the LPI. So it would be wise to 
dissociate the error code and the LPIs.

The prototype of this function would like:

int gicv3_allocate_host_lpi_block(struct domain *domain, uint32_t 
*first_lpi);

> +{
> +    static uint32_t next_lpi = 0;
> +    uint32_t lpi, lpi_idx = next_lpi % HOST_LPIS_PER_PAGE;
> +    int chunk;
> +    int i;
> +
> +    spin_lock(&lpi_data.host_lpis_lock);
> +    chunk = find_unused_host_lpi(next_lpi / HOST_LPIS_PER_PAGE, &lpi_idx);
> +
> +    if ( chunk == - 1 )          /* rescan for a hole from the beginning */
> +    {
> +        lpi_idx = 0;
> +        chunk = find_unused_host_lpi(0, &lpi_idx);
> +        if ( chunk == -1 )
> +        {
> +            spin_unlock(&lpi_data.host_lpis_lock);
> +            return -ENOSPC;
> +        }

The algo does not seem to have changed since the previous version. 
Looking at it again, my understanding is you will always try to allocate 
forward. So if the current chunk is full, you will allocate the next one 
rather than looking whether a previous chunk has space available. This 
will result to allocate more memory than necessary.

Similarly unused chunk could be freed to save memory.

To be clear, I am asking for confirmation, documentation, and listing as 
an improvement.

> +    }
> +
> +    /* If we hit an unallocated chunk, we initialize it and use entry 0. */
> +    if ( !lpi_data.host_lpis[chunk] )
> +    {
> +        union host_lpi *new_chunk;
> +
> +        new_chunk = alloc_xenheap_pages(0, 0);

As said on the previous version, please use alloc_xenheap_page as you 
only allocate one page.

Also, when NUMA support will be added we may want to take into account 
the node associated to the device saving us some time when reading the 
memory. You don't need to handle that now, but a TODO would be quite 
helpful.

> +        if ( !new_chunk )
> +        {
> +            spin_unlock(&lpi_data.host_lpis_lock);
> +            return -ENOMEM;
> +        }
> +
> +        for ( i = 0; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
> +            new_chunk[i].dom_id = INVALID_DOMID;
> +
> +        lpi_data.host_lpis[chunk] = new_chunk;
> +        lpi_idx = 0;
> +    }
> +
> +    lpi = chunk * HOST_LPIS_PER_PAGE + lpi_idx;
> +
> +    for ( i = 0; i < LPI_BLOCK; i++ )
> +    {
> +        union host_lpi hlpi;
> +
> +        /*
> +         * Mark this host LPI as belonging to the domain, but don't assign
> +         * any virtual LPI or a VCPU yet.
> +         */
> +        hlpi.virt_lpi = INVALID_LPI;
> +        hlpi.dom_id = d->domain_id;
> +        hlpi.vcpu_id = INVALID_DOMID;

Again, please don't mix vcpu and domain. If INVALID_VCPUID does not 
exist then it might be worth adding one.

> +        write_u64_atomic(&lpi_data.host_lpis[chunk][lpi_idx + i].data,
> +                         hlpi.data);
> +
> +        /*
> +         * Enable this host LPI, so we don't have to do this during the
> +         * guest's runtime.
> +         */
> +        lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
> +    }
> +
> +    /*
> +     * We have allocated and initialized the host LPI entries, so it's safe
> +     * to drop the lock now. Access to the structures can be done concurrently
> +     * as it involves only an atomic uint64_t access.
> +     */
> +    spin_unlock(&lpi_data.host_lpis_lock);
> +
> +    if ( lpi_data.flags & LPI_PROPTABLE_NEEDS_FLUSHING )
> +        clean_and_invalidate_dcache_va_range(&lpi_data.lpi_property[lpi],
> +                                             LPI_BLOCK);
> +
> +    gicv3_its_map_host_events(its, host_devid, eventid, lpi + LPI_OFFSET,
> +                              LPI_BLOCK);

See my comment at the top of the function.

> +
> +    next_lpi = lpi + LPI_BLOCK;

Newline here please.

> +    return lpi + LPI_OFFSET;
> +}
> +
> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi)
> +{
> +    union host_lpi *hlpi, empty_lpi = { .dom_id = INVALID_DOMID };
> +    int i;
> +
> +    hlpi = gic_get_host_lpi(lpi);
> +    if ( !hlpi )
> +        return -ENOENT;
> +
> +    spin_lock(&lpi_data.host_lpis_lock);
> +
> +    for ( i = 0; i < LPI_BLOCK; i++ )
> +        write_u64_atomic(&hlpi[i].data, empty_lpi.data);
> +
> +    /* TODO: Call a function in gic-v3-its.c to send DISCARDs */

It is quite worrying to see a TODO in a code that is supposed to be 
merged really soon.

I think this should be done by the caller and not here. You also need to 
disable the interrupts which has not been done here.

> +
> +    spin_unlock(&lpi_data.host_lpis_lock);
> +
> +    return 0;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-arm/gic.h b/xen/include/asm-arm/gic.h
> index 12bd155..7825575 100644
> --- a/xen/include/asm-arm/gic.h
> +++ b/xen/include/asm-arm/gic.h
> @@ -220,7 +220,12 @@ enum gic_version {
>      GIC_V3,
>  };
>
> +#define INVALID_LPI     0
>  #define LPI_OFFSET      8192

Again, newline here please.

> +static inline bool is_lpi(unsigned int irq)
> +{
> +    return irq >= LPI_OFFSET;
> +}

Again, I think both INVALID_LPI and is_lpi should be moved in irq.h.

>
>  extern enum gic_version gic_hw_version(void);
>
> diff --git a/xen/include/asm-arm/gic_v3_its.h b/xen/include/asm-arm/gic_v3_its.h
> index 3421ea0..2387972 100644
> --- a/xen/include/asm-arm/gic_v3_its.h
> +++ b/xen/include/asm-arm/gic_v3_its.h
> @@ -106,6 +106,11 @@
>  #define HOST_ITS_FLUSH_CMD_QUEUE        (1U << 0)
>  #define HOST_ITS_USES_PTA               (1U << 1)
>
> +#define INVALID_DOMID ((uint16_t)~0)

Again, rather than defining your own invalid domid, it would be better 
to use the one defined in xen.h (see DOMID_INVALID).

> +
> +/* We allocate LPIs on the hosts in chunks of 32 to reduce handling overhead. */
> +#define LPI_BLOCK                       32
> +
>  /* data structure for each hardware ITS */
>  struct host_its {
>      struct list_head entry;
> @@ -151,6 +156,12 @@ int gicv3_its_map_guest_device(struct domain *d,
>                                 paddr_t host_doorbell, uint32_t host_devid,
>                                 paddr_t guest_doorbell, uint32_t guest_devid,
>                                 uint32_t nr_events, bool valid);
> +int gicv3_its_map_host_events(struct host_its *its,
> +                              uint32_t host_devid, uint32_t eventid,
> +                              uint32_t lpi, uint32_t nrevents);
> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain *d,
> +                                  uint32_t host_devid, uint32_t eventid);
> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi);
>
>  #else
>
>

Cheers,

Julien Grall March 24, 2017, 11:54 a.m. UTC | #6

Hi Stefano,

On 03/23/2017 05:52 PM, Stefano Stabellini wrote:
> On Thu, 23 Mar 2017, Andre Przywara wrote:
> On Thu, 23 Mar 2017, Julien Grall wrote:
>> Not answering directly to the question. But I think this is an improvement and
>> not necessary for a first version.
>>
>> I would like to see this series merged in Xen 4.9 as a tech preview, so I
>> would suggest to gather this list of improvement (maybe on Jira?) and defer
>> them for Xen 4.10. They would need to be addressed before making ITS stable.
>> Stefano, does it sound good to you?
>
> Yes, I am OK with not having this in 4.9, but we need to keep track
> (Jira or otherwise) of these things.

I have created XEN-60 for this item and linked to XEN-2 (the Xen ITS task).

Cheers,

Andre Przywara April 3, 2017, 7:30 p.m. UTC | #7

Hi,

On 23/03/17 19:08, Julien Grall wrote:
> Hi Andre,
> 
> On 16/03/17 11:20, Andre Przywara wrote:
>> The number of LPIs on a host can be potentially huge (millions),
>> although in practise will be mostly reasonable. So prematurely allocating
>> an array of struct irq_desc's for each LPI is not an option.
>> However Xen itself does not care about LPIs, as every LPI will be
>> injected
>> into a guest (Dom0 for now).
>> Create a dense data structure (8 Bytes) for each LPI which holds just
>> enough information to determine the virtual IRQ number and the VCPU into
>> which the LPI needs to be injected.
>> Also to not artificially limit the number of LPIs, we create a 2-level
>> table for holding those structures.
>> This patch introduces functions to initialize these tables and to
>> create, lookup and destroy entries for a given LPI.
>> By using the naturally atomic access guarantee the native uint64_t data
>> type gives us, we allocate and access LPI information in a way that does
>> not require a lock.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@arm.com>
>> ---
>>  xen/arch/arm/gic-v3-its.c        |  90 ++++++++++++++++++-
>>  xen/arch/arm/gic-v3-lpi.c        | 188
>> +++++++++++++++++++++++++++++++++++++++
>>  xen/include/asm-arm/gic.h        |   5 ++
>>  xen/include/asm-arm/gic_v3_its.h |  11 +++
>>  4 files changed, 292 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/arm/gic-v3-its.c b/xen/arch/arm/gic-v3-its.c
>> index 60b15b5..ed14d95 100644
>> --- a/xen/arch/arm/gic-v3-its.c
>> +++ b/xen/arch/arm/gic-v3-its.c
>> @@ -148,6 +148,20 @@ static int its_send_cmd_sync(struct host_its
>> *its, unsigned int cpu)
>>      return its_send_command(its, cmd);
>>  }
>>
>> +static int its_send_cmd_mapti(struct host_its *its,
>> +                              uint32_t deviceid, uint32_t eventid,
>> +                              uint32_t pintid, uint16_t icid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
>> +    cmd[2] = icid;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  static int its_send_cmd_mapc(struct host_its *its, uint32_t
>> collection_id,
>>                               unsigned int cpu)
>>  {
>> @@ -180,6 +194,19 @@ static int its_send_cmd_mapd(struct host_its
>> *its, uint32_t deviceid,
>>      return its_send_command(its, cmd);
>>  }
>>
>> +static int its_send_cmd_inv(struct host_its *its,
>> +                            uint32_t deviceid, uint32_t eventid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid;
>> +    cmd[2] = 0x00;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  /* Set up the (1:1) collection mapping for the given host CPU. */
>>  int gicv3_its_setup_collection(unsigned int cpu)
>>  {
>> @@ -462,7 +489,7 @@ int gicv3_its_init(void)
>>
>>  static int remove_mapped_guest_device(struct its_devices *dev)
>>  {
>> -    int ret;
>> +    int ret, i;
>>
>>      if ( dev->hw_its )
>>      {
>> @@ -471,11 +498,19 @@ static int remove_mapped_guest_device(struct
>> its_devices *dev)
>>              return ret;
>>      }
>>
>> +    /*
>> +     * The only error the function below would return is -ENOENT, in
>> which
>> +     * case there is nothing to free here. So we just ignore it.
>> +     */
> 
> The function gicv3_free_host_lpi_block will only be used here. And to be
> fair, if you try to free something that does not exist then it is not
> really an error...
> 
> So I would prefer to see the function to be void.

Fixed in v3.

>> +    for ( i = 0; i < DIV_ROUND_UP(dev->eventids, LPI_BLOCK); i++ )
>> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);
> 
> Again, without looking at the implementation of
> gicv3_free_host_lpi_block, I think the usage of the function is very
> confusing. When I read host_lpis, I expect to see one LPI per event. But
> instead it be the first LPI of a block. The lack of documentation of the
> field in its_devices does not help to understand what's going on.
> 
> So please add some documentation and probably renaming some fields.

Fixed in v3.

> 
> Lastly, you have not answered to my question: should not we discard the
> LPIs before removing the device? Or does MAPD take care for you?"

Yes: "MAPD removes the mapping of the specified DeviceID. and interrupt
requests from that device are discarded."

So we don't need to issue individual DISCARDs if the device in unmapped.

>> +
>>      ret = gicv3_its_wait_commands(dev->hw_its);
>>      if ( ret )
>>          return ret;
> 
> I know I asked to wait the command, thank you for addressing it. But
> now, if the function fail you will end-up to leak memory. This is not
> better than failing to wait commands.

Fixed in v3.

>>
>>      xfree(dev->itt_addr);
>> +    xfree(dev->host_lpis);
>>      xfree(dev);
>>
>>      return 0;
>> @@ -513,6 +548,36 @@ static int compare_its_guest_devices(struct
>> its_devices *dev,
>>  }
>>
>>  /*
>> + * On the host ITS @its, map @nr_events consecutive LPIs.
>> + * The mapping connects a device @devid and event @eventid pair to
>> LPI @lpi,
>> + * increasing both @eventid and @lpi to cover the number of requested
>> LPIs.
>> + */
>> +int gicv3_its_map_host_events(struct host_its *its,
>> +                              uint32_t devid, uint32_t eventid,
>> uint32_t lpi,
>> +                              uint32_t nr_events)
>> +{
>> +    uint32_t i;
>> +    int ret;
>> +
>> +    for ( i = 0; i < nr_events; i++ )
>> +    {
>> +        /* For now we map every host LPI to host CPU 0 */
>> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
>> +        if ( ret )
>> +            return ret;
>> +        ret = its_send_cmd_inv(its, devid, eventid + i);
> 
> 
> So the spec allows up to 32KB event per device. As all the LPIs will be
> routed to CPU0 (e.g collection 0), it would be more efficient to do an
> INVALL. I would be happy to defer that to post Xen 4.9, but the rest
> needs to be fixed.

I tried INVALL and it didn't work, at least on the model. I can't see
why, so I kept the individual INVs in.
I have the patch still lying around, so we can revisit this later.

> Furthermore, what if the queue is full? AFAIU, you will return an error
> but it is not propagate. So Xen will think the device has been mapped
> even if it is not true.

In v3 the error is now checked in the caller and
gicv3_its_map_guest_device() cleans up and returns an error.
In the moment this is called from the MAPD emulation, so there is no
good way of communicating this back to the guest, but in the future this
can return the error to the hypercall and Dom0 can react accordingly.

> I think we need to have a plan here as this may likely happen if a
> device has many MSI and/or the queue is nearly full.
> 
>> +        if ( ret )
>> +            return ret;
>> +    }
>> +
>> +    ret = its_send_cmd_sync(its, 0);
>> +    if ( ret )
>> +        return ret;
>> +
>> +    return gicv3_its_wait_commands(its);
>> +}
>> +
>> +/*
>>   * Map a hardware device, identified by a certain host ITS and its
>> device ID
>>   * to domain d, a guest ITS (identified by its doorbell address) and
>> device ID.
>>   * Also provide the number of events (MSIs) needed for that device.
>> @@ -528,7 +593,7 @@ int gicv3_its_map_guest_device(struct domain *d,
>>      struct host_its *hw_its;
>>      struct its_devices *dev = NULL, *temp;
>>      struct rb_node **new = &d->arch.vgic.its_devices.rb_node, *parent
>> = NULL;
>> -    int ret = -ENOENT;
>> +    int ret = -ENOENT, i;
> 
> Again, i should be unsigned.

Fixed in v4.

>>
>>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
>>      if ( !hw_its )
>> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
>>      if ( !dev )
>>          goto out_unlock;
>>
>> +    dev->host_lpis = xzalloc_array(uint32_t,
>> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
> 
> Rather than having DIV_ROUND_UP spread everywhere. Would not be easier
> to round_up nr_events once for all?

I'd rather keep nr_events as the actual number around.
I think we might look into actually limiting the allocation later.

> 
>> +    if ( !dev->host_lpis )
>> +        goto out_unlock;
>> +
>>      ret = its_send_cmd_mapd(hw_its, host_devid, fls(nr_events - 1) - 1,
>>                              virt_to_maddr(itt_addr), true);
>>      if ( ret )
>> @@ -591,10 +661,26 @@ int gicv3_its_map_guest_device(struct domain *d,
>>
>>      spin_unlock(&d->arch.vgic.its_devices_lock);
>>
>> +    /*
>> +     * Map all host LPIs within this device already. We can't afford
>> to queue
>> +     * any host ITS commands later on during the guest's runtime.
>> +     */
>> +    for ( i = 0; i < DIV_ROUND_UP(nr_events, LPI_BLOCK); i++ )
>> +    {
>> +        ret = gicv3_allocate_host_lpi_block(hw_its, d, host_devid,
>> +                                            i * LPI_BLOCK);
>> +        if ( ret < 0 )
>> +            goto out;
> 
> This sounds quite wrong to me. If you fail to allocate a block, you will
> end-up with half of the device configured. However, you are freeing the
> memory but not reverting the ITS command. This is going to end up very
> badly.

I need to check again whether this is really a problem here.

> Please be really careful with the error path. Most of them are not
> correct and I don't want to end up dealing with security issue later on.
> 
>> +
>> +        dev->host_lpis[i] = ret;
> 
> Hmmm... as I said on the previous version you should not use the return
> for both error code and LPIs. LPIs are encoded on 32-bit, sooner or
> later there will be a clash between the error and the LPI.

Fixed in v3.

> 
>> +    }
>> +
>>      return 0;
>>
>>  out_unlock:
>>      spin_unlock(&d->arch.vgic.its_devices_lock);
>> +
>> +out:
>>      if ( dev )
>>          xfree(dev->host_lpis);
>>      xfree(itt_addr);
>> diff --git a/xen/arch/arm/gic-v3-lpi.c b/xen/arch/arm/gic-v3-lpi.c
>> index 51d7425..59d3ba4 100644
>> --- a/xen/arch/arm/gic-v3-lpi.c
>> +++ b/xen/arch/arm/gic-v3-lpi.c
>> @@ -20,24 +20,44 @@
>>
>>  #include <xen/lib.h>
>>  #include <xen/mm.h>
>> +#include <xen/sched.h>
>> +#include <asm/atomic.h>
>> +#include <asm/domain.h>
>>  #include <asm/gic.h>
>>  #include <asm/gic_v3_defs.h>
>>  #include <asm/gic_v3_its.h>
>>  #include <asm/io.h>
>>  #include <asm/page.h>
>>
>> +/* LPIs on the host always go to a guest, so no struct irq_desc for
>> them. */
>> +union host_lpi {
>> +    uint64_t data;
>> +    struct {
>> +        uint32_t virt_lpi;
>> +        uint16_t dom_id;
>> +        uint16_t vcpu_id;
>> +    };
>> +};
>> +
>>  #define LPI_PROPTABLE_NEEDS_FLUSHING    (1U << 0)
>>  /* Global state */
>>  static struct {
>>      /* The global LPI property table, shared by all redistributors. */
>>      uint8_t *lpi_property;
>>      /*
>> +     * A two-level table to lookup LPIs firing on the host and look
>> up the
>> +     * domain and virtual LPI number to inject.
>> +     */
>> +    union host_lpi **host_lpis;
>> +    /*
>>       * Number of physical LPIs the host supports. This is a property of
>>       * the GIC hardware. We depart from the habit of naming these things
>>       * "physical" in Xen, as the GICv3/4 spec uses the term "physical
>> LPI"
>>       * in a different context to differentiate them from "virtual LPIs".
>>       */
>>      unsigned long int nr_host_lpis;
>> +    /* Protects allocation and deallocation of host LPIs, but not the
>> access */
>> +    spinlock_t host_lpis_lock;
>>      unsigned int flags;
>>  } lpi_data;
>>
>> @@ -50,6 +70,19 @@ struct lpi_redist_data {
>>  static DEFINE_PER_CPU(struct lpi_redist_data, lpi_redist);
>>
>>  #define MAX_PHYS_LPIS   (lpi_data.nr_host_lpis - LPI_OFFSET)
>> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
>> +
>> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
>> +{
>> +    if ( !is_lpi(plpi) || plpi >= MAX_PHYS_LPIS + LPI_OFFSET )
>> +        return NULL;
>> +
>> +    plpi -= LPI_OFFSET;
>> +    if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
>> +        return NULL;
>> +
>> +    return &lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi %
>> HOST_LPIS_PER_PAGE];
>> +}
>>
>>  /* Stores this redistributor's physical address and ID in a per-CPU
>> variable */
>>  void gicv3_set_redist_address(paddr_t address, unsigned int redist_id)
>> @@ -204,15 +237,170 @@ int gicv3_lpi_init_rdist(void __iomem *
>> rdist_base)
>>  static unsigned int max_lpi_bits = CONFIG_MAX_PHYS_LPI_BITS;
>>  integer_param("max_lpi_bits", max_lpi_bits);
>>
>> +/*
>> + * Allocate the 2nd level array for host LPIs. This one holds pointers
>> + * to the page with the actual "union host_lpi" entries. Our LPI limit
>> + * avoids excessive memory usage.
>> + */
>>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>>  {
>> +    int nr_lpi_ptrs;
> 
> unsigned int please.

Fixed in v4.

>> +
>> +    /* We rely on the data structure being atomically accessible. */
>> +    BUILD_BUG_ON(sizeof(union host_lpi) > sizeof(unsigned long));
>> +
>>      lpi_data.nr_host_lpis = BIT_ULL(min(hw_lpi_bits, max_lpi_bits));
>>
>> +    spin_lock_init(&lpi_data.host_lpis_lock);
>> +
>> +    nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
>> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
>> +    if ( !lpi_data.host_lpis )
>> +        return -ENOMEM;
>> +
>>      printk("GICv3: using at most %ld LPIs on the host.\n",
>> MAX_PHYS_LPIS);
>>
>>      return 0;
>>  }
>>
>> +/* Must be called with host_lpis_lock held. */
> 
> Again, this is a call for adding an ASSERT in the function.

This comment is more like lock documentation, to give code writers a
guidance how the locking should be handled here.
I am not convinced that having ASSERTS in *static* functions is really
useful.

>> +static int find_unused_host_lpi(int start, uint32_t *index)
> 
> start should probably be unsigned.
> 
>> +{
>> +    int chunk;
> 
> Ditto.

Fixed in v3.

>> +    uint32_t i = *index;
>> +
>> +    for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE;
>> chunk++ )
>> +    {
>> +        /* If we hit an unallocated chunk, use entry 0 in that one. */
>> +        if ( !lpi_data.host_lpis[chunk] )
>> +        {
>> +            *index = 0;
>> +            return chunk;
>> +        }
>> +
>> +        /* Find an unallocated entry in this chunk. */
>> +        for ( ; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +        {
>> +            if ( lpi_data.host_lpis[chunk][i].dom_id == INVALID_DOMID )
>> +            {
>> +                *index = i;
>> +                return chunk;
>> +            }
>> +        }
>> +        i = 0;
>> +    }
>> +
>> +    return -1;
>> +}
>> +
>> +/*
>> + * Allocate a block of 32 LPIs on the given host ITS for device "devid",
>> + * starting with "eventid". Put them into the respective ITT by
>> issuing a
>> + * MAPTI command for each of them.
>> + */
>> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain
>> *d,
>> +                                  uint32_t host_devid, uint32_t eventid)
> 
> Again, most of the parameters here is just for calling back ITS and the
> caller should all in hand. So I would much prefer to avoid a call chain
> ITS -> LPI -> ITS and make the LPI code ITS agnostic.
> 
> Furthermore, the return value is both used to return an error and the
> LPI. Given that an LPI is encoded on 32-bit, sooner or later there will
> be a clash with between the error and the LPI. So it would be wise to
> dissociate the error code and the LPIs.
> 
> The prototype of this function would like:
> 
> int gicv3_allocate_host_lpi_block(struct domain *domain, uint32_t
> *first_lpi);

Fixed in v3.

>> +{
>> +    static uint32_t next_lpi = 0;
>> +    uint32_t lpi, lpi_idx = next_lpi % HOST_LPIS_PER_PAGE;
>> +    int chunk;
>> +    int i;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +    chunk = find_unused_host_lpi(next_lpi / HOST_LPIS_PER_PAGE,
>> &lpi_idx);
>> +
>> +    if ( chunk == - 1 )          /* rescan for a hole from the
>> beginning */
>> +    {
>> +        lpi_idx = 0;
>> +        chunk = find_unused_host_lpi(0, &lpi_idx);
>> +        if ( chunk == -1 )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOSPC;
>> +        }
> 
> The algo does not seem to have changed since the previous version.
> Looking at it again, my understanding is you will always try to allocate
> forward. So if the current chunk is full, you will allocate the next one
> rather than looking whether a previous chunk has space available. This
> will result to allocate more memory than necessary.

In v4 I amended the code slightly to move next_lpi outside of the
function. When we now free an LPI block, we check if the previous
next_lpi was pointing after this block and adjust it in this case.

> Similarly unused chunk could be freed to save memory.

No, we can't, because this breaks the lockless property. A user
(incoming LPI) would find the first level pointer and go into that
block. So we can't never replace this block pointer now. That smells
like a case for RCU, but I am not sure if Xen can properly handle this case.
But with the above optimization (adjusting next_lpi on freeing a block)
I am pretty sure this isn't a problem for now, especially for Dom0.

> To be clear, I am asking for confirmation, documentation, and listing as
> an improvement.
> 
>> +    }
>> +
>> +    /* If we hit an unallocated chunk, we initialize it and use entry
>> 0. */
>> +    if ( !lpi_data.host_lpis[chunk] )
>> +    {
>> +        union host_lpi *new_chunk;
>> +
>> +        new_chunk = alloc_xenheap_pages(0, 0);
> 
> As said on the previous version, please use alloc_xenheap_page as you
> only allocate one page.

Fixed in v3.

> Also, when NUMA support will be added we may want to take into account
> the node associated to the device saving us some time when reading the
> memory. You don't need to handle that now, but a TODO would be quite
> helpful.

TODO added in v3.

>> +        if ( !new_chunk )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOMEM;
>> +        }
>> +
>> +        for ( i = 0; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +            new_chunk[i].dom_id = INVALID_DOMID;
>> +
>> +        lpi_data.host_lpis[chunk] = new_chunk;
>> +        lpi_idx = 0;
>> +    }
>> +
>> +    lpi = chunk * HOST_LPIS_PER_PAGE + lpi_idx;
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +    {
>> +        union host_lpi hlpi;
>> +
>> +        /*
>> +         * Mark this host LPI as belonging to the domain, but don't
>> assign
>> +         * any virtual LPI or a VCPU yet.
>> +         */
>> +        hlpi.virt_lpi = INVALID_LPI;
>> +        hlpi.dom_id = d->domain_id;
>> +        hlpi.vcpu_id = INVALID_DOMID;
> 
> Again, please don't mix vcpu and domain. If INVALID_VCPUID does not
> exist then it might be worth adding one.

Fixed in v4.

>> +        write_u64_atomic(&lpi_data.host_lpis[chunk][lpi_idx + i].data,
>> +                         hlpi.data);
>> +
>> +        /*
>> +         * Enable this host LPI, so we don't have to do this during the
>> +         * guest's runtime.
>> +         */
>> +        lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
>> +    }
>> +
>> +    /*
>> +     * We have allocated and initialized the host LPI entries, so
>> it's safe
>> +     * to drop the lock now. Access to the structures can be done
>> concurrently
>> +     * as it involves only an atomic uint64_t access.
>> +     */
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    if ( lpi_data.flags & LPI_PROPTABLE_NEEDS_FLUSHING )
>> +       
>> clean_and_invalidate_dcache_va_range(&lpi_data.lpi_property[lpi],
>> +                                             LPI_BLOCK);
>> +
>> +    gicv3_its_map_host_events(its, host_devid, eventid, lpi +
>> LPI_OFFSET,
>> +                              LPI_BLOCK);
> 
> See my comment at the top of the function.

Fixed in v3 (I believe).

>> +
>> +    next_lpi = lpi + LPI_BLOCK;
> 
> Newline here please.

Fixed in v3.

> 
>> +    return lpi + LPI_OFFSET;
>> +}
>> +
>> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi)
>> +{
>> +    union host_lpi *hlpi, empty_lpi = { .dom_id = INVALID_DOMID };
>> +    int i;
>> +
>> +    hlpi = gic_get_host_lpi(lpi);
>> +    if ( !hlpi )
>> +        return -ENOENT;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +        write_u64_atomic(&hlpi[i].data, empty_lpi.data);
>> +
>> +    /* TODO: Call a function in gic-v3-its.c to send DISCARDs */
> 
> It is quite worrying to see a TODO in a code that is supposed to be
> merged really soon.
> 
> I think this should be done by the caller and not here. You also need to
> disable the interrupts which has not been done here.

I removed the TODO in v3, because it was wrong. We unmap the device
before, so no need for individual DISCARDs (as mentioned above).

>> +
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    return 0;
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/include/asm-arm/gic.h b/xen/include/asm-arm/gic.h
>> index 12bd155..7825575 100644
>> --- a/xen/include/asm-arm/gic.h
>> +++ b/xen/include/asm-arm/gic.h
>> @@ -220,7 +220,12 @@ enum gic_version {
>>      GIC_V3,
>>  };
>>
>> +#define INVALID_LPI     0
>>  #define LPI_OFFSET      8192
> 
> Again, newline here please.

Fixed in v3.

>> +static inline bool is_lpi(unsigned int irq)
>> +{
>> +    return irq >= LPI_OFFSET;
>> +}
> 
> Again, I think both INVALID_LPI and is_lpi should be moved in irq.h.

Fixed in v3.

>>
>>  extern enum gic_version gic_hw_version(void);
>>
>> diff --git a/xen/include/asm-arm/gic_v3_its.h
>> b/xen/include/asm-arm/gic_v3_its.h
>> index 3421ea0..2387972 100644
>> --- a/xen/include/asm-arm/gic_v3_its.h
>> +++ b/xen/include/asm-arm/gic_v3_its.h
>> @@ -106,6 +106,11 @@
>>  #define HOST_ITS_FLUSH_CMD_QUEUE        (1U << 0)
>>  #define HOST_ITS_USES_PTA               (1U << 1)
>>
>> +#define INVALID_DOMID ((uint16_t)~0)
> 
> Again, rather than defining your own invalid domid, it would be better
> to use the one defined in xen.h (see DOMID_INVALID).

Fixed in v3.

Cheers,
Andre.


>> +
>> +/* We allocate LPIs on the hosts in chunks of 32 to reduce handling
>> overhead. */
>> +#define LPI_BLOCK                       32
>> +
>>  /* data structure for each hardware ITS */
>>  struct host_its {
>>      struct list_head entry;
>> @@ -151,6 +156,12 @@ int gicv3_its_map_guest_device(struct domain *d,
>>                                 paddr_t host_doorbell, uint32_t
>> host_devid,
>>                                 paddr_t guest_doorbell, uint32_t
>> guest_devid,
>>                                 uint32_t nr_events, bool valid);
>> +int gicv3_its_map_host_events(struct host_its *its,
>> +                              uint32_t host_devid, uint32_t eventid,
>> +                              uint32_t lpi, uint32_t nrevents);
>> +int gicv3_allocate_host_lpi_block(struct host_its *its, struct domain
>> *d,
>> +                                  uint32_t host_devid, uint32_t
>> eventid);
>> +int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi);
>>
>>  #else
>>
>>
> 
> Cheers,
>

Julien Grall April 3, 2017, 8:13 p.m. UTC | #8

On 04/03/2017 08:30 PM, Andre Przywara wrote:
> Hi,

Hi Andre,

> On 23/03/17 19:08, Julien Grall wrote:
>>>  /*
>>> + * On the host ITS @its, map @nr_events consecutive LPIs.
>>> + * The mapping connects a device @devid and event @eventid pair to
>>> LPI @lpi,
>>> + * increasing both @eventid and @lpi to cover the number of requested
>>> LPIs.
>>> + */
>>> +int gicv3_its_map_host_events(struct host_its *its,
>>> +                              uint32_t devid, uint32_t eventid,
>>> uint32_t lpi,
>>> +                              uint32_t nr_events)
>>> +{
>>> +    uint32_t i;
>>> +    int ret;
>>> +
>>> +    for ( i = 0; i < nr_events; i++ )
>>> +    {
>>> +        /* For now we map every host LPI to host CPU 0 */
>>> +        ret = its_send_cmd_mapti(its, devid, eventid + i, lpi + i, 0);
>>> +        if ( ret )
>>> +            return ret;
>>> +        ret = its_send_cmd_inv(its, devid, eventid + i);
>>
>>
>> So the spec allows up to 32KB event per device. As all the LPIs will be
>> routed to CPU0 (e.g collection 0), it would be more efficient to do an
>> INVALL. I would be happy to defer that to post Xen 4.9, but the rest
>> needs to be fixed.
>
> I tried INVALL and it didn't work, at least on the model. I can't see
> why, so I kept the individual INVs in.
> I have the patch still lying around, so we can revisit this later.

Can you add a TODO comment please?

>>>
>>>      hw_its = gicv3_its_find_by_doorbell(host_doorbell);
>>>      if ( !hw_its )
>>> @@ -574,6 +639,11 @@ int gicv3_its_map_guest_device(struct domain *d,
>>>      if ( !dev )
>>>          goto out_unlock;
>>>
>>> +    dev->host_lpis = xzalloc_array(uint32_t,
>>> +                                   DIV_ROUND_UP(nr_events, LPI_BLOCK));
>>
>> Rather than having DIV_ROUND_UP spread everywhere. Would not be easier
>> to round_up nr_events once for all?
>
> I'd rather keep nr_events as the actual number around.
> I think we might look into actually limiting the allocation later.

Why? This number will likely be a multiple of bit because of the ITS 
works. You would also have to keep around multiple different value that 
will make the code more complicate to read...


>>> +/* Must be called with host_lpis_lock held. */
>>
>> Again, this is a call for adding an ASSERT in the function.
>
> This comment is more like lock documentation, to give code writers a
> guidance how the locking should be handled here.
> I am not convinced that having ASSERTS in *static* functions is really
> useful.

Well, you seem to assume that you will be the only one to modify this 
code and it is very easy to skip reading a comment by mistake.

So the ASSERT will catch such error. Give me a reason that the ASSERT is 
a bad idea and I will re-think my position.

[...]

>> The algo does not seem to have changed since the previous version.
>> Looking at it again, my understanding is you will always try to allocate
>> forward. So if the current chunk is full, you will allocate the next one
>> rather than looking whether a previous chunk has space available. This
>> will result to allocate more memory than necessary.
>
> In v4 I amended the code slightly to move next_lpi outside of the
> function. When we now free an LPI block, we check if the previous
> next_lpi was pointing after this block and adjust it in this case.
>
>> Similarly unused chunk could be freed to save memory.
>
> No, we can't, because this breaks the lockless property. A user
> (incoming LPI) would find the first level pointer and go into that
> block. So we can't never replace this block pointer now. That smells
> like a case for RCU, but I am not sure if Xen can properly handle this case.
> But with the above optimization (adjusting next_lpi on freeing a block)
> I am pretty sure this isn't a problem for now, especially for Dom0.

Then document it, because this is a call to forget to revisit that in 
the future.

Cheers,

[v2,08/27] ARM: GICv3 ITS: introduce host LPI array

Commit Message

Comments

Patch