diff mbox

[v6,08/11] vpci/bars: add handlers to map the BARs

Message ID 20170919152936.14498-9-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monne Sept. 19, 2017, 3:29 p.m. UTC
Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to snoop BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to
memory space accesses), and maps/unmaps the BARs of the device into
the guest p2m. A rangeset is used in order to figure out which memory
to map/unmap. This makes it easier to keep track of the possible
overlaps with other BARs, and will also simplify MSI-X support, where
certain regions of a BAR might be used for the MSI-X table or PBA.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Note that the long running BAR mapping and unmapping operations are
deferred to be performed by hvm_io_pending, so that they can be safely
preempted.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v5:
 - Switch to the new handler type.
 - Use pci_sbdf_t to size the BARs.
 - Use a single return for vpci_modify_bar.
 - Do not return an error code from vpci_modify_bars, just log the
   failure.
 - Remove the 'sizing' parameter. Instead just let the guest write
   directly to the BAR, and read the value back. This simplifies the
   BAR register handlers, specially the read one.
 - Ignore ROM BAR writes with memory decoding enabled and ROM enabled.
 - Do not propagate failures to setup the ROM BAR in vpci_init_bars.
 - Add preemption support to the BAR mapping/unmapping operations.

Changes since v4:
 - Expand commit message to mention the reason behind the usage of
   rangesets.
 - Fix comment related to the inclusiveness of rangesets.
 - Fix off-by-one error in the calculation of the end of memory
   regions.
 - Store the state of the BAR (mapped/unmapped) in the vpci_bar
   enabled field, previously was only used by ROMs.
 - Fix double negation of return code.
 - Modify vpci_cmd_write so it has a single call to pci_conf_write16.
 - Print a warning when trying to write to the BAR with memory
   decoding enabled (and ignore the write).
 - Remove header_type local variable, it's used only once.
 - Move the read of the command register.
 - Restore previous command register value in the exit paths.
 - Only set address to INVALID_PADDR if the initial BAR value matches
    ~0 & PCI_BASE_ADDRESS_MEM_MASK.
 - Don't disable the enabled bit in the expansion ROM register, memory
   decoding is already disabled and takes precedence.
 - Don't use INVALID_PADDR, just set the initial BAR address to the
   value found in the hardware.
 - Introduce rom_enabled to store the status of the
   PCI_ROM_ADDRESS_ENABLE bit.
 - Reorder fields of the structure to prevent holes.

Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/arch/x86/hvm/ioreq.c  |   4 +
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 478 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h   |   8 +
 xen/include/xen/vpci.h    |  41 ++++
 5 files changed, 532 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

Comments

Jan Beulich Oct. 4, 2017, 8:33 a.m. UTC | #1
>>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
> @@ -48,6 +49,9 @@ bool_t hvm_io_pending(struct vcpu *v)
>      struct domain *d = v->domain;
>      struct hvm_ioreq_server *s;
>  
> +    if ( has_vpci(v->domain) && vpci_check_pending(v) )

has_vpci(d)

> +      return 1;

Indentation.

> +static int vpci_map_range(unsigned long s, unsigned long e, void *data,
> +                          unsigned long *c)
> +{
> +    const struct map_data *map = data;
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        unsigned long size = e - s + 1;
> +
> +        rc = (map->map ? map_mmio_regions : unmap_mmio_regions)
> +             (map->d, _gfn(s), size, _mfn(s));
> +        if ( rc == 0 )
> +        {
> +            *c += size;
> +            break;
> +        }
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_G_WARNING
> +                   "Failed to identity %smap [%" PRI_gfn ", %" PRI_gfn ") for d%d: %d\n",
> +                   map ? "" : "un", s, e, map->d->domain_id, rc);
> +            break;
> +        }

ASSERT(rc < size) ?

> +bool vpci_check_pending(struct vcpu *v)

"check" in the function name generally suggests (to me at least) that
the parameter ought to be const. Perhaps vpci_process_pending()?

> +{
> +    if ( v->vpci.mem )
> +    {
> +        int rc = vpci_map_memory(v->domain, v->vpci.mem, v->vpci.map);
> +
> +        if ( rc == -ERESTART )
> +            return true;

There's no real need for the local variable if all other return values
are simply discarded here. However, ...

> +        rangeset_destroy(v->vpci.mem);
> +        v->vpci.mem = NULL;
> +    }
> +
> +    return false;
> +}

... I'm not convinced this is a good error handling model. I don't
recall how previous versions dealt with this, but iirc we agreed to
generally make all such Dom0 handling best effort (here: don't skip the
remaining ranges if mapping of one failed). An exception may want/need
to be -ENOMEM.

> +static int vpci_check_bar_overlap(const struct pci_dev *pdev,
> +                                  const struct vpci_bar *rom,
> +                                  struct rangeset *mem)
> +{
> +    const struct pci_dev *cmp;
> +
> +    /* Check for overlaps with other device's BARs. */
> +    list_for_each_entry(cmp, &pdev->domain->arch.pdev_list, domain_list)
> +    {
> +        unsigned int i;
> +
> +        if ( rom == NULL && pdev == cmp )
> +            continue;

This check looks rather unmotivated (or even bogus) without a comment.
The other special casing of ROM BARs further down also isn't all that
obvious (and right now I can't even convince myself it's correct).

> +static void vpci_modify_bars(const struct pci_dev *pdev, bool map)
> +{
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> +    unsigned int i;
> +    int rc;
> +
> +    if ( !mem )
> +        return;
> +
> +    /*
> +     * Create a rangeset that represents the current device BARs memory region
> +     * and compare it against all the currently active BAR memory regions. If
> +     * an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
> +     *
> +     * NB: the rangeset uses inclusive frame numbers.
> +     */
> +
> +    /* First fill the rangeset with all the BARs of this device. */
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        const struct vpci_bar *bar = &header->bars[i];
> +
> +        if ( !MAPPABLE_BAR(bar) ||
> +             (bar->type == VPCI_BAR_ROM && !bar->rom_enabled) )
> +            continue;
> +
> +        rc = rangeset_add_range(mem, PFN_DOWN(bar->addr),
> +                                PFN_DOWN(bar->addr + bar->size - 1));
> +        if ( rc )
> +        {
> +            rangeset_destroy(mem);
> +            return;

I'm afraid -ENOMEM here (which sadly is possible, as we don't maintain
any reserves) would produce a very hard to diagnose misbehavior. I think
you want to log a message here.

> +        }
> +    }
> +
> +    /* Check for overlaps with other device's BARs. */
> +    rc = vpci_check_bar_overlap(pdev, NULL, mem);

Why is this not symmetrical with vpci_modify_rom() (which also checks
overlaps inside the current device)?

> +    if ( rc )
> +    {
> +        rangeset_destroy(mem);
> +        return;

Same error handling comment as above, despite failure here being less
likely (hopefully at least). Perhaps worth joining the two paths.

> +    }
> +
> +    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
> +    if ( !rc )
> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +            if ( header->bars[i].type != VPCI_BAR_ROM ||
> +                 header->bars[i].rom_enabled )
> +            header->bars[i].enabled = map;

Hmm, you're updating state here regardless of possible failure in the
deferred operation (see the discarded error code in
vpci_check_pending()).

> +static uint32_t vpci_cmd_read(const struct pci_dev *pdev, unsigned int reg,
> +                              void *data)
> +{
> +    return pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                           PCI_FUNC(pdev->devfn), reg);
> +}

Wouldn't it be worthwhile having generic read functions dealing with
simple cases like this (and the BAR) one?

> +static void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t cmd, void *data)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> +
> +    /*
> +     * Let the guest play with all the bits directly except for the
> +     * memory decoding one.
> +     */

Please could you make clear it's only Dom0 we apply this lax model to?
Perhaps simply s/the guest/Dom0/.

> +static void vpci_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    bool hi = false;
> +
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +    {
> +         gprintk(XENLOG_WARNING,
> +                 "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
> +                 seg, bus, slot, func);

Indentation. Also any chance to log which BAR it was?

> +static void vpci_rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +
> +    if ( (pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &

Please use cmd here.

> +          PCI_COMMAND_MEMORY) && rom->rom_enabled )
> +    {
> +         gprintk(XENLOG_WARNING,
> +                 "%04x:%02x:%02x.%u: ignored ROM BAR write with memory decoding enabled\n",
> +                 seg, bus, slot, func);

Indentation again.

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    pci_sbdf_t sbdf = {
> +        .seg = seg,
> +        .bus = bus,
> +        .dev = slot,
> +        .func = func,
> +    };
> +    int rc;
> +
> +    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                                   &bars[i]);
> +            if ( rc )
> +            {
> +                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +                return rc;
> +            }
> +
> +            continue;
> +        }

You don't need val up to here - please defer the read.

> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;
> +
> +        /* Size the BAR and map it. */

Isn't the map part of this comment stale now? And without it,
considering the function name called, it is perhaps no longer worth
having it.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -20,6 +20,9 @@
>  #include <xen/smp.h>
>  #include <xen/perfc.h>
>  #include <asm/atomic.h>
> +#ifdef CONFIG_HAS_PCI
> +#include <xen/vpci.h>
> +#endif

Perhaps the conditional would better live in that header.

> @@ -264,6 +267,11 @@ struct vcpu
>  
>      struct evtchn_fifo_vcpu *evtchn_fifo;
>  
> +#ifdef CONFIG_HAS_PCI
> +    /* vPCI per-vCPU area, used to store data for long running operations. */
> +    struct vpci_vcpu vpci;
> +#endif

And perhaps the header would better provide an empty structure for the
"else" case. Another option would be to include the fields ...

>      struct arch_vcpu arch;

... in this structure.

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data);
>  
> +/*
> + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
> + * should not run.
> + */
> +bool vpci_check_pending(struct vcpu *v);
> +
>  struct vpci {
>      /* List of vPCI handlers for a device. */
>      struct list_head handlers;
>      spinlock_t lock;
> +
> +#ifdef __XEN__
> +    /* Hide the rest of the vpci struct from the user-space test harness. */
> +    struct vpci_header {
> +        /* Information about the PCI BARs of this device. */
> +        struct vpci_bar {
> +            paddr_t addr;
> +            uint64_t size;
> +            enum {
> +                VPCI_BAR_EMPTY,
> +                VPCI_BAR_IO,
> +                VPCI_BAR_MEM32,
> +                VPCI_BAR_MEM64_LO,
> +                VPCI_BAR_MEM64_HI,
> +                VPCI_BAR_ROM,
> +            } type;
> +            bool prefetchable;
> +            /* Store whether the BAR is mapped into guest p2m. */
> +            bool enabled;
> +            /*
> +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
> +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
> +             */
> +            bool rom_enabled;

Especially with the error handling issue in mind that I've mentioned
earlier, I wonder whether this field shouldn't be dropped, along the
lines of you also no longer caching the memory decode enable bit in the
command register.

Jan
Roger Pau Monne Oct. 5, 2017, 9:20 a.m. UTC | #2
On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
> > +static int vpci_check_bar_overlap(const struct pci_dev *pdev,
> > +                                  const struct vpci_bar *rom,
> > +                                  struct rangeset *mem)
> > +{
> > +    const struct pci_dev *cmp;
> > +
> > +    /* Check for overlaps with other device's BARs. */
> > +    list_for_each_entry(cmp, &pdev->domain->arch.pdev_list, domain_list)
> > +    {
> > +        unsigned int i;
> > +
> > +        if ( rom == NULL && pdev == cmp )
> > +            continue;
> 
> This check looks rather unmotivated (or even bogus) without a comment.
> The other special casing of ROM BARs further down also isn't all that
> obvious (and right now I can't even convince myself it's correct).

I've added the following comment before this check, which I think
explains the logic for this check, and the one below:

Since ROM BARs can be enabled independently of the memory decoding
bit we need to check for overlapping in slightly different
ways depending on the case.

If !rom it means the memory decoding bit has been toggled, and all
BARs belonging to the device will be {un}mapped, hence the rangeset
will contain the mappings for the whole device. In this case there's
no need to check for overlaps with BARs that belong to the same
device because the rangeset is able to deal with overlapping areas.

OTOH, if rom is set if means a single ROM BAR is being {un}mapped,
and hence the check for overlaps should be performed against all
the possible BARs, even the ones that belong to the device being
modified.

> > +        }
> > +    }
> > +
> > +    /* Check for overlaps with other device's BARs. */
> > +    rc = vpci_check_bar_overlap(pdev, NULL, mem);
> 
> Why is this not symmetrical with vpci_modify_rom() (which also checks
> overlaps inside the current device)?

I think the comment above should answer the question here, the
difference is because in this case Xen is mapping a whole device, so
vpci_check_bar_overlap should not check for overlap with BARs that
belong to the same device. OTOH, when mapping a ROM BAR Xen should
check for such overlap, because the regular BARs will already be
mapped.

> > +    }
> > +
> > +    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
> > +    if ( !rc )
> > +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> > +            if ( header->bars[i].type != VPCI_BAR_ROM ||
> > +                 header->bars[i].rom_enabled )
> > +            header->bars[i].enabled = map;
> 
> Hmm, you're updating state here regardless of possible failure in the
> deferred operation (see the discarded error code in
> vpci_check_pending()).

Yes, I've fixed the code above to try to map/unmap as much as
possible, even when a failure happens.

I agree that enabling/disabling here with the operation being deferred
is not ideal, but I also think we would end up doing the same
regardless of the outcome of the deferred operation. If some
mapping/unmapping of BARs failed, the memory decoding should be
enabled anyway. I can add a comment along this lines if you think
that's OK.

> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -20,6 +20,9 @@
> >  #include <xen/smp.h>
> >  #include <xen/perfc.h>
> >  #include <asm/atomic.h>
> > +#ifdef CONFIG_HAS_PCI
> > +#include <xen/vpci.h>
> > +#endif
> 
> Perhaps the conditional would better live in that header.
> 
> > @@ -264,6 +267,11 @@ struct vcpu
> >  
> >      struct evtchn_fifo_vcpu *evtchn_fifo;
> >  
> > +#ifdef CONFIG_HAS_PCI
> > +    /* vPCI per-vCPU area, used to store data for long running operations. */
> > +    struct vpci_vcpu vpci;
> > +#endif
> 
> And perhaps the header would better provide an empty structure for the
> "else" case. Another option would be to include the fields ...
> 
> >      struct arch_vcpu arch;
> 
> ... in this structure.

I've thought about placing the vpci data inside of arch vpcu, but it
felt a little bit weird because the vpci code should be arch-agnostic,
so placing some of it's data inside of an arch specific structure
seemed wrong. I will do as you suggest and provide an empty
vpci_vpcu structure in the !CONFIG_HAS_PCI case, together with hiding
the CONFIG_HAS_PCI macros inside of the header itself.

> > --- a/xen/include/xen/vpci.h
> > +++ b/xen/include/xen/vpci.h
> > @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
> >  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >                  uint32_t data);
> >  
> > +/*
> > + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
> > + * should not run.
> > + */
> > +bool vpci_check_pending(struct vcpu *v);
> > +
> >  struct vpci {
> >      /* List of vPCI handlers for a device. */
> >      struct list_head handlers;
> >      spinlock_t lock;
> > +
> > +#ifdef __XEN__
> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
> > +    struct vpci_header {
> > +        /* Information about the PCI BARs of this device. */
> > +        struct vpci_bar {
> > +            paddr_t addr;
> > +            uint64_t size;
> > +            enum {
> > +                VPCI_BAR_EMPTY,
> > +                VPCI_BAR_IO,
> > +                VPCI_BAR_MEM32,
> > +                VPCI_BAR_MEM64_LO,
> > +                VPCI_BAR_MEM64_HI,
> > +                VPCI_BAR_ROM,
> > +            } type;
> > +            bool prefetchable;
> > +            /* Store whether the BAR is mapped into guest p2m. */
> > +            bool enabled;
> > +            /*
> > +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
> > +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
> > +             */
> > +            bool rom_enabled;
> 
> Especially with the error handling issue in mind that I've mentioned
> earlier, I wonder whether this field shouldn't be dropped, along the
> lines of you also no longer caching the memory decode enable bit in the
> command register.

Removing rom_enabled would imply doing a register read in
vpci_modify_bars in order to know whether the ROM BAR is enabled or
not, which is not trivial because depending on the header type the
position of the ROM BAR is different.

Another option would be to store the prefetch/enable bits inside of
the addr field, but that would also require more masking/unmasking of
the fields when the values are used or updated.

Roger.
Jan Beulich Oct. 5, 2017, 10:01 a.m. UTC | #3
>>> On 05.10.17 at 11:20, <roger.pau@citrix.com> wrote:
> On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
>> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
>> > +static int vpci_check_bar_overlap(const struct pci_dev *pdev,
>> > +                                  const struct vpci_bar *rom,
>> > +                                  struct rangeset *mem)
>> > +{
>> > +    const struct pci_dev *cmp;
>> > +
>> > +    /* Check for overlaps with other device's BARs. */
>> > +    list_for_each_entry(cmp, &pdev->domain->arch.pdev_list, domain_list)
>> > +    {
>> > +        unsigned int i;
>> > +
>> > +        if ( rom == NULL && pdev == cmp )
>> > +            continue;
>> 
>> This check looks rather unmotivated (or even bogus) without a comment.
>> The other special casing of ROM BARs further down also isn't all that
>> obvious (and right now I can't even convince myself it's correct).
> 
> I've added the following comment before this check, which I think
> explains the logic for this check, and the one below:
> 
> Since ROM BARs can be enabled independently of the memory decoding
> bit we need to check for overlapping in slightly different
> ways depending on the case.
> 
> If !rom it means the memory decoding bit has been toggled, and all
> BARs belonging to the device will be {un}mapped,

That's not precise: When mapping, you may still skip the ROM one
if its enable bit is clear. Whether the difference matters for
unmapping when the ROM is already unmapped I can't tell right
away. Nevertheless I think ...

> hence the rangeset
> will contain the mappings for the whole device. In this case there's
> no need to check for overlaps with BARs that belong to the same
> device because the rangeset is able to deal with overlapping areas.

... the conclusion is correct, as I would expect the ROM range to
simply not be part of the rangeset then.

>> > +        }
>> > +    }
>> > +
>> > +    /* Check for overlaps with other device's BARs. */
>> > +    rc = vpci_check_bar_overlap(pdev, NULL, mem);
>> 
>> Why is this not symmetrical with vpci_modify_rom() (which also checks
>> overlaps inside the current device)?
> 
> I think the comment above should answer the question here, the
> difference is because in this case Xen is mapping a whole device, so
> vpci_check_bar_overlap should not check for overlap with BARs that
> belong to the same device. OTOH, when mapping a ROM BAR Xen should
> check for such overlap, because the regular BARs will already be
> mapped.

Right. Part of my confusion results from the naming of these
two functions (pretty similar despite their different call sites)
as well as their placement (modify_bars() sitting ahead of
the CMD write is fine, as it's a helper of that function, but
modify_rom() would better be moved down to make clear
whose helper it is; it's questionable whether this being a
separate helper function is actually useful).

>> > +    }
>> > +
>> > +    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
>> > +    if ( !rc )
>> > +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> > +            if ( header->bars[i].type != VPCI_BAR_ROM ||
>> > +                 header->bars[i].rom_enabled )
>> > +            header->bars[i].enabled = map;
>> 
>> Hmm, you're updating state here regardless of possible failure in the
>> deferred operation (see the discarded error code in
>> vpci_check_pending()).
> 
> Yes, I've fixed the code above to try to map/unmap as much as
> possible, even when a failure happens.
> 
> I agree that enabling/disabling here with the operation being deferred
> is not ideal, but I also think we would end up doing the same
> regardless of the outcome of the deferred operation. If some
> mapping/unmapping of BARs failed, the memory decoding should be
> enabled anyway. I can add a comment along this lines if you think
> that's OK.

Yes, at least explaining why things are the (not fully correct) way
they are would help (also to tell anyone wanting to improve this
what it actually is that would need changing). Of course even
better would be if maintained state would match the state
hardware is in.

>> > --- a/xen/include/xen/vpci.h
>> > +++ b/xen/include/xen/vpci.h
>> > @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
>> >  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >                  uint32_t data);
>> >  
>> > +/*
>> > + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
>> > + * should not run.
>> > + */
>> > +bool vpci_check_pending(struct vcpu *v);
>> > +
>> >  struct vpci {
>> >      /* List of vPCI handlers for a device. */
>> >      struct list_head handlers;
>> >      spinlock_t lock;
>> > +
>> > +#ifdef __XEN__
>> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
>> > +    struct vpci_header {
>> > +        /* Information about the PCI BARs of this device. */
>> > +        struct vpci_bar {
>> > +            paddr_t addr;
>> > +            uint64_t size;
>> > +            enum {
>> > +                VPCI_BAR_EMPTY,
>> > +                VPCI_BAR_IO,
>> > +                VPCI_BAR_MEM32,
>> > +                VPCI_BAR_MEM64_LO,
>> > +                VPCI_BAR_MEM64_HI,
>> > +                VPCI_BAR_ROM,
>> > +            } type;
>> > +            bool prefetchable;
>> > +            /* Store whether the BAR is mapped into guest p2m. */
>> > +            bool enabled;
>> > +            /*
>> > +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
>> > +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
>> > +             */
>> > +            bool rom_enabled;
>> 
>> Especially with the error handling issue in mind that I've mentioned
>> earlier, I wonder whether this field shouldn't be dropped, along the
>> lines of you also no longer caching the memory decode enable bit in the
>> command register.
> 
> Removing rom_enabled would imply doing a register read in
> vpci_modify_bars in order to know whether the ROM BAR is enabled or
> not, which is not trivial because depending on the header type the
> position of the ROM BAR is different.

As said - I wouldn't mind the field if it was always in sync with the
hardware one. And it was for a reason that I mentioned the
memory decode bit, which you no longer cache. I think both
should be treated the same.

> Another option would be to store the prefetch/enable bits inside of
> the addr field, but that would also require more masking/unmasking of
> the fields when the values are used or updated.

I didn't ask for these two to be eliminated.

Jan
Julien Grall Oct. 5, 2017, 10:56 a.m. UTC | #4
Hi Roger,

On 19/09/17 16:29, Roger Pau Monne wrote:
> +static int vpci_map_range(unsigned long s, unsigned long e, void *data,
> +                          unsigned long *c)
> +{
> +    const struct map_data *map = data;
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        unsigned long size = e - s + 1;
> +
> +        rc = (map->map ? map_mmio_regions : unmap_mmio_regions)
> +             (map->d, _gfn(s), size, _mfn(s));

Again, this is relying on *map_mmio_regions to support preemption. This 
is not the case on ARM.

I am not asking to add preemption in the ARM code. But we should at 
least add a check similar to XEN_DOMCTL_memory_mapping ( if (size > 64) 
) to remind us that *map_mmio_regions have to be fixed.


Similarly, on IRC said you will add a TODO regarding the lack of passing 
the type of the BAR.

Cheers,
Roger Pau Monne Oct. 5, 2017, 11:09 a.m. UTC | #5
On Thu, Oct 05, 2017 at 10:01:46AM +0000, Jan Beulich wrote:
> >>> On 05.10.17 at 11:20, <roger.pau@citrix.com> wrote:
> > On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
> >> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
> >> > +    }
> >> > +
> >> > +    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
> >> > +    if ( !rc )
> >> > +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> >> > +            if ( header->bars[i].type != VPCI_BAR_ROM ||
> >> > +                 header->bars[i].rom_enabled )
> >> > +            header->bars[i].enabled = map;
> >> 
> >> Hmm, you're updating state here regardless of possible failure in the
> >> deferred operation (see the discarded error code in
> >> vpci_check_pending()).
> > 
> > Yes, I've fixed the code above to try to map/unmap as much as
> > possible, even when a failure happens.
> > 
> > I agree that enabling/disabling here with the operation being deferred
> > is not ideal, but I also think we would end up doing the same
> > regardless of the outcome of the deferred operation. If some
> > mapping/unmapping of BARs failed, the memory decoding should be
> > enabled anyway. I can add a comment along this lines if you think
> > that's OK.
> 
> Yes, at least explaining why things are the (not fully correct) way
> they are would help (also to tell anyone wanting to improve this
> what it actually is that would need changing). Of course even
> better would be if maintained state would match the state
> hardware is in.

I see the current code in this version is slightly confusing regarding
the usage of the 'enabled' field. I've fixed the code so that the
'enabled' field matches the following conditions:

 - For non-ROM BARs: the 'enabled' field matches the value of the
   memory decoding bit, but it doesn't guarantee that the range is
   fully mapped/unmapped in the guest p2m.
 - For ROM BARs: the 'enabled' bit matches the value of the memory
   decoding bit & the ROM enable bit, but again it doesn't guarantee
   that the memory is fully mapped/unmapped in the guest p2m.

> >> > --- a/xen/include/xen/vpci.h
> >> > +++ b/xen/include/xen/vpci.h
> >> > @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
> >> >  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >                  uint32_t data);
> >> >  
> >> > +/*
> >> > + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
> >> > + * should not run.
> >> > + */
> >> > +bool vpci_check_pending(struct vcpu *v);
> >> > +
> >> >  struct vpci {
> >> >      /* List of vPCI handlers for a device. */
> >> >      struct list_head handlers;
> >> >      spinlock_t lock;
> >> > +
> >> > +#ifdef __XEN__
> >> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
> >> > +    struct vpci_header {
> >> > +        /* Information about the PCI BARs of this device. */
> >> > +        struct vpci_bar {
> >> > +            paddr_t addr;
> >> > +            uint64_t size;
> >> > +            enum {
> >> > +                VPCI_BAR_EMPTY,
> >> > +                VPCI_BAR_IO,
> >> > +                VPCI_BAR_MEM32,
> >> > +                VPCI_BAR_MEM64_LO,
> >> > +                VPCI_BAR_MEM64_HI,
> >> > +                VPCI_BAR_ROM,
> >> > +            } type;
> >> > +            bool prefetchable;
> >> > +            /* Store whether the BAR is mapped into guest p2m. */
> >> > +            bool enabled;
> >> > +            /*
> >> > +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
> >> > +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
> >> > +             */
> >> > +            bool rom_enabled;
> >> 
> >> Especially with the error handling issue in mind that I've mentioned
> >> earlier, I wonder whether this field shouldn't be dropped, along the
> >> lines of you also no longer caching the memory decode enable bit in the
> >> command register.
> > 
> > Removing rom_enabled would imply doing a register read in
> > vpci_modify_bars in order to know whether the ROM BAR is enabled or
> > not, which is not trivial because depending on the header type the
> > position of the ROM BAR is different.
> 
> As said - I wouldn't mind the field if it was always in sync with the
> hardware one. And it was for a reason that I mentioned the
> memory decode bit, which you no longer cache. I think both
> should be treated the same.

I think I'm missing something, rom_enabled matches exactly the state
of the ROM enable bit. There's no way rom_enabled will get updated
without the BAR ROM also being updated in vpci_rom_write.

In line with my comments above regarding the 'enabled' field, what
about adding:

 - rom_enabled matches the state of the ROM BAR enable bit. It doesn't
   take into account the state of the memory decoding bit. As such, it
   cannot be used to detect if the ROM BAR memory is active or not,
   the 'enabled' bit should be used in that case.

Roger.
Roger Pau Monne Oct. 5, 2017, 11:26 a.m. UTC | #6
On Thu, Oct 05, 2017 at 10:56:20AM +0000, Julien Grall wrote:
> Hi Roger,
> 
> On 19/09/17 16:29, Roger Pau Monne wrote:
> > +static int vpci_map_range(unsigned long s, unsigned long e, void *data,
> > +                          unsigned long *c)
> > +{
> > +    const struct map_data *map = data;
> > +    int rc;
> > +
> > +    for ( ; ; )
> > +    {
> > +        unsigned long size = e - s + 1;
> > +
> > +        rc = (map->map ? map_mmio_regions : unmap_mmio_regions)
> > +             (map->d, _gfn(s), size, _mfn(s));
> 
> Again, this is relying on *map_mmio_regions to support preemption. This is
> not the case on ARM.
> 
> I am not asking to add preemption in the ARM code. But we should at least
> add a check similar to XEN_DOMCTL_memory_mapping ( if (size > 64) ) to
> remind us that *map_mmio_regions have to be fixed.

I've added a bodge for ARM in order to limit the mappings to 64 for
each call to {un}map_mmio_regions.

> 
> Similarly, on IRC said you will add a TODO regarding the lack of passing the
> type of the BAR.

Sorry, not sure if we spoke about this before or after sending this
series, but in any case I've added it now.

Thanks, Roger.
Jan Beulich Oct. 5, 2017, 11:55 a.m. UTC | #7
>>> On 05.10.17 at 13:09, <roger.pau@citrix.com> wrote:
> On Thu, Oct 05, 2017 at 10:01:46AM +0000, Jan Beulich wrote:
>> >>> On 05.10.17 at 11:20, <roger.pau@citrix.com> wrote:
>> > On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
>> >> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
>> >> > --- a/xen/include/xen/vpci.h
>> >> > +++ b/xen/include/xen/vpci.h
>> >> > @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
>> >> >  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >> >                  uint32_t data);
>> >> >  
>> >> > +/*
>> >> > + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
>> >> > + * should not run.
>> >> > + */
>> >> > +bool vpci_check_pending(struct vcpu *v);
>> >> > +
>> >> >  struct vpci {
>> >> >      /* List of vPCI handlers for a device. */
>> >> >      struct list_head handlers;
>> >> >      spinlock_t lock;
>> >> > +
>> >> > +#ifdef __XEN__
>> >> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
>> >> > +    struct vpci_header {
>> >> > +        /* Information about the PCI BARs of this device. */
>> >> > +        struct vpci_bar {
>> >> > +            paddr_t addr;
>> >> > +            uint64_t size;
>> >> > +            enum {
>> >> > +                VPCI_BAR_EMPTY,
>> >> > +                VPCI_BAR_IO,
>> >> > +                VPCI_BAR_MEM32,
>> >> > +                VPCI_BAR_MEM64_LO,
>> >> > +                VPCI_BAR_MEM64_HI,
>> >> > +                VPCI_BAR_ROM,
>> >> > +            } type;
>> >> > +            bool prefetchable;
>> >> > +            /* Store whether the BAR is mapped into guest p2m. */
>> >> > +            bool enabled;
>> >> > +            /*
>> >> > +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
>> >> > +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
>> >> > +             */
>> >> > +            bool rom_enabled;
>> >> 
>> >> Especially with the error handling issue in mind that I've mentioned
>> >> earlier, I wonder whether this field shouldn't be dropped, along the
>> >> lines of you also no longer caching the memory decode enable bit in the
>> >> command register.
>> > 
>> > Removing rom_enabled would imply doing a register read in
>> > vpci_modify_bars in order to know whether the ROM BAR is enabled or
>> > not, which is not trivial because depending on the header type the
>> > position of the ROM BAR is different.
>> 
>> As said - I wouldn't mind the field if it was always in sync with the
>> hardware one. And it was for a reason that I mentioned the
>> memory decode bit, which you no longer cache. I think both
>> should be treated the same.
> 
> I think I'm missing something, rom_enabled matches exactly the state
> of the ROM enable bit. There's no way rom_enabled will get updated
> without the BAR ROM also being updated in vpci_rom_write.

Oh, I'm sorry for not being precise here: I think the hardware
bit should only be set once the mapping is complete. That's
not how the code currently behaves, so yes, right now the
cached bit apparently properly reflects the actual one. With
the possibly deferred mapping, that wouldn't be the case.

Jan
Roger Pau Monne Oct. 5, 2017, 12:02 p.m. UTC | #8
On Thu, Oct 05, 2017 at 11:55:39AM +0000, Jan Beulich wrote:
> >>> On 05.10.17 at 13:09, <roger.pau@citrix.com> wrote:
> > On Thu, Oct 05, 2017 at 10:01:46AM +0000, Jan Beulich wrote:
> >> >>> On 05.10.17 at 11:20, <roger.pau@citrix.com> wrote:
> >> > On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
> >> >> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
> >> >> > --- a/xen/include/xen/vpci.h
> >> >> > +++ b/xen/include/xen/vpci.h
> >> >> > @@ -35,11 +35,52 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
> >> >> >  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >> >                  uint32_t data);
> >> >> >  
> >> >> > +/*
> >> >> > + * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
> >> >> > + * should not run.
> >> >> > + */
> >> >> > +bool vpci_check_pending(struct vcpu *v);
> >> >> > +
> >> >> >  struct vpci {
> >> >> >      /* List of vPCI handlers for a device. */
> >> >> >      struct list_head handlers;
> >> >> >      spinlock_t lock;
> >> >> > +
> >> >> > +#ifdef __XEN__
> >> >> > +    /* Hide the rest of the vpci struct from the user-space test harness. */
> >> >> > +    struct vpci_header {
> >> >> > +        /* Information about the PCI BARs of this device. */
> >> >> > +        struct vpci_bar {
> >> >> > +            paddr_t addr;
> >> >> > +            uint64_t size;
> >> >> > +            enum {
> >> >> > +                VPCI_BAR_EMPTY,
> >> >> > +                VPCI_BAR_IO,
> >> >> > +                VPCI_BAR_MEM32,
> >> >> > +                VPCI_BAR_MEM64_LO,
> >> >> > +                VPCI_BAR_MEM64_HI,
> >> >> > +                VPCI_BAR_ROM,
> >> >> > +            } type;
> >> >> > +            bool prefetchable;
> >> >> > +            /* Store whether the BAR is mapped into guest p2m. */
> >> >> > +            bool enabled;
> >> >> > +            /*
> >> >> > +             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
> >> >> > +             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
> >> >> > +             */
> >> >> > +            bool rom_enabled;
> >> >> 
> >> >> Especially with the error handling issue in mind that I've mentioned
> >> >> earlier, I wonder whether this field shouldn't be dropped, along the
> >> >> lines of you also no longer caching the memory decode enable bit in the
> >> >> command register.
> >> > 
> >> > Removing rom_enabled would imply doing a register read in
> >> > vpci_modify_bars in order to know whether the ROM BAR is enabled or
> >> > not, which is not trivial because depending on the header type the
> >> > position of the ROM BAR is different.
> >> 
> >> As said - I wouldn't mind the field if it was always in sync with the
> >> hardware one. And it was for a reason that I mentioned the
> >> memory decode bit, which you no longer cache. I think both
> >> should be treated the same.
> > 
> > I think I'm missing something, rom_enabled matches exactly the state
> > of the ROM enable bit. There's no way rom_enabled will get updated
> > without the BAR ROM also being updated in vpci_rom_write.
> 
> Oh, I'm sorry for not being precise here: I think the hardware
> bit should only be set once the mapping is complete. That's
> not how the code currently behaves, so yes, right now the
> cached bit apparently properly reflects the actual one. With
> the possibly deferred mapping, that wouldn't be the case.

I could add some tail code to vpci_process_pending that sets the
memory decoding or ROM BAR enable bit together with the rom_enable and
enabled fields in the header struct. Would you agree to this?

Thanks, Roger.
Jan Beulich Oct. 5, 2017, 1:09 p.m. UTC | #9
>>> On 05.10.17 at 14:02, <roger.pau@citrix.com> wrote:
> On Thu, Oct 05, 2017 at 11:55:39AM +0000, Jan Beulich wrote:
>> >>> On 05.10.17 at 13:09, <roger.pau@citrix.com> wrote:
>> > On Thu, Oct 05, 2017 at 10:01:46AM +0000, Jan Beulich wrote:
>> >> >>> On 05.10.17 at 11:20, <roger.pau@citrix.com> wrote:
>> >> > On Wed, Oct 04, 2017 at 08:33:33AM +0000, Jan Beulich wrote:
>> >> >> >>> On 19.09.17 at 17:29, <roger.pau@citrix.com> wrote:
>> >> >> > +            bool rom_enabled;
>> >> >> 
>> >> >> Especially with the error handling issue in mind that I've mentioned
>> >> >> earlier, I wonder whether this field shouldn't be dropped, along the
>> >> >> lines of you also no longer caching the memory decode enable bit in the
>> >> >> command register.
>> >> > 
>> >> > Removing rom_enabled would imply doing a register read in
>> >> > vpci_modify_bars in order to know whether the ROM BAR is enabled or
>> >> > not, which is not trivial because depending on the header type the
>> >> > position of the ROM BAR is different.
>> >> 
>> >> As said - I wouldn't mind the field if it was always in sync with the
>> >> hardware one. And it was for a reason that I mentioned the
>> >> memory decode bit, which you no longer cache. I think both
>> >> should be treated the same.
>> > 
>> > I think I'm missing something, rom_enabled matches exactly the state
>> > of the ROM enable bit. There's no way rom_enabled will get updated
>> > without the BAR ROM also being updated in vpci_rom_write.
>> 
>> Oh, I'm sorry for not being precise here: I think the hardware
>> bit should only be set once the mapping is complete. That's
>> not how the code currently behaves, so yes, right now the
>> cached bit apparently properly reflects the actual one. With
>> the possibly deferred mapping, that wouldn't be the case.
> 
> I could add some tail code to vpci_process_pending that sets the
> memory decoding or ROM BAR enable bit together with the rom_enable and
> enabled fields in the header struct. Would you agree to this?

If that's cleanly doable, sure. I had assumed you didn't do it
because you couldn't reasonably update state at that later point.

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 3e7a88e053..f6588ceab4 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -26,6 +26,7 @@ 
 #include <xen/domain.h>
 #include <xen/event.h>
 #include <xen/paging.h>
+#include <xen/vpci.h>
 
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/ioreq.h>
@@ -48,6 +49,9 @@  bool_t hvm_io_pending(struct vcpu *v)
     struct domain *d = v->domain;
     struct hvm_ioreq_server *s;
 
+    if ( has_vpci(v->domain) && vpci_check_pending(v) )
+      return 1;
+
     list_for_each_entry ( s,
                           &d->arch.hvm_domain.ioreq_server.list,
                           list_entry )
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@ 
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..c0d38c8b91
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,478 @@ 
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+#include <xen/softirq.h>
+
+#include <asm/event.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||   \
+     (x)->type == VPCI_BAR_ROM)
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int vpci_map_range(unsigned long s, unsigned long e, void *data,
+                          unsigned long *c)
+{
+    const struct map_data *map = data;
+    int rc;
+
+    for ( ; ; )
+    {
+        unsigned long size = e - s + 1;
+
+        rc = (map->map ? map_mmio_regions : unmap_mmio_regions)
+             (map->d, _gfn(s), size, _mfn(s));
+        if ( rc == 0 )
+        {
+            *c += size;
+            break;
+        }
+        if ( rc < 0 )
+        {
+            printk(XENLOG_G_WARNING
+                   "Failed to identity %smap [%" PRI_gfn ", %" PRI_gfn ") for d%d: %d\n",
+                   map ? "" : "un", s, e, map->d->domain_id, rc);
+            break;
+        }
+        *c += rc;
+        s += rc;
+        if ( general_preempt_check() )
+        {
+            if ( !is_idle_vcpu(current) )
+                return -ERESTART;
+
+            process_pending_softirqs();
+        }
+    }
+
+    return rc;
+}
+
+static int vpci_map_memory(struct domain *d, struct rangeset *mem, bool map)
+{
+    struct map_data data = { .d = d, .map = map };
+
+    return rangeset_consume_ranges(mem, vpci_map_range, &data);
+}
+
+bool vpci_check_pending(struct vcpu *v)
+{
+    if ( v->vpci.mem )
+    {
+        int rc = vpci_map_memory(v->domain, v->vpci.mem, v->vpci.map);
+
+        if ( rc == -ERESTART )
+            return true;
+
+        rangeset_destroy(v->vpci.mem);
+        v->vpci.mem = NULL;
+    }
+
+    return false;
+}
+
+static int vpci_maybe_defer_map(struct domain *d, struct rangeset *mem,
+                                bool map)
+{
+    struct vcpu *curr = current;
+    int rc = 0;
+
+    if ( is_idle_vcpu(curr) )
+    {
+        rc = vpci_map_memory(d, mem, map);
+        rangeset_destroy(mem);
+    }
+    else
+    {
+        ASSERT(curr->domain == d);
+        curr->vpci.mem = mem;
+        curr->vpci.map = map;
+    }
+
+    return rc;
+}
+
+static int vpci_check_bar_overlap(const struct pci_dev *pdev,
+                                  const struct vpci_bar *rom,
+                                  struct rangeset *mem)
+{
+    const struct pci_dev *cmp;
+
+    /* Check for overlaps with other device's BARs. */
+    list_for_each_entry(cmp, &pdev->domain->arch.pdev_list, domain_list)
+    {
+        unsigned int i;
+
+        if ( rom == NULL && pdev == cmp )
+            continue;
+
+        for ( i = 0; i < ARRAY_SIZE(cmp->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &cmp->vpci->header.bars[i];
+            unsigned long start = PFN_DOWN(bar->addr);
+            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+            int rc;
+
+            if ( rom == bar || !bar->enabled || !MAPPABLE_BAR(bar) ||
+                 !rangeset_overlaps_range(mem, start, end) )
+                continue;
+
+            rc = rangeset_remove_range(mem, start, end);
+            if ( rc )
+                return rc;
+        }
+    }
+
+    return 0;
+}
+
+static void vpci_modify_bars(const struct pci_dev *pdev, bool map)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    unsigned int i;
+    int rc;
+
+    if ( !mem )
+        return;
+
+    /*
+     * Create a rangeset that represents the current device BARs memory region
+     * and compare it against all the currently active BAR memory regions. If
+     * an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
+     *
+     * NB: the rangeset uses inclusive frame numbers.
+     */
+
+    /* First fill the rangeset with all the BARs of this device. */
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        const struct vpci_bar *bar = &header->bars[i];
+
+        if ( !MAPPABLE_BAR(bar) ||
+             (bar->type == VPCI_BAR_ROM && !bar->rom_enabled) )
+            continue;
+
+        rc = rangeset_add_range(mem, PFN_DOWN(bar->addr),
+                                PFN_DOWN(bar->addr + bar->size - 1));
+        if ( rc )
+        {
+            rangeset_destroy(mem);
+            return;
+        }
+    }
+
+    /* Check for overlaps with other device's BARs. */
+    rc = vpci_check_bar_overlap(pdev, NULL, mem);
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return;
+    }
+
+    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
+    if ( !rc )
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+            if ( header->bars[i].type != VPCI_BAR_ROM ||
+                 header->bars[i].rom_enabled )
+            header->bars[i].enabled = map;
+}
+
+static void vpci_modify_rom(const struct pci_dev *pdev,
+                            struct vpci_bar *rom, bool map)
+{
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    int rc;
+
+    ASSERT(rom->type == VPCI_BAR_ROM);
+
+    if ( !mem )
+        return;
+
+    /* First fill the rangeset with the ROM BAR. */
+    rc = rangeset_add_range(mem, PFN_DOWN(rom->addr),
+                            PFN_DOWN(rom->addr + rom->size - 1));
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return;
+    }
+
+    /*
+     * Check for overlaps with other BARs (either on this device or other
+     * devices).
+     */
+    rc = vpci_check_bar_overlap(pdev, rom, mem);
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return;
+    }
+
+    rc = vpci_maybe_defer_map(pdev->domain, mem, map);
+    if ( !rc )
+        rom->enabled = map;
+}
+
+static uint32_t vpci_cmd_read(const struct pci_dev *pdev, unsigned int reg,
+                              void *data)
+{
+    return pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                           PCI_FUNC(pdev->devfn), reg);
+}
+
+static void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                           uint32_t cmd, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
+
+    /*
+     * Let the guest play with all the bits directly except for the
+     * memory decoding one.
+     */
+    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
+        vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
+
+    pci_conf_write16(seg, bus, slot, func, reg, cmd);
+}
+
+static uint32_t vpci_bar_read(const struct pci_dev *pdev, unsigned int reg,
+                              void *data)
+{
+    return pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                           PCI_FUNC(pdev->devfn), reg);
+}
+
+static void vpci_bar_write(const struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    bool hi = false;
+
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+    {
+         gprintk(XENLOG_WARNING,
+                 "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
+                 seg, bus, slot, func);
+        return;
+    }
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Update the cached address, so that when memory decoding is enabled
+     * Xen can map the BAR into the guest p2m.
+     */
+    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+    {
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, val);
+}
+
+static void vpci_rom_write(const struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *rom = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+
+    if ( (pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+          PCI_COMMAND_MEMORY) && rom->rom_enabled )
+    {
+         gprintk(XENLOG_WARNING,
+                 "%04x:%02x:%02x.%u: ignored ROM BAR write with memory decoding enabled\n",
+                 seg, bus, slot, func);
+        return;
+    }
+
+    rom->addr = val & PCI_ROM_ADDRESS_MASK;
+
+    /* Check if ROM BAR should be mapped/unmapped. */
+    if ( (cmd & PCI_COMMAND_MEMORY) &&
+         rom->rom_enabled != (val & PCI_ROM_ADDRESS_ENABLE) )
+        vpci_modify_rom(pdev, rom, val & PCI_ROM_ADDRESS_ENABLE);
+
+    rom->rom_enabled = val & PCI_ROM_ADDRESS_ENABLE;
+    pci_conf_write32(pdev->seg, pdev->bus, slot, func, reg, val);
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    pci_sbdf_t sbdf = {
+        .seg = seg,
+        .bus = bus,
+        .dev = slot,
+        .func = func,
+    };
+    int rc;
+
+    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = 6;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = 2;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                                   &bars[i]);
+            if ( rc )
+            {
+                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+                return rc;
+            }
+
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        /* Size the BAR and map it. */
+        rc = pci_size_mem_bar(sbdf, reg, i == num_bars - 1, &addr, &size, 0);
+        if ( rc < 0 )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = addr;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+    }
+
+    /* Check expansion ROM. */
+    rc = pci_size_mem_bar(sbdf, rom_reg, true, &addr, &size, PCI_BAR_ROM);
+    if ( rc > 0 && size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->addr = addr;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_rom_write, rom_reg, 4,
+                               rom);
+        if ( rc )
+            rom->type = VPCI_BAR_EMPTY;
+    }
+
+    if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        vpci_modify_bars(pdev, true);
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+    }
+
+    return 0;
+}
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index b03afb450d..39a330ffca 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -20,6 +20,9 @@ 
 #include <xen/smp.h>
 #include <xen/perfc.h>
 #include <asm/atomic.h>
+#ifdef CONFIG_HAS_PCI
+#include <xen/vpci.h>
+#endif
 #include <xen/wait.h>
 #include <public/xen.h>
 #include <public/domctl.h>
@@ -264,6 +267,11 @@  struct vcpu
 
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
+#ifdef CONFIG_HAS_PCI
+    /* vPCI per-vCPU area, used to store data for long running operations. */
+    struct vpci_vcpu vpci;
+#endif
+
     struct arch_vcpu arch;
 };
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index b42e38ed54..4e0b67c2f1 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -35,11 +35,52 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size);
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data);
 
+/*
+ * Check for pending vPCI operations on this vcpu. Returns true if the vcpu
+ * should not run.
+ */
+bool vpci_check_pending(struct vcpu *v);
+
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
     spinlock_t lock;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            paddr_t addr;
+            uint64_t size;
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            bool prefetchable;
+            /* Store whether the BAR is mapped into guest p2m. */
+            bool enabled;
+            /*
+             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
+             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
+             */
+            bool rom_enabled;
+        } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
+};
+
+#ifdef __XEN__
+struct vpci_vcpu {
+    struct rangeset *mem;
+    bool map;
 };
+#endif
 
 #endif