diff mbox

[v4,6/9] xen/vpci: add handlers to map the BARs

Message ID 20170630150117.88489-7-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monné June 30, 2017, 3:01 p.m. UTC
Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to emulate BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to memory
space accesses), and maps/unmaps the BARs of the device into the guest p2m.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 473 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  23 +++
 3 files changed, 497 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

Comments

Jan Beulich July 14, 2017, 3:11 p.m. UTC | #1
>>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> Introduce a set of handlers that trap accesses to the PCI BARs and the command
> register, in order to emulate BAR sizing and BAR relocation.

I don't think "emulate" is the right term here - you really don't mean to
change anything, you only want to snoop Dom0 writes.

> --- /dev/null
> +++ b/xen/drivers/vpci/header.c
> @@ -0,0 +1,473 @@
> +/*
> + * Generic functionality for handling accesses to the PCI header from the
> + * configuration space.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xen/p2m-common.h>
> +
> +#define MAPPABLE_BAR(x)                                                 \
> +    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
> +     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
> +     (x)->addr != INVALID_PADDR)
> +
> +static struct rangeset *vpci_get_bar_memory(const struct domain *d,
> +                                            const struct vpci_bar *map)
> +{
> +    const struct pci_dev *pdev;
> +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> +    int rc;
> +
> +    if ( !mem )
> +        return ERR_PTR(-ENOMEM);
> +
> +    /*
> +     * Create a rangeset that represents the current BAR memory region
> +     * and compare it against all the currently active BAR memory regions.
> +     * If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
> +     *
> +     * NB: the rangeset uses frames, and if start and end addresses are
> +     * equal it means only one frame is used, that's why PFN_DOWN is used
> +     * to calculate the end of the rangeset.
> +     */

That explanation doesn't seem to fit: Did you perhaps mean to
point out that rangeset ranges are inclusive ones?

> +    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
> +                            PFN_DOWN(map->addr + map->size));

Don't you need to subtract 1 here (and elsewhere below)?

> +    if ( rc )
> +    {
> +        rangeset_destroy(mem);
> +        return ERR_PTR(rc);
> +    }
> +
> +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
> +    {
> +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
> +                                       PCI_SLOT(pdev->devfn),
> +                                       PCI_FUNC(pdev->devfn),
> +                                       PCI_COMMAND);

This is quite a lot of overhead - a loop over all devices plus a config
space read on each one. What state the memory decode bit is in
could be recorded in the ->enabled flag, couldn't it? And devices on
different sub-branches of the topology can't possibly have
overlapping entries that we need to worry about, as the bridge
windows would suppress actual accesses.

> +        unsigned int i;
> +
> +        /* Check if memory decoding is enabled. */
> +        if ( !(cmd & PCI_COMMAND_MEMORY) )
> +            continue;
> +
> +        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> +        {
> +            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
> +
> +            if ( bar == map || !MAPPABLE_BAR(bar) ||
> +                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
> +                                          PFN_DOWN(bar->addr + bar->size)) )
> +                continue;
> +
> +            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
> +                                       PFN_DOWN(bar->addr + bar->size));

I'm struggling to convince myself of the correctness of this approach
(including other code further down which is also involved). I think you
should have taken the time to add a few words on the approach
chosen to the description. For example, it doesn't look like things will
go right if the device being dealt with has two BARs both using part
of the same page.

> +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> +                           const bool map)
> +{
> +    struct rangeset *mem;
> +    struct map_data data = { .d = d, .map = map };
> +    int rc;
> +
> +    ASSERT(MAPPABLE_BAR(bar));
> +
> +    mem = vpci_get_bar_memory(d, bar);
> +    if ( IS_ERR(mem) )
> +        return -PTR_ERR(mem);

The negation looks wrong to me.

> +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    uint16_t cmd = val.u16, current_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    int rc;
> +
> +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> +
> +    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
> +    {
> +        /*
> +         * Let the guest play with all the bits directly except for the
> +         * memory decoding one.
> +         */
> +        pci_conf_write16(seg, bus, slot, func, reg, cmd);
> +        return;

Please invert the condition and have both cases use the same write
at the end of the function.

> +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint32_t wdata = val.u32, size_mask;
> +    bool hi = false;
> +
> +    switch ( bar->type )
> +    {
> +    case VPCI_BAR_MEM32:
> +    case VPCI_BAR_MEM64_LO:
> +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> +        break;
> +    case VPCI_BAR_MEM64_HI:
> +        size_mask = ~0u;
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    if ( (wdata & size_mask) == size_mask )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        bar->sizing = true;
> +        return;

I think the comment needs extending to explain why the written
sizing value can't possibly be an address. This is particularly
relevant because I'm not sure that assumption would hold on e.g.
ARM (which I don't think has guaranteed ROM right below 4Gb).

> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    bar->sizing = false;
> +
> +    /*
> +     * Ignore attempts to change the position of the BAR if memory decoding is
> +     * active.
> +     */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +        return;

Especially as long as this code supports only Dom0 I think we want
a warning here.

> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( !hi )
> +        wdata &= PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));

Maybe shorter "0xffffffffull << (hi ? 0 : 32)"?

> +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    const uint32_t wdata = val.u32;
> +
> +    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        rom->sizing = true;
> +        return;
> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    rom->sizing = false;
> +
> +    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
> +
> +    /* Check if memory decoding is enabled. */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY &&
> +         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )

Just like you parenthesize the operands of ^, please also do so for
the ones of &. Also the ^-expression relies on the particular value
of PCI_ROM_ADDRESS_ENABLE, which I'd prefer if you avoided.

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t header_type;
> +    uint16_t cmd;
> +    uint32_t rom_val;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> +    switch ( header_type )

I'd prefer if you didn't introduce variables used just once.

> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);

This is unrelated to what you mean to do here. Please move it ...

> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */

... here.

> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                                   &bars[i]);
> +            if ( rc )
> +                return rc;
> +
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;

Perhaps ignore the 64-bit indicator if it appears in the last BAR?

> +        /* Size the BAR and map it. */
> +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> +                              &addr, &size);
> +        if ( rc < 0 )
> +            return rc;
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;

This doesn't match up with logic further up: When the memory decode
bit gets cleared, you don't zap the addresses, so I think you'd better
store it here too. Use INVALID_PADDR only when the value read has
all address bits set (same caveat as pointed out earlier).

> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    /* Check expansion ROM. */
> +    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
> +    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
> +        pci_conf_write32(seg, bus, slot, func, rom_reg,
> +                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);

Do you really need to do this when you've cleared the memory
decode bit already?

> +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);

You can't use this function here without first making it capable of
dealing with ROM BARs - it expects the low bits to be different
than what we have here (see the early ASSERT() that's there).

> +    if ( rc < 0 )
> +        return rc;

Perhaps I didn't pay attention elsewhere, but here it is quite obvious
that in the error case you return with the device in a state other than
on input.

> +    if ( size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
> +        if ( rom->enabled )
> +            rom->addr = addr;
> +        else
> +            rom->addr = INVALID_PADDR;

Same remark as further up.

Jan
Roger Pau Monné July 24, 2017, 2:58 p.m. UTC | #2
On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> > Introduce a set of handlers that trap accesses to the PCI BARs and the command
> > register, in order to emulate BAR sizing and BAR relocation.
> 
> I don't think "emulate" is the right term here - you really don't mean to
> change anything, you only want to snoop Dom0 writes.

Right, changed emulate to snoop.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/header.c
> > @@ -0,0 +1,473 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI header from the
> > + * configuration space.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +#include <xen/p2m-common.h>
> > +
> > +#define MAPPABLE_BAR(x)                                                 \
> > +    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
> > +     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
> > +     (x)->addr != INVALID_PADDR)
> > +
> > +static struct rangeset *vpci_get_bar_memory(const struct domain *d,
> > +                                            const struct vpci_bar *map)
> > +{
> > +    const struct pci_dev *pdev;
> > +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> > +    int rc;
> > +
> > +    if ( !mem )
> > +        return ERR_PTR(-ENOMEM);
> > +
> > +    /*
> > +     * Create a rangeset that represents the current BAR memory region
> > +     * and compare it against all the currently active BAR memory regions.
> > +     * If an overlap is found, subtract it from the region to be
> > +     * mapped/unmapped.
> > +     *
> > +     * NB: the rangeset uses frames, and if start and end addresses are
> > +     * equal it means only one frame is used, that's why PFN_DOWN is used
> > +     * to calculate the end of the rangeset.
> > +     */
> 
> That explanation doesn't seem to fit: Did you perhaps mean to
> point out that rangeset ranges are inclusive ones?

Yes, that's probably better.

> > +    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
> > +                            PFN_DOWN(map->addr + map->size));
> 
> Don't you need to subtract 1 here (and elsewhere below)?

Indeed.

> > +    if ( rc )
> > +    {
> > +        rangeset_destroy(mem);
> > +        return ERR_PTR(rc);
> > +    }
> > +
> > +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
> > +    {
> > +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
> > +                                       PCI_SLOT(pdev->devfn),
> > +                                       PCI_FUNC(pdev->devfn),
> > +                                       PCI_COMMAND);
> 
> This is quite a lot of overhead - a loop over all devices plus a config
> space read on each one. What state the memory decode bit is in
> could be recorded in the ->enabled flag, couldn't it? And devices on
> different sub-branches of the topology can't possibly have
> overlapping entries that we need to worry about, as the bridge
> windows would suppress actual accesses.

Oh, so Xen only needs to care about devices that share the same
bridge, because that is the only case where the same page can be
shared by multiple devices?

In any case, the Dom0 is free to wrongly position the BARs anywhere it
wants, thus possibly placing them outside of the bridge windows, in
with case I think we should better check all assigned devices.

> > +        unsigned int i;
> > +
> > +        /* Check if memory decoding is enabled. */
> > +        if ( !(cmd & PCI_COMMAND_MEMORY) )
> > +            continue;
> > +
> > +        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> > +        {
> > +            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
> > +
> > +            if ( bar == map || !MAPPABLE_BAR(bar) ||
> > +                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
> > +                                          PFN_DOWN(bar->addr + bar->size)) )
> > +                continue;
> > +
> > +            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
> > +                                       PFN_DOWN(bar->addr + bar->size));
> 
> I'm struggling to convince myself of the correctness of this approach
> (including other code further down which is also involved). I think you
> should have taken the time to add a few words on the approach
> chosen to the description.

Will do.

> For example, it doesn't look like things will
> go right if the device being dealt with has two BARs both using part
> of the same page.

Right, because the BAR won't reflect it's actual state (due to the
memory decoding being global per-device). AFAICT this will be solved
by your suggestion above of using ->enabled and keeping it updated for
BARs also.

> > +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> > +                           const bool map)
> > +{
> > +    struct rangeset *mem;
> > +    struct map_data data = { .d = d, .map = map };
> > +    int rc;
> > +
> > +    ASSERT(MAPPABLE_BAR(bar));
> > +
> > +    mem = vpci_get_bar_memory(d, bar);
> > +    if ( IS_ERR(mem) )
> > +        return -PTR_ERR(mem);
> 
> The negation looks wrong to me.

OK, this is already returning -<ERROR>, so the negation is not needed.

> > +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    uint16_t cmd = val.u16, current_cmd;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    int rc;
> > +
> > +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> > +
> > +    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
> > +    {
> > +        /*
> > +         * Let the guest play with all the bits directly except for the
> > +         * memory decoding one.
> > +         */
> > +        pci_conf_write16(seg, bus, slot, func, reg, cmd);
> > +        return;
> 
> Please invert the condition and have both cases use the same write
> at the end of the function.

Done.

> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint32_t wdata = val.u32, size_mask;
> > +    bool hi = false;
> > +
> > +    switch ( bar->type )
> > +    {
> > +    case VPCI_BAR_MEM32:
> > +    case VPCI_BAR_MEM64_LO:
> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> > +        break;
> > +    case VPCI_BAR_MEM64_HI:
> > +        size_mask = ~0u;
> > +        break;
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +        return;
> > +    }
> > +
> > +    if ( (wdata & size_mask) == size_mask )
> > +    {
> > +        /* Next reads from this register are going to return the BAR size. */
> > +        bar->sizing = true;
> > +        return;
> 
> I think the comment needs extending to explain why the written
> sizing value can't possibly be an address. This is particularly
> relevant because I'm not sure that assumption would hold on e.g.
> ARM (which I don't think has guaranteed ROM right below 4Gb).

Hm, right. Maybe it would be best to detect sizing by checking that
the address when performing a read is ~0 on the high bits and ~0 &
PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
kind of partial guessing as done here, it's certainly not very robust.

> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    bar->sizing = false;
> > +
> > +    /*
> > +     * Ignore attempts to change the position of the BAR if memory decoding is
> > +     * active.
> > +     */
> > +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> > +         PCI_COMMAND_MEMORY )
> > +        return;
> 
> Especially as long as this code supports only Dom0 I think we want
> a warning here.

Done, I've added:

%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled

> > +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *rom = data;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    const uint32_t wdata = val.u32;
> > +
> > +    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
> > +    {
> > +        /* Next reads from this register are going to return the BAR size. */
> > +        rom->sizing = true;
> > +        return;
> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    rom->sizing = false;
> > +
> > +    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
> > +
> > +    /* Check if memory decoding is enabled. */
> > +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> > +         PCI_COMMAND_MEMORY &&
> > +         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )
> 
> Just like you parenthesize the operands of ^, please also do so for
> the ones of &. Also the ^-expression relies on the particular value
> of PCI_ROM_ADDRESS_ENABLE, which I'd prefer if you avoided.

Changed it to: rom->enabled != !!(wadata & PCI_ROM_ADDRESS_ENABLE)

> > +static int vpci_init_bars(struct pci_dev *pdev)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint8_t header_type;
> > +    uint16_t cmd;
> > +    uint32_t rom_val;
> > +    uint64_t addr, size;
> > +    unsigned int i, num_bars, rom_reg;
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    struct vpci_bar *bars = header->bars;
> > +    int rc;
> > +
> > +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> > +    switch ( header_type )
> 
> I'd prefer if you didn't introduce variables used just once.

OK, I find it cumbersome to place it as the switch expression, but it
fits in a single line so it's not that bad.

> > +    if ( cmd & PCI_COMMAND_MEMORY )
> > +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> > +                         cmd & ~PCI_COMMAND_MEMORY);
> > +
> > +    for ( i = 0; i < num_bars; i++ )
> > +    {
> > +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> > +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> > +
> > +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> > +        {
> > +            bars[i].type = VPCI_BAR_MEM64_HI;
> > +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> > +                                   &bars[i]);
> > +            if ( rc )
> > +                return rc;
> > +
> > +            continue;
> > +        }
> > +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> > +        {
> > +            bars[i].type = VPCI_BAR_IO;
> > +            continue;
> > +        }
> > +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +            bars[i].type = VPCI_BAR_MEM64_LO;
> > +        else
> > +            bars[i].type = VPCI_BAR_MEM32;
> 
> Perhaps ignore the 64-bit indicator if it appears in the last BAR?

Hm, pci_size_mem_bar is going to complain anyway and Xen won't be able
to size the BAR.

> > +        /* Size the BAR and map it. */
> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> > +                              &addr, &size);
> > +        if ( rc < 0 )
> > +            return rc;
> > +
> > +        if ( size == 0 )
> > +        {
> > +            bars[i].type = VPCI_BAR_EMPTY;
> > +            continue;
> > +        }
> > +
> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> 
> This doesn't match up with logic further up: When the memory decode
> bit gets cleared, you don't zap the addresses, so I think you'd better
> store it here too. Use INVALID_PADDR only when the value read has
> all address bits set (same caveat as pointed out earlier).

OK, note that .addr can only possibly be INVALID_PADDR at
initialization time, once the user has written something to the BAR
.addr will be different than INVALID_PADDR.

> > +        bars[i].size = size;
> > +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> > +
> > +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> > +                               &bars[i]);
> > +        if ( rc )
> > +            return rc;
> > +    }
> > +
> > +    /* Check expansion ROM. */
> > +    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
> > +    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
> > +        pci_conf_write32(seg, bus, slot, func, rom_reg,
> > +                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);
> 
> Do you really need to do this when you've cleared the memory
> decode bit already?

Oh right, this is not needed. Both bits need to be enabled for the ROM
to be mapped.

> > +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);
> 
> You can't use this function here without first making it capable of
> dealing with ROM BARs - it expects the low bits to be different
> than what we have here (see the early ASSERT() that's there).
> 
> > +    if ( rc < 0 )
> > +        return rc;
> 
> Perhaps I didn't pay attention elsewhere, but here it is quite obvious
> that in the error case you return with the device in a state other than
> on input.

Yes, there are several error paths here that will return with memory
decoding disabled. I can fix that by writing back the original command
value to the register.

> > +    if ( size )
> > +    {
> > +        struct vpci_bar *rom = &header->bars[num_bars];
> > +
> > +        rom->type = VPCI_BAR_ROM;
> > +        rom->size = size;
> > +        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
> > +        if ( rom->enabled )
> > +            rom->addr = addr;
> > +        else
> > +            rom->addr = INVALID_PADDR;
> 
> Same remark as further up.

Fixed.

Thanks, Roger.
Jan Beulich July 29, 2017, 4:44 p.m. UTC | #3
>>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
>On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
>> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> > +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
>> > +    {
>> > +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
>> > +                                       PCI_SLOT(pdev->devfn),
>> > +                                       PCI_FUNC(pdev->devfn),
>> > +                                       PCI_COMMAND);
>> 
>> This is quite a lot of overhead - a loop over all devices plus a config
>> space read on each one. What state the memory decode bit is in
>> could be recorded in the ->enabled flag, couldn't it? And devices on
>> different sub-branches of the topology can't possibly have
>> overlapping entries that we need to worry about, as the bridge
>> windows would suppress actual accesses.
>
>Oh, so Xen only needs to care about devices that share the same
>bridge, because that is the only case where the same page can be
>shared by multiple devices?

Yes, that's my understanding (unless bridge windows overlap, which
I don't know what the resulting behavior would be).

>In any case, the Dom0 is free to wrongly position the BARs anywhere it
>wants, thus possibly placing them outside of the bridge windows, in
>with case I think we should better check all assigned devices.

As an initial solution this _may_ be good enough, but beware of systems
with very many devices.

>> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
>> > +                           union vpci_val val, void *data)
>> > +{
>> > +    struct vpci_bar *bar = data;
>> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
>> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>> > +    uint32_t wdata = val.u32, size_mask;
>> > +    bool hi = false;
>> > +
>> > +    switch ( bar->type )
>> > +    {
>> > +    case VPCI_BAR_MEM32:
>> > +    case VPCI_BAR_MEM64_LO:
>> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
>> > +        break;
>> > +    case VPCI_BAR_MEM64_HI:
>> > +        size_mask = ~0u;
>> > +        break;
>> > +    default:
>> > +        ASSERT_UNREACHABLE();
>> > +        return;
>> > +    }
>> > +
>> > +    if ( (wdata & size_mask) == size_mask )
>> > +    {
>> > +        /* Next reads from this register are going to return the BAR size. */
>> > +        bar->sizing = true;
>> > +        return;
>> 
>> I think the comment needs extending to explain why the written
>> sizing value can't possibly be an address. This is particularly
>> relevant because I'm not sure that assumption would hold on e.g.
>> ARM (which I don't think has guaranteed ROM right below 4Gb).
>
>Hm, right. Maybe it would be best to detect sizing by checking that
>the address when performing a read is ~0 on the high bits and ~0 &
>PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
>kind of partial guessing as done here, it's certainly not very robust.

I don't understand, particularly because you say "when performing a read).
Or do you mean to do away with the "sizing" flag altogether?

>> > +        /* Size the BAR and map it. */
>> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
>> > +                              &addr, &size);
>> > +        if ( rc < 0 )
>> > +            return rc;
>> > +
>> > +        if ( size == 0 )
>> > +        {
>> > +            bars[i].type = VPCI_BAR_EMPTY;
>> > +            continue;
>> > +        }
>> > +
>> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
>> 
>> This doesn't match up with logic further up: When the memory decode
>> bit gets cleared, you don't zap the addresses, so I think you'd better
>> store it here too. Use INVALID_PADDR only when the value read has
>> all address bits set (same caveat as pointed out earlier).
>
>OK, note that .addr can only possibly be INVALID_PADDR at
>initialization time, once the user has written something to the BAR
>.addr will be different than INVALID_PADDR.

Which is part of what worries me - it would be better if the field wouldn't
ever hold a special init-time-only value.

Jan
Roger Pau Monné Aug. 8, 2017, 12:35 p.m. UTC | #4
On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> >> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> >> > +                           union vpci_val val, void *data)
> >> > +{
> >> > +    struct vpci_bar *bar = data;
> >> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> >> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> >> > +    uint32_t wdata = val.u32, size_mask;
> >> > +    bool hi = false;
> >> > +
> >> > +    switch ( bar->type )
> >> > +    {
> >> > +    case VPCI_BAR_MEM32:
> >> > +    case VPCI_BAR_MEM64_LO:
> >> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> >> > +        break;
> >> > +    case VPCI_BAR_MEM64_HI:
> >> > +        size_mask = ~0u;
> >> > +        break;
> >> > +    default:
> >> > +        ASSERT_UNREACHABLE();
> >> > +        return;
> >> > +    }
> >> > +
> >> > +    if ( (wdata & size_mask) == size_mask )
> >> > +    {
> >> > +        /* Next reads from this register are going to return the BAR size. */
> >> > +        bar->sizing = true;
> >> > +        return;
> >> 
> >> I think the comment needs extending to explain why the written
> >> sizing value can't possibly be an address. This is particularly
> >> relevant because I'm not sure that assumption would hold on e.g.
> >> ARM (which I don't think has guaranteed ROM right below 4Gb).
> >
> >Hm, right. Maybe it would be best to detect sizing by checking that
> >the address when performing a read is ~0 on the high bits and ~0 &
> >PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
> >kind of partial guessing as done here, it's certainly not very robust.
> 
> I don't understand, particularly because you say "when performing a read).
> Or do you mean to do away with the "sizing" flag altogether?

Yes, I've got rid of the "sizing" flag, and now attempts by the guest
to size the BARs are detected during read of the BAR itself, by
checking whether the address matches ~0 in the high part, or
PCI_BASE_ADDRESS_MEM_MASK in the lower part.

> >> > +        /* Size the BAR and map it. */
> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> >> > +                              &addr, &size);
> >> > +        if ( rc < 0 )
> >> > +            return rc;
> >> > +
> >> > +        if ( size == 0 )
> >> > +        {
> >> > +            bars[i].type = VPCI_BAR_EMPTY;
> >> > +            continue;
> >> > +        }
> >> > +
> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> >> 
> >> This doesn't match up with logic further up: When the memory decode
> >> bit gets cleared, you don't zap the addresses, so I think you'd better
> >> store it here too. Use INVALID_PADDR only when the value read has
> >> all address bits set (same caveat as pointed out earlier).
> >
> >OK, note that .addr can only possibly be INVALID_PADDR at
> >initialization time, once the user has written something to the BAR
> >.addr will be different than INVALID_PADDR.
> 
> Which is part of what worries me - it would be better if the field wouldn't
> ever hold a special init-time-only value.

Right, but that matches the behavior of the hardware itself. On boot
the address of the BAR is not valid, but there's no way AFAIK to
restore the BAR to this state once an address has been written (except
by doing a reset of the device itself).

Roger.
Jan Beulich Aug. 9, 2017, 8:17 a.m. UTC | #5
>>> On 08.08.17 at 14:35, <roger.pau@citrix.com> wrote:
> On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
>> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
>> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> >> > +        /* Size the BAR and map it. */
>> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
>> >> > +                              &addr, &size);
>> >> > +        if ( rc < 0 )
>> >> > +            return rc;
>> >> > +
>> >> > +        if ( size == 0 )
>> >> > +        {
>> >> > +            bars[i].type = VPCI_BAR_EMPTY;
>> >> > +            continue;
>> >> > +        }
>> >> > +
>> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
>> >> 
>> >> This doesn't match up with logic further up: When the memory decode
>> >> bit gets cleared, you don't zap the addresses, so I think you'd better
>> >> store it here too. Use INVALID_PADDR only when the value read has
>> >> all address bits set (same caveat as pointed out earlier).
>> >
>> >OK, note that .addr can only possibly be INVALID_PADDR at
>> >initialization time, once the user has written something to the BAR
>> >.addr will be different than INVALID_PADDR.
>> 
>> Which is part of what worries me - it would be better if the field wouldn't
>> ever hold a special init-time-only value.
> 
> Right, but that matches the behavior of the hardware itself. On boot
> the address of the BAR is not valid, but there's no way AFAIK to
> restore the BAR to this state once an address has been written (except
> by doing a reset of the device itself).

True, but the BARs still hold _some_ value. And hence they can
equally well be made hold a value consistent with normal runtime.

Jan
Roger Pau Monné Aug. 9, 2017, 8:22 a.m. UTC | #6
On Wed, Aug 09, 2017 at 02:17:57AM -0600, Jan Beulich wrote:
> >>> On 08.08.17 at 14:35, <roger.pau@citrix.com> wrote:
> > On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
> >> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
> >> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> >> >> > +        /* Size the BAR and map it. */
> >> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> >> >> > +                              &addr, &size);
> >> >> > +        if ( rc < 0 )
> >> >> > +            return rc;
> >> >> > +
> >> >> > +        if ( size == 0 )
> >> >> > +        {
> >> >> > +            bars[i].type = VPCI_BAR_EMPTY;
> >> >> > +            continue;
> >> >> > +        }
> >> >> > +
> >> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> >> >> 
> >> >> This doesn't match up with logic further up: When the memory decode
> >> >> bit gets cleared, you don't zap the addresses, so I think you'd better
> >> >> store it here too. Use INVALID_PADDR only when the value read has
> >> >> all address bits set (same caveat as pointed out earlier).
> >> >
> >> >OK, note that .addr can only possibly be INVALID_PADDR at
> >> >initialization time, once the user has written something to the BAR
> >> >.addr will be different than INVALID_PADDR.
> >> 
> >> Which is part of what worries me - it would be better if the field wouldn't
> >> ever hold a special init-time-only value.
> > 
> > Right, but that matches the behavior of the hardware itself. On boot
> > the address of the BAR is not valid, but there's no way AFAIK to
> > restore the BAR to this state once an address has been written (except
> > by doing a reset of the device itself).
> 
> True, but the BARs still hold _some_ value. And hence they can
> equally well be made hold a value consistent with normal runtime.

I've changed it to remove the usage of INVALID_PADDR and instead made
the BAR hold the value that Xen finds in the underlying hardware,
without Xen trying to figure out if it's initialized or not.

Roger.
diff mbox

Patch

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@ 
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..3c800c4cf7
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,473 @@ 
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
+     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
+     (x)->addr != INVALID_PADDR)
+
+static struct rangeset *vpci_get_bar_memory(const struct domain *d,
+                                            const struct vpci_bar *map)
+{
+    const struct pci_dev *pdev;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    int rc;
+
+    if ( !mem )
+        return ERR_PTR(-ENOMEM);
+
+    /*
+     * Create a rangeset that represents the current BAR memory region
+     * and compare it against all the currently active BAR memory regions.
+     * If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
+     *
+     * NB: the rangeset uses frames, and if start and end addresses are
+     * equal it means only one frame is used, that's why PFN_DOWN is used
+     * to calculate the end of the rangeset.
+     */
+    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
+                            PFN_DOWN(map->addr + map->size));
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return ERR_PTR(rc);
+    }
+
+    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
+    {
+        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
+                                       PCI_SLOT(pdev->devfn),
+                                       PCI_FUNC(pdev->devfn),
+                                       PCI_COMMAND);
+        unsigned int i;
+
+        /* Check if memory decoding is enabled. */
+        if ( !(cmd & PCI_COMMAND_MEMORY) )
+            continue;
+
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
+
+            if ( bar == map || !MAPPABLE_BAR(bar) ||
+                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
+                                          PFN_DOWN(bar->addr + bar->size)) )
+                continue;
+
+            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
+                                       PFN_DOWN(bar->addr + bar->size));
+            if ( rc )
+            {
+                rangeset_destroy(mem);
+                return ERR_PTR(rc);
+            }
+        }
+    }
+
+    return mem;
+}
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int vpci_map_range(unsigned long s, unsigned long e, void *data)
+{
+    const struct map_data *map = data;
+
+    return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
+}
+
+static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
+                           const bool map)
+{
+    struct rangeset *mem;
+    struct map_data data = { .d = d, .map = map };
+    int rc;
+
+    ASSERT(MAPPABLE_BAR(bar));
+
+    mem = vpci_get_bar_memory(d, bar);
+    if ( IS_ERR(mem) )
+        return -PTR_ERR(mem);
+
+    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
+    rangeset_destroy(mem);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)
+{
+    const struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        const struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( !MAPPABLE_BAR(bar) )
+            continue;
+
+        rc = vpci_modify_bar(pdev->domain, bar, map);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
+static void vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    val->u16 = pci_conf_read16(seg, bus, slot, func, reg);
+}
+
+static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    uint16_t cmd = val.u16, current_cmd;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    int rc;
+
+    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
+
+    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
+    {
+        /*
+         * Let the guest play with all the bits directly except for the
+         * memory decoding one.
+         */
+        pci_conf_write16(seg, bus, slot, func, reg, cmd);
+        return;
+    }
+
+    /* Memory space access change. */
+    rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
+                seg, bus, slot, func,
+                cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
+        return;
+    }
+
+    pci_conf_write16(seg, bus, slot, func, reg, cmd);
+}
+
+static void vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( bar->sizing )
+        val->u32 = ~(bar->size - 1) >> (hi ? 32 : 0);
+    else
+        val->u32 = bar->addr >> (hi ? 32 : 0);
+
+    if ( !hi )
+    {
+        val->u32 |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                                : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val->u32 |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+}
+
+static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint32_t wdata = val.u32, size_mask;
+    bool hi = false;
+
+    switch ( bar->type )
+    {
+    case VPCI_BAR_MEM32:
+    case VPCI_BAR_MEM64_LO:
+        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
+        break;
+    case VPCI_BAR_MEM64_HI:
+        size_mask = ~0u;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    if ( (wdata & size_mask) == size_mask )
+    {
+        /* Next reads from this register are going to return the BAR size. */
+        bar->sizing = true;
+        return;
+    }
+
+    /* End previous sizing cycle if any. */
+    bar->sizing = false;
+
+    /*
+     * Ignore attempts to change the position of the BAR if memory decoding is
+     * active.
+     */
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+        return;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( !hi )
+        wdata &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /* Update the relevant part of the BAR address. */
+    bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)wdata << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+        wdata |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                                 PCI_FUNC(pdev->devfn), reg) &
+                                 ~PCI_BASE_ADDRESS_MEM_MASK;
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, wdata);
+}
+
+static void vpci_rom_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    const struct vpci_bar *rom = data;
+
+    if ( rom->sizing )
+        val->u32 = ~(rom->size - 1);
+    else
+        val->u32 = rom->addr;
+
+    val->u32 |= rom->enabled ? PCI_ROM_ADDRESS_ENABLE : 0;
+}
+
+static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    struct vpci_bar *rom = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    const uint32_t wdata = val.u32;
+
+    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
+    {
+        /* Next reads from this register are going to return the BAR size. */
+        rom->sizing = true;
+        return;
+    }
+
+    /* End previous sizing cycle if any. */
+    rom->sizing = false;
+
+    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
+
+    /* Check if memory decoding is enabled. */
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY &&
+         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )
+    {
+        if ( vpci_modify_bar(pdev->domain, rom,
+                             wdata & PCI_ROM_ADDRESS_ENABLE) )
+            return;
+
+        rom->enabled = wdata & PCI_ROM_ADDRESS_ENABLE;
+    }
+
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, wdata);
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint8_t header_type;
+    uint16_t cmd;
+    uint32_t rom_val;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
+    switch ( header_type )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = 6;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = 2;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                                   &bars[i]);
+            if ( rc )
+                return rc;
+
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        /* Size the BAR and map it. */
+        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
+                              &addr, &size);
+        if ( rc < 0 )
+            return rc;
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+            return rc;
+    }
+
+    /* Check expansion ROM. */
+    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
+    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
+        pci_conf_write32(seg, bus, slot, func, rom_reg,
+                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);
+
+    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);
+    if ( rc < 0 )
+        return rc;
+
+    if ( size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
+        if ( rom->enabled )
+            rom->addr = addr;
+        else
+            rom->addr = INVALID_PADDR;
+
+        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
+                               rom);
+        if ( rc )
+            return rc;
+
+        if ( rom->enabled )
+            pci_conf_write32(seg, bus, slot, func, rom_reg, rom_val);
+    }
+
+    if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        rc = vpci_modify_bars(pdev, true);
+        if ( rc )
+            return rc;
+
+        /* Enable memory decoding. */
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+    }
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 5e1b0bb3da..452ee482e8 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -63,6 +63,29 @@  void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
 struct vpci {
     /* Root pointer for the tree of vPCI handlers. */
     struct list_head handlers;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            paddr_t addr;
+            uint64_t size;
+            bool prefetchable;
+            bool sizing;
+            bool enabled;
+        } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
 };
 
 #endif