diff mbox

[v3,1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space

Message ID 20170427143546.14662-2-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monné April 27, 2017, 2:35 p.m. UTC
This functionality is going to reside in vpci.c (and the corresponding vpci.h
header), and should be arch-agnostic. The handlers introduced in this patch
setup the basic functionality required in order to trap accesses to the PCI
config space, and allow decoding the address and finding the corresponding
handler that should handle the access (although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
inside of a x86 HVM file, since that's not shared with other arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether
a domain should use the newly introduced vPCI handlers, this is only enabled
for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic functionality
of the vPCI traps can be asserted. This has been proven quite helpful during
development, since the logic to handle partial accesses or accesses that expand
across multiple registers is not trivial.

The handlers for the registers are added to a red-black tree, that indexes them
based on their offset. Since Xen needs to handle partial accesses to the
registers and access that expand across multiple registers the logic in
xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment it
in order to make it easier to understand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v2:
 - Generalize the PCI address decoding and use it for IOREQ code also.

Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.
---
 .gitignore                        |   4 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  45 ++++
 tools/tests/vpci/emul.h           | 107 +++++++++
 tools/tests/vpci/main.c           | 206 +++++++++++++++++
 xen/arch/arm/xen.lds.S            |   3 +
 xen/arch/x86/domain.c             |  18 +-
 xen/arch/x86/hvm/hvm.c            |   2 +
 xen/arch/x86/hvm/io.c             | 147 ++++++++++++
 xen/arch/x86/hvm/ioreq.c          |   7 +-
 xen/arch/x86/setup.c              |   3 +-
 xen/arch/x86/xen.lds.S            |   3 +
 xen/drivers/Makefile              |   2 +-
 xen/drivers/passthrough/pci.c     |   3 +
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/vpci.c           | 469 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/domain.h  |   3 +
 xen/include/asm-x86/hvm/io.h      |   8 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/pci.h             |   4 +
 xen/include/xen/vpci.h            |  66 ++++++
 23 files changed, 1099 insertions(+), 11 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

Comments

Jan Beulich May 19, 2017, 11:35 a.m. UTC | #1
>>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
>      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
>          if (d_config->b_info.device_model_version !=
>              LIBXL_DEVICE_MODEL_VERSION_NONE) {
> -            xc_config->emulation_flags = XEN_X86_EMU_ALL;
> +            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);

I can see why you need this, but I'm not sure this is a good model.
Ideally for ordinary HVM guests you'd never have to change this
line. Therefore perhaps it might be a better idea to use a "negative"
flag here.

> --- /dev/null
> +++ b/tools/tests/vpci/Makefile
> @@ -0,0 +1,45 @@
> +
> +XEN_ROOT=$(CURDIR)/../../..
> +include $(XEN_ROOT)/tools/Rules.mk
> +
> +TARGET := test_vpci
> +
> +.PHONY: all
> +all: $(TARGET)
> +
> +.PHONY: run
> +run: $(TARGET)
> +	./$(TARGET) > $(TARGET).out
> +
> +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
> +	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
> +
> +.PHONY: clean
> +clean:
> +	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
> +
> +.PHONY: distclean
> +distclean: clean
> +
> +.PHONY: install
> +install:
> +
> +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> +	sed -e '/#include/d' <$< >$@
> +
> +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
> +	# Trick the compiler so it doesn't complain about missing symbols
> +	sed -e '/#include/d' \
> +	    -e '1s;^;#include "emul.h"\
> +	             const vpci_register_init_t __start_vpci_array[1]\;\
> +	             const vpci_register_init_t __end_vpci_array[1]\;\
> +	             ;' <$< >$@
> +
> +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
> +	sed -e '/#include/d' <$< >$@
> +
> +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
> +	sed -e "/#include/d" \
> +	    -e '1s;^;#include "emul.h"\
> +	             ;' <$< >$@

Plain symlinking and __XEN__ conditionals in the files may be the
easier to follow variant. But I'm no heavily opposed to this one,
I'm merely afraid that further adjustments may end up becoming
necessary down the road, resulting in the rules here to become
more convoluted.

> --- /dev/null
> +++ b/tools/tests/vpci/emul.h
> @@ -0,0 +1,107 @@
> +/*
> + * Unit tests for the generic vPCI handler code.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see 
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef _TEST_VPCI_
> +#define _TEST_VPCI_
> +
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <errno.h>
> +#include <assert.h>
> +
> +#define container_of(ptr, type, member) ({                      \
> +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
> +        (type *)( (char *)__mptr - offsetof(type,member) );})

There are a couple of stray blanks (immediately inside parentheses)
here, and a missing one after the comma in offsetof().

> +#include "rbtree.h"
> +
> +struct pci_dev {
> +    struct domain *domain;
> +    struct vpci *vpci;
> +};
> +
> +struct domain {
> +    struct pci_dev pdev;
> +};
> +
> +struct vcpu
> +{
> +    struct domain *domain;
> +};
> +
> +extern struct vcpu v;

This is odd. With ...

> +#define spin_lock(x)
> +#define spin_unlock(x)
> +#define spin_is_locked(x) true
> +
> +#define current (&v)

... this, why don't you simply have

extern struct vcpu *current;

keeping v (or however you mean to name it) static?

> +#define has_vpci(d) true
> +
> +#include "vpci.h"
> +
> +#define xzalloc(type) (type *)calloc(1, sizeof(type))

Missing an outer pair of parentheses.

> +#define xfree(p) free(p)
> +
> +#define EXPORT_SYMBOL(x)

I think we should rather get rid of them from rbtree.c.

> +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev

Missing an outer pair of parentheses again, whereas ...

> +#define atomic_read(x) 1
> +
> +/* Dummy native helpers. Writes are ignored, reads return 1's. */
> +#define pci_conf_read8(...) (0xff)
> +#define pci_conf_read16(...) (0xffff)
> +#define pci_conf_read32(...) (0xffffffff)

... here they're pointless.

> +/* Dummy hooks, write stores data, read fetches it. */
> +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
> +                      union vpci_val *val, void *data)
> +{
> +    uint8_t *priv = data;
> +
> +    val->half_word = *priv;

Half word? Half a word is at least 16 bits on any reasonable
architecture nowadays. Using it for a byte is simply confusing. I'd
suggest naming the fields what they are - u8, u16, and u32.

> +#define VPCI_READ(reg, size, data) \
> +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
> +
> +#define VPCI_READ_CHECK(reg, size, expected) ({ \
> +    uint32_t val;                               \
> +    VPCI_READ(reg, size, &val);                 \
> +    assert(val == expected);                    \
> +    })

Bad indentation - either the }) needs to move left, of the body needs
to move right.

> +#define VPCI_WRITE(reg, size, data) \
> +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))

You using fixed SBDF here, ...

> +#define VPCI_CHECK_REG(reg, size, data) ({      \
> +    VPCI_WRITE(reg, size, data);                \
> +    VPCI_READ_CHECK(reg, size, data);           \
> +    })
> +
> +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
> +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \

... why do you have this strange &d.pdev here? The assumption
that a (fake) domain has a single (fake) PCI device looks pretty odd
anyway - why can't you simply have a global (fake) PCI device?

> +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
> +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
> +
> +int
> +main(int argc, char **argv)
> +{
> +    /* Index storage by offset. */
> +    uint32_t r0 = 0xdeadbeef;
> +    uint8_t r5 = 0xef;
> +    uint8_t r6 = 0xbe;
> +    uint8_t r7 = 0xef;
> +    uint16_t r12 = 0x8696;
> +    int rc;
> +
> +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);

In the context here the macro name is pretty confusing: I'd expect
it to check the register holds the specified value, without also doing
a write. How about VPCI_WRITE_CHECK()?

> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> +    VPCI_READ_CHECK(5, 1, 0xef);
> +    VPCI_CHECK_REG(5, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> +    VPCI_READ_CHECK(6, 1, 0xbe);
> +    VPCI_CHECK_REG(6, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> +    VPCI_READ_CHECK(7, 1, 0xef);
> +    VPCI_CHECK_REG(7, 1, 0xbd);
> +
> +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> +    VPCI_READ_CHECK(12, 2, 0x8696);
> +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> +
> +    /*
> +     * At this point we have the following layout:
> +     *
> +     * 32    24    16     8     0
> +     *  +-----+-----+-----+-----+
> +     *  |          r0           | 0
> +     *  +-----+-----+-----+-----+
> +     *  | r7  |  r6 |  r5 |/////| 32
> +     *  +-----+-----+-----+-----|
> +     *  |///////////////////////| 64
> +     *  +-----------+-----------+
> +     *  |///////////|    r12    | 96
> +     *  +-----------+-----------+
> +     *             ...
> +     *  / = empty.
> +     */
> +
> +    /* Try to add an overlapping register handler. */
> +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> +
> +    /* Try to add a non-aligned register. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> +
> +    /* Try to add a register with wrong size. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> +
> +    /* Try to add a register with missing handlers. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
> +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);

Is that something which really is wrong in all cases? What about e.g.
r/o registers?

> +    /* Read/write of unset register. */
> +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> +    VPCI_READ_CHECK(8, 2, 0xffff);
> +    VPCI_READ_CHECK(8, 1, 0xff);
> +    VPCI_WRITE(10, 2, 0xbeef);
> +    VPCI_READ_CHECK(10, 2, 0xffff);
> +
> +    /* Read of multiple registers */
> +    VPCI_CHECK_REG(7, 1, 0xbd);
> +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);

I think a variant accessing mixed size registers would also be
desirable here. Perhaps it would be best to exhaustively test
all possible variations (there aren't that many after all). Same
for writes and partial accesses (below) then.

> @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
>      handler->ops = &g2m_portio_ops;
>  }
>  
> +/* Do some sanity checks. */
> +static int vpci_access_check(unsigned int reg, unsigned int len)
> +{
> +    /* Check access size. */
> +    if ( len != 1 && len != 2 && len != 4 )
> +    {
> +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> +                 reg, len);

I think many of such gdprintk()s want to go away before this series
gets committed.

> +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,

Plain bool please.

> +                                 const ioreq_t *p)
> +{
> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> +}
> +
> +static int vpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr, uint32_t size, uint64_t *data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, devfn, reg;
> +    uint32_t data32;
> +    int rc;
> +
> +    vpci_lock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = d->arch.hvm_domain.pci_cf8;
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )

Pointless "else".

> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;

You need to write to *data here, or else you need to return
false from vpci_portio_accept() already in this case (but then
you'd need to follow the stdvga model and take the lock
there, releasing it in a .complete handler).

> +    }
> +
> +    /* Decode the PCI address. */
> +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
> +
> +    if ( vpci_access_check(reg, size) || reg >= 0xff )

> 0xff or >= 0x100, but the check is pointless as
hvm_pci_decode_addr() wont return larger values.

> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_UNHANDLEABLE;

I don't think this matches real hardware behavior. If this "fails"
at all, surely by returning all ones.

> +    }
> +
> +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
> +    if ( !rc )
> +        *data = data32;
> +    vpci_unlock(d);

Please set *data outside the locked region.

And since there's no best place to make this other remark - I'd
prefer if you either kept together SBDF in one value when passing
this as arguments to functions, or alternatively pass this as four
values rather than keeping devfn artificially together.

> +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
> +}

Again the question - what's the bare hardware equivalent of
returning X86EMUL_UNHANDLEABLE here?

> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -1177,6 +1177,9 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
>           CF8_ENABLED(cf8) )
>      {
>          uint32_t sbdf, x86_fam;
> +        unsigned int bus, devfn, reg;
> +
> +        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
>  
>          /* PCI config data cycle */
>  
> @@ -1186,9 +1189,7 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
>                                   PCI_FUNC(CF8_BDF(cf8)));

Any reason you don't use bus and devfn (really dev/slot and func)
in the expression the tail of which is visible here?

> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -224,6 +224,9 @@ SECTIONS
>         __start_schedulers_array = .;
>         *(.data.schedulers)
>         __end_schedulers_array = .;
> +       __start_vpci_array = .;
> +       *(.data.vpci)
> +       __end_vpci_array = .;

With vpci.c declaring these const, they should go into .rodata.
With the type name further being vpci_register_init_t it may even
be next to .init.rodata where they belong.

> @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>          devfn += pdev->phantom_stride;
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> +
> +    xen_vpci_add_handlers(pdev);
> }

You're losing an error code here.

> --- /dev/null
> +++ b/xen/drivers/vpci/Makefile
> @@ -0,0 +1 @@
> +obj-y += vpci.o

Without having seen further patches it's not clear whether this really
needs its own directory.

> --- /dev/null
> +++ b/xen/drivers/vpci/vpci.c
> @@ -0,0 +1,469 @@
> +/*
> + * Generic functionality for handling accesses to the PCI configuration space
> + * from guests.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> +#define vpci_init __start_vpci_array

What is this last one good for?

> +/* Internal struct to store the emulated PCI registers. */
> +struct vpci_register {
> +    vpci_read_t read;
> +    vpci_write_t write;

These two are pointers - please change the typedefs so that they're
visibly pointers here. That'll then also allow the typedef to be used to
declare actual handlers, should any such declarations be needed (e.g.
if the same handler can be used by two different source files).

> +    unsigned int size;
> +    unsigned int offset;
> +    void *priv_data;

"private" (shorter and hence easier to type)?

> +    struct rb_node node;
> +};
> +
> +int xen_vpci_add_handlers(struct pci_dev *pdev)

__hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
annotated so).

> +{
> +    int i, rc = 0;

i wants to be unsigned.

> +    if ( !has_vpci(pdev->domain) )
> +        return 0;
> +
> +    pdev->vpci = xzalloc(struct vpci);
> +    if ( !pdev->vpci )
> +        return -ENOMEM;
> +
> +    pdev->vpci->handlers = RB_ROOT;
> +
> +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +    {
> +        rc = vpci_init[i](pdev);
> +        if ( rc )
> +            break;
> +    }
> +
> +    if ( rc )
> +    {
> +        struct rb_node *node = rb_first(&pdev->vpci->handlers);
> +        struct vpci_register *r;

Please move this into the more narrow scope below.

> +        /* Iterate over the tree and cleanup. */
> +        while ( node != NULL )
> +        {
> +            r = container_of(node, struct vpci_register, node);
> +            node = rb_next(node);
> +            rb_erase(&r->node, &pdev->vpci->handlers);
> +            xfree(r);
> +        }
> +        xfree(pdev->vpci);
> +    }
> +
> +    return rc;
> +}
> +
> +static bool vpci_register_overlap(const struct vpci_register *r,
> +                                  unsigned int offset)
> +{
> +    if ( offset >= r->offset && offset < r->offset + r->size )
> +        return true;
> +
> +    return false;

This can be one single return statement.

> +}
> +
> +

Stray double blank lines.

> +static int vpci_register_cmp(const struct vpci_register *r1,
> +                             const struct vpci_register *r2)
> +{
> +    /* Make sure there's no overlap between registers. */
> +    if ( vpci_register_overlap(r1, r2->offset) ||
> +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
> +         vpci_register_overlap(r2, r1->offset) ||
> +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )

Overlap checks can generally be done with just two comparisons,
so I guess the parameters chosen for vpci_register_overlap()
aren't optimal. I guess you don't need the function at all, as you
could do all that's needed here:

    if ( r1->offset < r2->offset + r2->size &&
         r2->offset < r1->offset + r1->size )
        return 0;

The comment of course is somewhat misleading here too, as
returning zero isn't really an error indication.

> +        return 0;
> +
> +    if (r1->offset < r2->offset)
> +        return -1;
> +    else if (r1->offset > r2->offset)
> +        return 1;

Coding style.

> +    ASSERT_UNREACHABLE();
> +    return 0;
> +}
> +
> +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
> +                                                const unsigned int reg,
> +                                                const unsigned int size)
> +{
> +    struct rb_node *node;

const

> +    struct vpci_register r = {
> +        .offset = reg,
> +        .size = size,
> +    };
> +
> +    ASSERT(vpci_locked(pdev->domain));
> +
> +    node = pdev->vpci->handlers.rb_node;
> +    while ( node )
> +    {
> +        struct vpci_register *t =

const

> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> +                          vpci_write_t write_handler, unsigned int offset,
> +                          unsigned int size, void *data)
> +{
> +    struct rb_node **new, *parent;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||

Off by one again in the offset check.

> +         offset & (size - 1) || read_handler == NULL || write_handler == NULL )

As said, I'm not convinced either of the read or write handlers
being NULL is really a mistake. Both of them being NULL surely
is.

> +        return -EINVAL;
> +
> +    r = xzalloc(struct vpci_register);

Looks like xmalloc() would be fine here - you initialize all fields.

> +    if ( !r )
> +        return -ENOMEM;
> +
> +    r->read = read_handler;
> +    r->write = write_handler;
> +    r->size = size;
> +    r->offset = offset;
> +    r->priv_data = data;
> +
> +    vpci_lock(pdev->domain);
> +    new = &pdev->vpci->handlers.rb_node;
> +    parent = NULL;
> +
> +    while (*new) {

Coding style.

> +        struct vpci_register *this =

const

> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
> +{
> +    struct vpci_register *r;
> +
> +    vpci_lock(pdev->domain);
> +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);

I'm not sure about this - is there anything wrong with the caller,
knowing the size, also passing it? You could then even refuse
requests to remove a register where (offset,size) doesn't match
the recorded values (as vpci_find_register() will return any
overlapping one).

> +    if ( !r )
> +    {
> +        vpci_unlock(pdev->domain);
> +        return -ENOENT;
> +    }
> +
> +    rb_erase(&r->node, &pdev->vpci->handlers);
> +    xfree(r);
> +    vpci_unlock(pdev->domain);

Please swap xfree() and unlock.

> +static void vpci_read_hw(unsigned int seg, unsigned int bus,
> +                         unsigned int devfn, unsigned int reg, uint32_t size,
> +                         uint32_t *data)

Instead of passing a pointer to the result, please consider returning
the value, as the function doesn't return anything at present.

> +{
> +    switch ( size )
> +    {
> +    case 4:
> +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                                reg);
> +        break;
> +    case 3:
> +        /*
> +         * This is possible because a 4byte read can have 1byte trapped and
> +         * the rest passed-through.
> +         */
> +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                                reg + 1) << 8;
> +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> +                               reg);

Which of the two parts to read with read16() should depend on the
low bit of reg. Also for maximum compatibility I'd strongly suggest
reading the low part before the high one.

> +/* Helper macros for the read/write handlers. */
> +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)

What do e and s stand for here?

> +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8

And at least o here?

> +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)

And d, s, and o here?

Also I can't see what addition you would want to perform below.
All you ought to do are ANDs and ORs.

> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,

The function being other than void, same question as earlier:
What's the bare hardware equivalent of this returning other
than zero?

> +                  unsigned int reg, uint32_t size, uint32_t *data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    union vpci_val val = { .double_word = 0 };
> +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> +    uint32_t tmp_data;
> +    int rc;
> +
> +    ASSERT(vpci_locked(d));
> +
> +    *data = 0;
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);

What about the global PCI devices lock here? While VT-d code,
perhaps wrongly, doesn't acquire that lock prior to calling the
function, all callers in passthrough/pci.c do or verify it is being
held.

> +    if ( !pdev )
> +        goto passthrough;
> +
> +    /* Find the vPCI register handler. */
> +    r = vpci_find_register(pdev, reg, size);

With the overlap handling in vpci_find_register() I can't see how
this would reliably return the correct (lowest) register when the
request spans multiple ones.

> +    if ( !r )
> +        goto passthrough;
> +
> +    if ( r->offset > reg )
> +    {
> +        /*
> +         * There's a heading gap into the emulated register.
> +         * NB: it's possible for this recursive call to have a size of 3.
> +         */
> +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);

I'm not particularly happy to see recursion being used here, even if
that's not going to be very deep. Both qemu and pciback get away
without, iirc, and while it's not the neatest code I find qemu's easier
to follow than the apparently written from scratch variant here. Is
there a particular reason you didn't at least take what is there as a
basis?

> +        if ( rc )
> +            return rc;
> +
> +        /* Add the head read to the partial result. */
> +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
> +        data_lshift = r->offset - reg;
> +
> +        /* Account for the read. */
> +        size -= data_lshift;
> +        reg += data_lshift;
> +    }
> +    else if ( r->offset < reg )
> +        /* There's an offset into the emulated register */
> +        data_rshift = reg - r->offset;

This could be a plain else, avoiding another conditional branch.

> +    ASSERT(data_lshift == 0 || data_rshift == 0);
> +    data_size = min(size, r->size - data_rshift);
> +    ASSERT(data_size != 0);
> +
> +    /* Perform the read of the register. */
> +    rc = r->read(pdev, r->offset, &val, r->priv_data);
> +    if ( rc )
> +        return rc;
> +
> +    val.double_word >>= data_rshift * 8;
> +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
> +
> +    /* Account for the read */
> +    size -= data_size;
> +    reg += data_size;
> +
> +    /* Read the remaining, if any. */
> +    if ( size > 0 )
> +    {
> +        /*
> +         * Read tailing data.

trailing?

> +static int vpci_write_helper(struct pci_dev *pdev,
> +                             const struct vpci_register *r, unsigned int size,
> +                             unsigned int offset, uint32_t data)
> +{
> +    union vpci_val val = { .double_word = data };
> +    int rc;
> +
> +    ASSERT(size <= r->size);
> +    if ( size != r->size )
> +    {
> +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> +        if ( rc )
> +            return rc;
> +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> +        data &= GENMASK_BYTES(size, 0);
> +        val.double_word |= data << (offset * 8);
> +    }
> +
> +    return r->write(pdev, r->offset, val, r->priv_data);
> +}

I'm not sure that writing back the value read is correct in all cases
(think of write-only or rw1c registers or even offsets where reads
and writes access different registers altogether). I think the write
handlers will need to be made capable of dealing with partial writes.

> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                   unsigned int reg, uint32_t size, uint32_t data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_size, data_offset = 0;
> +    int rc;
> +
> +    ASSERT(vpci_locked(d));
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> +    if ( !pdev )
> +        goto passthrough;
> +
> +    /* Find the vPCI register handler. */
> +    r = vpci_find_register(pdev, reg, size);
> +    if ( !r )
> +        goto passthrough;
> +
> +    else if ( r->offset > reg )

Pointless "else" again, even more so with the blank line in between.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -13,6 +13,7 @@
>  #include <xen/irq.h>
>  #include <xen/pci_regs.h>
>  #include <xen/pfn.h>
> +#include <xen/rbtree.h>

Why? All you add to this file is ...

> @@ -88,6 +89,9 @@ struct pci_dev {
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
>      u64 vf_rlen[6];
> +
> +    /* Data for vPCI. */
> +    struct vpci *vpci;

... this. I guess you really want to add the #include ...

> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,66 @@
> +#ifndef _VPCI_
> +#define _VPCI_
> +
> +#include <xen/pci.h>
> +#include <xen/types.h>

... here.

> +/* Helpers for locking/unlocking. */
> +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)

While for the code layering you don't need recursive locks, did you
consider using them nevertheless so that spin_is_locked() return
values are actually meaningful for your purposes?

> +#define REGISTER_VPCI_INIT(x) \
> +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x

To match up with the type name and assuming "REGISTER" here
means the PCI register rather than "registration", I think this
would better be VPCI_REGISTER() (I don't really mind the _INIT
suffix, but I think it's relatively pointless).

> +/* Add vPCI handlers to device. */
> +int xen_vpci_add_handlers(struct pci_dev *dev);
> +
> +/* Add/remove a register handler. */
> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> +                          vpci_write_t write_handler, unsigned int offset,
> +                          unsigned int size, void *data);
> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
> +
> +/* Generic read/write handlers for the PCI config space. */
> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                  unsigned int reg, uint32_t size, uint32_t *data);
> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> +                   unsigned int reg, uint32_t size, uint32_t data);

Along the lines of what I've said in a few places about return values,
please carefully consider where they're needed. Once you decide
they are really needed, the respective functions would likely want to
become __must_check.

Jan
Roger Pau Monné May 29, 2017, 12:57 p.m. UTC | #2
On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,
> >      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
> >          if (d_config->b_info.device_model_version !=
> >              LIBXL_DEVICE_MODEL_VERSION_NONE) {
> > -            xc_config->emulation_flags = XEN_X86_EMU_ALL;
> > +            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
> 
> I can see why you need this, but I'm not sure this is a good model.
> Ideally for ordinary HVM guests you'd never have to change this
> line. Therefore perhaps it might be a better idea to use a "negative"
> flag here.

I would expect that at some point HVM guests are also going to use the
Xen internal PCI emulation for IOREQs (XEN_DMOP_IO_RANGE_PCI), and for
PCI passthrough, so having VPCI disabled is only temporary (long term
HVM guests should again use XEN_X86_EMU_ALL).

> > --- /dev/null
> > +++ b/tools/tests/vpci/Makefile
> > @@ -0,0 +1,45 @@
> > +
> > +XEN_ROOT=$(CURDIR)/../../..
> > +include $(XEN_ROOT)/tools/Rules.mk
> > +
> > +TARGET := test_vpci
> > +
> > +.PHONY: all
> > +all: $(TARGET)
> > +
> > +.PHONY: run
> > +run: $(TARGET)
> > +	./$(TARGET) > $(TARGET).out
> > +
> > +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
> > +	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
> > +
> > +.PHONY: clean
> > +clean:
> > +	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
> > +
> > +.PHONY: distclean
> > +distclean: clean
> > +
> > +.PHONY: install
> > +install:
> > +
> > +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> > +	sed -e '/#include/d' <$< >$@
> > +
> > +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
> > +	# Trick the compiler so it doesn't complain about missing symbols
> > +	sed -e '/#include/d' \
> > +	    -e '1s;^;#include "emul.h"\
> > +	             const vpci_register_init_t __start_vpci_array[1]\;\
> > +	             const vpci_register_init_t __end_vpci_array[1]\;\
> > +	             ;' <$< >$@
> > +
> > +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
> > +	sed -e '/#include/d' <$< >$@
> > +
> > +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
> > +	sed -e "/#include/d" \
> > +	    -e '1s;^;#include "emul.h"\
> > +	             ;' <$< >$@
> 
> Plain symlinking and __XEN__ conditionals in the files may be the
> easier to follow variant. But I'm no heavily opposed to this one,
> I'm merely afraid that further adjustments may end up becoming
> necessary down the road, resulting in the rules here to become
> more convoluted.

Yes, I'm not opposed to doing that, I just think the code is cleaner
using this rather than adding __XEN__ conditionals, but I agree that
if this becomes too convoluted at some point I would be in favor of
using conditionals instead.

> > --- /dev/null
> > +++ b/tools/tests/vpci/emul.h
> > @@ -0,0 +1,107 @@
> > +/*
> > + * Unit tests for the generic vPCI handler code.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see 
> > <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _TEST_VPCI_
> > +#define _TEST_VPCI_
> > +
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <errno.h>
> > +#include <assert.h>
> > +
> > +#define container_of(ptr, type, member) ({                      \
> > +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
> > +        (type *)( (char *)__mptr - offsetof(type,member) );})
> 
> There are a couple of stray blanks (immediately inside parentheses)
> here, and a missing one after the comma in offsetof().

Sorry, I've picked this as-is from the Xen headers (kernel.h). I've
changed it to:

#define container_of(ptr, type, member) ({                      \
        typeof(((type *)0)->member) *__mptr = (ptr);            \
        (type *)((char *)__mptr - offsetof(type, member));})

> > +#include "rbtree.h"
> > +
> > +struct pci_dev {
> > +    struct domain *domain;
> > +    struct vpci *vpci;
> > +};
> > +
> > +struct domain {
> > +    struct pci_dev pdev;
> > +};
> > +
> > +struct vcpu
> > +{
> > +    struct domain *domain;
> > +};
> > +
> > +extern struct vcpu v;
> 
> This is odd. With ...
> 
> > +#define spin_lock(x)
> > +#define spin_unlock(x)
> > +#define spin_is_locked(x) true
> > +
> > +#define current (&v)
> 
> ... this, why don't you simply have
> 
> extern struct vcpu *current;
> 
> keeping v (or however you mean to name it) static?

Done.

> > +#define has_vpci(d) true
> > +
> > +#include "vpci.h"
> > +
> > +#define xzalloc(type) (type *)calloc(1, sizeof(type))
> 
> Missing an outer pair of parentheses.

Done.

> > +#define xfree(p) free(p)
> > +
> > +#define EXPORT_SYMBOL(x)
> 
> I think we should rather get rid of them from rbtree.c.
> 
> > +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
> 
> Missing an outer pair of parentheses again, whereas ...

OK, I'm adding a pre-patch then to get rid of them.

> > +#define atomic_read(x) 1
> > +
> > +/* Dummy native helpers. Writes are ignored, reads return 1's. */
> > +#define pci_conf_read8(...) (0xff)
> > +#define pci_conf_read16(...) (0xffff)
> > +#define pci_conf_read32(...) (0xffffffff)
> 
> ... here they're pointless.

Thanks, removed.

> > +/* Dummy hooks, write stores data, read fetches it. */
> > +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
> > +                      union vpci_val *val, void *data)
> > +{
> > +    uint8_t *priv = data;
> > +
> > +    val->half_word = *priv;
> 
> Half word? Half a word is at least 16 bits on any reasonable
> architecture nowadays. Using it for a byte is simply confusing. I'd
> suggest naming the fields what they are - u8, u16, and u32.

Right, I can do that. I was using this nomenclature because that's
what the PCI specification uses:

half word: 8b
word: 16b
double word: 32b

Will change it to u8, u16 and u32.

> > +#define VPCI_READ(reg, size, data) \
> > +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
> > +
> > +#define VPCI_READ_CHECK(reg, size, expected) ({ \
> > +    uint32_t val;                               \
> > +    VPCI_READ(reg, size, &val);                 \
> > +    assert(val == expected);                    \
> > +    })
> 
> Bad indentation - either the }) needs to move left, of the body needs
> to move right.

I will move the }) left.

> > +#define VPCI_WRITE(reg, size, data) \
> > +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
> 
> You using fixed SBDF here, ...
> 
> > +#define VPCI_CHECK_REG(reg, size, data) ({      \
> > +    VPCI_WRITE(reg, size, data);                \
> > +    VPCI_READ_CHECK(reg, size, data);           \
> > +    })
> > +
> > +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
> > +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \
> 
> ... why do you have this strange &d.pdev here? The assumption
> that a (fake) domain has a single (fake) PCI device looks pretty odd
> anyway - why can't you simply have a global (fake) PCI device?

Yes, I can do that. Now the root is the pci_dev itself instead of the
domain.

> > +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
> > +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
> > +
> > +int
> > +main(int argc, char **argv)
> > +{
> > +    /* Index storage by offset. */
> > +    uint32_t r0 = 0xdeadbeef;
> > +    uint8_t r5 = 0xef;
> > +    uint8_t r6 = 0xbe;
> > +    uint8_t r7 = 0xef;
> > +    uint16_t r12 = 0x8696;
> > +    int rc;
> > +
> > +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> > +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> > +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
> 
> In the context here the macro name is pretty confusing: I'd expect
> it to check the register holds the specified value, without also doing
> a write. How about VPCI_WRITE_CHECK()?

Completely agree, the read macro is already VPCI_READ_CHECK, so it
makes sense for the write one to be VPCI_WRITE_CHECK IMHO.

> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> > +    VPCI_READ_CHECK(5, 1, 0xef);
> > +    VPCI_CHECK_REG(5, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> > +    VPCI_READ_CHECK(6, 1, 0xbe);
> > +    VPCI_CHECK_REG(6, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> > +    VPCI_READ_CHECK(7, 1, 0xef);
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +
> > +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> > +    VPCI_READ_CHECK(12, 2, 0x8696);
> > +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> > +
> > +    /*
> > +     * At this point we have the following layout:
> > +     *
> > +     * 32    24    16     8     0
> > +     *  +-----+-----+-----+-----+
> > +     *  |          r0           | 0
> > +     *  +-----+-----+-----+-----+
> > +     *  | r7  |  r6 |  r5 |/////| 32
> > +     *  +-----+-----+-----+-----|
> > +     *  |///////////////////////| 64
> > +     *  +-----------+-----------+
> > +     *  |///////////|    r12    | 96
> > +     *  +-----------+-----------+
> > +     *             ...
> > +     *  / = empty.
> > +     */
> > +
> > +    /* Try to add an overlapping register handler. */
> > +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> > +
> > +    /* Try to add a non-aligned register. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> > +
> > +    /* Try to add a register with wrong size. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> > +
> > +    /* Try to add a register with missing handlers. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
> > +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
> 
> Is that something which really is wrong in all cases? What about e.g.
> r/o registers?

My initial though was that r/o registers should register something
like a noop write handler, that could be shared between all of them. I
can certainly allow registration of handlers without a read or write
function (but not both), and make it a noop (writes will be ignored,
reads will return 1's).

> > +    /* Read/write of unset register. */
> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> > +    VPCI_READ_CHECK(8, 1, 0xff);
> > +    VPCI_WRITE(10, 2, 0xbeef);
> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> > +
> > +    /* Read of multiple registers */
> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> 
> I think a variant accessing mixed size registers would also be
> desirable here. Perhaps it would be best to exhaustively test
> all possible variations (there aren't that many after all). Same
> for writes and partial accesses (below) then.

So you mean to scan the whole space (from 0 to 128 on this test) using
all possible register sizes for both read and write? That would indeed
be feasible.

> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
> >      handler->ops = &g2m_portio_ops;
> >  }
> >  
> > +/* Do some sanity checks. */
> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> > +{
> > +    /* Check access size. */
> > +    if ( len != 1 && len != 2 && len != 4 )
> > +    {
> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> > +                 reg, len);
> 
> I think many of such gdprintk()s want to go away before this series
> gets committed.

OK, I've found them useful while developing, but I guess they are not
really useful outside from that context. I guess there's no way to
leave them in place, maybe a Kconfig option?

> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
> 
> Plain bool please.

Sadly struct hvm_io_ops (which is where this function is used) expects
a bool_t as return.

> 
> > +                                 const ioreq_t *p)
> > +{
> > +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> > +}
> > +
> > +static int vpci_portio_read(const struct hvm_io_handler *handler,
> > +                            uint64_t addr, uint32_t size, uint64_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    unsigned int bus, devfn, reg;
> > +    uint32_t data32;
> > +    int rc;
> > +
> > +    vpci_lock(d);
> > +    if ( addr == 0xcf8 )
> > +    {
> > +        ASSERT(size == 4);
> > +        *data = d->arch.hvm_domain.pci_cf8;
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> 
> Pointless "else".
> 
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> 
> You need to write to *data here, or else you need to return
> false from vpci_portio_accept() already in this case (but then
> you'd need to follow the stdvga model and take the lock
> there, releasing it in a .complete handler).

Right, this should return 1's in this case then.

> > +    }
> > +
> > +    /* Decode the PCI address. */
> > +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
> > +
> > +    if ( vpci_access_check(reg, size) || reg >= 0xff )
> 
> > 0xff or >= 0x100, but the check is pointless as
> hvm_pci_decode_addr() wont return larger values.

Right, this check is pointless.

> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_UNHANDLEABLE;
> 
> I don't think this matches real hardware behavior. If this "fails"
> at all, surely by returning all ones.

Yes, I've changed this to return 1's and X86EMUL_OKAY.

> > +    }
> > +
> > +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
> > +    if ( !rc )
> > +        *data = data32;
> > +    vpci_unlock(d);
> 
> Please set *data outside the locked region.
> 
> And since there's no best place to make this other remark - I'd
> prefer if you either kept together SBDF in one value when passing
> this as arguments to functions, or alternatively pass this as four
> values rather than keeping devfn artificially together.

I guess I prefer the 4 values, that was my first approach until I
realized pci_dev internally stores devfn in a single variable.

> > +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
> > +}
> 
> Again the question - what's the bare hardware equivalent of
> returning X86EMUL_UNHANDLEABLE here?

All 1's I assume (or other random garbage). Would you be OK with me
adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?

> > --- a/xen/arch/x86/hvm/ioreq.c
> > +++ b/xen/arch/x86/hvm/ioreq.c
> > @@ -1177,6 +1177,9 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
> >           CF8_ENABLED(cf8) )
> >      {
> >          uint32_t sbdf, x86_fam;
> > +        unsigned int bus, devfn, reg;
> > +
> > +        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
> >  
> >          /* PCI config data cycle */
> >  
> > @@ -1186,9 +1189,7 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
> >                                   PCI_FUNC(CF8_BDF(cf8)));
> 
> Any reason you don't use bus and devfn (really dev/slot and func)
> in the expression the tail of which is visible here?

Good catch, thanks.

> > --- a/xen/arch/x86/xen.lds.S
> > +++ b/xen/arch/x86/xen.lds.S
> > @@ -224,6 +224,9 @@ SECTIONS
> >         __start_schedulers_array = .;
> >         *(.data.schedulers)
> >         __end_schedulers_array = .;
> > +       __start_vpci_array = .;
> > +       *(.data.vpci)
> > +       __end_vpci_array = .;
> 
> With vpci.c declaring these const, they should go into .rodata.
> With the type name further being vpci_register_init_t it may even
> be next to .init.rodata where they belong.

Done, I've placed them in .rodata now.

I don't think it makes sense to place them in .init.rodata, the _init_
tag means these functions are supposed to be used to initialize the
vPCI handlers for devices, but they could be run after Xen has
finished the initialization, for example if MMCFG areas are discovered
by Dom0 with new devices (I know this is not yet implemented).

> > @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
> >          devfn += pdev->phantom_stride;
> >      } while ( devfn != pdev->devfn &&
> >                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> > +
> > +    xen_vpci_add_handlers(pdev);
> > }
> 
> You're losing an error code here.

Fixed, thanks.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/Makefile
> > @@ -0,0 +1 @@
> > +obj-y += vpci.o
> 
> Without having seen further patches it's not clear whether this really
> needs its own directory.

vPCI emulation handlers (for PCI headers, capabilities, MSI...) will
also go into this folder.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/vpci.c
> > @@ -0,0 +1,469 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI configuration space
> > + * from guests.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +
> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> > +#define vpci_init __start_vpci_array
> 
> What is this last one good for?

It's used by xen_vpci_add_handlers in order to call the init
functions, I can drop it and call __start_vpci_array[i](...) if that's
better.

> > +/* Internal struct to store the emulated PCI registers. */
> > +struct vpci_register {
> > +    vpci_read_t read;
> > +    vpci_write_t write;
> 
> These two are pointers - please change the typedefs so that they're
> visibly pointers here. That'll then also allow the typedef to be used to
> declare actual handlers, should any such declarations be needed (e.g.
> if the same handler can be used by two different source files).
> 
> > +    unsigned int size;
> > +    unsigned int offset;
> > +    void *priv_data;
> 
> "private" (shorter and hence easier to type)?

Done to both the above comments.

> > +    struct rb_node node;
> > +};
> > +
> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
> 
> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
> annotated so).

OK, so you really want the init handlers to be inside of the
.init.rodata section then.

> > +{
> > +    int i, rc = 0;
> 
> i wants to be unsigned.

Rightfully.

> > +    if ( !has_vpci(pdev->domain) )
> > +        return 0;
> > +
> > +    pdev->vpci = xzalloc(struct vpci);
> > +    if ( !pdev->vpci )
> > +        return -ENOMEM;
> > +
> > +    pdev->vpci->handlers = RB_ROOT;
> > +
> > +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> > +    {
> > +        rc = vpci_init[i](pdev);
> > +        if ( rc )
> > +            break;
> > +    }
> > +
> > +    if ( rc )
> > +    {
> > +        struct rb_node *node = rb_first(&pdev->vpci->handlers);
> > +        struct vpci_register *r;
> 
> Please move this into the more narrow scope below.

Done.

> > +        /* Iterate over the tree and cleanup. */
> > +        while ( node != NULL )
> > +        {
> > +            r = container_of(node, struct vpci_register, node);
> > +            node = rb_next(node);
> > +            rb_erase(&r->node, &pdev->vpci->handlers);
> > +            xfree(r);
> > +        }
> > +        xfree(pdev->vpci);
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static bool vpci_register_overlap(const struct vpci_register *r,
> > +                                  unsigned int offset)
> > +{
> > +    if ( offset >= r->offset && offset < r->offset + r->size )
> > +        return true;
> > +
> > +    return false;
> 
> This can be one single return statement.
>
> > +}
> > +
> > +
> 
> Stray double blank lines.

Fixed both.

> > +static int vpci_register_cmp(const struct vpci_register *r1,
> > +                             const struct vpci_register *r2)
> > +{
> > +    /* Make sure there's no overlap between registers. */
> > +    if ( vpci_register_overlap(r1, r2->offset) ||
> > +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
> > +         vpci_register_overlap(r2, r1->offset) ||
> > +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
> 
> Overlap checks can generally be done with just two comparisons,
> so I guess the parameters chosen for vpci_register_overlap()
> aren't optimal. I guess you don't need the function at all, as you
> could do all that's needed here:
> 
>     if ( r1->offset < r2->offset + r2->size &&
>          r2->offset < r1->offset + r1->size )
>         return 0;
> 
> The comment of course is somewhat misleading here too, as
> returning zero isn't really an error indication.

I've fixed this and changed the comment to "Return 0 if registers
overlap".

> > +        return 0;
> > +
> > +    if (r1->offset < r2->offset)
> > +        return -1;
> > +    else if (r1->offset > r2->offset)
> > +        return 1;
> 
> Coding style.

Fixed (and removed the pointless else).

> > +    ASSERT_UNREACHABLE();
> > +    return 0;
> > +}
> > +
> > +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
> > +                                                const unsigned int reg,
> > +                                                const unsigned int size)
> > +{
> > +    struct rb_node *node;
> 
> const
> 
> > +    struct vpci_register r = {
> > +        .offset = reg,
> > +        .size = size,
> > +    };
> > +
> > +    ASSERT(vpci_locked(pdev->domain));
> > +
> > +    node = pdev->vpci->handlers.rb_node;
> > +    while ( node )
> > +    {
> > +        struct vpci_register *t =
> 
> const

Making both of them const means the return must also be const, and
that's not suitable by some of the consumers (ie:
xen_vpci_remove_register for example).

> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data)
> > +{
> > +    struct rb_node **new, *parent;
> > +    struct vpci_register *r;
> > +
> > +    /* Some sanity checks. */
> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
> 
> Off by one again in the offset check.

Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
PCI_MAX_REGISTER? 

> > +         offset & (size - 1) || read_handler == NULL || write_handler == NULL )
> 
> As said, I'm not convinced either of the read or write handlers
> being NULL is really a mistake. Both of them being NULL surely
> is.

Right (see above reply).

> > +        return -EINVAL;
> > +
> > +    r = xzalloc(struct vpci_register);
> 
> Looks like xmalloc() would be fine here - you initialize all fields.

Yes.

> > +    if ( !r )
> > +        return -ENOMEM;
> > +
> > +    r->read = read_handler;
> > +    r->write = write_handler;
> > +    r->size = size;
> > +    r->offset = offset;
> > +    r->priv_data = data;
> > +
> > +    vpci_lock(pdev->domain);
> > +    new = &pdev->vpci->handlers.rb_node;
> > +    parent = NULL;
> > +
> > +    while (*new) {
> 
> Coding style.
> 
> > +        struct vpci_register *this =
> 
> const

Done for both.

> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
> > +{
> > +    struct vpci_register *r;
> > +
> > +    vpci_lock(pdev->domain);
> > +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
> 
> I'm not sure about this - is there anything wrong with the caller,
> knowing the size, also passing it? You could then even refuse
> requests to remove a register where (offset,size) doesn't match
> the recorded values (as vpci_find_register() will return any
> overlapping one).

Well, I think the important bit is to check that what
vpci_find_register returns matches what the called expects to
remove.

> > +    if ( !r )
> > +    {
> > +        vpci_unlock(pdev->domain);
> > +        return -ENOENT;
> > +    }
> > +
> > +    rb_erase(&r->node, &pdev->vpci->handlers);
> > +    xfree(r);
> > +    vpci_unlock(pdev->domain);
> 
> Please swap xfree() and unlock.
> 
> > +static void vpci_read_hw(unsigned int seg, unsigned int bus,
> > +                         unsigned int devfn, unsigned int reg, uint32_t size,
> > +                         uint32_t *data)
> 
> Instead of passing a pointer to the result, please consider returning
> the value, as the function doesn't return anything at present.
> 
> > +{
> > +    switch ( size )
> > +    {
> > +    case 4:
> > +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg);
> > +        break;
> > +    case 3:
> > +        /*
> > +         * This is possible because a 4byte read can have 1byte trapped and
> > +         * the rest passed-through.
> > +         */
> > +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                                reg + 1) << 8;
> > +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> > +                               reg);
> 
> Which of the two parts to read with read16() should depend on the
> low bit of reg. Also for maximum compatibility I'd strongly suggest
> reading the low part before the high one.

Right. Changed it to:

if ( reg & 1 )
{
    data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                          reg);
    data |= pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                            reg + 1) << 8;
}
else
{
    data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg);
    data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                           reg + 2) << 16;
}

> > +/* Helper macros for the read/write handlers. */
> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
> 
> What do e and s stand for here?

e = end, s = start (in bytes).

> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
> 
> And at least o here?

o = offset (in bytes)

> > +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)
> 
> And d, s, and o here?
> 
> Also I can't see what addition you would want to perform below.
> All you ought to do are ANDs and ORs.

It's adding new data to a partial output (ie: because the output might
be split across several registers).

r = result to update
d = new data to add
s = size of the new data to add
o = offset of the newly added data in the partial result.

I will add all this as a comment to the macros:

/*
 * Helper macros for the read/write handlers (all input values are in bytes).
 *
 * GENMASK_BYTES: generate a bitmask from the byte pointed by s to the byte
 * pointed by e.
 *
 * SHIFT_RIGHT_BYTES: shift a value pointed by d the number of bytes
 * specified in o.
 *
 * ADD_RESULT: append a result to another variable pointed by r. d is the
 * variable that contains the data to be added, s contains the size of the
 * data to add, and finally o is the offset of such data in the destination
 * (r).
 */

I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
if you think it's going to make it easier to understand.

> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> 
> The function being other than void, same question as earlier:
> What's the bare hardware equivalent of this returning other
> than zero?

I though it would be useful to have some more fine-grained error
reporting if that's ever needed, although as you say, from a hardware
point of view any errors will be simply reported as the value obtained
being all 1's.

The question is, should this already return all 1's, or should the
called translate failures into all 1's?

> > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    union vpci_val val = { .double_word = 0 };
> > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > +    uint32_t tmp_data;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    *data = 0;
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> 
> What about the global PCI devices lock here? While VT-d code,
> perhaps wrongly, doesn't acquire that lock prior to calling the
> function, all callers in passthrough/pci.c do or verify it is being
> held.

Right, I will add an assert then.

> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> 
> With the overlap handling in vpci_find_register() I can't see how
> this would reliably return the correct (lowest) register when the
> request spans multiple ones.

It doesn't need to, if there's a lower register it will be found by
the recursive call to xen_vpci_read done below (before calling into
the handler pointed by r).

> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    if ( r->offset > reg )
> > +    {
> > +        /*
> > +         * There's a heading gap into the emulated register.
> > +         * NB: it's possible for this recursive call to have a size of 3.
> > +         */
> > +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);
> 
> I'm not particularly happy to see recursion being used here, even if
> that's not going to be very deep. Both qemu and pciback get away
> without, iirc, and while it's not the neatest code I find qemu's easier
> to follow than the apparently written from scratch variant here. Is
> there a particular reason you didn't at least take what is there as a
> basis?

I've been thinking about this, and I've used a RB tree because I
wanted the searches to be fast, but due to RB properties if we want to
emulate a region that expends across multiple registers Xen needs to
issue several vpci_find_register, which will always start from the
root of the Rb tree, making this slower in this case.

Using a sorted linked list will allow Xen to only perform the search
once, and then continue from the first register until all registers
needed in order to complete the emulation have been found.

In resume, I think that using a sorted linked list to store the
registers here would be better indeed, and it would also get rid of
the recursion.

> > +        if ( rc )
> > +            return rc;
> > +
> > +        /* Add the head read to the partial result. */
> > +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
> > +        data_lshift = r->offset - reg;
> > +
> > +        /* Account for the read. */
> > +        size -= data_lshift;
> > +        reg += data_lshift;
> > +    }
> > +    else if ( r->offset < reg )
> > +        /* There's an offset into the emulated register */
> > +        data_rshift = reg - r->offset;
> 
> This could be a plain else, avoiding another conditional branch.

This is likely going to change in any case...

> > +    ASSERT(data_lshift == 0 || data_rshift == 0);
> > +    data_size = min(size, r->size - data_rshift);
> > +    ASSERT(data_size != 0);
> > +
> > +    /* Perform the read of the register. */
> > +    rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    val.double_word >>= data_rshift * 8;
> > +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
> > +
> > +    /* Account for the read */
> > +    size -= data_size;
> > +    reg += data_size;
> > +
> > +    /* Read the remaining, if any. */
> > +    if ( size > 0 )
> > +    {
> > +        /*
> > +         * Read tailing data.
> 
> trailing?

Yes.

> > +static int vpci_write_helper(struct pci_dev *pdev,
> > +                             const struct vpci_register *r, unsigned int size,
> > +                             unsigned int offset, uint32_t data)
> > +{
> > +    union vpci_val val = { .double_word = data };
> > +    int rc;
> > +
> > +    ASSERT(size <= r->size);
> > +    if ( size != r->size )
> > +    {
> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> > +        if ( rc )
> > +            return rc;
> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> > +        data &= GENMASK_BYTES(size, 0);
> > +        val.double_word |= data << (offset * 8);
> > +    }
> > +
> > +    return r->write(pdev, r->offset, val, r->priv_data);
> > +}
> 
> I'm not sure that writing back the value read is correct in all cases
> (think of write-only or rw1c registers or even offsets where reads
> and writes access different registers altogether). I think the write
> handlers will need to be made capable of dealing with partial writes.

That seems to be what pciback does fro writes: read, merge value,
write back (drivers/xen/xen-pciback/conf_space.c
xen_pcibk_config_write):

err = conf_space_read(dev, cfg_entry, field_start,
		      &tmp_val);
if (err)
	break;

tmp_val = merge_value(tmp_val, value, get_mask(size),
		      offset - field_start);

err = conf_space_write(dev, cfg_entry, field_start,
		       tmp_val);

I would prefer to do it this way in order to avoid adding more
complexity to the handlers themselves. So far I haven't found any such
registers (rw1c) in the PCI config space, do you have references to
any of them?

> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    unsigned int data_size, data_offset = 0;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > +    if ( !pdev )
> > +        goto passthrough;
> > +
> > +    /* Find the vPCI register handler. */
> > +    r = vpci_find_register(pdev, reg, size);
> > +    if ( !r )
> > +        goto passthrough;
> > +
> > +    else if ( r->offset > reg )
> 
> Pointless "else" again, even more so with the blank line in between.
> 
> > --- a/xen/include/xen/pci.h
> > +++ b/xen/include/xen/pci.h
> > @@ -13,6 +13,7 @@
> >  #include <xen/irq.h>
> >  #include <xen/pci_regs.h>
> >  #include <xen/pfn.h>
> > +#include <xen/rbtree.h>
> 
> Why? All you add to this file is ...
> 
> > @@ -88,6 +89,9 @@ struct pci_dev {
> >  #define PT_FAULT_THRESHOLD 10
> >      } fault;
> >      u64 vf_rlen[6];
> > +
> > +    /* Data for vPCI. */
> > +    struct vpci *vpci;
> 
> ... this. I guess you really want to add the #include ...
> 
> > --- /dev/null
> > +++ b/xen/include/xen/vpci.h
> > @@ -0,0 +1,66 @@
> > +#ifndef _VPCI_
> > +#define _VPCI_
> > +
> > +#include <xen/pci.h>
> > +#include <xen/types.h>
> 
> ... here.

Right, the RB is going to be replaced with a linked-list anyway, but
the list header should be included here.

> > +/* Helpers for locking/unlocking. */
> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> 
> While for the code layering you don't need recursive locks, did you
> consider using them nevertheless so that spin_is_locked() return
> values are actually meaningful for your purposes?

I'm not sure I follow, spin_is_locked already return meaningful values
for my purpose AFAICT.

> > +#define REGISTER_VPCI_INIT(x) \
> > +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
> 
> To match up with the type name and assuming "REGISTER" here
> means the PCI register rather than "registration", I think this
> would better be VPCI_REGISTER() (I don't really mind the _INIT
> suffix, but I think it's relatively pointless).

OK, that's fine.

> > +/* Add vPCI handlers to device. */
> > +int xen_vpci_add_handlers(struct pci_dev *dev);
> > +
> > +/* Add/remove a register handler. */
> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> > +                          vpci_write_t write_handler, unsigned int offset,
> > +                          unsigned int size, void *data);
> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
> > +
> > +/* Generic read/write handlers for the PCI config space. */
> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                  unsigned int reg, uint32_t size, uint32_t *data);
> > +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                   unsigned int reg, uint32_t size, uint32_t data);
> 
> Along the lines of what I've said in a few places about return values,
> please carefully consider where they're needed. Once you decide
> they are really needed, the respective functions would likely want to
> become __must_check.

From a emulation PoV all those errors are going to be reported to the
VM as ignored writes or reads returning all 1's.

The question is where/who should do this translation. I though it
would be helpful to do this in the trap handlers themselves, and let
the vpci code return more meaningful error value. If you think it's
not going to be helpful I can remove them.

Roger.
Jan Beulich May 29, 2017, 2:16 p.m. UTC | #3
>>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
> On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
>> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> > +    /* Read/write of unset register. */
>> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
>> > +    VPCI_READ_CHECK(8, 2, 0xffff);
>> > +    VPCI_READ_CHECK(8, 1, 0xff);
>> > +    VPCI_WRITE(10, 2, 0xbeef);
>> > +    VPCI_READ_CHECK(10, 2, 0xffff);
>> > +
>> > +    /* Read of multiple registers */
>> > +    VPCI_CHECK_REG(7, 1, 0xbd);
>> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
>> 
>> I think a variant accessing mixed size registers would also be
>> desirable here. Perhaps it would be best to exhaustively test
>> all possible variations (there aren't that many after all). Same
>> for writes and partial accesses (below) then.
> 
> So you mean to scan the whole space (from 0 to 128 on this test) using
> all possible register sizes for both read and write? That would indeed
> be feasible.

Not sure what the "from 0 to 128" is meant to apply to. What I mean
is to test all combinations of (mix of) register sizes and access sizes.
I.e. all combinations making up a single 4-byte field ((1,1,1,1),
(1,1,2), (2,1,1), (2,2), (4)) and all four 1-byte accesses, both 2-byte
ones, and the sole possible 4-byte one.

>> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
>> >      handler->ops = &g2m_portio_ops;
>> >  }
>> >  
>> > +/* Do some sanity checks. */
>> > +static int vpci_access_check(unsigned int reg, unsigned int len)
>> > +{
>> > +    /* Check access size. */
>> > +    if ( len != 1 && len != 2 && len != 4 )
>> > +    {
>> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
>> > +                 reg, len);
>> 
>> I think many of such gdprintk()s want to go away before this series
>> gets committed.
> 
> OK, I've found them useful while developing, but I guess they are not
> really useful outside from that context. I guess there's no way to
> leave them in place, maybe a Kconfig option?

That seems overkill. I wouldn't so much mind the messages (they
get compiled out for non-debug builds anyway), but the clutter
they introduce to code: In some cases half of your functions are
the invocation of gdprintk() ...

>> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
>> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
>> 
>> Plain bool please.
> 
> Sadly struct hvm_io_ops (which is where this function is used) expects
> a bool_t as return.

I don't follow - bool_t is simply a typedef of bool nowadays, and
typedefs are all equivalent as far as the C type system goes.

>> Again the question - what's the bare hardware equivalent of
>> returning X86EMUL_UNHANDLEABLE here?
> 
> All 1's I assume (or other random garbage). Would you be OK with me
> adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?

That would probably be okay (despite my dislike of labels and goto-s).

>> > +#include <xen/sched.h>
>> > +#include <xen/vpci.h>
>> > +
>> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
>> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>> > +#define vpci_init __start_vpci_array
>> 
>> What is this last one good for?
> 
> It's used by xen_vpci_add_handlers in order to call the init
> functions, I can drop it and call __start_vpci_array[i](...) if that's
> better.

Well, if there were several users, I could see the benefit of an
abbreviating #define. But for a single user the #define adds
more code / clutter than is being saved on the use site.

>> > +    struct rb_node node;
>> > +};
>> > +
>> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
>> 
>> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
>> annotated so).
> 
> OK, so you really want the init handlers to be inside of the
> .init.rodata section then.

Only if that's correct, and it is correct as long as all possible call trees
root in a __hwdom_init function. (To avoid misunderstanding: This
clearly can't be .init.rodata uniformly, as __hwdom_init isn't always
an alias of __init).

>> > +    struct vpci_register r = {
>> > +        .offset = reg,
>> > +        .size = size,
>> > +    };
>> > +
>> > +    ASSERT(vpci_locked(pdev->domain));
>> > +
>> > +    node = pdev->vpci->handlers.rb_node;
>> > +    while ( node )
>> > +    {
>> > +        struct vpci_register *t =
>> 
>> const
> 
> Making both of them const means the return must also be const, and
> that's not suitable by some of the consumers (ie:
> xen_vpci_remove_register for example).

In that case there's no choice then, okay.

>> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,

Btw., I only now notice this further strange xen_ prefix here.

>> > +                          vpci_write_t write_handler, unsigned int offset,
>> > +                          unsigned int size, void *data)
>> > +{
>> > +    struct rb_node **new, *parent;
>> > +    struct vpci_register *r;
>> > +
>> > +    /* Some sanity checks. */
>> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
>> 
>> Off by one again in the offset check.
> 
> Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
> PCI_MAX_REGISTER? 

Could be done, but then we need one for base and one for
extended config space. May want to check whether Linux has
invented some good names for these by now.

>> > +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
>> > +{
>> > +    struct vpci_register *r;
>> > +
>> > +    vpci_lock(pdev->domain);
>> > +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
>> 
>> I'm not sure about this - is there anything wrong with the caller,
>> knowing the size, also passing it? You could then even refuse
>> requests to remove a register where (offset,size) doesn't match
>> the recorded values (as vpci_find_register() will return any
>> overlapping one).
> 
> Well, I think the important bit is to check that what
> vpci_find_register returns matches what the called expects to
> remove.

Correct. Debuggability would call for checking both offset and size.

>> > +/* Helper macros for the read/write handlers. */
>> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
>> 
>> What do e and s stand for here?
> 
> e = end, s = start (in bytes).

And where do you start counting. Having seen the rest of the
series I'm actually rather unconvinced use these macros results
in better code - to me, plain hex numbers are quite a bit easier
to read as long as the number of digits doesn't go meaningfully
beyond 10 or so.

>> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
>> 
>> And at least o here?
> 
> o = offset (in bytes)

I think simply writhing the shift expression is once again more
clear to the reader than using a macro which is longer to read
and type and which has semantics which aren't immediately
clear from its name.

> I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
> if you think it's going to make it easier to understand.

I'd prefer if the name "merge" appeared in that name - I don't see
this being usable strictly only to append to either side of a value.

>> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
>> 
>> The function being other than void, same question as earlier:
>> What's the bare hardware equivalent of this returning other
>> than zero?
> 
> I though it would be useful to have some more fine-grained error
> reporting if that's ever needed, although as you say, from a hardware
> point of view any errors will be simply reported as the value obtained
> being all 1's.
> 
> The question is, should this already return all 1's, or should the
> called translate failures into all 1's?

If you leave this to the callers, overall code would likely become
less readable, so I'd prefer this to be done in a central place.

>> > +    /* Find the vPCI register handler. */
>> > +    r = vpci_find_register(pdev, reg, size);
>> 
>> With the overlap handling in vpci_find_register() I can't see how
>> this would reliably return the correct (lowest) register when the
>> request spans multiple ones.
> 
> It doesn't need to, if there's a lower register it will be found by
> the recursive call to xen_vpci_read done below (before calling into
> the handler pointed by r).

Since that's quite non-obvious, to me this is another argument
against using recursion here.

>> > +static int vpci_write_helper(struct pci_dev *pdev,
>> > +                             const struct vpci_register *r, unsigned int size,
>> > +                             unsigned int offset, uint32_t data)
>> > +{
>> > +    union vpci_val val = { .double_word = data };
>> > +    int rc;
>> > +
>> > +    ASSERT(size <= r->size);
>> > +    if ( size != r->size )
>> > +    {
>> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
>> > +        if ( rc )
>> > +            return rc;
>> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
>> > +        data &= GENMASK_BYTES(size, 0);
>> > +        val.double_word |= data << (offset * 8);
>> > +    }
>> > +
>> > +    return r->write(pdev, r->offset, val, r->priv_data);
>> > +}
>> 
>> I'm not sure that writing back the value read is correct in all cases
>> (think of write-only or rw1c registers or even offsets where reads
>> and writes access different registers altogether). I think the write
>> handlers will need to be made capable of dealing with partial writes.
> 
> That seems to be what pciback does fro writes: read, merge value,
> write back (drivers/xen/xen-pciback/conf_space.c
> xen_pcibk_config_write):
> 
> err = conf_space_read(dev, cfg_entry, field_start,
> 		      &tmp_val);
> if (err)
> 	break;
> 
> tmp_val = merge_value(tmp_val, value, get_mask(size),
> 		      offset - field_start);
> 
> err = conf_space_write(dev, cfg_entry, field_start,
> 		       tmp_val);
> 
> I would prefer to do it this way in order to avoid adding more
> complexity to the handlers themselves. So far I haven't found any such
> registers (rw1c) in the PCI config space, do you have references to
> any of them?

The status register.

>> > +/* Helpers for locking/unlocking. */
>> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
>> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
>> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
>> 
>> While for the code layering you don't need recursive locks, did you
>> consider using them nevertheless so that spin_is_locked() return
>> values are actually meaningful for your purposes?
> 
> I'm not sure I follow, spin_is_locked already return meaningful values
> for my purpose AFAICT.

For non-recursive locks this tells you whether _any_ CPU holds
the lock, yet normally you want to know whether the CPU you
run on does.

Jan
Roger Pau Monné May 29, 2017, 3:05 p.m. UTC | #4
On Mon, May 29, 2017 at 08:16:41AM -0600, Jan Beulich wrote:
> >>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
> > On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
> >> > +    /* Read/write of unset register. */
> >> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> >> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> >> > +    VPCI_READ_CHECK(8, 1, 0xff);
> >> > +    VPCI_WRITE(10, 2, 0xbeef);
> >> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> >> > +
> >> > +    /* Read of multiple registers */
> >> > +    VPCI_CHECK_REG(7, 1, 0xbd);
> >> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> >> 
> >> I think a variant accessing mixed size registers would also be
> >> desirable here. Perhaps it would be best to exhaustively test
> >> all possible variations (there aren't that many after all). Same
> >> for writes and partial accesses (below) then.
> > 
> > So you mean to scan the whole space (from 0 to 128 on this test) using
> > all possible register sizes for both read and write? That would indeed
> > be feasible.
> 
> Not sure what the "from 0 to 128" is meant to apply to. What I mean
> is to test all combinations of (mix of) register sizes and access sizes.
> I.e. all combinations making up a single 4-byte field ((1,1,1,1),
> (1,1,2), (2,1,1), (2,2), (4)) and all four 1-byte accesses, both 2-byte
> ones, and the sole possible 4-byte one.

OK, thanks for the clarification, now I get it.

> >> > @@ -256,6 +257,152 @@ void register_g2m_portio_handler(struct domain *d)
> >> >      handler->ops = &g2m_portio_ops;
> >> >  }
> >> >  
> >> > +/* Do some sanity checks. */
> >> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> >> > +{
> >> > +    /* Check access size. */
> >> > +    if ( len != 1 && len != 2 && len != 4 )
> >> > +    {
> >> > +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
> >> > +                 reg, len);
> >> 
> >> I think many of such gdprintk()s want to go away before this series
> >> gets committed.
> > 
> > OK, I've found them useful while developing, but I guess they are not
> > really useful outside from that context. I guess there's no way to
> > leave them in place, maybe a Kconfig option?
> 
> That seems overkill. I wouldn't so much mind the messages (they
> get compiled out for non-debug builds anyway), but the clutter
> they introduce to code: In some cases half of your functions are
> the invocation of gdprintk() ...

Let me try to prune some of them.

> >> > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> >> > +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
> >> 
> >> Plain bool please.
> > 
> > Sadly struct hvm_io_ops (which is where this function is used) expects
> > a bool_t as return.
> 
> I don't follow - bool_t is simply a typedef of bool nowadays, and
> typedefs are all equivalent as far as the C type system goes.

Clearly my bad, I assumed they where actually different types.

> >> Again the question - what's the bare hardware equivalent of
> >> returning X86EMUL_UNHANDLEABLE here?
> > 
> > All 1's I assume (or other random garbage). Would you be OK with me
> > adding a "fail" label that sets data to all 1's and returns X86EMUL_OKAY?
> 
> That would probably be okay (despite my dislike of labels and goto-s).

Maybe the goto is not needed after all if vpci_read returns the data
filled with 1's and no error code.

> >> > +#include <xen/sched.h>
> >> > +#include <xen/vpci.h>
> >> > +
> >> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> >> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> >> > +#define vpci_init __start_vpci_array
> >> 
> >> What is this last one good for?
> > 
> > It's used by xen_vpci_add_handlers in order to call the init
> > functions, I can drop it and call __start_vpci_array[i](...) if that's
> > better.
> 
> Well, if there were several users, I could see the benefit of an
> abbreviating #define. But for a single user the #define adds
> more code / clutter than is being saved on the use site.

Ack.

> >> > +    struct rb_node node;
> >> > +};
> >> > +
> >> > +int xen_vpci_add_handlers(struct pci_dev *pdev)
> >> 
> >> __hwdom_init (I notice setup_one_hwdom_device() wrongly isn't
> >> annotated so).
> > 
> > OK, so you really want the init handlers to be inside of the
> > .init.rodata section then.
> 
> Only if that's correct, and it is correct as long as all possible call trees
> root in a __hwdom_init function. (To avoid misunderstanding: This
> clearly can't be .init.rodata uniformly, as __hwdom_init isn't always
> an alias of __init).

OK.

> >> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
> 
> Btw., I only now notice this further strange xen_ prefix here.

I assume this should be vpci_*, (dropping the xen_ prefix uniformly).

> >> > +                          vpci_write_t write_handler, unsigned int offset,
> >> > +                          unsigned int size, void *data)
> >> > +{
> >> > +    struct rb_node **new, *parent;
> >> > +    struct vpci_register *r;
> >> > +
> >> > +    /* Some sanity checks. */
> >> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
> >> 
> >> Off by one again in the offset check.
> > 
> > Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
> > PCI_MAX_REGISTER? 
> 
> Could be done, but then we need one for base and one for
> extended config space. May want to check whether Linux has
> invented some good names for these by now.

pci_regs.h from Linux now has:

/*
 * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
 * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
 * configuration space.
 */
#define PCI_CFG_SPACE_SIZE	256
#define PCI_CFG_SPACE_EXP_SIZE	4096

At the top. Do you think those defines are fine for importing? (this
was introduced by cc10385b6fde3, but I don't think importing this in a
more formal way makes sense).

> >> > +/* Helper macros for the read/write handlers. */
> >> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
> >> 
> >> What do e and s stand for here?
> > 
> > e = end, s = start (in bytes).
> 
> And where do you start counting. Having seen the rest of the
> series I'm actually rather unconvinced use these macros results
> in better code - to me, plain hex numbers are quite a bit easier
> to read as long as the number of digits doesn't go meaningfully
> beyond 10 or so.
> 
> >> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
> >> 
> >> And at least o here?
> > 
> > o = offset (in bytes)
> 
> I think simply writhing the shift expression is once again more
> clear to the reader than using a macro which is longer to read
> and type and which has semantics which aren't immediately
> clear from its name.
> 
> > I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
> > if you think it's going to make it easier to understand.
> 
> I'd prefer if the name "merge" appeared in that name - I don't see
> this being usable strictly only to append to either side of a value.

OK MERGE_RESULT or MERGE_REGISTER maybe? (and the rest removed).

> >> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> >> 
> >> The function being other than void, same question as earlier:
> >> What's the bare hardware equivalent of this returning other
> >> than zero?
> > 
> > I though it would be useful to have some more fine-grained error
> > reporting if that's ever needed, although as you say, from a hardware
> > point of view any errors will be simply reported as the value obtained
> > being all 1's.
> > 
> > The question is, should this already return all 1's, or should the
> > called translate failures into all 1's?
> 
> If you leave this to the callers, overall code would likely become
> less readable, so I'd prefer this to be done in a central place.

I will change the prototypes/code of vpci_{read/write} so no error
code is returned (and in the read case the value on error is going to
be 1's).

> >> > +    /* Find the vPCI register handler. */
> >> > +    r = vpci_find_register(pdev, reg, size);
> >> 
> >> With the overlap handling in vpci_find_register() I can't see how
> >> this would reliably return the correct (lowest) register when the
> >> request spans multiple ones.
> > 
> > It doesn't need to, if there's a lower register it will be found by
> > the recursive call to xen_vpci_read done below (before calling into
> > the handler pointed by r).
> 
> Since that's quite non-obvious, to me this is another argument
> against using recursion here.

Yes, will switch to a sorted list and no recursion.

> >> > +static int vpci_write_helper(struct pci_dev *pdev,
> >> > +                             const struct vpci_register *r, unsigned int size,
> >> > +                             unsigned int offset, uint32_t data)
> >> > +{
> >> > +    union vpci_val val = { .double_word = data };
> >> > +    int rc;
> >> > +
> >> > +    ASSERT(size <= r->size);
> >> > +    if ( size != r->size )
> >> > +    {
> >> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
> >> > +        if ( rc )
> >> > +            return rc;
> >> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
> >> > +        data &= GENMASK_BYTES(size, 0);
> >> > +        val.double_word |= data << (offset * 8);
> >> > +    }
> >> > +
> >> > +    return r->write(pdev, r->offset, val, r->priv_data);
> >> > +}
> >> 
> >> I'm not sure that writing back the value read is correct in all cases
> >> (think of write-only or rw1c registers or even offsets where reads
> >> and writes access different registers altogether). I think the write
> >> handlers will need to be made capable of dealing with partial writes.
> > 
> > That seems to be what pciback does fro writes: read, merge value,
> > write back (drivers/xen/xen-pciback/conf_space.c
> > xen_pcibk_config_write):
> > 
> > err = conf_space_read(dev, cfg_entry, field_start,
> > 		      &tmp_val);
> > if (err)
> > 	break;
> > 
> > tmp_val = merge_value(tmp_val, value, get_mask(size),
> > 		      offset - field_start);
> > 
> > err = conf_space_write(dev, cfg_entry, field_start,
> > 		       tmp_val);
> > 
> > I would prefer to do it this way in order to avoid adding more
> > complexity to the handlers themselves. So far I haven't found any such
> > registers (rw1c) in the PCI config space, do you have references to
> > any of them?
> 
> The status register.

But the status register is not going to be trapped, not by Dom0 or
DomUs? None of the registers that I've emulated for the header, or the
capabilities behave this way, and adding such and offset would
greatly increase the complexity of each handler IMHO.

Maybe it would be easier to add a flag to mark rw1c registers as such,
if that's ever needed? (and avoid the read in that case)

> >> > +/* Helpers for locking/unlocking. */
> >> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
> >> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
> >> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> >> 
> >> While for the code layering you don't need recursive locks, did you
> >> consider using them nevertheless so that spin_is_locked() return
> >> values are actually meaningful for your purposes?
> > 
> > I'm not sure I follow, spin_is_locked already return meaningful values
> > for my purpose AFAICT.
> 
> For non-recursive locks this tells you whether _any_ CPU holds
> the lock, yet normally you want to know whether the CPU you
> run on does.

Indeed, so if I make the lock recursive spin_is_locked is going to
return whether the current CPU holds the lock. That's kind of
counter-intuitive.

Roger.
Jan Beulich May 29, 2017, 3:26 p.m. UTC | #5
>>> On 29.05.17 at 17:05, <roger.pau@citrix.com> wrote:
> On Mon, May 29, 2017 at 08:16:41AM -0600, Jan Beulich wrote:
>> >>> On 29.05.17 at 14:57, <roger.pau@citrix.com> wrote:
>> > On Fri, May 19, 2017 at 05:35:47AM -0600, Jan Beulich wrote:
>> >> >>> On 27.04.17 at 16:35, <roger.pau@citrix.com> wrote:
>> >> > +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
>> 
>> Btw., I only now notice this further strange xen_ prefix here.
> 
> I assume this should be vpci_*, (dropping the xen_ prefix uniformly).

Yes please.

>> >> > +                          vpci_write_t write_handler, unsigned int offset,
>> >> > +                          unsigned int size, void *data)
>> >> > +{
>> >> > +    struct rb_node **new, *parent;
>> >> > +    struct vpci_register *r;
>> >> > +
>> >> > +    /* Some sanity checks. */
>> >> > +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
>> >> 
>> >> Off by one again in the offset check.
>> > 
>> > Fixed to be > 0xfff. Should this maybe be added to pci_regs.h as
>> > PCI_MAX_REGISTER? 
>> 
>> Could be done, but then we need one for base and one for
>> extended config space. May want to check whether Linux has
>> invented some good names for these by now.
> 
> pci_regs.h from Linux now has:
> 
> /*
>  * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
>  * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
>  * configuration space.
>  */
> #define PCI_CFG_SPACE_SIZE	256
> #define PCI_CFG_SPACE_EXP_SIZE	4096
> 
> At the top. Do you think those defines are fine for importing?

Sure.

> (this
> was introduced by cc10385b6fde3, but I don't think importing this in a
> more formal way makes sense).

Agreed.

>> >> > +/* Helper macros for the read/write handlers. */
>> >> > +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
>> >> 
>> >> What do e and s stand for here?
>> > 
>> > e = end, s = start (in bytes).
>> 
>> And where do you start counting. Having seen the rest of the
>> series I'm actually rather unconvinced use these macros results
>> in better code - to me, plain hex numbers are quite a bit easier
>> to read as long as the number of digits doesn't go meaningfully
>> beyond 10 or so.
>> 
>> >> > +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
>> >> 
>> >> And at least o here?
>> > 
>> > o = offset (in bytes)
>> 
>> I think simply writhing the shift expression is once again more
>> clear to the reader than using a macro which is longer to read
>> and type and which has semantics which aren't immediately
>> clear from its name.
>> 
>> > I can rename ADD_RESULT to APPEND_RESULT or something more descriptive
>> > if you think it's going to make it easier to understand.
>> 
>> I'd prefer if the name "merge" appeared in that name - I don't see
>> this being usable strictly only to append to either side of a value.
> 
> OK MERGE_RESULT or MERGE_REGISTER maybe? (and the rest removed).

Either name is fine with me, with a slight preference to the former.

>> >> > +static int vpci_write_helper(struct pci_dev *pdev,
>> >> > +                             const struct vpci_register *r, unsigned int size,
>> >> > +                             unsigned int offset, uint32_t data)
>> >> > +{
>> >> > +    union vpci_val val = { .double_word = data };
>> >> > +    int rc;
>> >> > +
>> >> > +    ASSERT(size <= r->size);
>> >> > +    if ( size != r->size )
>> >> > +    {
>> >> > +        rc = r->read(pdev, r->offset, &val, r->priv_data);
>> >> > +        if ( rc )
>> >> > +            return rc;
>> >> > +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
>> >> > +        data &= GENMASK_BYTES(size, 0);
>> >> > +        val.double_word |= data << (offset * 8);
>> >> > +    }
>> >> > +
>> >> > +    return r->write(pdev, r->offset, val, r->priv_data);
>> >> > +}
>> >> 
>> >> I'm not sure that writing back the value read is correct in all cases
>> >> (think of write-only or rw1c registers or even offsets where reads
>> >> and writes access different registers altogether). I think the write
>> >> handlers will need to be made capable of dealing with partial writes.
>> > 
>> > That seems to be what pciback does fro writes: read, merge value,
>> > write back (drivers/xen/xen-pciback/conf_space.c
>> > xen_pcibk_config_write):
>> > 
>> > err = conf_space_read(dev, cfg_entry, field_start,
>> > 		      &tmp_val);
>> > if (err)
>> > 	break;
>> > 
>> > tmp_val = merge_value(tmp_val, value, get_mask(size),
>> > 		      offset - field_start);
>> > 
>> > err = conf_space_write(dev, cfg_entry, field_start,
>> > 		       tmp_val);
>> > 
>> > I would prefer to do it this way in order to avoid adding more
>> > complexity to the handlers themselves. So far I haven't found any such
>> > registers (rw1c) in the PCI config space, do you have references to
>> > any of them?
>> 
>> The status register.
> 
> But the status register is not going to be trapped, not by Dom0 or
> DomUs? None of the registers that I've emulated for the header, or the
> capabilities behave this way, and adding such and offset would
> greatly increase the complexity of each handler IMHO.
> 
> Maybe it would be easier to add a flag to mark rw1c registers as such,
> if that's ever needed? (and avoid the read in that case)

Well, yes, that's how qemu does it. Of course with the caveat that
you need to mark individual bits (again as qemu does). Of course,
as long as no such register is being emulated, leaving a prominent
comment would probably be an acceptable alternative to coding
all this right away.

>> >> > +/* Helpers for locking/unlocking. */
>> >> > +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
>> >> > +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
>> >> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
>> >> 
>> >> While for the code layering you don't need recursive locks, did you
>> >> consider using them nevertheless so that spin_is_locked() return
>> >> values are actually meaningful for your purposes?
>> > 
>> > I'm not sure I follow, spin_is_locked already return meaningful values
>> > for my purpose AFAICT.
>> 
>> For non-recursive locks this tells you whether _any_ CPU holds
>> the lock, yet normally you want to know whether the CPU you
>> run on does.
> 
> Indeed, so if I make the lock recursive spin_is_locked is going to
> return whether the current CPU holds the lock. That's kind of
> counter-intuitive.

Counter-intuitive or not, it's a result of non-recursive locks being
more slim (and hence slightly faster). And as long as
spin_is_locked() is being used in ASSERT()s only, it's better than
no sanity checking at all.

Jan
diff mbox

Patch

diff --git a/.gitignore b/.gitignore
index 74747cb7e7..ebafba25b5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -236,6 +236,10 @@  tools/tests/regression/build/*
 tools/tests/regression/downloads/*
 tools/tests/mem-sharing/memshrtool
 tools/tests/mce-test/tools/xen-mceinj
+tools/tests/vpci/rbtree.[hc]
+tools/tests/vpci/vpci.[hc]
+tools/tests/vpci/test_vpci.out
+tools/tests/vpci/test_vpci
 tools/xcutils/lsevtchn
 tools/xcutils/readnotes
 tools/xenbackendd/_paths.h
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 455f6f0bed..dd7fc78a99 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -11,7 +11,7 @@  int libxl__arch_domain_prepare_config(libxl__gc *gc,
     if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
         if (d_config->b_info.device_model_version !=
             LIBXL_DEVICE_MODEL_VERSION_NONE) {
-            xc_config->emulation_flags = XEN_X86_EMU_ALL;
+            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
         } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) {
             /*
              * HVM guests without device model may want
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 639776130b..5cfe781e62 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -13,6 +13,7 @@  endif
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 SUBDIRS-y += xen-access
 SUBDIRS-y += xenstore
+SUBDIRS-$(CONFIG_HAS_PCI) += vpci
 
 .PHONY: all clean install distclean
 all clean distclean: %: subdirs-%
diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
new file mode 100644
index 0000000000..7969fcbd82
--- /dev/null
+++ b/tools/tests/vpci/Makefile
@@ -0,0 +1,45 @@ 
+
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test_vpci
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+	./$(TARGET) > $(TARGET).out
+
+$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
+	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
+
+.PHONY: clean
+clean:
+	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
+
+.PHONY: distclean
+distclean: clean
+
+.PHONY: install
+install:
+
+vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
+	sed -e '/#include/d' <$< >$@
+
+vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
+	# Trick the compiler so it doesn't complain about missing symbols
+	sed -e '/#include/d' \
+	    -e '1s;^;#include "emul.h"\
+	             const vpci_register_init_t __start_vpci_array[1]\;\
+	             const vpci_register_init_t __end_vpci_array[1]\;\
+	             ;' <$< >$@
+
+rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
+	sed -e '/#include/d' <$< >$@
+
+rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
+	sed -e "/#include/d" \
+	    -e '1s;^;#include "emul.h"\
+	             ;' <$< >$@
+
diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
new file mode 100644
index 0000000000..85897ed43b
--- /dev/null
+++ b/tools/tests/vpci/emul.h
@@ -0,0 +1,107 @@ 
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _TEST_VPCI_
+#define _TEST_VPCI_
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <assert.h>
+
+#define container_of(ptr, type, member) ({                      \
+        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
+        (type *)( (char *)__mptr - offsetof(type,member) );})
+
+#include "rbtree.h"
+
+struct pci_dev {
+    struct domain *domain;
+    struct vpci *vpci;
+};
+
+struct domain {
+    struct pci_dev pdev;
+};
+
+struct vcpu
+{
+    struct domain *domain;
+};
+
+extern struct vcpu v;
+
+#define spin_lock(x)
+#define spin_unlock(x)
+#define spin_is_locked(x) true
+
+#define current (&v)
+
+#define has_vpci(d) true
+
+#include "vpci.h"
+
+#define xzalloc(type) (type *)calloc(1, sizeof(type))
+#define xfree(p) free(p)
+
+#define EXPORT_SYMBOL(x)
+
+#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
+
+#define atomic_read(x) 1
+
+/* Dummy native helpers. Writes are ignored, reads return 1's. */
+#define pci_conf_read8(...) (0xff)
+#define pci_conf_read16(...) (0xffff)
+#define pci_conf_read32(...) (0xffffffff)
+#define pci_conf_write8(...)
+#define pci_conf_write16(...)
+#define pci_conf_write32(...)
+
+#define BUG() assert(0)
+#define ASSERT_UNREACHABLE() assert(0)
+#define ASSERT(x) assert(x)
+
+#ifdef _LP64
+#define BITS_PER_LONG 64
+#else
+#define BITS_PER_LONG 32
+#endif
+#define GENMASK(h, l) \
+    (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+
+#define min(x,y) ({ \
+        const typeof(x) _x = (x);       \
+        const typeof(y) _y = (y);       \
+        (void) (&_x == &_y);            \
+        _x < _y ? _x : _y; })
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
new file mode 100644
index 0000000000..0fc63de038
--- /dev/null
+++ b/tools/tests/vpci/main.c
@@ -0,0 +1,206 @@ 
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "emul.h"
+
+/* Single vcpu (current), and single domain with a single PCI device. */
+static struct vpci vpci = {
+    .handlers = RB_ROOT,
+};
+
+static struct domain d = {
+    .pdev.domain = &d,
+    .pdev.vpci = &vpci,
+};
+
+struct vcpu v = { .domain = &d };
+
+/* Dummy hooks, write stores data, read fetches it. */
+static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
+                      union vpci_val *val, void *data)
+{
+    uint8_t *priv = data;
+
+    val->half_word = *priv;
+    return 0;
+}
+
+static int vpci_write8(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val val, void *data)
+{
+    uint8_t *priv = data;
+
+    *priv = val.half_word;
+    return 0;
+}
+
+static int vpci_read16(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint16_t *priv = data;
+
+    val->word = *priv;
+    return 0;
+}
+
+static int vpci_write16(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint16_t *priv = data;
+
+    *priv = val.word;
+    return 0;
+}
+
+static int vpci_read32(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint32_t *priv = data;
+
+    val->double_word = *priv;
+    return 0;
+}
+
+static int vpci_write32(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint32_t *priv = data;
+
+    *priv = val.double_word;
+    return 0;
+}
+
+#define VPCI_READ(reg, size, data) \
+    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
+
+#define VPCI_READ_CHECK(reg, size, expected) ({ \
+    uint32_t val;                               \
+    VPCI_READ(reg, size, &val);                 \
+    assert(val == expected);                    \
+    })
+
+#define VPCI_WRITE(reg, size, data) \
+    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
+
+#define VPCI_CHECK_REG(reg, size, data) ({      \
+    VPCI_WRITE(reg, size, data);                \
+    VPCI_READ_CHECK(reg, size, data);           \
+    })
+
+#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
+    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \
+
+#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
+    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
+
+int
+main(int argc, char **argv)
+{
+    /* Index storage by offset. */
+    uint32_t r0 = 0xdeadbeef;
+    uint8_t r5 = 0xef;
+    uint8_t r6 = 0xbe;
+    uint8_t r7 = 0xef;
+    uint16_t r12 = 0x8696;
+    int rc;
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
+    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
+    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
+    VPCI_READ_CHECK(5, 1, 0xef);
+    VPCI_CHECK_REG(5, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
+    VPCI_READ_CHECK(6, 1, 0xbe);
+    VPCI_CHECK_REG(6, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
+    VPCI_READ_CHECK(7, 1, 0xef);
+    VPCI_CHECK_REG(7, 1, 0xbd);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
+    VPCI_READ_CHECK(12, 2, 0x8696);
+    VPCI_READ_CHECK(12, 4, 0xffff8696);
+
+    /*
+     * At this point we have the following layout:
+     *
+     * 32    24    16     8     0
+     *  +-----+-----+-----+-----+
+     *  |          r0           | 0
+     *  +-----+-----+-----+-----+
+     *  | r7  |  r6 |  r5 |/////| 32
+     *  +-----+-----+-----+-----|
+     *  |///////////////////////| 64
+     *  +-----------+-----------+
+     *  |///////////|    r12    | 96
+     *  +-----------+-----------+
+     *             ...
+     *  / = empty.
+     */
+
+    /* Try to add an overlapping register handler. */
+    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
+
+    /* Try to add a non-aligned register. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
+
+    /* Try to add a register with wrong size. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
+
+    /* Try to add a register with missing handlers. */
+    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
+    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
+
+    /* Read/write of unset register. */
+    VPCI_READ_CHECK(8, 4, 0xffffffff);
+    VPCI_READ_CHECK(8, 2, 0xffff);
+    VPCI_READ_CHECK(8, 1, 0xff);
+    VPCI_WRITE(10, 2, 0xbeef);
+    VPCI_READ_CHECK(10, 2, 0xffff);
+
+    /* Read of multiple registers */
+    VPCI_CHECK_REG(7, 1, 0xbd);
+    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
+
+    /* Partial read of a register. */
+    VPCI_CHECK_REG(0, 4, 0x1a1b1c1d);
+    VPCI_READ_CHECK(2, 1, 0x1b);
+    VPCI_READ_CHECK(6, 2, 0xbdba);
+
+    /* Write of multiple registers. */
+    VPCI_CHECK_REG(4, 4, 0xaabbccff);
+
+    /* Partial write of a register. */
+    VPCI_CHECK_REG(2, 1, 0xfe);
+    VPCI_CHECK_REG(6, 2, 0xfebc);
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 44bd3bf0ce..41bf9dfaf3 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -79,6 +79,9 @@  SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel)
        *(.data.rel.*)
        CONSTRUCTORS
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 90e2b1f82a..f74020facc 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -500,11 +500,21 @@  static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
     if ( is_hvm_domain(d) )
     {
         if ( is_hardware_domain(d) &&
-             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
-            return false;
-        if ( !is_hardware_domain(d) && emflags &&
-             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                         XEN_X86_EMU_VPCI) )
             return false;
+        if ( !is_hardware_domain(d) )
+        {
+            switch ( emflags )
+            {
+            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:
+            case XEN_X86_EMU_LAPIC:
+            case 0:
+                break;
+            default:
+                return false;
+            }
+        }
     }
     else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
     {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a441955322..7f3322ede6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@ 
 #include <xen/vm_event.h>
 #include <xen/monitor.h>
 #include <xen/warning.h>
+#include <xen/vpci.h>
 #include <asm/shadow.h>
 #include <asm/hap.h>
 #include <asm/current.h>
@@ -655,6 +656,7 @@  int hvm_domain_initialise(struct domain *d)
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
 
     register_g2m_portio_handler(d);
+    register_vpci_portio_handler(d);
 
     hvm_ioreq_init(d);
 
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 214ab307c4..80f842b092 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -25,6 +25,7 @@ 
 #include <xen/trace.h>
 #include <xen/event.h>
 #include <xen/hypercall.h>
+#include <xen/vpci.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -256,6 +257,152 @@  void register_g2m_portio_handler(struct domain *d)
     handler->ops = &g2m_portio_ops;
 }
 
+/* Do some sanity checks. */
+static int vpci_access_check(unsigned int reg, unsigned int len)
+{
+    /* Check access size. */
+    if ( len != 1 && len != 2 && len != 4 )
+    {
+        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    /* Check if access crosses a double-word boundary. */
+    if ( (reg & 3) + len > 4 )
+    {
+        gdprintk(XENLOG_WARNING,
+                 "invalid access across double-word boundary (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* Helper to decode a PCI address. */
+void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                         unsigned int *bus, unsigned int *devfn,
+                         unsigned int *reg)
+{
+    unsigned long bdf;
+
+    ASSERT(CF8_ENABLED(cf8));
+
+    bdf = CF8_BDF(cf8);
+    *bus = PCI_BUS(bdf);
+    *devfn = PCI_DEVFN(PCI_SLOT(bdf), PCI_FUNC(bdf));
+    /*
+     * NB: the lower 2 bits of the register address are fetched from the
+     * offset into the 0xcfc register when reading/writing to it.
+     */
+    *reg = CF8_ADDR_LO(cf8) | (addr & 3);
+}
+
+/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
+static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
+                                 const ioreq_t *p)
+{
+    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
+}
+
+static int vpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr, uint32_t size, uint64_t *data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    uint32_t data32;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = d->arch.hvm_domain.pci_cf8;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
+    if ( !rc )
+        *data = data32;
+    vpci_unlock(d);
+
+     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static int vpci_portio_write(const struct hvm_io_handler *handler,
+                             uint64_t addr, uint32_t size, uint64_t data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        d->arch.hvm_domain.pci_cf8 = data;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+    else if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_write(0, bus, devfn, reg, size, data);
+    vpci_unlock(d);
+
+    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops vpci_portio_ops = {
+    .accept = vpci_portio_accept,
+    .read = vpci_portio_read,
+    .write = vpci_portio_write,
+};
+
+void register_vpci_portio_handler(struct domain *d)
+{
+    struct hvm_io_handler *handler;
+
+    if ( !has_vpci(d) )
+        return;
+
+    handler = hvm_next_io_handler(d);
+    if ( !handler )
+        return;
+
+    spin_lock_init(&d->arch.hvm_domain.vpci_lock);
+    handler->type = IOREQ_TYPE_PIO;
+    handler->ops = &vpci_portio_ops;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 07a6c2679b..70eabb809c 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -1177,6 +1177,9 @@  struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
          CF8_ENABLED(cf8) )
     {
         uint32_t sbdf, x86_fam;
+        unsigned int bus, devfn, reg;
+
+        hvm_pci_decode_addr(cf8, p->addr, &bus, &devfn, &reg);
 
         /* PCI config data cycle */
 
@@ -1186,9 +1189,7 @@  struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
                                  PCI_FUNC(CF8_BDF(cf8)));
 
         type = XEN_DMOP_IO_RANGE_PCI;
-        addr = ((uint64_t)sbdf << 32) |
-               CF8_ADDR_LO(cf8) |
-               (p->addr & 3);
+        addr = ((uint64_t)sbdf << 32) | reg;
         /* AMD extended configuration space access? */
         if ( CF8_ADDR_HI(cf8) &&
              d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f7b927858c..4cf919f206 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1566,7 +1566,8 @@  void __init noreturn __start_xen(unsigned long mbi_p)
         domcr_flags |= DOMCRF_hvm |
                        ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
                          DOMCRF_hap : 0);
-        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                                 XEN_X86_EMU_VPCI;
     }
 
     /* Create initial domain 0. */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 8289a1bf09..f5cc8e2b8d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -224,6 +224,9 @@  SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel.ro)
        *(.data.rel.ro.*)
   } :text
diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
index 19391802a8..d51c766453 100644
--- a/xen/drivers/Makefile
+++ b/xen/drivers/Makefile
@@ -1,6 +1,6 @@ 
 subdir-y += char
 subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
-subdir-$(CONFIG_HAS_PCI) += pci
+subdir-$(CONFIG_HAS_PCI) += pci vpci
 subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
 subdir-$(CONFIG_ACPI) += acpi
 subdir-$(CONFIG_VIDEO) += video
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c8e2d2d9a9..2288cf8814 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -30,6 +30,7 @@ 
 #include <xen/radix-tree.h>
 #include <xen/softirq.h>
 #include <xen/tasklet.h>
+#include <xen/vpci.h>
 #include <xsm/xsm.h>
 #include <asm/msi.h>
 #include "ats.h"
@@ -1041,6 +1042,8 @@  static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
         devfn += pdev->phantom_stride;
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+
+    xen_vpci_add_handlers(pdev);
 }
 
 static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
new file mode 100644
index 0000000000..840a906470
--- /dev/null
+++ b/xen/drivers/vpci/Makefile
@@ -0,0 +1 @@ 
+obj-y += vpci.o
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
new file mode 100644
index 0000000000..b159f0db80
--- /dev/null
+++ b/xen/drivers/vpci/vpci.c
@@ -0,0 +1,469 @@ 
+/*
+ * Generic functionality for handling accesses to the PCI configuration space
+ * from guests.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+#define vpci_init __start_vpci_array
+
+/* Internal struct to store the emulated PCI registers. */
+struct vpci_register {
+    vpci_read_t read;
+    vpci_write_t write;
+    unsigned int size;
+    unsigned int offset;
+    void *priv_data;
+    struct rb_node node;
+};
+
+int xen_vpci_add_handlers(struct pci_dev *pdev)
+{
+    int i, rc = 0;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    pdev->vpci = xzalloc(struct vpci);
+    if ( !pdev->vpci )
+        return -ENOMEM;
+
+    pdev->vpci->handlers = RB_ROOT;
+
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = vpci_init[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    if ( rc )
+    {
+        struct rb_node *node = rb_first(&pdev->vpci->handlers);
+        struct vpci_register *r;
+
+        /* Iterate over the tree and cleanup. */
+        while ( node != NULL )
+        {
+            r = container_of(node, struct vpci_register, node);
+            node = rb_next(node);
+            rb_erase(&r->node, &pdev->vpci->handlers);
+            xfree(r);
+        }
+        xfree(pdev->vpci);
+    }
+
+    return rc;
+}
+
+static bool vpci_register_overlap(const struct vpci_register *r,
+                                  unsigned int offset)
+{
+    if ( offset >= r->offset && offset < r->offset + r->size )
+        return true;
+
+    return false;
+}
+
+
+static int vpci_register_cmp(const struct vpci_register *r1,
+                             const struct vpci_register *r2)
+{
+    /* Make sure there's no overlap between registers. */
+    if ( vpci_register_overlap(r1, r2->offset) ||
+         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
+         vpci_register_overlap(r2, r1->offset) ||
+         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
+        return 0;
+
+    if (r1->offset < r2->offset)
+        return -1;
+    else if (r1->offset > r2->offset)
+        return 1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
+                                                const unsigned int reg,
+                                                const unsigned int size)
+{
+    struct rb_node *node;
+    struct vpci_register r = {
+        .offset = reg,
+        .size = size,
+    };
+
+    ASSERT(vpci_locked(pdev->domain));
+
+    node = pdev->vpci->handlers.rb_node;
+    while ( node )
+    {
+        struct vpci_register *t =
+            container_of(node, struct vpci_register, node);
+
+        switch ( vpci_register_cmp(&r, t) )
+        {
+        case -1:
+            node = node->rb_left;
+            break;
+        case 1:
+            node = node->rb_right;
+            break;
+        default:
+            return t;
+        }
+    }
+
+    return NULL;
+}
+
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data)
+{
+    struct rb_node **new, *parent;
+    struct vpci_register *r;
+
+    /* Some sanity checks. */
+    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
+         offset & (size - 1) || read_handler == NULL || write_handler == NULL )
+        return -EINVAL;
+
+    r = xzalloc(struct vpci_register);
+    if ( !r )
+        return -ENOMEM;
+
+    r->read = read_handler;
+    r->write = write_handler;
+    r->size = size;
+    r->offset = offset;
+    r->priv_data = data;
+
+    vpci_lock(pdev->domain);
+    new = &pdev->vpci->handlers.rb_node;
+    parent = NULL;
+
+    while (*new) {
+        struct vpci_register *this =
+            container_of(*new, struct vpci_register, node);
+
+        parent = *new;
+        switch ( vpci_register_cmp(r, this) )
+        {
+        case -1:
+            new = &((*new)->rb_left);
+            break;
+        case 1:
+            new = &((*new)->rb_right);
+            break;
+        default:
+            xfree(r);
+            vpci_unlock(pdev->domain);
+            return -EEXIST;
+        }
+    }
+
+    rb_link_node(&r->node, parent, new);
+    rb_insert_color(&r->node, &pdev->vpci->handlers);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
+{
+    struct vpci_register *r;
+
+    vpci_lock(pdev->domain);
+    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
+    if ( !r )
+    {
+        vpci_unlock(pdev->domain);
+        return -ENOENT;
+    }
+
+    rb_erase(&r->node, &pdev->vpci->handlers);
+    xfree(r);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+/* Wrappers for performing reads/writes to the underlying hardware. */
+static void vpci_read_hw(unsigned int seg, unsigned int bus,
+                         unsigned int devfn, unsigned int reg, uint32_t size,
+                         uint32_t *data)
+{
+    switch ( size )
+    {
+    case 4:
+        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte read can have 1byte trapped and
+         * the rest passed-through.
+         */
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg + 1) << 8;
+        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    case 2:
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 1:
+        *data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    default:
+        BUG();
+    }
+}
+
+static void vpci_write_hw(unsigned int seg, unsigned int bus,
+                          unsigned int devfn, unsigned int reg, uint32_t size,
+                          uint32_t data)
+{
+    switch ( size )
+    {
+    case 4:
+        pci_conf_write32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte write can have 1byte trapped and
+         * the rest passed-through.
+         */
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg + 1,
+                         data >> 8);
+        break;
+    case 2:
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 1:
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        break;
+    default:
+        BUG();
+    }
+}
+
+/* Helper macros for the read/write handlers. */
+#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
+#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
+#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)
+
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    union vpci_val val = { .double_word = 0 };
+    unsigned int data_rshift = 0, data_lshift = 0, data_size;
+    uint32_t tmp_data;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    *data = 0;
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the head read to the partial result. */
+        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
+        data_lshift = r->offset - reg;
+
+        /* Account for the read. */
+        size -= data_lshift;
+        reg += data_lshift;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register */
+        data_rshift = reg - r->offset;
+
+    ASSERT(data_lshift == 0 || data_rshift == 0);
+    data_size = min(size, r->size - data_rshift);
+    ASSERT(data_size != 0);
+
+    /* Perform the read of the register. */
+    rc = r->read(pdev, r->offset, &val, r->priv_data);
+    if ( rc )
+        return rc;
+
+    val.double_word >>= data_rshift * 8;
+    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+
+    /* Read the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Read tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, size, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the tail read to the partial result. */
+        ADD_RESULT(*data, tmp_data, size, data_size + data_lshift);
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_read_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/* Perform a maybe partial write to a register. */
+static int vpci_write_helper(struct pci_dev *pdev,
+                             const struct vpci_register *r, unsigned int size,
+                             unsigned int offset, uint32_t data)
+{
+    union vpci_val val = { .double_word = data };
+    int rc;
+
+    ASSERT(size <= r->size);
+    if ( size != r->size )
+    {
+        rc = r->read(pdev, r->offset, &val, r->priv_data);
+        if ( rc )
+            return rc;
+        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
+        data &= GENMASK_BYTES(size, 0);
+        val.double_word |= data << (offset * 8);
+    }
+
+    return r->write(pdev, r->offset, val, r->priv_data);
+}
+
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_size, data_offset = 0;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    else if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register found.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, r->offset - reg, data);
+        if ( rc )
+            return rc;
+
+        /* Advance the data by the written size. */
+        SHIFT_RIGHT_BYTES(data, r->offset - reg);
+        size -= r->offset - reg;
+        reg += r->offset - reg;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register. */
+        data_offset = reg - r->offset;
+
+    data_size = min(size, r->size - data_offset);
+
+    /* Perform the write of the register. */
+    ASSERT(data_size != 0);
+    rc = vpci_write_helper(pdev, r, data_size, data_offset, data);
+    if ( rc )
+        return rc;
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+    SHIFT_RIGHT_BYTES(data, data_size);
+
+    /* Write the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Write tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, size, data);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_write_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 6ab987f231..f0741917ed 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -426,6 +426,7 @@  struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
                             XEN_X86_EMU_USE_PIRQ))
+#define has_vpci(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI))
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9bb2..cbf4170789 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -184,6 +184,9 @@  struct hvm_domain {
     /* List of guest to machine IO ports mapping. */
     struct list_head g2m_ioport_list;
 
+    /* Lock for the PCI emulation layer (vPCI). */
+    spinlock_t vpci_lock;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 2484eb1c75..9ed1bf2e06 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -149,12 +149,20 @@  void stdvga_deinit(struct domain *d);
 
 extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
 
+/* Decode a PCI port IO access into a bus/devfn/reg. */
+void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
+                         unsigned int *bus, unsigned int *devfn,
+                         unsigned int *reg);
+
 /*
  * HVM port IO handler that performs forwarding of guest IO ports into machine
  * IO ports.
  */
 void register_g2m_portio_handler(struct domain *d);
 
+/* HVM port IO handler for PCI accesses. */
+void register_vpci_portio_handler(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index f21332e897..86a1a09a8d 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -295,12 +295,15 @@  struct xen_arch_domainconfig {
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
 #define _XEN_X86_EMU_USE_PIRQ       9
 #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 59b6e8a81c..a83c4a1276 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -13,6 +13,7 @@ 
 #include <xen/irq.h>
 #include <xen/pci_regs.h>
 #include <xen/pfn.h>
+#include <xen/rbtree.h>
 #include <asm/device.h>
 #include <asm/numa.h>
 #include <asm/pci.h>
@@ -88,6 +89,9 @@  struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    /* Data for vPCI. */
+    struct vpci *vpci;
 };
 
 #define for_each_pdev(domain, pdev) \
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
new file mode 100644
index 0000000000..56e8d1c35e
--- /dev/null
+++ b/xen/include/xen/vpci.h
@@ -0,0 +1,66 @@ 
+#ifndef _VPCI_
+#define _VPCI_
+
+#include <xen/pci.h>
+#include <xen/types.h>
+
+/* Helpers for locking/unlocking. */
+#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
+
+/* Value read or written by the handlers. */
+union vpci_val {
+    uint8_t half_word;
+    uint16_t word;
+    uint32_t double_word;
+};
+
+/*
+ * The vPCI handlers will never be called concurrently for the same domain, ii
+ * is guaranteed that the vpci domain lock will always be locked when calling
+ * any handler.
+ */
+typedef int (*vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val *val, void *data);
+
+typedef int (*vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
+                            union vpci_val val, void *data);
+
+typedef int (*vpci_register_init_t)(struct pci_dev *dev);
+
+#define REGISTER_VPCI_INIT(x) \
+  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
+
+/* Add vPCI handlers to device. */
+int xen_vpci_add_handlers(struct pci_dev *dev);
+
+/* Add/remove a register handler. */
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data);
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
+
+/* Generic read/write handlers for the PCI config space. */
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data);
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data);
+
+struct vpci {
+    /* Root pointer for the tree of vPCI handlers. */
+    struct rb_root handlers;
+};
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+