diff mbox

[v2,1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space

Message ID 20170420151743.90889-2-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monné April 20, 2017, 3:17 p.m. UTC
This functionality is going to reside in vpci.c (and the corresponding vpci.h
header), and should be arch-agnostic. The handlers introduced in this patch
setup the basic functionality required in order to trap accesses to the PCI
config space, and allow decoding the address and finding the corresponding
handler that should handle the access (although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
inside of a x86 HVM file, since that's not shared with other arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether
a domain should use the newly introduced vPCI handlers, this is only enabled
for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic functionality
of the vPCI traps can be asserted. This has been proven quite helpful during
development, since the logic to handle partial accesses or accesses that expand
across multiple registers is not trivial.

The handlers for the registers are added to a red-black tree, that indexes them
based on their offset. Since Xen needs to handle partial accesses to the
registers and access that expand across multiple registers the logic in
xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment it
in order to make it easier to understand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.
---
 .gitignore                        |   4 +
 tools/libxl/libxl_x86.c           |   2 +-
 tools/tests/Makefile              |   1 +
 tools/tests/vpci/Makefile         |  45 ++++
 tools/tests/vpci/emul.h           | 107 +++++++++
 tools/tests/vpci/main.c           | 206 +++++++++++++++++
 xen/arch/arm/xen.lds.S            |   3 +
 xen/arch/x86/domain.c             |  18 +-
 xen/arch/x86/hvm/hvm.c            |   2 +
 xen/arch/x86/hvm/io.c             | 135 +++++++++++
 xen/arch/x86/setup.c              |   3 +-
 xen/arch/x86/xen.lds.S            |   3 +
 xen/drivers/Makefile              |   2 +-
 xen/drivers/passthrough/pci.c     |   3 +
 xen/drivers/vpci/Makefile         |   1 +
 xen/drivers/vpci/vpci.c           | 474 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h      |   1 +
 xen/include/asm-x86/hvm/domain.h  |   3 +
 xen/include/asm-x86/hvm/io.h      |   3 +
 xen/include/public/arch-x86/xen.h |   5 +-
 xen/include/xen/pci.h             |   4 +
 xen/include/xen/vpci.h            |  66 ++++++
 22 files changed, 1083 insertions(+), 8 deletions(-)
 create mode 100644 tools/tests/vpci/Makefile
 create mode 100644 tools/tests/vpci/emul.h
 create mode 100644 tools/tests/vpci/main.c
 create mode 100644 xen/drivers/vpci/Makefile
 create mode 100644 xen/drivers/vpci/vpci.c
 create mode 100644 xen/include/xen/vpci.h

Comments

Paul Durrant April 21, 2017, 4:07 p.m. UTC | #1
> -----Original Message-----

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]

> Sent: 20 April 2017 16:18

> To: xen-devel@lists.xenproject.org

> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne

> <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu

> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper

> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>

> Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap accesses

> to the PCI config space

> 

> This functionality is going to reside in vpci.c (and the corresponding vpci.h

> header), and should be arch-agnostic. The handlers introduced in this patch

> setup the basic functionality required in order to trap accesses to the PCI

> config space, and allow decoding the address and finding the corresponding

> handler that should handle the access (although no handlers are

> implemented).

> 

> Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup

> inside of a x86 HVM file, since that's not shared with other arches.

> 

> A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen

> whether

> a domain should use the newly introduced vPCI handlers, this is only enabled

> for PVH Dom0 at the moment.

> 

> A very simple user-space test is also provided, so that the basic functionality

> of the vPCI traps can be asserted. This has been proven quite helpful during

> development, since the logic to handle partial accesses or accesses that

> expand

> across multiple registers is not trivial.

> 

> The handlers for the registers are added to a red-black tree, that indexes

> them

> based on their offset. Since Xen needs to handle partial accesses to the

> registers and access that expand across multiple registers the logic in

> xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment

> it

> in order to make it easier to understand.

> 


Since config space is not exactly huge, I'm wondering why you used an r-b tree rather than a direct map from register to handler?

  Paul
Paul Durrant April 21, 2017, 4:23 p.m. UTC | #2
> -----Original Message-----

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]

> Sent: 20 April 2017 16:18

> To: xen-devel@lists.xenproject.org

> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne

> <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu

> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper

> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>

> Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap accesses

> to the PCI config space

> 

> This functionality is going to reside in vpci.c (and the corresponding vpci.h

> header), and should be arch-agnostic. The handlers introduced in this patch

> setup the basic functionality required in order to trap accesses to the PCI

> config space, and allow decoding the address and finding the corresponding

> handler that should handle the access (although no handlers are

> implemented).

> 

> Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup

> inside of a x86 HVM file, since that's not shared with other arches.

> 

> A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen

> whether

> a domain should use the newly introduced vPCI handlers, this is only enabled

> for PVH Dom0 at the moment.

> 

> A very simple user-space test is also provided, so that the basic functionality

> of the vPCI traps can be asserted. This has been proven quite helpful during

> development, since the logic to handle partial accesses or accesses that

> expand

> across multiple registers is not trivial.

> 

> The handlers for the registers are added to a red-black tree, that indexes

> them

> based on their offset. Since Xen needs to handle partial accesses to the

> registers and access that expand across multiple registers the logic in

> xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment

> it

> in order to make it easier to understand.

> 

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

> ---

> Cc: Ian Jackson <ian.jackson@eu.citrix.com>

> Cc: Wei Liu <wei.liu2@citrix.com>

> Cc: Jan Beulich <jbeulich@suse.com>

> Cc: Andrew Cooper <andrew.cooper3@citrix.com>

> Cc: Paul Durrant <paul.durrant@citrix.com>

> ---

> Changes since v1:

>  - Allow access to cross a word-boundary.

>  - Add locking.

>  - Add cleanup to xen_vpci_add_handlers in case of failure.

> ---

>  .gitignore                        |   4 +

>  tools/libxl/libxl_x86.c           |   2 +-

>  tools/tests/Makefile              |   1 +

>  tools/tests/vpci/Makefile         |  45 ++++

>  tools/tests/vpci/emul.h           | 107 +++++++++

>  tools/tests/vpci/main.c           | 206 +++++++++++++++++

>  xen/arch/arm/xen.lds.S            |   3 +

>  xen/arch/x86/domain.c             |  18 +-

>  xen/arch/x86/hvm/hvm.c            |   2 +

>  xen/arch/x86/hvm/io.c             | 135 +++++++++++

>  xen/arch/x86/setup.c              |   3 +-

>  xen/arch/x86/xen.lds.S            |   3 +

>  xen/drivers/Makefile              |   2 +-

>  xen/drivers/passthrough/pci.c     |   3 +

>  xen/drivers/vpci/Makefile         |   1 +

>  xen/drivers/vpci/vpci.c           | 474

> ++++++++++++++++++++++++++++++++++++++

>  xen/include/asm-x86/domain.h      |   1 +

>  xen/include/asm-x86/hvm/domain.h  |   3 +

>  xen/include/asm-x86/hvm/io.h      |   3 +

>  xen/include/public/arch-x86/xen.h |   5 +-

>  xen/include/xen/pci.h             |   4 +

>  xen/include/xen/vpci.h            |  66 ++++++

>  22 files changed, 1083 insertions(+), 8 deletions(-)

>  create mode 100644 tools/tests/vpci/Makefile

>  create mode 100644 tools/tests/vpci/emul.h

>  create mode 100644 tools/tests/vpci/main.c

>  create mode 100644 xen/drivers/vpci/Makefile

>  create mode 100644 xen/drivers/vpci/vpci.c

>  create mode 100644 xen/include/xen/vpci.h

> 

> diff --git a/.gitignore b/.gitignore

> index 74747cb7e7..ebafba25b5 100644

> --- a/.gitignore

> +++ b/.gitignore

> @@ -236,6 +236,10 @@ tools/tests/regression/build/*

>  tools/tests/regression/downloads/*

>  tools/tests/mem-sharing/memshrtool

>  tools/tests/mce-test/tools/xen-mceinj

> +tools/tests/vpci/rbtree.[hc]

> +tools/tests/vpci/vpci.[hc]

> +tools/tests/vpci/test_vpci.out

> +tools/tests/vpci/test_vpci

>  tools/xcutils/lsevtchn

>  tools/xcutils/readnotes

>  tools/xenbackendd/_paths.h

> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c

> index 455f6f0bed..dd7fc78a99 100644

> --- a/tools/libxl/libxl_x86.c

> +++ b/tools/libxl/libxl_x86.c

> @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc,

>      if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {

>          if (d_config->b_info.device_model_version !=

>              LIBXL_DEVICE_MODEL_VERSION_NONE) {

> -            xc_config->emulation_flags = XEN_X86_EMU_ALL;

> +            xc_config->emulation_flags = (XEN_X86_EMU_ALL &

> ~XEN_X86_EMU_VPCI);

>          } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) {

>              /*

>               * HVM guests without device model may want

> diff --git a/tools/tests/Makefile b/tools/tests/Makefile

> index 639776130b..5cfe781e62 100644

> --- a/tools/tests/Makefile

> +++ b/tools/tests/Makefile

> @@ -13,6 +13,7 @@ endif

>  SUBDIRS-$(CONFIG_X86) += x86_emulator

>  SUBDIRS-y += xen-access

>  SUBDIRS-y += xenstore

> +SUBDIRS-$(CONFIG_HAS_PCI) += vpci

> 

>  .PHONY: all clean install distclean

>  all clean distclean: %: subdirs-%

> diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile

> new file mode 100644

> index 0000000000..7969fcbd82

> --- /dev/null

> +++ b/tools/tests/vpci/Makefile

> @@ -0,0 +1,45 @@

> +

> +XEN_ROOT=$(CURDIR)/../../..

> +include $(XEN_ROOT)/tools/Rules.mk

> +

> +TARGET := test_vpci

> +

> +.PHONY: all

> +all: $(TARGET)

> +

> +.PHONY: run

> +run: $(TARGET)

> +	./$(TARGET) > $(TARGET).out

> +

> +$(TARGET): vpci.c vpci.h rbtree.c rbtree.h

> +	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c

> +

> +.PHONY: clean

> +clean:

> +	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h

> +

> +.PHONY: distclean

> +distclean: clean

> +

> +.PHONY: install

> +install:

> +

> +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h

> +	sed -e '/#include/d' <$< >$@

> +

> +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c

> +	# Trick the compiler so it doesn't complain about missing symbols

> +	sed -e '/#include/d' \

> +	    -e '1s;^;#include "emul.h"\

> +	             const vpci_register_init_t __start_vpci_array[1]\;\

> +	             const vpci_register_init_t __end_vpci_array[1]\;\

> +	             ;' <$< >$@

> +

> +rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h

> +	sed -e '/#include/d' <$< >$@

> +

> +rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c

> +	sed -e "/#include/d" \

> +	    -e '1s;^;#include "emul.h"\

> +	             ;' <$< >$@

> +

> diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h

> new file mode 100644

> index 0000000000..85897ed43b

> --- /dev/null

> +++ b/tools/tests/vpci/emul.h

> @@ -0,0 +1,107 @@

> +/*

> + * Unit tests for the generic vPCI handler code.

> + *

> + * Copyright (C) 2017 Citrix Systems R&D

> + *

> + * This program is free software; you can redistribute it and/or

> + * modify it under the terms and conditions of the GNU General Public

> + * License, version 2, as published by the Free Software Foundation.

> + *

> + * This program is distributed in the hope that it will be useful,

> + * but WITHOUT ANY WARRANTY; without even the implied warranty of

> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

> GNU

> + * General Public License for more details.

> + *

> + * You should have received a copy of the GNU General Public

> + * License along with this program; If not, see

> <http://www.gnu.org/licenses/>.

> + */

> +

> +#ifndef _TEST_VPCI_

> +#define _TEST_VPCI_

> +

> +#include <stdlib.h>

> +#include <stdio.h>

> +#include <stddef.h>

> +#include <stdint.h>

> +#include <stdbool.h>

> +#include <errno.h>

> +#include <assert.h>

> +

> +#define container_of(ptr, type, member) ({                      \

> +        typeof( ((type *)0)->member ) *__mptr = (ptr);          \

> +        (type *)( (char *)__mptr - offsetof(type,member) );})

> +

> +#include "rbtree.h"

> +

> +struct pci_dev {

> +    struct domain *domain;

> +    struct vpci *vpci;

> +};

> +

> +struct domain {

> +    struct pci_dev pdev;

> +};

> +

> +struct vcpu

> +{

> +    struct domain *domain;

> +};

> +

> +extern struct vcpu v;

> +

> +#define spin_lock(x)

> +#define spin_unlock(x)

> +#define spin_is_locked(x) true

> +

> +#define current (&v)

> +

> +#define has_vpci(d) true

> +

> +#include "vpci.h"

> +

> +#define xzalloc(type) (type *)calloc(1, sizeof(type))

> +#define xfree(p) free(p)

> +

> +#define EXPORT_SYMBOL(x)

> +

> +#define pci_get_pdev_by_domain(d, ...) &(d)->pdev

> +

> +#define atomic_read(x) 1

> +

> +/* Dummy native helpers. Writes are ignored, reads return 1's. */

> +#define pci_conf_read8(...) (0xff)

> +#define pci_conf_read16(...) (0xffff)

> +#define pci_conf_read32(...) (0xffffffff)

> +#define pci_conf_write8(...)

> +#define pci_conf_write16(...)

> +#define pci_conf_write32(...)

> +

> +#define BUG() assert(0)

> +#define ASSERT_UNREACHABLE() assert(0)

> +#define ASSERT(x) assert(x)

> +

> +#ifdef _LP64

> +#define BITS_PER_LONG 64

> +#else

> +#define BITS_PER_LONG 32

> +#endif

> +#define GENMASK(h, l) \

> +    (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))

> +

> +#define min(x,y) ({ \

> +        const typeof(x) _x = (x);       \

> +        const typeof(y) _y = (y);       \

> +        (void) (&_x == &_y);            \

> +        _x < _y ? _x : _y; })

> +

> +#endif

> +

> +/*

> + * Local variables:

> + * mode: C

> + * c-file-style: "BSD"

> + * c-basic-offset: 4

> + * indent-tabs-mode: nil

> + * End:

> + */

> +

> diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c

> new file mode 100644

> index 0000000000..0fc63de038

> --- /dev/null

> +++ b/tools/tests/vpci/main.c

> @@ -0,0 +1,206 @@

> +/*

> + * Unit tests for the generic vPCI handler code.

> + *

> + * Copyright (C) 2017 Citrix Systems R&D

> + *

> + * This program is free software; you can redistribute it and/or

> + * modify it under the terms and conditions of the GNU General Public

> + * License, version 2, as published by the Free Software Foundation.

> + *

> + * This program is distributed in the hope that it will be useful,

> + * but WITHOUT ANY WARRANTY; without even the implied warranty of

> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

> GNU

> + * General Public License for more details.

> + *

> + * You should have received a copy of the GNU General Public

> + * License along with this program; If not, see

> <http://www.gnu.org/licenses/>.

> + */

> +

> +#include "emul.h"

> +

> +/* Single vcpu (current), and single domain with a single PCI device. */

> +static struct vpci vpci = {

> +    .handlers = RB_ROOT,

> +};

> +

> +static struct domain d = {

> +    .pdev.domain = &d,

> +    .pdev.vpci = &vpci,

> +};

> +

> +struct vcpu v = { .domain = &d };

> +

> +/* Dummy hooks, write stores data, read fetches it. */

> +static int vpci_read8(struct pci_dev *pdev, unsigned int reg,

> +                      union vpci_val *val, void *data)

> +{

> +    uint8_t *priv = data;

> +

> +    val->half_word = *priv;

> +    return 0;

> +}

> +

> +static int vpci_write8(struct pci_dev *pdev, unsigned int reg,

> +                       union vpci_val val, void *data)

> +{

> +    uint8_t *priv = data;

> +

> +    *priv = val.half_word;

> +    return 0;

> +}

> +

> +static int vpci_read16(struct pci_dev *pdev, unsigned int reg,

> +                       union vpci_val *val, void *data)

> +{

> +    uint16_t *priv = data;

> +

> +    val->word = *priv;

> +    return 0;

> +}

> +

> +static int vpci_write16(struct pci_dev *pdev, unsigned int reg,

> +                        union vpci_val val, void *data)

> +{

> +    uint16_t *priv = data;

> +

> +    *priv = val.word;

> +    return 0;

> +}

> +

> +static int vpci_read32(struct pci_dev *pdev, unsigned int reg,

> +                       union vpci_val *val, void *data)

> +{

> +    uint32_t *priv = data;

> +

> +    val->double_word = *priv;

> +    return 0;

> +}

> +

> +static int vpci_write32(struct pci_dev *pdev, unsigned int reg,

> +                        union vpci_val val, void *data)

> +{

> +    uint32_t *priv = data;

> +

> +    *priv = val.double_word;

> +    return 0;

> +}

> +

> +#define VPCI_READ(reg, size, data) \

> +    assert(!xen_vpci_read(0, 0, 0, reg, size, data))

> +

> +#define VPCI_READ_CHECK(reg, size, expected) ({ \

> +    uint32_t val;                               \

> +    VPCI_READ(reg, size, &val);                 \

> +    assert(val == expected);                    \

> +    })

> +

> +#define VPCI_WRITE(reg, size, data) \

> +    assert(!xen_vpci_write(0, 0, 0, reg, size, data))

> +

> +#define VPCI_CHECK_REG(reg, size, data) ({      \

> +    VPCI_WRITE(reg, size, data);                \

> +    VPCI_READ_CHECK(reg, size, data);           \

> +    })

> +

> +#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \

> +    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store))

> \

> +

> +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \

> +    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \

> +

> +int

> +main(int argc, char **argv)

> +{

> +    /* Index storage by offset. */

> +    uint32_t r0 = 0xdeadbeef;

> +    uint8_t r5 = 0xef;

> +    uint8_t r6 = 0xbe;

> +    uint8_t r7 = 0xef;

> +    uint16_t r12 = 0x8696;

> +    int rc;

> +

> +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);

> +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);

> +    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);

> +

> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);

> +    VPCI_READ_CHECK(5, 1, 0xef);

> +    VPCI_CHECK_REG(5, 1, 0xba);

> +

> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);

> +    VPCI_READ_CHECK(6, 1, 0xbe);

> +    VPCI_CHECK_REG(6, 1, 0xba);

> +

> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);

> +    VPCI_READ_CHECK(7, 1, 0xef);

> +    VPCI_CHECK_REG(7, 1, 0xbd);

> +

> +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);

> +    VPCI_READ_CHECK(12, 2, 0x8696);

> +    VPCI_READ_CHECK(12, 4, 0xffff8696);

> +

> +    /*

> +     * At this point we have the following layout:

> +     *

> +     * 32    24    16     8     0

> +     *  +-----+-----+-----+-----+

> +     *  |          r0           | 0

> +     *  +-----+-----+-----+-----+

> +     *  | r7  |  r6 |  r5 |/////| 32

> +     *  +-----+-----+-----+-----|

> +     *  |///////////////////////| 64

> +     *  +-----------+-----------+

> +     *  |///////////|    r12    | 96

> +     *  +-----------+-----------+

> +     *             ...

> +     *  / = empty.

> +     */

> +

> +    /* Try to add an overlapping register handler. */

> +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);

> +

> +    /* Try to add a non-aligned register. */

> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);

> +

> +    /* Try to add a register with wrong size. */

> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);

> +

> +    /* Try to add a register with missing handlers. */

> +    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);

> +    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);

> +

> +    /* Read/write of unset register. */

> +    VPCI_READ_CHECK(8, 4, 0xffffffff);

> +    VPCI_READ_CHECK(8, 2, 0xffff);

> +    VPCI_READ_CHECK(8, 1, 0xff);

> +    VPCI_WRITE(10, 2, 0xbeef);

> +    VPCI_READ_CHECK(10, 2, 0xffff);

> +

> +    /* Read of multiple registers */

> +    VPCI_CHECK_REG(7, 1, 0xbd);

> +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);

> +

> +    /* Partial read of a register. */

> +    VPCI_CHECK_REG(0, 4, 0x1a1b1c1d);

> +    VPCI_READ_CHECK(2, 1, 0x1b);

> +    VPCI_READ_CHECK(6, 2, 0xbdba);

> +

> +    /* Write of multiple registers. */

> +    VPCI_CHECK_REG(4, 4, 0xaabbccff);

> +

> +    /* Partial write of a register. */

> +    VPCI_CHECK_REG(2, 1, 0xfe);

> +    VPCI_CHECK_REG(6, 2, 0xfebc);

> +

> +    return 0;

> +}

> +

> +/*

> + * Local variables:

> + * mode: C

> + * c-file-style: "BSD"

> + * c-basic-offset: 4

> + * indent-tabs-mode: nil

> + * End:

> + */

> +

> diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S

> index 44bd3bf0ce..41bf9dfaf3 100644

> --- a/xen/arch/arm/xen.lds.S

> +++ b/xen/arch/arm/xen.lds.S

> @@ -79,6 +79,9 @@ SECTIONS

>         __start_schedulers_array = .;

>         *(.data.schedulers)

>         __end_schedulers_array = .;

> +       __start_vpci_array = .;

> +       *(.data.vpci)

> +       __end_vpci_array = .;

>         *(.data.rel)

>         *(.data.rel.*)

>         CONSTRUCTORS

> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c

> index 90e2b1f82a..f74020facc 100644

> --- a/xen/arch/x86/domain.c

> +++ b/xen/arch/x86/domain.c

> @@ -500,11 +500,21 @@ static bool emulation_flags_ok(const struct domain

> *d, uint32_t emflags)

>      if ( is_hvm_domain(d) )

>      {

>          if ( is_hardware_domain(d) &&

> -             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )

> -            return false;

> -        if ( !is_hardware_domain(d) && emflags &&

> -             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )

> +             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|

> +                         XEN_X86_EMU_VPCI) )

>              return false;

> +        if ( !is_hardware_domain(d) )

> +        {

> +            switch ( emflags )

> +            {

> +            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:

> +            case XEN_X86_EMU_LAPIC:

> +            case 0:

> +                break;

> +            default:

> +                return false;

> +            }

> +        }

>      }

>      else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )

>      {

> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c

> index a441955322..7f3322ede6 100644

> --- a/xen/arch/x86/hvm/hvm.c

> +++ b/xen/arch/x86/hvm/hvm.c

> @@ -37,6 +37,7 @@

>  #include <xen/vm_event.h>

>  #include <xen/monitor.h>

>  #include <xen/warning.h>

> +#include <xen/vpci.h>

>  #include <asm/shadow.h>

>  #include <asm/hap.h>

>  #include <asm/current.h>

> @@ -655,6 +656,7 @@ int hvm_domain_initialise(struct domain *d)

>          d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;

> 

>      register_g2m_portio_handler(d);

> +    register_vpci_portio_handler(d);

> 

>      hvm_ioreq_init(d);

> 

> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c

> index 214ab307c4..15048da556 100644

> --- a/xen/arch/x86/hvm/io.c

> +++ b/xen/arch/x86/hvm/io.c

> @@ -25,6 +25,7 @@

>  #include <xen/trace.h>

>  #include <xen/event.h>

>  #include <xen/hypercall.h>

> +#include <xen/vpci.h>

>  #include <asm/current.h>

>  #include <asm/cpufeature.h>

>  #include <asm/processor.h>

> @@ -256,6 +257,140 @@ void register_g2m_portio_handler(struct domain

> *d)

>      handler->ops = &g2m_portio_ops;

>  }

> 

> +/* Do some sanity checks. */

> +static int vpci_access_check(unsigned int reg, unsigned int len)

> +{

> +    /* Check access size. */

> +    if ( len != 1 && len != 2 && len != 4 )

> +    {

> +        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",

> +                 reg, len);

> +        return -EINVAL;

> +    }

> +

> +    /* Check if access crosses a double-word boundary. */

> +    if ( (reg & 3) + len > 4 )

> +    {

> +        gdprintk(XENLOG_WARNING,

> +                 "invalid access across double-word boundary (reg: %#x, len:

> %u)\n",

> +                 reg, len);

> +        return -EINVAL;

> +    }

> +

> +    return 0;

> +}

> +

> +/* Helper to decode a PCI address. */

> +static void vpci_decode_addr(unsigned int cf8, unsigned int addr,

> +                             unsigned int *bus, unsigned int *devfn,

> +                             unsigned int *reg)

> +{

> +    unsigned long bdf;

> +

> +    bdf = CF8_BDF(cf8);

> +    *bus = PCI_BUS(bdf);

> +    *devfn = PCI_DEVFN(PCI_SLOT(bdf), PCI_FUNC(bdf));

> +    /*

> +     * NB: the lower 2 bits of the register address are fetched from the

> +     * offset into the 0xcfc register when reading/writing to it.

> +     */

> +    *reg = (cf8 & 0xfc) | (addr & 3);

> +}

> +

> +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */

> +static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,

> +                                 const ioreq_t *p)

> +{

> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;

> +}

> +

> +static int vpci_portio_read(const struct hvm_io_handler *handler,

> +                            uint64_t addr, uint32_t size, uint64_t *data)

> +{

> +    struct domain *d = current->domain;

> +    unsigned int bus, devfn, reg;

> +    uint32_t data32;

> +    int rc;

> +

> +    vpci_lock(d);

> +    if ( addr == 0xcf8 )

> +    {

> +        ASSERT(size == 4);

> +        *data = d->arch.hvm_domain.pci_cf8;

> +        vpci_unlock(d);

> +        return X86EMUL_OKAY;

> +    }

> +

> +    /* Decode the PCI address. */

> +    vpci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn,

> &reg);

> +

> +    if ( vpci_access_check(reg, size) || reg >= 0xff )

> +    {

> +        vpci_unlock(d);

> +        return X86EMUL_UNHANDLEABLE;

> +    }

> +

> +    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);

> +    if ( !rc )

> +        *data = data32;

> +    vpci_unlock(d);

> +

> +     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;

> +}

> +

> +static int vpci_portio_write(const struct hvm_io_handler *handler,

> +                             uint64_t addr, uint32_t size, uint64_t data)

> +{

> +    struct domain *d = current->domain;

> +    unsigned int bus, devfn, reg;

> +    int rc;

> +

> +    vpci_lock(d);

> +    if ( addr == 0xcf8 )

> +    {

> +        ASSERT(size == 4);

> +        d->arch.hvm_domain.pci_cf8 = data;

> +        vpci_unlock(d);

> +        return X86EMUL_OKAY;

> +    }

> +

> +    /* Decode the PCI address. */

> +    vpci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn,

> &reg);

> +

> +    if ( vpci_access_check(reg, size) || reg >= 0xff )

> +    {

> +        vpci_unlock(d);

> +        return X86EMUL_UNHANDLEABLE;

> +    }

> +

> +    rc = xen_vpci_write(0, bus, devfn, reg, size, data);

> +    vpci_unlock(d);

> +

> +    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;

> +}

> +

> +static const struct hvm_io_ops vpci_portio_ops = {

> +    .accept = vpci_portio_accept,

> +    .read = vpci_portio_read,

> +    .write = vpci_portio_write,

> +};

> +

> +void register_vpci_portio_handler(struct domain *d)

> +{

> +    struct hvm_io_handler *handler;

> +

> +    if ( !has_vpci(d) )

> +        return;

> +

> +    handler = hvm_next_io_handler(d);

> +    if ( !handler )

> +        return;

> +

> +    spin_lock_init(&d->arch.hvm_domain.vpci_lock);

> +    handler->type = IOREQ_TYPE_PIO;

> +    handler->ops = &vpci_portio_ops;

> +}

> +

>  /*

>   * Local variables:

>   * mode: C

> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c

> index f7b927858c..4cf919f206 100644

> --- a/xen/arch/x86/setup.c

> +++ b/xen/arch/x86/setup.c

> @@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long

> mbi_p)

>          domcr_flags |= DOMCRF_hvm |

>                         ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?

>                           DOMCRF_hap : 0);

> -        config.emulation_flags =

> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;

> +        config.emulation_flags =

> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|

> +                                 XEN_X86_EMU_VPCI;

>      }

> 

>      /* Create initial domain 0. */

> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S

> index 8289a1bf09..f5cc8e2b8d 100644

> --- a/xen/arch/x86/xen.lds.S

> +++ b/xen/arch/x86/xen.lds.S

> @@ -224,6 +224,9 @@ SECTIONS

>         __start_schedulers_array = .;

>         *(.data.schedulers)

>         __end_schedulers_array = .;

> +       __start_vpci_array = .;

> +       *(.data.vpci)

> +       __end_vpci_array = .;

>         *(.data.rel.ro)

>         *(.data.rel.ro.*)

>    } :text

> diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile

> index 19391802a8..d51c766453 100644

> --- a/xen/drivers/Makefile

> +++ b/xen/drivers/Makefile

> @@ -1,6 +1,6 @@

>  subdir-y += char

>  subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq

> -subdir-$(CONFIG_HAS_PCI) += pci

> +subdir-$(CONFIG_HAS_PCI) += pci vpci

>  subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough

>  subdir-$(CONFIG_ACPI) += acpi

>  subdir-$(CONFIG_VIDEO) += video

> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c

> index c8e2d2d9a9..2288cf8814 100644

> --- a/xen/drivers/passthrough/pci.c

> +++ b/xen/drivers/passthrough/pci.c

> @@ -30,6 +30,7 @@

>  #include <xen/radix-tree.h>

>  #include <xen/softirq.h>

>  #include <xen/tasklet.h>

> +#include <xen/vpci.h>

>  #include <xsm/xsm.h>

>  #include <asm/msi.h>

>  #include "ats.h"

> @@ -1041,6 +1042,8 @@ static void setup_one_hwdom_device(const struct

> setup_hwdom *ctxt,

>          devfn += pdev->phantom_stride;

>      } while ( devfn != pdev->devfn &&

>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );

> +

> +    xen_vpci_add_handlers(pdev);

>  }

> 

>  static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg,

> void *arg)

> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile

> new file mode 100644

> index 0000000000..840a906470

> --- /dev/null

> +++ b/xen/drivers/vpci/Makefile

> @@ -0,0 +1 @@

> +obj-y += vpci.o

> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c

> new file mode 100644

> index 0000000000..f4cd04f11d

> --- /dev/null

> +++ b/xen/drivers/vpci/vpci.c

> @@ -0,0 +1,474 @@

> +/*

> + * Generic functionality for handling accesses to the PCI configuration space

> + * from guests.

> + *

> + * Copyright (C) 2017 Citrix Systems R&D

> + *

> + * This program is free software; you can redistribute it and/or

> + * modify it under the terms and conditions of the GNU General Public

> + * License, version 2, as published by the Free Software Foundation.

> + *

> + * This program is distributed in the hope that it will be useful,

> + * but WITHOUT ANY WARRANTY; without even the implied warranty of

> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

> GNU

> + * General Public License for more details.

> + *

> + * You should have received a copy of the GNU General Public

> + * License along with this program; If not, see

> <http://www.gnu.org/licenses/>.

> + */

> +

> +#include <xen/sched.h>

> +#include <xen/vpci.h>

> +

> +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];

> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)

> +#define vpci_init __start_vpci_array

> +

> +/* Helpers for locking/unlocking. */

> +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)

> +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)

> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)

> +

> +/* Internal struct to store the emulated PCI registers. */

> +struct vpci_register {

> +    vpci_read_t read;

> +    vpci_write_t write;

> +    unsigned int size;

> +    unsigned int offset;

> +    void *priv_data;

> +    struct rb_node node;

> +};

> +

> +int xen_vpci_add_handlers(struct pci_dev *pdev)

> +{

> +    int i, rc = 0;

> +

> +    if ( !has_vpci(pdev->domain) )

> +        return 0;

> +

> +    pdev->vpci = xzalloc(struct vpci);

> +    if ( !pdev->vpci )

> +        return -ENOMEM;

> +

> +    pdev->vpci->handlers = RB_ROOT;

> +

> +    for ( i = 0; i < NUM_VPCI_INIT; i++ )

> +    {

> +        rc = vpci_init[i](pdev);

> +        if ( rc )

> +            break;

> +    }

> +

> +    if ( rc )

> +    {

> +        struct rb_node *node = rb_first(&pdev->vpci->handlers);

> +        struct vpci_register *r;

> +

> +        /* Iterate over the tree and cleanup. */

> +        while ( node != NULL )

> +        {

> +            r = container_of(node, struct vpci_register, node);

> +            node = rb_next(node);

> +            rb_erase(&r->node, &pdev->vpci->handlers);

> +            xfree(r);

> +        }

> +        xfree(pdev->vpci);

> +    }

> +

> +    return rc;

> +}

> +

> +static bool vpci_register_overlap(const struct vpci_register *r,

> +                                  unsigned int offset)

> +{

> +    if ( offset >= r->offset && offset < r->offset + r->size )

> +        return true;

> +

> +    return false;

> +}

> +

> +

> +static int vpci_register_cmp(const struct vpci_register *r1,

> +                             const struct vpci_register *r2)

> +{

> +    /* Make sure there's no overlap between registers. */

> +    if ( vpci_register_overlap(r1, r2->offset) ||

> +         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||

> +         vpci_register_overlap(r2, r1->offset) ||

> +         vpci_register_overlap(r2, r1->offset + r1->size - 1) )

> +        return 0;

> +

> +    if (r1->offset < r2->offset)

> +        return -1;

> +    else if (r1->offset > r2->offset)

> +        return 1;

> +

> +    ASSERT_UNREACHABLE();

> +    return 0;

> +}

> +

> +static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,

> +                                                const unsigned int reg,

> +                                                const unsigned int size)

> +{

> +    struct rb_node *node;

> +    struct vpci_register r = {

> +        .offset = reg,

> +        .size = size,

> +    };

> +

> +    ASSERT(vpci_locked(pdev->domain));

> +

> +    node = pdev->vpci->handlers.rb_node;

> +    while ( node )

> +    {

> +        struct vpci_register *t =

> +            container_of(node, struct vpci_register, node);

> +

> +        switch ( vpci_register_cmp(&r, t) )

> +        {

> +        case -1:

> +            node = node->rb_left;

> +            break;

> +        case 1:

> +            node = node->rb_right;

> +            break;

> +        default:

> +            return t;

> +        }

> +    }

> +

> +    return NULL;

> +}

> +

> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t

> read_handler,

> +                          vpci_write_t write_handler, unsigned int offset,

> +                          unsigned int size, void *data)

> +{

> +    struct rb_node **new, *parent;

> +    struct vpci_register *r;

> +

> +    /* Some sanity checks. */

> +    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||

> +         offset & (size - 1) || read_handler == NULL || write_handler == NULL )

> +        return -EINVAL;

> +

> +    r = xzalloc(struct vpci_register);

> +    if ( !r )

> +        return -ENOMEM;

> +

> +    r->read = read_handler;

> +    r->write = write_handler;

> +    r->size = size;

> +    r->offset = offset;

> +    r->priv_data = data;

> +

> +    vpci_lock(pdev->domain);

> +    new = &pdev->vpci->handlers.rb_node;

> +    parent = NULL;

> +

> +    while (*new) {

> +        struct vpci_register *this =

> +            container_of(*new, struct vpci_register, node);

> +

> +        parent = *new;

> +        switch ( vpci_register_cmp(r, this) )

> +        {

> +        case -1:

> +            new = &((*new)->rb_left);

> +            break;

> +        case 1:

> +            new = &((*new)->rb_right);

> +            break;

> +        default:

> +            xfree(r);

> +            vpci_unlock(pdev->domain);

> +            return -EEXIST;

> +        }

> +    }

> +

> +    rb_link_node(&r->node, parent, new);

> +    rb_insert_color(&r->node, &pdev->vpci->handlers);

> +    vpci_unlock(pdev->domain);

> +

> +    return 0;

> +}

> +

> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)

> +{

> +    struct vpci_register *r;

> +

> +    vpci_lock(pdev->domain);

> +    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);

> +    if ( !r )

> +    {

> +        vpci_unlock(pdev->domain);

> +        return -ENOENT;

> +    }

> +

> +    rb_erase(&r->node, &pdev->vpci->handlers);

> +    xfree(r);

> +    vpci_unlock(pdev->domain);

> +

> +    return 0;

> +}

> +

> +/* Wrappers for performing reads/writes to the underlying hardware. */

> +static void vpci_read_hw(unsigned int seg, unsigned int bus,

> +                         unsigned int devfn, unsigned int reg, uint32_t size,

> +                         uint32_t *data)

> +{

> +    switch ( size )

> +    {

> +    case 4:

> +        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),

> +                                reg);

> +        break;

> +    case 3:

> +        /*

> +         * This is possible because a 4byte read can have 1byte trapped and

> +         * the rest passed-through.

> +         */

> +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),

> +                                reg + 1) << 8;

> +        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),

> +                               reg);

> +        break;

> +    case 2:

> +        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),

> +                                reg);

> +        break;

> +    case 1:

> +        *data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),

> +                               reg);

> +        break;

> +    default:

> +        BUG();

> +    }

> +}

> +

> +static void vpci_write_hw(unsigned int seg, unsigned int bus,

> +                          unsigned int devfn, unsigned int reg, uint32_t size,

> +                          uint32_t data)

> +{

> +    switch ( size )

> +    {

> +    case 4:

> +        pci_conf_write32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,

> +                         data);

> +        break;

> +    case 3:

> +        /*

> +         * This is possible because a 4byte write can have 1byte trapped and

> +         * the rest passed-through.

> +         */

> +        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,

> data);

> +        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg +

> 1,

> +                         data >> 8);

> +        break;

> +    case 2:

> +        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,

> +                         data);

> +        break;

> +    case 1:

> +        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,

> data);

> +        break;

> +    default:

> +        BUG();

> +    }

> +}

> +

> +/* Helper macros for the read/write handlers. */

> +#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)

> +#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8

> +#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) *

> 8)

> +

> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,

> +                  unsigned int reg, uint32_t size, uint32_t *data)

> +{

> +    struct domain *d = current->domain;

> +    struct pci_dev *pdev;

> +    const struct vpci_register *r;

> +    union vpci_val val = { .double_word = 0 };

> +    unsigned int data_rshift = 0, data_lshift = 0, data_size;

> +    uint32_t tmp_data;

> +    int rc;

> +

> +    ASSERT(vpci_locked(d));

> +

> +    *data = 0;

> +

> +    /* Find the PCI dev matching the address. */

> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);

> +    if ( !pdev )

> +        goto passthrough;


I hope this can eventually be generalised so I wonder what your intention is regarding co-existence between Xen emulated PCI config space, pass-through and PCI devices emulated externally. We already have a framework for registering PCI devices by SBDF but this code seems to make no use of it, which I suspect is likely to cause future conflict.

  Paul

> +

> +    /* Find the vPCI register handler. */

> +    r = vpci_find_register(pdev, reg, size);

> +    if ( !r )

> +        goto passthrough;

> +

> +    if ( r->offset > reg )

> +    {

> +        /*

> +         * There's a heading gap into the emulated register.

> +         * NB: it's possible for this recursive call to have a size of 3.

> +         */

> +        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);

> +        if ( rc )

> +            return rc;

> +

> +        /* Add the head read to the partial result. */

> +        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);

> +        data_lshift = r->offset - reg;

> +

> +        /* Account for the read. */

> +        size -= data_lshift;

> +        reg += data_lshift;

> +    }

> +    else if ( r->offset < reg )

> +        /* There's an offset into the emulated register */

> +        data_rshift = reg - r->offset;

> +

> +    ASSERT(data_lshift == 0 || data_rshift == 0);

> +    data_size = min(size, r->size - data_rshift);

> +    ASSERT(data_size != 0);

> +

> +    /* Perform the read of the register. */

> +    rc = r->read(pdev, r->offset, &val, r->priv_data);

> +    if ( rc )

> +        return rc;

> +

> +    val.double_word >>= data_rshift * 8;

> +    ADD_RESULT(*data, val.double_word, data_size, data_lshift);

> +

> +    /* Account for the read */

> +    size -= data_size;

> +    reg += data_size;

> +

> +    /* Read the remaining, if any. */

> +    if ( size > 0 )

> +    {

> +        /*

> +         * Read tailing data.

> +         * NB: it's possible for this recursive call to have a size of 3.

> +         */

> +        rc = xen_vpci_read(seg, bus, devfn, reg, size, &tmp_data);

> +        if ( rc )

> +            return rc;

> +

> +        /* Add the tail read to the partial result. */

> +        ADD_RESULT(*data, tmp_data, size, data_size + data_lshift);

> +    }

> +

> +    return 0;

> +

> + passthrough:

> +    vpci_read_hw(seg, bus, devfn, reg, size, data);

> +    return 0;

> +}

> +

> +/* Perform a maybe partial write to a register. */

> +static int vpci_write_helper(struct pci_dev *pdev,

> +                             const struct vpci_register *r, unsigned int size,

> +                             unsigned int offset, uint32_t data)

> +{

> +    union vpci_val val = { .double_word = data };

> +    int rc;

> +

> +    ASSERT(size <= r->size);

> +    if ( size != r->size )

> +    {

> +        rc = r->read(pdev, r->offset, &val, r->priv_data);

> +        if ( rc )

> +            return rc;

> +        val.double_word &= ~GENMASK_BYTES(size + offset, offset);

> +        data &= GENMASK_BYTES(size, 0);

> +        val.double_word |= data << (offset * 8);

> +    }

> +

> +    return r->write(pdev, r->offset, val, r->priv_data);

> +}

> +

> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,

> +                   unsigned int reg, uint32_t size, uint32_t data)

> +{

> +    struct domain *d = current->domain;

> +    struct pci_dev *pdev;

> +    const struct vpci_register *r;

> +    unsigned int data_size, data_offset = 0;

> +    int rc;

> +

> +    ASSERT(vpci_locked(d));

> +

> +    /* Find the PCI dev matching the address. */

> +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);

> +    if ( !pdev )

> +        goto passthrough;

> +

> +    /* Find the vPCI register handler. */

> +    r = vpci_find_register(pdev, reg, size);

> +    if ( !r )

> +        goto passthrough;

> +

> +    else if ( r->offset > reg )

> +    {

> +        /*

> +         * There's a heading gap into the emulated register found.

> +         * NB: it's possible for this recursive call to have a size of 3.

> +         */

> +        rc = xen_vpci_write(seg, bus, devfn, reg, r->offset - reg, data);

> +        if ( rc )

> +            return rc;

> +

> +        /* Advance the data by the written size. */

> +        SHIFT_RIGHT_BYTES(data, r->offset - reg);

> +        size -= r->offset - reg;

> +        reg += r->offset - reg;

> +    }

> +    else if ( r->offset < reg )

> +        /* There's an offset into the emulated register. */

> +        data_offset = reg - r->offset;

> +

> +    data_size = min(size, r->size - data_offset);

> +

> +    /* Perform the write of the register. */

> +    ASSERT(data_size != 0);

> +    rc = vpci_write_helper(pdev, r, data_size, data_offset, data);

> +    if ( rc )

> +        return rc;

> +

> +    /* Account for the read */

> +    size -= data_size;

> +    reg += data_size;

> +    SHIFT_RIGHT_BYTES(data, data_size);

> +

> +    /* Write the remaining, if any. */

> +    if ( size > 0 )

> +    {

> +        /*

> +         * Write tailing data.

> +         * NB: it's possible for this recursive call to have a size of 3.

> +         */

> +        rc = xen_vpci_write(seg, bus, devfn, reg, size, data);

> +        if ( rc )

> +            return rc;

> +    }

> +

> +    return 0;

> +

> + passthrough:

> +    vpci_write_hw(seg, bus, devfn, reg, size, data);

> +    return 0;

> +}

> +

> +/*

> + * Local variables:

> + * mode: C

> + * c-file-style: "BSD"

> + * c-basic-offset: 4

> + * tab-width: 4

> + * indent-tabs-mode: nil

> + * End:

> + */

> +

> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h

> index 6ab987f231..f0741917ed 100644

> --- a/xen/include/asm-x86/domain.h

> +++ b/xen/include/asm-x86/domain.h

> @@ -426,6 +426,7 @@ struct arch_domain

>  #define has_vpit(d)        (!!((d)->arch.emulation_flags &

> XEN_X86_EMU_PIT))

>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & \

>                              XEN_X86_EMU_USE_PIRQ))

> +#define has_vpci(d)        (!!((d)->arch.emulation_flags &

> XEN_X86_EMU_VPCI))

> 

>  #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))

> 

> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-

> x86/hvm/domain.h

> index d2899c9bb2..cbf4170789 100644

> --- a/xen/include/asm-x86/hvm/domain.h

> +++ b/xen/include/asm-x86/hvm/domain.h

> @@ -184,6 +184,9 @@ struct hvm_domain {

>      /* List of guest to machine IO ports mapping. */

>      struct list_head g2m_ioport_list;

> 

> +    /* Lock for the PCI emulation layer (vPCI). */

> +    spinlock_t vpci_lock;

> +

>      /* List of permanently write-mapped pages. */

>      struct {

>          spinlock_t lock;

> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h

> index 2484eb1c75..2dbf92f13e 100644

> --- a/xen/include/asm-x86/hvm/io.h

> +++ b/xen/include/asm-x86/hvm/io.h

> @@ -155,6 +155,9 @@ extern void hvm_dpci_msi_eoi(struct domain *d, int

> vector);

>   */

>  void register_g2m_portio_handler(struct domain *d);

> 

> +/* HVM port IO handler for PCI accesses. */

> +void register_vpci_portio_handler(struct domain *d);

> +

>  #endif /* __ASM_X86_HVM_IO_H__ */

> 

> 

> diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-

> x86/xen.h

> index 8a9ba7982b..c00f8cda93 100644

> --- a/xen/include/public/arch-x86/xen.h

> +++ b/xen/include/public/arch-x86/xen.h

> @@ -295,12 +295,15 @@ struct xen_arch_domainconfig {

>  #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)

>  #define _XEN_X86_EMU_USE_PIRQ       9

>  #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)

> +#define _XEN_X86_EMU_VPCI           10

> +#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)

> 

>  #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC |

> XEN_X86_EMU_HPET |  \

>                                       XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \

>                                       XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \

>                                       XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \

> -                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)

> +                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\

> +                                     XEN_X86_EMU_VPCI)

>      uint32_t emulation_flags;

>  };

> 

> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h

> index 59b6e8a81c..a83c4a1276 100644

> --- a/xen/include/xen/pci.h

> +++ b/xen/include/xen/pci.h

> @@ -13,6 +13,7 @@

>  #include <xen/irq.h>

>  #include <xen/pci_regs.h>

>  #include <xen/pfn.h>

> +#include <xen/rbtree.h>

>  #include <asm/device.h>

>  #include <asm/numa.h>

>  #include <asm/pci.h>

> @@ -88,6 +89,9 @@ struct pci_dev {

>  #define PT_FAULT_THRESHOLD 10

>      } fault;

>      u64 vf_rlen[6];

> +

> +    /* Data for vPCI. */

> +    struct vpci *vpci;

>  };

> 

>  #define for_each_pdev(domain, pdev) \

> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h

> new file mode 100644

> index 0000000000..56e8d1c35e

> --- /dev/null

> +++ b/xen/include/xen/vpci.h

> @@ -0,0 +1,66 @@

> +#ifndef _VPCI_

> +#define _VPCI_

> +

> +#include <xen/pci.h>

> +#include <xen/types.h>

> +

> +/* Helpers for locking/unlocking. */

> +#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)

> +#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)

> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)

> +

> +/* Value read or written by the handlers. */

> +union vpci_val {

> +    uint8_t half_word;

> +    uint16_t word;

> +    uint32_t double_word;

> +};

> +

> +/*

> + * The vPCI handlers will never be called concurrently for the same domain,

> ii

> + * is guaranteed that the vpci domain lock will always be locked when calling

> + * any handler.

> + */

> +typedef int (*vpci_read_t)(struct pci_dev *pdev, unsigned int reg,

> +                           union vpci_val *val, void *data);

> +

> +typedef int (*vpci_write_t)(struct pci_dev *pdev, unsigned int reg,

> +                            union vpci_val val, void *data);

> +

> +typedef int (*vpci_register_init_t)(struct pci_dev *dev);

> +

> +#define REGISTER_VPCI_INIT(x) \

> +  static const vpci_register_init_t x##_entry __used_section(".data.vpci") =

> x

> +

> +/* Add vPCI handlers to device. */

> +int xen_vpci_add_handlers(struct pci_dev *dev);

> +

> +/* Add/remove a register handler. */

> +int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t

> read_handler,

> +                          vpci_write_t write_handler, unsigned int offset,

> +                          unsigned int size, void *data);

> +int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);

> +

> +/* Generic read/write handlers for the PCI config space. */

> +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,

> +                  unsigned int reg, uint32_t size, uint32_t *data);

> +int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,

> +                   unsigned int reg, uint32_t size, uint32_t data);

> +

> +struct vpci {

> +    /* Root pointer for the tree of vPCI handlers. */

> +    struct rb_root handlers;

> +};

> +

> +#endif

> +

> +/*

> + * Local variables:

> + * mode: C

> + * c-file-style: "BSD"

> + * c-basic-offset: 4

> + * tab-width: 4

> + * indent-tabs-mode: nil

> + * End:

> + */

> +

> --

> 2.11.0 (Apple Git-81)
Roger Pau Monné April 24, 2017, 9:09 a.m. UTC | #3
On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: 20 April 2017 16:18
> > To: xen-devel@lists.xenproject.org
> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap accesses
> > to the PCI config space
> > 
> > This functionality is going to reside in vpci.c (and the corresponding vpci.h
> > header), and should be arch-agnostic. The handlers introduced in this patch
> > setup the basic functionality required in order to trap accesses to the PCI
> > config space, and allow decoding the address and finding the corresponding
> > handler that should handle the access (although no handlers are
> > implemented).
> > 
> > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
> > inside of a x86 HVM file, since that's not shared with other arches.
> > 
> > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
> > whether
> > a domain should use the newly introduced vPCI handlers, this is only enabled
> > for PVH Dom0 at the moment.
> > 
> > A very simple user-space test is also provided, so that the basic functionality
> > of the vPCI traps can be asserted. This has been proven quite helpful during
> > development, since the logic to handle partial accesses or accesses that
> > expand
> > across multiple registers is not trivial.
> > 
> > The handlers for the registers are added to a red-black tree, that indexes
> > them
> > based on their offset. Since Xen needs to handle partial accesses to the
> > registers and access that expand across multiple registers the logic in
> > xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment
> > it
> > in order to make it easier to understand.
> > 
> 
> Since config space is not exactly huge, I'm wondering why you used an r-b tree rather than a direct map from register to handler?

Hello,

For local PCI the configuration space it's 256byte only, which means using 1/2
a page (256 * 8) so that Xen can store a pointer for each possible register.
The extended configuration space (ECAM) extends the space to 4K, which means we
would use 8 pages per device (4096*8), I think that's too much.

Roger.
Paul Durrant April 24, 2017, 9:34 a.m. UTC | #4
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 10:09
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > Sent: 20 April 2017 16:18
> > > To: xen-devel@lists.xenproject.org
> > > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau
> Monne
> > > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> Cooper
> > > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> > > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses
> > > to the PCI config space
> > >
> > > This functionality is going to reside in vpci.c (and the corresponding vpci.h
> > > header), and should be arch-agnostic. The handlers introduced in this
> patch
> > > setup the basic functionality required in order to trap accesses to the PCI
> > > config space, and allow decoding the address and finding the
> corresponding
> > > handler that should handle the access (although no handlers are
> > > implemented).
> > >
> > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
> > > inside of a x86 HVM file, since that's not shared with other arches.
> > >
> > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal
> Xen
> > > whether
> > > a domain should use the newly introduced vPCI handlers, this is only
> enabled
> > > for PVH Dom0 at the moment.
> > >
> > > A very simple user-space test is also provided, so that the basic
> functionality
> > > of the vPCI traps can be asserted. This has been proven quite helpful
> during
> > > development, since the logic to handle partial accesses or accesses that
> > > expand
> > > across multiple registers is not trivial.
> > >
> > > The handlers for the registers are added to a red-black tree, that indexes
> > > them
> > > based on their offset. Since Xen needs to handle partial accesses to the
> > > registers and access that expand across multiple registers the logic in
> > > xen_vpci_{read/write} is kind of convoluted, I've tried to properly
> comment
> > > it
> > > in order to make it easier to understand.
> > >
> >
> > Since config space is not exactly huge, I'm wondering why you used an r-b
> tree rather than a direct map from register to handler?
> 
> Hello,
> 
> For local PCI the configuration space it's 256byte only, which means using 1/2
> a page (256 * 8) so that Xen can store a pointer for each possible register.
> The extended configuration space (ECAM) extends the space to 4K, which
> means we
> would use 8 pages per device (4096*8), I think that's too much.

Ok, but I still think that adding an r-b tree implementation is just more complexity in the way that io handlers are registered in Xen.

TBH, the whole thing needs a clean-up. We don't have proper range-based handler registration for port IO or MMIO at all (instead we potentially call the 'accept' function for every handler for every I/O). We then have (IIRC) an ordered list for MSI-X BAR registrations and now you're proposing an r-b system for PCI config space. On top of that, there is then the rangeset based ioreq server selection that occurs if the I/O falls through all of this and needs sending outside Xen. There really has to be at least some scope for unificiation here; it's getting way too convoluted.

  Paul

> 
> Roger.
Roger Pau Monné April 24, 2017, 9:42 a.m. UTC | #5
On Fri, Apr 21, 2017 at 05:23:34PM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
[...]
> > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
> > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    union vpci_val val = { .double_word = 0 };
> > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > +    uint32_t tmp_data;
> > +    int rc;
> > +
> > +    ASSERT(vpci_locked(d));
> > +
> > +    *data = 0;
> > +
> > +    /* Find the PCI dev matching the address. */
> > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > +    if ( !pdev )
> > +        goto passthrough;
> 
> I hope this can eventually be generalised so I wonder what your intention is regarding co-existence between Xen emulated PCI config space, pass-through and PCI devices emulated externally. We already have a framework for registering PCI devices by SBDF but this code seems to make no use of it, which I suspect is likely to cause future conflict.

Yes, the long term aim is to use this code in order to implement
PCI-passthrough for PVH and HVM DomUs also.

TBH, I didn't know we already had such code (I assume you mean the IOREQ
related PCI code). As it is, I see a couple of issues with that, the first one
is that this code expects a ioreq client on the other end, and the code I'm
adding here is all inside of the hypervisor. The second issue is that the IOREQ
code ATM only allows for local PCI accesses, which means I should extend it to
also deal with ECAM/MMCFG areas.

I completely agree that at some point this should be made to work together, but
I'm not sure if it would be better to do that once we want to also use vPCI for
DomUs, so that the Dom0 side is not delayed further.

Roger.
Paul Durrant April 24, 2017, 9:55 a.m. UTC | #6
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 10:42
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Fri, Apr 21, 2017 at 05:23:34PM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> [...]
> > > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int
> devfn,
> > > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > > +{
> > > +    struct domain *d = current->domain;
> > > +    struct pci_dev *pdev;
> > > +    const struct vpci_register *r;
> > > +    union vpci_val val = { .double_word = 0 };
> > > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > > +    uint32_t tmp_data;
> > > +    int rc;
> > > +
> > > +    ASSERT(vpci_locked(d));
> > > +
> > > +    *data = 0;
> > > +
> > > +    /* Find the PCI dev matching the address. */
> > > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > > +    if ( !pdev )
> > > +        goto passthrough;
> >
> > I hope this can eventually be generalised so I wonder what your intention is
> regarding co-existence between Xen emulated PCI config space, pass-
> through and PCI devices emulated externally. We already have a framework
> for registering PCI devices by SBDF but this code seems to make no use of it,
> which I suspect is likely to cause future conflict.
> 
> Yes, the long term aim is to use this code in order to implement
> PCI-passthrough for PVH and HVM DomUs also.
> 
> TBH, I didn't know we already had such code (I assume you mean the IOREQ
> related PCI code). As it is, I see a couple of issues with that, the first one
> is that this code expects a ioreq client on the other end, and the code I'm
> adding here is all inside of the hypervisor. The second issue is that the IOREQ
> code ATM only allows for local PCI accesses, which means I should extend it
> to
> also deal with ECAM/MMCFG areas.
> 
> I completely agree that at some point this should be made to work together,
> but
> I'm not sure if it would be better to do that once we want to also use vPCI for
> DomUs, so that the Dom0 side is not delayed further.
> 

If the follow up work will definitely be done, then I can live with that. Is there an actual plan to deal with domU pass-through on some backlog somewhere?

  Paul

> Roger.
Paul Durrant April 24, 2017, 9:58 a.m. UTC | #7
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 10:42
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Fri, Apr 21, 2017 at 05:23:34PM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> [...]
> > > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int
> devfn,
> > > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > > +{
> > > +    struct domain *d = current->domain;
> > > +    struct pci_dev *pdev;
> > > +    const struct vpci_register *r;
> > > +    union vpci_val val = { .double_word = 0 };
> > > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > > +    uint32_t tmp_data;
> > > +    int rc;
> > > +
> > > +    ASSERT(vpci_locked(d));
> > > +
> > > +    *data = 0;
> > > +
> > > +    /* Find the PCI dev matching the address. */
> > > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > > +    if ( !pdev )
> > > +        goto passthrough;
> >
> > I hope this can eventually be generalised so I wonder what your intention is
> regarding co-existence between Xen emulated PCI config space, pass-
> through and PCI devices emulated externally. We already have a framework
> for registering PCI devices by SBDF but this code seems to make no use of it,
> which I suspect is likely to cause future conflict.
> 
> Yes, the long term aim is to use this code in order to implement
> PCI-passthrough for PVH and HVM DomUs also.
> 
> TBH, I didn't know we already had such code (I assume you mean the IOREQ
> related PCI code). As it is, I see a couple of issues with that, the first one
> is that this code expects a ioreq client on the other end, and the code I'm
> adding here is all inside of the hypervisor. The second issue is that the IOREQ
> code ATM only allows for local PCI accesses, which means I should extend it
> to
> also deal with ECAM/MMCFG areas.
> 
> I completely agree that at some point this should be made to work together,
> but
> I'm not sure if it would be better to do that once we want to also use vPCI for
> DomUs, so that the Dom0 side is not delayed further.

BTW, that's also an argument for forgetting about the r-b scheme for handler registration since, if this really is for dom0 only, 8 pages worth of direct map is not a lot.

  Paul

> 
> Roger.
Roger Pau Monné April 24, 2017, 10:08 a.m. UTC | #8
On Mon, Apr 24, 2017 at 10:34:15AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 24 April 2017 10:09
> > To: Paul Durrant <Paul.Durrant@citrix.com>
> > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > accesses to the PCI config space
> > 
> > On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > > Sent: 20 April 2017 16:18
> > > > To: xen-devel@lists.xenproject.org
> > > > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau
> > Monne
> > > > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > Cooper
> > > > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> > > > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > accesses
> > > > to the PCI config space
> > > >
> > > > This functionality is going to reside in vpci.c (and the corresponding vpci.h
> > > > header), and should be arch-agnostic. The handlers introduced in this
> > patch
> > > > setup the basic functionality required in order to trap accesses to the PCI
> > > > config space, and allow decoding the address and finding the
> > corresponding
> > > > handler that should handle the access (although no handlers are
> > > > implemented).
> > > >
> > > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
> > > > inside of a x86 HVM file, since that's not shared with other arches.
> > > >
> > > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal
> > Xen
> > > > whether
> > > > a domain should use the newly introduced vPCI handlers, this is only
> > enabled
> > > > for PVH Dom0 at the moment.
> > > >
> > > > A very simple user-space test is also provided, so that the basic
> > functionality
> > > > of the vPCI traps can be asserted. This has been proven quite helpful
> > during
> > > > development, since the logic to handle partial accesses or accesses that
> > > > expand
> > > > across multiple registers is not trivial.
> > > >
> > > > The handlers for the registers are added to a red-black tree, that indexes
> > > > them
> > > > based on their offset. Since Xen needs to handle partial accesses to the
> > > > registers and access that expand across multiple registers the logic in
> > > > xen_vpci_{read/write} is kind of convoluted, I've tried to properly
> > comment
> > > > it
> > > > in order to make it easier to understand.
> > > >
> > >
> > > Since config space is not exactly huge, I'm wondering why you used an r-b
> > tree rather than a direct map from register to handler?
> > 
> > Hello,
> > 
> > For local PCI the configuration space it's 256byte only, which means using 1/2
> > a page (256 * 8) so that Xen can store a pointer for each possible register.
> > The extended configuration space (ECAM) extends the space to 4K, which
> > means we
> > would use 8 pages per device (4096*8), I think that's too much.
> 
> Ok, but I still think that adding an r-b tree implementation is just more complexity in the way that io handlers are registered in Xen.

But this complexity is completely hidden inside of the io handler itself that
traps the access to 0xcf8/cfc (or ECAM areas).

Do you mean that you would like this functionality to made available to IOREQ
clients also, so that they could register handlers for specific PCI registers
without owning the full configuration space of such device?

> TBH, the whole thing needs a clean-up. We don't have proper range-based handler registration for port IO or MMIO at all (instead we potentially call the 'accept' function for every handler for every I/O). We then have (IIRC) an ordered list for MSI-X BAR registrations and now you're proposing an r-b system for PCI config space.

One way or another Xen needs to track handlers for the PCI config space, and
currently this is not implemented inside of Xen.

The MSI-X BAR tracking will go away once this code is also used for
PCI-passthrough to DomUs. The msixtbl code is just extremely messy, because
MSI-X interrupt handling for passthrough devices is partially handled in QEMU
and partially inside of Xen.

> On top of that, there is then the rangeset based ioreq server selection that occurs if the I/O falls through all of this and needs sending outside Xen. There really has to be at least some scope for unificiation here; it's getting way too convoluted.

Yes, I agree that there's some room for sharing here, the address decoding done
in hvm_select_ioreq_server for PCI could be reused for vPCI also, it's just
that all this code expects a IOREQ server, and vPCI is not going to be an IOREQ
server.

Roger.
Roger Pau Monné April 24, 2017, 10:11 a.m. UTC | #9
On Mon, Apr 24, 2017 at 10:58:04AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 24 April 2017 10:42
> > To: Paul Durrant <Paul.Durrant@citrix.com>
> > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > accesses to the PCI config space
> > 
> > On Fri, Apr 21, 2017 at 05:23:34PM +0100, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > [...]
> > > > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int
> > devfn,
> > > > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > > > +{
> > > > +    struct domain *d = current->domain;
> > > > +    struct pci_dev *pdev;
> > > > +    const struct vpci_register *r;
> > > > +    union vpci_val val = { .double_word = 0 };
> > > > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > > > +    uint32_t tmp_data;
> > > > +    int rc;
> > > > +
> > > > +    ASSERT(vpci_locked(d));
> > > > +
> > > > +    *data = 0;
> > > > +
> > > > +    /* Find the PCI dev matching the address. */
> > > > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > > > +    if ( !pdev )
> > > > +        goto passthrough;
> > >
> > > I hope this can eventually be generalised so I wonder what your intention is
> > regarding co-existence between Xen emulated PCI config space, pass-
> > through and PCI devices emulated externally. We already have a framework
> > for registering PCI devices by SBDF but this code seems to make no use of it,
> > which I suspect is likely to cause future conflict.
> > 
> > Yes, the long term aim is to use this code in order to implement
> > PCI-passthrough for PVH and HVM DomUs also.
> > 
> > TBH, I didn't know we already had such code (I assume you mean the IOREQ
> > related PCI code). As it is, I see a couple of issues with that, the first one
> > is that this code expects a ioreq client on the other end, and the code I'm
> > adding here is all inside of the hypervisor. The second issue is that the IOREQ
> > code ATM only allows for local PCI accesses, which means I should extend it
> > to
> > also deal with ECAM/MMCFG areas.
> > 
> > I completely agree that at some point this should be made to work together,
> > but
> > I'm not sure if it would be better to do that once we want to also use vPCI for
> > DomUs, so that the Dom0 side is not delayed further.
> 
> BTW, that's also an argument for forgetting about the r-b scheme for handler registration since, if this really is for dom0 only, 8 pages worth of direct map is not a lot.

It's 8 pages for each device, not 8 pages for each domain, so it doesn't matter
if it's Dom0 or DomU, each PCIe device would use 8 pages.

Roger.
Paul Durrant April 24, 2017, 10:12 a.m. UTC | #10
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 11:12
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Mon, Apr 24, 2017 at 10:58:04AM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne
> > > Sent: 24 April 2017 10:42
> > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> Liu
> > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> Cooper
> > > <Andrew.Cooper3@citrix.com>
> > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > accesses to the PCI config space
> > >
> > > On Fri, Apr 21, 2017 at 05:23:34PM +0100, Paul Durrant wrote:
> > > > > -----Original Message-----
> > > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > [...]
> > > > > +int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int
> > > devfn,
> > > > > +                  unsigned int reg, uint32_t size, uint32_t *data)
> > > > > +{
> > > > > +    struct domain *d = current->domain;
> > > > > +    struct pci_dev *pdev;
> > > > > +    const struct vpci_register *r;
> > > > > +    union vpci_val val = { .double_word = 0 };
> > > > > +    unsigned int data_rshift = 0, data_lshift = 0, data_size;
> > > > > +    uint32_t tmp_data;
> > > > > +    int rc;
> > > > > +
> > > > > +    ASSERT(vpci_locked(d));
> > > > > +
> > > > > +    *data = 0;
> > > > > +
> > > > > +    /* Find the PCI dev matching the address. */
> > > > > +    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
> > > > > +    if ( !pdev )
> > > > > +        goto passthrough;
> > > >
> > > > I hope this can eventually be generalised so I wonder what your
> intention is
> > > regarding co-existence between Xen emulated PCI config space, pass-
> > > through and PCI devices emulated externally. We already have a
> framework
> > > for registering PCI devices by SBDF but this code seems to make no use of
> it,
> > > which I suspect is likely to cause future conflict.
> > >
> > > Yes, the long term aim is to use this code in order to implement
> > > PCI-passthrough for PVH and HVM DomUs also.
> > >
> > > TBH, I didn't know we already had such code (I assume you mean the
> IOREQ
> > > related PCI code). As it is, I see a couple of issues with that, the first one
> > > is that this code expects a ioreq client on the other end, and the code I'm
> > > adding here is all inside of the hypervisor. The second issue is that the
> IOREQ
> > > code ATM only allows for local PCI accesses, which means I should extend
> it
> > > to
> > > also deal with ECAM/MMCFG areas.
> > >
> > > I completely agree that at some point this should be made to work
> together,
> > > but
> > > I'm not sure if it would be better to do that once we want to also use vPCI
> for
> > > DomUs, so that the Dom0 side is not delayed further.
> >
> > BTW, that's also an argument for forgetting about the r-b scheme for
> handler registration since, if this really is for dom0 only, 8 pages worth of
> direct map is not a lot.
> 
> It's 8 pages for each device, not 8 pages for each domain, so it doesn't matter
> if it's Dom0 or DomU, each PCIe device would use 8 pages.

Sorry, yes of course it is.

  Paul

> 
> Roger.
Paul Durrant April 24, 2017, 10:19 a.m. UTC | #11
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 11:09
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Mon, Apr 24, 2017 at 10:34:15AM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne
> > > Sent: 24 April 2017 10:09
> > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> Liu
> > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> Cooper
> > > <Andrew.Cooper3@citrix.com>
> > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > accesses to the PCI config space
> > >
> > > On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > > > > -----Original Message-----
> > > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > > > Sent: 20 April 2017 16:18
> > > > > To: xen-devel@lists.xenproject.org
> > > > > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau
> > > Monne
> > > > > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> Liu
> > > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > > Cooper
> > > > > <Andrew.Cooper3@citrix.com>; Paul Durrant
> <Paul.Durrant@citrix.com>
> > > > > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > accesses
> > > > > to the PCI config space
> > > > >
> > > > > This functionality is going to reside in vpci.c (and the corresponding
> vpci.h
> > > > > header), and should be arch-agnostic. The handlers introduced in this
> > > patch
> > > > > setup the basic functionality required in order to trap accesses to the
> PCI
> > > > > config space, and allow decoding the address and finding the
> > > corresponding
> > > > > handler that should handle the access (although no handlers are
> > > > > implemented).
> > > > >
> > > > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> setup
> > > > > inside of a x86 HVM file, since that's not shared with other arches.
> > > > >
> > > > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to
> signal
> > > Xen
> > > > > whether
> > > > > a domain should use the newly introduced vPCI handlers, this is only
> > > enabled
> > > > > for PVH Dom0 at the moment.
> > > > >
> > > > > A very simple user-space test is also provided, so that the basic
> > > functionality
> > > > > of the vPCI traps can be asserted. This has been proven quite helpful
> > > during
> > > > > development, since the logic to handle partial accesses or accesses
> that
> > > > > expand
> > > > > across multiple registers is not trivial.
> > > > >
> > > > > The handlers for the registers are added to a red-black tree, that
> indexes
> > > > > them
> > > > > based on their offset. Since Xen needs to handle partial accesses to
> the
> > > > > registers and access that expand across multiple registers the logic in
> > > > > xen_vpci_{read/write} is kind of convoluted, I've tried to properly
> > > comment
> > > > > it
> > > > > in order to make it easier to understand.
> > > > >
> > > >
> > > > Since config space is not exactly huge, I'm wondering why you used an
> r-b
> > > tree rather than a direct map from register to handler?
> > >
> > > Hello,
> > >
> > > For local PCI the configuration space it's 256byte only, which means using
> 1/2
> > > a page (256 * 8) so that Xen can store a pointer for each possible register.
> > > The extended configuration space (ECAM) extends the space to 4K,
> which
> > > means we
> > > would use 8 pages per device (4096*8), I think that's too much.
> >
> > Ok, but I still think that adding an r-b tree implementation is just more
> complexity in the way that io handlers are registered in Xen.
> 
> But this complexity is completely hidden inside of the io handler itself that
> traps the access to 0xcf8/cfc (or ECAM areas).
> 
> Do you mean that you would like this functionality to made available to
> IOREQ
> clients also, so that they could register handlers for specific PCI registers
> without owning the full configuration space of such device?
> 
> > TBH, the whole thing needs a clean-up. We don't have proper range-based
> handler registration for port IO or MMIO at all (instead we potentially call the
> 'accept' function for every handler for every I/O). We then have (IIRC) an
> ordered list for MSI-X BAR registrations and now you're proposing an r-b
> system for PCI config space.
> 
> One way or another Xen needs to track handlers for the PCI config space,
> and
> currently this is not implemented inside of Xen.

What I mean is that we should have some form of range-based IO handler registration framework and then that can be used for port IO, MMIO and PCI config space. For external config space emulation then yes of course the external emulated needs to claim the whole space for that SBDF, but that's just a degenerate case of claiming a specific range within the SBDF.
Thus, if Xen can steer port IO, MMIO or PCI config accesses by range then we can potentially use that framework to register internal emulation handlers or a special emulation handler that sends the requests out to an ioreq server.

> 
> The MSI-X BAR tracking will go away once this code is also used for
> PCI-passthrough to DomUs. The msixtbl code is just extremely messy,
> because
> MSI-X interrupt handling for passthrough devices is partially handled in
> QEMU
> and partially inside of Xen.
> 
> > On top of that, there is then the rangeset based ioreq server selection that
> occurs if the I/O falls through all of this and needs sending outside Xen. There
> really has to be at least some scope for unificiation here; it's getting way too
> convoluted.
> 
> Yes, I agree that there's some room for sharing here, the address decoding
> done
> in hvm_select_ioreq_server for PCI could be reused for vPCI also, it's just
> that all this code expects a IOREQ server, and vPCI is not going to be an
> IOREQ
> server.

Indeed. Hopefully I've explained what I was thinking above.

  Paul

> 
> Roger.
Roger Pau Monné April 24, 2017, 11:02 a.m. UTC | #12
On Mon, Apr 24, 2017 at 11:19:10AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 24 April 2017 11:09
> > To: Paul Durrant <Paul.Durrant@citrix.com>
> > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > accesses to the PCI config space
> > 
> > On Mon, Apr 24, 2017 at 10:34:15AM +0100, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monne
> > > > Sent: 24 April 2017 10:09
> > > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> > Liu
> > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > Cooper
> > > > <Andrew.Cooper3@citrix.com>
> > > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > > accesses to the PCI config space
> > > >
> > > > On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > > > > > -----Original Message-----
> > > > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > > > > Sent: 20 April 2017 16:18
> > > > > > To: xen-devel@lists.xenproject.org
> > > > > > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau
> > > > Monne
> > > > > > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> > Liu
> > > > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > > > Cooper
> > > > > > <Andrew.Cooper3@citrix.com>; Paul Durrant
> > <Paul.Durrant@citrix.com>
> > > > > > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > > accesses
> > > > > > to the PCI config space
> > > > > >
> > > > > > This functionality is going to reside in vpci.c (and the corresponding
> > vpci.h
> > > > > > header), and should be arch-agnostic. The handlers introduced in this
> > > > patch
> > > > > > setup the basic functionality required in order to trap accesses to the
> > PCI
> > > > > > config space, and allow decoding the address and finding the
> > > > corresponding
> > > > > > handler that should handle the access (although no handlers are
> > > > > > implemented).
> > > > > >
> > > > > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> > setup
> > > > > > inside of a x86 HVM file, since that's not shared with other arches.
> > > > > >
> > > > > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to
> > signal
> > > > Xen
> > > > > > whether
> > > > > > a domain should use the newly introduced vPCI handlers, this is only
> > > > enabled
> > > > > > for PVH Dom0 at the moment.
> > > > > >
> > > > > > A very simple user-space test is also provided, so that the basic
> > > > functionality
> > > > > > of the vPCI traps can be asserted. This has been proven quite helpful
> > > > during
> > > > > > development, since the logic to handle partial accesses or accesses
> > that
> > > > > > expand
> > > > > > across multiple registers is not trivial.
> > > > > >
> > > > > > The handlers for the registers are added to a red-black tree, that
> > indexes
> > > > > > them
> > > > > > based on their offset. Since Xen needs to handle partial accesses to
> > the
> > > > > > registers and access that expand across multiple registers the logic in
> > > > > > xen_vpci_{read/write} is kind of convoluted, I've tried to properly
> > > > comment
> > > > > > it
> > > > > > in order to make it easier to understand.
> > > > > >
> > > > >
> > > > > Since config space is not exactly huge, I'm wondering why you used an
> > r-b
> > > > tree rather than a direct map from register to handler?
> > > >
> > > > Hello,
> > > >
> > > > For local PCI the configuration space it's 256byte only, which means using
> > 1/2
> > > > a page (256 * 8) so that Xen can store a pointer for each possible register.
> > > > The extended configuration space (ECAM) extends the space to 4K,
> > which
> > > > means we
> > > > would use 8 pages per device (4096*8), I think that's too much.
> > >
> > > Ok, but I still think that adding an r-b tree implementation is just more
> > complexity in the way that io handlers are registered in Xen.
> > 
> > But this complexity is completely hidden inside of the io handler itself that
> > traps the access to 0xcf8/cfc (or ECAM areas).
> > 
> > Do you mean that you would like this functionality to made available to
> > IOREQ
> > clients also, so that they could register handlers for specific PCI registers
> > without owning the full configuration space of such device?
> > 
> > > TBH, the whole thing needs a clean-up. We don't have proper range-based
> > handler registration for port IO or MMIO at all (instead we potentially call the
> > 'accept' function for every handler for every I/O). We then have (IIRC) an
> > ordered list for MSI-X BAR registrations and now you're proposing an r-b
> > system for PCI config space.
> > 
> > One way or another Xen needs to track handlers for the PCI config space,
> > and
> > currently this is not implemented inside of Xen.
> 
> What I mean is that we should have some form of range-based IO handler registration framework and then that can be used for port IO, MMIO and PCI config space. For external config space emulation then yes of course the external emulated needs to claim the whole space for that SBDF, but that's just a degenerate case of claiming a specific range within the SBDF.
> Thus, if Xen can steer port IO, MMIO or PCI config accesses by range then we can potentially use that framework to register internal emulation handlers or a special emulation handler that sends the requests out to an ioreq server.

IMHO I'm not sure Xen needs PCI register based trapping granularity. I would
argue that whatever (IOREQ or Xen internal function) that wants to trap access
to a specific PCI config device register needs to take care of all the
registers for that device.

I will look into hooking this code (vPCI) into the existing hvm_*_ioreq
functionality, so that vPCI claims the full PCI config space for each device it
manages.

Thanks, Roger.
Paul Durrant April 24, 2017, 11:50 a.m. UTC | #13
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 24 April 2017 12:03
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Mon, Apr 24, 2017 at 11:19:10AM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne
> > > Sent: 24 April 2017 11:09
> > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> Liu
> > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> Cooper
> > > <Andrew.Cooper3@citrix.com>
> > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > accesses to the PCI config space
> > >
> > > On Mon, Apr 24, 2017 at 10:34:15AM +0100, Paul Durrant wrote:
> > > > > -----Original Message-----
> > > > > From: Roger Pau Monne
> > > > > Sent: 24 April 2017 10:09
> > > > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>;
> Wei
> > > Liu
> > > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > > Cooper
> > > > > <Andrew.Cooper3@citrix.com>
> > > > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > > > accesses to the PCI config space
> > > > >
> > > > > On Fri, Apr 21, 2017 at 05:07:43PM +0100, Paul Durrant wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > > > > > Sent: 20 April 2017 16:18
> > > > > > > To: xen-devel@lists.xenproject.org
> > > > > > > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger
> Pau
> > > > > Monne
> > > > > > > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>;
> Wei
> > > Liu
> > > > > > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> > > > > Cooper
> > > > > > > <Andrew.Cooper3@citrix.com>; Paul Durrant
> > > <Paul.Durrant@citrix.com>
> > > > > > > Subject: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > > > accesses
> > > > > > > to the PCI config space
> > > > > > >
> > > > > > > This functionality is going to reside in vpci.c (and the
> corresponding
> > > vpci.h
> > > > > > > header), and should be arch-agnostic. The handlers introduced in
> this
> > > > > patch
> > > > > > > setup the basic functionality required in order to trap accesses to
> the
> > > PCI
> > > > > > > config space, and allow decoding the address and finding the
> > > > > corresponding
> > > > > > > handler that should handle the access (although no handlers are
> > > > > > > implemented).
> > > > > > >
> > > > > > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> > > setup
> > > > > > > inside of a x86 HVM file, since that's not shared with other arches.
> > > > > > >
> > > > > > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to
> > > signal
> > > > > Xen
> > > > > > > whether
> > > > > > > a domain should use the newly introduced vPCI handlers, this is
> only
> > > > > enabled
> > > > > > > for PVH Dom0 at the moment.
> > > > > > >
> > > > > > > A very simple user-space test is also provided, so that the basic
> > > > > functionality
> > > > > > > of the vPCI traps can be asserted. This has been proven quite
> helpful
> > > > > during
> > > > > > > development, since the logic to handle partial accesses or
> accesses
> > > that
> > > > > > > expand
> > > > > > > across multiple registers is not trivial.
> > > > > > >
> > > > > > > The handlers for the registers are added to a red-black tree, that
> > > indexes
> > > > > > > them
> > > > > > > based on their offset. Since Xen needs to handle partial accesses
> to
> > > the
> > > > > > > registers and access that expand across multiple registers the logic
> in
> > > > > > > xen_vpci_{read/write} is kind of convoluted, I've tried to properly
> > > > > comment
> > > > > > > it
> > > > > > > in order to make it easier to understand.
> > > > > > >
> > > > > >
> > > > > > Since config space is not exactly huge, I'm wondering why you used
> an
> > > r-b
> > > > > tree rather than a direct map from register to handler?
> > > > >
> > > > > Hello,
> > > > >
> > > > > For local PCI the configuration space it's 256byte only, which means
> using
> > > 1/2
> > > > > a page (256 * 8) so that Xen can store a pointer for each possible
> register.
> > > > > The extended configuration space (ECAM) extends the space to 4K,
> > > which
> > > > > means we
> > > > > would use 8 pages per device (4096*8), I think that's too much.
> > > >
> > > > Ok, but I still think that adding an r-b tree implementation is just more
> > > complexity in the way that io handlers are registered in Xen.
> > >
> > > But this complexity is completely hidden inside of the io handler itself
> that
> > > traps the access to 0xcf8/cfc (or ECAM areas).
> > >
> > > Do you mean that you would like this functionality to made available to
> > > IOREQ
> > > clients also, so that they could register handlers for specific PCI registers
> > > without owning the full configuration space of such device?
> > >
> > > > TBH, the whole thing needs a clean-up. We don't have proper range-
> based
> > > handler registration for port IO or MMIO at all (instead we potentially call
> the
> > > 'accept' function for every handler for every I/O). We then have (IIRC) an
> > > ordered list for MSI-X BAR registrations and now you're proposing an r-b
> > > system for PCI config space.
> > >
> > > One way or another Xen needs to track handlers for the PCI config space,
> > > and
> > > currently this is not implemented inside of Xen.
> >
> > What I mean is that we should have some form of range-based IO handler
> registration framework and then that can be used for port IO, MMIO and PCI
> config space. For external config space emulation then yes of course the
> external emulated needs to claim the whole space for that SBDF, but that's
> just a degenerate case of claiming a specific range within the SBDF.
> > Thus, if Xen can steer port IO, MMIO or PCI config accesses by range then
> we can potentially use that framework to register internal emulation
> handlers or a special emulation handler that sends the requests out to an
> ioreq server.
> 
> IMHO I'm not sure Xen needs PCI register based trapping granularity. I would
> argue that whatever (IOREQ or Xen internal function) that wants to trap
> access
> to a specific PCI config device register needs to take care of all the
> registers for that device.
> 

Having distinct handers for distinct groups of makes sense though... e.g. being able to register a BAR handler for each BAR and then maybe an MSI-X capability handler for wherever that appears in the capability chain, etc. If you don't allow such registration at the top level then it ends up getting done at the next level. That said, it may make more sense to have a top level of emulation that just handles all register reads and writes to config space and then a second level that has callbacks for BAR enumeration, bus master enable, MSI-X mask/unmask, etc.

> I will look into hooking this code (vPCI) into the existing hvm_*_ioreq
> functionality, so that vPCI claims the full PCI config space for each device it
> manages.

Cool.

  Paul

> 
> Thanks, Roger.
Roger Pau Monné April 25, 2017, 8:27 a.m. UTC | #14
On Mon, Apr 24, 2017 at 12:50:58PM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 24 April 2017 12:03
> > To: Paul Durrant <Paul.Durrant@citrix.com>
> > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > accesses to the PCI config space
> > IMHO I'm not sure Xen needs PCI register based trapping granularity. I would
> > argue that whatever (IOREQ or Xen internal function) that wants to trap
> > access
> > to a specific PCI config device register needs to take care of all the
> > registers for that device.
> > 
> 
> Having distinct handers for distinct groups of makes sense though... e.g. being able to register a BAR handler for each BAR and then maybe an MSI-X capability handler for wherever that appears in the capability chain, etc. If you don't allow such registration at the top level then it ends up getting done at the next level.

Yes, that's what's done here. Handlers for specific registers are added at the
next level (vPCI). See patches 5, 6, 8 or 9 for examples.

> That said, it may make more sense to have a top level of emulation that just handles all register reads and writes to config space and then a second level that has callbacks for BAR enumeration, bus master enable, MSI-X mask/unmask, etc.
> 
> > I will look into hooking this code (vPCI) into the existing hvm_*_ioreq
> > functionality, so that vPCI claims the full PCI config space for each device it
> > manages.
> 
> Cool.

I've been looking into this, and I have to say this whole emulation handling is
a mess. The fact that Xen differentiates between internal and external (IOREQ)
handlers so early in the code (hvmemul_do_io) makes it far from trivial to
unify internal and external handlers, the more that external handlers have
grown a complex set of infrastructure that internal handlers don't have at
all.

Ideally I think the IOREQ filtering code should be generalized to apply to both
internal and external handlers, and the difference between external and
internal handlers should just be the set of functions that they use. External
ones would always use generic IOREQ functions for pushing requests to the
external emulators, while internal ones would just implement their own
functions.

That said, I think this is a non-trivial amount of work, that will further
delay this series. I don't see an easy way to integrate this code with the
current IOREQ code at all. I'm willing to do this, but I would rather have this
series merged first, so that other people can start working on PVH Dom0.

ATM, the only think I can see that could be easily shared between the IOREQ
code and vPCI is the PCI address decoding code.

Thanks, Roger.
Paul Durrant April 25, 2017, 8:35 a.m. UTC | #15
> -----Original Message-----
> From: Roger Pau Monne
> Sent: 25 April 2017 09:27
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu
> <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> accesses to the PCI config space
> 
> On Mon, Apr 24, 2017 at 12:50:58PM +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne
> > > Sent: 24 April 2017 12:03
> > > To: Paul Durrant <Paul.Durrant@citrix.com>
> > > Cc: xen-devel@lists.xenproject.org; konrad.wilk@oracle.com;
> > > boris.ostrovsky@oracle.com; Ian Jackson <Ian.Jackson@citrix.com>; Wei
> Liu
> > > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew
> Cooper
> > > <Andrew.Cooper3@citrix.com>
> > > Subject: Re: [PATCH v2 1/9] xen/vpci: introduce basic handlers to trap
> > > accesses to the PCI config space
> > > IMHO I'm not sure Xen needs PCI register based trapping granularity. I
> would
> > > argue that whatever (IOREQ or Xen internal function) that wants to trap
> > > access
> > > to a specific PCI config device register needs to take care of all the
> > > registers for that device.
> > >
> >
> > Having distinct handers for distinct groups of makes sense though... e.g.
> being able to register a BAR handler for each BAR and then maybe an MSI-X
> capability handler for wherever that appears in the capability chain, etc. If
> you don't allow such registration at the top level then it ends up getting done
> at the next level.
> 
> Yes, that's what's done here. Handlers for specific registers are added at the
> next level (vPCI). See patches 5, 6, 8 or 9 for examples.
> 
> > That said, it may make more sense to have a top level of emulation that
> just handles all register reads and writes to config space and then a second
> level that has callbacks for BAR enumeration, bus master enable, MSI-X
> mask/unmask, etc.
> >
> > > I will look into hooking this code (vPCI) into the existing hvm_*_ioreq
> > > functionality, so that vPCI claims the full PCI config space for each device
> it
> > > manages.
> >
> > Cool.
> 
> I've been looking into this, and I have to say this whole emulation handling is
> a mess.

Too right. It's pretty horrible.

>The fact that Xen differentiates between internal and external
> (IOREQ)
> handlers so early in the code (hvmemul_do_io) makes it far from trivial to
> unify internal and external handlers, the more that external handlers have
> grown a complex set of infrastructure that internal handlers don't have at
> all.
> 

Indeed. Arguably that's because the external emulation is asynchronous and therefore requires more infrastructure but I think a lot of the abstraction is the wrong way round.

> Ideally I think the IOREQ filtering code should be generalized to apply to both
> internal and external handlers, and the difference between external and
> internal handlers should just be the set of functions that they use.

Exactly.

> External
> ones would always use generic IOREQ functions for pushing requests to the
> external emulators, while internal ones would just implement their own
> functions.
> 

Yep. We are definitely on the same wavelength :-)

> That said, I think this is a non-trivial amount of work, that will further
> delay this series. I don't see an easy way to integrate this code with the
> current IOREQ code at all. I'm willing to do this, but I would rather have this
> series merged first, so that other people can start working on PVH Dom0.
> 

Fair enough. If you've looked and come to that conclusion then I trust your judgement.

> ATM, the only think I can see that could be easily shared between the IOREQ
> code and vPCI is the PCI address decoding code.
> 

Yes, maybe some utility functions/macros can be generalized. It's not much, but it's a start. Once 4.9 is out of the door I think there should be an I/O emulation cleanup/rationalization item for 4.10 which of course I'm happy to help with.

Cheers,

  Paul

> Thanks, Roger.
diff mbox

Patch

diff --git a/.gitignore b/.gitignore
index 74747cb7e7..ebafba25b5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -236,6 +236,10 @@  tools/tests/regression/build/*
 tools/tests/regression/downloads/*
 tools/tests/mem-sharing/memshrtool
 tools/tests/mce-test/tools/xen-mceinj
+tools/tests/vpci/rbtree.[hc]
+tools/tests/vpci/vpci.[hc]
+tools/tests/vpci/test_vpci.out
+tools/tests/vpci/test_vpci
 tools/xcutils/lsevtchn
 tools/xcutils/readnotes
 tools/xenbackendd/_paths.h
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 455f6f0bed..dd7fc78a99 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -11,7 +11,7 @@  int libxl__arch_domain_prepare_config(libxl__gc *gc,
     if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) {
         if (d_config->b_info.device_model_version !=
             LIBXL_DEVICE_MODEL_VERSION_NONE) {
-            xc_config->emulation_flags = XEN_X86_EMU_ALL;
+            xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI);
         } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) {
             /*
              * HVM guests without device model may want
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 639776130b..5cfe781e62 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -13,6 +13,7 @@  endif
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 SUBDIRS-y += xen-access
 SUBDIRS-y += xenstore
+SUBDIRS-$(CONFIG_HAS_PCI) += vpci
 
 .PHONY: all clean install distclean
 all clean distclean: %: subdirs-%
diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile
new file mode 100644
index 0000000000..7969fcbd82
--- /dev/null
+++ b/tools/tests/vpci/Makefile
@@ -0,0 +1,45 @@ 
+
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test_vpci
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+	./$(TARGET) > $(TARGET).out
+
+$(TARGET): vpci.c vpci.h rbtree.c rbtree.h
+	$(HOSTCC) -g -o $@ vpci.c main.c rbtree.c
+
+.PHONY: clean
+clean:
+	rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c rbtree.c rbtree.h
+
+.PHONY: distclean
+distclean: clean
+
+.PHONY: install
+install:
+
+vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
+	sed -e '/#include/d' <$< >$@
+
+vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c
+	# Trick the compiler so it doesn't complain about missing symbols
+	sed -e '/#include/d' \
+	    -e '1s;^;#include "emul.h"\
+	             const vpci_register_init_t __start_vpci_array[1]\;\
+	             const vpci_register_init_t __end_vpci_array[1]\;\
+	             ;' <$< >$@
+
+rbtree.h: $(XEN_ROOT)/xen/include/xen/rbtree.h
+	sed -e '/#include/d' <$< >$@
+
+rbtree.c: $(XEN_ROOT)/xen/common/rbtree.c
+	sed -e "/#include/d" \
+	    -e '1s;^;#include "emul.h"\
+	             ;' <$< >$@
+
diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
new file mode 100644
index 0000000000..85897ed43b
--- /dev/null
+++ b/tools/tests/vpci/emul.h
@@ -0,0 +1,107 @@ 
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _TEST_VPCI_
+#define _TEST_VPCI_
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <assert.h>
+
+#define container_of(ptr, type, member) ({                      \
+        typeof( ((type *)0)->member ) *__mptr = (ptr);          \
+        (type *)( (char *)__mptr - offsetof(type,member) );})
+
+#include "rbtree.h"
+
+struct pci_dev {
+    struct domain *domain;
+    struct vpci *vpci;
+};
+
+struct domain {
+    struct pci_dev pdev;
+};
+
+struct vcpu
+{
+    struct domain *domain;
+};
+
+extern struct vcpu v;
+
+#define spin_lock(x)
+#define spin_unlock(x)
+#define spin_is_locked(x) true
+
+#define current (&v)
+
+#define has_vpci(d) true
+
+#include "vpci.h"
+
+#define xzalloc(type) (type *)calloc(1, sizeof(type))
+#define xfree(p) free(p)
+
+#define EXPORT_SYMBOL(x)
+
+#define pci_get_pdev_by_domain(d, ...) &(d)->pdev
+
+#define atomic_read(x) 1
+
+/* Dummy native helpers. Writes are ignored, reads return 1's. */
+#define pci_conf_read8(...) (0xff)
+#define pci_conf_read16(...) (0xffff)
+#define pci_conf_read32(...) (0xffffffff)
+#define pci_conf_write8(...)
+#define pci_conf_write16(...)
+#define pci_conf_write32(...)
+
+#define BUG() assert(0)
+#define ASSERT_UNREACHABLE() assert(0)
+#define ASSERT(x) assert(x)
+
+#ifdef _LP64
+#define BITS_PER_LONG 64
+#else
+#define BITS_PER_LONG 32
+#endif
+#define GENMASK(h, l) \
+    (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+
+#define min(x,y) ({ \
+        const typeof(x) _x = (x);       \
+        const typeof(y) _y = (y);       \
+        (void) (&_x == &_y);            \
+        _x < _y ? _x : _y; })
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
new file mode 100644
index 0000000000..0fc63de038
--- /dev/null
+++ b/tools/tests/vpci/main.c
@@ -0,0 +1,206 @@ 
+/*
+ * Unit tests for the generic vPCI handler code.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "emul.h"
+
+/* Single vcpu (current), and single domain with a single PCI device. */
+static struct vpci vpci = {
+    .handlers = RB_ROOT,
+};
+
+static struct domain d = {
+    .pdev.domain = &d,
+    .pdev.vpci = &vpci,
+};
+
+struct vcpu v = { .domain = &d };
+
+/* Dummy hooks, write stores data, read fetches it. */
+static int vpci_read8(struct pci_dev *pdev, unsigned int reg,
+                      union vpci_val *val, void *data)
+{
+    uint8_t *priv = data;
+
+    val->half_word = *priv;
+    return 0;
+}
+
+static int vpci_write8(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val val, void *data)
+{
+    uint8_t *priv = data;
+
+    *priv = val.half_word;
+    return 0;
+}
+
+static int vpci_read16(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint16_t *priv = data;
+
+    val->word = *priv;
+    return 0;
+}
+
+static int vpci_write16(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint16_t *priv = data;
+
+    *priv = val.word;
+    return 0;
+}
+
+static int vpci_read32(struct pci_dev *pdev, unsigned int reg,
+                       union vpci_val *val, void *data)
+{
+    uint32_t *priv = data;
+
+    val->double_word = *priv;
+    return 0;
+}
+
+static int vpci_write32(struct pci_dev *pdev, unsigned int reg,
+                        union vpci_val val, void *data)
+{
+    uint32_t *priv = data;
+
+    *priv = val.double_word;
+    return 0;
+}
+
+#define VPCI_READ(reg, size, data) \
+    assert(!xen_vpci_read(0, 0, 0, reg, size, data))
+
+#define VPCI_READ_CHECK(reg, size, expected) ({ \
+    uint32_t val;                               \
+    VPCI_READ(reg, size, &val);                 \
+    assert(val == expected);                    \
+    })
+
+#define VPCI_WRITE(reg, size, data) \
+    assert(!xen_vpci_write(0, 0, 0, reg, size, data))
+
+#define VPCI_CHECK_REG(reg, size, data) ({      \
+    VPCI_WRITE(reg, size, data);                \
+    VPCI_READ_CHECK(reg, size, data);           \
+    })
+
+#define VPCI_ADD_REG(fread, fwrite, off, size, store)                         \
+    assert(!xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, &store)) \
+
+#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size)                      \
+    assert(xen_vpci_add_register(&d.pdev, fread, fwrite, off, size, NULL))  \
+
+int
+main(int argc, char **argv)
+{
+    /* Index storage by offset. */
+    uint32_t r0 = 0xdeadbeef;
+    uint8_t r5 = 0xef;
+    uint8_t r6 = 0xbe;
+    uint8_t r7 = 0xef;
+    uint16_t r12 = 0x8696;
+    int rc;
+
+    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
+    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
+    VPCI_CHECK_REG(0, 4, 0xbcbcbcbc);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
+    VPCI_READ_CHECK(5, 1, 0xef);
+    VPCI_CHECK_REG(5, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
+    VPCI_READ_CHECK(6, 1, 0xbe);
+    VPCI_CHECK_REG(6, 1, 0xba);
+
+    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
+    VPCI_READ_CHECK(7, 1, 0xef);
+    VPCI_CHECK_REG(7, 1, 0xbd);
+
+    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
+    VPCI_READ_CHECK(12, 2, 0x8696);
+    VPCI_READ_CHECK(12, 4, 0xffff8696);
+
+    /*
+     * At this point we have the following layout:
+     *
+     * 32    24    16     8     0
+     *  +-----+-----+-----+-----+
+     *  |          r0           | 0
+     *  +-----+-----+-----+-----+
+     *  | r7  |  r6 |  r5 |/////| 32
+     *  +-----+-----+-----+-----|
+     *  |///////////////////////| 64
+     *  +-----------+-----------+
+     *  |///////////|    r12    | 96
+     *  +-----------+-----------+
+     *             ...
+     *  / = empty.
+     */
+
+    /* Try to add an overlapping register handler. */
+    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
+
+    /* Try to add a non-aligned register. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
+
+    /* Try to add a register with wrong size. */
+    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
+
+    /* Try to add a register with missing handlers. */
+    VPCI_ADD_INVALID_REG(vpci_read16, NULL, 8, 2);
+    VPCI_ADD_INVALID_REG(NULL, vpci_write16, 8, 2);
+
+    /* Read/write of unset register. */
+    VPCI_READ_CHECK(8, 4, 0xffffffff);
+    VPCI_READ_CHECK(8, 2, 0xffff);
+    VPCI_READ_CHECK(8, 1, 0xff);
+    VPCI_WRITE(10, 2, 0xbeef);
+    VPCI_READ_CHECK(10, 2, 0xffff);
+
+    /* Read of multiple registers */
+    VPCI_CHECK_REG(7, 1, 0xbd);
+    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
+
+    /* Partial read of a register. */
+    VPCI_CHECK_REG(0, 4, 0x1a1b1c1d);
+    VPCI_READ_CHECK(2, 1, 0x1b);
+    VPCI_READ_CHECK(6, 2, 0xbdba);
+
+    /* Write of multiple registers. */
+    VPCI_CHECK_REG(4, 4, 0xaabbccff);
+
+    /* Partial write of a register. */
+    VPCI_CHECK_REG(2, 1, 0xfe);
+    VPCI_CHECK_REG(6, 2, 0xfebc);
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index 44bd3bf0ce..41bf9dfaf3 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -79,6 +79,9 @@  SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel)
        *(.data.rel.*)
        CONSTRUCTORS
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 90e2b1f82a..f74020facc 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -500,11 +500,21 @@  static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
     if ( is_hvm_domain(d) )
     {
         if ( is_hardware_domain(d) &&
-             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
-            return false;
-        if ( !is_hardware_domain(d) && emflags &&
-             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                         XEN_X86_EMU_VPCI) )
             return false;
+        if ( !is_hardware_domain(d) )
+        {
+            switch ( emflags )
+            {
+            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:
+            case XEN_X86_EMU_LAPIC:
+            case 0:
+                break;
+            default:
+                return false;
+            }
+        }
     }
     else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
     {
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a441955322..7f3322ede6 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -37,6 +37,7 @@ 
 #include <xen/vm_event.h>
 #include <xen/monitor.h>
 #include <xen/warning.h>
+#include <xen/vpci.h>
 #include <asm/shadow.h>
 #include <asm/hap.h>
 #include <asm/current.h>
@@ -655,6 +656,7 @@  int hvm_domain_initialise(struct domain *d)
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
 
     register_g2m_portio_handler(d);
+    register_vpci_portio_handler(d);
 
     hvm_ioreq_init(d);
 
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 214ab307c4..15048da556 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -25,6 +25,7 @@ 
 #include <xen/trace.h>
 #include <xen/event.h>
 #include <xen/hypercall.h>
+#include <xen/vpci.h>
 #include <asm/current.h>
 #include <asm/cpufeature.h>
 #include <asm/processor.h>
@@ -256,6 +257,140 @@  void register_g2m_portio_handler(struct domain *d)
     handler->ops = &g2m_portio_ops;
 }
 
+/* Do some sanity checks. */
+static int vpci_access_check(unsigned int reg, unsigned int len)
+{
+    /* Check access size. */
+    if ( len != 1 && len != 2 && len != 4 )
+    {
+        gdprintk(XENLOG_WARNING, "invalid length (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    /* Check if access crosses a double-word boundary. */
+    if ( (reg & 3) + len > 4 )
+    {
+        gdprintk(XENLOG_WARNING,
+                 "invalid access across double-word boundary (reg: %#x, len: %u)\n",
+                 reg, len);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+/* Helper to decode a PCI address. */
+static void vpci_decode_addr(unsigned int cf8, unsigned int addr,
+                             unsigned int *bus, unsigned int *devfn,
+                             unsigned int *reg)
+{
+    unsigned long bdf;
+
+    bdf = CF8_BDF(cf8);
+    *bus = PCI_BUS(bdf);
+    *devfn = PCI_DEVFN(PCI_SLOT(bdf), PCI_FUNC(bdf));
+    /*
+     * NB: the lower 2 bits of the register address are fetched from the
+     * offset into the 0xcfc register when reading/writing to it.
+     */
+    *reg = (cf8 & 0xfc) | (addr & 3);
+}
+
+/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
+static bool_t vpci_portio_accept(const struct hvm_io_handler *handler,
+                                 const ioreq_t *p)
+{
+    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
+}
+
+static int vpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr, uint32_t size, uint64_t *data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    uint32_t data32;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = d->arch.hvm_domain.pci_cf8;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    vpci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_read(0, bus, devfn, reg, size, &data32);
+    if ( !rc )
+        *data = data32;
+    vpci_unlock(d);
+
+     return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static int vpci_portio_write(const struct hvm_io_handler *handler,
+                             uint64_t addr, uint32_t size, uint64_t data)
+{
+    struct domain *d = current->domain;
+    unsigned int bus, devfn, reg;
+    int rc;
+
+    vpci_lock(d);
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        d->arch.hvm_domain.pci_cf8 = data;
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Decode the PCI address. */
+    vpci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &devfn, &reg);
+
+    if ( vpci_access_check(reg, size) || reg >= 0xff )
+    {
+        vpci_unlock(d);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    rc = xen_vpci_write(0, bus, devfn, reg, size, data);
+    vpci_unlock(d);
+
+    return rc ? X86EMUL_UNHANDLEABLE : X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops vpci_portio_ops = {
+    .accept = vpci_portio_accept,
+    .read = vpci_portio_read,
+    .write = vpci_portio_write,
+};
+
+void register_vpci_portio_handler(struct domain *d)
+{
+    struct hvm_io_handler *handler;
+
+    if ( !has_vpci(d) )
+        return;
+
+    handler = hvm_next_io_handler(d);
+    if ( !handler )
+        return;
+
+    spin_lock_init(&d->arch.hvm_domain.vpci_lock);
+    handler->type = IOREQ_TYPE_PIO;
+    handler->ops = &vpci_portio_ops;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f7b927858c..4cf919f206 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1566,7 +1566,8 @@  void __init noreturn __start_xen(unsigned long mbi_p)
         domcr_flags |= DOMCRF_hvm |
                        ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
                          DOMCRF_hap : 0);
-        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
+                                 XEN_X86_EMU_VPCI;
     }
 
     /* Create initial domain 0. */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 8289a1bf09..f5cc8e2b8d 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -224,6 +224,9 @@  SECTIONS
        __start_schedulers_array = .;
        *(.data.schedulers)
        __end_schedulers_array = .;
+       __start_vpci_array = .;
+       *(.data.vpci)
+       __end_vpci_array = .;
        *(.data.rel.ro)
        *(.data.rel.ro.*)
   } :text
diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
index 19391802a8..d51c766453 100644
--- a/xen/drivers/Makefile
+++ b/xen/drivers/Makefile
@@ -1,6 +1,6 @@ 
 subdir-y += char
 subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
-subdir-$(CONFIG_HAS_PCI) += pci
+subdir-$(CONFIG_HAS_PCI) += pci vpci
 subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
 subdir-$(CONFIG_ACPI) += acpi
 subdir-$(CONFIG_VIDEO) += video
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c8e2d2d9a9..2288cf8814 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -30,6 +30,7 @@ 
 #include <xen/radix-tree.h>
 #include <xen/softirq.h>
 #include <xen/tasklet.h>
+#include <xen/vpci.h>
 #include <xsm/xsm.h>
 #include <asm/msi.h>
 #include "ats.h"
@@ -1041,6 +1042,8 @@  static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
         devfn += pdev->phantom_stride;
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+
+    xen_vpci_add_handlers(pdev);
 }
 
 static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
new file mode 100644
index 0000000000..840a906470
--- /dev/null
+++ b/xen/drivers/vpci/Makefile
@@ -0,0 +1 @@ 
+obj-y += vpci.o
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
new file mode 100644
index 0000000000..f4cd04f11d
--- /dev/null
+++ b/xen/drivers/vpci/vpci.c
@@ -0,0 +1,474 @@ 
+/*
+ * Generic functionality for handling accesses to the PCI configuration space
+ * from guests.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+
+extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
+#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
+#define vpci_init __start_vpci_array
+
+/* Helpers for locking/unlocking. */
+#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
+
+/* Internal struct to store the emulated PCI registers. */
+struct vpci_register {
+    vpci_read_t read;
+    vpci_write_t write;
+    unsigned int size;
+    unsigned int offset;
+    void *priv_data;
+    struct rb_node node;
+};
+
+int xen_vpci_add_handlers(struct pci_dev *pdev)
+{
+    int i, rc = 0;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
+
+    pdev->vpci = xzalloc(struct vpci);
+    if ( !pdev->vpci )
+        return -ENOMEM;
+
+    pdev->vpci->handlers = RB_ROOT;
+
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = vpci_init[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    if ( rc )
+    {
+        struct rb_node *node = rb_first(&pdev->vpci->handlers);
+        struct vpci_register *r;
+
+        /* Iterate over the tree and cleanup. */
+        while ( node != NULL )
+        {
+            r = container_of(node, struct vpci_register, node);
+            node = rb_next(node);
+            rb_erase(&r->node, &pdev->vpci->handlers);
+            xfree(r);
+        }
+        xfree(pdev->vpci);
+    }
+
+    return rc;
+}
+
+static bool vpci_register_overlap(const struct vpci_register *r,
+                                  unsigned int offset)
+{
+    if ( offset >= r->offset && offset < r->offset + r->size )
+        return true;
+
+    return false;
+}
+
+
+static int vpci_register_cmp(const struct vpci_register *r1,
+                             const struct vpci_register *r2)
+{
+    /* Make sure there's no overlap between registers. */
+    if ( vpci_register_overlap(r1, r2->offset) ||
+         vpci_register_overlap(r1, r2->offset + r2->size - 1) ||
+         vpci_register_overlap(r2, r1->offset) ||
+         vpci_register_overlap(r2, r1->offset + r1->size - 1) )
+        return 0;
+
+    if (r1->offset < r2->offset)
+        return -1;
+    else if (r1->offset > r2->offset)
+        return 1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+static struct vpci_register *vpci_find_register(const struct pci_dev *pdev,
+                                                const unsigned int reg,
+                                                const unsigned int size)
+{
+    struct rb_node *node;
+    struct vpci_register r = {
+        .offset = reg,
+        .size = size,
+    };
+
+    ASSERT(vpci_locked(pdev->domain));
+
+    node = pdev->vpci->handlers.rb_node;
+    while ( node )
+    {
+        struct vpci_register *t =
+            container_of(node, struct vpci_register, node);
+
+        switch ( vpci_register_cmp(&r, t) )
+        {
+        case -1:
+            node = node->rb_left;
+            break;
+        case 1:
+            node = node->rb_right;
+            break;
+        default:
+            return t;
+        }
+    }
+
+    return NULL;
+}
+
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data)
+{
+    struct rb_node **new, *parent;
+    struct vpci_register *r;
+
+    /* Some sanity checks. */
+    if ( (size != 1 && size != 2 && size != 4) || offset >= 0xFFF ||
+         offset & (size - 1) || read_handler == NULL || write_handler == NULL )
+        return -EINVAL;
+
+    r = xzalloc(struct vpci_register);
+    if ( !r )
+        return -ENOMEM;
+
+    r->read = read_handler;
+    r->write = write_handler;
+    r->size = size;
+    r->offset = offset;
+    r->priv_data = data;
+
+    vpci_lock(pdev->domain);
+    new = &pdev->vpci->handlers.rb_node;
+    parent = NULL;
+
+    while (*new) {
+        struct vpci_register *this =
+            container_of(*new, struct vpci_register, node);
+
+        parent = *new;
+        switch ( vpci_register_cmp(r, this) )
+        {
+        case -1:
+            new = &((*new)->rb_left);
+            break;
+        case 1:
+            new = &((*new)->rb_right);
+            break;
+        default:
+            xfree(r);
+            vpci_unlock(pdev->domain);
+            return -EEXIST;
+        }
+    }
+
+    rb_link_node(&r->node, parent, new);
+    rb_insert_color(&r->node, &pdev->vpci->handlers);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset)
+{
+    struct vpci_register *r;
+
+    vpci_lock(pdev->domain);
+    r = vpci_find_register(pdev, offset, 1 /* size doesn't matter here. */);
+    if ( !r )
+    {
+        vpci_unlock(pdev->domain);
+        return -ENOENT;
+    }
+
+    rb_erase(&r->node, &pdev->vpci->handlers);
+    xfree(r);
+    vpci_unlock(pdev->domain);
+
+    return 0;
+}
+
+/* Wrappers for performing reads/writes to the underlying hardware. */
+static void vpci_read_hw(unsigned int seg, unsigned int bus,
+                         unsigned int devfn, unsigned int reg, uint32_t size,
+                         uint32_t *data)
+{
+    switch ( size )
+    {
+    case 4:
+        *data = pci_conf_read32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte read can have 1byte trapped and
+         * the rest passed-through.
+         */
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg + 1) << 8;
+        *data |= pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    case 2:
+        *data = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                reg);
+        break;
+    case 1:
+        *data = pci_conf_read8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                               reg);
+        break;
+    default:
+        BUG();
+    }
+}
+
+static void vpci_write_hw(unsigned int seg, unsigned int bus,
+                          unsigned int devfn, unsigned int reg, uint32_t size,
+                          uint32_t data)
+{
+    switch ( size )
+    {
+    case 4:
+        pci_conf_write32(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 3:
+        /*
+         * This is possible because a 4byte write can have 1byte trapped and
+         * the rest passed-through.
+         */
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg + 1,
+                         data >> 8);
+        break;
+    case 2:
+        pci_conf_write16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg,
+                         data);
+        break;
+    case 1:
+        pci_conf_write8(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), reg, data);
+        break;
+    default:
+        BUG();
+    }
+}
+
+/* Helper macros for the read/write handlers. */
+#define GENMASK_BYTES(e, s) GENMASK((e) * 8, (s) * 8)
+#define SHIFT_RIGHT_BYTES(d, o) d >>= (o) * 8
+#define ADD_RESULT(r, d, s, o) r |= ((d) & GENMASK_BYTES(s, 0)) << ((o) * 8)
+
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    union vpci_val val = { .double_word = 0 };
+    unsigned int data_rshift = 0, data_lshift = 0, data_size;
+    uint32_t tmp_data;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    *data = 0;
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, r->offset - reg, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the head read to the partial result. */
+        ADD_RESULT(*data, tmp_data, r->offset - reg, 0);
+        data_lshift = r->offset - reg;
+
+        /* Account for the read. */
+        size -= data_lshift;
+        reg += data_lshift;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register */
+        data_rshift = reg - r->offset;
+
+    ASSERT(data_lshift == 0 || data_rshift == 0);
+    data_size = min(size, r->size - data_rshift);
+    ASSERT(data_size != 0);
+
+    /* Perform the read of the register. */
+    rc = r->read(pdev, r->offset, &val, r->priv_data);
+    if ( rc )
+        return rc;
+
+    val.double_word >>= data_rshift * 8;
+    ADD_RESULT(*data, val.double_word, data_size, data_lshift);
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+
+    /* Read the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Read tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_read(seg, bus, devfn, reg, size, &tmp_data);
+        if ( rc )
+            return rc;
+
+        /* Add the tail read to the partial result. */
+        ADD_RESULT(*data, tmp_data, size, data_size + data_lshift);
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_read_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/* Perform a maybe partial write to a register. */
+static int vpci_write_helper(struct pci_dev *pdev,
+                             const struct vpci_register *r, unsigned int size,
+                             unsigned int offset, uint32_t data)
+{
+    union vpci_val val = { .double_word = data };
+    int rc;
+
+    ASSERT(size <= r->size);
+    if ( size != r->size )
+    {
+        rc = r->read(pdev, r->offset, &val, r->priv_data);
+        if ( rc )
+            return rc;
+        val.double_word &= ~GENMASK_BYTES(size + offset, offset);
+        data &= GENMASK_BYTES(size, 0);
+        val.double_word |= data << (offset * 8);
+    }
+
+    return r->write(pdev, r->offset, val, r->priv_data);
+}
+
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data)
+{
+    struct domain *d = current->domain;
+    struct pci_dev *pdev;
+    const struct vpci_register *r;
+    unsigned int data_size, data_offset = 0;
+    int rc;
+
+    ASSERT(vpci_locked(d));
+
+    /* Find the PCI dev matching the address. */
+    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    if ( !pdev )
+        goto passthrough;
+
+    /* Find the vPCI register handler. */
+    r = vpci_find_register(pdev, reg, size);
+    if ( !r )
+        goto passthrough;
+
+    else if ( r->offset > reg )
+    {
+        /*
+         * There's a heading gap into the emulated register found.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, r->offset - reg, data);
+        if ( rc )
+            return rc;
+
+        /* Advance the data by the written size. */
+        SHIFT_RIGHT_BYTES(data, r->offset - reg);
+        size -= r->offset - reg;
+        reg += r->offset - reg;
+    }
+    else if ( r->offset < reg )
+        /* There's an offset into the emulated register. */
+        data_offset = reg - r->offset;
+
+    data_size = min(size, r->size - data_offset);
+
+    /* Perform the write of the register. */
+    ASSERT(data_size != 0);
+    rc = vpci_write_helper(pdev, r, data_size, data_offset, data);
+    if ( rc )
+        return rc;
+
+    /* Account for the read */
+    size -= data_size;
+    reg += data_size;
+    SHIFT_RIGHT_BYTES(data, data_size);
+
+    /* Write the remaining, if any. */
+    if ( size > 0 )
+    {
+        /*
+         * Write tailing data.
+         * NB: it's possible for this recursive call to have a size of 3.
+         */
+        rc = xen_vpci_write(seg, bus, devfn, reg, size, data);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+
+ passthrough:
+    vpci_write_hw(seg, bus, devfn, reg, size, data);
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 6ab987f231..f0741917ed 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -426,6 +426,7 @@  struct arch_domain
 #define has_vpit(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT))
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
                             XEN_X86_EMU_USE_PIRQ))
+#define has_vpci(d)        (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI))
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
 
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index d2899c9bb2..cbf4170789 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -184,6 +184,9 @@  struct hvm_domain {
     /* List of guest to machine IO ports mapping. */
     struct list_head g2m_ioport_list;
 
+    /* Lock for the PCI emulation layer (vPCI). */
+    spinlock_t vpci_lock;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 2484eb1c75..2dbf92f13e 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -155,6 +155,9 @@  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
  */
 void register_g2m_portio_handler(struct domain *d);
 
+/* HVM port IO handler for PCI accesses. */
+void register_vpci_portio_handler(struct domain *d);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index 8a9ba7982b..c00f8cda93 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -295,12 +295,15 @@  struct xen_arch_domainconfig {
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
 #define _XEN_X86_EMU_USE_PIRQ       9
 #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 59b6e8a81c..a83c4a1276 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -13,6 +13,7 @@ 
 #include <xen/irq.h>
 #include <xen/pci_regs.h>
 #include <xen/pfn.h>
+#include <xen/rbtree.h>
 #include <asm/device.h>
 #include <asm/numa.h>
 #include <asm/pci.h>
@@ -88,6 +89,9 @@  struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    /* Data for vPCI. */
+    struct vpci *vpci;
 };
 
 #define for_each_pdev(domain, pdev) \
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
new file mode 100644
index 0000000000..56e8d1c35e
--- /dev/null
+++ b/xen/include/xen/vpci.h
@@ -0,0 +1,66 @@ 
+#ifndef _VPCI_
+#define _VPCI_
+
+#include <xen/pci.h>
+#include <xen/types.h>
+
+/* Helpers for locking/unlocking. */
+#define vpci_lock(d) spin_lock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_unlock(d) spin_unlock(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
+
+/* Value read or written by the handlers. */
+union vpci_val {
+    uint8_t half_word;
+    uint16_t word;
+    uint32_t double_word;
+};
+
+/*
+ * The vPCI handlers will never be called concurrently for the same domain, ii
+ * is guaranteed that the vpci domain lock will always be locked when calling
+ * any handler.
+ */
+typedef int (*vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val *val, void *data);
+
+typedef int (*vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
+                            union vpci_val val, void *data);
+
+typedef int (*vpci_register_init_t)(struct pci_dev *dev);
+
+#define REGISTER_VPCI_INIT(x) \
+  static const vpci_register_init_t x##_entry __used_section(".data.vpci") = x
+
+/* Add vPCI handlers to device. */
+int xen_vpci_add_handlers(struct pci_dev *dev);
+
+/* Add/remove a register handler. */
+int xen_vpci_add_register(struct pci_dev *pdev, vpci_read_t read_handler,
+                          vpci_write_t write_handler, unsigned int offset,
+                          unsigned int size, void *data);
+int xen_vpci_remove_register(struct pci_dev *pdev, unsigned int offset);
+
+/* Generic read/write handlers for the PCI config space. */
+int xen_vpci_read(unsigned int seg, unsigned int bus, unsigned int devfn,
+                  unsigned int reg, uint32_t size, uint32_t *data);
+int xen_vpci_write(unsigned int seg, unsigned int bus, unsigned int devfn,
+                   unsigned int reg, uint32_t size, uint32_t data);
+
+struct vpci {
+    /* Root pointer for the tree of vPCI handlers. */
+    struct rb_root handlers;
+};
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+