diff mbox

[v7,RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support

Message ID 1475035789-685-1-git-send-email-ashish.mittal@veritas.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ashish Mittal Sept. 28, 2016, 4:09 a.m. UTC
This patch adds support for a new block device type called "vxhs".
Source code for the library that this code loads can be downloaded from:
https://github.com/MittalAshish/libqnio.git

Sample command line using JSON syntax:
./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'

Sample command line using URI syntax:
qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D

Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>
---
v7 changelog:
(1) Got rid of the header file and most of function forward-declarations.
(2) Added wrappers for vxhs_qnio_iio_open() and vxhs_qnio_iio_close()
(3) Fixed a double close attempt of vdisk_rfd.
(4) Changed to pass QEMUIOVector * in a couple of functions instead of
    individual structure members.
(5) Got rid of VXHS_VECTOR_ALIGNED/NOT_ALIGNED.
(6) Got rid of vxhs_calculate_iovec_size().
(7) Changed to use qemu_try_memalign().
(8) Got rid of unnecessary "else" conditions in a couple of places.
(9) Limited the filename case to pass a single URI in vxhs_parse_uri().
    Users will have to use the host/port/vdisk_id syntax to specify
    multiple host information.
(10) Inlined couple of macros including the ones for qemu_spin_unlock.
(11) Other miscellaneous changes.

v6 changelog:
(1) Removed cJSON dependency out of the libqnioshim layer.
(2) Merged libqnioshim code into qemu vxhs driver proper.
    Now qemu-vxhs code only links with libqnio.so.
(3) Replaced use of custom spinlocks with qemu_spin_lock.

v5 changelog:
(1) Removed unused functions.
(2) Changed all qemu_ prefix for functions defined in libqnio and vxhs.c.
(3) Fixed memory leaks in vxhs_qemu_init() and on the close of vxhs device.
(4) Added upper bounds check on num_servers.
(5) Close channel fds whereever necessary.
(6) Changed vdisk_size to int64_t for 32-bit compilations.
(7) Added message to configure file to indicate if vxhs is enabled or not.

v4 changelog:
(1) Reworked QAPI/JSON parsing.
(2) Reworked URI parsing as suggested by Kevin.
(3) Fixes per review comments from Stefan on v1.
(4) Fixes per review comments from Daniel on v3.

v3 changelog:
(1) Implemented QAPI interface for passing VxHS block device parameters.

v2 changelog:
(1) Removed code to dlopen library. We now check if libqnio is installed during
    configure, and directly link with it.
(2) Changed file headers to mention GPLv2-or-later license.
(3) Removed unnecessary type casts and inlines.
(4) Removed custom tokenize function and modified code to use g_strsplit.
(5) Replaced malloc/free with g_new/g_free and removed code that checks for
    memory allocation failure conditions.
(6) Removed some block ops implementations that were place-holders only.
(7) Removed all custom debug messages. Added new messages in block/trace-events
(8) Other miscellaneous corrections.

v1 changelog:
(1) First patch submission for review comments.

 block/Makefile.objs |    2 +
 block/trace-events  |   47 ++
 block/vxhs.c        | 1645 +++++++++++++++++++++++++++++++++++++++++++++++++++
 configure           |   41 ++
 4 files changed, 1735 insertions(+)
 create mode 100644 block/vxhs.c

Comments

Paolo Bonzini Sept. 28, 2016, 11:12 a.m. UTC | #1
Thanks, this looks much better!  A few more remarks and code smells, but
much smaller than the previous review.

> +static int32_t
> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
> +                    void *ctx, uint32_t flags)

The "BDRVVXHSState *s, int idx" interface would apply here too.

> +{
> +    int   ret = 0;
> +
> +    switch (opcode) {
> +    case VDISK_STAT:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_AIO_FLUSH:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_CHECK_IO_FAILOVER_READY:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
> +                                     in, ctx, flags);
> +        break;
> +
> +    default:
> +        ret = -ENOTSUP;
> +        break;
> +    }
> +
> +    if (ret) {
> +        *in = 0;
> +        trace_vxhs_qnio_iio_ioctl(opcode);
> +    }
> +
> +    return ret;
> +}
> +
> +/*
> + * Try to reopen the vDisk on one of the available hosts
> + * If vDisk reopen is successful on any of the host then
> + * check if that node is ready to accept I/O.
> + */
> +static int vxhs_reopen_vdisk(BDRVVXHSState *s, int index)
> +{
> +    VXHSvDiskHostsInfo hostinfo = s->vdisk_hostinfo[index];

This should almost definitely be

    VXHSvDiskHostsInfo *hostinfo = &s->vdisk_hostinfo[index];

and below:

    res = vxhs_qnio_iio_open(&hostinfo->qnio_cfd, of_vsa_addr,
                             &hostinfo->vdisk_rfd, file_name);

How was the failover code tested?



> +/*
> + * This helper function converts an array of iovectors into a flat buffer.
> + */
> +
> +static void *vxhs_convert_iovector_to_buffer(QEMUIOVector *qiov)

Please rename to vxhs_allocate_buffer_for_iovector.

> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
> +                              int *cfd, int *rfd, Error **errp)

Passing the cfd and rfd as int* is unnecessary, because here:

> +    s->vdisk_cur_host_idx = 0;
> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
> +
> +    /*
> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
> +     */
> +    if (global_qnio_ctx == NULL) {
> +        global_qnio_ctx = vxhs_setup_qnio();
> +        if (global_qnio_ctx == NULL) {
> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
> +            ret = -EINVAL;
> +            goto out;
> +        }
> +    }
> +
> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
> +    if (!ret) {
> +        error_setg(&local_err, "Failed qnio_iio_open");
> +        ret = -EIO;
> +    }
> +out:
> +    g_free(file_name);
> +    g_free(of_vsa_addr);


... you're basically doing the same as

     /* ... create global_qnio_ctx ... */

     s->qnio_ctx = global_qnio_ctx;
     s->vdisk_cur_host_idx = 0;
     ret = vxhs_reopen_vdisk(s, s->vdisk_cur_host_idx);

I suggest that you use vxhs_reopen_vdisk (rename it if you prefer; in
particular it's not specific to a *re*open). vxhs_qnio_iio_open remains
an internal function to vxhs_reopen_vdisk.

This would have also caught the bug in vxhs_reopen_vdisk! :->


> +errout:
> +    /*
> +     * Close remote vDisk device if it was opened earlier
> +     */
> +    if (device_opened) {
> +        for (i = 0; i < s->vdisk_nhosts; i++) {
> +            vxhs_qnio_iio_close(s, i);
> +        }
> +    }

No need for device_opened, since you have s->vdisk_nhosts = 0 on entry
and qnio_cfd/vdisk_rfd initialized to -1 at the point where
s->vdisk_nhosts become nonzero.

On the other hand, you could also consider inverting the initialization
order.  If you open the pipe first, cleaning it up is much easier than
cleaning up the backends.

> +static void vxhs_close(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int i;
> +
> +    trace_vxhs_close(s->vdisk_guid);
> +    close(s->fds[VDISK_FD_READ]);
> +    close(s->fds[VDISK_FD_WRITE]);
> +
> +    /*
> +     * Clearing all the event handlers for oflame registered to QEMU
> +     */
> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
> +                       false, NULL, NULL, NULL);
> +    g_free(s->vdisk_guid);
> +    s->vdisk_guid = NULL;
> +
> +    for (i = 0; i < VXHS_MAX_HOSTS; i++) {

Only loop up to s->vdisk_nhosts.

> +        vxhs_qnio_iio_close(s, i);
> +        /*
> +         * Free the dynamically allocated hostip string
> +         */
> +        g_free(s->vdisk_hostinfo[i].hostip);
> +        s->vdisk_hostinfo[i].hostip = NULL;
> +        s->vdisk_hostinfo[i].port = 0;
> +    }
> +}
> +
> +/*
> + * Returns the size of vDisk in bytes. This is required
> + * by QEMU block upper block layer so that it is visible
> + * to guest.
> + */
> +static int64_t vxhs_getlength(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * Fetch the vDisk size using stat ioctl
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {
> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;

Simpler:

    BDRVVXHSState *s = bs->opaque;

    if (s->vdisk_size == 0) {
        int64_t vdisk_size = vxhs_get_vdisk_stat(s);
        if (vdisk_size == 0) {
            return -EIO;
        }
        s->vdisk_size = vdisk_size;
    }

    return s->vdisk_size;

> +}
> +
> +/*
> + * Returns actual blocks allocated for the vDisk.
> + * This is required by qemu-img utility.
> + */
> +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;

Just return vxhs_getlength(bs).

Paolo
Stefan Hajnoczi Sept. 28, 2016, 11:13 a.m. UTC | #2
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c"

Why do several trace events have a %c format specifier at the end and it
always takes a '.' value?

> +#define QNIO_CONNECT_TIMOUT_SECS    120

This isn't used and there is a typo (s/TIMOUT/TIMEOUT/).  Can it be
dropped?

> +static int32_t
> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
> +                    void *ctx, uint32_t flags)
> +{
> +    int   ret = 0;
> +
> +    switch (opcode) {
> +    case VDISK_STAT:

It seems unnecessary to abstract the iio_ioctl() constants and then have
a switch statement to translate to the actual library constants.  It
makes little sense since the flags argument already uses the library
constants.  Just use the library's constants.

> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_AIO_FLUSH:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_CHECK_IO_FAILOVER_READY:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
> +                                     in, ctx, flags);
> +        break;
> +
> +    default:
> +        ret = -ENOTSUP;
> +        break;
> +    }
> +
> +    if (ret) {
> +        *in = 0;

Some callers pass in = NULL so this will crash.

The naming seems wrong: this is an output argument, not an input
argument.  Please call it "out_val" or similar.

> +    res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx);
> +    if (res == 0) {
> +        res = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd,
> +                  VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags);

Looking at iio_ioctl(), I'm not sure how this can ever work.  The fourth
argument is NULL and iio_ioctl() will attempt *vdisk_size = 0 so this
will crash.

Do you have tests that exercise this code path?

> +/*
> + * This is called by QEMU when a flush gets triggered from within
> + * a guest at the block layer, either for IDE or SCSI disks.
> + */
> +static int vxhs_co_flush(BlockDriverState *bs)

This is called from coroutine context, please add the coroutine_fn
function attribute to document this.

> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t size = 0;
> +    int ret = 0;
> +
> +    /*
> +     * VDISK_AIO_FLUSH ioctl is a no-op at present and will
> +     * always return success. This could change in the future.
> +     */
> +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> +            VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC);

This function is not allowed to block.  It cannot do a synchronous
flush.  This line is misleading because the constant is called
VDISK_AIO_FLUSH, but looking at the library code I see it's actually a
synchronous call that ends up in a loop that sleeps (!) waiting for the
response.

Please do an async flush and qemu_coroutine_yield() to return
control to QEMU's event loop.  When the flush completes you can
qemu_coroutine_enter() again to return from this function.

> +
> +    if (ret < 0) {
> +        trace_vxhs_co_flush(s->vdisk_guid, ret, errno);
> +        vxhs_close(bs);

This looks unsafe.  Won't it cause double close() calls for s->fds[]
when bdrv_close() is called later?

> +    }
> +
> +    return ret;
> +}
> +
> +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s)

sizeof(unsigned long) = 4 on some machines, please change it to int64_t.

I also suggest changing the function name to vxhs_get_vdisk_size() since
it only provides the size.

> +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * TODO:
> +         * Once HyperScale storage-virtualizer provides
> +         * actual physical allocation of blocks then
> +         * fetch that information and return back to the
> +         * caller but for now just get the full size.
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {
> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;
> +}

Why are you implementing this function if vxhs doesn't support querying
the allocated file size?  Don't return a bogus number.  Just don't
implement it like other block drivers.
Stefan Hajnoczi Sept. 28, 2016, 11:36 a.m. UTC | #3
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using JSON syntax:
> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'
> 
> Sample command line using URI syntax:
> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D
> 
> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>

Have you tried running the qemu-iotests test suite?
http://qemu-project.org/Documentation/QemuIoTests

The test suite needs to be working in order for this block driver to be
merged.

Please also consider how to test failover.  You can use blkdebug (see
docs/blkdebug.txt) to inject I/O errors but perhaps you need something
more low-level in the library.
Daniel P. Berrangé Sept. 28, 2016, 12:06 p.m. UTC | #4
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using JSON syntax:
> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'

Please line wrap the text here

> 
> Sample command line using URI syntax:
> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D

and here.

> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>


>  block/Makefile.objs |    2 +
>  block/trace-events  |   47 ++
>  block/vxhs.c        | 1645 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  configure           |   41 ++
>  4 files changed, 1735 insertions(+)
>  create mode 100644 block/vxhs.c

This has lost the QAPI schema definition in qapi/block-core.json
that was in earlier versions.

We should not be adding new block drivers without having a QAPI
schema defined for them.

I would like to see this use the exact same syntax for specifying
the server as is used for Gluster, as it will simplify life for
libvirt to only have one format to generate.

This would simply renaming the current 'GlusterServer' QAPI
struct to be something more generic perhaps "BlockServer"
so that it can be shared between both.

It also means that the JSON example above must include the
'type' discriminator.

Regards,
Daniel
Stefan Hajnoczi Sept. 28, 2016, 9:45 p.m. UTC | #5
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:

Review of .bdrv_open() and .bdrv_aio_writev() code paths.

The big issues I see in this driver and libqnio:

1. Showstoppers like broken .bdrv_open() and leaking memory on every
   reply message.
2. Insecure due to missing input validation (network packets and
   configuration) and incorrect string handling.
3. Not fully asynchronous so QEMU and the guest may hang.

Please think about the whole codebase and not just the lines I've
pointed out in this review when fixing these sorts of issues.  There may
be similar instances of these bugs elsewhere and it's important that
they are fixed so that this can be merged.

> +/*
> + * Structure per vDisk maintained for state
> + */
> +typedef struct BDRVVXHSState {
> +    int                     fds[2];
> +    int64_t                 vdisk_size;
> +    int64_t                 vdisk_blocks;
> +    int64_t                 vdisk_flags;
> +    int                     vdisk_aio_count;
> +    int                     event_reader_pos;
> +    VXHSAIOCB               *qnio_event_acb;
> +    void                    *qnio_ctx;
> +    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
> +    QemuSpin                vdisk_acb_lock;  /* Protects ACB */

These comments are insufficient for documenting locking.  Not all fields
are actually protected by these locks.  Please order fields according to
lock coverage:

typedef struct VXHSAIOCB {
    ...

    /* Protected by BDRVVXHSState->vdisk_acb_lock */
    int                 segments;
    ...
};

typedef struct BDRVVXHSState {
    ...

    /* Protected by vdisk_lock */
    QemuSpin                vdisk_lock;
    int                     vdisk_aio_count;
    QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
    ...
}

> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
> +{
> +    /*
> +     * Close vDisk device
> +     */
> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);

libqnio comment:
Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
dropped.

> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
> +    }
> +
> +    /*
> +     * Close QNIO channel against cached channel-fd
> +     */
> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);

libqnio comment:
Why does iio_devclose() take an int32_t cfd argument but iio_close()
takes a uint32_t cfd argument?

> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
> +    }
> +}
> +
> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
> +                              int *rfd, const char *file_name)
> +{
> +    /*
> +     * Open qnio channel to storage agent if not opened before.
> +     */
> +    if (*cfd < 0) {
> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);

libqnio comments:

1.
There is a buffer overflow in qnio_create_channel().  strncpy() is used
incorrectly so long hostname or port (both can be 99 characters long)
will overflow channel->name[] (64 characters) or channel->port[] (8
characters).

    strncpy(channel->name, hostname, strlen(hostname) + 1);
    strncpy(channel->port, port, strlen(port) + 1);

The third argument must be the size of the *destination* buffer, not the
source buffer.  Also note that strncpy() doesn't NUL-terminate the
destination string so you must do that manually to ensure there is a NUL
byte at the end of the buffer.

2.
channel is leaked in the "Failed to open single connection" error case
in qnio_create_channel().

3.
If host is longer the 63 characters then the ioapi_ctx->channels and
qnio_ctx->channels maps will use different keys due to string truncation
in qnio_create_channel().  This means "Channel already exists" in
qnio_create_channel() and possibly other things will not work as
expected.

> +        if (*cfd < 0) {
> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
> +            return -ENODEV;
> +        }
> +    }
> +
> +    /*
> +     * Open vdisk device
> +     */
> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);

libqnio comment:
Buffer overflow in iio_devopen() since chandev[128] is not large enough
to hold channel[100] + " " + devpath[arbitrary length] chars:

    sprintf(chandev, "%s %s", channel, devpath);

> +
> +    if (*rfd < 0) {
> +        if (*cfd >= 0) {

This check is always true.  Otherwise the return -ENODEV would have been
taken above.  The if statement isn't necessary.

> +static void vxhs_check_failover_status(int res, void *ctx)
> +{
> +    BDRVVXHSState *s = ctx;
> +
> +    if (res == 0) {
> +        /* found failover target */
> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
> +        s->vdisk_ask_failover_idx = 0;
> +        trace_vxhs_check_failover_status(
> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
> +                   s->vdisk_guid);
> +        qemu_spin_lock(&s->vdisk_lock);
> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
> +        qemu_spin_unlock(&s->vdisk_lock);
> +        vxhs_handle_queued_ios(s);
> +    } else {
> +        /* keep looking */
> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
> +        s->vdisk_ask_failover_idx++;
> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
> +            /* pause and cycle through list again */
> +            sleep(QNIO_CONNECT_RETRY_SECS);

This code is called from a QEMU thread via vxhs_aio_rw().  It is not
permitted to call sleep() since it will freeze QEMU and probably the
guest.

If you need a timer you can use QEMU's timer APIs.  See aio_timer_new(),
timer_new_ns(), timer_mod(), timer_del(), timer_free().

> +            s->vdisk_ask_failover_idx = 0;
> +        }
> +        res = vxhs_switch_storage_agent(s);
> +    }
> +}
> +
> +static int vxhs_failover_io(BDRVVXHSState *s)
> +{
> +    int res = 0;
> +
> +    trace_vxhs_failover_io(s->vdisk_guid);
> +
> +    s->vdisk_ask_failover_idx = 0;
> +    res = vxhs_switch_storage_agent(s);
> +
> +    return res;
> +}
> +
> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
> +                       uint32_t error, uint32_t opcode)

This function is doing too much.  Especially the failover code should
run in the AioContext since it's complex.  Don't do failover here
because this function is outside the AioContext lock.  Do it from
AioContext using a QEMUBH like block/rbd.c.

> +static int32_t
> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
> +                     uint64_t offset, void *ctx, uint32_t flags)
> +{
> +    struct iovec cur;
> +    uint64_t cur_offset = 0;
> +    uint64_t cur_write_len = 0;
> +    int segcount = 0;
> +    int ret = 0;
> +    int i, nsio = 0;
> +    int iovcnt = qiov->niov;
> +    struct iovec *iov = qiov->iov;
> +
> +    errno = 0;
> +    cur.iov_base = 0;
> +    cur.iov_len = 0;
> +
> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);

libqnio comments:

1.
There are blocking connect(2) and getaddrinfo(3) calls in iio_writev()
so this may hang for arbitrary amounts of time.  This is not permitted
in .bdrv_aio_readv()/.bdrv_aio_writev().  Please make qnio actually
asynchronous.

2.
Where does client_callback() free reply?  It looks like every reply
message causes a memory leak!

3.
Buffer overflow in iio_writev() since device[128] cannot fit the device
string generated from the vdisk_guid.

4.
Buffer overflow in iio_writev() due to
strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is
larger than target[64].  Also note the previous comments about strncpy()
usage.

5.
I don't see any endianness handling or portable alignment of struct
fields in the network protocol code.  Binary network protocols need to
take care of these issue for portability.  This means libqnio compiled
for different architectures will not work.  Do you plan to support any
other architectures besides x86?

6.
The networking code doesn't look robust: kvset uses assert() on input
from the network so the other side of the connection could cause SIGABRT
(coredump), the client uses the msg pointer as the cookie for the
response packet so the server can easily crash the client by sending a
bogus cookie value, etc.  Even on the client side these things are
troublesome but on a server they are guaranteed security issues.  I
didn't look into it deeply.  Please audit the code.

> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
> +                              int *cfd, int *rfd, Error **errp)
> +{
> +    QDict *backing_options = NULL;
> +    QemuOpts *opts, *tcp_opts;
> +    const char *vxhs_filename;
> +    char *of_vsa_addr = NULL;
> +    Error *local_err = NULL;
> +    const char *vdisk_id_opt;
> +    char *file_name = NULL;
> +    size_t num_servers = 0;
> +    char *str = NULL;
> +    int ret = 0;
> +    int i;
> +
> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> +    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
> +    if (vxhs_filename) {
> +        trace_vxhs_qemu_init_filename(vxhs_filename);
> +    }
> +
> +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
> +    if (!vdisk_id_opt) {
> +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    s->vdisk_guid = g_strdup(vdisk_id_opt);
> +    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
> +
> +    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
> +    if (num_servers < 1) {
> +        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
> +        ret = -EINVAL;
> +        goto out;
> +    } else if (num_servers > VXHS_MAX_HOSTS) {
> +        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
> +        error_append_hint(errp, "Maximum %d servers allowed.\n",
> +                          VXHS_MAX_HOSTS);
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    trace_vxhs_qemu_init_numservers(num_servers);
> +
> +    for (i = 0; i < num_servers; i++) {
> +        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
> +        qdict_extract_subqdict(options, &backing_options, str);
> +
> +        /* Create opts info from runtime_tcp_opts list */
> +        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
> +        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
> +        if (local_err) {
> +            qdict_del(backing_options, str);

backing_options is leaked and there's no need to delete the str key.

> +            qemu_opts_del(tcp_opts);
> +            g_free(str);
> +            ret = -EINVAL;
> +            goto out;
> +        }
> +
> +        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
> +                                                            VXHS_OPT_HOST));
> +        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
> +                                                                 VXHS_OPT_PORT),
> +                                                    NULL, 0);

This will segfault if the port option was missing.

> +
> +        s->vdisk_hostinfo[i].qnio_cfd = -1;
> +        s->vdisk_hostinfo[i].vdisk_rfd = -1;
> +        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
> +                             s->vdisk_hostinfo[i].port);

It's not safe to use the %s format specifier for a trace event with a
NULL value.  In the case where hostip is NULL this could crash on some
systems.

> +
> +        qdict_del(backing_options, str);
> +        qemu_opts_del(tcp_opts);
> +        g_free(str);
> +    }

backing_options is leaked.

> +
> +    s->vdisk_nhosts = i;
> +    s->vdisk_cur_host_idx = 0;
> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);

Can we get here with num_servers == 0?  In that case this would access
uninitialized memory.  I guess num_servers == 0 does not make sense and
there should be an error case for it.

> +
> +    /*
> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
> +     */
> +    if (global_qnio_ctx == NULL) {
> +        global_qnio_ctx = vxhs_setup_qnio();

libqnio comment:
The client epoll thread should mask all signals (like
qemu_thread_create()).  Otherwise it may receive signals that it cannot
deal with.

> +        if (global_qnio_ctx == NULL) {
> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
> +            ret = -EINVAL;
> +            goto out;
> +        }
> +    }
> +
> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
> +    if (!ret) {
> +        error_setg(&local_err, "Failed qnio_iio_open");
> +        ret = -EIO;
> +    }

The return value of vxhs_qnio_iio_open() is 0 for success or -errno for
error.

I guess you never ran this code!  The block driver won't even open
successfully.

> +
> +out:
> +    g_free(file_name);
> +    g_free(of_vsa_addr);
> +    qemu_opts_del(opts);
> +
> +    if (ret < 0) {
> +        for (i = 0; i < num_servers; i++) {
> +            g_free(s->vdisk_hostinfo[i].hostip);
> +        }
> +        g_free(s->vdisk_guid);
> +        s->vdisk_guid = NULL;
> +        errno = -ret;

There is no need to set errno here.  The return value already contains
the error and the caller doesn't look at errno.

> +    }
> +    error_propagate(errp, local_err);
> +
> +    return ret;
> +}
> +
> +static int vxhs_open(BlockDriverState *bs, QDict *options,
> +              int bdrv_flags, Error **errp)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    AioContext *aio_context;
> +    int qemu_qnio_cfd = -1;
> +    int device_opened = 0;
> +    int qemu_rfd = -1;
> +    int ret = 0;
> +    int i;
> +
> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
> +    if (ret < 0) {
> +        trace_vxhs_open_fail(ret);
> +        return ret;
> +    }
> +
> +    device_opened = 1;
> +    s->qnio_ctx = global_qnio_ctx;
> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
> +    s->vdisk_size = 0;
> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
> +
> +    /*
> +     * Create a pipe for communicating between two threads in different
> +     * context. Set handler for read event, which gets triggered when
> +     * IO completion is done by non-QEMU context.
> +     */
> +    ret = qemu_pipe(s->fds);
> +    if (ret < 0) {
> +        trace_vxhs_open_epipe('.');
> +        ret = -errno;
> +        goto errout;

This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc.
bdrv_close() will not be called so this function must do cleanup itself.

> +    }
> +    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
> +
> +    aio_context = bdrv_get_aio_context(bs);
> +    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
> +                       false, vxhs_aio_event_reader, NULL, s);
> +
> +    /*
> +     * Initialize the spin-locks.
> +     */
> +    qemu_spin_init(&s->vdisk_lock);
> +    qemu_spin_init(&s->vdisk_acb_lock);
> +
> +    return 0;
> +
> +errout:
> +    /*
> +     * Close remote vDisk device if it was opened earlier
> +     */
> +    if (device_opened) {

This is always true.  The device_opened variable can be removed.

> +/*
> + * This allocates QEMU-VXHS callback for each IO
> + * and is passed to QNIO. When QNIO completes the work,
> + * it will be passed back through the callback.
> + */
> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
> +                                int64_t sector_num, QEMUIOVector *qiov,
> +                                int nb_sectors,
> +                                BlockCompletionFunc *cb,
> +                                void *opaque, int iodir)
> +{
> +    VXHSAIOCB *acb = NULL;
> +    BDRVVXHSState *s = bs->opaque;
> +    size_t size;
> +    uint64_t offset;
> +    int iio_flags = 0;
> +    int ret = 0;
> +    void *qnio_ctx = s->qnio_ctx;
> +    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
> +
> +    offset = sector_num * BDRV_SECTOR_SIZE;
> +    size = nb_sectors * BDRV_SECTOR_SIZE;
> +
> +    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
> +    /*
> +     * Setup or initialize VXHSAIOCB.
> +     * Every single field should be initialized since
> +     * acb will be picked up from the slab without
> +     * initializing with zero.
> +     */
> +    acb->io_offset = offset;
> +    acb->size = size;
> +    acb->ret = 0;
> +    acb->flags = 0;
> +    acb->aio_done = VXHS_IO_INPROGRESS;
> +    acb->segments = 0;
> +    acb->buffer = 0;
> +    acb->qiov = qiov;
> +    acb->direction = iodir;
> +
> +    qemu_spin_lock(&s->vdisk_lock);
> +    if (OF_VDISK_FAILED(s)) {
> +        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
> +        qemu_spin_unlock(&s->vdisk_lock);
> +        goto errout;
> +    }
> +    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
> +        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
> +        s->vdisk_aio_retry_qd++;
> +        OF_AIOCB_FLAGS_SET_QUEUED(acb);
> +        qemu_spin_unlock(&s->vdisk_lock);
> +        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
> +        goto out;
> +    }
> +    s->vdisk_aio_count++;
> +    qemu_spin_unlock(&s->vdisk_lock);
> +
> +    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
> +
> +    switch (iodir) {
> +    case VDISK_AIO_WRITE:
> +            vxhs_inc_acb_segment_count(acb, 1);
> +            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
> +                                       offset, (void *)acb, iio_flags);
> +            break;
> +    case VDISK_AIO_READ:
> +            vxhs_inc_acb_segment_count(acb, 1);
> +            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
> +                                       offset, (void *)acb, iio_flags);
> +            break;
> +    default:
> +            trace_vxhs_aio_rw_invalid(iodir);
> +            goto errout;

s->vdisk_aio_count must be decremented before returning.

> +static void vxhs_close(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int i;
> +
> +    trace_vxhs_close(s->vdisk_guid);
> +    close(s->fds[VDISK_FD_READ]);
> +    close(s->fds[VDISK_FD_WRITE]);
> +
> +    /*
> +     * Clearing all the event handlers for oflame registered to QEMU
> +     */
> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
> +                       false, NULL, NULL, NULL);

Please remove the event handler before closing the fd.  I don't think it
matters in this case but in other scenarios there could be race
conditions if another thread opens an fd and the file descriptor number
is reused.
Jeff Cody Sept. 29, 2016, 1:46 a.m. UTC | #6
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using JSON syntax:
> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'
> 
> Sample command line using URI syntax:
> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D
> 
> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>

It would be very useful for reviewing the code if there is any sort of
technical documentation (whitepaper, text file, email, etc..) that describes
the details of VXHS.

[...]


There is still a good hunk of this patch that I have yet to fully go
through, but there is enough here to address first (both from my comments,
and others).


> + * IO specific flags
> + */
> +#define IIO_FLAG_ASYNC              0x00000001
> +#define IIO_FLAG_DONE               0x00000010
> +#define IIO_FLAG_SYNC               0
> +
> +#define VDISK_FD_READ               0
> +#define VDISK_FD_WRITE              1
> +#define VXHS_MAX_HOSTS              4
> +
> +#define VXHS_OPT_FILENAME           "filename"
> +#define VXHS_OPT_VDISK_ID           "vdisk_id"
> +#define VXHS_OPT_SERVER             "server."
> +#define VXHS_OPT_HOST               "host"
> +#define VXHS_OPT_PORT               "port"
> +
> +/* qnio client ioapi_ctx */
> +static void *global_qnio_ctx;
> +

The use of a global qnio_ctx means that you are only going to be able to
connect to one vxhs server.  I.e., QEMU will not be able to have multiple
drives with different VXHS servers, unless I am missing something.

Is that by design?

I don't see any reason why you could not contain this to the BDRVVXHSState
struct, so that it is limited in scope to each drive instance.


[...]


> +
> +static int32_t
> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
> +                    void *ctx, uint32_t flags)
> +{

I looked at the underlying iio_ioctl() code, and it is a bit odd that
vdisk_size is always required by it.  Especially since it doesn't allow a
NULL parameter for it, in case you don't care about it.

> +    int   ret = 0;
> +
> +    switch (opcode) {
> +    case VDISK_STAT:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_AIO_FLUSH:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_CHECK_IO_FAILOVER_READY:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
> +                                     in, ctx, flags);
> +        break;
> +
> +    default:
> +        ret = -ENOTSUP;
> +        break;
> +    }

This whole function is necessary only because the underlying iio_ioctl()
returns success for unknown opcodes.  I think this is a mistake, and you
probably don't want to do it this way.

The iio_ioctl() function should return -ENOTSUP for unknown opcodes; this is
not a hard failure, and it allows us to determine what the underlying
library supports.  And then this function can disappear completely.

Since QEMU and libqnio are decoupled, you won't know at runtime what version
of libqnio is available on whatever system it is running on.  As the library
and driver evolve, there will likely come a time when you will want to know
in the QEMU driver if the QNIO version installed supports a certain feature.
If iio_ioctl (and other functions, as appropriate) return -ENOTSUP for
unsupported features, then it becomes easy to probe to see what is
supported.

I'd actually go a step further - have the iio_ioctl() function not filter on
opcodes either, and just blindly pass the opcodes to the server via
iio_ioctl_json().  That way you can let the server tell QEMU what it
supports, at least as far as ioctl operations go.


> +
> +    if (ret) {
> +        *in = 0;
> +        trace_vxhs_qnio_iio_ioctl(opcode);
> +    }
> +
> +    return ret;
> +}
> +
> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
> +{
> +    /*
> +     * Close vDisk device
> +     */
> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
> +    }
> +
> +    /*
> +     * Close QNIO channel against cached channel-fd
> +     */
> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
> +    }
> +}
> +
> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
> +                              int *rfd, const char *file_name)
> +{
> +    /*
> +     * Open qnio channel to storage agent if not opened before.
> +     */
> +    if (*cfd < 0) {
> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
> +        if (*cfd < 0) {
> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
> +            return -ENODEV;
> +        }
> +    }
> +
> +    /*
> +     * Open vdisk device
> +     */
> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
> +
> +    if (*rfd < 0) {
> +        if (*cfd >= 0) {

This if conditional can be dropped, *cfd is always >= 0 here.

> +            iio_close(global_qnio_ctx, *cfd);
> +            *cfd = -1;
> +            *rfd = -1;
> +        }
> +
> +        trace_vxhs_qnio_iio_devopen(file_name);
> +        return -ENODEV;
> +    }
> +
> +    return 0;
> +}
> +

[...]

> +
> +static int vxhs_switch_storage_agent(BDRVVXHSState *s)
> +{
> +    int res = 0;
> +    int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE);
> +
> +    trace_vxhs_switch_storage_agent(
> +              s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
> +              s->vdisk_guid);
> +
> +    res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx);
> +    if (res == 0) {
> +        res = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd,
> +                  VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags);

Segfault here.  The libqnio library doesn't allow NULL for the vdisk size
argument (although it should).


> +    } else {
> +        trace_vxhs_switch_storage_agent_failed(
> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
> +                  s->vdisk_guid, res, errno);
> +        /*
> +         * Try the next host.
> +         * Calling vxhs_check_failover_status from here ties up the qnio
> +         * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1)
> +         * for all the hosts in the IO target list.
> +         */
> +
> +        vxhs_check_failover_status(res, s);
> +    }
> +    return res;
> +}
> +
> +static void vxhs_check_failover_status(int res, void *ctx)
> +{
> +    BDRVVXHSState *s = ctx;
> +
> +    if (res == 0) {
> +        /* found failover target */
> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
> +        s->vdisk_ask_failover_idx = 0;
> +        trace_vxhs_check_failover_status(
> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
> +                   s->vdisk_guid);
> +        qemu_spin_lock(&s->vdisk_lock);
> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
> +        qemu_spin_unlock(&s->vdisk_lock);
> +        vxhs_handle_queued_ios(s);
> +    } else {
> +        /* keep looking */
> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
> +        s->vdisk_ask_failover_idx++;
> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
> +            /* pause and cycle through list again */
> +            sleep(QNIO_CONNECT_RETRY_SECS);

Repeat from the my v6 review comments: this absolutely cannot happen here.
This is not just called from your callback, but also from your
.bdrv_aio_readv, .bdrv_aio_writev implementations.  Don't sleep() QEMU.

To resolve this will probably require some redesign of the failover code,
and when it is called; because of the variety of code paths that can invoke
this code, you also cannot do a coroutine yield here, either.


[...]


> +static int32_t
> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
> +                     uint64_t offset, void *ctx, uint32_t flags)
> +{
> +    struct iovec cur;
> +    uint64_t cur_offset = 0;
> +    uint64_t cur_write_len = 0;
> +    int segcount = 0;
> +    int ret = 0;
> +    int i, nsio = 0;
> +    int iovcnt = qiov->niov;
> +    struct iovec *iov = qiov->iov;
> +
> +    errno = 0;
> +    cur.iov_base = 0;
> +    cur.iov_len = 0;
> +
> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
> +
> +    if (ret == -1 && errno == EFBIG) {
> +        trace_vxhs_qnio_iio_writev(ret);
> +        /*
> +         * IO size is larger than IIO_IO_BUF_SIZE hence need to
> +         * split the I/O at IIO_IO_BUF_SIZE boundary
> +         * There are two cases here:
> +         *  1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE
> +         *  2. iovcnt is greater than 1 and IO size is greater than
> +         *     IIO_IO_BUF_SIZE.
> +         *
> +         * Need to adjust the segment count, for that we need to compute
> +         * the segment count and increase the segment count in one shot
> +         * instead of setting iteratively in for loop. It is required to
> +         * prevent any race between the splitted IO submission and IO
> +         * completion.
> +         */

If I understand the for loop below correctly, it is all done to set nsio to the
correct value, right?

> +        cur_offset = offset;
> +        for (i = 0; i < iovcnt; i++) {
> +            if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) {
> +                cur_offset += iov[i].iov_len;
> +                nsio++;

Is this chunk:

> +            } else if (iov[i].iov_len > 0) {
> +                cur.iov_base = iov[i].iov_base;
> +                cur.iov_len = IIO_IO_BUF_SIZE;
> +                cur_write_len = 0;
> +                while (1) {
> +                    nsio++;
> +                    cur_write_len += cur.iov_len;
> +                    if (cur_write_len == iov[i].iov_len) {
> +                        break;
> +                    }
> +                    cur_offset += cur.iov_len;
> +                    cur.iov_base += cur.iov_len;
> +                    if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) {
> +                        cur.iov_len = IIO_IO_BUF_SIZE;
> +                    } else {
> +                        cur.iov_len = (iov[i].iov_len - cur_write_len);
> +                    }
> +                }
> +            }

... effectively just doing this?

tmp = iov[1].iov_len / IIO_IO_BUF_SIZE;
nsio += tmp;
nsio += iov[1].iov_len % IIO_IO_BUF_SIZE ? 1: 0;

> +        }
> +
> +        segcount = nsio - 1;
> +        vxhs_inc_acb_segment_count(ctx, segcount);
> +


[...]


> +static int vxhs_open(BlockDriverState *bs, QDict *options,
> +              int bdrv_flags, Error **errp)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    AioContext *aio_context;
> +    int qemu_qnio_cfd = -1;
> +    int device_opened = 0;
> +    int qemu_rfd = -1;
> +    int ret = 0;
> +    int i;
> +
> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
> +    if (ret < 0) {
> +        trace_vxhs_open_fail(ret);
> +        return ret;
> +    }
> +
> +    device_opened = 1;

This is still unneeded; it is always == 1 in any path that can hit errout.

> +    s->qnio_ctx = global_qnio_ctx;
> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
> +    s->vdisk_size = 0;
> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);


[...]

> +
> +/*
> + * This is called by QEMU when a flush gets triggered from within
> + * a guest at the block layer, either for IDE or SCSI disks.
> + */
> +static int vxhs_co_flush(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t size = 0;
> +    int ret = 0;
> +
> +    /*
> +     * VDISK_AIO_FLUSH ioctl is a no-op at present and will
> +     * always return success. This could change in the future.
> +     */

Rather than always return success, since it is a no-op how about have it
return -ENOTSUP?  That can then be filtered out here, but it gives us the
ability to determine later if a version of vxhs supports flush or not.

Or, you could just not implement vxhs_co_flush() at all; the block layer
will assume success in that case.

> +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> +            VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC);
> +
> +    if (ret < 0) {
> +        trace_vxhs_co_flush(s->vdisk_guid, ret, errno);
> +        vxhs_close(bs);

But if we leave this here, and if you are using the vxhs_close() for cleanup
(and I like that you do), you need to make sure to set bs->drv = NULL.
Otherwise, 1) subsequent I/O will not have a good day, and 2) the final
invocation of bdrv_close() will double free resources.

> +    }
> +
> +    return ret;
> +}
> +
> +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s)
> +{
> +    int64_t vdisk_size = 0;
> +    int ret = 0;
> +
> +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> +            VDISK_STAT, &vdisk_size, NULL, 0);
> +
> +    if (ret < 0) {
> +        trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno);
> +        return 0;

Why return 0, rather than the error?

> +    }
> +
> +    trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size);
> +    return vdisk_size;
> +}
> +
> +/*
> + * Returns the size of vDisk in bytes. This is required
> + * by QEMU block upper block layer so that it is visible
> + * to guest.
> + */
> +static int64_t vxhs_getlength(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * Fetch the vDisk size using stat ioctl
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {

If vxhs_get_vdisk_stat() returned the error rather than 0, then this check
is unnecessary (assuming vxhs_qnio_iio_ioctl() returns useful errors
itself).

> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;
> +}
> +
> +/*
> + * Returns actual blocks allocated for the vDisk.
> + * This is required by qemu-img utility.
> + */

This returns bytes, not blocks.

> +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * TODO:
> +         * Once HyperScale storage-virtualizer provides
> +         * actual physical allocation of blocks then
> +         * fetch that information and return back to the
> +         * caller but for now just get the full size.
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {
> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;
> +}

1. This is identical to vxhs_getlength(), so you could just call that, but:

2. Don't, because you are not returning what is expected by this function.
   If it is not supported yet by VXHS, just don't implement
   bdrv_get_allocated_file_size.  The block driver code will do the right
   thing, and return -ENOTSUP (which ends up showing up as this value is
   unavailable).  This is much more useful than having the wrong value
   returned.  
   
3. I guess unless, of course, 100% file allocation is always true, then you
   can ignore point #2, and just follow #1.

-Jeff
Jeff Cody Sept. 29, 2016, 2:18 a.m. UTC | #7
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using JSON syntax:
> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'
> 
> Sample command line using URI syntax:
> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D
> 
> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>

Hi Ashish,

You've received a lot of feedback to digest on your patch -- creating a
whole new block driver can be difficult!

If I may make a suggestion: it is usually more productive to address the
feedback via email _before_ you code up and send out the next patch version,
unless the comments are straightforward and don't need any discussion (many
comments often don't).

For instance, I appreciate your reply to the feedback on v6, but by the time
it hit my inbox v7 was already there, so it more or less kills the
discussion and starts the review cycle afresh.  Not a big deal, but overall
it would probably be more productive if that reply was sent first, and we
could discuss things like sleep, coroutines, caching, global ctx instances,
etc... :) Those discussion might then help shape the next patch even more,
and result in fewer iterations.


Thanks, and happy hacking!
-Jeff
Ashish Mittal Sept. 29, 2016, 5:30 p.m. UTC | #8
That makes perfect sense. I will try and follow this method now onwards. Thanks!

On Wed, Sep 28, 2016 at 7:18 PM, Jeff Cody <jcody@redhat.com> wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>> This patch adds support for a new block device type called "vxhs".
>> Source code for the library that this code loads can be downloaded from:
>> https://github.com/MittalAshish/libqnio.git
>>
>> Sample command line using JSON syntax:
>> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'
>>
>> Sample command line using URI syntax:
>> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D
>>
>> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com>
>
> Hi Ashish,
>
> You've received a lot of feedback to digest on your patch -- creating a
> whole new block driver can be difficult!
>
> If I may make a suggestion: it is usually more productive to address the
> feedback via email _before_ you code up and send out the next patch version,
> unless the comments are straightforward and don't need any discussion (many
> comments often don't).
>
> For instance, I appreciate your reply to the feedback on v6, but by the time
> it hit my inbox v7 was already there, so it more or less kills the
> discussion and starts the review cycle afresh.  Not a big deal, but overall
> it would probably be more productive if that reply was sent first, and we
> could discuss things like sleep, coroutines, caching, global ctx instances,
> etc... :) Those discussion might then help shape the next patch even more,
> and result in fewer iterations.
>
>
> Thanks, and happy hacking!
> -Jeff
Stefan Hajnoczi Sept. 30, 2016, 8:36 a.m. UTC | #9
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git

The QEMU block driver should deal with BlockDriver<->libqnio integration
and libqnio should deal with vxhs logic (network protocol, failover,
etc).  Right now the vxhs logic is spread between both components.  If
responsibilities aren't cleanly separated between QEMU and libqnio then
I see no point in having libqnio.

Failover code should move into libqnio so that programs using libqnio
avoid duplicating the failover code.

Similarly IIO_IO_BUF_SIZE/segments should be handled internally by
libqnio so programs using libqnio do not duplicate this code.

libqnio itself can be simplified significantly:

The multi-threading is not necessary and adds complexity.  Right now
there seem to be two reasons for multi-threading: shared contexts and
the epoll thread.  Both can be eliminated as follows.

Shared contexts do not make sense in a multi-disk, multi-core
environment.  Why is it advantages to tie disks to a single context?
It's simpler and more multi-core friendly to let every disk have its own
connection.

The epoll thread forces library users to use thread synchronization when
processing callbacks.  Look at libiscsi for an example of how to
eliminate it.  Two APIs are defined: int iscsi_get_fd(iscsi) and int
iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT).  The program using the
library can integrate the fd into its own event loop.  The advantage of
doing this is that no library threads are necessary and all callbacks
are invoked from the program's event loop.  Therefore no thread
synchronization is needed.

If you make these changes then all multi-threading in libqnio and the
QEMU block driver can be dropped.  There will be less code and it will
be simpler.
Ashish Mittal Oct. 1, 2016, 3:10 a.m. UTC | #10
Hi Stefan, others,

Thank you for all the review comments.

On Wed, Sep 28, 2016 at 4:13 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>> +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c"
>
> Why do several trace events have a %c format specifier at the end and it
> always takes a '.' value?
>
>> +#define QNIO_CONNECT_TIMOUT_SECS    120
>
> This isn't used and there is a typo (s/TIMOUT/TIMEOUT/).  Can it be
> dropped?
>
>> +static int32_t
>> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
>> +                    void *ctx, uint32_t flags)
>> +{
>> +    int   ret = 0;
>> +
>> +    switch (opcode) {
>> +    case VDISK_STAT:
>
> It seems unnecessary to abstract the iio_ioctl() constants and then have
> a switch statement to translate to the actual library constants.  It
> makes little sense since the flags argument already uses the library
> constants.  Just use the library's constants.
>
>> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
>> +                                     in, ctx, flags);
>> +        break;
>> +
>> +    case VDISK_AIO_FLUSH:
>> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
>> +                                     in, ctx, flags);
>> +        break;
>> +
>> +    case VDISK_CHECK_IO_FAILOVER_READY:
>> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
>> +                                     in, ctx, flags);
>> +        break;
>> +
>> +    default:
>> +        ret = -ENOTSUP;
>> +        break;
>> +    }
>> +
>> +    if (ret) {
>> +        *in = 0;
>
> Some callers pass in = NULL so this will crash.
>
> The naming seems wrong: this is an output argument, not an input
> argument.  Please call it "out_val" or similar.
>
>> +    res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx);
>> +    if (res == 0) {
>> +        res = vxhs_qnio_iio_ioctl(s->qnio_ctx,
>> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd,
>> +                  VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags);
>
> Looking at iio_ioctl(), I'm not sure how this can ever work.  The fourth
> argument is NULL and iio_ioctl() will attempt *vdisk_size = 0 so this
> will crash.
>
> Do you have tests that exercise this code path?
>

You are right. This bug crept in to the FAILOVER path when I moved the
qnio shim code to qemu. Earlier code in libqnio did *in = 0 on a
per-case basis and skipped it for VDISK_CHECK_IO_FAILOVER_READY. I
will fix this.

We do thoroughly test these code paths, but the problem is that the
existing tests do not fully work with the new changes I am doing. I do
not yet have test case to test failover with the latest code. I do
frequently test using qemu-io (open a vdisk, read, write and re-read
to check written data) and also try to bring up an existing guest VM
using latest qemu-system-x86_64 binary to make sure I don't regress
main functionality. I did not however run these tests for v7 patch,
therefore some of the v7 changes do break the code. This patch had
some major code reconfiguration over v6, therefore my intention was to
just get a feel for whether the main code structure looks good.

A lot of changes have been proposed. I will discuss these with the
team and get back with inputs. I guess having a test framework is
really important at this time.

Regards,
Ashish

On Fri, Sep 30, 2016 at 1:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>> This patch adds support for a new block device type called "vxhs".
>> Source code for the library that this code loads can be downloaded from:
>> https://github.com/MittalAshish/libqnio.git
>
> The QEMU block driver should deal with BlockDriver<->libqnio integration
> and libqnio should deal with vxhs logic (network protocol, failover,
> etc).  Right now the vxhs logic is spread between both components.  If
> responsibilities aren't cleanly separated between QEMU and libqnio then
> I see no point in having libqnio.
>
> Failover code should move into libqnio so that programs using libqnio
> avoid duplicating the failover code.
>
> Similarly IIO_IO_BUF_SIZE/segments should be handled internally by
> libqnio so programs using libqnio do not duplicate this code.
>
> libqnio itself can be simplified significantly:
>
> The multi-threading is not necessary and adds complexity.  Right now
> there seem to be two reasons for multi-threading: shared contexts and
> the epoll thread.  Both can be eliminated as follows.
>
> Shared contexts do not make sense in a multi-disk, multi-core
> environment.  Why is it advantages to tie disks to a single context?
> It's simpler and more multi-core friendly to let every disk have its own
> connection.
>
> The epoll thread forces library users to use thread synchronization when
> processing callbacks.  Look at libiscsi for an example of how to
> eliminate it.  Two APIs are defined: int iscsi_get_fd(iscsi) and int
> iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT).  The program using the
> library can integrate the fd into its own event loop.  The advantage of
> doing this is that no library threads are necessary and all callbacks
> are invoked from the program's event loop.  Therefore no thread
> synchronization is needed.
>
> If you make these changes then all multi-threading in libqnio and the
> QEMU block driver can be dropped.  There will be less code and it will
> be simpler.
Stefan Hajnoczi Oct. 3, 2016, 2:10 p.m. UTC | #11
On Sat, Oct 1, 2016 at 4:10 AM, ashish mittal <ashmit602@gmail.com> wrote:
>> If you make these changes then all multi-threading in libqnio and the
>> QEMU block driver can be dropped.  There will be less code and it will
>> be simpler.

You'll get a lot of basic tests for free if you add vxhs support to
the existing tests/qemu-iotests/ framework.

A vxhs server is required so qemu-iotests has something to run
against.  I saw some server code in libqnio but haven't investigated
how complete it is.  The main thing is to support read/write/flush.

Stefan
Jeff Cody Oct. 5, 2016, 4:02 a.m. UTC | #12
On Wed, Sep 28, 2016 at 12:13:32PM +0100, Stefan Hajnoczi wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:

[...]

> > +/*
> > + * This is called by QEMU when a flush gets triggered from within
> > + * a guest at the block layer, either for IDE or SCSI disks.
> > + */
> > +static int vxhs_co_flush(BlockDriverState *bs)
> 
> This is called from coroutine context, please add the coroutine_fn
> function attribute to document this.
> 
> > +{
> > +    BDRVVXHSState *s = bs->opaque;
> > +    int64_t size = 0;
> > +    int ret = 0;
> > +
> > +    /*
> > +     * VDISK_AIO_FLUSH ioctl is a no-op at present and will
> > +     * always return success. This could change in the future.
> > +     */
> > +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> > +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> > +            VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC);
> 
> This function is not allowed to block.  It cannot do a synchronous
> flush.  This line is misleading because the constant is called
> VDISK_AIO_FLUSH, but looking at the library code I see it's actually a
> synchronous call that ends up in a loop that sleeps (!) waiting for the
> response.
> 
> Please do an async flush and qemu_coroutine_yield() to return
> control to QEMU's event loop.  When the flush completes you can
> qemu_coroutine_enter() again to return from this function.
> 
> > +
> > +    if (ret < 0) {
> > +        trace_vxhs_co_flush(s->vdisk_guid, ret, errno);
> > +        vxhs_close(bs);
> 
> This looks unsafe.  Won't it cause double close() calls for s->fds[]
> when bdrv_close() is called later?
> 

Calling the close on a failed flush is a good idea, in my opinion.  However,
to make it safe, bs->drv MUST be set to NULL after the call to vxhs_close().
That will prevent double free / close calls,  and also fail out new I/O.
(This is actually what the gluster driver does, for instance).

Jeff
Ketan Nilangekar Oct. 20, 2016, 1:31 a.m. UTC | #13
Hi Stefan,

1. You suggestion to move the failover implementation to libqnio is well taken. Infact we are proposing that service/network failovers should not be handled in gemu address space at all. The vxhs driver will know and talk to only a single virtual IP. The service behind the virtual IP may fail and move to another node without the qemu driver noticing it.

This way the failover logic will be completely out of qemu address space. We are considering use of some of our proprietary clustering/monitoring services to implement service failover.


2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.

Let us know what you think about these proposals.

Thanks,
Ketan.


On 9/30/16, 1:36 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:

>> This patch adds support for a new block device type called "vxhs".

>> Source code for the library that this code loads can be downloaded from:

>> https://github.com/MittalAshish/libqnio.git

>

>The QEMU block driver should deal with BlockDriver<->libqnio integration

>and libqnio should deal with vxhs logic (network protocol, failover,

>etc).  Right now the vxhs logic is spread between both components.  If

>responsibilities aren't cleanly separated between QEMU and libqnio then

>I see no point in having libqnio.

>

>Failover code should move into libqnio so that programs using libqnio

>avoid duplicating the failover code.

>

>Similarly IIO_IO_BUF_SIZE/segments should be handled internally by

>libqnio so programs using libqnio do not duplicate this code.

>

>libqnio itself can be simplified significantly:

>

>The multi-threading is not necessary and adds complexity.  Right now

>there seem to be two reasons for multi-threading: shared contexts and

>the epoll thread.  Both can be eliminated as follows.

>

>Shared contexts do not make sense in a multi-disk, multi-core

>environment.  Why is it advantages to tie disks to a single context?

>It's simpler and more multi-core friendly to let every disk have its own

>connection.

>

>The epoll thread forces library users to use thread synchronization when

>processing callbacks.  Look at libiscsi for an example of how to

>eliminate it.  Two APIs are defined: int iscsi_get_fd(iscsi) and int

>iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT).  The program using the

>library can integrate the fd into its own event loop.  The advantage of

>doing this is that no library threads are necessary and all callbacks

>are invoked from the program's event loop.  Therefore no thread

>synchronization is needed.

>

>If you make these changes then all multi-threading in libqnio and the

>QEMU block driver can be dropped.  There will be less code and it will

>be simpler.
Paolo Bonzini Oct. 24, 2016, 2:24 p.m. UTC | #14
On 20/10/2016 03:31, Ketan Nilangekar wrote:
> This way the failover logic will be completely out of qemu address
> space. We are considering use of some of our proprietary
> clustering/monitoring services to implement service failover.

Are you implementing a different protocol just for the sake of QEMU, in
other words, and forwarding from that protocol to your proprietary code?

If that is what you are doing, you don't need at all a vxhs driver in
QEMU.  Just implement NBD or iSCSI on your side, QEMU already has
drivers for that.

Paolo
Abhijit Dey Oct. 25, 2016, 1:56 a.m. UTC | #15
+ Venky from Storage engineering

Sent from my iPhone

> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> 
> 
>> On 20/10/2016 03:31, Ketan Nilangekar wrote:
>> This way the failover logic will be completely out of qemu address
>> space. We are considering use of some of our proprietary
>> clustering/monitoring services to implement service failover.
> 
> Are you implementing a different protocol just for the sake of QEMU, in
> other words, and forwarding from that protocol to your proprietary code?
> 
> If that is what you are doing, you don't need at all a vxhs driver in
> QEMU.  Just implement NBD or iSCSI on your side, QEMU already has
> drivers for that.
> 
> Paolo
Ketan Nilangekar Oct. 25, 2016, 5:07 a.m. UTC | #16
We are able to derive significant performance from the qemu block driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd based io tap in the past and the performance of qemu block driver is significantly better. Hence we would like to go with the vxhs driver for now.

Ketan


> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> 
> 
>> On 20/10/2016 03:31, Ketan Nilangekar wrote:
>> This way the failover logic will be completely out of qemu address
>> space. We are considering use of some of our proprietary
>> clustering/monitoring services to implement service failover.
> 
> Are you implementing a different protocol just for the sake of QEMU, in
> other words, and forwarding from that protocol to your proprietary code?
> 
> If that is what you are doing, you don't need at all a vxhs driver in
> QEMU.  Just implement NBD or iSCSI on your side, QEMU already has
> drivers for that.
> 
> Paolo
Abhijit Dey Oct. 25, 2016, 5:15 a.m. UTC | #17
+ Venky

Sent from my iPhone

> On Oct 25, 2016, at 7:07 AM, Ketan Nilangekar <Ketan.Nilangekar@veritas.com> wrote:
> 
> We are able to derive significant performance from the qemu block driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd based io tap in the past and the performance of qemu block driver is significantly better. Hence we would like to go with the vxhs driver for now.
> 
> Ketan
> 
> 
>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> 
>> 
>> 
>>> On 20/10/2016 03:31, Ketan Nilangekar wrote:
>>> This way the failover logic will be completely out of qemu address
>>> space. We are considering use of some of our proprietary
>>> clustering/monitoring services to implement service failover.
>> 
>> Are you implementing a different protocol just for the sake of QEMU, in
>> other words, and forwarding from that protocol to your proprietary code?
>> 
>> If that is what you are doing, you don't need at all a vxhs driver in
>> QEMU.  Just implement NBD or iSCSI on your side, QEMU already has
>> drivers for that.
>> 
>> Paolo
Paolo Bonzini Oct. 25, 2016, 11:01 a.m. UTC | #18
On 25/10/2016 07:07, Ketan Nilangekar wrote:
> We are able to derive significant performance from the qemu block
> driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd
> based io tap in the past and the performance of qemu block driver is
> significantly better. Hence we would like to go with the vxhs driver
> for now.

Is this still true with failover implemented outside QEMU (which
requires I/O to be proxied, if I'm not mistaken)?  What does the benefit
come from if so, is it the threaded backend and performing multiple
connections to the same server?

Paolo

> Ketan
> 
> 
>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com>
>> wrote:
>> 
>> 
>> 
>>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the
>>> failover logic will be completely out of qemu address space. We
>>> are considering use of some of our proprietary 
>>> clustering/monitoring services to implement service failover.
>> 
>> Are you implementing a different protocol just for the sake of
>> QEMU, in other words, and forwarding from that protocol to your
>> proprietary code?
>> 
>> If that is what you are doing, you don't need at all a vxhs driver
>> in QEMU.  Just implement NBD or iSCSI on your side, QEMU already
>> has drivers for that.
>> 
>> Paolo
Paolo Bonzini Oct. 25, 2016, 9:59 p.m. UTC | #19
On 25/10/2016 23:53, Ketan Nilangekar wrote:
> We need to confirm the perf numbers but it really depends on the way we do failover outside qemu.
> 
> We are looking at a vip based failover implementation which may need
> some handling code in qnio but that overhead should be minimal (atleast
> no more than the current impl in qemu driver)

Then it's not outside QEMU's address space, it's only outside
block/vxhs.c... I don't understand.

Paolo

> IMO, the real benefit of qemu + qnio perf comes from:
> 1. the epoll based io multiplexer
> 2. 8 epoll threads
> 3. Zero buffer copies in userland code
> 4. Minimal locking
>
> We are also looking at replacing the existing qnio socket code with
> memory readv/writev calls available with the latest kernel for even
> better performance.

> 
> Ketan
> 
>> On Oct 25, 2016, at 1:01 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>>
>>> On 25/10/2016 07:07, Ketan Nilangekar wrote:
>>> We are able to derive significant performance from the qemu block
>>> driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd
>>> based io tap in the past and the performance of qemu block driver is
>>> significantly better. Hence we would like to go with the vxhs driver
>>> for now.
>>
>> Is this still true with failover implemented outside QEMU (which
>> requires I/O to be proxied, if I'm not mistaken)?  What does the benefit
>> come from if so, is it the threaded backend and performing multiple
>> connections to the same server?
>>
>> Paolo
>>
>>> Ketan
>>>
>>>
>>>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the
>>>>> failover logic will be completely out of qemu address space. We
>>>>> are considering use of some of our proprietary 
>>>>> clustering/monitoring services to implement service failover.
>>>>
>>>> Are you implementing a different protocol just for the sake of
>>>> QEMU, in other words, and forwarding from that protocol to your
>>>> proprietary code?
>>>>
>>>> If that is what you are doing, you don't need at all a vxhs driver
>>>> in QEMU.  Just implement NBD or iSCSI on your side, QEMU already
>>>> has drivers for that.
>>>>
>>>> Paolo
> 
>
Ketan Nilangekar Oct. 26, 2016, 10:17 p.m. UTC | #20
Including the rest of the folks from the original thread.


Ketan.





On 10/26/16, 9:33 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:

>
>
>On 26/10/2016 00:39, Ketan Nilangekar wrote:
>> 
>> 
>>> On Oct 26, 2016, at 12:00 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>>
>>>> On 25/10/2016 23:53, Ketan Nilangekar wrote:
>>>> We need to confirm the perf numbers but it really depends on the way we do failover outside qemu.
>>>>
>>>> We are looking at a vip based failover implementation which may need
>>>> some handling code in qnio but that overhead should be minimal (atleast
>>>> no more than the current impl in qemu driver)
>>>
>>> Then it's not outside QEMU's address space, it's only outside
>>> block/vxhs.c... I don't understand.
>>>
>>> Paolo
>>>
>> 
>> Yes and that is something that we are considering and not finalized on a design. But even if some of the failover code is in the qnio library, is that a problem? 
>> As per my understanding the original suggestions were around getting the failover code out of the block driver and into the network library.
>> If an optimal design for this means that some of the failover handling needs to be done in qnio, is that not acceptable?
>> The way we see it, driver/qnio will talk to the storage service using a single IP but may have some retry code for retransmitting failed IOs in a failover scenario.
>
>Sure, that's fine.  It's just that it seemed different from the previous
>explanation.
>
>Paolo
>
>>>> IMO, the real benefit of qemu + qnio perf comes from:
>>>> 1. the epoll based io multiplexer
>>>> 2. 8 epoll threads
>>>> 3. Zero buffer copies in userland code
>>>> 4. Minimal locking
>>>>
>>>> We are also looking at replacing the existing qnio socket code with
>>>> memory readv/writev calls available with the latest kernel for even
>>>> better performance.
>>>
>>>>
>>>> Ketan
>>>>
>>>>> On Oct 25, 2016, at 1:01 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On 25/10/2016 07:07, Ketan Nilangekar wrote:
>>>>>> We are able to derive significant performance from the qemu block
>>>>>> driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd
>>>>>> based io tap in the past and the performance of qemu block driver is
>>>>>> significantly better. Hence we would like to go with the vxhs driver
>>>>>> for now.
>>>>>
>>>>> Is this still true with failover implemented outside QEMU (which
>>>>> requires I/O to be proxied, if I'm not mistaken)?  What does the benefit
>>>>> come from if so, is it the threaded backend and performing multiple
>>>>> connections to the same server?
>>>>>
>>>>> Paolo
>>>>>
>>>>>> Ketan
>>>>>>
>>>>>>
>>>>>>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the
>>>>>>>> failover logic will be completely out of qemu address space. We
>>>>>>>> are considering use of some of our proprietary 
>>>>>>>> clustering/monitoring services to implement service failover.
>>>>>>>
>>>>>>> Are you implementing a different protocol just for the sake of
>>>>>>> QEMU, in other words, and forwarding from that protocol to your
>>>>>>> proprietary code?
>>>>>>>
>>>>>>> If that is what you are doing, you don't need at all a vxhs driver
>>>>>>> in QEMU.  Just implement NBD or iSCSI on your side, QEMU already
>>>>>>> has drivers for that.
>>>>>>>
>>>>>>> Paolo
>>>>
>>>>
Stefan Hajnoczi Nov. 4, 2016, 9:49 a.m. UTC | #21
On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.

Have you benchmarked against just 1 epoll thread with 8 connections?

Stefan
Stefan Hajnoczi Nov. 4, 2016, 9:52 a.m. UTC | #22
On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.

By the way, when you benchmark with 8 epoll threads, are there any other
guests with vxhs running on the machine?

In a real-life situation where multiple VMs are running on a single host
it may turn out that giving each VM 8 epoll threads doesn't help at all
because the host CPUs are busy with other tasks.

Stefan
Ketan Nilangekar Nov. 4, 2016, 6:30 p.m. UTC | #23
> On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
>> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
>> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
>> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
>> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
>> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
>> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.
> 
> By the way, when you benchmark with 8 epoll threads, are there any other
> guests with vxhs running on the machine?
> 

Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device.

> In a real-life situation where multiple VMs are running on a single host
> it may turn out that giving each VM 8 epoll threads doesn't help at all
> because the host CPUs are busy with other tasks.

The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. 

But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define

Ketan
> Stefan
Ketan Nilangekar Nov. 4, 2016, 6:44 p.m. UTC | #24
> On Nov 4, 2016, at 2:49 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
>> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
>> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
>> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
>> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
>> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
>> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.
> 
> Have you benchmarked against just 1 epoll thread with 8 connections?
> 

The first implementation of qnio was actually single threaded with 8 connections. The single VM throughput at the time iirc was less than half of what we are getting now. Especially with a workload doing IOs on multiple vdisks. I assume we will need some sort of cpu/core affinity to drive the most out of a single epoll threaded design.

Ketan

> Stefan
Stefan Hajnoczi Nov. 7, 2016, 10:22 a.m. UTC | #25
On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote:
> > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
> >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
> >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
> >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
> >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
> >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.
> > 
> > By the way, when you benchmark with 8 epoll threads, are there any other
> > guests with vxhs running on the machine?
> > 
> 
> Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device.
> 
> > In a real-life situation where multiple VMs are running on a single host
> > it may turn out that giving each VM 8 epoll threads doesn't help at all
> > because the host CPUs are busy with other tasks.
> 
> The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. 
> 
> But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define

In QEMU there is currently work to add multiqueue support to the block
layer.  This enables true multiqueue from the guest down to the storage
backend.

virtio-blk already supports multiple queues but they are all processed
from the same thread in QEMU today.  Once multiple threads are able to
process the queues it would make sense to continue down into the vxhs
block driver.

So I don't think implementing multiple epoll threads in libqnio is
useful in the long term.  Rather, a straightforward approach of
integrating with the libqnio user's event loop (as described in my
previous emails) would simplify the code and allow you to take advantage
of full multiqueue support in the future.

Stefan
Ketan Nilangekar Nov. 7, 2016, 8:27 p.m. UTC | #26
On 11/7/16, 2:22 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote:

>> > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:

>> >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:

>> >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 

>> >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.

>> >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.

>> >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.

>> >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.

>> > 

>> > By the way, when you benchmark with 8 epoll threads, are there any other

>> > guests with vxhs running on the machine?

>> > 

>> 

>> Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device.

>> 

>> > In a real-life situation where multiple VMs are running on a single host

>> > it may turn out that giving each VM 8 epoll threads doesn't help at all

>> > because the host CPUs are busy with other tasks.

>> 

>> The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. 

>> 

>> But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define

>

>In QEMU there is currently work to add multiqueue support to the block

>layer.  This enables true multiqueue from the guest down to the storage

>backend.


Is there any spec or documentation on this that you can point us to?

>

>virtio-blk already supports multiple queues but they are all processed

>from the same thread in QEMU today.  Once multiple threads are able to

>process the queues it would make sense to continue down into the vxhs

>block driver.

>

>So I don't think implementing multiple epoll threads in libqnio is

>useful in the long term.  Rather, a straightforward approach of

>integrating with the libqnio user's event loop (as described in my

>previous emails) would simplify the code and allow you to take advantage

>of full multiqueue support in the future.


Makes sense. We will take this up in the next iteration of libqnio.

Thanks,
Katan.

>

>Stefan
Stefan Hajnoczi Nov. 8, 2016, 3:39 p.m. UTC | #27
On Mon, Nov 07, 2016 at 08:27:39PM +0000, Ketan Nilangekar wrote:
> On 11/7/16, 2:22 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> >On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote:
> >> > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote:
> >> >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. 
> >> >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll.
> >> >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size.
> >> >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls.
> >> >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory.
> >> > 
> >> > By the way, when you benchmark with 8 epoll threads, are there any other
> >> > guests with vxhs running on the machine?
> >> > 
> >> 
> >> Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device.
> >> 
> >> > In a real-life situation where multiple VMs are running on a single host
> >> > it may turn out that giving each VM 8 epoll threads doesn't help at all
> >> > because the host CPUs are busy with other tasks.
> >> 
> >> The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. 
> >> 
> >> But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define
> >
> >In QEMU there is currently work to add multiqueue support to the block
> >layer.  This enables true multiqueue from the guest down to the storage
> >backend.
> 
> Is there any spec or documentation on this that you can point us to?

The current status is:

1. virtio-blk and virtio-scsi support multiple queues but these queues
   are processed from a single thread today.

2. MemoryRegions can be marked with !global_locking so its handler
   functions are dispatched without taking the QEMU global mutex.  This
   allows device emulation to run in multiple threads.

3. Paolo Bonzini (CCed) is currently working on make the block layer
   (BlockDriverState and co) support access from multiple threads and
   multiqueue.  This is work in progress.

If you are interested in this work keep an eye out for patch series from
Paolo Bonzini and Fam Zheng.

Stefan
Paolo Bonzini Nov. 9, 2016, 12:47 p.m. UTC | #28
On 08/11/2016 16:39, Stefan Hajnoczi wrote:
> The current status is:
> 
> 1. virtio-blk and virtio-scsi support multiple queues but these queues
>    are processed from a single thread today.
> 
> 2. MemoryRegions can be marked with !global_locking so its handler
>    functions are dispatched without taking the QEMU global mutex.  This
>    allows device emulation to run in multiple threads.

Alternatively, virtio-blk and virtio-scsi can already use ioeventfd and
"-object iothread,id=FOO -device virtio-blk-pci,iothread=FOO" to let
device emulation run in a separate thread that doesn't take the QEMU
global mutex.

> 3. Paolo Bonzini (CCed) is currently working on make the block layer
>    (BlockDriverState and co) support access from multiple threads and
>    multiqueue.  This is work in progress.
> 
> If you are interested in this work keep an eye out for patch series from
> Paolo Bonzini and Fam Zheng.

The first part (drop RFifoLock) was committed for 2.8.  It's a
relatively long road, but these are the currently ready parts of the work:

- take AioContext acquire/release in small critical sections
- push AioContext down to individual callbacks
- make BlockDriverState thread-safe

The latter needs rebasing after the last changes to dirty bitmaps, but I
think these patches should be ready for 2.9.

These are the planned bits:

- replace AioContext with fine-grained mutex in bdrv_aio_*
- protect everything with CoMutex in bdrv_co_*
- remove aio_context_acquire/release

For now I was not planning to make network backends support multiqueue,
only files.

Paolo
Stefan Hajnoczi Nov. 14, 2016, 3:05 p.m. UTC | #29
On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>
> Review of .bdrv_open() and .bdrv_aio_writev() code paths.
>
> The big issues I see in this driver and libqnio:
>
> 1. Showstoppers like broken .bdrv_open() and leaking memory on every
>    reply message.
> 2. Insecure due to missing input validation (network packets and
>    configuration) and incorrect string handling.
> 3. Not fully asynchronous so QEMU and the guest may hang.
>
> Please think about the whole codebase and not just the lines I've
> pointed out in this review when fixing these sorts of issues.  There may
> be similar instances of these bugs elsewhere and it's important that
> they are fixed so that this can be merged.

Ping?

You didn't respond to the comments I raised.  The libqnio buffer
overflows and other issues from this email are still present.

I put a lot of time into reviewing this patch series and libqnio.  If
you want to get reviews please address feedback before sending a new
patch series.

>
>> +/*
>> + * Structure per vDisk maintained for state
>> + */
>> +typedef struct BDRVVXHSState {
>> +    int                     fds[2];
>> +    int64_t                 vdisk_size;
>> +    int64_t                 vdisk_blocks;
>> +    int64_t                 vdisk_flags;
>> +    int                     vdisk_aio_count;
>> +    int                     event_reader_pos;
>> +    VXHSAIOCB               *qnio_event_acb;
>> +    void                    *qnio_ctx;
>> +    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
>> +    QemuSpin                vdisk_acb_lock;  /* Protects ACB */
>
> These comments are insufficient for documenting locking.  Not all fields
> are actually protected by these locks.  Please order fields according to
> lock coverage:
>
> typedef struct VXHSAIOCB {
>     ...
>
>     /* Protected by BDRVVXHSState->vdisk_acb_lock */
>     int                 segments;
>     ...
> };
>
> typedef struct BDRVVXHSState {
>     ...
>
>     /* Protected by vdisk_lock */
>     QemuSpin                vdisk_lock;
>     int                     vdisk_aio_count;
>     QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
>     ...
> }
>
>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
>> +{
>> +    /*
>> +     * Close vDisk device
>> +     */
>> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
>> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
>
> libqnio comment:
> Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
> dropped.
>
>> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
>> +    }
>> +
>> +    /*
>> +     * Close QNIO channel against cached channel-fd
>> +     */
>> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
>> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
>
> libqnio comment:
> Why does iio_devclose() take an int32_t cfd argument but iio_close()
> takes a uint32_t cfd argument?
>
>> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
>> +    }
>> +}
>> +
>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
>> +                              int *rfd, const char *file_name)
>> +{
>> +    /*
>> +     * Open qnio channel to storage agent if not opened before.
>> +     */
>> +    if (*cfd < 0) {
>> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
>
> libqnio comments:
>
> 1.
> There is a buffer overflow in qnio_create_channel().  strncpy() is used
> incorrectly so long hostname or port (both can be 99 characters long)
> will overflow channel->name[] (64 characters) or channel->port[] (8
> characters).
>
>     strncpy(channel->name, hostname, strlen(hostname) + 1);
>     strncpy(channel->port, port, strlen(port) + 1);
>
> The third argument must be the size of the *destination* buffer, not the
> source buffer.  Also note that strncpy() doesn't NUL-terminate the
> destination string so you must do that manually to ensure there is a NUL
> byte at the end of the buffer.
>
> 2.
> channel is leaked in the "Failed to open single connection" error case
> in qnio_create_channel().
>
> 3.
> If host is longer the 63 characters then the ioapi_ctx->channels and
> qnio_ctx->channels maps will use different keys due to string truncation
> in qnio_create_channel().  This means "Channel already exists" in
> qnio_create_channel() and possibly other things will not work as
> expected.
>
>> +        if (*cfd < 0) {
>> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
>> +            return -ENODEV;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Open vdisk device
>> +     */
>> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
>
> libqnio comment:
> Buffer overflow in iio_devopen() since chandev[128] is not large enough
> to hold channel[100] + " " + devpath[arbitrary length] chars:
>
>     sprintf(chandev, "%s %s", channel, devpath);
>
>> +
>> +    if (*rfd < 0) {
>> +        if (*cfd >= 0) {
>
> This check is always true.  Otherwise the return -ENODEV would have been
> taken above.  The if statement isn't necessary.
>
>> +static void vxhs_check_failover_status(int res, void *ctx)
>> +{
>> +    BDRVVXHSState *s = ctx;
>> +
>> +    if (res == 0) {
>> +        /* found failover target */
>> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
>> +        s->vdisk_ask_failover_idx = 0;
>> +        trace_vxhs_check_failover_status(
>> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>> +                   s->vdisk_guid);
>> +        qemu_spin_lock(&s->vdisk_lock);
>> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        vxhs_handle_queued_ios(s);
>> +    } else {
>> +        /* keep looking */
>> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
>> +        s->vdisk_ask_failover_idx++;
>> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
>> +            /* pause and cycle through list again */
>> +            sleep(QNIO_CONNECT_RETRY_SECS);
>
> This code is called from a QEMU thread via vxhs_aio_rw().  It is not
> permitted to call sleep() since it will freeze QEMU and probably the
> guest.
>
> If you need a timer you can use QEMU's timer APIs.  See aio_timer_new(),
> timer_new_ns(), timer_mod(), timer_del(), timer_free().
>
>> +            s->vdisk_ask_failover_idx = 0;
>> +        }
>> +        res = vxhs_switch_storage_agent(s);
>> +    }
>> +}
>> +
>> +static int vxhs_failover_io(BDRVVXHSState *s)
>> +{
>> +    int res = 0;
>> +
>> +    trace_vxhs_failover_io(s->vdisk_guid);
>> +
>> +    s->vdisk_ask_failover_idx = 0;
>> +    res = vxhs_switch_storage_agent(s);
>> +
>> +    return res;
>> +}
>> +
>> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
>> +                       uint32_t error, uint32_t opcode)
>
> This function is doing too much.  Especially the failover code should
> run in the AioContext since it's complex.  Don't do failover here
> because this function is outside the AioContext lock.  Do it from
> AioContext using a QEMUBH like block/rbd.c.
>
>> +static int32_t
>> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
>> +                     uint64_t offset, void *ctx, uint32_t flags)
>> +{
>> +    struct iovec cur;
>> +    uint64_t cur_offset = 0;
>> +    uint64_t cur_write_len = 0;
>> +    int segcount = 0;
>> +    int ret = 0;
>> +    int i, nsio = 0;
>> +    int iovcnt = qiov->niov;
>> +    struct iovec *iov = qiov->iov;
>> +
>> +    errno = 0;
>> +    cur.iov_base = 0;
>> +    cur.iov_len = 0;
>> +
>> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
>
> libqnio comments:
>
> 1.
> There are blocking connect(2) and getaddrinfo(3) calls in iio_writev()
> so this may hang for arbitrary amounts of time.  This is not permitted
> in .bdrv_aio_readv()/.bdrv_aio_writev().  Please make qnio actually
> asynchronous.
>
> 2.
> Where does client_callback() free reply?  It looks like every reply
> message causes a memory leak!
>
> 3.
> Buffer overflow in iio_writev() since device[128] cannot fit the device
> string generated from the vdisk_guid.
>
> 4.
> Buffer overflow in iio_writev() due to
> strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is
> larger than target[64].  Also note the previous comments about strncpy()
> usage.
>
> 5.
> I don't see any endianness handling or portable alignment of struct
> fields in the network protocol code.  Binary network protocols need to
> take care of these issue for portability.  This means libqnio compiled
> for different architectures will not work.  Do you plan to support any
> other architectures besides x86?
>
> 6.
> The networking code doesn't look robust: kvset uses assert() on input
> from the network so the other side of the connection could cause SIGABRT
> (coredump), the client uses the msg pointer as the cookie for the
> response packet so the server can easily crash the client by sending a
> bogus cookie value, etc.  Even on the client side these things are
> troublesome but on a server they are guaranteed security issues.  I
> didn't look into it deeply.  Please audit the code.
>
>> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
>> +                              int *cfd, int *rfd, Error **errp)
>> +{
>> +    QDict *backing_options = NULL;
>> +    QemuOpts *opts, *tcp_opts;
>> +    const char *vxhs_filename;
>> +    char *of_vsa_addr = NULL;
>> +    Error *local_err = NULL;
>> +    const char *vdisk_id_opt;
>> +    char *file_name = NULL;
>> +    size_t num_servers = 0;
>> +    char *str = NULL;
>> +    int ret = 0;
>> +    int i;
>> +
>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>> +    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
>> +    if (vxhs_filename) {
>> +        trace_vxhs_qemu_init_filename(vxhs_filename);
>> +    }
>> +
>> +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
>> +    if (!vdisk_id_opt) {
>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +    s->vdisk_guid = g_strdup(vdisk_id_opt);
>> +    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
>> +
>> +    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
>> +    if (num_servers < 1) {
>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
>> +        ret = -EINVAL;
>> +        goto out;
>> +    } else if (num_servers > VXHS_MAX_HOSTS) {
>> +        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
>> +        error_append_hint(errp, "Maximum %d servers allowed.\n",
>> +                          VXHS_MAX_HOSTS);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +    trace_vxhs_qemu_init_numservers(num_servers);
>> +
>> +    for (i = 0; i < num_servers; i++) {
>> +        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
>> +        qdict_extract_subqdict(options, &backing_options, str);
>> +
>> +        /* Create opts info from runtime_tcp_opts list */
>> +        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
>> +        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
>> +        if (local_err) {
>> +            qdict_del(backing_options, str);
>
> backing_options is leaked and there's no need to delete the str key.
>
>> +            qemu_opts_del(tcp_opts);
>> +            g_free(str);
>> +            ret = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
>> +                                                            VXHS_OPT_HOST));
>> +        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
>> +                                                                 VXHS_OPT_PORT),
>> +                                                    NULL, 0);
>
> This will segfault if the port option was missing.
>
>> +
>> +        s->vdisk_hostinfo[i].qnio_cfd = -1;
>> +        s->vdisk_hostinfo[i].vdisk_rfd = -1;
>> +        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
>> +                             s->vdisk_hostinfo[i].port);
>
> It's not safe to use the %s format specifier for a trace event with a
> NULL value.  In the case where hostip is NULL this could crash on some
> systems.
>
>> +
>> +        qdict_del(backing_options, str);
>> +        qemu_opts_del(tcp_opts);
>> +        g_free(str);
>> +    }
>
> backing_options is leaked.
>
>> +
>> +    s->vdisk_nhosts = i;
>> +    s->vdisk_cur_host_idx = 0;
>> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
>> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
>
> Can we get here with num_servers == 0?  In that case this would access
> uninitialized memory.  I guess num_servers == 0 does not make sense and
> there should be an error case for it.
>
>> +
>> +    /*
>> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
>> +     */
>> +    if (global_qnio_ctx == NULL) {
>> +        global_qnio_ctx = vxhs_setup_qnio();
>
> libqnio comment:
> The client epoll thread should mask all signals (like
> qemu_thread_create()).  Otherwise it may receive signals that it cannot
> deal with.
>
>> +        if (global_qnio_ctx == NULL) {
>> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
>> +            ret = -EINVAL;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
>> +    if (!ret) {
>> +        error_setg(&local_err, "Failed qnio_iio_open");
>> +        ret = -EIO;
>> +    }
>
> The return value of vxhs_qnio_iio_open() is 0 for success or -errno for
> error.
>
> I guess you never ran this code!  The block driver won't even open
> successfully.
>
>> +
>> +out:
>> +    g_free(file_name);
>> +    g_free(of_vsa_addr);
>> +    qemu_opts_del(opts);
>> +
>> +    if (ret < 0) {
>> +        for (i = 0; i < num_servers; i++) {
>> +            g_free(s->vdisk_hostinfo[i].hostip);
>> +        }
>> +        g_free(s->vdisk_guid);
>> +        s->vdisk_guid = NULL;
>> +        errno = -ret;
>
> There is no need to set errno here.  The return value already contains
> the error and the caller doesn't look at errno.
>
>> +    }
>> +    error_propagate(errp, local_err);
>> +
>> +    return ret;
>> +}
>> +
>> +static int vxhs_open(BlockDriverState *bs, QDict *options,
>> +              int bdrv_flags, Error **errp)
>> +{
>> +    BDRVVXHSState *s = bs->opaque;
>> +    AioContext *aio_context;
>> +    int qemu_qnio_cfd = -1;
>> +    int device_opened = 0;
>> +    int qemu_rfd = -1;
>> +    int ret = 0;
>> +    int i;
>> +
>> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
>> +    if (ret < 0) {
>> +        trace_vxhs_open_fail(ret);
>> +        return ret;
>> +    }
>> +
>> +    device_opened = 1;
>> +    s->qnio_ctx = global_qnio_ctx;
>> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
>> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
>> +    s->vdisk_size = 0;
>> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
>> +
>> +    /*
>> +     * Create a pipe for communicating between two threads in different
>> +     * context. Set handler for read event, which gets triggered when
>> +     * IO completion is done by non-QEMU context.
>> +     */
>> +    ret = qemu_pipe(s->fds);
>> +    if (ret < 0) {
>> +        trace_vxhs_open_epipe('.');
>> +        ret = -errno;
>> +        goto errout;
>
> This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc.
> bdrv_close() will not be called so this function must do cleanup itself.
>
>> +    }
>> +    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
>> +
>> +    aio_context = bdrv_get_aio_context(bs);
>> +    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
>> +                       false, vxhs_aio_event_reader, NULL, s);
>> +
>> +    /*
>> +     * Initialize the spin-locks.
>> +     */
>> +    qemu_spin_init(&s->vdisk_lock);
>> +    qemu_spin_init(&s->vdisk_acb_lock);
>> +
>> +    return 0;
>> +
>> +errout:
>> +    /*
>> +     * Close remote vDisk device if it was opened earlier
>> +     */
>> +    if (device_opened) {
>
> This is always true.  The device_opened variable can be removed.
>
>> +/*
>> + * This allocates QEMU-VXHS callback for each IO
>> + * and is passed to QNIO. When QNIO completes the work,
>> + * it will be passed back through the callback.
>> + */
>> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
>> +                                int64_t sector_num, QEMUIOVector *qiov,
>> +                                int nb_sectors,
>> +                                BlockCompletionFunc *cb,
>> +                                void *opaque, int iodir)
>> +{
>> +    VXHSAIOCB *acb = NULL;
>> +    BDRVVXHSState *s = bs->opaque;
>> +    size_t size;
>> +    uint64_t offset;
>> +    int iio_flags = 0;
>> +    int ret = 0;
>> +    void *qnio_ctx = s->qnio_ctx;
>> +    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
>> +
>> +    offset = sector_num * BDRV_SECTOR_SIZE;
>> +    size = nb_sectors * BDRV_SECTOR_SIZE;
>> +
>> +    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
>> +    /*
>> +     * Setup or initialize VXHSAIOCB.
>> +     * Every single field should be initialized since
>> +     * acb will be picked up from the slab without
>> +     * initializing with zero.
>> +     */
>> +    acb->io_offset = offset;
>> +    acb->size = size;
>> +    acb->ret = 0;
>> +    acb->flags = 0;
>> +    acb->aio_done = VXHS_IO_INPROGRESS;
>> +    acb->segments = 0;
>> +    acb->buffer = 0;
>> +    acb->qiov = qiov;
>> +    acb->direction = iodir;
>> +
>> +    qemu_spin_lock(&s->vdisk_lock);
>> +    if (OF_VDISK_FAILED(s)) {
>> +        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        goto errout;
>> +    }
>> +    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
>> +        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
>> +        s->vdisk_aio_retry_qd++;
>> +        OF_AIOCB_FLAGS_SET_QUEUED(acb);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
>> +        goto out;
>> +    }
>> +    s->vdisk_aio_count++;
>> +    qemu_spin_unlock(&s->vdisk_lock);
>> +
>> +    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
>> +
>> +    switch (iodir) {
>> +    case VDISK_AIO_WRITE:
>> +            vxhs_inc_acb_segment_count(acb, 1);
>> +            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
>> +                                       offset, (void *)acb, iio_flags);
>> +            break;
>> +    case VDISK_AIO_READ:
>> +            vxhs_inc_acb_segment_count(acb, 1);
>> +            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
>> +                                       offset, (void *)acb, iio_flags);
>> +            break;
>> +    default:
>> +            trace_vxhs_aio_rw_invalid(iodir);
>> +            goto errout;
>
> s->vdisk_aio_count must be decremented before returning.
>
>> +static void vxhs_close(BlockDriverState *bs)
>> +{
>> +    BDRVVXHSState *s = bs->opaque;
>> +    int i;
>> +
>> +    trace_vxhs_close(s->vdisk_guid);
>> +    close(s->fds[VDISK_FD_READ]);
>> +    close(s->fds[VDISK_FD_WRITE]);
>> +
>> +    /*
>> +     * Clearing all the event handlers for oflame registered to QEMU
>> +     */
>> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
>> +                       false, NULL, NULL, NULL);
>
> Please remove the event handler before closing the fd.  I don't think it
> matters in this case but in other scenarios there could be race
> conditions if another thread opens an fd and the file descriptor number
> is reused.
Ashish Mittal Nov. 14, 2016, 6:01 p.m. UTC | #30
Will look into these ASAP.

On Mon, Nov 14, 2016 at 7:05 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>>
>> Review of .bdrv_open() and .bdrv_aio_writev() code paths.
>>
>> The big issues I see in this driver and libqnio:
>>
>> 1. Showstoppers like broken .bdrv_open() and leaking memory on every
>>    reply message.
>> 2. Insecure due to missing input validation (network packets and
>>    configuration) and incorrect string handling.
>> 3. Not fully asynchronous so QEMU and the guest may hang.
>>
>> Please think about the whole codebase and not just the lines I've
>> pointed out in this review when fixing these sorts of issues.  There may
>> be similar instances of these bugs elsewhere and it's important that
>> they are fixed so that this can be merged.
>
> Ping?
>
> You didn't respond to the comments I raised.  The libqnio buffer
> overflows and other issues from this email are still present.
>
> I put a lot of time into reviewing this patch series and libqnio.  If
> you want to get reviews please address feedback before sending a new
> patch series.
>
>>
>>> +/*
>>> + * Structure per vDisk maintained for state
>>> + */
>>> +typedef struct BDRVVXHSState {
>>> +    int                     fds[2];
>>> +    int64_t                 vdisk_size;
>>> +    int64_t                 vdisk_blocks;
>>> +    int64_t                 vdisk_flags;
>>> +    int                     vdisk_aio_count;
>>> +    int                     event_reader_pos;
>>> +    VXHSAIOCB               *qnio_event_acb;
>>> +    void                    *qnio_ctx;
>>> +    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
>>> +    QemuSpin                vdisk_acb_lock;  /* Protects ACB */
>>
>> These comments are insufficient for documenting locking.  Not all fields
>> are actually protected by these locks.  Please order fields according to
>> lock coverage:
>>
>> typedef struct VXHSAIOCB {
>>     ...
>>
>>     /* Protected by BDRVVXHSState->vdisk_acb_lock */
>>     int                 segments;
>>     ...
>> };
>>
>> typedef struct BDRVVXHSState {
>>     ...
>>
>>     /* Protected by vdisk_lock */
>>     QemuSpin                vdisk_lock;
>>     int                     vdisk_aio_count;
>>     QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
>>     ...
>> }
>>
>>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
>>> +{
>>> +    /*
>>> +     * Close vDisk device
>>> +     */
>>> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
>>> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
>>
>> libqnio comment:
>> Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
>> dropped.
>>
>>> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
>>> +    }
>>> +
>>> +    /*
>>> +     * Close QNIO channel against cached channel-fd
>>> +     */
>>> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
>>> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
>>
>> libqnio comment:
>> Why does iio_devclose() take an int32_t cfd argument but iio_close()
>> takes a uint32_t cfd argument?
>>
>>> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
>>> +    }
>>> +}
>>> +
>>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
>>> +                              int *rfd, const char *file_name)
>>> +{
>>> +    /*
>>> +     * Open qnio channel to storage agent if not opened before.
>>> +     */
>>> +    if (*cfd < 0) {
>>> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
>>
>> libqnio comments:
>>
>> 1.
>> There is a buffer overflow in qnio_create_channel().  strncpy() is used
>> incorrectly so long hostname or port (both can be 99 characters long)
>> will overflow channel->name[] (64 characters) or channel->port[] (8
>> characters).
>>
>>     strncpy(channel->name, hostname, strlen(hostname) + 1);
>>     strncpy(channel->port, port, strlen(port) + 1);
>>
>> The third argument must be the size of the *destination* buffer, not the
>> source buffer.  Also note that strncpy() doesn't NUL-terminate the
>> destination string so you must do that manually to ensure there is a NUL
>> byte at the end of the buffer.
>>
>> 2.
>> channel is leaked in the "Failed to open single connection" error case
>> in qnio_create_channel().
>>
>> 3.
>> If host is longer the 63 characters then the ioapi_ctx->channels and
>> qnio_ctx->channels maps will use different keys due to string truncation
>> in qnio_create_channel().  This means "Channel already exists" in
>> qnio_create_channel() and possibly other things will not work as
>> expected.
>>
>>> +        if (*cfd < 0) {
>>> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
>>> +            return -ENODEV;
>>> +        }
>>> +    }
>>> +
>>> +    /*
>>> +     * Open vdisk device
>>> +     */
>>> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
>>
>> libqnio comment:
>> Buffer overflow in iio_devopen() since chandev[128] is not large enough
>> to hold channel[100] + " " + devpath[arbitrary length] chars:
>>
>>     sprintf(chandev, "%s %s", channel, devpath);
>>
>>> +
>>> +    if (*rfd < 0) {
>>> +        if (*cfd >= 0) {
>>
>> This check is always true.  Otherwise the return -ENODEV would have been
>> taken above.  The if statement isn't necessary.
>>
>>> +static void vxhs_check_failover_status(int res, void *ctx)
>>> +{
>>> +    BDRVVXHSState *s = ctx;
>>> +
>>> +    if (res == 0) {
>>> +        /* found failover target */
>>> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
>>> +        s->vdisk_ask_failover_idx = 0;
>>> +        trace_vxhs_check_failover_status(
>>> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>>> +                   s->vdisk_guid);
>>> +        qemu_spin_lock(&s->vdisk_lock);
>>> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        vxhs_handle_queued_ios(s);
>>> +    } else {
>>> +        /* keep looking */
>>> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
>>> +        s->vdisk_ask_failover_idx++;
>>> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
>>> +            /* pause and cycle through list again */
>>> +            sleep(QNIO_CONNECT_RETRY_SECS);
>>
>> This code is called from a QEMU thread via vxhs_aio_rw().  It is not
>> permitted to call sleep() since it will freeze QEMU and probably the
>> guest.
>>
>> If you need a timer you can use QEMU's timer APIs.  See aio_timer_new(),
>> timer_new_ns(), timer_mod(), timer_del(), timer_free().
>>
>>> +            s->vdisk_ask_failover_idx = 0;
>>> +        }
>>> +        res = vxhs_switch_storage_agent(s);
>>> +    }
>>> +}
>>> +
>>> +static int vxhs_failover_io(BDRVVXHSState *s)
>>> +{
>>> +    int res = 0;
>>> +
>>> +    trace_vxhs_failover_io(s->vdisk_guid);
>>> +
>>> +    s->vdisk_ask_failover_idx = 0;
>>> +    res = vxhs_switch_storage_agent(s);
>>> +
>>> +    return res;
>>> +}
>>> +
>>> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
>>> +                       uint32_t error, uint32_t opcode)
>>
>> This function is doing too much.  Especially the failover code should
>> run in the AioContext since it's complex.  Don't do failover here
>> because this function is outside the AioContext lock.  Do it from
>> AioContext using a QEMUBH like block/rbd.c.
>>
>>> +static int32_t
>>> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
>>> +                     uint64_t offset, void *ctx, uint32_t flags)
>>> +{
>>> +    struct iovec cur;
>>> +    uint64_t cur_offset = 0;
>>> +    uint64_t cur_write_len = 0;
>>> +    int segcount = 0;
>>> +    int ret = 0;
>>> +    int i, nsio = 0;
>>> +    int iovcnt = qiov->niov;
>>> +    struct iovec *iov = qiov->iov;
>>> +
>>> +    errno = 0;
>>> +    cur.iov_base = 0;
>>> +    cur.iov_len = 0;
>>> +
>>> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
>>
>> libqnio comments:
>>
>> 1.
>> There are blocking connect(2) and getaddrinfo(3) calls in iio_writev()
>> so this may hang for arbitrary amounts of time.  This is not permitted
>> in .bdrv_aio_readv()/.bdrv_aio_writev().  Please make qnio actually
>> asynchronous.
>>
>> 2.
>> Where does client_callback() free reply?  It looks like every reply
>> message causes a memory leak!
>>
>> 3.
>> Buffer overflow in iio_writev() since device[128] cannot fit the device
>> string generated from the vdisk_guid.
>>
>> 4.
>> Buffer overflow in iio_writev() due to
>> strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is
>> larger than target[64].  Also note the previous comments about strncpy()
>> usage.
>>
>> 5.
>> I don't see any endianness handling or portable alignment of struct
>> fields in the network protocol code.  Binary network protocols need to
>> take care of these issue for portability.  This means libqnio compiled
>> for different architectures will not work.  Do you plan to support any
>> other architectures besides x86?
>>
>> 6.
>> The networking code doesn't look robust: kvset uses assert() on input
>> from the network so the other side of the connection could cause SIGABRT
>> (coredump), the client uses the msg pointer as the cookie for the
>> response packet so the server can easily crash the client by sending a
>> bogus cookie value, etc.  Even on the client side these things are
>> troublesome but on a server they are guaranteed security issues.  I
>> didn't look into it deeply.  Please audit the code.
>>
>>> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
>>> +                              int *cfd, int *rfd, Error **errp)
>>> +{
>>> +    QDict *backing_options = NULL;
>>> +    QemuOpts *opts, *tcp_opts;
>>> +    const char *vxhs_filename;
>>> +    char *of_vsa_addr = NULL;
>>> +    Error *local_err = NULL;
>>> +    const char *vdisk_id_opt;
>>> +    char *file_name = NULL;
>>> +    size_t num_servers = 0;
>>> +    char *str = NULL;
>>> +    int ret = 0;
>>> +    int i;
>>> +
>>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>>> +    if (local_err) {
>>> +        error_propagate(errp, local_err);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +
>>> +    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
>>> +    if (vxhs_filename) {
>>> +        trace_vxhs_qemu_init_filename(vxhs_filename);
>>> +    }
>>> +
>>> +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
>>> +    if (!vdisk_id_opt) {
>>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +    s->vdisk_guid = g_strdup(vdisk_id_opt);
>>> +    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
>>> +
>>> +    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
>>> +    if (num_servers < 1) {
>>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    } else if (num_servers > VXHS_MAX_HOSTS) {
>>> +        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
>>> +        error_append_hint(errp, "Maximum %d servers allowed.\n",
>>> +                          VXHS_MAX_HOSTS);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +    trace_vxhs_qemu_init_numservers(num_servers);
>>> +
>>> +    for (i = 0; i < num_servers; i++) {
>>> +        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
>>> +        qdict_extract_subqdict(options, &backing_options, str);
>>> +
>>> +        /* Create opts info from runtime_tcp_opts list */
>>> +        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
>>> +        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
>>> +        if (local_err) {
>>> +            qdict_del(backing_options, str);
>>
>> backing_options is leaked and there's no need to delete the str key.
>>
>>> +            qemu_opts_del(tcp_opts);
>>> +            g_free(str);
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +        }
>>> +
>>> +        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
>>> +                                                            VXHS_OPT_HOST));
>>> +        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
>>> +                                                                 VXHS_OPT_PORT),
>>> +                                                    NULL, 0);
>>
>> This will segfault if the port option was missing.
>>
>>> +
>>> +        s->vdisk_hostinfo[i].qnio_cfd = -1;
>>> +        s->vdisk_hostinfo[i].vdisk_rfd = -1;
>>> +        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
>>> +                             s->vdisk_hostinfo[i].port);
>>
>> It's not safe to use the %s format specifier for a trace event with a
>> NULL value.  In the case where hostip is NULL this could crash on some
>> systems.
>>
>>> +
>>> +        qdict_del(backing_options, str);
>>> +        qemu_opts_del(tcp_opts);
>>> +        g_free(str);
>>> +    }
>>
>> backing_options is leaked.
>>
>>> +
>>> +    s->vdisk_nhosts = i;
>>> +    s->vdisk_cur_host_idx = 0;
>>> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
>>> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
>>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
>>
>> Can we get here with num_servers == 0?  In that case this would access
>> uninitialized memory.  I guess num_servers == 0 does not make sense and
>> there should be an error case for it.
>>
>>> +
>>> +    /*
>>> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
>>> +     */
>>> +    if (global_qnio_ctx == NULL) {
>>> +        global_qnio_ctx = vxhs_setup_qnio();
>>
>> libqnio comment:
>> The client epoll thread should mask all signals (like
>> qemu_thread_create()).  Otherwise it may receive signals that it cannot
>> deal with.
>>
>>> +        if (global_qnio_ctx == NULL) {
>>> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +        }
>>> +    }
>>> +
>>> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
>>> +    if (!ret) {
>>> +        error_setg(&local_err, "Failed qnio_iio_open");
>>> +        ret = -EIO;
>>> +    }
>>
>> The return value of vxhs_qnio_iio_open() is 0 for success or -errno for
>> error.
>>
>> I guess you never ran this code!  The block driver won't even open
>> successfully.
>>
>>> +
>>> +out:
>>> +    g_free(file_name);
>>> +    g_free(of_vsa_addr);
>>> +    qemu_opts_del(opts);
>>> +
>>> +    if (ret < 0) {
>>> +        for (i = 0; i < num_servers; i++) {
>>> +            g_free(s->vdisk_hostinfo[i].hostip);
>>> +        }
>>> +        g_free(s->vdisk_guid);
>>> +        s->vdisk_guid = NULL;
>>> +        errno = -ret;
>>
>> There is no need to set errno here.  The return value already contains
>> the error and the caller doesn't look at errno.
>>
>>> +    }
>>> +    error_propagate(errp, local_err);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int vxhs_open(BlockDriverState *bs, QDict *options,
>>> +              int bdrv_flags, Error **errp)
>>> +{
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    AioContext *aio_context;
>>> +    int qemu_qnio_cfd = -1;
>>> +    int device_opened = 0;
>>> +    int qemu_rfd = -1;
>>> +    int ret = 0;
>>> +    int i;
>>> +
>>> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
>>> +    if (ret < 0) {
>>> +        trace_vxhs_open_fail(ret);
>>> +        return ret;
>>> +    }
>>> +
>>> +    device_opened = 1;
>>> +    s->qnio_ctx = global_qnio_ctx;
>>> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
>>> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
>>> +    s->vdisk_size = 0;
>>> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
>>> +
>>> +    /*
>>> +     * Create a pipe for communicating between two threads in different
>>> +     * context. Set handler for read event, which gets triggered when
>>> +     * IO completion is done by non-QEMU context.
>>> +     */
>>> +    ret = qemu_pipe(s->fds);
>>> +    if (ret < 0) {
>>> +        trace_vxhs_open_epipe('.');
>>> +        ret = -errno;
>>> +        goto errout;
>>
>> This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc.
>> bdrv_close() will not be called so this function must do cleanup itself.
>>
>>> +    }
>>> +    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
>>> +
>>> +    aio_context = bdrv_get_aio_context(bs);
>>> +    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
>>> +                       false, vxhs_aio_event_reader, NULL, s);
>>> +
>>> +    /*
>>> +     * Initialize the spin-locks.
>>> +     */
>>> +    qemu_spin_init(&s->vdisk_lock);
>>> +    qemu_spin_init(&s->vdisk_acb_lock);
>>> +
>>> +    return 0;
>>> +
>>> +errout:
>>> +    /*
>>> +     * Close remote vDisk device if it was opened earlier
>>> +     */
>>> +    if (device_opened) {
>>
>> This is always true.  The device_opened variable can be removed.
>>
>>> +/*
>>> + * This allocates QEMU-VXHS callback for each IO
>>> + * and is passed to QNIO. When QNIO completes the work,
>>> + * it will be passed back through the callback.
>>> + */
>>> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
>>> +                                int64_t sector_num, QEMUIOVector *qiov,
>>> +                                int nb_sectors,
>>> +                                BlockCompletionFunc *cb,
>>> +                                void *opaque, int iodir)
>>> +{
>>> +    VXHSAIOCB *acb = NULL;
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    size_t size;
>>> +    uint64_t offset;
>>> +    int iio_flags = 0;
>>> +    int ret = 0;
>>> +    void *qnio_ctx = s->qnio_ctx;
>>> +    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
>>> +
>>> +    offset = sector_num * BDRV_SECTOR_SIZE;
>>> +    size = nb_sectors * BDRV_SECTOR_SIZE;
>>> +
>>> +    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
>>> +    /*
>>> +     * Setup or initialize VXHSAIOCB.
>>> +     * Every single field should be initialized since
>>> +     * acb will be picked up from the slab without
>>> +     * initializing with zero.
>>> +     */
>>> +    acb->io_offset = offset;
>>> +    acb->size = size;
>>> +    acb->ret = 0;
>>> +    acb->flags = 0;
>>> +    acb->aio_done = VXHS_IO_INPROGRESS;
>>> +    acb->segments = 0;
>>> +    acb->buffer = 0;
>>> +    acb->qiov = qiov;
>>> +    acb->direction = iodir;
>>> +
>>> +    qemu_spin_lock(&s->vdisk_lock);
>>> +    if (OF_VDISK_FAILED(s)) {
>>> +        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        goto errout;
>>> +    }
>>> +    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
>>> +        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
>>> +        s->vdisk_aio_retry_qd++;
>>> +        OF_AIOCB_FLAGS_SET_QUEUED(acb);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
>>> +        goto out;
>>> +    }
>>> +    s->vdisk_aio_count++;
>>> +    qemu_spin_unlock(&s->vdisk_lock);
>>> +
>>> +    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
>>> +
>>> +    switch (iodir) {
>>> +    case VDISK_AIO_WRITE:
>>> +            vxhs_inc_acb_segment_count(acb, 1);
>>> +            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
>>> +                                       offset, (void *)acb, iio_flags);
>>> +            break;
>>> +    case VDISK_AIO_READ:
>>> +            vxhs_inc_acb_segment_count(acb, 1);
>>> +            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
>>> +                                       offset, (void *)acb, iio_flags);
>>> +            break;
>>> +    default:
>>> +            trace_vxhs_aio_rw_invalid(iodir);
>>> +            goto errout;
>>
>> s->vdisk_aio_count must be decremented before returning.
>>
>>> +static void vxhs_close(BlockDriverState *bs)
>>> +{
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    int i;
>>> +
>>> +    trace_vxhs_close(s->vdisk_guid);
>>> +    close(s->fds[VDISK_FD_READ]);
>>> +    close(s->fds[VDISK_FD_WRITE]);
>>> +
>>> +    /*
>>> +     * Clearing all the event handlers for oflame registered to QEMU
>>> +     */
>>> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
>>> +                       false, NULL, NULL, NULL);
>>
>> Please remove the event handler before closing the fd.  I don't think it
>> matters in this case but in other scenarios there could be race
>> conditions if another thread opens an fd and the file descriptor number
>> is reused.
Ashish Mittal Nov. 15, 2016, 10:38 p.m. UTC | #31
Hi Stefan


On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>
> Review of .bdrv_open() and .bdrv_aio_writev() code paths.
>
> The big issues I see in this driver and libqnio:
>
> 1. Showstoppers like broken .bdrv_open() and leaking memory on every
>    reply message.
> 2. Insecure due to missing input validation (network packets and
>    configuration) and incorrect string handling.
> 3. Not fully asynchronous so QEMU and the guest may hang.
>
> Please think about the whole codebase and not just the lines I've
> pointed out in this review when fixing these sorts of issues.  There may
> be similar instances of these bugs elsewhere and it's important that
> they are fixed so that this can be merged.
>
>> +/*
>> + * Structure per vDisk maintained for state
>> + */
>> +typedef struct BDRVVXHSState {
>> +    int                     fds[2];
>> +    int64_t                 vdisk_size;
>> +    int64_t                 vdisk_blocks;
>> +    int64_t                 vdisk_flags;
>> +    int                     vdisk_aio_count;
>> +    int                     event_reader_pos;
>> +    VXHSAIOCB               *qnio_event_acb;
>> +    void                    *qnio_ctx;
>> +    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
>> +    QemuSpin                vdisk_acb_lock;  /* Protects ACB */
>
> These comments are insufficient for documenting locking.  Not all fields
> are actually protected by these locks.  Please order fields according to
> lock coverage:
>
> typedef struct VXHSAIOCB {
>     ...
>
>     /* Protected by BDRVVXHSState->vdisk_acb_lock */
>     int                 segments;
>     ...
> };
>
> typedef struct BDRVVXHSState {
>     ...
>
>     /* Protected by vdisk_lock */
>     QemuSpin                vdisk_lock;
>     int                     vdisk_aio_count;
>     QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
>     ...
> }
>
>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
>> +{
>> +    /*
>> +     * Close vDisk device
>> +     */
>> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
>> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
>
> libqnio comment:
> Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
> dropped.
>
>> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
>> +    }
>> +
>> +    /*
>> +     * Close QNIO channel against cached channel-fd
>> +     */
>> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
>> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
>
> libqnio comment:
> Why does iio_devclose() take an int32_t cfd argument but iio_close()
> takes a uint32_t cfd argument?
>
>> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
>> +    }
>> +}
>> +
>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
>> +                              int *rfd, const char *file_name)
>> +{
>> +    /*
>> +     * Open qnio channel to storage agent if not opened before.
>> +     */
>> +    if (*cfd < 0) {
>> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
>
> libqnio comments:
>
> 1.
> There is a buffer overflow in qnio_create_channel().  strncpy() is used
> incorrectly so long hostname or port (both can be 99 characters long)
> will overflow channel->name[] (64 characters) or channel->port[] (8
> characters).
>
>     strncpy(channel->name, hostname, strlen(hostname) + 1);
>     strncpy(channel->port, port, strlen(port) + 1);
>
> The third argument must be the size of the *destination* buffer, not the
> source buffer.  Also note that strncpy() doesn't NUL-terminate the
> destination string so you must do that manually to ensure there is a NUL
> byte at the end of the buffer.
>
> 2.
> channel is leaked in the "Failed to open single connection" error case
> in qnio_create_channel().
>
> 3.
> If host is longer the 63 characters then the ioapi_ctx->channels and
> qnio_ctx->channels maps will use different keys due to string truncation
> in qnio_create_channel().  This means "Channel already exists" in
> qnio_create_channel() and possibly other things will not work as
> expected.
>
>> +        if (*cfd < 0) {
>> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
>> +            return -ENODEV;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Open vdisk device
>> +     */
>> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
>
> libqnio comment:
> Buffer overflow in iio_devopen() since chandev[128] is not large enough
> to hold channel[100] + " " + devpath[arbitrary length] chars:
>
>     sprintf(chandev, "%s %s", channel, devpath);
>
>> +
>> +    if (*rfd < 0) {
>> +        if (*cfd >= 0) {
>
> This check is always true.  Otherwise the return -ENODEV would have been
> taken above.  The if statement isn't necessary.
>
>> +static void vxhs_check_failover_status(int res, void *ctx)
>> +{
>> +    BDRVVXHSState *s = ctx;
>> +
>> +    if (res == 0) {
>> +        /* found failover target */
>> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
>> +        s->vdisk_ask_failover_idx = 0;
>> +        trace_vxhs_check_failover_status(
>> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>> +                   s->vdisk_guid);
>> +        qemu_spin_lock(&s->vdisk_lock);
>> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        vxhs_handle_queued_ios(s);
>> +    } else {
>> +        /* keep looking */
>> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
>> +        s->vdisk_ask_failover_idx++;
>> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
>> +            /* pause and cycle through list again */
>> +            sleep(QNIO_CONNECT_RETRY_SECS);
>
> This code is called from a QEMU thread via vxhs_aio_rw().  It is not
> permitted to call sleep() since it will freeze QEMU and probably the
> guest.
>
> If you need a timer you can use QEMU's timer APIs.  See aio_timer_new(),
> timer_new_ns(), timer_mod(), timer_del(), timer_free().
>
>> +            s->vdisk_ask_failover_idx = 0;
>> +        }
>> +        res = vxhs_switch_storage_agent(s);
>> +    }
>> +}
>> +
>> +static int vxhs_failover_io(BDRVVXHSState *s)
>> +{
>> +    int res = 0;
>> +
>> +    trace_vxhs_failover_io(s->vdisk_guid);
>> +
>> +    s->vdisk_ask_failover_idx = 0;
>> +    res = vxhs_switch_storage_agent(s);
>> +
>> +    return res;
>> +}
>> +
>> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
>> +                       uint32_t error, uint32_t opcode)
>
> This function is doing too much.  Especially the failover code should
> run in the AioContext since it's complex.  Don't do failover here
> because this function is outside the AioContext lock.  Do it from
> AioContext using a QEMUBH like block/rbd.c.
>
>> +static int32_t
>> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
>> +                     uint64_t offset, void *ctx, uint32_t flags)
>> +{
>> +    struct iovec cur;
>> +    uint64_t cur_offset = 0;
>> +    uint64_t cur_write_len = 0;
>> +    int segcount = 0;
>> +    int ret = 0;
>> +    int i, nsio = 0;
>> +    int iovcnt = qiov->niov;
>> +    struct iovec *iov = qiov->iov;
>> +
>> +    errno = 0;
>> +    cur.iov_base = 0;
>> +    cur.iov_len = 0;
>> +
>> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
>
> libqnio comments:
>
> 1.
> There are blocking connect(2) and getaddrinfo(3) calls in iio_writev()
> so this may hang for arbitrary amounts of time.  This is not permitted
> in .bdrv_aio_readv()/.bdrv_aio_writev().  Please make qnio actually
> asynchronous.
>
> 2.
> Where does client_callback() free reply?  It looks like every reply
> message causes a memory leak!
>
> 3.
> Buffer overflow in iio_writev() since device[128] cannot fit the device
> string generated from the vdisk_guid.
>
> 4.
> Buffer overflow in iio_writev() due to
> strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is
> larger than target[64].  Also note the previous comments about strncpy()
> usage.
>
> 5.
> I don't see any endianness handling or portable alignment of struct
> fields in the network protocol code.  Binary network protocols need to
> take care of these issue for portability.  This means libqnio compiled
> for different architectures will not work.  Do you plan to support any
> other architectures besides x86?
>

No, we support only x86 and do not plan to support any other arch.
Please let me know if this necessitates any changes to the configure
script.

> 6.
> The networking code doesn't look robust: kvset uses assert() on input
> from the network so the other side of the connection could cause SIGABRT
> (coredump), the client uses the msg pointer as the cookie for the
> response packet so the server can easily crash the client by sending a
> bogus cookie value, etc.  Even on the client side these things are
> troublesome but on a server they are guaranteed security issues.  I
> didn't look into it deeply.  Please audit the code.
>

By design, our solution on OpenStack platform uses a closed set of
nodes communicating on dedicated networks. VxHS servers on all the
nodes are on a dedicated network. Clients (qemu) connects to these
only after reading the server IP from the XML (read by libvirt). The
XML cannot be modified without proper access. Therefore, IMO this
problem would be  relevant only if someone were to use qnio as a
generic mode of communication/data transfer, but for our use-case, we
will not run into this problem. Is this explanation acceptable?

Will reply to other comments in this email that are still not addressed. Thanks!


>> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
>> +                              int *cfd, int *rfd, Error **errp)
>> +{
>> +    QDict *backing_options = NULL;
>> +    QemuOpts *opts, *tcp_opts;
>> +    const char *vxhs_filename;
>> +    char *of_vsa_addr = NULL;
>> +    Error *local_err = NULL;
>> +    const char *vdisk_id_opt;
>> +    char *file_name = NULL;
>> +    size_t num_servers = 0;
>> +    char *str = NULL;
>> +    int ret = 0;
>> +    int i;
>> +
>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>> +    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
>> +    if (vxhs_filename) {
>> +        trace_vxhs_qemu_init_filename(vxhs_filename);
>> +    }
>> +
>> +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
>> +    if (!vdisk_id_opt) {
>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +    s->vdisk_guid = g_strdup(vdisk_id_opt);
>> +    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
>> +
>> +    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
>> +    if (num_servers < 1) {
>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
>> +        ret = -EINVAL;
>> +        goto out;
>> +    } else if (num_servers > VXHS_MAX_HOSTS) {
>> +        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
>> +        error_append_hint(errp, "Maximum %d servers allowed.\n",
>> +                          VXHS_MAX_HOSTS);
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +    trace_vxhs_qemu_init_numservers(num_servers);
>> +
>> +    for (i = 0; i < num_servers; i++) {
>> +        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
>> +        qdict_extract_subqdict(options, &backing_options, str);
>> +
>> +        /* Create opts info from runtime_tcp_opts list */
>> +        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
>> +        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
>> +        if (local_err) {
>> +            qdict_del(backing_options, str);
>
> backing_options is leaked and there's no need to delete the str key.
>
>> +            qemu_opts_del(tcp_opts);
>> +            g_free(str);
>> +            ret = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
>> +                                                            VXHS_OPT_HOST));
>> +        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
>> +                                                                 VXHS_OPT_PORT),
>> +                                                    NULL, 0);
>
> This will segfault if the port option was missing.
>
>> +
>> +        s->vdisk_hostinfo[i].qnio_cfd = -1;
>> +        s->vdisk_hostinfo[i].vdisk_rfd = -1;
>> +        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
>> +                             s->vdisk_hostinfo[i].port);
>
> It's not safe to use the %s format specifier for a trace event with a
> NULL value.  In the case where hostip is NULL this could crash on some
> systems.
>
>> +
>> +        qdict_del(backing_options, str);
>> +        qemu_opts_del(tcp_opts);
>> +        g_free(str);
>> +    }
>
> backing_options is leaked.
>
>> +
>> +    s->vdisk_nhosts = i;
>> +    s->vdisk_cur_host_idx = 0;
>> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
>> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
>
> Can we get here with num_servers == 0?  In that case this would access
> uninitialized memory.  I guess num_servers == 0 does not make sense and
> there should be an error case for it.
>
>> +
>> +    /*
>> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
>> +     */
>> +    if (global_qnio_ctx == NULL) {
>> +        global_qnio_ctx = vxhs_setup_qnio();
>
> libqnio comment:
> The client epoll thread should mask all signals (like
> qemu_thread_create()).  Otherwise it may receive signals that it cannot
> deal with.
>
>> +        if (global_qnio_ctx == NULL) {
>> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
>> +            ret = -EINVAL;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
>> +    if (!ret) {
>> +        error_setg(&local_err, "Failed qnio_iio_open");
>> +        ret = -EIO;
>> +    }
>
> The return value of vxhs_qnio_iio_open() is 0 for success or -errno for
> error.
>
> I guess you never ran this code!  The block driver won't even open
> successfully.
>
>> +
>> +out:
>> +    g_free(file_name);
>> +    g_free(of_vsa_addr);
>> +    qemu_opts_del(opts);
>> +
>> +    if (ret < 0) {
>> +        for (i = 0; i < num_servers; i++) {
>> +            g_free(s->vdisk_hostinfo[i].hostip);
>> +        }
>> +        g_free(s->vdisk_guid);
>> +        s->vdisk_guid = NULL;
>> +        errno = -ret;
>
> There is no need to set errno here.  The return value already contains
> the error and the caller doesn't look at errno.
>
>> +    }
>> +    error_propagate(errp, local_err);
>> +
>> +    return ret;
>> +}
>> +
>> +static int vxhs_open(BlockDriverState *bs, QDict *options,
>> +              int bdrv_flags, Error **errp)
>> +{
>> +    BDRVVXHSState *s = bs->opaque;
>> +    AioContext *aio_context;
>> +    int qemu_qnio_cfd = -1;
>> +    int device_opened = 0;
>> +    int qemu_rfd = -1;
>> +    int ret = 0;
>> +    int i;
>> +
>> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
>> +    if (ret < 0) {
>> +        trace_vxhs_open_fail(ret);
>> +        return ret;
>> +    }
>> +
>> +    device_opened = 1;
>> +    s->qnio_ctx = global_qnio_ctx;
>> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
>> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
>> +    s->vdisk_size = 0;
>> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
>> +
>> +    /*
>> +     * Create a pipe for communicating between two threads in different
>> +     * context. Set handler for read event, which gets triggered when
>> +     * IO completion is done by non-QEMU context.
>> +     */
>> +    ret = qemu_pipe(s->fds);
>> +    if (ret < 0) {
>> +        trace_vxhs_open_epipe('.');
>> +        ret = -errno;
>> +        goto errout;
>
> This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc.
> bdrv_close() will not be called so this function must do cleanup itself.
>
>> +    }
>> +    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
>> +
>> +    aio_context = bdrv_get_aio_context(bs);
>> +    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
>> +                       false, vxhs_aio_event_reader, NULL, s);
>> +
>> +    /*
>> +     * Initialize the spin-locks.
>> +     */
>> +    qemu_spin_init(&s->vdisk_lock);
>> +    qemu_spin_init(&s->vdisk_acb_lock);
>> +
>> +    return 0;
>> +
>> +errout:
>> +    /*
>> +     * Close remote vDisk device if it was opened earlier
>> +     */
>> +    if (device_opened) {
>
> This is always true.  The device_opened variable can be removed.
>
>> +/*
>> + * This allocates QEMU-VXHS callback for each IO
>> + * and is passed to QNIO. When QNIO completes the work,
>> + * it will be passed back through the callback.
>> + */
>> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
>> +                                int64_t sector_num, QEMUIOVector *qiov,
>> +                                int nb_sectors,
>> +                                BlockCompletionFunc *cb,
>> +                                void *opaque, int iodir)
>> +{
>> +    VXHSAIOCB *acb = NULL;
>> +    BDRVVXHSState *s = bs->opaque;
>> +    size_t size;
>> +    uint64_t offset;
>> +    int iio_flags = 0;
>> +    int ret = 0;
>> +    void *qnio_ctx = s->qnio_ctx;
>> +    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
>> +
>> +    offset = sector_num * BDRV_SECTOR_SIZE;
>> +    size = nb_sectors * BDRV_SECTOR_SIZE;
>> +
>> +    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
>> +    /*
>> +     * Setup or initialize VXHSAIOCB.
>> +     * Every single field should be initialized since
>> +     * acb will be picked up from the slab without
>> +     * initializing with zero.
>> +     */
>> +    acb->io_offset = offset;
>> +    acb->size = size;
>> +    acb->ret = 0;
>> +    acb->flags = 0;
>> +    acb->aio_done = VXHS_IO_INPROGRESS;
>> +    acb->segments = 0;
>> +    acb->buffer = 0;
>> +    acb->qiov = qiov;
>> +    acb->direction = iodir;
>> +
>> +    qemu_spin_lock(&s->vdisk_lock);
>> +    if (OF_VDISK_FAILED(s)) {
>> +        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        goto errout;
>> +    }
>> +    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
>> +        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
>> +        s->vdisk_aio_retry_qd++;
>> +        OF_AIOCB_FLAGS_SET_QUEUED(acb);
>> +        qemu_spin_unlock(&s->vdisk_lock);
>> +        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
>> +        goto out;
>> +    }
>> +    s->vdisk_aio_count++;
>> +    qemu_spin_unlock(&s->vdisk_lock);
>> +
>> +    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
>> +
>> +    switch (iodir) {
>> +    case VDISK_AIO_WRITE:
>> +            vxhs_inc_acb_segment_count(acb, 1);
>> +            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
>> +                                       offset, (void *)acb, iio_flags);
>> +            break;
>> +    case VDISK_AIO_READ:
>> +            vxhs_inc_acb_segment_count(acb, 1);
>> +            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
>> +                                       offset, (void *)acb, iio_flags);
>> +            break;
>> +    default:
>> +            trace_vxhs_aio_rw_invalid(iodir);
>> +            goto errout;
>
> s->vdisk_aio_count must be decremented before returning.
>
>> +static void vxhs_close(BlockDriverState *bs)
>> +{
>> +    BDRVVXHSState *s = bs->opaque;
>> +    int i;
>> +
>> +    trace_vxhs_close(s->vdisk_guid);
>> +    close(s->fds[VDISK_FD_READ]);
>> +    close(s->fds[VDISK_FD_WRITE]);
>> +
>> +    /*
>> +     * Clearing all the event handlers for oflame registered to QEMU
>> +     */
>> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
>> +                       false, NULL, NULL, NULL);
>
> Please remove the event handler before closing the fd.  I don't think it
> matters in this case but in other scenarios there could be race
> conditions if another thread opens an fd and the file descriptor number
> is reused.
Stefan Hajnoczi Nov. 16, 2016, 8:12 a.m. UTC | #32
On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>> 5.
>> I don't see any endianness handling or portable alignment of struct
>> fields in the network protocol code.  Binary network protocols need to
>> take care of these issue for portability.  This means libqnio compiled
>> for different architectures will not work.  Do you plan to support any
>> other architectures besides x86?
>>
>
> No, we support only x86 and do not plan to support any other arch.
> Please let me know if this necessitates any changes to the configure
> script.

I think no change to ./configure is necessary.  The library will only
ship on x86 so other platforms will never attempt to compile the code.

>> 6.
>> The networking code doesn't look robust: kvset uses assert() on input
>> from the network so the other side of the connection could cause SIGABRT
>> (coredump), the client uses the msg pointer as the cookie for the
>> response packet so the server can easily crash the client by sending a
>> bogus cookie value, etc.  Even on the client side these things are
>> troublesome but on a server they are guaranteed security issues.  I
>> didn't look into it deeply.  Please audit the code.
>>
>
> By design, our solution on OpenStack platform uses a closed set of
> nodes communicating on dedicated networks. VxHS servers on all the
> nodes are on a dedicated network. Clients (qemu) connects to these
> only after reading the server IP from the XML (read by libvirt). The
> XML cannot be modified without proper access. Therefore, IMO this
> problem would be  relevant only if someone were to use qnio as a
> generic mode of communication/data transfer, but for our use-case, we
> will not run into this problem. Is this explanation acceptable?

No.  The trust model is that the guest is untrusted and in the worst
case may gain code execution in QEMU due to security bugs.

You are assuming block/vxhs.c and libqnio are trusted but that
assumption violates the trust model.

In other words:
1. Guest exploits a security hole inside QEMU and gains code execution
on the host.
2. Guest uses VxHS client file descriptor on host to send a malicious
packet to VxHS server.
3. VxHS server is compromised by guest.
4. Compromised VxHS server sends malicious packets to all other
connected clients.
5. All clients have been compromised.

This means both the VxHS client and server must be robust.  They have
to validate inputs to avoid buffer overflows, assertion failures,
infinite loops, etc.

Stefan
Jeff Cody Nov. 18, 2016, 7:26 a.m. UTC | #33
On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote:
> On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote:
> > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> >> 5.
> >> I don't see any endianness handling or portable alignment of struct
> >> fields in the network protocol code.  Binary network protocols need to
> >> take care of these issue for portability.  This means libqnio compiled
> >> for different architectures will not work.  Do you plan to support any
> >> other architectures besides x86?
> >>
> >
> > No, we support only x86 and do not plan to support any other arch.
> > Please let me know if this necessitates any changes to the configure
> > script.
> 
> I think no change to ./configure is necessary.  The library will only
> ship on x86 so other platforms will never attempt to compile the code.
> 
> >> 6.
> >> The networking code doesn't look robust: kvset uses assert() on input
> >> from the network so the other side of the connection could cause SIGABRT
> >> (coredump), the client uses the msg pointer as the cookie for the
> >> response packet so the server can easily crash the client by sending a
> >> bogus cookie value, etc.  Even on the client side these things are
> >> troublesome but on a server they are guaranteed security issues.  I
> >> didn't look into it deeply.  Please audit the code.
> >>
> >
> > By design, our solution on OpenStack platform uses a closed set of
> > nodes communicating on dedicated networks. VxHS servers on all the
> > nodes are on a dedicated network. Clients (qemu) connects to these
> > only after reading the server IP from the XML (read by libvirt). The
> > XML cannot be modified without proper access. Therefore, IMO this
> > problem would be  relevant only if someone were to use qnio as a
> > generic mode of communication/data transfer, but for our use-case, we
> > will not run into this problem. Is this explanation acceptable?
> 
> No.  The trust model is that the guest is untrusted and in the worst
> case may gain code execution in QEMU due to security bugs.
> 
> You are assuming block/vxhs.c and libqnio are trusted but that
> assumption violates the trust model.
> 
> In other words:
> 1. Guest exploits a security hole inside QEMU and gains code execution
> on the host.
> 2. Guest uses VxHS client file descriptor on host to send a malicious
> packet to VxHS server.
> 3. VxHS server is compromised by guest.
> 4. Compromised VxHS server sends malicious packets to all other
> connected clients.
> 5. All clients have been compromised.
> 
> This means both the VxHS client and server must be robust.  They have
> to validate inputs to avoid buffer overflows, assertion failures,
> infinite loops, etc.
> 
> Stefan


The libqnio code is important with respect to the VxHS driver.  It is a bit
different than other existing external protocol drivers, in that the current
user and developer base is small, and the code itself is pretty new.  So I
think for the VxHS driver here upstream, we really do need to get some of
the libqnio issues squared away.  I don't know if we've ever explicitly
address the extent to which libqnio issues affect the driver
merging, so I figure it is probably worth discussing here.

To try and consolidate libqnio discussion, here is what I think I've read /
seen from others as the major issues that should be addressed in libqnio:

* Code auditing, static analysis, and general code cleanup.  Things like
  memory leaks shouldn't be happening, and some prior libqnio compiler
  warnings imply that there is more code analysis that should be done with
  libqnio.

  (With regards to memory leaks:  Valgrind may be useful to track these down:

    # valgrind  ./qemu-io -c 'write -pP 0xae 66000 128k' \
            vxhs://localhost/test.raw

    ==30369== LEAK SUMMARY:
    ==30369==    definitely lost: 4,168 bytes in 2 blocks
    ==30369==    indirectly lost: 1,207,720 bytes in 58,085 blocks) 

* Potential security issues such as buffer overruns, input validation, etc., 
  need to be audited.

* Async operations need to be truly asynchronous, without blocking calls.

* Daniel pointed out that there is no authentication method for taking to a
  remote server.  This seems a bit scary.  Maybe all that is needed here is
  some clarification of the security scheme for authentication?  My
  impression from above is that you are relying on the networks being
  private to provide some sort of implicit authentication, though, and this
  seems fragile (and doesn't protect against a compromised guest or other
  process on the server, for one).

(if I've missed anything, please add it here!)

-Jeff
Daniel P. Berrangé Nov. 18, 2016, 8:57 a.m. UTC | #34
On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
> On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote:
> > On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote:
> > > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> > >> 5.
> > >> I don't see any endianness handling or portable alignment of struct
> > >> fields in the network protocol code.  Binary network protocols need to
> > >> take care of these issue for portability.  This means libqnio compiled
> > >> for different architectures will not work.  Do you plan to support any
> > >> other architectures besides x86?
> > >>
> > >
> > > No, we support only x86 and do not plan to support any other arch.
> > > Please let me know if this necessitates any changes to the configure
> > > script.
> > 
> > I think no change to ./configure is necessary.  The library will only
> > ship on x86 so other platforms will never attempt to compile the code.
> > 
> > >> 6.
> > >> The networking code doesn't look robust: kvset uses assert() on input
> > >> from the network so the other side of the connection could cause SIGABRT
> > >> (coredump), the client uses the msg pointer as the cookie for the
> > >> response packet so the server can easily crash the client by sending a
> > >> bogus cookie value, etc.  Even on the client side these things are
> > >> troublesome but on a server they are guaranteed security issues.  I
> > >> didn't look into it deeply.  Please audit the code.
> > >>
> > >
> > > By design, our solution on OpenStack platform uses a closed set of
> > > nodes communicating on dedicated networks. VxHS servers on all the
> > > nodes are on a dedicated network. Clients (qemu) connects to these
> > > only after reading the server IP from the XML (read by libvirt). The
> > > XML cannot be modified without proper access. Therefore, IMO this
> > > problem would be  relevant only if someone were to use qnio as a
> > > generic mode of communication/data transfer, but for our use-case, we
> > > will not run into this problem. Is this explanation acceptable?
> > 
> > No.  The trust model is that the guest is untrusted and in the worst
> > case may gain code execution in QEMU due to security bugs.
> > 
> > You are assuming block/vxhs.c and libqnio are trusted but that
> > assumption violates the trust model.
> > 
> > In other words:
> > 1. Guest exploits a security hole inside QEMU and gains code execution
> > on the host.
> > 2. Guest uses VxHS client file descriptor on host to send a malicious
> > packet to VxHS server.
> > 3. VxHS server is compromised by guest.
> > 4. Compromised VxHS server sends malicious packets to all other
> > connected clients.
> > 5. All clients have been compromised.
> > 
> > This means both the VxHS client and server must be robust.  They have
> > to validate inputs to avoid buffer overflows, assertion failures,
> > infinite loops, etc.
> > 
> > Stefan
> 
> 
> The libqnio code is important with respect to the VxHS driver.  It is a bit
> different than other existing external protocol drivers, in that the current
> user and developer base is small, and the code itself is pretty new.  So I
> think for the VxHS driver here upstream, we really do need to get some of
> the libqnio issues squared away.  I don't know if we've ever explicitly
> address the extent to which libqnio issues affect the driver
> merging, so I figure it is probably worth discussing here.
> 
> To try and consolidate libqnio discussion, here is what I think I've read /
> seen from others as the major issues that should be addressed in libqnio:
> 
> * Code auditing, static analysis, and general code cleanup.  Things like
>   memory leaks shouldn't be happening, and some prior libqnio compiler
>   warnings imply that there is more code analysis that should be done with
>   libqnio.
> 
>   (With regards to memory leaks:  Valgrind may be useful to track these down:
> 
>     # valgrind  ./qemu-io -c 'write -pP 0xae 66000 128k' \
>             vxhs://localhost/test.raw
> 
>     ==30369== LEAK SUMMARY:
>     ==30369==    definitely lost: 4,168 bytes in 2 blocks
>     ==30369==    indirectly lost: 1,207,720 bytes in 58,085 blocks) 
> 
> * Potential security issues such as buffer overruns, input validation, etc., 
>   need to be audited.
> 
> * Async operations need to be truly asynchronous, without blocking calls.
> 
> * Daniel pointed out that there is no authentication method for taking to a
>   remote server.  This seems a bit scary.  Maybe all that is needed here is
>   some clarification of the security scheme for authentication?  My
>   impression from above is that you are relying on the networks being
>   private to provide some sort of implicit authentication, though, and this
>   seems fragile (and doesn't protect against a compromised guest or other
>   process on the server, for one).

While relying on some kind of private network may have been acceptable
10 years ago, I don't think it is a credible authentication / security
strategy in the current (increasingly) hostile network environments. You
really have to assume as a starting position that even internal networks
are compromised these days. 

Regards,
Daniel
Stefan Hajnoczi Nov. 18, 2016, 10:02 a.m. UTC | #35
On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
> * Daniel pointed out that there is no authentication method for taking to a
>   remote server.  This seems a bit scary.  Maybe all that is needed here is
>   some clarification of the security scheme for authentication?  My
>   impression from above is that you are relying on the networks being
>   private to provide some sort of implicit authentication, though, and this
>   seems fragile (and doesn't protect against a compromised guest or other
>   process on the server, for one).

Exactly, from the QEMU trust model you must assume that QEMU has been
compromised by the guest.  The escaped guest can connect to the VxHS
server since it controls the QEMU process.

An escaped guest must not have access to other guests' volumes.
Therefore authentication is necessary.

By the way, QEMU has a secrets API for providing passwords and other
sensistive data without passing them on the command-line.  The
command-line is vulnerable to snooping by other processes so using this
API is mandatory.  Please see include/crypto/secret.h.

Stefan
Ketan Nilangekar Nov. 18, 2016, 10:34 a.m. UTC | #36
On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote:

>On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote:

>> On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote:

>> > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:

>> >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:

>> >> 5.

>> >> I don't see any endianness handling or portable alignment of struct

>> >> fields in the network protocol code.  Binary network protocols need to

>> >> take care of these issue for portability.  This means libqnio compiled

>> >> for different architectures will not work.  Do you plan to support any

>> >> other architectures besides x86?

>> >>

>> >

>> > No, we support only x86 and do not plan to support any other arch.

>> > Please let me know if this necessitates any changes to the configure

>> > script.

>> 

>> I think no change to ./configure is necessary.  The library will only

>> ship on x86 so other platforms will never attempt to compile the code.

>> 

>> >> 6.

>> >> The networking code doesn't look robust: kvset uses assert() on input

>> >> from the network so the other side of the connection could cause SIGABRT

>> >> (coredump), the client uses the msg pointer as the cookie for the

>> >> response packet so the server can easily crash the client by sending a

>> >> bogus cookie value, etc.  Even on the client side these things are

>> >> troublesome but on a server they are guaranteed security issues.  I

>> >> didn't look into it deeply.  Please audit the code.

>> >>

>> >

>> > By design, our solution on OpenStack platform uses a closed set of

>> > nodes communicating on dedicated networks. VxHS servers on all the

>> > nodes are on a dedicated network. Clients (qemu) connects to these

>> > only after reading the server IP from the XML (read by libvirt). The

>> > XML cannot be modified without proper access. Therefore, IMO this

>> > problem would be  relevant only if someone were to use qnio as a

>> > generic mode of communication/data transfer, but for our use-case, we

>> > will not run into this problem. Is this explanation acceptable?

>> 

>> No.  The trust model is that the guest is untrusted and in the worst

>> case may gain code execution in QEMU due to security bugs.

>> 

>> You are assuming block/vxhs.c and libqnio are trusted but that

>> assumption violates the trust model.

>> 

>> In other words:

>> 1. Guest exploits a security hole inside QEMU and gains code execution

>> on the host.

>> 2. Guest uses VxHS client file descriptor on host to send a malicious

>> packet to VxHS server.

>> 3. VxHS server is compromised by guest.

>> 4. Compromised VxHS server sends malicious packets to all other

>> connected clients.

>> 5. All clients have been compromised.

>> 

>> This means both the VxHS client and server must be robust.  They have

>> to validate inputs to avoid buffer overflows, assertion failures,

>> infinite loops, etc.

>> 

>> Stefan

>

>

>The libqnio code is important with respect to the VxHS driver.  It is a bit

>different than other existing external protocol drivers, in that the current

>user and developer base is small, and the code itself is pretty new.  So I

>think for the VxHS driver here upstream, we really do need to get some of

>the libqnio issues squared away.  I don't know if we've ever explicitly

>address the extent to which libqnio issues affect the driver

>merging, so I figure it is probably worth discussing here.

>

>To try and consolidate libqnio discussion, here is what I think I've read /

>seen from others as the major issues that should be addressed in libqnio:

>

>* Code auditing, static analysis, and general code cleanup.  Things like

>  memory leaks shouldn't be happening, and some prior libqnio compiler

>  warnings imply that there is more code analysis that should be done with

>  libqnio.

>

>  (With regards to memory leaks:  Valgrind may be useful to track these down:

>

>    # valgrind  ./qemu-io -c 'write -pP 0xae 66000 128k' \

>            vxhs://localhost/test.raw

>

>    ==30369== LEAK SUMMARY:

>    ==30369==    definitely lost: 4,168 bytes in 2 blocks

>    ==30369==    indirectly lost: 1,207,720 bytes in 58,085 blocks) 


We have done and are doing exhaustive memory leak tests using valgrind. Memory leaks within qnio have been addressed to some extent. We will post detailed valgrind results to this thread.

>

>* Potential security issues such as buffer overruns, input validation, etc., 

>  need to be audited.


We have known a few such issues from previous comments and have addressed some of those. If there are any important outstanding ones, please let us know and we will fix them on priority.

>

>* Async operations need to be truly asynchronous, without blocking calls.


There is only one blocking call in libqnio reconnect which we has been pointed out. We will fix this soon.

>

>* Daniel pointed out that there is no authentication method for taking to a

>  remote server.  This seems a bit scary.  Maybe all that is needed here is

>  some clarification of the security scheme for authentication?  My

>  impression from above is that you are relying on the networks being

>  private to provide some sort of implicit authentication, though, and this

>  seems fragile (and doesn't protect against a compromised guest or other

>  process on the server, for one).


Our auth scheme is based on network isolation at L2/L3 level. If there is a simplified authentication mechanism which we can implement without imposing significant penalties on IO performance, please let us know and we will implement that if feasible.

>

>(if I've missed anything, please add it here!)

>

>-Jeff
Ketan Nilangekar Nov. 18, 2016, 10:57 a.m. UTC | #37
On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:

>> * Daniel pointed out that there is no authentication method for taking to a

>>   remote server.  This seems a bit scary.  Maybe all that is needed here is

>>   some clarification of the security scheme for authentication?  My

>>   impression from above is that you are relying on the networks being

>>   private to provide some sort of implicit authentication, though, and this

>>   seems fragile (and doesn't protect against a compromised guest or other

>>   process on the server, for one).

>

>Exactly, from the QEMU trust model you must assume that QEMU has been

>compromised by the guest.  The escaped guest can connect to the VxHS

>server since it controls the QEMU process.

>

>An escaped guest must not have access to other guests' volumes.

>Therefore authentication is necessary.

>

>By the way, QEMU has a secrets API for providing passwords and other

>sensistive data without passing them on the command-line.  The

>command-line is vulnerable to snooping by other processes so using this

>API is mandatory.  Please see include/crypto/secret.h.


Stefan, do you have any details on the authentication implemented by qemu-nbd as part of the nbd protocol?

>

>Stefan
Daniel P. Berrangé Nov. 18, 2016, 11:03 a.m. UTC | #38
On Fri, Nov 18, 2016 at 10:57:00AM +0000, Ketan Nilangekar wrote:
> 
> 
> 
> 
> 
> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> 
> >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
> >> * Daniel pointed out that there is no authentication method for taking to a
> >>   remote server.  This seems a bit scary.  Maybe all that is needed here is
> >>   some clarification of the security scheme for authentication?  My
> >>   impression from above is that you are relying on the networks being
> >>   private to provide some sort of implicit authentication, though, and this
> >>   seems fragile (and doesn't protect against a compromised guest or other
> >>   process on the server, for one).
> >
> >Exactly, from the QEMU trust model you must assume that QEMU has been
> >compromised by the guest.  The escaped guest can connect to the VxHS
> >server since it controls the QEMU process.
> >
> >An escaped guest must not have access to other guests' volumes.
> >Therefore authentication is necessary.
> >
> >By the way, QEMU has a secrets API for providing passwords and other
> >sensistive data without passing them on the command-line.  The
> >command-line is vulnerable to snooping by other processes so using this
> >API is mandatory.  Please see include/crypto/secret.h.
> 
> Stefan, do you have any details on the authentication implemented by
> qemu-nbd as part of the nbd protocol?

Historically the NBD protocol has zero authentication or encryption
facilities.

I recently added support for running TLS over the NBD channel. When
doing this, the server is able to request an x509 certificate from
the client and use the distinguished name from the cert as an identity
against which to control access.

The glusterfs protocol takes a similar approach of using TLS and x509
certs to provide identies for access control.

The Ceph/RBD protocol uses an explicit username+password pair.

NFS these days can use Kerberos

My recommendation would be either TLS with x509 certs, or to integrate
with SASL which is a pluggable authentication framework, or better yet
to support both.  This is what we do for VNC & SPICE auth, as well as
for libvirt auth

Regards,
Daniel
Ketan Nilangekar Nov. 18, 2016, 11:36 a.m. UTC | #39
On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:

>> * Daniel pointed out that there is no authentication method for taking to a

>>   remote server.  This seems a bit scary.  Maybe all that is needed here is

>>   some clarification of the security scheme for authentication?  My

>>   impression from above is that you are relying on the networks being

>>   private to provide some sort of implicit authentication, though, and this

>>   seems fragile (and doesn't protect against a compromised guest or other

>>   process on the server, for one).

>

>Exactly, from the QEMU trust model you must assume that QEMU has been

>compromised by the guest.  The escaped guest can connect to the VxHS

>server since it controls the QEMU process.

>

>An escaped guest must not have access to other guests' volumes.

>Therefore authentication is necessary.


Just so I am clear on this, how will such an escaped guest get to know the other guest vdisk IDs?

>

>By the way, QEMU has a secrets API for providing passwords and other

>sensistive data without passing them on the command-line.  The

>command-line is vulnerable to snooping by other processes so using this

>API is mandatory.  Please see include/crypto/secret.h.

>

>Stefan
Daniel P. Berrangé Nov. 18, 2016, 11:54 a.m. UTC | #40
On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:
> 
> 
> 
> 
> 
> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> 
> >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
> >> * Daniel pointed out that there is no authentication method for taking to a
> >>   remote server.  This seems a bit scary.  Maybe all that is needed here is
> >>   some clarification of the security scheme for authentication?  My
> >>   impression from above is that you are relying on the networks being
> >>   private to provide some sort of implicit authentication, though, and this
> >>   seems fragile (and doesn't protect against a compromised guest or other
> >>   process on the server, for one).
> >
> >Exactly, from the QEMU trust model you must assume that QEMU has been
> >compromised by the guest.  The escaped guest can connect to the VxHS
> >server since it controls the QEMU process.
> >
> >An escaped guest must not have access to other guests' volumes.
> >Therefore authentication is necessary.
> 
> Just so I am clear on this, how will such an escaped guest get to know
> the other guest vdisk IDs?

There can be a multiple approaches depending on the deployment scenario.
At the very simplest it could directly read the IDs out of the libvirt
XML files in /var/run/libvirt. Or it can rnu "ps" to list other running
QEMU processes and see the vdisk IDs in the command line args of those
processes. Or the mgmt app may be creating vdisk IDs based on some
particular scheme, and the attacker may have info about this which lets
them determine likely IDs.  Or the QEMU may have previously been
permitted to the use the disk and remembered the ID for use later
after access to the disk has been removed.

IOW, you can't rely on security-through-obscurity of the vdisk IDs

Regards,
Daniel
Ketan Nilangekar Nov. 18, 2016, 1:25 p.m. UTC | #41
> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
> 
>> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:
>> 
>> 
>> 
>> 
>> 
>>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>> 
>>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
>>>> * Daniel pointed out that there is no authentication method for taking to a
>>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is
>>>>  some clarification of the security scheme for authentication?  My
>>>>  impression from above is that you are relying on the networks being
>>>>  private to provide some sort of implicit authentication, though, and this
>>>>  seems fragile (and doesn't protect against a compromised guest or other
>>>>  process on the server, for one).
>>> 
>>> Exactly, from the QEMU trust model you must assume that QEMU has been
>>> compromised by the guest.  The escaped guest can connect to the VxHS
>>> server since it controls the QEMU process.
>>> 
>>> An escaped guest must not have access to other guests' volumes.
>>> Therefore authentication is necessary.
>> 
>> Just so I am clear on this, how will such an escaped guest get to know
>> the other guest vdisk IDs?
> 
> There can be a multiple approaches depending on the deployment scenario.
> At the very simplest it could directly read the IDs out of the libvirt
> XML files in /var/run/libvirt. Or it can rnu "ps" to list other running
> QEMU processes and see the vdisk IDs in the command line args of those
> processes. Or the mgmt app may be creating vdisk IDs based on some
> particular scheme, and the attacker may have info about this which lets
> them determine likely IDs.  Or the QEMU may have previously been
> permitted to the use the disk and remembered the ID for use later
> after access to the disk has been removed.
> 

Are we talking about a compromised guest here or compromised hypervisor? How will a compromised guest read the xml file or list running qemu processes?

> IOW, you can't rely on security-through-obscurity of the vdisk IDs
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
Daniel P. Berrangé Nov. 18, 2016, 1:36 p.m. UTC | #42
On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote:
> 
> 
> > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
> > 
> >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:
> >> 
> >> 
> >> 
> >> 
> >> 
> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> >>> 
> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
> >>>> * Daniel pointed out that there is no authentication method for taking to a
> >>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is
> >>>>  some clarification of the security scheme for authentication?  My
> >>>>  impression from above is that you are relying on the networks being
> >>>>  private to provide some sort of implicit authentication, though, and this
> >>>>  seems fragile (and doesn't protect against a compromised guest or other
> >>>>  process on the server, for one).
> >>> 
> >>> Exactly, from the QEMU trust model you must assume that QEMU has been
> >>> compromised by the guest.  The escaped guest can connect to the VxHS
> >>> server since it controls the QEMU process.
> >>> 
> >>> An escaped guest must not have access to other guests' volumes.
> >>> Therefore authentication is necessary.
> >> 
> >> Just so I am clear on this, how will such an escaped guest get to know
> >> the other guest vdisk IDs?
> > 
> > There can be a multiple approaches depending on the deployment scenario.
> > At the very simplest it could directly read the IDs out of the libvirt
> > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running
> > QEMU processes and see the vdisk IDs in the command line args of those
> > processes. Or the mgmt app may be creating vdisk IDs based on some
> > particular scheme, and the attacker may have info about this which lets
> > them determine likely IDs.  Or the QEMU may have previously been
> > permitted to the use the disk and remembered the ID for use later
> > after access to the disk has been removed.
> > 
> 
> Are we talking about a compromised guest here or compromised hypervisor?
> How will a compromised guest read the xml file or list running qemu
> processes?

Compromised QEMU process, aka hypervisor userspace


Regards,
Daniel
Markus Armbruster Nov. 18, 2016, 2:49 p.m. UTC | #43
Ketan Nilangekar <Ketan.Nilangekar@veritas.com> writes:

> On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote:
[...]
>>* Daniel pointed out that there is no authentication method for taking to a
>>  remote server.  This seems a bit scary.  Maybe all that is needed here is
>>  some clarification of the security scheme for authentication?  My
>>  impression from above is that you are relying on the networks being
>>  private to provide some sort of implicit authentication, though, and this
>>  seems fragile (and doesn't protect against a compromised guest or other
>>  process on the server, for one).
>
> Our auth scheme is based on network isolation at L2/L3 level.

Stefan already explained the trust model.  Since understanding it is
crucial to security work, let me use the opportunity to explain it once
more.

The guest is untrusted.  It interacts only with QEMU and, if enabled,
KVM.  KVM has a relatively small attack surface, but if the guest
penetrates it, game's over.  There's nothing we can do to mitigate.
QEMU has a much larger attack surface, but we *can* do something to
mitigate a compromise: nothing on the host trusts QEMU.  Second line of
defense.

A line of defense is as strong as its weakest point.  Adding an
interface between QEMU and the host that requires the host to trust QEMU
basically destroys the second line of defense.  That's a big deal.

You might argue that you don't require "the host" to trust, but only
your daemon (or whatever it is your driver talks to).  But that puts
that daemon in the same security domain as QEMU itself, i.e. it should
not be trusted by anything else on the host.  Now you have a second
problem.

If you rely on "network isolation at L2/L3 level", chances are
*everything* on this isolated network joins QEMU's security domain.  You
almost certainly need a separate isolated network per guest to have a
chance at being credible.  Even then, I'd rather not bet my own money on
it.

It's better to stick to the common trust model, and have *nothing* on
the host trust QEMU.

> If there is a simplified authentication mechanism which we can implement without imposing significant penalties on IO performance, please let us know and we will implement that if feasible.

Daniel already listed available mechanisms.
Jeff Cody Nov. 18, 2016, 4:19 p.m. UTC | #44
On Fri, Nov 18, 2016 at 10:34:54AM +0000, Ketan Nilangekar wrote:
> 
> 
> 
> 
> 
> On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote:
> 
> >On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote:
> >> On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote:
> >> > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> >> >> 5.
> >> >> I don't see any endianness handling or portable alignment of struct
> >> >> fields in the network protocol code.  Binary network protocols need to
> >> >> take care of these issue for portability.  This means libqnio compiled
> >> >> for different architectures will not work.  Do you plan to support any
> >> >> other architectures besides x86?
> >> >>
> >> >
> >> > No, we support only x86 and do not plan to support any other arch.
> >> > Please let me know if this necessitates any changes to the configure
> >> > script.
> >> 
> >> I think no change to ./configure is necessary.  The library will only
> >> ship on x86 so other platforms will never attempt to compile the code.
> >> 
> >> >> 6.
> >> >> The networking code doesn't look robust: kvset uses assert() on input
> >> >> from the network so the other side of the connection could cause SIGABRT
> >> >> (coredump), the client uses the msg pointer as the cookie for the
> >> >> response packet so the server can easily crash the client by sending a
> >> >> bogus cookie value, etc.  Even on the client side these things are
> >> >> troublesome but on a server they are guaranteed security issues.  I
> >> >> didn't look into it deeply.  Please audit the code.
> >> >>
> >> >
> >> > By design, our solution on OpenStack platform uses a closed set of
> >> > nodes communicating on dedicated networks. VxHS servers on all the
> >> > nodes are on a dedicated network. Clients (qemu) connects to these
> >> > only after reading the server IP from the XML (read by libvirt). The
> >> > XML cannot be modified without proper access. Therefore, IMO this
> >> > problem would be  relevant only if someone were to use qnio as a
> >> > generic mode of communication/data transfer, but for our use-case, we
> >> > will not run into this problem. Is this explanation acceptable?
> >> 
> >> No.  The trust model is that the guest is untrusted and in the worst
> >> case may gain code execution in QEMU due to security bugs.
> >> 
> >> You are assuming block/vxhs.c and libqnio are trusted but that
> >> assumption violates the trust model.
> >> 
> >> In other words:
> >> 1. Guest exploits a security hole inside QEMU and gains code execution
> >> on the host.
> >> 2. Guest uses VxHS client file descriptor on host to send a malicious
> >> packet to VxHS server.
> >> 3. VxHS server is compromised by guest.
> >> 4. Compromised VxHS server sends malicious packets to all other
> >> connected clients.
> >> 5. All clients have been compromised.
> >> 
> >> This means both the VxHS client and server must be robust.  They have
> >> to validate inputs to avoid buffer overflows, assertion failures,
> >> infinite loops, etc.
> >> 
> >> Stefan
> >
> >
> >The libqnio code is important with respect to the VxHS driver.  It is a bit
> >different than other existing external protocol drivers, in that the current
> >user and developer base is small, and the code itself is pretty new.  So I
> >think for the VxHS driver here upstream, we really do need to get some of
> >the libqnio issues squared away.  I don't know if we've ever explicitly
> >address the extent to which libqnio issues affect the driver
> >merging, so I figure it is probably worth discussing here.
> >
> >To try and consolidate libqnio discussion, here is what I think I've read /
> >seen from others as the major issues that should be addressed in libqnio:
> >
> >* Code auditing, static analysis, and general code cleanup.  Things like
> >  memory leaks shouldn't be happening, and some prior libqnio compiler
> >  warnings imply that there is more code analysis that should be done with
> >  libqnio.
> >
> >  (With regards to memory leaks:  Valgrind may be useful to track these down:
> >
> >    # valgrind  ./qemu-io -c 'write -pP 0xae 66000 128k' \
> >            vxhs://localhost/test.raw
> >
> >    ==30369== LEAK SUMMARY:
> >    ==30369==    definitely lost: 4,168 bytes in 2 blocks
> >    ==30369==    indirectly lost: 1,207,720 bytes in 58,085 blocks) 
> 
> We have done and are doing exhaustive memory leak tests using valgrind.
> Memory leaks within qnio have been addressed to some extent. We will post
> detailed valgrind results to this thread.
>

That is good to hear.  I ran the above on the latest HEAD from the qnio
github repo, so I look forward to checking out the latest code once it is
available.

> >
> >* Potential security issues such as buffer overruns, input validation, etc., 
> >  need to be audited.
> 
> We have known a few such issues from previous comments and have addressed
> some of those. If there are any important outstanding ones, please let us
> know and we will fix them on priority.
>

One concern is that the issues noted are not from an exhaustive review on
Stefan's part, AFAIK.  When Stefan called for auditing the code, that is
really a call to look for other potential security flaws as well, using the
perspective he outlined on the trust model.


> >
> >* Async operations need to be truly asynchronous, without blocking calls.
> 
> There is only one blocking call in libqnio reconnect which we has been
> pointed out. We will fix this soon.
>

Great, thanks!


> >
> >* Daniel pointed out that there is no authentication method for taking to a
> >  remote server.  This seems a bit scary.  Maybe all that is needed here is
> >  some clarification of the security scheme for authentication?  My
> >  impression from above is that you are relying on the networks being
> >  private to provide some sort of implicit authentication, though, and this
> >  seems fragile (and doesn't protect against a compromised guest or other
> >  process on the server, for one).
> 
> Our auth scheme is based on network isolation at L2/L3 level. If there is
> a simplified authentication mechanism which we can implement without
> imposing significant penalties on IO performance, please let us know and
> we will implement that if feasible.
> 
> >
> >(if I've missed anything, please add it here!)
> >
> >-Jeff
Ketan Nilangekar Nov. 23, 2016, 8:25 a.m. UTC | #45
+Nitin Jerath from Veritas.




On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote:

>On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote:

>> 

>> 

>> > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:

>> > 

>> >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:

>> >> 

>> >> 

>> >> 

>> >> 

>> >> 

>> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>> >>> 

>> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:

>> >>>> * Daniel pointed out that there is no authentication method for taking to a

>> >>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is

>> >>>>  some clarification of the security scheme for authentication?  My

>> >>>>  impression from above is that you are relying on the networks being

>> >>>>  private to provide some sort of implicit authentication, though, and this

>> >>>>  seems fragile (and doesn't protect against a compromised guest or other

>> >>>>  process on the server, for one).

>> >>> 

>> >>> Exactly, from the QEMU trust model you must assume that QEMU has been

>> >>> compromised by the guest.  The escaped guest can connect to the VxHS

>> >>> server since it controls the QEMU process.

>> >>> 

>> >>> An escaped guest must not have access to other guests' volumes.

>> >>> Therefore authentication is necessary.

>> >> 

>> >> Just so I am clear on this, how will such an escaped guest get to know

>> >> the other guest vdisk IDs?

>> > 

>> > There can be a multiple approaches depending on the deployment scenario.

>> > At the very simplest it could directly read the IDs out of the libvirt

>> > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running

>> > QEMU processes and see the vdisk IDs in the command line args of those

>> > processes. Or the mgmt app may be creating vdisk IDs based on some

>> > particular scheme, and the attacker may have info about this which lets

>> > them determine likely IDs.  Or the QEMU may have previously been

>> > permitted to the use the disk and remembered the ID for use later

>> > after access to the disk has been removed.

>> > 

>> 

>> Are we talking about a compromised guest here or compromised hypervisor?

>> How will a compromised guest read the xml file or list running qemu

>> processes?

>

>Compromised QEMU process, aka hypervisor userspace

>

>

>Regards,

>Daniel

>-- 

>|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|

>|: http://libvirt.org              -o-             http://virt-manager.org :|

>|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
Ashish Mittal Nov. 23, 2016, 10:09 p.m. UTC | #46
On the topic of protocol security -

Would it be enough for the first patch to implement only
authentication and not encryption?

On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar
<Ketan.Nilangekar@veritas.com> wrote:
> +Nitin Jerath from Veritas.
>
>
>
>
> On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote:
>
>>On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote:
>>>
>>>
>>> > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
>>> >
>>> >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>> >>>
>>> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
>>> >>>> * Daniel pointed out that there is no authentication method for taking to a
>>> >>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is
>>> >>>>  some clarification of the security scheme for authentication?  My
>>> >>>>  impression from above is that you are relying on the networks being
>>> >>>>  private to provide some sort of implicit authentication, though, and this
>>> >>>>  seems fragile (and doesn't protect against a compromised guest or other
>>> >>>>  process on the server, for one).
>>> >>>
>>> >>> Exactly, from the QEMU trust model you must assume that QEMU has been
>>> >>> compromised by the guest.  The escaped guest can connect to the VxHS
>>> >>> server since it controls the QEMU process.
>>> >>>
>>> >>> An escaped guest must not have access to other guests' volumes.
>>> >>> Therefore authentication is necessary.
>>> >>
>>> >> Just so I am clear on this, how will such an escaped guest get to know
>>> >> the other guest vdisk IDs?
>>> >
>>> > There can be a multiple approaches depending on the deployment scenario.
>>> > At the very simplest it could directly read the IDs out of the libvirt
>>> > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running
>>> > QEMU processes and see the vdisk IDs in the command line args of those
>>> > processes. Or the mgmt app may be creating vdisk IDs based on some
>>> > particular scheme, and the attacker may have info about this which lets
>>> > them determine likely IDs.  Or the QEMU may have previously been
>>> > permitted to the use the disk and remembered the ID for use later
>>> > after access to the disk has been removed.
>>> >
>>>
>>> Are we talking about a compromised guest here or compromised hypervisor?
>>> How will a compromised guest read the xml file or list running qemu
>>> processes?
>>
>>Compromised QEMU process, aka hypervisor userspace
>>
>>
>>Regards,
>>Daniel
>>--
>>|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
>>|: http://libvirt.org              -o-             http://virt-manager.org :|
>>|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
Paolo Bonzini Nov. 23, 2016, 10:37 p.m. UTC | #47
On 23/11/2016 23:09, ashish mittal wrote:
> On the topic of protocol security -
> 
> Would it be enough for the first patch to implement only
> authentication and not encryption?

Yes, of course.  However, as we introduce more and more QEMU-specific
characteristics to a protocol that is already QEMU-specific (it doesn't
do failover, etc.), I am still not sure of the actual benefit of using
libqnio versus having an NBD server or FUSE driver.

You have already mentioned performance, but the design has changed so
much that I think one of the two things has to change: either failover
moves back to QEMU and there is no (closed source) translator running on
the node, or the translator needs to speak a well-known and
already-supported protocol.

Paolo

> On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar
> <Ketan.Nilangekar@veritas.com> wrote:
>> +Nitin Jerath from Veritas.
>>
>>
>>
>>
>> On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote:
>>
>>> On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote:
>>>>
>>>>
>>>>> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
>>>>>
>>>>>> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:
>>>>>>>> * Daniel pointed out that there is no authentication method for taking to a
>>>>>>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is
>>>>>>>>  some clarification of the security scheme for authentication?  My
>>>>>>>>  impression from above is that you are relying on the networks being
>>>>>>>>  private to provide some sort of implicit authentication, though, and this
>>>>>>>>  seems fragile (and doesn't protect against a compromised guest or other
>>>>>>>>  process on the server, for one).
>>>>>>>
>>>>>>> Exactly, from the QEMU trust model you must assume that QEMU has been
>>>>>>> compromised by the guest.  The escaped guest can connect to the VxHS
>>>>>>> server since it controls the QEMU process.
>>>>>>>
>>>>>>> An escaped guest must not have access to other guests' volumes.
>>>>>>> Therefore authentication is necessary.
>>>>>>
>>>>>> Just so I am clear on this, how will such an escaped guest get to know
>>>>>> the other guest vdisk IDs?
>>>>>
>>>>> There can be a multiple approaches depending on the deployment scenario.
>>>>> At the very simplest it could directly read the IDs out of the libvirt
>>>>> XML files in /var/run/libvirt. Or it can rnu "ps" to list other running
>>>>> QEMU processes and see the vdisk IDs in the command line args of those
>>>>> processes. Or the mgmt app may be creating vdisk IDs based on some
>>>>> particular scheme, and the attacker may have info about this which lets
>>>>> them determine likely IDs.  Or the QEMU may have previously been
>>>>> permitted to the use the disk and remembered the ID for use later
>>>>> after access to the disk has been removed.
>>>>>
>>>>
>>>> Are we talking about a compromised guest here or compromised hypervisor?
>>>> How will a compromised guest read the xml file or list running qemu
>>>> processes?
>>>
>>> Compromised QEMU process, aka hypervisor userspace
>>>
>>>
>>> Regards,
>>> Daniel
>>> --
>>> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
>>> |: http://libvirt.org              -o-             http://virt-manager.org :|
>>> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
Ketan Nilangekar Nov. 24, 2016, 5:44 a.m. UTC | #48
On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:

>

>

>On 23/11/2016 23:09, ashish mittal wrote:

>> On the topic of protocol security -

>> 

>> Would it be enough for the first patch to implement only

>> authentication and not encryption?

>

>Yes, of course.  However, as we introduce more and more QEMU-specific

>characteristics to a protocol that is already QEMU-specific (it doesn't

>do failover, etc.), I am still not sure of the actual benefit of using

>libqnio versus having an NBD server or FUSE driver.

>

>You have already mentioned performance, but the design has changed so

>much that I think one of the two things has to change: either failover

>moves back to QEMU and there is no (closed source) translator running on

>the node, or the translator needs to speak a well-known and

>already-supported protocol.


IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 

Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.

Ketan

>

>Paolo

>

>> On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar

>> <Ketan.Nilangekar@veritas.com> wrote:

>>> +Nitin Jerath from Veritas.

>>>

>>>

>>>

>>>

>>> On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote:

>>>

>>>> On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote:

>>>>>

>>>>>

>>>>>> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote:

>>>>>>

>>>>>>> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote:

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

>>>>>>>>

>>>>>>>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote:

>>>>>>>>> * Daniel pointed out that there is no authentication method for taking to a

>>>>>>>>>  remote server.  This seems a bit scary.  Maybe all that is needed here is

>>>>>>>>>  some clarification of the security scheme for authentication?  My

>>>>>>>>>  impression from above is that you are relying on the networks being

>>>>>>>>>  private to provide some sort of implicit authentication, though, and this

>>>>>>>>>  seems fragile (and doesn't protect against a compromised guest or other

>>>>>>>>>  process on the server, for one).

>>>>>>>>

>>>>>>>> Exactly, from the QEMU trust model you must assume that QEMU has been

>>>>>>>> compromised by the guest.  The escaped guest can connect to the VxHS

>>>>>>>> server since it controls the QEMU process.

>>>>>>>>

>>>>>>>> An escaped guest must not have access to other guests' volumes.

>>>>>>>> Therefore authentication is necessary.

>>>>>>>

>>>>>>> Just so I am clear on this, how will such an escaped guest get to know

>>>>>>> the other guest vdisk IDs?

>>>>>>

>>>>>> There can be a multiple approaches depending on the deployment scenario.

>>>>>> At the very simplest it could directly read the IDs out of the libvirt

>>>>>> XML files in /var/run/libvirt. Or it can rnu "ps" to list other running

>>>>>> QEMU processes and see the vdisk IDs in the command line args of those

>>>>>> processes. Or the mgmt app may be creating vdisk IDs based on some

>>>>>> particular scheme, and the attacker may have info about this which lets

>>>>>> them determine likely IDs.  Or the QEMU may have previously been

>>>>>> permitted to the use the disk and remembered the ID for use later

>>>>>> after access to the disk has been removed.

>>>>>>

>>>>>

>>>>> Are we talking about a compromised guest here or compromised hypervisor?

>>>>> How will a compromised guest read the xml file or list running qemu

>>>>> processes?

>>>>

>>>> Compromised QEMU process, aka hypervisor userspace

>>>>

>>>>

>>>> Regards,

>>>> Daniel

>>>> --

>>>> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|

>>>> |: http://libvirt.org              -o-             http://virt-manager.org :|

>>>> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
Daniel P. Berrangé Nov. 24, 2016, 10:15 a.m. UTC | #49
On Wed, Nov 23, 2016 at 02:09:50PM -0800, ashish mittal wrote:
> On the topic of protocol security -
> 
> Would it be enough for the first patch to implement only
> authentication and not encryption?

Yes, authentication is the only critical thing from my POV. While encryption
is a nice to have, there are plenty of storage systems which do *not* do
encryption. Guest data can still be protected simply by running LUKS on the
guest disks, so lack of encryption is not a serious security risk, provided
the authentication scheme itself does not require encryption in order to
be secure.


Regards,
Daniel
Stefan Hajnoczi Nov. 24, 2016, 11:11 a.m. UTC | #50
On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
> On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
> >On 23/11/2016 23:09, ashish mittal wrote:
> >> On the topic of protocol security -
> >> 
> >> Would it be enough for the first patch to implement only
> >> authentication and not encryption?
> >
> >Yes, of course.  However, as we introduce more and more QEMU-specific
> >characteristics to a protocol that is already QEMU-specific (it doesn't
> >do failover, etc.), I am still not sure of the actual benefit of using
> >libqnio versus having an NBD server or FUSE driver.
> >
> >You have already mentioned performance, but the design has changed so
> >much that I think one of the two things has to change: either failover
> >moves back to QEMU and there is no (closed source) translator running on
> >the node, or the translator needs to speak a well-known and
> >already-supported protocol.
> 
> IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 
> 
> Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.

By "cross memory attach" do you mean
process_vm_readv(2)/process_vm_writev(2)?

That puts us back to square one in terms of security.  You have
(untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
another process on the same machine.  That process is therefore also
untrusted and may only process data for one guest so that guests stay
isolated from each other.

There's an easier way to get even better performance: get rid of libqnio
and the external process.  Move the code from the external process into
QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
context switching.

Can you remind me why there needs to be an external process?

Stefan
Ketan Nilangekar Nov. 24, 2016, 11:31 a.m. UTC | #51
On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
    > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:

    > >On 23/11/2016 23:09, ashish mittal wrote:

    > >> On the topic of protocol security -

    > >> 

    > >> Would it be enough for the first patch to implement only

    > >> authentication and not encryption?

    > >

    > >Yes, of course.  However, as we introduce more and more QEMU-specific

    > >characteristics to a protocol that is already QEMU-specific (it doesn't

    > >do failover, etc.), I am still not sure of the actual benefit of using

    > >libqnio versus having an NBD server or FUSE driver.

    > >

    > >You have already mentioned performance, but the design has changed so

    > >much that I think one of the two things has to change: either failover

    > >moves back to QEMU and there is no (closed source) translator running on

    > >the node, or the translator needs to speak a well-known and

    > >already-supported protocol.

    > 

    > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 

    > 

    > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.

    
    By "cross memory attach" do you mean
    process_vm_readv(2)/process_vm_writev(2)?
  
Ketan> Yes.
  
    That puts us back to square one in terms of security.  You have
    (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
    another process on the same machine.  That process is therefore also
    untrusted and may only process data for one guest so that guests stay
    isolated from each other.
    
Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.

    There's an easier way to get even better performance: get rid of libqnio
    and the external process.  Move the code from the external process into
    QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
    context switching.
    
    Can you remind me why there needs to be an external process?
 
Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.

Ketan.
 
    Stefan
Stefan Hajnoczi Nov. 24, 2016, 4:08 p.m. UTC | #52
On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
> 
> 
> On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> 
>     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
>     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>     > >On 23/11/2016 23:09, ashish mittal wrote:
>     > >> On the topic of protocol security -
>     > >> 
>     > >> Would it be enough for the first patch to implement only
>     > >> authentication and not encryption?
>     > >
>     > >Yes, of course.  However, as we introduce more and more QEMU-specific
>     > >characteristics to a protocol that is already QEMU-specific (it doesn't
>     > >do failover, etc.), I am still not sure of the actual benefit of using
>     > >libqnio versus having an NBD server or FUSE driver.
>     > >
>     > >You have already mentioned performance, but the design has changed so
>     > >much that I think one of the two things has to change: either failover
>     > >moves back to QEMU and there is no (closed source) translator running on
>     > >the node, or the translator needs to speak a well-known and
>     > >already-supported protocol.
>     > 
>     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 
>     > 
>     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.
>     
>     By "cross memory attach" do you mean
>     process_vm_readv(2)/process_vm_writev(2)?
>   
> Ketan> Yes.
>   
>     That puts us back to square one in terms of security.  You have
>     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
>     another process on the same machine.  That process is therefore also
>     untrusted and may only process data for one guest so that guests stay
>     isolated from each other.
>     
> Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.

This is incorrect.

Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.
It means process A reads/writes directly from/to process B memory.  Both
processes must have the same uid/gid.  There is no trust boundary
between them.

Network communication does not require both processes to have the same
uid/gid.  If you want multiple QEMU processes talking to a single server
there must be a trust boundary between client and server.  The server
can validate the input from the client and reject undesired operations.

Hope this makes sense now.

Two architectures that implement the QEMU trust model correctly are:

1. Cross memory attach: each QEMU process has a dedicated vxhs server
   process to prevent guests from attacking each other.  This is where I
   said you might as well put the code inside QEMU since there is no
   isolation anyway.  From what you've said it sounds like the vxhs
   server needs a host-wide view and is responsible for all guests
   running on the host, so I guess we have to rule out this
   architecture.

2. Network communication: one vxhs server process and multiple guests.
   Here you might as well use NBD or iSCSI because it already exists and
   the vxhs driver doesn't add any unique functionality over existing
   protocols.

>     There's an easier way to get even better performance: get rid of libqnio
>     and the external process.  Move the code from the external process into
>     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
>     context switching.
>     
>     Can you remind me why there needs to be an external process?
>  
> Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.

This sounds similar to what QEMU and Linux (file systems, LVM, RAID,
etc) already do.  It brings to mind a third architecture:

3. A Linux driver or file system.  Then QEMU opens a raw block device.
   This is what the Ceph rbd block driver in Linux does.  This
   architecture has a kernel-userspace boundary so vxhs does not have to
   trust QEMU.

I suggest Architecture #2.  You'll be able to deploy on existing systems
because QEMU already supports NBD or iSCSI.  Use the time you gain from
switching to this architecture on benchmarking and optimizing NBD or
iSCSI so performance is closer to your goal.

Stefan
Ketan Nilangekar Nov. 25, 2016, 8:27 a.m. UTC | #53
On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
    > 

    > 

    > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    > 

    >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:

    >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:

    >     > >On 23/11/2016 23:09, ashish mittal wrote:

    >     > >> On the topic of protocol security -

    >     > >> 

    >     > >> Would it be enough for the first patch to implement only

    >     > >> authentication and not encryption?

    >     > >

    >     > >Yes, of course.  However, as we introduce more and more QEMU-specific

    >     > >characteristics to a protocol that is already QEMU-specific (it doesn't

    >     > >do failover, etc.), I am still not sure of the actual benefit of using

    >     > >libqnio versus having an NBD server or FUSE driver.

    >     > >

    >     > >You have already mentioned performance, but the design has changed so

    >     > >much that I think one of the two things has to change: either failover

    >     > >moves back to QEMU and there is no (closed source) translator running on

    >     > >the node, or the translator needs to speak a well-known and

    >     > >already-supported protocol.

    >     > 

    >     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 

    >     > 

    >     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.

    >     

    >     By "cross memory attach" do you mean

    >     process_vm_readv(2)/process_vm_writev(2)?

    >   

    > Ketan> Yes.

    >   

    >     That puts us back to square one in terms of security.  You have

    >     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of

    >     another process on the same machine.  That process is therefore also

    >     untrusted and may only process data for one guest so that guests stay

    >     isolated from each other.

    >     

    > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.

    
    This is incorrect.
    
    Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.
    It means process A reads/writes directly from/to process B memory.  Both
    processes must have the same uid/gid.  There is no trust boundary
    between them.
    
Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. 

    Network communication does not require both processes to have the same
    uid/gid.  If you want multiple QEMU processes talking to a single server
    there must be a trust boundary between client and server.  The server
    can validate the input from the client and reject undesired operations.

Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable?
    
    Hope this makes sense now.
    
    Two architectures that implement the QEMU trust model correctly are:
    
    1. Cross memory attach: each QEMU process has a dedicated vxhs server
       process to prevent guests from attacking each other.  This is where I
       said you might as well put the code inside QEMU since there is no
       isolation anyway.  From what you've said it sounds like the vxhs
       server needs a host-wide view and is responsible for all guests
       running on the host, so I guess we have to rule out this
       architecture.
    
    2. Network communication: one vxhs server process and multiple guests.
       Here you might as well use NBD or iSCSI because it already exists and
       the vxhs driver doesn't add any unique functionality over existing
       protocols.

Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support.
There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion.

    >     There's an easier way to get even better performance: get rid of libqnio

    >     and the external process.  Move the code from the external process into

    >     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and

    >     context switching.

    >     

    >     Can you remind me why there needs to be an external process?

    >  

    > Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.

    
    This sounds similar to what QEMU and Linux (file systems, LVM, RAID,
    etc) already do.  It brings to mind a third architecture:
    
    3. A Linux driver or file system.  Then QEMU opens a raw block device.
       This is what the Ceph rbd block driver in Linux does.  This
       architecture has a kernel-userspace boundary so vxhs does not have to
       trust QEMU.
    
    I suggest Architecture #2.  You'll be able to deploy on existing systems
    because QEMU already supports NBD or iSCSI.  Use the time you gain from
    switching to this architecture on benchmarking and optimizing NBD or
    iSCSI so performance is closer to your goal.
    
Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community.

    Stefan
Stefan Hajnoczi Nov. 25, 2016, 11:35 a.m. UTC | #54
On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
> On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
>     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
>     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>     >     > >> On the topic of protocol security -
>     >     > >> 
>     >     > >> Would it be enough for the first patch to implement only
>     >     > >> authentication and not encryption?
>     >     > >
>     >     > >Yes, of course.  However, as we introduce more and more QEMU-specific
>     >     > >characteristics to a protocol that is already QEMU-specific (it doesn't
>     >     > >do failover, etc.), I am still not sure of the actual benefit of using
>     >     > >libqnio versus having an NBD server or FUSE driver.
>     >     > >
>     >     > >You have already mentioned performance, but the design has changed so
>     >     > >much that I think one of the two things has to change: either failover
>     >     > >moves back to QEMU and there is no (closed source) translator running on
>     >     > >the node, or the translator needs to speak a well-known and
>     >     > >already-supported protocol.
>     >     > 
>     >     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 
>     >     > 
>     >     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.
>     >     
>     >     By "cross memory attach" do you mean
>     >     process_vm_readv(2)/process_vm_writev(2)?
>     >   
>     > Ketan> Yes.
>     >   
>     >     That puts us back to square one in terms of security.  You have
>     >     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
>     >     another process on the same machine.  That process is therefore also
>     >     untrusted and may only process data for one guest so that guests stay
>     >     isolated from each other.
>     >     
>     > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.
>     
>     This is incorrect.
>     
>     Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.
>     It means process A reads/writes directly from/to process B memory.  Both
>     processes must have the same uid/gid.  There is no trust boundary
>     between them.
>     
> Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. 
> 
>     Network communication does not require both processes to have the same
>     uid/gid.  If you want multiple QEMU processes talking to a single server
>     there must be a trust boundary between client and server.  The server
>     can validate the input from the client and reject undesired operations.
> 
> Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable?
>     
>     Hope this makes sense now.
>     
>     Two architectures that implement the QEMU trust model correctly are:
>     
>     1. Cross memory attach: each QEMU process has a dedicated vxhs server
>        process to prevent guests from attacking each other.  This is where I
>        said you might as well put the code inside QEMU since there is no
>        isolation anyway.  From what you've said it sounds like the vxhs
>        server needs a host-wide view and is responsible for all guests
>        running on the host, so I guess we have to rule out this
>        architecture.
>     
>     2. Network communication: one vxhs server process and multiple guests.
>        Here you might as well use NBD or iSCSI because it already exists and
>        the vxhs driver doesn't add any unique functionality over existing
>        protocols.
> 
> Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support.

NBD over TCP supports TLS with X.509 certificate authentication.  I
think Daniel Berrange mentioned that.

NBD over AF_UNIX does not need authentication because it relies on file
permissions for access control.  Each guest should have its own UNIX
domain socket that it connects to.  That socket can only see exports
that have been assigned to the guest.

> There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion.

Please discuss it now so everyone gets on the same page.  I think there
is a big gap and we need to communicate so that progress can be made.

>     >     There's an easier way to get even better performance: get rid of libqnio
>     >     and the external process.  Move the code from the external process into
>     >     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
>     >     context switching.
>     >     
>     >     Can you remind me why there needs to be an external process?
>     >  
>     > Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.
>     
>     This sounds similar to what QEMU and Linux (file systems, LVM, RAID,
>     etc) already do.  It brings to mind a third architecture:
>     
>     3. A Linux driver or file system.  Then QEMU opens a raw block device.
>        This is what the Ceph rbd block driver in Linux does.  This
>        architecture has a kernel-userspace boundary so vxhs does not have to
>        trust QEMU.
>     
>     I suggest Architecture #2.  You'll be able to deploy on existing systems
>     because QEMU already supports NBD or iSCSI.  Use the time you gain from
>     switching to this architecture on benchmarking and optimizing NBD or
>     iSCSI so performance is closer to your goal.
>     
> Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community.

I thought the VxHS block driver was another network block driver like
GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
with the goal of better performance.

Please share fio(1) or other standard benchmark configuration files and
performance results.

NBD and libqnio wire protocols have comparable performance
characteristics.  There is no magic that should give either one a
fundamental edge over the other.  Am I missing something?

The main performance difference is probably that libqnio opens 8
simultaneous connections but that's not unique to the wire protocol.
What happens when you run 8 NBD simultaneous TCP connections?

Stefan
Fam Zheng Nov. 28, 2016, 7:15 a.m. UTC | #55
On Fri, 11/25 08:27, Ketan Nilangekar wrote:
> Ketan> We have made a choice to go with QEMU driver approach after serious
> evaluation of most if not all standard IO tapping mechanisms including NFS,
> NBD and FUSE. None of these has been able to deliver the performance that we
> have set ourselves to achieve. Hence the effort to propose this new IO tap
> which we believe will provide an alternate to the existing mechanisms and
> hopefully benefit the community.

Out of curiosity: have you also evaluated the kernel TCMU interface [1] that can
do native command exchange very efficiently? It is a relatively new IO tappling
mechanism but provides better isolation between QEMU and the backend process
with the supervision of kernel. With its "loopback" frontend, theoretically the
SG lists can be passed back and forth for local clients. For remote clients,
iscsi can be used as the protocol.

Fam

[1]: https://www.kernel.org/doc/Documentation/target/tcmu-design.txt
Ketan Nilangekar Nov. 28, 2016, 10:23 a.m. UTC | #56
On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
    > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:

    >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:

    >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:

    >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:

    >     >     > >On 23/11/2016 23:09, ashish mittal wrote:

    >     >     > >> On the topic of protocol security -

    >     >     > >> 

    >     >     > >> Would it be enough for the first patch to implement only

    >     >     > >> authentication and not encryption?

    >     >     > >

    >     >     > >Yes, of course.  However, as we introduce more and more QEMU-specific

    >     >     > >characteristics to a protocol that is already QEMU-specific (it doesn't

    >     >     > >do failover, etc.), I am still not sure of the actual benefit of using

    >     >     > >libqnio versus having an NBD server or FUSE driver.

    >     >     > >

    >     >     > >You have already mentioned performance, but the design has changed so

    >     >     > >much that I think one of the two things has to change: either failover

    >     >     > >moves back to QEMU and there is no (closed source) translator running on

    >     >     > >the node, or the translator needs to speak a well-known and

    >     >     > >already-supported protocol.

    >     >     > 

    >     >     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 

    >     >     > 

    >     >     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.

    >     >     

    >     >     By "cross memory attach" do you mean

    >     >     process_vm_readv(2)/process_vm_writev(2)?

    >     >   

    >     > Ketan> Yes.

    >     >   

    >     >     That puts us back to square one in terms of security.  You have

    >     >     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of

    >     >     another process on the same machine.  That process is therefore also

    >     >     untrusted and may only process data for one guest so that guests stay

    >     >     isolated from each other.

    >     >     

    >     > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.

    >     

    >     This is incorrect.

    >     

    >     Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.

    >     It means process A reads/writes directly from/to process B memory.  Both

    >     processes must have the same uid/gid.  There is no trust boundary

    >     between them.

    >     

    > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. 

    > 

    >     Network communication does not require both processes to have the same

    >     uid/gid.  If you want multiple QEMU processes talking to a single server

    >     there must be a trust boundary between client and server.  The server

    >     can validate the input from the client and reject undesired operations.

    > 

    > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable?

    >     

    >     Hope this makes sense now.

    >     

    >     Two architectures that implement the QEMU trust model correctly are:

    >     

    >     1. Cross memory attach: each QEMU process has a dedicated vxhs server

    >        process to prevent guests from attacking each other.  This is where I

    >        said you might as well put the code inside QEMU since there is no

    >        isolation anyway.  From what you've said it sounds like the vxhs

    >        server needs a host-wide view and is responsible for all guests

    >        running on the host, so I guess we have to rule out this

    >        architecture.

    >     

    >     2. Network communication: one vxhs server process and multiple guests.

    >        Here you might as well use NBD or iSCSI because it already exists and

    >        the vxhs driver doesn't add any unique functionality over existing

    >        protocols.

    > 

    > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support.

    
    NBD over TCP supports TLS with X.509 certificate authentication.  I
    think Daniel Berrange mentioned that.

Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned. 
    
    NBD over AF_UNIX does not need authentication because it relies on file
    permissions for access control.  Each guest should have its own UNIX
    domain socket that it connects to.  That socket can only see exports
    that have been assigned to the guest.
    
    > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion.

    
    Please discuss it now so everyone gets on the same page.  I think there
    is a big gap and we need to communicate so that progress can be made.

Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later.

    
    >     >     There's an easier way to get even better performance: get rid of libqnio

    >     >     and the external process.  Move the code from the external process into

    >     >     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and

    >     >     context switching.

    >     >     

    >     >     Can you remind me why there needs to be an external process?

    >     >  

    >     > Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.

    >     

    >     This sounds similar to what QEMU and Linux (file systems, LVM, RAID,

    >     etc) already do.  It brings to mind a third architecture:

    >     

    >     3. A Linux driver or file system.  Then QEMU opens a raw block device.

    >        This is what the Ceph rbd block driver in Linux does.  This

    >        architecture has a kernel-userspace boundary so vxhs does not have to

    >        trust QEMU.

    >     

    >     I suggest Architecture #2.  You'll be able to deploy on existing systems

    >     because QEMU already supports NBD or iSCSI.  Use the time you gain from

    >     switching to this architecture on benchmarking and optimizing NBD or

    >     iSCSI so performance is closer to your goal.

    >     

    > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community.

    
    I thought the VxHS block driver was another network block driver like
    GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
    with the goal of better performance.

Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD.

    
    Please share fio(1) or other standard benchmark configuration files and
    performance results.

Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum. 
    
    NBD and libqnio wire protocols have comparable performance
    characteristics.  There is no magic that should give either one a
    fundamental edge over the other.  Am I missing something?

Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies.
    
    The main performance difference is probably that libqnio opens 8
    simultaneous connections but that's not unique to the wire protocol.
    What happens when you run 8 NBD simultaneous TCP connections?

Ketan> Possibly. We have not benchmarked this.

    Stefan
Stefan Hajnoczi Nov. 28, 2016, 2:17 p.m. UTC | #57
On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
> 
> 
> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
> 
>     On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>     > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>     >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
>     >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>     >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
>     >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>     >     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>     >     >     > >> On the topic of protocol security -
>     >     >     > >> 
>     >     >     > >> Would it be enough for the first patch to implement only
>     >     >     > >> authentication and not encryption?
>     >     >     > >
>     >     >     > >Yes, of course.  However, as we introduce more and more QEMU-specific
>     >     >     > >characteristics to a protocol that is already QEMU-specific (it doesn't
>     >     >     > >do failover, etc.), I am still not sure of the actual benefit of using
>     >     >     > >libqnio versus having an NBD server or FUSE driver.
>     >     >     > >
>     >     >     > >You have already mentioned performance, but the design has changed so
>     >     >     > >much that I think one of the two things has to change: either failover
>     >     >     > >moves back to QEMU and there is no (closed source) translator running on
>     >     >     > >the node, or the translator needs to speak a well-known and
>     >     >     > >already-supported protocol.
>     >     >     > 
>     >     >     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. 
>     >     >     > 
>     >     >     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.
>     >     >     
>     >     >     By "cross memory attach" do you mean
>     >     >     process_vm_readv(2)/process_vm_writev(2)?
>     >     >   
>     >     > Ketan> Yes.
>     >     >   
>     >     >     That puts us back to square one in terms of security.  You have
>     >     >     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
>     >     >     another process on the same machine.  That process is therefore also
>     >     >     untrusted and may only process data for one guest so that guests stay
>     >     >     isolated from each other.
>     >     >     
>     >     > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.
>     >     
>     >     This is incorrect.
>     >     
>     >     Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.
>     >     It means process A reads/writes directly from/to process B memory.  Both
>     >     processes must have the same uid/gid.  There is no trust boundary
>     >     between them.
>     >     
>     > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. 
>     > 
>     >     Network communication does not require both processes to have the same
>     >     uid/gid.  If you want multiple QEMU processes talking to a single server
>     >     there must be a trust boundary between client and server.  The server
>     >     can validate the input from the client and reject undesired operations.
>     > 
>     > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable?
>     >     
>     >     Hope this makes sense now.
>     >     
>     >     Two architectures that implement the QEMU trust model correctly are:
>     >     
>     >     1. Cross memory attach: each QEMU process has a dedicated vxhs server
>     >        process to prevent guests from attacking each other.  This is where I
>     >        said you might as well put the code inside QEMU since there is no
>     >        isolation anyway.  From what you've said it sounds like the vxhs
>     >        server needs a host-wide view and is responsible for all guests
>     >        running on the host, so I guess we have to rule out this
>     >        architecture.
>     >     
>     >     2. Network communication: one vxhs server process and multiple guests.
>     >        Here you might as well use NBD or iSCSI because it already exists and
>     >        the vxhs driver doesn't add any unique functionality over existing
>     >        protocols.
>     > 
>     > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support.
>     
>     NBD over TCP supports TLS with X.509 certificate authentication.  I
>     think Daniel Berrange mentioned that.
> 
> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned. 
>     
>     NBD over AF_UNIX does not need authentication because it relies on file
>     permissions for access control.  Each guest should have its own UNIX
>     domain socket that it connects to.  That socket can only see exports
>     that have been assigned to the guest.
>     
>     > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion.
>     
>     Please discuss it now so everyone gets on the same page.  I think there
>     is a big gap and we need to communicate so that progress can be made.
> 
> Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later.

Why does the client have to know about failover if it's connected to a
server process on the same host?  I thought the server process manages
networking issues (like the actual protocol to speak to other VxHS nodes
and for failover).

>     >     >     There's an easier way to get even better performance: get rid of libqnio
>     >     >     and the external process.  Move the code from the external process into
>     >     >     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
>     >     >     context switching.
>     >     >     
>     >     >     Can you remind me why there needs to be an external process?
>     >     >  
>     >     > Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.
>     >     
>     >     This sounds similar to what QEMU and Linux (file systems, LVM, RAID,
>     >     etc) already do.  It brings to mind a third architecture:
>     >     
>     >     3. A Linux driver or file system.  Then QEMU opens a raw block device.
>     >        This is what the Ceph rbd block driver in Linux does.  This
>     >        architecture has a kernel-userspace boundary so vxhs does not have to
>     >        trust QEMU.
>     >     
>     >     I suggest Architecture #2.  You'll be able to deploy on existing systems
>     >     because QEMU already supports NBD or iSCSI.  Use the time you gain from
>     >     switching to this architecture on benchmarking and optimizing NBD or
>     >     iSCSI so performance is closer to your goal.
>     >     
>     > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community.
>     
>     I thought the VxHS block driver was another network block driver like
>     GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
>     with the goal of better performance.
> 
> Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD.
> 
>     
>     Please share fio(1) or other standard benchmark configuration files and
>     performance results.
> 
> Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum. 
>     
>     NBD and libqnio wire protocols have comparable performance
>     characteristics.  There is no magic that should give either one a
>     fundamental edge over the other.  Am I missing something?
> 
> Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies.
>     
>     The main performance difference is probably that libqnio opens 8
>     simultaneous connections but that's not unique to the wire protocol.
>     What happens when you run 8 NBD simultaneous TCP connections?
> 
> Ketan> Possibly. We have not benchmarked this.

There must be benchmark data if you want to add a new feature or modify
existing code for performance reasons.  This rule is followed in QEMU so
that performance changes are justified.

I'm afraid that when you look into the performance you'll find that any
performance difference between NBD and this VxHS patch series is due to
implementation differences that can be ported across to QEMU NBD, rather
than wire protocol differences.

If that's the case then it would save a lot of time to use NBD over
AF_UNIX for now.  You could focus efforts on achieving the final
architecture you've explained with cross memory attach.

Please take a look at vhost-user-scsi, which folks from Nutanix are
currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
sample application" on qemu-devel.  It is a true zero-copy local I/O tap
because it shares guest RAM.  This is more efficient than cross memory
attach's single memory copy.  It does not require running the server as
root.  This is the #1 thing you should evaluate for your final
architecture.

vhost-user-scsi works on the virtio-scsi emulation level.  That means
the server must implement the virtio-scsi vring and device emulation.
It is not a block driver.  By hooking in at this level you can achieve
the best performance but you lose all QEMU block layer functionality and
need to implement your own SCSI target.  You also need to consider live
migration.

Stefan
Ashish Mittal Nov. 30, 2016, 12:45 a.m. UTC | #58
+ Rakesh from Veritas

On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
>>
>>
>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>
>>     On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>>     > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>     >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
>>     >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>     >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote:
>>     >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote:
>>     >     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>>     >     >     > >> On the topic of protocol security -
>>     >     >     > >>
>>     >     >     > >> Would it be enough for the first patch to implement only
>>     >     >     > >> authentication and not encryption?
>>     >     >     > >
>>     >     >     > >Yes, of course.  However, as we introduce more and more QEMU-specific
>>     >     >     > >characteristics to a protocol that is already QEMU-specific (it doesn't
>>     >     >     > >do failover, etc.), I am still not sure of the actual benefit of using
>>     >     >     > >libqnio versus having an NBD server or FUSE driver.
>>     >     >     > >
>>     >     >     > >You have already mentioned performance, but the design has changed so
>>     >     >     > >much that I think one of the two things has to change: either failover
>>     >     >     > >moves back to QEMU and there is no (closed source) translator running on
>>     >     >     > >the node, or the translator needs to speak a well-known and
>>     >     >     > >already-supported protocol.
>>     >     >     >
>>     >     >     > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all.
>>     >     >     >
>>     >     >     > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well.
>>     >     >
>>     >     >     By "cross memory attach" do you mean
>>     >     >     process_vm_readv(2)/process_vm_writev(2)?
>>     >     >
>>     >     > Ketan> Yes.
>>     >     >
>>     >     >     That puts us back to square one in terms of security.  You have
>>     >     >     (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of
>>     >     >     another process on the same machine.  That process is therefore also
>>     >     >     untrusted and may only process data for one guest so that guests stay
>>     >     >     isolated from each other.
>>     >     >
>>     >     > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server.
>>     >
>>     >     This is incorrect.
>>     >
>>     >     Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access.
>>     >     It means process A reads/writes directly from/to process B memory.  Both
>>     >     processes must have the same uid/gid.  There is no trust boundary
>>     >     between them.
>>     >
>>     > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation.
>>     >
>>     >     Network communication does not require both processes to have the same
>>     >     uid/gid.  If you want multiple QEMU processes talking to a single server
>>     >     there must be a trust boundary between client and server.  The server
>>     >     can validate the input from the client and reject undesired operations.
>>     >
>>     > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable?
>>     >
>>     >     Hope this makes sense now.
>>     >
>>     >     Two architectures that implement the QEMU trust model correctly are:
>>     >
>>     >     1. Cross memory attach: each QEMU process has a dedicated vxhs server
>>     >        process to prevent guests from attacking each other.  This is where I
>>     >        said you might as well put the code inside QEMU since there is no
>>     >        isolation anyway.  From what you've said it sounds like the vxhs
>>     >        server needs a host-wide view and is responsible for all guests
>>     >        running on the host, so I guess we have to rule out this
>>     >        architecture.
>>     >
>>     >     2. Network communication: one vxhs server process and multiple guests.
>>     >        Here you might as well use NBD or iSCSI because it already exists and
>>     >        the vxhs driver doesn't add any unique functionality over existing
>>     >        protocols.
>>     >
>>     > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support.
>>
>>     NBD over TCP supports TLS with X.509 certificate authentication.  I
>>     think Daniel Berrange mentioned that.
>>
>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned.
>>
>>     NBD over AF_UNIX does not need authentication because it relies on file
>>     permissions for access control.  Each guest should have its own UNIX
>>     domain socket that it connects to.  That socket can only see exports
>>     that have been assigned to the guest.
>>
>>     > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion.
>>
>>     Please discuss it now so everyone gets on the same page.  I think there
>>     is a big gap and we need to communicate so that progress can be made.
>>
>> Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later.
>
> Why does the client have to know about failover if it's connected to a
> server process on the same host?  I thought the server process manages
> networking issues (like the actual protocol to speak to other VxHS nodes
> and for failover).
>
>>     >     >     There's an easier way to get even better performance: get rid of libqnio
>>     >     >     and the external process.  Move the code from the external process into
>>     >     >     QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and
>>     >     >     context switching.
>>     >     >
>>     >     >     Can you remind me why there needs to be an external process?
>>     >     >
>>     >     > Ketan>  Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver.
>>     >
>>     >     This sounds similar to what QEMU and Linux (file systems, LVM, RAID,
>>     >     etc) already do.  It brings to mind a third architecture:
>>     >
>>     >     3. A Linux driver or file system.  Then QEMU opens a raw block device.
>>     >        This is what the Ceph rbd block driver in Linux does.  This
>>     >        architecture has a kernel-userspace boundary so vxhs does not have to
>>     >        trust QEMU.
>>     >
>>     >     I suggest Architecture #2.  You'll be able to deploy on existing systems
>>     >     because QEMU already supports NBD or iSCSI.  Use the time you gain from
>>     >     switching to this architecture on benchmarking and optimizing NBD or
>>     >     iSCSI so performance is closer to your goal.
>>     >
>>     > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community.
>>
>>     I thought the VxHS block driver was another network block driver like
>>     GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
>>     with the goal of better performance.
>>
>> Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD.
>>
>>
>>     Please share fio(1) or other standard benchmark configuration files and
>>     performance results.
>>
>> Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum.
>>
>>     NBD and libqnio wire protocols have comparable performance
>>     characteristics.  There is no magic that should give either one a
>>     fundamental edge over the other.  Am I missing something?
>>
>> Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies.
>>
>>     The main performance difference is probably that libqnio opens 8
>>     simultaneous connections but that's not unique to the wire protocol.
>>     What happens when you run 8 NBD simultaneous TCP connections?
>>
>> Ketan> Possibly. We have not benchmarked this.
>
> There must be benchmark data if you want to add a new feature or modify
> existing code for performance reasons.  This rule is followed in QEMU so
> that performance changes are justified.
>
> I'm afraid that when you look into the performance you'll find that any
> performance difference between NBD and this VxHS patch series is due to
> implementation differences that can be ported across to QEMU NBD, rather
> than wire protocol differences.
>
> If that's the case then it would save a lot of time to use NBD over
> AF_UNIX for now.  You could focus efforts on achieving the final
> architecture you've explained with cross memory attach.
>
> Please take a look at vhost-user-scsi, which folks from Nutanix are
> currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
> sample application" on qemu-devel.  It is a true zero-copy local I/O tap
> because it shares guest RAM.  This is more efficient than cross memory
> attach's single memory copy.  It does not require running the server as
> root.  This is the #1 thing you should evaluate for your final
> architecture.
>
> vhost-user-scsi works on the virtio-scsi emulation level.  That means
> the server must implement the virtio-scsi vring and device emulation.
> It is not a block driver.  By hooking in at this level you can achieve
> the best performance but you lose all QEMU block layer functionality and
> need to implement your own SCSI target.  You also need to consider live
> migration.
>
> Stefan
Rakesh Ranjan Nov. 30, 2016, 4:20 a.m. UTC | #59
Hello Stefan,

>>>>> Why does the client have to know about failover if it's connected to
>>>>>a server process on the same host?  I thought the server process
>>>>>manages networking issues (like the actual protocol to speak to other
>>>>>VxHS nodes and for failover).

Just to comment on this, the model being followed within HyperScale is to
allow application I/O continuity (resiliency) in various cases as
mentioned below. It really adds value for consumer/customer and tries to
avoid culprits for single points of failure.

1. HyperScale storage service failure (QNIO Server)
	- Daemon managing local storage for VMs and runs on each compute node
	- Daemon can run as a service on Hypervisor itself as well as within VSA
(Virtual Storage Appliance or Virtual Machine running on the hypervisor),
which depends on ecosystem where HyperScale is supported
	- Daemon or storage service down/crash/crash-in-loop shouldnÂąt lead to an
huge impact on all the VMs running on that hypervisor or compute node
hence providing service level resiliency is very useful for
          application I/O continuity in such case.

   Solution:
	- The service failure handling can be only done at the client side and
not at the server side since service running as a server itself is down.
	- Client detects an I/O error and depending on the logic, it does
application I/O failover to another available/active QNIO server or
HyperScale Storage service running on different compute node
(reflection/replication node)
	- Once the orig/old server comes back online, client gets/receives
negotiated error (not a real application error) to do the application I/O
failback to the original server or local HyperScale storage service to get
better I/O performance.
  
2. Local physical storage or media failure
	- Once server or HyperScale storage service detects the media or local
disk failure, depending on the vDisk (guest disk) configuration, if
another storage copy is available
          on different compute node then it internally handles the local
fault and serves the application read and write requests otherwise
application or client gets the fault.
	- Client doesnÂąt know about any I/O failure since Server or Storage
service manages/handles the fault tolerance.
        - In such case, in order to get some I/O performance benefit, once
client gets a negotiated error (not an application error) from local
server or storage service,
          client can initiate I/O failover and can directly send
application I/O to another compute node where storage copy is available to
serve the application need instead of sending it locally where media is
faulted.       

-Rakesh


On 11/29/16, 4:45 PM, "ashish mittal" <ashmit602@gmail.com> wrote:

>+ Rakesh from Veritas
>
>On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefanha@gmail.com>
>wrote:
>> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
>>>
>>>
>>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote:
>>>
>>>     On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>>>     > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com>
>>>wrote:
>>>     >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar
>>>wrote:
>>>     >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi"
>>><stefanha@gmail.com> wrote:
>>>     >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan
>>>Nilangekar wrote:
>>>     >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini"
>>><pbonzini@redhat.com> wrote:
>>>     >     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>>>     >     >     > >> On the topic of protocol security -
>>>     >     >     > >>
>>>     >     >     > >> Would it be enough for the first patch to
>>>implement only
>>>     >     >     > >> authentication and not encryption?
>>>     >     >     > >
>>>     >     >     > >Yes, of course.  However, as we introduce more and
>>>more QEMU-specific
>>>     >     >     > >characteristics to a protocol that is already
>>>QEMU-specific (it doesn't
>>>     >     >     > >do failover, etc.), I am still not sure of the
>>>actual benefit of using
>>>     >     >     > >libqnio versus having an NBD server or FUSE driver.
>>>     >     >     > >
>>>     >     >     > >You have already mentioned performance, but the
>>>design has changed so
>>>     >     >     > >much that I think one of the two things has to
>>>change: either failover
>>>     >     >     > >moves back to QEMU and there is no (closed source)
>>>translator running on
>>>     >     >     > >the node, or the translator needs to speak a
>>>well-known and
>>>     >     >     > >already-supported protocol.
>>>     >     >     >
>>>     >     >     > IMO design has not changed. Implementation has
>>>changed significantly. I would propose that we keep resiliency/failover
>>>code out of QEMU driver and implement it entirely in libqnio as planned
>>>in a subsequent revision. The VxHS server does not need to
>>>understand/handle failover at all.
>>>     >     >     >
>>>     >     >     > Today libqnio gives us significantly better
>>>performance than any NBD/FUSE implementation. We know because we have
>>>prototyped with both. Significant improvements to libqnio are also in
>>>the pipeline which will use cross memory attach calls to further boost
>>>performance. Ofcourse a big reason for the performance is also the
>>>HyperScale storage backend but we believe this method of IO
>>>tapping/redirecting can be leveraged by other solutions as well.
>>>     >     >
>>>     >     >     By "cross memory attach" do you mean
>>>     >     >     process_vm_readv(2)/process_vm_writev(2)?
>>>     >     >
>>>     >     > Ketan> Yes.
>>>     >     >
>>>     >     >     That puts us back to square one in terms of security.
>>>You have
>>>     >     >     (untrusted) QEMU + (untrusted) libqnio directly
>>>accessing the memory of
>>>     >     >     another process on the same machine.  That process is
>>>therefore also
>>>     >     >     untrusted and may only process data for one guest so
>>>that guests stay
>>>     >     >     isolated from each other.
>>>     >     >
>>>     >     > Ketan> Understood but this will be no worse than the
>>>current network based communication between qnio and vxhs server. And
>>>although we have questions around QEMU trust/vulnerability issues, we
>>>are looking to implement basic authentication scheme between libqnio
>>>and vxhs server.
>>>     >
>>>     >     This is incorrect.
>>>     >
>>>     >     Cross memory attach is equivalent to ptrace(2) (i.e.
>>>debugger) access.
>>>     >     It means process A reads/writes directly from/to process B
>>>memory.  Both
>>>     >     processes must have the same uid/gid.  There is no trust
>>>boundary
>>>     >     between them.
>>>     >
>>>     > Ketan> Not if vxhs server is running as root and initiating the
>>>cross mem attach. Which is also why we are proposing a basic
>>>authentication mechanism between qemu-vxhs. But anyway the cross memory
>>>attach is for a near future implementation.
>>>     >
>>>     >     Network communication does not require both processes to
>>>have the same
>>>     >     uid/gid.  If you want multiple QEMU processes talking to a
>>>single server
>>>     >     there must be a trust boundary between client and server.
>>>The server
>>>     >     can validate the input from the client and reject undesired
>>>operations.
>>>     >
>>>     > Ketan> This is what we are trying to propose. With the addition
>>>of authentication between qemu-vxhs server, we should be able to
>>>achieve this. Question is, would that be acceptable?
>>>     >
>>>     >     Hope this makes sense now.
>>>     >
>>>     >     Two architectures that implement the QEMU trust model
>>>correctly are:
>>>     >
>>>     >     1. Cross memory attach: each QEMU process has a dedicated
>>>vxhs server
>>>     >        process to prevent guests from attacking each other.
>>>This is where I
>>>     >        said you might as well put the code inside QEMU since
>>>there is no
>>>     >        isolation anyway.  From what you've said it sounds like
>>>the vxhs
>>>     >        server needs a host-wide view and is responsible for all
>>>guests
>>>     >        running on the host, so I guess we have to rule out this
>>>     >        architecture.
>>>     >
>>>     >     2. Network communication: one vxhs server process and
>>>multiple guests.
>>>     >        Here you might as well use NBD or iSCSI because it
>>>already exists and
>>>     >        the vxhs driver doesn't add any unique functionality over
>>>existing
>>>     >        protocols.
>>>     >
>>>     > Ketan> NBD does not give us the performance we are trying to
>>>achieve. Besides NBD does not have any authentication support.
>>>
>>>     NBD over TCP supports TLS with X.509 certificate authentication.  I
>>>     think Daniel Berrange mentioned that.
>>>
>>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD
>>>did not have any auth as Daniel Berrange mentioned.
>>>
>>>     NBD over AF_UNIX does not need authentication because it relies on
>>>file
>>>     permissions for access control.  Each guest should have its own
>>>UNIX
>>>     domain socket that it connects to.  That socket can only see
>>>exports
>>>     that have been assigned to the guest.
>>>
>>>     > There is a hybrid 2.a approach which uses both 1 & 2 but IÂąd
>>>keep that for a later discussion.
>>>
>>>     Please discuss it now so everyone gets on the same page.  I think
>>>there
>>>     is a big gap and we need to communicate so that progress can be
>>>made.
>>>
>>> Ketan> The approach was to use cross mem attach for IO path and a
>>>simplified network IO lib for resiliency/failover. Did not want to
>>>derail the current discussion hence the suggestion to take it up later.
>>
>> Why does the client have to know about failover if it's connected to a
>> server process on the same host?  I thought the server process manages
>> networking issues (like the actual protocol to speak to other VxHS nodes
>> and for failover).
>>
>>>     >     >     There's an easier way to get even better performance:
>>>get rid of libqnio
>>>     >     >     and the external process.  Move the code from the
>>>external process into
>>>     >     >     QEMU to eliminate the
>>>process_vm_readv(2)/process_vm_writev(2) and
>>>     >     >     context switching.
>>>     >     >
>>>     >     >     Can you remind me why there needs to be an external
>>>process?
>>>     >     >
>>>     >     > Ketan>  Apart from virtualizing the available direct
>>>attached storage on the compute, vxhs storage backend (the external
>>>process) provides features such as storage QoS, resiliency, efficient
>>>use of direct attached storage, automatic storage recovery points
>>>(snapshots) etc. Implementing this in QEMU is not practical and not the
>>>purpose of proposing this driver.
>>>     >
>>>     >     This sounds similar to what QEMU and Linux (file systems,
>>>LVM, RAID,
>>>     >     etc) already do.  It brings to mind a third architecture:
>>>     >
>>>     >     3. A Linux driver or file system.  Then QEMU opens a raw
>>>block device.
>>>     >        This is what the Ceph rbd block driver in Linux does.
>>>This
>>>     >        architecture has a kernel-userspace boundary so vxhs does
>>>not have to
>>>     >        trust QEMU.
>>>     >
>>>     >     I suggest Architecture #2.  You'll be able to deploy on
>>>existing systems
>>>     >     because QEMU already supports NBD or iSCSI.  Use the time
>>>you gain from
>>>     >     switching to this architecture on benchmarking and
>>>optimizing NBD or
>>>     >     iSCSI so performance is closer to your goal.
>>>     >
>>>     > Ketan> We have made a choice to go with QEMU driver approach
>>>after serious evaluation of most if not all standard IO tapping
>>>mechanisms including NFS, NBD and FUSE. None of these has been able to
>>>deliver the performance that we have set ourselves to achieve. Hence
>>>the effort to propose this new IO tap which we believe will provide an
>>>alternate to the existing mechanisms and hopefully benefit the
>>>community.
>>>
>>>     I thought the VxHS block driver was another network block driver
>>>like
>>>     GlusterFS or Sheepdog but you are actually proposing a new local
>>>I/O tap
>>>     with the goal of better performance.
>>>
>>> Ketan> The VxHS block driver is a new local IO tap with the goal of
>>>better performance specifically when used with the VxHS server. This
>>>coupled with shared mem IPC (like cross mem attach) could be a much
>>>better IO tap option for qemu users. This will also avoid context
>>>switch between qemu/network stack to service which happens today in NBD.
>>>
>>>
>>>     Please share fio(1) or other standard benchmark configuration
>>>files and
>>>     performance results.
>>>
>>> Ketan> We have fio results with the VxHS storage backend which I am
>>>not sure I can share in a public forum.
>>>
>>>     NBD and libqnio wire protocols have comparable performance
>>>     characteristics.  There is no magic that should give either one a
>>>     fundamental edge over the other.  Am I missing something?
>>>
>>> Ketan> I have not seen the NBD code but few things which we considered
>>>and are part of libqnio (though not exclusively) are low protocol
>>>overhead, threading model, queueing, latencies, memory pools, zero data
>>>copies in user-land, scatter-gather write/read etc. Again these are not
>>>exclusive to libqnio but could give one protocol the edge over the
>>>other. Also part of the ³magic² is also in the VxHS storage backend
>>>which is able to ingest the IOs with lower latencies.
>>>
>>>     The main performance difference is probably that libqnio opens 8
>>>     simultaneous connections but that's not unique to the wire
>>>protocol.
>>>     What happens when you run 8 NBD simultaneous TCP connections?
>>>
>>> Ketan> Possibly. We have not benchmarked this.
>>
>> There must be benchmark data if you want to add a new feature or modify
>> existing code for performance reasons.  This rule is followed in QEMU so
>> that performance changes are justified.
>>
>> I'm afraid that when you look into the performance you'll find that any
>> performance difference between NBD and this VxHS patch series is due to
>> implementation differences that can be ported across to QEMU NBD, rather
>> than wire protocol differences.
>>
>> If that's the case then it would save a lot of time to use NBD over
>> AF_UNIX for now.  You could focus efforts on achieving the final
>> architecture you've explained with cross memory attach.
>>
>> Please take a look at vhost-user-scsi, which folks from Nutanix are
>> currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
>> sample application" on qemu-devel.  It is a true zero-copy local I/O tap
>> because it shares guest RAM.  This is more efficient than cross memory
>> attach's single memory copy.  It does not require running the server as
>> root.  This is the #1 thing you should evaluate for your final
>> architecture.
>>
>> vhost-user-scsi works on the virtio-scsi emulation level.  That means
>> the server must implement the virtio-scsi vring and device emulation.
>> It is not a block driver.  By hooking in at this level you can achieve
>> the best performance but you lose all QEMU block layer functionality and
>> need to implement your own SCSI target.  You also need to consider live
>> migration.
>>
>> Stefan
Stefan Hajnoczi Nov. 30, 2016, 8:35 a.m. UTC | #60
On Wed, Nov 30, 2016 at 04:20:03AM +0000, Rakesh Ranjan wrote:
> >>>>> Why does the client have to know about failover if it's connected to
> >>>>>a server process on the same host?  I thought the server process
> >>>>>manages networking issues (like the actual protocol to speak to other
> >>>>>VxHS nodes and for failover).
> 
> Just to comment on this, the model being followed within HyperScale is to
> allow application I/O continuity (resiliency) in various cases as
> mentioned below. It really adds value for consumer/customer and tries to
> avoid culprits for single points of failure.
> 
> 1. HyperScale storage service failure (QNIO Server)
> 	- Daemon managing local storage for VMs and runs on each compute node
> 	- Daemon can run as a service on Hypervisor itself as well as within VSA
> (Virtual Storage Appliance or Virtual Machine running on the hypervisor),
> which depends on ecosystem where HyperScale is supported
> 	- Daemon or storage service down/crash/crash-in-loop shouldnÂąt lead to an
> huge impact on all the VMs running on that hypervisor or compute node
> hence providing service level resiliency is very useful for
>           application I/O continuity in such case.
> 
>    Solution:
> 	- The service failure handling can be only done at the client side and
> not at the server side since service running as a server itself is down.
> 	- Client detects an I/O error and depending on the logic, it does
> application I/O failover to another available/active QNIO server or
> HyperScale Storage service running on different compute node
> (reflection/replication node)
> 	- Once the orig/old server comes back online, client gets/receives
> negotiated error (not a real application error) to do the application I/O
> failback to the original server or local HyperScale storage service to get
> better I/O performance.
>   
> 2. Local physical storage or media failure
> 	- Once server or HyperScale storage service detects the media or local
> disk failure, depending on the vDisk (guest disk) configuration, if
> another storage copy is available
>           on different compute node then it internally handles the local
> fault and serves the application read and write requests otherwise
> application or client gets the fault.
> 	- Client doesnÂąt know about any I/O failure since Server or Storage
> service manages/handles the fault tolerance.
>         - In such case, in order to get some I/O performance benefit, once
> client gets a negotiated error (not an application error) from local
> server or storage service,
>           client can initiate I/O failover and can directly send
> application I/O to another compute node where storage copy is available to
> serve the application need instead of sending it locally where media is
> faulted.       

Thanks for explaining the model.

The new information for me here is that the qnio server may run in a VM
instead of on the host and that the client will attempt to use a remote
qnio server if the local qnio server fails.

This means that although the discussion most recently focussed on local
I/O tap performance, there is a requirement for a network protocol too.
The local I/O tap stuff is just an optimization for when the local qnio
server can be used.

Stefan
Stefan Hajnoczi Nov. 30, 2016, 9:01 a.m. UTC | #61
On Mon, Nov 28, 2016 at 02:17:56PM +0000, Stefan Hajnoczi wrote:
> Please take a look at vhost-user-scsi, which folks from Nutanix are
> currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
> sample application" on qemu-devel.  It is a true zero-copy local I/O tap
> because it shares guest RAM.  This is more efficient than cross memory
> attach's single memory copy.  It does not require running the server as
> root.  This is the #1 thing you should evaluate for your final
> architecture.
> 
> vhost-user-scsi works on the virtio-scsi emulation level.  That means
> the server must implement the virtio-scsi vring and device emulation.
> It is not a block driver.  By hooking in at this level you can achieve
> the best performance but you lose all QEMU block layer functionality and
> need to implement your own SCSI target.  You also need to consider live
> migration.

To clarify why I think vhost-user-scsi is best suited to your
requirements for performance:

With vhost-user-scsi the qnio server would be notified by kvm.ko via
eventfd when the VM submits new I/O requests to the virtio-scsi HBA.
The QEMU process is completely bypassed for I/O request submission and
the qnio server processes the SCSI command instead.  This avoids the
context switch to QEMU and then to the qnio server.  With cross memory
attach QEMU first needs to process the I/O request and hand it to
libqnio before the qnio server can be scheduled.

The vhost-user-scsi qnio server has shared memory access to guest RAM
and is therefore able to do zero-copy I/O into guest buffers.  Cross
memory attach always incurs a memory copy.

Using this high-performance architecture requires significant changes
though.  vhost-user-scsi hooks into the stack at a different layer so a
QEMU block driver is not used at all.  QEMU also wouldn't use libqnio.
Instead everything will live in your qnio server process (not part of
QEMU).

You'd have to rethink the resiliency strategy because you currently rely
on the QEMU block driver connecting to a different qnio server if the
local qnio server fails.  In the vhost-user-scsi world it's more like
having a phyiscal SCSI adapter - redundancy and multipathing are used to
achieve resiliency.

For example, virtio-scsi HBA #1 would connect to the local qnio server
process.  virtio-scsi HBA #2 would connect to another local process
called the "proxy process" which forwards requests to a remote qnio
server (using libqnio?).  If HBA #1 fails then I/O is sent to HBA #2
instead.  The path can reset back to HBA #1 once that becomes
operational again.

If the qnio server is supposed to run in a VM instead of directly in the
host environment then it's worth looking at the vhost-pci work that Wei
Wang <wei.w.wang@intel.com> is working on.  The email thread is called
"[PATCH v2 0/4] *** vhost-user spec extension for vhost-pci ***".  The
idea here is to allow inter-VM virtio device emulation so that instead
of terminating the virtio-scsi device in the qnio server process on the
host, you can terminate it inside another VM with good performance
characteristics.

Stefan
diff mbox

Patch

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 7d4031d..1861bb9 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -18,6 +18,7 @@  block-obj-$(CONFIG_LIBNFS) += nfs.o
 block-obj-$(CONFIG_CURL) += curl.o
 block-obj-$(CONFIG_RBD) += rbd.o
 block-obj-$(CONFIG_GLUSTERFS) += gluster.o
+block-obj-$(CONFIG_VXHS) += vxhs.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
@@ -38,6 +39,7 @@  rbd.o-cflags       := $(RBD_CFLAGS)
 rbd.o-libs         := $(RBD_LIBS)
 gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
 gluster.o-libs     := $(GLUSTERFS_LIBS)
+vxhs.o-libs        := $(VXHS_LIBS)
 ssh.o-cflags       := $(LIBSSH2_CFLAGS)
 ssh.o-libs         := $(LIBSSH2_LIBS)
 archipelago.o-libs := $(ARCHIPELAGO_LIBS)
diff --git a/block/trace-events b/block/trace-events
index 05fa13c..44de452 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -114,3 +114,50 @@  qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s
 qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64
 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+
+# block/vxhs.c
+vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c"
+vxhs_iio_callback(int error, int reason) "ctx is NULL: error %d, reason %d"
+vxhs_setup_qnio(void *s) "Context to HyperScale IO manager = %p"
+vxhs_setup_qnio_nwerror(char c) "Could not initialize the network channel. Bailing out%c"
+vxhs_iio_callback_iofail(int err, int reason, void *acb, int seg) "Read/Write failed: error %d, reason %d, acb %p, segment %d"
+vxhs_iio_callback_retry(char *guid, void *acb) "vDisk %s, added acb %p to retry queue (5)"
+vxhs_iio_callback_chnlfail(int error) "QNIO channel failed, no i/o (%d)"
+vxhs_iio_callback_fail(int r, void *acb, int seg, uint64_t size, int err) " ALERT: reason = %d , acb = %p, acb->segments = %d, acb->size = %lu Error = %d"
+vxhs_fail_aio(char * guid, void *acb) "vDisk %s, failing acb %p"
+vxhs_iio_callback_ready(char *vd, int err) "async vxhs_iio_callback: IRP_VDISK_CHECK_IO_FAILOVER_READY completed for vdisk %s with error %d"
+vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o %d, %d"
+vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno %d"
+vxhs_open_fail(int ret) "Could not open the device. Error = %d"
+vxhs_open_epipe(char c) "Could not create a pipe for device. Bailing out%c"
+vxhs_aio_rw(char *guid, int iodir, uint64_t size, uint64_t offset) "vDisk %s, vDisk device is in failed state iodir = %d size = %lu offset = %lu"
+vxhs_aio_rw_retry(char *guid, void *acb, int queue) "vDisk %s, added acb %p to retry queue(%d)"
+vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d"
+vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void *acb, int seg, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %lu offset = %lu ACB = %p Segments = %d. Error = %d, errno = %d"
+vxhs_co_flush(char *guid, int ret, int err) "vDisk (%s) Flush ioctl failed ret = %d errno = %d"
+vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl failed, ret = %d, errno = %d"
+vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat ioctl returned size %lu"
+vxhs_switch_storage_agent(char *ip, char *guid) "Query host %s for vdisk %s"
+vxhs_switch_storage_agent_failed(char *ip, char *guid, int res, int err) "Query to host %s for vdisk %s failed, res = %d, errno = %d"
+vxhs_check_failover_status(char *ip, char *guid) "Switched to storage server host-IP %s for vdisk %s"
+vxhs_check_failover_status_retry(char *guid) "failover_ioctl_cb: keep looking for io target for vdisk %s"
+vxhs_failover_io(char *vdisk) "I/O Failover starting for vDisk %s"
+vxhs_qnio_iio_open(const char *ip) "Failed to connect to storage agent on host-ip %s"
+vxhs_qnio_iio_devopen(const char *fname) "Failed to open vdisk device: %s"
+vxhs_handle_queued_ios(void *acb, int res) "Restarted acb %p res %d"
+vxhs_restart_aio(int dir, int res, int err) "IO ERROR FOR: Read/Write = %d Error = %d, errno = %d"
+vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %ld"
+vxhs_aio_rw_iofail(char *guid) "vDisk %s, I/O operation failed."
+vxhs_aio_rw_devfail(char *guid, int dir, uint64_t size, uint64_t off) "vDisk %s, vDisk device failed iodir = %d size = %lu offset = %lu"
+vxhs_parse_uri_filename(const char *filename) "URI passed via bdrv_parse_filename %s"
+vxhs_qemu_init_vdisk(const char *vdisk_id) "vdisk_id from json %s"
+vxhs_qemu_init_numservers(int num_servers) "Number of servers passed = %d"
+vxhs_parse_uri_hostinfo(int num, char *host, int port) "Host %d: IP %s, Port %d"
+vxhs_qemu_init(char *of_vsa_addr, int port) "Adding host %s:%d to BDRVVXHSState"
+vxhs_qemu_init_filename(const char *filename) "Filename passed as %s"
+vxhs_close(char *vdisk_guid) "Closing vdisk %s"
+vxhs_convert_iovector_to_buffer(size_t len) "Could not allocate buffer for size %zu bytes"
+vxhs_qnio_iio_writev(int res) "iio_writev returned %d"
+vxhs_qnio_iio_writev_err(int iter, uint64_t iov_len, int err) "Error for iteration : %d, iov_len = %lu errno = %d"
+vxhs_qnio_iio_readv(void *ctx, int ret, int error) "Error while issuing read to QNIO. ctx %p Error = %d, errno = %d"
+vxhs_qnio_iio_ioctl(uint32_t opcode) "Error while executing IOCTL. Opcode = %u"
diff --git a/block/vxhs.c b/block/vxhs.c
new file mode 100644
index 0000000..90a4343
--- /dev/null
+++ b/block/vxhs.c
@@ -0,0 +1,1645 @@ 
+/*
+ * QEMU Block driver for Veritas HyperScale (VxHS)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "block/block_int.h"
+#include <qnio/qnio_api.h>
+#include "qapi/qmp/qerror.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "trace.h"
+#include "qemu/uri.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+
+#define QNIO_CONNECT_RETRY_SECS     5
+#define QNIO_CONNECT_TIMOUT_SECS    120
+
+/*
+ * IO specific flags
+ */
+#define IIO_FLAG_ASYNC              0x00000001
+#define IIO_FLAG_DONE               0x00000010
+#define IIO_FLAG_SYNC               0
+
+#define VDISK_FD_READ               0
+#define VDISK_FD_WRITE              1
+#define VXHS_MAX_HOSTS              4
+
+#define VXHS_OPT_FILENAME           "filename"
+#define VXHS_OPT_VDISK_ID           "vdisk_id"
+#define VXHS_OPT_SERVER             "server."
+#define VXHS_OPT_HOST               "host"
+#define VXHS_OPT_PORT               "port"
+
+/* qnio client ioapi_ctx */
+static void *global_qnio_ctx;
+
+/* vdisk prefix to pass to qnio */
+static const char vdisk_prefix[] = "/dev/of/vdisk";
+
+typedef enum {
+    VXHS_IO_INPROGRESS,
+    VXHS_IO_COMPLETED,
+    VXHS_IO_ERROR
+} VXHSIOState;
+
+typedef enum {
+    VDISK_AIO_READ,
+    VDISK_AIO_WRITE,
+    VDISK_STAT,
+    VDISK_TRUNC,
+    VDISK_AIO_FLUSH,
+    VDISK_AIO_RECLAIM,
+    VDISK_GET_GEOMETRY,
+    VDISK_CHECK_IO_FAILOVER_READY,
+    VDISK_AIO_LAST_CMD
+} VDISKAIOCmd;
+
+typedef void (*qnio_callback_t)(ssize_t retval, void *arg);
+
+/*
+ * BDRVVXHSState specific flags
+ */
+#define OF_VDISK_FLAGS_STATE_ACTIVE             0x0000000000000001
+#define OF_VDISK_FLAGS_STATE_FAILED             0x0000000000000002
+#define OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS   0x0000000000000004
+
+#define OF_VDISK_ACTIVE(s)                                              \
+        ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_ACTIVE)
+#define OF_VDISK_SET_ACTIVE(s)                                          \
+        ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_ACTIVE)
+#define OF_VDISK_RESET_ACTIVE(s)                                        \
+        ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_ACTIVE)
+
+#define OF_VDISK_FAILED(s)                                              \
+        ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_FAILED)
+#define OF_VDISK_SET_FAILED(s)                                          \
+        ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_FAILED)
+#define OF_VDISK_RESET_FAILED(s)                                        \
+        ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_FAILED)
+
+#define OF_VDISK_IOFAILOVER_IN_PROGRESS(s)                              \
+        ((s)->vdisk_flags & OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS)
+#define OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s)                          \
+        ((s)->vdisk_flags |= OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS)
+#define OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s)                        \
+        ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS)
+
+/*
+ * VXHSAIOCB specific flags
+ */
+#define OF_ACB_QUEUED               0x00000001
+
+#define OF_AIOCB_FLAGS_QUEUED(a)            \
+        ((a)->flags & OF_ACB_QUEUED)
+#define OF_AIOCB_FLAGS_SET_QUEUED(a)        \
+        ((a)->flags |= OF_ACB_QUEUED)
+#define OF_AIOCB_FLAGS_RESET_QUEUED(a)      \
+        ((a)->flags &= ~OF_ACB_QUEUED)
+
+typedef struct qemu2qnio_ctx {
+    uint32_t            qnio_flag;
+    uint64_t            qnio_size;
+    char                *qnio_channel;
+    char                *target;
+    qnio_callback_t     qnio_cb;
+} qemu2qnio_ctx_t;
+
+typedef qemu2qnio_ctx_t qnio2qemu_ctx_t;
+
+typedef struct LibQNIOSymbol {
+        const char *name;
+        gpointer *addr;
+} LibQNIOSymbol;
+
+/*
+ * HyperScale AIO callbacks structure
+ */
+typedef struct VXHSAIOCB {
+    BlockAIOCB          common;
+    size_t              ret;
+    size_t              size;
+    QEMUBH              *bh;
+    int                 aio_done;
+    int                 segments;
+    int                 flags;
+    size_t              io_offset;
+    QEMUIOVector        *qiov;
+    void                *buffer;
+    int                 direction;  /* IO direction (r/w) */
+    QSIMPLEQ_ENTRY(VXHSAIOCB) retry_entry;
+} VXHSAIOCB;
+
+typedef struct VXHSvDiskHostsInfo {
+    int                 qnio_cfd;   /* Channel FD */
+    int                 vdisk_rfd;  /* vDisk remote FD */
+    char                *hostip;    /* Host's IP addresses */
+    int                 port;       /* Host's port number */
+} VXHSvDiskHostsInfo;
+
+/*
+ * Structure per vDisk maintained for state
+ */
+typedef struct BDRVVXHSState {
+    int                     fds[2];
+    int64_t                 vdisk_size;
+    int64_t                 vdisk_blocks;
+    int64_t                 vdisk_flags;
+    int                     vdisk_aio_count;
+    int                     event_reader_pos;
+    VXHSAIOCB               *qnio_event_acb;
+    void                    *qnio_ctx;
+    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
+    QemuSpin                vdisk_acb_lock;  /* Protects ACB */
+    VXHSvDiskHostsInfo      vdisk_hostinfo[VXHS_MAX_HOSTS]; /* Per host info */
+    int                     vdisk_nhosts;   /* Total number of hosts */
+    int                     vdisk_cur_host_idx; /* IOs are being shipped to */
+    int                     vdisk_ask_failover_idx; /*asking permsn to ship io*/
+    QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
+    int                     vdisk_aio_retry_qd; /* Currently for debugging */
+    char                    *vdisk_guid;
+} BDRVVXHSState;
+
+static int vxhs_restart_aio(VXHSAIOCB *acb);
+static void vxhs_check_failover_status(int res, void *ctx);
+
+static void vxhs_inc_acb_segment_count(void *ptr, int count)
+{
+    VXHSAIOCB *acb = ptr;
+    BDRVVXHSState *s = acb->common.bs->opaque;
+
+    qemu_spin_lock(&s->vdisk_acb_lock);
+    acb->segments += count;
+    qemu_spin_unlock(&s->vdisk_acb_lock);
+}
+
+static void vxhs_dec_acb_segment_count(void *ptr, int count)
+{
+    VXHSAIOCB *acb = ptr;
+    BDRVVXHSState *s = acb->common.bs->opaque;
+
+    qemu_spin_lock(&s->vdisk_acb_lock);
+    acb->segments -= count;
+    qemu_spin_unlock(&s->vdisk_acb_lock);
+}
+
+static void vxhs_set_acb_buffer(void *ptr, void *buffer)
+{
+    VXHSAIOCB *acb = ptr;
+
+    acb->buffer = buffer;
+}
+
+static void vxhs_inc_vdisk_iocount(void *ptr, uint32_t count)
+{
+    BDRVVXHSState *s = ptr;
+
+    qemu_spin_lock(&s->vdisk_lock);
+    s->vdisk_aio_count += count;
+    qemu_spin_unlock(&s->vdisk_lock);
+}
+
+static void vxhs_dec_vdisk_iocount(void *ptr, uint32_t count)
+{
+    BDRVVXHSState *s = ptr;
+
+    qemu_spin_lock(&s->vdisk_lock);
+    s->vdisk_aio_count -= count;
+    qemu_spin_unlock(&s->vdisk_lock);
+}
+
+static int32_t
+vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
+                    void *ctx, uint32_t flags)
+{
+    int   ret = 0;
+
+    switch (opcode) {
+    case VDISK_STAT:
+        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
+                                     in, ctx, flags);
+        break;
+
+    case VDISK_AIO_FLUSH:
+        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
+                                     in, ctx, flags);
+        break;
+
+    case VDISK_CHECK_IO_FAILOVER_READY:
+        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
+                                     in, ctx, flags);
+        break;
+
+    default:
+        ret = -ENOTSUP;
+        break;
+    }
+
+    if (ret) {
+        *in = 0;
+        trace_vxhs_qnio_iio_ioctl(opcode);
+    }
+
+    return ret;
+}
+
+static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
+{
+    /*
+     * Close vDisk device
+     */
+    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
+        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
+        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
+    }
+
+    /*
+     * Close QNIO channel against cached channel-fd
+     */
+    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
+        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
+        s->vdisk_hostinfo[idx].qnio_cfd = -1;
+    }
+}
+
+static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
+                              int *rfd, const char *file_name)
+{
+    /*
+     * Open qnio channel to storage agent if not opened before.
+     */
+    if (*cfd < 0) {
+        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
+        if (*cfd < 0) {
+            trace_vxhs_qnio_iio_open(of_vsa_addr);
+            return -ENODEV;
+        }
+    }
+
+    /*
+     * Open vdisk device
+     */
+    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
+
+    if (*rfd < 0) {
+        if (*cfd >= 0) {
+            iio_close(global_qnio_ctx, *cfd);
+            *cfd = -1;
+            *rfd = -1;
+        }
+
+        trace_vxhs_qnio_iio_devopen(file_name);
+        return -ENODEV;
+    }
+
+    return 0;
+}
+
+/*
+ * Try to reopen the vDisk on one of the available hosts
+ * If vDisk reopen is successful on any of the host then
+ * check if that node is ready to accept I/O.
+ */
+static int vxhs_reopen_vdisk(BDRVVXHSState *s, int index)
+{
+    VXHSvDiskHostsInfo hostinfo = s->vdisk_hostinfo[index];
+    char *of_vsa_addr = NULL;
+    char *file_name = NULL;
+    int  res = 0;
+
+    /*
+     * Close stale vdisk device remote-fd and channel-fd
+     * since it could be invalid after a channel disconnect.
+     * We will reopen the vdisk later to get the new fd.
+     */
+    vxhs_qnio_iio_close(s, index);
+
+    /*
+     * Build storage agent address and vdisk device name strings
+     */
+    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
+    of_vsa_addr = g_strdup_printf("of://%s:%d",
+                                  hostinfo.hostip, hostinfo.port);
+
+    res = vxhs_qnio_iio_open(&hostinfo.qnio_cfd, of_vsa_addr,
+                             &hostinfo.vdisk_rfd, file_name);
+
+    g_free(of_vsa_addr);
+    g_free(file_name);
+    return res;
+}
+
+static void vxhs_fail_aio(VXHSAIOCB *acb, int err)
+{
+    BDRVVXHSState *s = NULL;
+    int segcount = 0;
+    int rv = 0;
+
+    s = acb->common.bs->opaque;
+
+    trace_vxhs_fail_aio(s->vdisk_guid, acb);
+    if (!acb->ret) {
+        acb->ret = err;
+    }
+    qemu_spin_lock(&s->vdisk_acb_lock);
+    segcount = acb->segments;
+    qemu_spin_unlock(&s->vdisk_acb_lock);
+    if (segcount == 0) {
+        /*
+         * Complete the io request
+         */
+        rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb));
+        if (rv != sizeof(acb)) {
+            error_report("VXHS AIO completion failed: %s",
+                         strerror(errno));
+            abort();
+        }
+    }
+}
+
+static int vxhs_handle_queued_ios(BDRVVXHSState *s)
+{
+    VXHSAIOCB *acb = NULL;
+    int res = 0;
+
+    qemu_spin_lock(&s->vdisk_lock);
+    while ((acb = QSIMPLEQ_FIRST(&s->vdisk_aio_retryq)) != NULL) {
+        /*
+         * Before we process the acb, check whether I/O failover
+         * started again due to failback or cascading failure.
+         */
+        if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
+            qemu_spin_unlock(&s->vdisk_lock);
+            goto out;
+        }
+        QSIMPLEQ_REMOVE_HEAD(&s->vdisk_aio_retryq, retry_entry);
+        s->vdisk_aio_retry_qd--;
+        OF_AIOCB_FLAGS_RESET_QUEUED(acb);
+        if (OF_VDISK_FAILED(s)) {
+            qemu_spin_unlock(&s->vdisk_lock);
+            vxhs_fail_aio(acb, EIO);
+            qemu_spin_lock(&s->vdisk_lock);
+        } else {
+            qemu_spin_unlock(&s->vdisk_lock);
+            res = vxhs_restart_aio(acb);
+            trace_vxhs_handle_queued_ios(acb, res);
+            qemu_spin_lock(&s->vdisk_lock);
+            if (res) {
+                QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq,
+                                     acb, retry_entry);
+                OF_AIOCB_FLAGS_SET_QUEUED(acb);
+                qemu_spin_unlock(&s->vdisk_lock);
+                goto out;
+            }
+        }
+    }
+    qemu_spin_unlock(&s->vdisk_lock);
+out:
+    return res;
+}
+
+/*
+ * If errors are consistent with storage agent failure:
+ *  - Try to reconnect in case error is transient or storage agent restarted.
+ *  - Currently failover is being triggered on per vDisk basis. There is
+ *    a scope of further optimization where failover can be global (per VM).
+ *  - In case of network (storage agent) failure, for all the vDisks, having
+ *    no redundancy, I/Os will be failed without attempting for I/O failover
+ *    because of stateless nature of vDisk.
+ *  - If local or source storage agent is down then send an ioctl to remote
+ *    storage agent to check if remote storage agent in a state to accept
+ *    application I/Os.
+ *  - Once remote storage agent is ready to accept I/O, start I/O shipping.
+ *  - If I/Os cannot be serviced then vDisk will be marked failed so that
+ *    new incoming I/Os are returned with failure immediately.
+ *  - If vDisk I/O failover is in progress then all new/inflight I/Os will
+ *    queued and will be restarted or failed based on failover operation
+ *    is successful or not.
+ *  - I/O failover can be started either in I/O forward or I/O backward
+ *    path.
+ *  - I/O failover will be started as soon as all the pending acb(s)
+ *    are queued and there is no pending I/O count.
+ *  - If I/O failover couldn't be completed within QNIO_CONNECT_TIMOUT_SECS
+ *    then vDisk will be marked failed and all I/Os will be completed with
+ *    error.
+ */
+
+static int vxhs_switch_storage_agent(BDRVVXHSState *s)
+{
+    int res = 0;
+    int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE);
+
+    trace_vxhs_switch_storage_agent(
+              s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
+              s->vdisk_guid);
+
+    res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx);
+    if (res == 0) {
+        res = vxhs_qnio_iio_ioctl(s->qnio_ctx,
+                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd,
+                  VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags);
+    } else {
+        trace_vxhs_switch_storage_agent_failed(
+                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
+                  s->vdisk_guid, res, errno);
+        /*
+         * Try the next host.
+         * Calling vxhs_check_failover_status from here ties up the qnio
+         * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1)
+         * for all the hosts in the IO target list.
+         */
+
+        vxhs_check_failover_status(res, s);
+    }
+    return res;
+}
+
+static void vxhs_check_failover_status(int res, void *ctx)
+{
+    BDRVVXHSState *s = ctx;
+
+    if (res == 0) {
+        /* found failover target */
+        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
+        s->vdisk_ask_failover_idx = 0;
+        trace_vxhs_check_failover_status(
+                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
+                   s->vdisk_guid);
+        qemu_spin_lock(&s->vdisk_lock);
+        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
+        qemu_spin_unlock(&s->vdisk_lock);
+        vxhs_handle_queued_ios(s);
+    } else {
+        /* keep looking */
+        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
+        s->vdisk_ask_failover_idx++;
+        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
+            /* pause and cycle through list again */
+            sleep(QNIO_CONNECT_RETRY_SECS);
+            s->vdisk_ask_failover_idx = 0;
+        }
+        res = vxhs_switch_storage_agent(s);
+    }
+}
+
+static int vxhs_failover_io(BDRVVXHSState *s)
+{
+    int res = 0;
+
+    trace_vxhs_failover_io(s->vdisk_guid);
+
+    s->vdisk_ask_failover_idx = 0;
+    res = vxhs_switch_storage_agent(s);
+
+    return res;
+}
+
+static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
+                       uint32_t error, uint32_t opcode)
+{
+    VXHSAIOCB *acb = NULL;
+    BDRVVXHSState *s = NULL;
+    int rv = 0;
+    int segcount = 0;
+
+    switch (opcode) {
+    case IRP_READ_REQUEST:
+    case IRP_WRITE_REQUEST:
+
+    /*
+     * ctx is VXHSAIOCB*
+     * ctx is NULL if error is QNIOERROR_CHANNEL_HUP or reason is IIO_REASON_HUP
+     */
+    if (ctx) {
+        acb = ctx;
+        s = acb->common.bs->opaque;
+    } else {
+        trace_vxhs_iio_callback(error, reason);
+        goto out;
+    }
+
+    if (error) {
+        trace_vxhs_iio_callback_iofail(error, reason, acb, acb->segments);
+
+        if (reason == IIO_REASON_DONE || reason == IIO_REASON_EVENT) {
+            /*
+             * Storage agent failed while I/O was in progress
+             * Fail over only if the qnio channel dropped, indicating
+             * storage agent failure. Don't fail over in response to other
+             * I/O errors such as disk failure.
+             */
+            if (error == QNIOERROR_RETRY_ON_SOURCE || error == QNIOERROR_HUP ||
+                error == QNIOERROR_CHANNEL_HUP || error == -1) {
+                /*
+                 * Start vDisk IO failover once callback is
+                 * called against all the pending IOs.
+                 * If vDisk has no redundancy enabled
+                 * then IO failover routine will mark
+                 * the vDisk failed and fail all the
+                 * AIOs without retry (stateless vDisk)
+                 */
+                qemu_spin_lock(&s->vdisk_lock);
+                if (!OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
+                    OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s);
+                }
+                /*
+                 * Check if this acb is already queued before.
+                 * It is possible in case if I/Os are submitted
+                 * in multiple segments (QNIO_MAX_IO_SIZE).
+                 */
+                qemu_spin_lock(&s->vdisk_acb_lock);
+                if (!OF_AIOCB_FLAGS_QUEUED(acb)) {
+                    QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq,
+                                         acb, retry_entry);
+                    OF_AIOCB_FLAGS_SET_QUEUED(acb);
+                    s->vdisk_aio_retry_qd++;
+                    trace_vxhs_iio_callback_retry(s->vdisk_guid, acb);
+                }
+                segcount = --acb->segments;
+                qemu_spin_unlock(&s->vdisk_acb_lock);
+                /*
+                 * Decrement AIO count only when callback is called
+                 * against all the segments of aiocb.
+                 */
+                if (segcount == 0 && --s->vdisk_aio_count == 0) {
+                    /*
+                     * Start vDisk I/O failover
+                     */
+                    qemu_spin_unlock(&s->vdisk_lock);
+                    /*
+                     * TODO:
+                     * Need to explore further if it is possible to optimize
+                     * the failover operation on Virtual-Machine (global)
+                     * specific rather vDisk specific.
+                     */
+                    vxhs_failover_io(s);
+                    goto out;
+                }
+                qemu_spin_unlock(&s->vdisk_lock);
+                goto out;
+            }
+        } else if (reason == IIO_REASON_HUP) {
+            /*
+             * Channel failed, spontaneous notification,
+             * not in response to I/O
+             */
+            trace_vxhs_iio_callback_chnlfail(error);
+            /*
+             * TODO: Start channel failover when no I/O is outstanding
+             */
+            goto out;
+        } else {
+            trace_vxhs_iio_callback_fail(reason, acb, acb->segments,
+                                         acb->size, error);
+        }
+    }
+    /*
+     * Set error into acb if not set. In case if acb is being
+     * submitted in multiple segments then need to set the error
+     * only once.
+     *
+     * Once acb done callback is called for the last segment
+     * then acb->ret return status will be sent back to the
+     * caller.
+     */
+    qemu_spin_lock(&s->vdisk_acb_lock);
+    if (error && !acb->ret) {
+        acb->ret = error;
+    }
+    --acb->segments;
+    segcount = acb->segments;
+    assert(segcount >= 0);
+    qemu_spin_unlock(&s->vdisk_acb_lock);
+    /*
+     * Check if all the outstanding I/Os are done against acb.
+     * If yes then send signal for AIO completion.
+     */
+    if (segcount == 0) {
+        rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb));
+        if (rv != sizeof(acb)) {
+            error_report("VXHS AIO completion failed: %s", strerror(errno));
+            abort();
+        }
+    }
+    break;
+
+    case IRP_VDISK_CHECK_IO_FAILOVER_READY:
+        /* ctx is BDRVVXHSState* */
+        assert(ctx);
+        trace_vxhs_iio_callback_ready(((BDRVVXHSState *)ctx)->vdisk_guid,
+                                      error);
+        vxhs_check_failover_status(error, ctx);
+        break;
+
+    default:
+        if (reason == IIO_REASON_HUP) {
+            /*
+             * Channel failed, spontaneous notification,
+             * not in response to I/O
+             */
+            trace_vxhs_iio_callback_chnfail(error, errno);
+            /*
+             * TODO: Start channel failover when no I/O is outstanding
+             */
+        } else {
+            trace_vxhs_iio_callback_unknwn(opcode, error);
+        }
+        break;
+    }
+out:
+    return;
+}
+
+static void vxhs_complete_aio(VXHSAIOCB *acb, BDRVVXHSState *s)
+{
+    BlockCompletionFunc *cb = acb->common.cb;
+    void *opaque = acb->common.opaque;
+    int ret = 0;
+
+    if (acb->ret != 0) {
+        trace_vxhs_complete_aio(acb, acb->ret);
+        /*
+         * We mask all the IO errors generically as EIO for upper layers
+         * Right now our IO Manager uses non standard error codes. Instead
+         * of confusing upper layers with incorrect interpretation we are
+         * doing this workaround.
+         */
+        ret = (-EIO);
+    }
+    /*
+     * Copy back contents from stablization buffer into original iovector
+     * before returning the IO
+     */
+    if (acb->buffer != NULL) {
+        qemu_iovec_from_buf(acb->qiov, 0, acb->buffer, acb->qiov->size);
+        qemu_vfree(acb->buffer);
+        acb->buffer = NULL;
+    }
+    vxhs_dec_vdisk_iocount(s, 1);
+    acb->aio_done = VXHS_IO_COMPLETED;
+    qemu_aio_unref(acb);
+    cb(opaque, ret);
+}
+
+/*
+ * This is the HyperScale event handler registered to QEMU.
+ * It is invoked when any IO gets completed and written on pipe
+ * by callback called from QNIO thread context. Then it marks
+ * the AIO as completed, and releases HyperScale AIO callbacks.
+ */
+static void vxhs_aio_event_reader(void *opaque)
+{
+    BDRVVXHSState *s = opaque;
+    ssize_t ret;
+
+    do {
+        char *p = (char *)&s->qnio_event_acb;
+
+        ret = read(s->fds[VDISK_FD_READ], p + s->event_reader_pos,
+                   sizeof(s->qnio_event_acb) - s->event_reader_pos);
+        if (ret > 0) {
+            s->event_reader_pos += ret;
+            if (s->event_reader_pos == sizeof(s->qnio_event_acb)) {
+                s->event_reader_pos = 0;
+                vxhs_complete_aio(s->qnio_event_acb, s);
+            }
+        }
+    } while (ret < 0 && errno == EINTR);
+}
+
+/*
+ * Call QNIO operation to create channels to do IO on vDisk.
+ */
+
+static void *vxhs_setup_qnio(void)
+{
+    void *qnio_ctx = NULL;
+
+    qnio_ctx = iio_init(vxhs_iio_callback);
+
+    if (qnio_ctx != NULL) {
+        trace_vxhs_setup_qnio(qnio_ctx);
+    } else {
+        trace_vxhs_setup_qnio_nwerror('.');
+    }
+
+    return qnio_ctx;
+}
+
+/*
+ * This helper function converts an array of iovectors into a flat buffer.
+ */
+
+static void *vxhs_convert_iovector_to_buffer(QEMUIOVector *qiov)
+{
+    void *buf = NULL;
+    size_t size = 0;
+
+    if (qiov->niov == 0) {
+        return buf;
+    }
+
+    size = qiov->size;
+    buf = qemu_try_memalign(BDRV_SECTOR_SIZE, size);
+    if (!buf) {
+        trace_vxhs_convert_iovector_to_buffer(size);
+        errno = -ENOMEM;
+        return NULL;
+    }
+    return buf;
+}
+
+/*
+ * This helper function iterates over the iovector and checks
+ * if the length of every element is an integral multiple
+ * of the sector size.
+ * Return Value:
+ *      On Success : true
+ *      On Failure : false
+ */
+static int vxhs_is_iovector_read_aligned(QEMUIOVector *qiov, size_t sector)
+{
+    struct iovec *iov = qiov->iov;
+    int niov = qiov->niov;
+    int i;
+
+    for (i = 0; i < niov; i++) {
+        if (iov[i].iov_len % sector != 0) {
+            return false;
+        }
+    }
+    return true;
+}
+
+static int32_t
+vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
+                     uint64_t offset, void *ctx, uint32_t flags)
+{
+    struct iovec cur;
+    uint64_t cur_offset = 0;
+    uint64_t cur_write_len = 0;
+    int segcount = 0;
+    int ret = 0;
+    int i, nsio = 0;
+    int iovcnt = qiov->niov;
+    struct iovec *iov = qiov->iov;
+
+    errno = 0;
+    cur.iov_base = 0;
+    cur.iov_len = 0;
+
+    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
+
+    if (ret == -1 && errno == EFBIG) {
+        trace_vxhs_qnio_iio_writev(ret);
+        /*
+         * IO size is larger than IIO_IO_BUF_SIZE hence need to
+         * split the I/O at IIO_IO_BUF_SIZE boundary
+         * There are two cases here:
+         *  1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE
+         *  2. iovcnt is greater than 1 and IO size is greater than
+         *     IIO_IO_BUF_SIZE.
+         *
+         * Need to adjust the segment count, for that we need to compute
+         * the segment count and increase the segment count in one shot
+         * instead of setting iteratively in for loop. It is required to
+         * prevent any race between the splitted IO submission and IO
+         * completion.
+         */
+        cur_offset = offset;
+        for (i = 0; i < iovcnt; i++) {
+            if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) {
+                cur_offset += iov[i].iov_len;
+                nsio++;
+            } else if (iov[i].iov_len > 0) {
+                cur.iov_base = iov[i].iov_base;
+                cur.iov_len = IIO_IO_BUF_SIZE;
+                cur_write_len = 0;
+                while (1) {
+                    nsio++;
+                    cur_write_len += cur.iov_len;
+                    if (cur_write_len == iov[i].iov_len) {
+                        break;
+                    }
+                    cur_offset += cur.iov_len;
+                    cur.iov_base += cur.iov_len;
+                    if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) {
+                        cur.iov_len = IIO_IO_BUF_SIZE;
+                    } else {
+                        cur.iov_len = (iov[i].iov_len - cur_write_len);
+                    }
+                }
+            }
+        }
+
+        segcount = nsio - 1;
+        vxhs_inc_acb_segment_count(ctx, segcount);
+        /*
+         * Split the IO and submit it to QNIO.
+         * Reset the cur_offset before splitting the IO.
+         */
+        cur_offset = offset;
+        nsio = 0;
+        for (i = 0; i < iovcnt; i++) {
+            if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) {
+                errno = 0;
+                ret = iio_writev(qnio_ctx, rfd, &iov[i], 1, cur_offset, ctx,
+                                 flags);
+                if (ret == -1) {
+                    trace_vxhs_qnio_iio_writev_err(i, iov[i].iov_len, errno);
+                    /*
+                     * Need to adjust the AIOCB segment count to prevent
+                     * blocking of AIOCB completion within QEMU block driver.
+                     */
+                    if (segcount > 0 && (segcount - nsio) > 0) {
+                        vxhs_dec_acb_segment_count(ctx, segcount - nsio);
+                    }
+                    return ret;
+                }
+                cur_offset += iov[i].iov_len;
+                nsio++;
+            } else if (iov[i].iov_len > 0) {
+                /*
+                 * This case is where one element of the io vector is > 4MB.
+                 */
+                cur.iov_base = iov[i].iov_base;
+                cur.iov_len = IIO_IO_BUF_SIZE;
+                cur_write_len = 0;
+                while (1) {
+                    nsio++;
+                    errno = 0;
+                    ret = iio_writev(qnio_ctx, rfd, &cur, 1, cur_offset, ctx,
+                                     flags);
+                    if (ret == -1) {
+                        trace_vxhs_qnio_iio_writev_err(i, cur.iov_len, errno);
+                        /*
+                         * Need to adjust the AIOCB segment count to prevent
+                         * blocking of AIOCB completion within the
+                         * QEMU block driver.
+                         */
+                        if (segcount > 0 && (segcount - nsio) > 0) {
+                            vxhs_dec_acb_segment_count(ctx, segcount - nsio);
+                        }
+                        return ret;
+                    }
+
+                    cur_write_len += cur.iov_len;
+                    if (cur_write_len == iov[i].iov_len) {
+                        break;
+                    }
+                    cur_offset += cur.iov_len;
+                    cur.iov_base += cur.iov_len;
+                    if ((iov[i].iov_len - cur_write_len) >
+                                                IIO_IO_BUF_SIZE) {
+                        cur.iov_len = IIO_IO_BUF_SIZE;
+                    } else {
+                        cur.iov_len = (iov[i].iov_len - cur_write_len);
+                    }
+                }
+            }
+        }
+    }
+    return ret;
+}
+
+/*
+ * Iterate over the i/o vector and send read request
+ * to QNIO one by one.
+ */
+static int32_t
+vxhs_qnio_iio_readv(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
+               uint64_t offset, void *ctx, uint32_t flags)
+{
+    uint64_t read_offset = offset;
+    void *buffer = NULL;
+    size_t size;
+    int aligned, segcount;
+    int i, ret = 0;
+    int iovcnt = qiov->niov;
+    struct iovec *iov = qiov->iov;
+
+    aligned = vxhs_is_iovector_read_aligned(qiov, BDRV_SECTOR_SIZE);
+    size = qiov->size;
+
+    if (!aligned) {
+        buffer = vxhs_convert_iovector_to_buffer(qiov);
+        if (buffer == NULL) {
+            return -ENOMEM;
+        }
+
+        errno = 0;
+        ret = iio_read(qnio_ctx, rfd, buffer, size, read_offset, ctx, flags);
+        if (ret != 0) {
+            trace_vxhs_qnio_iio_readv(ctx, ret, errno);
+            qemu_vfree(buffer);
+            return ret;
+        }
+        vxhs_set_acb_buffer(ctx, buffer);
+        return ret;
+    }
+
+    /*
+     * Since read IO request is going to split based on
+     * number of IOvectors hence increment the segment
+     * count depending on the number of IOVectors before
+     * submitting the read request to QNIO.
+     * This is needed to protect the QEMU block driver
+     * IO completion while read request for the same IO
+     * is being submitted to QNIO.
+     */
+    segcount = iovcnt - 1;
+    if (segcount > 0) {
+        vxhs_inc_acb_segment_count(ctx, segcount);
+    }
+
+    for (i = 0; i < iovcnt; i++) {
+        errno = 0;
+        ret = iio_read(qnio_ctx, rfd, iov[i].iov_base, iov[i].iov_len,
+                       read_offset, ctx, flags);
+        if (ret != 0) {
+            trace_vxhs_qnio_iio_readv(ctx, ret, errno);
+            /*
+             * Need to adjust the AIOCB segment count to prevent
+             * blocking of AIOCB completion within QEMU block driver.
+             */
+            if (segcount > 0 && (segcount - i) > 0) {
+                vxhs_dec_acb_segment_count(ctx, segcount - i);
+            }
+            return ret;
+        }
+        read_offset += iov[i].iov_len;
+    }
+
+    return ret;
+}
+
+static int vxhs_restart_aio(VXHSAIOCB *acb)
+{
+    BDRVVXHSState *s = NULL;
+    int iio_flags = 0;
+    int res = 0;
+
+    s = acb->common.bs->opaque;
+
+    if (acb->direction == VDISK_AIO_WRITE) {
+        vxhs_inc_vdisk_iocount(s, 1);
+        vxhs_inc_acb_segment_count(acb, 1);
+        iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
+        res = vxhs_qnio_iio_writev(s->qnio_ctx,
+                s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
+                acb->qiov, acb->io_offset, (void *)acb, iio_flags);
+    }
+
+    if (acb->direction == VDISK_AIO_READ) {
+        vxhs_inc_vdisk_iocount(s, 1);
+        vxhs_inc_acb_segment_count(acb, 1);
+        iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
+        res = vxhs_qnio_iio_readv(s->qnio_ctx,
+                s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
+                acb->qiov, acb->io_offset, (void *)acb, iio_flags);
+    }
+
+    if (res != 0) {
+        vxhs_dec_vdisk_iocount(s, 1);
+        vxhs_dec_acb_segment_count(acb, 1);
+        trace_vxhs_restart_aio(acb->direction, res, errno);
+    }
+
+    return res;
+}
+
+static QemuOptsList runtime_opts = {
+    .name = "vxhs",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        {
+            .name = VXHS_OPT_FILENAME,
+            .type = QEMU_OPT_STRING,
+            .help = "URI to the Veritas HyperScale image",
+        },
+        {
+            .name = VXHS_OPT_VDISK_ID,
+            .type = QEMU_OPT_STRING,
+            .help = "UUID of the VxHS vdisk",
+        },
+        { /* end of list */ }
+    },
+};
+
+static QemuOptsList runtime_tcp_opts = {
+    .name = "vxhs_tcp",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head),
+    .desc = {
+        {
+            .name = VXHS_OPT_HOST,
+            .type = QEMU_OPT_STRING,
+            .help = "host address (ipv4 addresses)",
+        },
+        {
+            .name = VXHS_OPT_PORT,
+            .type = QEMU_OPT_NUMBER,
+            .help = "port number on which VxHSD is listening (default 9999)",
+            .def_value_str = "9999"
+        },
+        { /* end of list */ }
+    },
+};
+
+/*
+ * Parse the incoming URI and populate *options with the host information.
+ * URI syntax has the limitation of supporting only one host info.
+ * To pass multiple host information, use the JSON syntax.
+ */
+static int vxhs_parse_uri(const char *filename, QDict *options)
+{
+    URI *uri = NULL;
+    char *hoststr, *portstr;
+    char *port;
+    int ret = 0;
+
+    trace_vxhs_parse_uri_filename(filename);
+
+    uri = uri_parse(filename);
+    if (!uri || !uri->server || !uri->path) {
+        uri_free(uri);
+        return -EINVAL;
+    }
+
+    hoststr = g_strdup(VXHS_OPT_SERVER"0.host");
+    qdict_put(options, hoststr, qstring_from_str(uri->server));
+    g_free(hoststr);
+
+    portstr = g_strdup(VXHS_OPT_SERVER"0.port");
+    if (uri->port) {
+        port = g_strdup_printf("%d", uri->port);
+        qdict_put(options, portstr, qstring_from_str(port));
+        g_free(port);
+    }
+    g_free(portstr);
+
+    if (strstr(uri->path, "vxhs") == NULL) {
+        qdict_put(options, "vdisk_id", qstring_from_str(uri->path));
+    }
+
+    trace_vxhs_parse_uri_hostinfo(1, uri->server, uri->port);
+    uri_free(uri);
+
+    return ret;
+}
+
+static void vxhs_parse_filename(const char *filename, QDict *options,
+                               Error **errp)
+{
+    if (qdict_haskey(options, "vdisk_id")
+        || qdict_haskey(options, "server"))
+    {
+        error_setg(errp, "vdisk_id/server and a file name may not be specified "
+                         "at the same time");
+        return;
+    }
+
+    if (strstr(filename, "://")) {
+        int ret = vxhs_parse_uri(filename, options);
+        if (ret < 0) {
+            error_setg(errp, "Invalid URI. URI should be of the form "
+                       "  vxhs://<host_ip>:<port>/{<vdisk_id>}");
+        }
+    }
+}
+
+static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
+                              int *cfd, int *rfd, Error **errp)
+{
+    QDict *backing_options = NULL;
+    QemuOpts *opts, *tcp_opts;
+    const char *vxhs_filename;
+    char *of_vsa_addr = NULL;
+    Error *local_err = NULL;
+    const char *vdisk_id_opt;
+    char *file_name = NULL;
+    size_t num_servers = 0;
+    char *str = NULL;
+    int ret = 0;
+    int i;
+
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
+    if (vxhs_filename) {
+        trace_vxhs_qemu_init_filename(vxhs_filename);
+    }
+
+    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
+    if (!vdisk_id_opt) {
+        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
+        ret = -EINVAL;
+        goto out;
+    }
+    s->vdisk_guid = g_strdup(vdisk_id_opt);
+    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
+
+    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
+    if (num_servers < 1) {
+        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
+        ret = -EINVAL;
+        goto out;
+    } else if (num_servers > VXHS_MAX_HOSTS) {
+        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
+        error_append_hint(errp, "Maximum %d servers allowed.\n",
+                          VXHS_MAX_HOSTS);
+        ret = -EINVAL;
+        goto out;
+    }
+    trace_vxhs_qemu_init_numservers(num_servers);
+
+    for (i = 0; i < num_servers; i++) {
+        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
+        qdict_extract_subqdict(options, &backing_options, str);
+
+        /* Create opts info from runtime_tcp_opts list */
+        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
+        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
+        if (local_err) {
+            qdict_del(backing_options, str);
+            qemu_opts_del(tcp_opts);
+            g_free(str);
+            ret = -EINVAL;
+            goto out;
+        }
+
+        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
+                                                            VXHS_OPT_HOST));
+        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
+                                                                 VXHS_OPT_PORT),
+                                                    NULL, 0);
+
+        s->vdisk_hostinfo[i].qnio_cfd = -1;
+        s->vdisk_hostinfo[i].vdisk_rfd = -1;
+        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
+                             s->vdisk_hostinfo[i].port);
+
+        qdict_del(backing_options, str);
+        qemu_opts_del(tcp_opts);
+        g_free(str);
+    }
+
+    s->vdisk_nhosts = i;
+    s->vdisk_cur_host_idx = 0;
+    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
+    of_vsa_addr = g_strdup_printf("of://%s:%d",
+                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
+                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
+
+    /*
+     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
+     */
+    if (global_qnio_ctx == NULL) {
+        global_qnio_ctx = vxhs_setup_qnio();
+        if (global_qnio_ctx == NULL) {
+            error_setg(&local_err, "Failed vxhs_setup_qnio");
+            ret = -EINVAL;
+            goto out;
+        }
+    }
+
+    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
+    if (!ret) {
+        error_setg(&local_err, "Failed qnio_iio_open");
+        ret = -EIO;
+    }
+
+out:
+    g_free(file_name);
+    g_free(of_vsa_addr);
+    qemu_opts_del(opts);
+
+    if (ret < 0) {
+        for (i = 0; i < num_servers; i++) {
+            g_free(s->vdisk_hostinfo[i].hostip);
+        }
+        g_free(s->vdisk_guid);
+        s->vdisk_guid = NULL;
+        errno = -ret;
+    }
+    error_propagate(errp, local_err);
+
+    return ret;
+}
+
+static int vxhs_open(BlockDriverState *bs, QDict *options,
+              int bdrv_flags, Error **errp)
+{
+    BDRVVXHSState *s = bs->opaque;
+    AioContext *aio_context;
+    int qemu_qnio_cfd = -1;
+    int device_opened = 0;
+    int qemu_rfd = -1;
+    int ret = 0;
+    int i;
+
+    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
+    if (ret < 0) {
+        trace_vxhs_open_fail(ret);
+        return ret;
+    }
+
+    device_opened = 1;
+    s->qnio_ctx = global_qnio_ctx;
+    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
+    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
+    s->vdisk_size = 0;
+    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
+
+    /*
+     * Create a pipe for communicating between two threads in different
+     * context. Set handler for read event, which gets triggered when
+     * IO completion is done by non-QEMU context.
+     */
+    ret = qemu_pipe(s->fds);
+    if (ret < 0) {
+        trace_vxhs_open_epipe('.');
+        ret = -errno;
+        goto errout;
+    }
+    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
+
+    aio_context = bdrv_get_aio_context(bs);
+    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
+                       false, vxhs_aio_event_reader, NULL, s);
+
+    /*
+     * Initialize the spin-locks.
+     */
+    qemu_spin_init(&s->vdisk_lock);
+    qemu_spin_init(&s->vdisk_acb_lock);
+
+    return 0;
+
+errout:
+    /*
+     * Close remote vDisk device if it was opened earlier
+     */
+    if (device_opened) {
+        for (i = 0; i < s->vdisk_nhosts; i++) {
+            vxhs_qnio_iio_close(s, i);
+        }
+    }
+    trace_vxhs_open_fail(ret);
+    return ret;
+}
+
+static const AIOCBInfo vxhs_aiocb_info = {
+    .aiocb_size = sizeof(VXHSAIOCB)
+};
+
+/*
+ * This allocates QEMU-VXHS callback for each IO
+ * and is passed to QNIO. When QNIO completes the work,
+ * it will be passed back through the callback.
+ */
+static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
+                                int64_t sector_num, QEMUIOVector *qiov,
+                                int nb_sectors,
+                                BlockCompletionFunc *cb,
+                                void *opaque, int iodir)
+{
+    VXHSAIOCB *acb = NULL;
+    BDRVVXHSState *s = bs->opaque;
+    size_t size;
+    uint64_t offset;
+    int iio_flags = 0;
+    int ret = 0;
+    void *qnio_ctx = s->qnio_ctx;
+    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
+
+    offset = sector_num * BDRV_SECTOR_SIZE;
+    size = nb_sectors * BDRV_SECTOR_SIZE;
+
+    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
+    /*
+     * Setup or initialize VXHSAIOCB.
+     * Every single field should be initialized since
+     * acb will be picked up from the slab without
+     * initializing with zero.
+     */
+    acb->io_offset = offset;
+    acb->size = size;
+    acb->ret = 0;
+    acb->flags = 0;
+    acb->aio_done = VXHS_IO_INPROGRESS;
+    acb->segments = 0;
+    acb->buffer = 0;
+    acb->qiov = qiov;
+    acb->direction = iodir;
+
+    qemu_spin_lock(&s->vdisk_lock);
+    if (OF_VDISK_FAILED(s)) {
+        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
+        qemu_spin_unlock(&s->vdisk_lock);
+        goto errout;
+    }
+    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
+        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
+        s->vdisk_aio_retry_qd++;
+        OF_AIOCB_FLAGS_SET_QUEUED(acb);
+        qemu_spin_unlock(&s->vdisk_lock);
+        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
+        goto out;
+    }
+    s->vdisk_aio_count++;
+    qemu_spin_unlock(&s->vdisk_lock);
+
+    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
+
+    switch (iodir) {
+    case VDISK_AIO_WRITE:
+            vxhs_inc_acb_segment_count(acb, 1);
+            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
+                                       offset, (void *)acb, iio_flags);
+            break;
+    case VDISK_AIO_READ:
+            vxhs_inc_acb_segment_count(acb, 1);
+            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
+                                       offset, (void *)acb, iio_flags);
+            break;
+    default:
+            trace_vxhs_aio_rw_invalid(iodir);
+            goto errout;
+    }
+
+    if (ret != 0) {
+        trace_vxhs_aio_rw_ioerr(
+                  s->vdisk_guid, iodir, size, offset,
+                  acb, acb->segments, ret, errno);
+        /*
+         * Don't retry I/Os against vDisk having no
+         * redundancy or stateful storage on compute
+         *
+         * TODO: Revisit this code path to see if any
+         *       particular error needs to be handled.
+         *       At this moment failing the I/O.
+         */
+        qemu_spin_lock(&s->vdisk_lock);
+        if (s->vdisk_nhosts == 1) {
+            trace_vxhs_aio_rw_iofail(s->vdisk_guid);
+            s->vdisk_aio_count--;
+            vxhs_dec_acb_segment_count(acb, 1);
+            qemu_spin_unlock(&s->vdisk_lock);
+            goto errout;
+        }
+        if (OF_VDISK_FAILED(s)) {
+            trace_vxhs_aio_rw_devfail(
+                      s->vdisk_guid, iodir, size, offset);
+            s->vdisk_aio_count--;
+            vxhs_dec_acb_segment_count(acb, 1);
+            qemu_spin_unlock(&s->vdisk_lock);
+            goto errout;
+        }
+        if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
+            /*
+             * Queue all incoming io requests after failover starts.
+             * Number of requests that can arrive is limited by io queue depth
+             * so an app blasting independent ios will not exhaust memory.
+             */
+            QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
+            s->vdisk_aio_retry_qd++;
+            OF_AIOCB_FLAGS_SET_QUEUED(acb);
+            s->vdisk_aio_count--;
+            vxhs_dec_acb_segment_count(acb, 1);
+            qemu_spin_unlock(&s->vdisk_lock);
+            trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 2);
+            goto out;
+        }
+        OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s);
+        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
+        s->vdisk_aio_retry_qd++;
+        OF_AIOCB_FLAGS_SET_QUEUED(acb);
+        vxhs_dec_acb_segment_count(acb, 1);
+        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 3);
+        /*
+         * Start I/O failover if there is no active
+         * AIO within vxhs block driver.
+         */
+        if (--s->vdisk_aio_count == 0) {
+            qemu_spin_unlock(&s->vdisk_lock);
+            /*
+             * Start IO failover
+             */
+            vxhs_failover_io(s);
+            goto out;
+        }
+        qemu_spin_unlock(&s->vdisk_lock);
+    }
+
+out:
+    return &acb->common;
+
+errout:
+    qemu_aio_unref(acb);
+    return NULL;
+}
+
+static BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs,
+                                   int64_t sector_num, QEMUIOVector *qiov,
+                                   int nb_sectors,
+                                   BlockCompletionFunc *cb, void *opaque)
+{
+    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors,
+                         cb, opaque, VDISK_AIO_READ);
+}
+
+static BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs,
+                                    int64_t sector_num, QEMUIOVector *qiov,
+                                    int nb_sectors,
+                                    BlockCompletionFunc *cb, void *opaque)
+{
+    return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors,
+                         cb, opaque, VDISK_AIO_WRITE);
+}
+
+static void vxhs_close(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+    int i;
+
+    trace_vxhs_close(s->vdisk_guid);
+    close(s->fds[VDISK_FD_READ]);
+    close(s->fds[VDISK_FD_WRITE]);
+
+    /*
+     * Clearing all the event handlers for oflame registered to QEMU
+     */
+    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
+                       false, NULL, NULL, NULL);
+    g_free(s->vdisk_guid);
+    s->vdisk_guid = NULL;
+
+    for (i = 0; i < VXHS_MAX_HOSTS; i++) {
+        vxhs_qnio_iio_close(s, i);
+        /*
+         * Free the dynamically allocated hostip string
+         */
+        g_free(s->vdisk_hostinfo[i].hostip);
+        s->vdisk_hostinfo[i].hostip = NULL;
+        s->vdisk_hostinfo[i].port = 0;
+    }
+}
+
+/*
+ * This is called by QEMU when a flush gets triggered from within
+ * a guest at the block layer, either for IDE or SCSI disks.
+ */
+static int vxhs_co_flush(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+    int64_t size = 0;
+    int ret = 0;
+
+    /*
+     * VDISK_AIO_FLUSH ioctl is a no-op at present and will
+     * always return success. This could change in the future.
+     */
+    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
+            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
+            VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC);
+
+    if (ret < 0) {
+        trace_vxhs_co_flush(s->vdisk_guid, ret, errno);
+        vxhs_close(bs);
+    }
+
+    return ret;
+}
+
+static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s)
+{
+    int64_t vdisk_size = 0;
+    int ret = 0;
+
+    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
+            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
+            VDISK_STAT, &vdisk_size, NULL, 0);
+
+    if (ret < 0) {
+        trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno);
+        return 0;
+    }
+
+    trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size);
+    return vdisk_size;
+}
+
+/*
+ * Returns the size of vDisk in bytes. This is required
+ * by QEMU block upper block layer so that it is visible
+ * to guest.
+ */
+static int64_t vxhs_getlength(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+    int64_t vdisk_size = 0;
+
+    if (s->vdisk_size > 0) {
+        vdisk_size = s->vdisk_size;
+    } else {
+        /*
+         * Fetch the vDisk size using stat ioctl
+         */
+        vdisk_size = vxhs_get_vdisk_stat(s);
+        if (vdisk_size > 0) {
+            s->vdisk_size = vdisk_size;
+        }
+    }
+
+    if (vdisk_size > 0) {
+        return vdisk_size; /* return size in bytes */
+    }
+
+    return -EIO;
+}
+
+/*
+ * Returns actual blocks allocated for the vDisk.
+ * This is required by qemu-img utility.
+ */
+static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+    int64_t vdisk_size = 0;
+
+    if (s->vdisk_size > 0) {
+        vdisk_size = s->vdisk_size;
+    } else {
+        /*
+         * TODO:
+         * Once HyperScale storage-virtualizer provides
+         * actual physical allocation of blocks then
+         * fetch that information and return back to the
+         * caller but for now just get the full size.
+         */
+        vdisk_size = vxhs_get_vdisk_stat(s);
+        if (vdisk_size > 0) {
+            s->vdisk_size = vdisk_size;
+        }
+    }
+
+    if (vdisk_size > 0) {
+        return vdisk_size; /* return size in bytes */
+    }
+
+    return -EIO;
+}
+
+static void vxhs_detach_aio_context(BlockDriverState *bs)
+{
+    BDRVVXHSState *s = bs->opaque;
+
+    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
+                       false, NULL, NULL, NULL);
+
+}
+
+static void vxhs_attach_aio_context(BlockDriverState *bs,
+                                   AioContext *new_context)
+{
+    BDRVVXHSState *s = bs->opaque;
+
+    aio_set_fd_handler(new_context, s->fds[VDISK_FD_READ],
+                       false, vxhs_aio_event_reader, NULL, s);
+}
+
+static BlockDriver bdrv_vxhs = {
+    .format_name                  = "vxhs",
+    .protocol_name                = "vxhs",
+    .instance_size                = sizeof(BDRVVXHSState),
+    .bdrv_file_open               = vxhs_open,
+    .bdrv_parse_filename          = vxhs_parse_filename,
+    .bdrv_close                   = vxhs_close,
+    .bdrv_getlength               = vxhs_getlength,
+    .bdrv_get_allocated_file_size = vxhs_get_allocated_blocks,
+    .bdrv_aio_readv               = vxhs_aio_readv,
+    .bdrv_aio_writev              = vxhs_aio_writev,
+    .bdrv_co_flush_to_disk        = vxhs_co_flush,
+    .bdrv_detach_aio_context      = vxhs_detach_aio_context,
+    .bdrv_attach_aio_context      = vxhs_attach_aio_context,
+};
+
+static void bdrv_vxhs_init(void)
+{
+    trace_vxhs_bdrv_init('.');
+    bdrv_register(&bdrv_vxhs);
+}
+
+block_init(bdrv_vxhs_init);
diff --git a/configure b/configure
index 8fa62ad..50fe935 100755
--- a/configure
+++ b/configure
@@ -320,6 +320,7 @@  numa=""
 tcmalloc="no"
 jemalloc="no"
 replication="yes"
+vxhs=""
 
 # parse CC options first
 for opt do
@@ -1159,6 +1160,11 @@  for opt do
   ;;
   --enable-replication) replication="yes"
   ;;
+  --disable-vxhs) vxhs="no"
+  ;;
+  --enable-vxhs) vxhs="yes"
+  ;;
+
   *)
       echo "ERROR: unknown option $opt"
       echo "Try '$0 --help' for more information"
@@ -1388,6 +1394,7 @@  disabled with --disable-FEATURE, default is enabled if available:
   tcmalloc        tcmalloc support
   jemalloc        jemalloc support
   replication     replication support
+  vxhs            Veritas HyperScale vDisk backend support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -4513,6 +4520,33 @@  if do_cc -nostdlib -Wl,-r -Wl,--no-relax -o $TMPMO $TMPO; then
 fi
 
 ##########################################
+# Veritas HyperScale block driver VxHS
+# Check if libqnio is installed
+
+if test "$vxhs" != "no" ; then
+  cat > $TMPC <<EOF
+#include <stdint.h>
+#include <qnio/qnio_api.h>
+
+void *vxhs_callback;
+
+int main(void) {
+    iio_init(vxhs_callback);
+    return 0;
+}
+EOF
+  vxhs_libs="-lqnio"
+  if compile_prog "" "$vxhs_libs" ; then
+    vxhs=yes
+  else
+    if test "$vxhs" = "yes" ; then
+      feature_not_found "vxhs block device" "Install libqnio. See github"
+    fi
+    vxhs=no
+  fi
+fi
+
+##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
 
@@ -4877,6 +4911,7 @@  echo "tcmalloc support  $tcmalloc"
 echo "jemalloc support  $jemalloc"
 echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
+echo "VxHS block device $vxhs"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -5465,6 +5500,12 @@  if test "$pthread_setname_np" = "yes" ; then
   echo "CONFIG_PTHREAD_SETNAME_NP=y" >> $config_host_mak
 fi
 
+if test "$vxhs" = "yes" ; then
+  echo "CONFIG_VXHS=y" >> $config_host_mak
+  echo "VXHS_CFLAGS=$vxhs_cflags" >> $config_host_mak
+  echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then