Message ID | 1475035789-685-1-git-send-email-ashish.mittal@veritas.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Thanks, this looks much better! A few more remarks and code smells, but much smaller than the previous review. > +static int32_t > +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, > + void *ctx, uint32_t flags) The "BDRVVXHSState *s, int idx" interface would apply here too. > +{ > + int ret = 0; > + > + switch (opcode) { > + case VDISK_STAT: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, > + in, ctx, flags); > + break; > + > + case VDISK_AIO_FLUSH: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, > + in, ctx, flags); > + break; > + > + case VDISK_CHECK_IO_FAILOVER_READY: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, > + in, ctx, flags); > + break; > + > + default: > + ret = -ENOTSUP; > + break; > + } > + > + if (ret) { > + *in = 0; > + trace_vxhs_qnio_iio_ioctl(opcode); > + } > + > + return ret; > +} > + > +/* > + * Try to reopen the vDisk on one of the available hosts > + * If vDisk reopen is successful on any of the host then > + * check if that node is ready to accept I/O. > + */ > +static int vxhs_reopen_vdisk(BDRVVXHSState *s, int index) > +{ > + VXHSvDiskHostsInfo hostinfo = s->vdisk_hostinfo[index]; This should almost definitely be VXHSvDiskHostsInfo *hostinfo = &s->vdisk_hostinfo[index]; and below: res = vxhs_qnio_iio_open(&hostinfo->qnio_cfd, of_vsa_addr, &hostinfo->vdisk_rfd, file_name); How was the failover code tested? > +/* > + * This helper function converts an array of iovectors into a flat buffer. > + */ > + > +static void *vxhs_convert_iovector_to_buffer(QEMUIOVector *qiov) Please rename to vxhs_allocate_buffer_for_iovector. > +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, > + int *cfd, int *rfd, Error **errp) Passing the cfd and rfd as int* is unnecessary, because here: > + s->vdisk_cur_host_idx = 0; > + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); > + of_vsa_addr = g_strdup_printf("of://%s:%d", > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); > + > + /* > + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. > + */ > + if (global_qnio_ctx == NULL) { > + global_qnio_ctx = vxhs_setup_qnio(); > + if (global_qnio_ctx == NULL) { > + error_setg(&local_err, "Failed vxhs_setup_qnio"); > + ret = -EINVAL; > + goto out; > + } > + } > + > + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); > + if (!ret) { > + error_setg(&local_err, "Failed qnio_iio_open"); > + ret = -EIO; > + } > +out: > + g_free(file_name); > + g_free(of_vsa_addr); ... you're basically doing the same as /* ... create global_qnio_ctx ... */ s->qnio_ctx = global_qnio_ctx; s->vdisk_cur_host_idx = 0; ret = vxhs_reopen_vdisk(s, s->vdisk_cur_host_idx); I suggest that you use vxhs_reopen_vdisk (rename it if you prefer; in particular it's not specific to a *re*open). vxhs_qnio_iio_open remains an internal function to vxhs_reopen_vdisk. This would have also caught the bug in vxhs_reopen_vdisk! :-> > +errout: > + /* > + * Close remote vDisk device if it was opened earlier > + */ > + if (device_opened) { > + for (i = 0; i < s->vdisk_nhosts; i++) { > + vxhs_qnio_iio_close(s, i); > + } > + } No need for device_opened, since you have s->vdisk_nhosts = 0 on entry and qnio_cfd/vdisk_rfd initialized to -1 at the point where s->vdisk_nhosts become nonzero. On the other hand, you could also consider inverting the initialization order. If you open the pipe first, cleaning it up is much easier than cleaning up the backends. > +static void vxhs_close(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int i; > + > + trace_vxhs_close(s->vdisk_guid); > + close(s->fds[VDISK_FD_READ]); > + close(s->fds[VDISK_FD_WRITE]); > + > + /* > + * Clearing all the event handlers for oflame registered to QEMU > + */ > + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], > + false, NULL, NULL, NULL); > + g_free(s->vdisk_guid); > + s->vdisk_guid = NULL; > + > + for (i = 0; i < VXHS_MAX_HOSTS; i++) { Only loop up to s->vdisk_nhosts. > + vxhs_qnio_iio_close(s, i); > + /* > + * Free the dynamically allocated hostip string > + */ > + g_free(s->vdisk_hostinfo[i].hostip); > + s->vdisk_hostinfo[i].hostip = NULL; > + s->vdisk_hostinfo[i].port = 0; > + } > +} > + > +/* > + * Returns the size of vDisk in bytes. This is required > + * by QEMU block upper block layer so that it is visible > + * to guest. > + */ > +static int64_t vxhs_getlength(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t vdisk_size = 0; > + > + if (s->vdisk_size > 0) { > + vdisk_size = s->vdisk_size; > + } else { > + /* > + * Fetch the vDisk size using stat ioctl > + */ > + vdisk_size = vxhs_get_vdisk_stat(s); > + if (vdisk_size > 0) { > + s->vdisk_size = vdisk_size; > + } > + } > + > + if (vdisk_size > 0) { > + return vdisk_size; /* return size in bytes */ > + } > + > + return -EIO; Simpler: BDRVVXHSState *s = bs->opaque; if (s->vdisk_size == 0) { int64_t vdisk_size = vxhs_get_vdisk_stat(s); if (vdisk_size == 0) { return -EIO; } s->vdisk_size = vdisk_size; } return s->vdisk_size; > +} > + > +/* > + * Returns actual blocks allocated for the vDisk. > + * This is required by qemu-img utility. > + */ > +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t vdisk_size = 0; Just return vxhs_getlength(bs). Paolo
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c" Why do several trace events have a %c format specifier at the end and it always takes a '.' value? > +#define QNIO_CONNECT_TIMOUT_SECS 120 This isn't used and there is a typo (s/TIMOUT/TIMEOUT/). Can it be dropped? > +static int32_t > +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, > + void *ctx, uint32_t flags) > +{ > + int ret = 0; > + > + switch (opcode) { > + case VDISK_STAT: It seems unnecessary to abstract the iio_ioctl() constants and then have a switch statement to translate to the actual library constants. It makes little sense since the flags argument already uses the library constants. Just use the library's constants. > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, > + in, ctx, flags); > + break; > + > + case VDISK_AIO_FLUSH: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, > + in, ctx, flags); > + break; > + > + case VDISK_CHECK_IO_FAILOVER_READY: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, > + in, ctx, flags); > + break; > + > + default: > + ret = -ENOTSUP; > + break; > + } > + > + if (ret) { > + *in = 0; Some callers pass in = NULL so this will crash. The naming seems wrong: this is an output argument, not an input argument. Please call it "out_val" or similar. > + res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx); > + if (res == 0) { > + res = vxhs_qnio_iio_ioctl(s->qnio_ctx, > + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd, > + VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags); Looking at iio_ioctl(), I'm not sure how this can ever work. The fourth argument is NULL and iio_ioctl() will attempt *vdisk_size = 0 so this will crash. Do you have tests that exercise this code path? > +/* > + * This is called by QEMU when a flush gets triggered from within > + * a guest at the block layer, either for IDE or SCSI disks. > + */ > +static int vxhs_co_flush(BlockDriverState *bs) This is called from coroutine context, please add the coroutine_fn function attribute to document this. > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t size = 0; > + int ret = 0; > + > + /* > + * VDISK_AIO_FLUSH ioctl is a no-op at present and will > + * always return success. This could change in the future. > + */ > + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, > + VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC); This function is not allowed to block. It cannot do a synchronous flush. This line is misleading because the constant is called VDISK_AIO_FLUSH, but looking at the library code I see it's actually a synchronous call that ends up in a loop that sleeps (!) waiting for the response. Please do an async flush and qemu_coroutine_yield() to return control to QEMU's event loop. When the flush completes you can qemu_coroutine_enter() again to return from this function. > + > + if (ret < 0) { > + trace_vxhs_co_flush(s->vdisk_guid, ret, errno); > + vxhs_close(bs); This looks unsafe. Won't it cause double close() calls for s->fds[] when bdrv_close() is called later? > + } > + > + return ret; > +} > + > +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s) sizeof(unsigned long) = 4 on some machines, please change it to int64_t. I also suggest changing the function name to vxhs_get_vdisk_size() since it only provides the size. > +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t vdisk_size = 0; > + > + if (s->vdisk_size > 0) { > + vdisk_size = s->vdisk_size; > + } else { > + /* > + * TODO: > + * Once HyperScale storage-virtualizer provides > + * actual physical allocation of blocks then > + * fetch that information and return back to the > + * caller but for now just get the full size. > + */ > + vdisk_size = vxhs_get_vdisk_stat(s); > + if (vdisk_size > 0) { > + s->vdisk_size = vdisk_size; > + } > + } > + > + if (vdisk_size > 0) { > + return vdisk_size; /* return size in bytes */ > + } > + > + return -EIO; > +} Why are you implementing this function if vxhs doesn't support querying the allocated file size? Don't return a bogus number. Just don't implement it like other block drivers.
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > This patch adds support for a new block device type called "vxhs". > Source code for the library that this code loads can be downloaded from: > https://github.com/MittalAshish/libqnio.git > > Sample command line using JSON syntax: > ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' > > Sample command line using URI syntax: > qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D > > Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> Have you tried running the qemu-iotests test suite? http://qemu-project.org/Documentation/QemuIoTests The test suite needs to be working in order for this block driver to be merged. Please also consider how to test failover. You can use blkdebug (see docs/blkdebug.txt) to inject I/O errors but perhaps you need something more low-level in the library.
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > This patch adds support for a new block device type called "vxhs". > Source code for the library that this code loads can be downloaded from: > https://github.com/MittalAshish/libqnio.git > > Sample command line using JSON syntax: > ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' Please line wrap the text here > > Sample command line using URI syntax: > qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D and here. > Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> > block/Makefile.objs | 2 + > block/trace-events | 47 ++ > block/vxhs.c | 1645 +++++++++++++++++++++++++++++++++++++++++++++++++++ > configure | 41 ++ > 4 files changed, 1735 insertions(+) > create mode 100644 block/vxhs.c This has lost the QAPI schema definition in qapi/block-core.json that was in earlier versions. We should not be adding new block drivers without having a QAPI schema defined for them. I would like to see this use the exact same syntax for specifying the server as is used for Gluster, as it will simplify life for libvirt to only have one format to generate. This would simply renaming the current 'GlusterServer' QAPI struct to be something more generic perhaps "BlockServer" so that it can be shared between both. It also means that the JSON example above must include the 'type' discriminator. Regards, Daniel
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: Review of .bdrv_open() and .bdrv_aio_writev() code paths. The big issues I see in this driver and libqnio: 1. Showstoppers like broken .bdrv_open() and leaking memory on every reply message. 2. Insecure due to missing input validation (network packets and configuration) and incorrect string handling. 3. Not fully asynchronous so QEMU and the guest may hang. Please think about the whole codebase and not just the lines I've pointed out in this review when fixing these sorts of issues. There may be similar instances of these bugs elsewhere and it's important that they are fixed so that this can be merged. > +/* > + * Structure per vDisk maintained for state > + */ > +typedef struct BDRVVXHSState { > + int fds[2]; > + int64_t vdisk_size; > + int64_t vdisk_blocks; > + int64_t vdisk_flags; > + int vdisk_aio_count; > + int event_reader_pos; > + VXHSAIOCB *qnio_event_acb; > + void *qnio_ctx; > + QemuSpin vdisk_lock; /* Lock to protect BDRVVXHSState */ > + QemuSpin vdisk_acb_lock; /* Protects ACB */ These comments are insufficient for documenting locking. Not all fields are actually protected by these locks. Please order fields according to lock coverage: typedef struct VXHSAIOCB { ... /* Protected by BDRVVXHSState->vdisk_acb_lock */ int segments; ... }; typedef struct BDRVVXHSState { ... /* Protected by vdisk_lock */ QemuSpin vdisk_lock; int vdisk_aio_count; QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; ... } > +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) > +{ > + /* > + * Close vDisk device > + */ > + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { > + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); libqnio comment: Why does iio_devclose() take an unused cfd argument? Perhaps it can be dropped. > + s->vdisk_hostinfo[idx].vdisk_rfd = -1; > + } > + > + /* > + * Close QNIO channel against cached channel-fd > + */ > + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { > + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); libqnio comment: Why does iio_devclose() take an int32_t cfd argument but iio_close() takes a uint32_t cfd argument? > + s->vdisk_hostinfo[idx].qnio_cfd = -1; > + } > +} > + > +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, > + int *rfd, const char *file_name) > +{ > + /* > + * Open qnio channel to storage agent if not opened before. > + */ > + if (*cfd < 0) { > + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); libqnio comments: 1. There is a buffer overflow in qnio_create_channel(). strncpy() is used incorrectly so long hostname or port (both can be 99 characters long) will overflow channel->name[] (64 characters) or channel->port[] (8 characters). strncpy(channel->name, hostname, strlen(hostname) + 1); strncpy(channel->port, port, strlen(port) + 1); The third argument must be the size of the *destination* buffer, not the source buffer. Also note that strncpy() doesn't NUL-terminate the destination string so you must do that manually to ensure there is a NUL byte at the end of the buffer. 2. channel is leaked in the "Failed to open single connection" error case in qnio_create_channel(). 3. If host is longer the 63 characters then the ioapi_ctx->channels and qnio_ctx->channels maps will use different keys due to string truncation in qnio_create_channel(). This means "Channel already exists" in qnio_create_channel() and possibly other things will not work as expected. > + if (*cfd < 0) { > + trace_vxhs_qnio_iio_open(of_vsa_addr); > + return -ENODEV; > + } > + } > + > + /* > + * Open vdisk device > + */ > + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); libqnio comment: Buffer overflow in iio_devopen() since chandev[128] is not large enough to hold channel[100] + " " + devpath[arbitrary length] chars: sprintf(chandev, "%s %s", channel, devpath); > + > + if (*rfd < 0) { > + if (*cfd >= 0) { This check is always true. Otherwise the return -ENODEV would have been taken above. The if statement isn't necessary. > +static void vxhs_check_failover_status(int res, void *ctx) > +{ > + BDRVVXHSState *s = ctx; > + > + if (res == 0) { > + /* found failover target */ > + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; > + s->vdisk_ask_failover_idx = 0; > + trace_vxhs_check_failover_status( > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, > + s->vdisk_guid); > + qemu_spin_lock(&s->vdisk_lock); > + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); > + qemu_spin_unlock(&s->vdisk_lock); > + vxhs_handle_queued_ios(s); > + } else { > + /* keep looking */ > + trace_vxhs_check_failover_status_retry(s->vdisk_guid); > + s->vdisk_ask_failover_idx++; > + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { > + /* pause and cycle through list again */ > + sleep(QNIO_CONNECT_RETRY_SECS); This code is called from a QEMU thread via vxhs_aio_rw(). It is not permitted to call sleep() since it will freeze QEMU and probably the guest. If you need a timer you can use QEMU's timer APIs. See aio_timer_new(), timer_new_ns(), timer_mod(), timer_del(), timer_free(). > + s->vdisk_ask_failover_idx = 0; > + } > + res = vxhs_switch_storage_agent(s); > + } > +} > + > +static int vxhs_failover_io(BDRVVXHSState *s) > +{ > + int res = 0; > + > + trace_vxhs_failover_io(s->vdisk_guid); > + > + s->vdisk_ask_failover_idx = 0; > + res = vxhs_switch_storage_agent(s); > + > + return res; > +} > + > +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, > + uint32_t error, uint32_t opcode) This function is doing too much. Especially the failover code should run in the AioContext since it's complex. Don't do failover here because this function is outside the AioContext lock. Do it from AioContext using a QEMUBH like block/rbd.c. > +static int32_t > +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, > + uint64_t offset, void *ctx, uint32_t flags) > +{ > + struct iovec cur; > + uint64_t cur_offset = 0; > + uint64_t cur_write_len = 0; > + int segcount = 0; > + int ret = 0; > + int i, nsio = 0; > + int iovcnt = qiov->niov; > + struct iovec *iov = qiov->iov; > + > + errno = 0; > + cur.iov_base = 0; > + cur.iov_len = 0; > + > + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); libqnio comments: 1. There are blocking connect(2) and getaddrinfo(3) calls in iio_writev() so this may hang for arbitrary amounts of time. This is not permitted in .bdrv_aio_readv()/.bdrv_aio_writev(). Please make qnio actually asynchronous. 2. Where does client_callback() free reply? It looks like every reply message causes a memory leak! 3. Buffer overflow in iio_writev() since device[128] cannot fit the device string generated from the vdisk_guid. 4. Buffer overflow in iio_writev() due to strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is larger than target[64]. Also note the previous comments about strncpy() usage. 5. I don't see any endianness handling or portable alignment of struct fields in the network protocol code. Binary network protocols need to take care of these issue for portability. This means libqnio compiled for different architectures will not work. Do you plan to support any other architectures besides x86? 6. The networking code doesn't look robust: kvset uses assert() on input from the network so the other side of the connection could cause SIGABRT (coredump), the client uses the msg pointer as the cookie for the response packet so the server can easily crash the client by sending a bogus cookie value, etc. Even on the client side these things are troublesome but on a server they are guaranteed security issues. I didn't look into it deeply. Please audit the code. > +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, > + int *cfd, int *rfd, Error **errp) > +{ > + QDict *backing_options = NULL; > + QemuOpts *opts, *tcp_opts; > + const char *vxhs_filename; > + char *of_vsa_addr = NULL; > + Error *local_err = NULL; > + const char *vdisk_id_opt; > + char *file_name = NULL; > + size_t num_servers = 0; > + char *str = NULL; > + int ret = 0; > + int i; > + > + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); > + qemu_opts_absorb_qdict(opts, options, &local_err); > + if (local_err) { > + error_propagate(errp, local_err); > + ret = -EINVAL; > + goto out; > + } > + > + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); > + if (vxhs_filename) { > + trace_vxhs_qemu_init_filename(vxhs_filename); > + } > + > + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); > + if (!vdisk_id_opt) { > + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); > + ret = -EINVAL; > + goto out; > + } > + s->vdisk_guid = g_strdup(vdisk_id_opt); > + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); > + > + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); > + if (num_servers < 1) { > + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); > + ret = -EINVAL; > + goto out; > + } else if (num_servers > VXHS_MAX_HOSTS) { > + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); > + error_append_hint(errp, "Maximum %d servers allowed.\n", > + VXHS_MAX_HOSTS); > + ret = -EINVAL; > + goto out; > + } > + trace_vxhs_qemu_init_numservers(num_servers); > + > + for (i = 0; i < num_servers; i++) { > + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); > + qdict_extract_subqdict(options, &backing_options, str); > + > + /* Create opts info from runtime_tcp_opts list */ > + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); > + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); > + if (local_err) { > + qdict_del(backing_options, str); backing_options is leaked and there's no need to delete the str key. > + qemu_opts_del(tcp_opts); > + g_free(str); > + ret = -EINVAL; > + goto out; > + } > + > + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, > + VXHS_OPT_HOST)); > + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, > + VXHS_OPT_PORT), > + NULL, 0); This will segfault if the port option was missing. > + > + s->vdisk_hostinfo[i].qnio_cfd = -1; > + s->vdisk_hostinfo[i].vdisk_rfd = -1; > + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, > + s->vdisk_hostinfo[i].port); It's not safe to use the %s format specifier for a trace event with a NULL value. In the case where hostip is NULL this could crash on some systems. > + > + qdict_del(backing_options, str); > + qemu_opts_del(tcp_opts); > + g_free(str); > + } backing_options is leaked. > + > + s->vdisk_nhosts = i; > + s->vdisk_cur_host_idx = 0; > + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); > + of_vsa_addr = g_strdup_printf("of://%s:%d", > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); Can we get here with num_servers == 0? In that case this would access uninitialized memory. I guess num_servers == 0 does not make sense and there should be an error case for it. > + > + /* > + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. > + */ > + if (global_qnio_ctx == NULL) { > + global_qnio_ctx = vxhs_setup_qnio(); libqnio comment: The client epoll thread should mask all signals (like qemu_thread_create()). Otherwise it may receive signals that it cannot deal with. > + if (global_qnio_ctx == NULL) { > + error_setg(&local_err, "Failed vxhs_setup_qnio"); > + ret = -EINVAL; > + goto out; > + } > + } > + > + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); > + if (!ret) { > + error_setg(&local_err, "Failed qnio_iio_open"); > + ret = -EIO; > + } The return value of vxhs_qnio_iio_open() is 0 for success or -errno for error. I guess you never ran this code! The block driver won't even open successfully. > + > +out: > + g_free(file_name); > + g_free(of_vsa_addr); > + qemu_opts_del(opts); > + > + if (ret < 0) { > + for (i = 0; i < num_servers; i++) { > + g_free(s->vdisk_hostinfo[i].hostip); > + } > + g_free(s->vdisk_guid); > + s->vdisk_guid = NULL; > + errno = -ret; There is no need to set errno here. The return value already contains the error and the caller doesn't look at errno. > + } > + error_propagate(errp, local_err); > + > + return ret; > +} > + > +static int vxhs_open(BlockDriverState *bs, QDict *options, > + int bdrv_flags, Error **errp) > +{ > + BDRVVXHSState *s = bs->opaque; > + AioContext *aio_context; > + int qemu_qnio_cfd = -1; > + int device_opened = 0; > + int qemu_rfd = -1; > + int ret = 0; > + int i; > + > + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); > + if (ret < 0) { > + trace_vxhs_open_fail(ret); > + return ret; > + } > + > + device_opened = 1; > + s->qnio_ctx = global_qnio_ctx; > + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; > + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; > + s->vdisk_size = 0; > + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); > + > + /* > + * Create a pipe for communicating between two threads in different > + * context. Set handler for read event, which gets triggered when > + * IO completion is done by non-QEMU context. > + */ > + ret = qemu_pipe(s->fds); > + if (ret < 0) { > + trace_vxhs_open_epipe('.'); > + ret = -errno; > + goto errout; This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc. bdrv_close() will not be called so this function must do cleanup itself. > + } > + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); > + > + aio_context = bdrv_get_aio_context(bs); > + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], > + false, vxhs_aio_event_reader, NULL, s); > + > + /* > + * Initialize the spin-locks. > + */ > + qemu_spin_init(&s->vdisk_lock); > + qemu_spin_init(&s->vdisk_acb_lock); > + > + return 0; > + > +errout: > + /* > + * Close remote vDisk device if it was opened earlier > + */ > + if (device_opened) { This is always true. The device_opened variable can be removed. > +/* > + * This allocates QEMU-VXHS callback for each IO > + * and is passed to QNIO. When QNIO completes the work, > + * it will be passed back through the callback. > + */ > +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, > + int64_t sector_num, QEMUIOVector *qiov, > + int nb_sectors, > + BlockCompletionFunc *cb, > + void *opaque, int iodir) > +{ > + VXHSAIOCB *acb = NULL; > + BDRVVXHSState *s = bs->opaque; > + size_t size; > + uint64_t offset; > + int iio_flags = 0; > + int ret = 0; > + void *qnio_ctx = s->qnio_ctx; > + uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd; > + > + offset = sector_num * BDRV_SECTOR_SIZE; > + size = nb_sectors * BDRV_SECTOR_SIZE; > + > + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); > + /* > + * Setup or initialize VXHSAIOCB. > + * Every single field should be initialized since > + * acb will be picked up from the slab without > + * initializing with zero. > + */ > + acb->io_offset = offset; > + acb->size = size; > + acb->ret = 0; > + acb->flags = 0; > + acb->aio_done = VXHS_IO_INPROGRESS; > + acb->segments = 0; > + acb->buffer = 0; > + acb->qiov = qiov; > + acb->direction = iodir; > + > + qemu_spin_lock(&s->vdisk_lock); > + if (OF_VDISK_FAILED(s)) { > + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); > + qemu_spin_unlock(&s->vdisk_lock); > + goto errout; > + } > + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { > + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); > + s->vdisk_aio_retry_qd++; > + OF_AIOCB_FLAGS_SET_QUEUED(acb); > + qemu_spin_unlock(&s->vdisk_lock); > + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); > + goto out; > + } > + s->vdisk_aio_count++; > + qemu_spin_unlock(&s->vdisk_lock); > + > + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); > + > + switch (iodir) { > + case VDISK_AIO_WRITE: > + vxhs_inc_acb_segment_count(acb, 1); > + ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov, > + offset, (void *)acb, iio_flags); > + break; > + case VDISK_AIO_READ: > + vxhs_inc_acb_segment_count(acb, 1); > + ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov, > + offset, (void *)acb, iio_flags); > + break; > + default: > + trace_vxhs_aio_rw_invalid(iodir); > + goto errout; s->vdisk_aio_count must be decremented before returning. > +static void vxhs_close(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int i; > + > + trace_vxhs_close(s->vdisk_guid); > + close(s->fds[VDISK_FD_READ]); > + close(s->fds[VDISK_FD_WRITE]); > + > + /* > + * Clearing all the event handlers for oflame registered to QEMU > + */ > + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], > + false, NULL, NULL, NULL); Please remove the event handler before closing the fd. I don't think it matters in this case but in other scenarios there could be race conditions if another thread opens an fd and the file descriptor number is reused.
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > This patch adds support for a new block device type called "vxhs". > Source code for the library that this code loads can be downloaded from: > https://github.com/MittalAshish/libqnio.git > > Sample command line using JSON syntax: > ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' > > Sample command line using URI syntax: > qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D > > Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> It would be very useful for reviewing the code if there is any sort of technical documentation (whitepaper, text file, email, etc..) that describes the details of VXHS. [...] There is still a good hunk of this patch that I have yet to fully go through, but there is enough here to address first (both from my comments, and others). > + * IO specific flags > + */ > +#define IIO_FLAG_ASYNC 0x00000001 > +#define IIO_FLAG_DONE 0x00000010 > +#define IIO_FLAG_SYNC 0 > + > +#define VDISK_FD_READ 0 > +#define VDISK_FD_WRITE 1 > +#define VXHS_MAX_HOSTS 4 > + > +#define VXHS_OPT_FILENAME "filename" > +#define VXHS_OPT_VDISK_ID "vdisk_id" > +#define VXHS_OPT_SERVER "server." > +#define VXHS_OPT_HOST "host" > +#define VXHS_OPT_PORT "port" > + > +/* qnio client ioapi_ctx */ > +static void *global_qnio_ctx; > + The use of a global qnio_ctx means that you are only going to be able to connect to one vxhs server. I.e., QEMU will not be able to have multiple drives with different VXHS servers, unless I am missing something. Is that by design? I don't see any reason why you could not contain this to the BDRVVXHSState struct, so that it is limited in scope to each drive instance. [...] > + > +static int32_t > +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, > + void *ctx, uint32_t flags) > +{ I looked at the underlying iio_ioctl() code, and it is a bit odd that vdisk_size is always required by it. Especially since it doesn't allow a NULL parameter for it, in case you don't care about it. > + int ret = 0; > + > + switch (opcode) { > + case VDISK_STAT: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, > + in, ctx, flags); > + break; > + > + case VDISK_AIO_FLUSH: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, > + in, ctx, flags); > + break; > + > + case VDISK_CHECK_IO_FAILOVER_READY: > + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, > + in, ctx, flags); > + break; > + > + default: > + ret = -ENOTSUP; > + break; > + } This whole function is necessary only because the underlying iio_ioctl() returns success for unknown opcodes. I think this is a mistake, and you probably don't want to do it this way. The iio_ioctl() function should return -ENOTSUP for unknown opcodes; this is not a hard failure, and it allows us to determine what the underlying library supports. And then this function can disappear completely. Since QEMU and libqnio are decoupled, you won't know at runtime what version of libqnio is available on whatever system it is running on. As the library and driver evolve, there will likely come a time when you will want to know in the QEMU driver if the QNIO version installed supports a certain feature. If iio_ioctl (and other functions, as appropriate) return -ENOTSUP for unsupported features, then it becomes easy to probe to see what is supported. I'd actually go a step further - have the iio_ioctl() function not filter on opcodes either, and just blindly pass the opcodes to the server via iio_ioctl_json(). That way you can let the server tell QEMU what it supports, at least as far as ioctl operations go. > + > + if (ret) { > + *in = 0; > + trace_vxhs_qnio_iio_ioctl(opcode); > + } > + > + return ret; > +} > + > +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) > +{ > + /* > + * Close vDisk device > + */ > + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { > + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); > + s->vdisk_hostinfo[idx].vdisk_rfd = -1; > + } > + > + /* > + * Close QNIO channel against cached channel-fd > + */ > + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { > + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); > + s->vdisk_hostinfo[idx].qnio_cfd = -1; > + } > +} > + > +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, > + int *rfd, const char *file_name) > +{ > + /* > + * Open qnio channel to storage agent if not opened before. > + */ > + if (*cfd < 0) { > + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); > + if (*cfd < 0) { > + trace_vxhs_qnio_iio_open(of_vsa_addr); > + return -ENODEV; > + } > + } > + > + /* > + * Open vdisk device > + */ > + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); > + > + if (*rfd < 0) { > + if (*cfd >= 0) { This if conditional can be dropped, *cfd is always >= 0 here. > + iio_close(global_qnio_ctx, *cfd); > + *cfd = -1; > + *rfd = -1; > + } > + > + trace_vxhs_qnio_iio_devopen(file_name); > + return -ENODEV; > + } > + > + return 0; > +} > + [...] > + > +static int vxhs_switch_storage_agent(BDRVVXHSState *s) > +{ > + int res = 0; > + int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE); > + > + trace_vxhs_switch_storage_agent( > + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, > + s->vdisk_guid); > + > + res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx); > + if (res == 0) { > + res = vxhs_qnio_iio_ioctl(s->qnio_ctx, > + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd, > + VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags); Segfault here. The libqnio library doesn't allow NULL for the vdisk size argument (although it should). > + } else { > + trace_vxhs_switch_storage_agent_failed( > + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, > + s->vdisk_guid, res, errno); > + /* > + * Try the next host. > + * Calling vxhs_check_failover_status from here ties up the qnio > + * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1) > + * for all the hosts in the IO target list. > + */ > + > + vxhs_check_failover_status(res, s); > + } > + return res; > +} > + > +static void vxhs_check_failover_status(int res, void *ctx) > +{ > + BDRVVXHSState *s = ctx; > + > + if (res == 0) { > + /* found failover target */ > + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; > + s->vdisk_ask_failover_idx = 0; > + trace_vxhs_check_failover_status( > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, > + s->vdisk_guid); > + qemu_spin_lock(&s->vdisk_lock); > + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); > + qemu_spin_unlock(&s->vdisk_lock); > + vxhs_handle_queued_ios(s); > + } else { > + /* keep looking */ > + trace_vxhs_check_failover_status_retry(s->vdisk_guid); > + s->vdisk_ask_failover_idx++; > + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { > + /* pause and cycle through list again */ > + sleep(QNIO_CONNECT_RETRY_SECS); Repeat from the my v6 review comments: this absolutely cannot happen here. This is not just called from your callback, but also from your .bdrv_aio_readv, .bdrv_aio_writev implementations. Don't sleep() QEMU. To resolve this will probably require some redesign of the failover code, and when it is called; because of the variety of code paths that can invoke this code, you also cannot do a coroutine yield here, either. [...] > +static int32_t > +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, > + uint64_t offset, void *ctx, uint32_t flags) > +{ > + struct iovec cur; > + uint64_t cur_offset = 0; > + uint64_t cur_write_len = 0; > + int segcount = 0; > + int ret = 0; > + int i, nsio = 0; > + int iovcnt = qiov->niov; > + struct iovec *iov = qiov->iov; > + > + errno = 0; > + cur.iov_base = 0; > + cur.iov_len = 0; > + > + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); > + > + if (ret == -1 && errno == EFBIG) { > + trace_vxhs_qnio_iio_writev(ret); > + /* > + * IO size is larger than IIO_IO_BUF_SIZE hence need to > + * split the I/O at IIO_IO_BUF_SIZE boundary > + * There are two cases here: > + * 1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE > + * 2. iovcnt is greater than 1 and IO size is greater than > + * IIO_IO_BUF_SIZE. > + * > + * Need to adjust the segment count, for that we need to compute > + * the segment count and increase the segment count in one shot > + * instead of setting iteratively in for loop. It is required to > + * prevent any race between the splitted IO submission and IO > + * completion. > + */ If I understand the for loop below correctly, it is all done to set nsio to the correct value, right? > + cur_offset = offset; > + for (i = 0; i < iovcnt; i++) { > + if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) { > + cur_offset += iov[i].iov_len; > + nsio++; Is this chunk: > + } else if (iov[i].iov_len > 0) { > + cur.iov_base = iov[i].iov_base; > + cur.iov_len = IIO_IO_BUF_SIZE; > + cur_write_len = 0; > + while (1) { > + nsio++; > + cur_write_len += cur.iov_len; > + if (cur_write_len == iov[i].iov_len) { > + break; > + } > + cur_offset += cur.iov_len; > + cur.iov_base += cur.iov_len; > + if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) { > + cur.iov_len = IIO_IO_BUF_SIZE; > + } else { > + cur.iov_len = (iov[i].iov_len - cur_write_len); > + } > + } > + } ... effectively just doing this? tmp = iov[1].iov_len / IIO_IO_BUF_SIZE; nsio += tmp; nsio += iov[1].iov_len % IIO_IO_BUF_SIZE ? 1: 0; > + } > + > + segcount = nsio - 1; > + vxhs_inc_acb_segment_count(ctx, segcount); > + [...] > +static int vxhs_open(BlockDriverState *bs, QDict *options, > + int bdrv_flags, Error **errp) > +{ > + BDRVVXHSState *s = bs->opaque; > + AioContext *aio_context; > + int qemu_qnio_cfd = -1; > + int device_opened = 0; > + int qemu_rfd = -1; > + int ret = 0; > + int i; > + > + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); > + if (ret < 0) { > + trace_vxhs_open_fail(ret); > + return ret; > + } > + > + device_opened = 1; This is still unneeded; it is always == 1 in any path that can hit errout. > + s->qnio_ctx = global_qnio_ctx; > + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; > + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; > + s->vdisk_size = 0; > + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); [...] > + > +/* > + * This is called by QEMU when a flush gets triggered from within > + * a guest at the block layer, either for IDE or SCSI disks. > + */ > +static int vxhs_co_flush(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t size = 0; > + int ret = 0; > + > + /* > + * VDISK_AIO_FLUSH ioctl is a no-op at present and will > + * always return success. This could change in the future. > + */ Rather than always return success, since it is a no-op how about have it return -ENOTSUP? That can then be filtered out here, but it gives us the ability to determine later if a version of vxhs supports flush or not. Or, you could just not implement vxhs_co_flush() at all; the block layer will assume success in that case. > + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, > + VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC); > + > + if (ret < 0) { > + trace_vxhs_co_flush(s->vdisk_guid, ret, errno); > + vxhs_close(bs); But if we leave this here, and if you are using the vxhs_close() for cleanup (and I like that you do), you need to make sure to set bs->drv = NULL. Otherwise, 1) subsequent I/O will not have a good day, and 2) the final invocation of bdrv_close() will double free resources. > + } > + > + return ret; > +} > + > +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s) > +{ > + int64_t vdisk_size = 0; > + int ret = 0; > + > + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, > + VDISK_STAT, &vdisk_size, NULL, 0); > + > + if (ret < 0) { > + trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno); > + return 0; Why return 0, rather than the error? > + } > + > + trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size); > + return vdisk_size; > +} > + > +/* > + * Returns the size of vDisk in bytes. This is required > + * by QEMU block upper block layer so that it is visible > + * to guest. > + */ > +static int64_t vxhs_getlength(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t vdisk_size = 0; > + > + if (s->vdisk_size > 0) { > + vdisk_size = s->vdisk_size; > + } else { > + /* > + * Fetch the vDisk size using stat ioctl > + */ > + vdisk_size = vxhs_get_vdisk_stat(s); > + if (vdisk_size > 0) { > + s->vdisk_size = vdisk_size; > + } > + } > + > + if (vdisk_size > 0) { If vxhs_get_vdisk_stat() returned the error rather than 0, then this check is unnecessary (assuming vxhs_qnio_iio_ioctl() returns useful errors itself). > + return vdisk_size; /* return size in bytes */ > + } > + > + return -EIO; > +} > + > +/* > + * Returns actual blocks allocated for the vDisk. > + * This is required by qemu-img utility. > + */ This returns bytes, not blocks. > +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs) > +{ > + BDRVVXHSState *s = bs->opaque; > + int64_t vdisk_size = 0; > + > + if (s->vdisk_size > 0) { > + vdisk_size = s->vdisk_size; > + } else { > + /* > + * TODO: > + * Once HyperScale storage-virtualizer provides > + * actual physical allocation of blocks then > + * fetch that information and return back to the > + * caller but for now just get the full size. > + */ > + vdisk_size = vxhs_get_vdisk_stat(s); > + if (vdisk_size > 0) { > + s->vdisk_size = vdisk_size; > + } > + } > + > + if (vdisk_size > 0) { > + return vdisk_size; /* return size in bytes */ > + } > + > + return -EIO; > +} 1. This is identical to vxhs_getlength(), so you could just call that, but: 2. Don't, because you are not returning what is expected by this function. If it is not supported yet by VXHS, just don't implement bdrv_get_allocated_file_size. The block driver code will do the right thing, and return -ENOTSUP (which ends up showing up as this value is unavailable). This is much more useful than having the wrong value returned. 3. I guess unless, of course, 100% file allocation is always true, then you can ignore point #2, and just follow #1. -Jeff
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > This patch adds support for a new block device type called "vxhs". > Source code for the library that this code loads can be downloaded from: > https://github.com/MittalAshish/libqnio.git > > Sample command line using JSON syntax: > ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' > > Sample command line using URI syntax: > qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D > > Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> Hi Ashish, You've received a lot of feedback to digest on your patch -- creating a whole new block driver can be difficult! If I may make a suggestion: it is usually more productive to address the feedback via email _before_ you code up and send out the next patch version, unless the comments are straightforward and don't need any discussion (many comments often don't). For instance, I appreciate your reply to the feedback on v6, but by the time it hit my inbox v7 was already there, so it more or less kills the discussion and starts the review cycle afresh. Not a big deal, but overall it would probably be more productive if that reply was sent first, and we could discuss things like sleep, coroutines, caching, global ctx instances, etc... :) Those discussion might then help shape the next patch even more, and result in fewer iterations. Thanks, and happy hacking! -Jeff
That makes perfect sense. I will try and follow this method now onwards. Thanks! On Wed, Sep 28, 2016 at 7:18 PM, Jeff Cody <jcody@redhat.com> wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> This patch adds support for a new block device type called "vxhs". >> Source code for the library that this code loads can be downloaded from: >> https://github.com/MittalAshish/libqnio.git >> >> Sample command line using JSON syntax: >> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' >> >> Sample command line using URI syntax: >> qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D >> >> Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> > > Hi Ashish, > > You've received a lot of feedback to digest on your patch -- creating a > whole new block driver can be difficult! > > If I may make a suggestion: it is usually more productive to address the > feedback via email _before_ you code up and send out the next patch version, > unless the comments are straightforward and don't need any discussion (many > comments often don't). > > For instance, I appreciate your reply to the feedback on v6, but by the time > it hit my inbox v7 was already there, so it more or less kills the > discussion and starts the review cycle afresh. Not a big deal, but overall > it would probably be more productive if that reply was sent first, and we > could discuss things like sleep, coroutines, caching, global ctx instances, > etc... :) Those discussion might then help shape the next patch even more, > and result in fewer iterations. > > > Thanks, and happy hacking! > -Jeff
On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > This patch adds support for a new block device type called "vxhs". > Source code for the library that this code loads can be downloaded from: > https://github.com/MittalAshish/libqnio.git The QEMU block driver should deal with BlockDriver<->libqnio integration and libqnio should deal with vxhs logic (network protocol, failover, etc). Right now the vxhs logic is spread between both components. If responsibilities aren't cleanly separated between QEMU and libqnio then I see no point in having libqnio. Failover code should move into libqnio so that programs using libqnio avoid duplicating the failover code. Similarly IIO_IO_BUF_SIZE/segments should be handled internally by libqnio so programs using libqnio do not duplicate this code. libqnio itself can be simplified significantly: The multi-threading is not necessary and adds complexity. Right now there seem to be two reasons for multi-threading: shared contexts and the epoll thread. Both can be eliminated as follows. Shared contexts do not make sense in a multi-disk, multi-core environment. Why is it advantages to tie disks to a single context? It's simpler and more multi-core friendly to let every disk have its own connection. The epoll thread forces library users to use thread synchronization when processing callbacks. Look at libiscsi for an example of how to eliminate it. Two APIs are defined: int iscsi_get_fd(iscsi) and int iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT). The program using the library can integrate the fd into its own event loop. The advantage of doing this is that no library threads are necessary and all callbacks are invoked from the program's event loop. Therefore no thread synchronization is needed. If you make these changes then all multi-threading in libqnio and the QEMU block driver can be dropped. There will be less code and it will be simpler.
Hi Stefan, others, Thank you for all the review comments. On Wed, Sep 28, 2016 at 4:13 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c" > > Why do several trace events have a %c format specifier at the end and it > always takes a '.' value? > >> +#define QNIO_CONNECT_TIMOUT_SECS 120 > > This isn't used and there is a typo (s/TIMOUT/TIMEOUT/). Can it be > dropped? > >> +static int32_t >> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, >> + void *ctx, uint32_t flags) >> +{ >> + int ret = 0; >> + >> + switch (opcode) { >> + case VDISK_STAT: > > It seems unnecessary to abstract the iio_ioctl() constants and then have > a switch statement to translate to the actual library constants. It > makes little sense since the flags argument already uses the library > constants. Just use the library's constants. > >> + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, >> + in, ctx, flags); >> + break; >> + >> + case VDISK_AIO_FLUSH: >> + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, >> + in, ctx, flags); >> + break; >> + >> + case VDISK_CHECK_IO_FAILOVER_READY: >> + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, >> + in, ctx, flags); >> + break; >> + >> + default: >> + ret = -ENOTSUP; >> + break; >> + } >> + >> + if (ret) { >> + *in = 0; > > Some callers pass in = NULL so this will crash. > > The naming seems wrong: this is an output argument, not an input > argument. Please call it "out_val" or similar. > >> + res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx); >> + if (res == 0) { >> + res = vxhs_qnio_iio_ioctl(s->qnio_ctx, >> + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd, >> + VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags); > > Looking at iio_ioctl(), I'm not sure how this can ever work. The fourth > argument is NULL and iio_ioctl() will attempt *vdisk_size = 0 so this > will crash. > > Do you have tests that exercise this code path? > You are right. This bug crept in to the FAILOVER path when I moved the qnio shim code to qemu. Earlier code in libqnio did *in = 0 on a per-case basis and skipped it for VDISK_CHECK_IO_FAILOVER_READY. I will fix this. We do thoroughly test these code paths, but the problem is that the existing tests do not fully work with the new changes I am doing. I do not yet have test case to test failover with the latest code. I do frequently test using qemu-io (open a vdisk, read, write and re-read to check written data) and also try to bring up an existing guest VM using latest qemu-system-x86_64 binary to make sure I don't regress main functionality. I did not however run these tests for v7 patch, therefore some of the v7 changes do break the code. This patch had some major code reconfiguration over v6, therefore my intention was to just get a feel for whether the main code structure looks good. A lot of changes have been proposed. I will discuss these with the team and get back with inputs. I guess having a test framework is really important at this time. Regards, Ashish On Fri, Sep 30, 2016 at 1:36 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> This patch adds support for a new block device type called "vxhs". >> Source code for the library that this code loads can be downloaded from: >> https://github.com/MittalAshish/libqnio.git > > The QEMU block driver should deal with BlockDriver<->libqnio integration > and libqnio should deal with vxhs logic (network protocol, failover, > etc). Right now the vxhs logic is spread between both components. If > responsibilities aren't cleanly separated between QEMU and libqnio then > I see no point in having libqnio. > > Failover code should move into libqnio so that programs using libqnio > avoid duplicating the failover code. > > Similarly IIO_IO_BUF_SIZE/segments should be handled internally by > libqnio so programs using libqnio do not duplicate this code. > > libqnio itself can be simplified significantly: > > The multi-threading is not necessary and adds complexity. Right now > there seem to be two reasons for multi-threading: shared contexts and > the epoll thread. Both can be eliminated as follows. > > Shared contexts do not make sense in a multi-disk, multi-core > environment. Why is it advantages to tie disks to a single context? > It's simpler and more multi-core friendly to let every disk have its own > connection. > > The epoll thread forces library users to use thread synchronization when > processing callbacks. Look at libiscsi for an example of how to > eliminate it. Two APIs are defined: int iscsi_get_fd(iscsi) and int > iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT). The program using the > library can integrate the fd into its own event loop. The advantage of > doing this is that no library threads are necessary and all callbacks > are invoked from the program's event loop. Therefore no thread > synchronization is needed. > > If you make these changes then all multi-threading in libqnio and the > QEMU block driver can be dropped. There will be less code and it will > be simpler.
On Sat, Oct 1, 2016 at 4:10 AM, ashish mittal <ashmit602@gmail.com> wrote: >> If you make these changes then all multi-threading in libqnio and the >> QEMU block driver can be dropped. There will be less code and it will >> be simpler. You'll get a lot of basic tests for free if you add vxhs support to the existing tests/qemu-iotests/ framework. A vxhs server is required so qemu-iotests has something to run against. I saw some server code in libqnio but haven't investigated how complete it is. The main thing is to support read/write/flush. Stefan
On Wed, Sep 28, 2016 at 12:13:32PM +0100, Stefan Hajnoczi wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: [...] > > +/* > > + * This is called by QEMU when a flush gets triggered from within > > + * a guest at the block layer, either for IDE or SCSI disks. > > + */ > > +static int vxhs_co_flush(BlockDriverState *bs) > > This is called from coroutine context, please add the coroutine_fn > function attribute to document this. > > > +{ > > + BDRVVXHSState *s = bs->opaque; > > + int64_t size = 0; > > + int ret = 0; > > + > > + /* > > + * VDISK_AIO_FLUSH ioctl is a no-op at present and will > > + * always return success. This could change in the future. > > + */ > > + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, > > + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, > > + VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC); > > This function is not allowed to block. It cannot do a synchronous > flush. This line is misleading because the constant is called > VDISK_AIO_FLUSH, but looking at the library code I see it's actually a > synchronous call that ends up in a loop that sleeps (!) waiting for the > response. > > Please do an async flush and qemu_coroutine_yield() to return > control to QEMU's event loop. When the flush completes you can > qemu_coroutine_enter() again to return from this function. > > > + > > + if (ret < 0) { > > + trace_vxhs_co_flush(s->vdisk_guid, ret, errno); > > + vxhs_close(bs); > > This looks unsafe. Won't it cause double close() calls for s->fds[] > when bdrv_close() is called later? > Calling the close on a failed flush is a good idea, in my opinion. However, to make it safe, bs->drv MUST be set to NULL after the call to vxhs_close(). That will prevent double free / close calls, and also fail out new I/O. (This is actually what the gluster driver does, for instance). Jeff
Hi Stefan, 1. You suggestion to move the failover implementation to libqnio is well taken. Infact we are proposing that service/network failovers should not be handled in gemu address space at all. The vxhs driver will know and talk to only a single virtual IP. The service behind the virtual IP may fail and move to another node without the qemu driver noticing it. This way the failover logic will be completely out of qemu address space. We are considering use of some of our proprietary clustering/monitoring services to implement service failover. 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. Let us know what you think about these proposals. Thanks, Ketan. On 9/30/16, 1:36 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> This patch adds support for a new block device type called "vxhs". >> Source code for the library that this code loads can be downloaded from: >> https://github.com/MittalAshish/libqnio.git > >The QEMU block driver should deal with BlockDriver<->libqnio integration >and libqnio should deal with vxhs logic (network protocol, failover, >etc). Right now the vxhs logic is spread between both components. If >responsibilities aren't cleanly separated between QEMU and libqnio then >I see no point in having libqnio. > >Failover code should move into libqnio so that programs using libqnio >avoid duplicating the failover code. > >Similarly IIO_IO_BUF_SIZE/segments should be handled internally by >libqnio so programs using libqnio do not duplicate this code. > >libqnio itself can be simplified significantly: > >The multi-threading is not necessary and adds complexity. Right now >there seem to be two reasons for multi-threading: shared contexts and >the epoll thread. Both can be eliminated as follows. > >Shared contexts do not make sense in a multi-disk, multi-core >environment. Why is it advantages to tie disks to a single context? >It's simpler and more multi-core friendly to let every disk have its own >connection. > >The epoll thread forces library users to use thread synchronization when >processing callbacks. Look at libiscsi for an example of how to >eliminate it. Two APIs are defined: int iscsi_get_fd(iscsi) and int >iscsi_which_events(iscsi) (e.g. POLLIN, POLLOUT). The program using the >library can integrate the fd into its own event loop. The advantage of >doing this is that no library threads are necessary and all callbacks >are invoked from the program's event loop. Therefore no thread >synchronization is needed. > >If you make these changes then all multi-threading in libqnio and the >QEMU block driver can be dropped. There will be less code and it will >be simpler.
On 20/10/2016 03:31, Ketan Nilangekar wrote: > This way the failover logic will be completely out of qemu address > space. We are considering use of some of our proprietary > clustering/monitoring services to implement service failover. Are you implementing a different protocol just for the sake of QEMU, in other words, and forwarding from that protocol to your proprietary code? If that is what you are doing, you don't need at all a vxhs driver in QEMU. Just implement NBD or iSCSI on your side, QEMU already has drivers for that. Paolo
+ Venky from Storage engineering Sent from my iPhone > On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > >> On 20/10/2016 03:31, Ketan Nilangekar wrote: >> This way the failover logic will be completely out of qemu address >> space. We are considering use of some of our proprietary >> clustering/monitoring services to implement service failover. > > Are you implementing a different protocol just for the sake of QEMU, in > other words, and forwarding from that protocol to your proprietary code? > > If that is what you are doing, you don't need at all a vxhs driver in > QEMU. Just implement NBD or iSCSI on your side, QEMU already has > drivers for that. > > Paolo
We are able to derive significant performance from the qemu block driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd based io tap in the past and the performance of qemu block driver is significantly better. Hence we would like to go with the vxhs driver for now. Ketan > On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > >> On 20/10/2016 03:31, Ketan Nilangekar wrote: >> This way the failover logic will be completely out of qemu address >> space. We are considering use of some of our proprietary >> clustering/monitoring services to implement service failover. > > Are you implementing a different protocol just for the sake of QEMU, in > other words, and forwarding from that protocol to your proprietary code? > > If that is what you are doing, you don't need at all a vxhs driver in > QEMU. Just implement NBD or iSCSI on your side, QEMU already has > drivers for that. > > Paolo
+ Venky Sent from my iPhone > On Oct 25, 2016, at 7:07 AM, Ketan Nilangekar <Ketan.Nilangekar@veritas.com> wrote: > > We are able to derive significant performance from the qemu block driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd based io tap in the past and the performance of qemu block driver is significantly better. Hence we would like to go with the vxhs driver for now. > > Ketan > > >> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> >>> On 20/10/2016 03:31, Ketan Nilangekar wrote: >>> This way the failover logic will be completely out of qemu address >>> space. We are considering use of some of our proprietary >>> clustering/monitoring services to implement service failover. >> >> Are you implementing a different protocol just for the sake of QEMU, in >> other words, and forwarding from that protocol to your proprietary code? >> >> If that is what you are doing, you don't need at all a vxhs driver in >> QEMU. Just implement NBD or iSCSI on your side, QEMU already has >> drivers for that. >> >> Paolo
On 25/10/2016 07:07, Ketan Nilangekar wrote: > We are able to derive significant performance from the qemu block > driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd > based io tap in the past and the performance of qemu block driver is > significantly better. Hence we would like to go with the vxhs driver > for now. Is this still true with failover implemented outside QEMU (which requires I/O to be proxied, if I'm not mistaken)? What does the benefit come from if so, is it the threaded backend and performing multiple connections to the same server? Paolo > Ketan > > >> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> >> wrote: >> >> >> >>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the >>> failover logic will be completely out of qemu address space. We >>> are considering use of some of our proprietary >>> clustering/monitoring services to implement service failover. >> >> Are you implementing a different protocol just for the sake of >> QEMU, in other words, and forwarding from that protocol to your >> proprietary code? >> >> If that is what you are doing, you don't need at all a vxhs driver >> in QEMU. Just implement NBD or iSCSI on your side, QEMU already >> has drivers for that. >> >> Paolo
On 25/10/2016 23:53, Ketan Nilangekar wrote: > We need to confirm the perf numbers but it really depends on the way we do failover outside qemu. > > We are looking at a vip based failover implementation which may need > some handling code in qnio but that overhead should be minimal (atleast > no more than the current impl in qemu driver) Then it's not outside QEMU's address space, it's only outside block/vxhs.c... I don't understand. Paolo > IMO, the real benefit of qemu + qnio perf comes from: > 1. the epoll based io multiplexer > 2. 8 epoll threads > 3. Zero buffer copies in userland code > 4. Minimal locking > > We are also looking at replacing the existing qnio socket code with > memory readv/writev calls available with the latest kernel for even > better performance. > > Ketan > >> On Oct 25, 2016, at 1:01 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> >>> On 25/10/2016 07:07, Ketan Nilangekar wrote: >>> We are able to derive significant performance from the qemu block >>> driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd >>> based io tap in the past and the performance of qemu block driver is >>> significantly better. Hence we would like to go with the vxhs driver >>> for now. >> >> Is this still true with failover implemented outside QEMU (which >> requires I/O to be proxied, if I'm not mistaken)? What does the benefit >> come from if so, is it the threaded backend and performing multiple >> connections to the same server? >> >> Paolo >> >>> Ketan >>> >>> >>>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> >>>> wrote: >>>> >>>> >>>> >>>>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the >>>>> failover logic will be completely out of qemu address space. We >>>>> are considering use of some of our proprietary >>>>> clustering/monitoring services to implement service failover. >>>> >>>> Are you implementing a different protocol just for the sake of >>>> QEMU, in other words, and forwarding from that protocol to your >>>> proprietary code? >>>> >>>> If that is what you are doing, you don't need at all a vxhs driver >>>> in QEMU. Just implement NBD or iSCSI on your side, QEMU already >>>> has drivers for that. >>>> >>>> Paolo > >
Including the rest of the folks from the original thread. Ketan. On 10/26/16, 9:33 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > >On 26/10/2016 00:39, Ketan Nilangekar wrote: >> >> >>> On Oct 26, 2016, at 12:00 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>> >>> >>> >>>> On 25/10/2016 23:53, Ketan Nilangekar wrote: >>>> We need to confirm the perf numbers but it really depends on the way we do failover outside qemu. >>>> >>>> We are looking at a vip based failover implementation which may need >>>> some handling code in qnio but that overhead should be minimal (atleast >>>> no more than the current impl in qemu driver) >>> >>> Then it's not outside QEMU's address space, it's only outside >>> block/vxhs.c... I don't understand. >>> >>> Paolo >>> >> >> Yes and that is something that we are considering and not finalized on a design. But even if some of the failover code is in the qnio library, is that a problem? >> As per my understanding the original suggestions were around getting the failover code out of the block driver and into the network library. >> If an optimal design for this means that some of the failover handling needs to be done in qnio, is that not acceptable? >> The way we see it, driver/qnio will talk to the storage service using a single IP but may have some retry code for retransmitting failed IOs in a failover scenario. > >Sure, that's fine. It's just that it seemed different from the previous >explanation. > >Paolo > >>>> IMO, the real benefit of qemu + qnio perf comes from: >>>> 1. the epoll based io multiplexer >>>> 2. 8 epoll threads >>>> 3. Zero buffer copies in userland code >>>> 4. Minimal locking >>>> >>>> We are also looking at replacing the existing qnio socket code with >>>> memory readv/writev calls available with the latest kernel for even >>>> better performance. >>> >>>> >>>> Ketan >>>> >>>>> On Oct 25, 2016, at 1:01 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>>> On 25/10/2016 07:07, Ketan Nilangekar wrote: >>>>>> We are able to derive significant performance from the qemu block >>>>>> driver as compared to nbd/iscsi/nfs. We have prototyped nfs and nbd >>>>>> based io tap in the past and the performance of qemu block driver is >>>>>> significantly better. Hence we would like to go with the vxhs driver >>>>>> for now. >>>>> >>>>> Is this still true with failover implemented outside QEMU (which >>>>> requires I/O to be proxied, if I'm not mistaken)? What does the benefit >>>>> come from if so, is it the threaded backend and performing multiple >>>>> connections to the same server? >>>>> >>>>> Paolo >>>>> >>>>>> Ketan >>>>>> >>>>>> >>>>>>> On Oct 24, 2016, at 4:24 PM, Paolo Bonzini <pbonzini@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 20/10/2016 03:31, Ketan Nilangekar wrote: This way the >>>>>>>> failover logic will be completely out of qemu address space. We >>>>>>>> are considering use of some of our proprietary >>>>>>>> clustering/monitoring services to implement service failover. >>>>>>> >>>>>>> Are you implementing a different protocol just for the sake of >>>>>>> QEMU, in other words, and forwarding from that protocol to your >>>>>>> proprietary code? >>>>>>> >>>>>>> If that is what you are doing, you don't need at all a vxhs driver >>>>>>> in QEMU. Just implement NBD or iSCSI on your side, QEMU already >>>>>>> has drivers for that. >>>>>>> >>>>>>> Paolo >>>> >>>>
On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: > 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. > Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. > The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. > The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. > I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. Have you benchmarked against just 1 epoll thread with 8 connections? Stefan
On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: > 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. > Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. > The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. > The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. > I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. By the way, when you benchmark with 8 epoll threads, are there any other guests with vxhs running on the machine? In a real-life situation where multiple VMs are running on a single host it may turn out that giving each VM 8 epoll threads doesn't help at all because the host CPUs are busy with other tasks. Stefan
> On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. > > By the way, when you benchmark with 8 epoll threads, are there any other > guests with vxhs running on the machine? > Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device. > In a real-life situation where multiple VMs are running on a single host > it may turn out that giving each VM 8 epoll threads doesn't help at all > because the host CPUs are busy with other tasks. The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define Ketan > Stefan
> On Nov 4, 2016, at 2:49 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. > > Have you benchmarked against just 1 epoll thread with 8 connections? > The first implementation of qnio was actually single threaded with 8 connections. The single VM throughput at the time iirc was less than half of what we are getting now. Especially with a workload doing IOs on multiple vdisks. I assume we will need some sort of cpu/core affinity to drive the most out of a single epoll threaded design. Ketan > Stefan
On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote: > > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: > >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. > >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. > >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. > >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. > >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. > > > > By the way, when you benchmark with 8 epoll threads, are there any other > > guests with vxhs running on the machine? > > > > Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device. > > > In a real-life situation where multiple VMs are running on a single host > > it may turn out that giving each VM 8 epoll threads doesn't help at all > > because the host CPUs are busy with other tasks. > > The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. > > But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define In QEMU there is currently work to add multiqueue support to the block layer. This enables true multiqueue from the guest down to the storage backend. virtio-blk already supports multiple queues but they are all processed from the same thread in QEMU today. Once multiple threads are able to process the queues it would make sense to continue down into the vxhs block driver. So I don't think implementing multiple epoll threads in libqnio is useful in the long term. Rather, a straightforward approach of integrating with the libqnio user's event loop (as described in my previous emails) would simplify the code and allow you to take advantage of full multiqueue support in the future. Stefan
On 11/7/16, 2:22 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote: >> > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: >> >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. >> >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. >> >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. >> >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. >> >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. >> > >> > By the way, when you benchmark with 8 epoll threads, are there any other >> > guests with vxhs running on the machine? >> > >> >> Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device. >> >> > In a real-life situation where multiple VMs are running on a single host >> > it may turn out that giving each VM 8 epoll threads doesn't help at all >> > because the host CPUs are busy with other tasks. >> >> The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. >> >> But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define > >In QEMU there is currently work to add multiqueue support to the block >layer. This enables true multiqueue from the guest down to the storage >backend. Is there any spec or documentation on this that you can point us to? > >virtio-blk already supports multiple queues but they are all processed >from the same thread in QEMU today. Once multiple threads are able to >process the queues it would make sense to continue down into the vxhs >block driver. > >So I don't think implementing multiple epoll threads in libqnio is >useful in the long term. Rather, a straightforward approach of >integrating with the libqnio user's event loop (as described in my >previous emails) would simplify the code and allow you to take advantage >of full multiqueue support in the future. Makes sense. We will take this up in the next iteration of libqnio. Thanks, Katan. > >Stefan
On Mon, Nov 07, 2016 at 08:27:39PM +0000, Ketan Nilangekar wrote: > On 11/7/16, 2:22 AM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > >On Fri, Nov 04, 2016 at 06:30:47PM +0000, Ketan Nilangekar wrote: > >> > On Nov 4, 2016, at 2:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> >> On Thu, Oct 20, 2016 at 01:31:15AM +0000, Ketan Nilangekar wrote: > >> >> 2. The idea of having multi-threaded epoll based network client was to drive more throughput by using multiplexed epoll implementation and (fairly) distributing IOs from several vdisks (typical VM assumed to have atleast 2) across 8 connections. > >> >> Each connection is serviced by single epoll and does not share its context with other connections/epoll. All memory pools/queues are in the context of a connection/epoll. > >> >> The qemu thread enqueues IO request in one of the 8 epoll queues using a round-robin. Responses are also handled in the context of an epoll loop and do not share context with other epolls. Any synchronization code that you see today in the driver callback is code that handles the split IOs which we plan to address by a) implementing readv in libqnio and b) removing the 4MB limit on write IO size. > >> >> The number of client epoll threads (8) is a #define in qnio and can easily be changed. However our tests indicate that we are able to drive a good number of IOs using 8 threads/epolls. > >> >> I am sure there are ways to simplify the library implementation, but for now the performance of the epoll threads is more than satisfactory. > >> > > >> > By the way, when you benchmark with 8 epoll threads, are there any other > >> > guests with vxhs running on the machine? > >> > > >> > >> Yes. Infact the total througput with around 4-5 VMs scales well to saturate around 90% of available storage throughput of a typical pcie ssd device. > >> > >> > In a real-life situation where multiple VMs are running on a single host > >> > it may turn out that giving each VM 8 epoll threads doesn't help at all > >> > because the host CPUs are busy with other tasks. > >> > >> The exact number of epolls required to achieve optimal throughput may be something that can be adjusted dynamically by the qnio library in subsequent revisions. > >> > >> But as I mentioned today we can change this by simply rebuilding qnio with a different value for the #define > > > >In QEMU there is currently work to add multiqueue support to the block > >layer. This enables true multiqueue from the guest down to the storage > >backend. > > Is there any spec or documentation on this that you can point us to? The current status is: 1. virtio-blk and virtio-scsi support multiple queues but these queues are processed from a single thread today. 2. MemoryRegions can be marked with !global_locking so its handler functions are dispatched without taking the QEMU global mutex. This allows device emulation to run in multiple threads. 3. Paolo Bonzini (CCed) is currently working on make the block layer (BlockDriverState and co) support access from multiple threads and multiqueue. This is work in progress. If you are interested in this work keep an eye out for patch series from Paolo Bonzini and Fam Zheng. Stefan
On 08/11/2016 16:39, Stefan Hajnoczi wrote: > The current status is: > > 1. virtio-blk and virtio-scsi support multiple queues but these queues > are processed from a single thread today. > > 2. MemoryRegions can be marked with !global_locking so its handler > functions are dispatched without taking the QEMU global mutex. This > allows device emulation to run in multiple threads. Alternatively, virtio-blk and virtio-scsi can already use ioeventfd and "-object iothread,id=FOO -device virtio-blk-pci,iothread=FOO" to let device emulation run in a separate thread that doesn't take the QEMU global mutex. > 3. Paolo Bonzini (CCed) is currently working on make the block layer > (BlockDriverState and co) support access from multiple threads and > multiqueue. This is work in progress. > > If you are interested in this work keep an eye out for patch series from > Paolo Bonzini and Fam Zheng. The first part (drop RFifoLock) was committed for 2.8. It's a relatively long road, but these are the currently ready parts of the work: - take AioContext acquire/release in small critical sections - push AioContext down to individual callbacks - make BlockDriverState thread-safe The latter needs rebasing after the last changes to dirty bitmaps, but I think these patches should be ready for 2.9. These are the planned bits: - replace AioContext with fine-grained mutex in bdrv_aio_* - protect everything with CoMutex in bdrv_co_* - remove aio_context_acquire/release For now I was not planning to make network backends support multiqueue, only files. Paolo
On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > > Review of .bdrv_open() and .bdrv_aio_writev() code paths. > > The big issues I see in this driver and libqnio: > > 1. Showstoppers like broken .bdrv_open() and leaking memory on every > reply message. > 2. Insecure due to missing input validation (network packets and > configuration) and incorrect string handling. > 3. Not fully asynchronous so QEMU and the guest may hang. > > Please think about the whole codebase and not just the lines I've > pointed out in this review when fixing these sorts of issues. There may > be similar instances of these bugs elsewhere and it's important that > they are fixed so that this can be merged. Ping? You didn't respond to the comments I raised. The libqnio buffer overflows and other issues from this email are still present. I put a lot of time into reviewing this patch series and libqnio. If you want to get reviews please address feedback before sending a new patch series. > >> +/* >> + * Structure per vDisk maintained for state >> + */ >> +typedef struct BDRVVXHSState { >> + int fds[2]; >> + int64_t vdisk_size; >> + int64_t vdisk_blocks; >> + int64_t vdisk_flags; >> + int vdisk_aio_count; >> + int event_reader_pos; >> + VXHSAIOCB *qnio_event_acb; >> + void *qnio_ctx; >> + QemuSpin vdisk_lock; /* Lock to protect BDRVVXHSState */ >> + QemuSpin vdisk_acb_lock; /* Protects ACB */ > > These comments are insufficient for documenting locking. Not all fields > are actually protected by these locks. Please order fields according to > lock coverage: > > typedef struct VXHSAIOCB { > ... > > /* Protected by BDRVVXHSState->vdisk_acb_lock */ > int segments; > ... > }; > > typedef struct BDRVVXHSState { > ... > > /* Protected by vdisk_lock */ > QemuSpin vdisk_lock; > int vdisk_aio_count; > QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; > ... > } > >> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) >> +{ >> + /* >> + * Close vDisk device >> + */ >> + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { >> + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); > > libqnio comment: > Why does iio_devclose() take an unused cfd argument? Perhaps it can be > dropped. > >> + s->vdisk_hostinfo[idx].vdisk_rfd = -1; >> + } >> + >> + /* >> + * Close QNIO channel against cached channel-fd >> + */ >> + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { >> + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); > > libqnio comment: > Why does iio_devclose() take an int32_t cfd argument but iio_close() > takes a uint32_t cfd argument? > >> + s->vdisk_hostinfo[idx].qnio_cfd = -1; >> + } >> +} >> + >> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, >> + int *rfd, const char *file_name) >> +{ >> + /* >> + * Open qnio channel to storage agent if not opened before. >> + */ >> + if (*cfd < 0) { >> + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); > > libqnio comments: > > 1. > There is a buffer overflow in qnio_create_channel(). strncpy() is used > incorrectly so long hostname or port (both can be 99 characters long) > will overflow channel->name[] (64 characters) or channel->port[] (8 > characters). > > strncpy(channel->name, hostname, strlen(hostname) + 1); > strncpy(channel->port, port, strlen(port) + 1); > > The third argument must be the size of the *destination* buffer, not the > source buffer. Also note that strncpy() doesn't NUL-terminate the > destination string so you must do that manually to ensure there is a NUL > byte at the end of the buffer. > > 2. > channel is leaked in the "Failed to open single connection" error case > in qnio_create_channel(). > > 3. > If host is longer the 63 characters then the ioapi_ctx->channels and > qnio_ctx->channels maps will use different keys due to string truncation > in qnio_create_channel(). This means "Channel already exists" in > qnio_create_channel() and possibly other things will not work as > expected. > >> + if (*cfd < 0) { >> + trace_vxhs_qnio_iio_open(of_vsa_addr); >> + return -ENODEV; >> + } >> + } >> + >> + /* >> + * Open vdisk device >> + */ >> + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); > > libqnio comment: > Buffer overflow in iio_devopen() since chandev[128] is not large enough > to hold channel[100] + " " + devpath[arbitrary length] chars: > > sprintf(chandev, "%s %s", channel, devpath); > >> + >> + if (*rfd < 0) { >> + if (*cfd >= 0) { > > This check is always true. Otherwise the return -ENODEV would have been > taken above. The if statement isn't necessary. > >> +static void vxhs_check_failover_status(int res, void *ctx) >> +{ >> + BDRVVXHSState *s = ctx; >> + >> + if (res == 0) { >> + /* found failover target */ >> + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; >> + s->vdisk_ask_failover_idx = 0; >> + trace_vxhs_check_failover_status( >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >> + s->vdisk_guid); >> + qemu_spin_lock(&s->vdisk_lock); >> + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); >> + qemu_spin_unlock(&s->vdisk_lock); >> + vxhs_handle_queued_ios(s); >> + } else { >> + /* keep looking */ >> + trace_vxhs_check_failover_status_retry(s->vdisk_guid); >> + s->vdisk_ask_failover_idx++; >> + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { >> + /* pause and cycle through list again */ >> + sleep(QNIO_CONNECT_RETRY_SECS); > > This code is called from a QEMU thread via vxhs_aio_rw(). It is not > permitted to call sleep() since it will freeze QEMU and probably the > guest. > > If you need a timer you can use QEMU's timer APIs. See aio_timer_new(), > timer_new_ns(), timer_mod(), timer_del(), timer_free(). > >> + s->vdisk_ask_failover_idx = 0; >> + } >> + res = vxhs_switch_storage_agent(s); >> + } >> +} >> + >> +static int vxhs_failover_io(BDRVVXHSState *s) >> +{ >> + int res = 0; >> + >> + trace_vxhs_failover_io(s->vdisk_guid); >> + >> + s->vdisk_ask_failover_idx = 0; >> + res = vxhs_switch_storage_agent(s); >> + >> + return res; >> +} >> + >> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, >> + uint32_t error, uint32_t opcode) > > This function is doing too much. Especially the failover code should > run in the AioContext since it's complex. Don't do failover here > because this function is outside the AioContext lock. Do it from > AioContext using a QEMUBH like block/rbd.c. > >> +static int32_t >> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, >> + uint64_t offset, void *ctx, uint32_t flags) >> +{ >> + struct iovec cur; >> + uint64_t cur_offset = 0; >> + uint64_t cur_write_len = 0; >> + int segcount = 0; >> + int ret = 0; >> + int i, nsio = 0; >> + int iovcnt = qiov->niov; >> + struct iovec *iov = qiov->iov; >> + >> + errno = 0; >> + cur.iov_base = 0; >> + cur.iov_len = 0; >> + >> + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); > > libqnio comments: > > 1. > There are blocking connect(2) and getaddrinfo(3) calls in iio_writev() > so this may hang for arbitrary amounts of time. This is not permitted > in .bdrv_aio_readv()/.bdrv_aio_writev(). Please make qnio actually > asynchronous. > > 2. > Where does client_callback() free reply? It looks like every reply > message causes a memory leak! > > 3. > Buffer overflow in iio_writev() since device[128] cannot fit the device > string generated from the vdisk_guid. > > 4. > Buffer overflow in iio_writev() due to > strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is > larger than target[64]. Also note the previous comments about strncpy() > usage. > > 5. > I don't see any endianness handling or portable alignment of struct > fields in the network protocol code. Binary network protocols need to > take care of these issue for portability. This means libqnio compiled > for different architectures will not work. Do you plan to support any > other architectures besides x86? > > 6. > The networking code doesn't look robust: kvset uses assert() on input > from the network so the other side of the connection could cause SIGABRT > (coredump), the client uses the msg pointer as the cookie for the > response packet so the server can easily crash the client by sending a > bogus cookie value, etc. Even on the client side these things are > troublesome but on a server they are guaranteed security issues. I > didn't look into it deeply. Please audit the code. > >> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, >> + int *cfd, int *rfd, Error **errp) >> +{ >> + QDict *backing_options = NULL; >> + QemuOpts *opts, *tcp_opts; >> + const char *vxhs_filename; >> + char *of_vsa_addr = NULL; >> + Error *local_err = NULL; >> + const char *vdisk_id_opt; >> + char *file_name = NULL; >> + size_t num_servers = 0; >> + char *str = NULL; >> + int ret = 0; >> + int i; >> + >> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); >> + qemu_opts_absorb_qdict(opts, options, &local_err); >> + if (local_err) { >> + error_propagate(errp, local_err); >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); >> + if (vxhs_filename) { >> + trace_vxhs_qemu_init_filename(vxhs_filename); >> + } >> + >> + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); >> + if (!vdisk_id_opt) { >> + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); >> + ret = -EINVAL; >> + goto out; >> + } >> + s->vdisk_guid = g_strdup(vdisk_id_opt); >> + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); >> + >> + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); >> + if (num_servers < 1) { >> + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); >> + ret = -EINVAL; >> + goto out; >> + } else if (num_servers > VXHS_MAX_HOSTS) { >> + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); >> + error_append_hint(errp, "Maximum %d servers allowed.\n", >> + VXHS_MAX_HOSTS); >> + ret = -EINVAL; >> + goto out; >> + } >> + trace_vxhs_qemu_init_numservers(num_servers); >> + >> + for (i = 0; i < num_servers; i++) { >> + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); >> + qdict_extract_subqdict(options, &backing_options, str); >> + >> + /* Create opts info from runtime_tcp_opts list */ >> + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); >> + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); >> + if (local_err) { >> + qdict_del(backing_options, str); > > backing_options is leaked and there's no need to delete the str key. > >> + qemu_opts_del(tcp_opts); >> + g_free(str); >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, >> + VXHS_OPT_HOST)); >> + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, >> + VXHS_OPT_PORT), >> + NULL, 0); > > This will segfault if the port option was missing. > >> + >> + s->vdisk_hostinfo[i].qnio_cfd = -1; >> + s->vdisk_hostinfo[i].vdisk_rfd = -1; >> + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, >> + s->vdisk_hostinfo[i].port); > > It's not safe to use the %s format specifier for a trace event with a > NULL value. In the case where hostip is NULL this could crash on some > systems. > >> + >> + qdict_del(backing_options, str); >> + qemu_opts_del(tcp_opts); >> + g_free(str); >> + } > > backing_options is leaked. > >> + >> + s->vdisk_nhosts = i; >> + s->vdisk_cur_host_idx = 0; >> + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); >> + of_vsa_addr = g_strdup_printf("of://%s:%d", >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); > > Can we get here with num_servers == 0? In that case this would access > uninitialized memory. I guess num_servers == 0 does not make sense and > there should be an error case for it. > >> + >> + /* >> + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. >> + */ >> + if (global_qnio_ctx == NULL) { >> + global_qnio_ctx = vxhs_setup_qnio(); > > libqnio comment: > The client epoll thread should mask all signals (like > qemu_thread_create()). Otherwise it may receive signals that it cannot > deal with. > >> + if (global_qnio_ctx == NULL) { >> + error_setg(&local_err, "Failed vxhs_setup_qnio"); >> + ret = -EINVAL; >> + goto out; >> + } >> + } >> + >> + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); >> + if (!ret) { >> + error_setg(&local_err, "Failed qnio_iio_open"); >> + ret = -EIO; >> + } > > The return value of vxhs_qnio_iio_open() is 0 for success or -errno for > error. > > I guess you never ran this code! The block driver won't even open > successfully. > >> + >> +out: >> + g_free(file_name); >> + g_free(of_vsa_addr); >> + qemu_opts_del(opts); >> + >> + if (ret < 0) { >> + for (i = 0; i < num_servers; i++) { >> + g_free(s->vdisk_hostinfo[i].hostip); >> + } >> + g_free(s->vdisk_guid); >> + s->vdisk_guid = NULL; >> + errno = -ret; > > There is no need to set errno here. The return value already contains > the error and the caller doesn't look at errno. > >> + } >> + error_propagate(errp, local_err); >> + >> + return ret; >> +} >> + >> +static int vxhs_open(BlockDriverState *bs, QDict *options, >> + int bdrv_flags, Error **errp) >> +{ >> + BDRVVXHSState *s = bs->opaque; >> + AioContext *aio_context; >> + int qemu_qnio_cfd = -1; >> + int device_opened = 0; >> + int qemu_rfd = -1; >> + int ret = 0; >> + int i; >> + >> + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); >> + if (ret < 0) { >> + trace_vxhs_open_fail(ret); >> + return ret; >> + } >> + >> + device_opened = 1; >> + s->qnio_ctx = global_qnio_ctx; >> + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; >> + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; >> + s->vdisk_size = 0; >> + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); >> + >> + /* >> + * Create a pipe for communicating between two threads in different >> + * context. Set handler for read event, which gets triggered when >> + * IO completion is done by non-QEMU context. >> + */ >> + ret = qemu_pipe(s->fds); >> + if (ret < 0) { >> + trace_vxhs_open_epipe('.'); >> + ret = -errno; >> + goto errout; > > This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc. > bdrv_close() will not be called so this function must do cleanup itself. > >> + } >> + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); >> + >> + aio_context = bdrv_get_aio_context(bs); >> + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], >> + false, vxhs_aio_event_reader, NULL, s); >> + >> + /* >> + * Initialize the spin-locks. >> + */ >> + qemu_spin_init(&s->vdisk_lock); >> + qemu_spin_init(&s->vdisk_acb_lock); >> + >> + return 0; >> + >> +errout: >> + /* >> + * Close remote vDisk device if it was opened earlier >> + */ >> + if (device_opened) { > > This is always true. The device_opened variable can be removed. > >> +/* >> + * This allocates QEMU-VXHS callback for each IO >> + * and is passed to QNIO. When QNIO completes the work, >> + * it will be passed back through the callback. >> + */ >> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, >> + int64_t sector_num, QEMUIOVector *qiov, >> + int nb_sectors, >> + BlockCompletionFunc *cb, >> + void *opaque, int iodir) >> +{ >> + VXHSAIOCB *acb = NULL; >> + BDRVVXHSState *s = bs->opaque; >> + size_t size; >> + uint64_t offset; >> + int iio_flags = 0; >> + int ret = 0; >> + void *qnio_ctx = s->qnio_ctx; >> + uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd; >> + >> + offset = sector_num * BDRV_SECTOR_SIZE; >> + size = nb_sectors * BDRV_SECTOR_SIZE; >> + >> + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); >> + /* >> + * Setup or initialize VXHSAIOCB. >> + * Every single field should be initialized since >> + * acb will be picked up from the slab without >> + * initializing with zero. >> + */ >> + acb->io_offset = offset; >> + acb->size = size; >> + acb->ret = 0; >> + acb->flags = 0; >> + acb->aio_done = VXHS_IO_INPROGRESS; >> + acb->segments = 0; >> + acb->buffer = 0; >> + acb->qiov = qiov; >> + acb->direction = iodir; >> + >> + qemu_spin_lock(&s->vdisk_lock); >> + if (OF_VDISK_FAILED(s)) { >> + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); >> + qemu_spin_unlock(&s->vdisk_lock); >> + goto errout; >> + } >> + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { >> + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); >> + s->vdisk_aio_retry_qd++; >> + OF_AIOCB_FLAGS_SET_QUEUED(acb); >> + qemu_spin_unlock(&s->vdisk_lock); >> + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); >> + goto out; >> + } >> + s->vdisk_aio_count++; >> + qemu_spin_unlock(&s->vdisk_lock); >> + >> + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); >> + >> + switch (iodir) { >> + case VDISK_AIO_WRITE: >> + vxhs_inc_acb_segment_count(acb, 1); >> + ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov, >> + offset, (void *)acb, iio_flags); >> + break; >> + case VDISK_AIO_READ: >> + vxhs_inc_acb_segment_count(acb, 1); >> + ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov, >> + offset, (void *)acb, iio_flags); >> + break; >> + default: >> + trace_vxhs_aio_rw_invalid(iodir); >> + goto errout; > > s->vdisk_aio_count must be decremented before returning. > >> +static void vxhs_close(BlockDriverState *bs) >> +{ >> + BDRVVXHSState *s = bs->opaque; >> + int i; >> + >> + trace_vxhs_close(s->vdisk_guid); >> + close(s->fds[VDISK_FD_READ]); >> + close(s->fds[VDISK_FD_WRITE]); >> + >> + /* >> + * Clearing all the event handlers for oflame registered to QEMU >> + */ >> + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], >> + false, NULL, NULL, NULL); > > Please remove the event handler before closing the fd. I don't think it > matters in this case but in other scenarios there could be race > conditions if another thread opens an fd and the file descriptor number > is reused.
Will look into these ASAP. On Mon, Nov 14, 2016 at 7:05 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> >> Review of .bdrv_open() and .bdrv_aio_writev() code paths. >> >> The big issues I see in this driver and libqnio: >> >> 1. Showstoppers like broken .bdrv_open() and leaking memory on every >> reply message. >> 2. Insecure due to missing input validation (network packets and >> configuration) and incorrect string handling. >> 3. Not fully asynchronous so QEMU and the guest may hang. >> >> Please think about the whole codebase and not just the lines I've >> pointed out in this review when fixing these sorts of issues. There may >> be similar instances of these bugs elsewhere and it's important that >> they are fixed so that this can be merged. > > Ping? > > You didn't respond to the comments I raised. The libqnio buffer > overflows and other issues from this email are still present. > > I put a lot of time into reviewing this patch series and libqnio. If > you want to get reviews please address feedback before sending a new > patch series. > >> >>> +/* >>> + * Structure per vDisk maintained for state >>> + */ >>> +typedef struct BDRVVXHSState { >>> + int fds[2]; >>> + int64_t vdisk_size; >>> + int64_t vdisk_blocks; >>> + int64_t vdisk_flags; >>> + int vdisk_aio_count; >>> + int event_reader_pos; >>> + VXHSAIOCB *qnio_event_acb; >>> + void *qnio_ctx; >>> + QemuSpin vdisk_lock; /* Lock to protect BDRVVXHSState */ >>> + QemuSpin vdisk_acb_lock; /* Protects ACB */ >> >> These comments are insufficient for documenting locking. Not all fields >> are actually protected by these locks. Please order fields according to >> lock coverage: >> >> typedef struct VXHSAIOCB { >> ... >> >> /* Protected by BDRVVXHSState->vdisk_acb_lock */ >> int segments; >> ... >> }; >> >> typedef struct BDRVVXHSState { >> ... >> >> /* Protected by vdisk_lock */ >> QemuSpin vdisk_lock; >> int vdisk_aio_count; >> QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; >> ... >> } >> >>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) >>> +{ >>> + /* >>> + * Close vDisk device >>> + */ >>> + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { >>> + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); >> >> libqnio comment: >> Why does iio_devclose() take an unused cfd argument? Perhaps it can be >> dropped. >> >>> + s->vdisk_hostinfo[idx].vdisk_rfd = -1; >>> + } >>> + >>> + /* >>> + * Close QNIO channel against cached channel-fd >>> + */ >>> + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { >>> + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); >> >> libqnio comment: >> Why does iio_devclose() take an int32_t cfd argument but iio_close() >> takes a uint32_t cfd argument? >> >>> + s->vdisk_hostinfo[idx].qnio_cfd = -1; >>> + } >>> +} >>> + >>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, >>> + int *rfd, const char *file_name) >>> +{ >>> + /* >>> + * Open qnio channel to storage agent if not opened before. >>> + */ >>> + if (*cfd < 0) { >>> + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); >> >> libqnio comments: >> >> 1. >> There is a buffer overflow in qnio_create_channel(). strncpy() is used >> incorrectly so long hostname or port (both can be 99 characters long) >> will overflow channel->name[] (64 characters) or channel->port[] (8 >> characters). >> >> strncpy(channel->name, hostname, strlen(hostname) + 1); >> strncpy(channel->port, port, strlen(port) + 1); >> >> The third argument must be the size of the *destination* buffer, not the >> source buffer. Also note that strncpy() doesn't NUL-terminate the >> destination string so you must do that manually to ensure there is a NUL >> byte at the end of the buffer. >> >> 2. >> channel is leaked in the "Failed to open single connection" error case >> in qnio_create_channel(). >> >> 3. >> If host is longer the 63 characters then the ioapi_ctx->channels and >> qnio_ctx->channels maps will use different keys due to string truncation >> in qnio_create_channel(). This means "Channel already exists" in >> qnio_create_channel() and possibly other things will not work as >> expected. >> >>> + if (*cfd < 0) { >>> + trace_vxhs_qnio_iio_open(of_vsa_addr); >>> + return -ENODEV; >>> + } >>> + } >>> + >>> + /* >>> + * Open vdisk device >>> + */ >>> + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); >> >> libqnio comment: >> Buffer overflow in iio_devopen() since chandev[128] is not large enough >> to hold channel[100] + " " + devpath[arbitrary length] chars: >> >> sprintf(chandev, "%s %s", channel, devpath); >> >>> + >>> + if (*rfd < 0) { >>> + if (*cfd >= 0) { >> >> This check is always true. Otherwise the return -ENODEV would have been >> taken above. The if statement isn't necessary. >> >>> +static void vxhs_check_failover_status(int res, void *ctx) >>> +{ >>> + BDRVVXHSState *s = ctx; >>> + >>> + if (res == 0) { >>> + /* found failover target */ >>> + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; >>> + s->vdisk_ask_failover_idx = 0; >>> + trace_vxhs_check_failover_status( >>> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >>> + s->vdisk_guid); >>> + qemu_spin_lock(&s->vdisk_lock); >>> + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); >>> + qemu_spin_unlock(&s->vdisk_lock); >>> + vxhs_handle_queued_ios(s); >>> + } else { >>> + /* keep looking */ >>> + trace_vxhs_check_failover_status_retry(s->vdisk_guid); >>> + s->vdisk_ask_failover_idx++; >>> + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { >>> + /* pause and cycle through list again */ >>> + sleep(QNIO_CONNECT_RETRY_SECS); >> >> This code is called from a QEMU thread via vxhs_aio_rw(). It is not >> permitted to call sleep() since it will freeze QEMU and probably the >> guest. >> >> If you need a timer you can use QEMU's timer APIs. See aio_timer_new(), >> timer_new_ns(), timer_mod(), timer_del(), timer_free(). >> >>> + s->vdisk_ask_failover_idx = 0; >>> + } >>> + res = vxhs_switch_storage_agent(s); >>> + } >>> +} >>> + >>> +static int vxhs_failover_io(BDRVVXHSState *s) >>> +{ >>> + int res = 0; >>> + >>> + trace_vxhs_failover_io(s->vdisk_guid); >>> + >>> + s->vdisk_ask_failover_idx = 0; >>> + res = vxhs_switch_storage_agent(s); >>> + >>> + return res; >>> +} >>> + >>> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, >>> + uint32_t error, uint32_t opcode) >> >> This function is doing too much. Especially the failover code should >> run in the AioContext since it's complex. Don't do failover here >> because this function is outside the AioContext lock. Do it from >> AioContext using a QEMUBH like block/rbd.c. >> >>> +static int32_t >>> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, >>> + uint64_t offset, void *ctx, uint32_t flags) >>> +{ >>> + struct iovec cur; >>> + uint64_t cur_offset = 0; >>> + uint64_t cur_write_len = 0; >>> + int segcount = 0; >>> + int ret = 0; >>> + int i, nsio = 0; >>> + int iovcnt = qiov->niov; >>> + struct iovec *iov = qiov->iov; >>> + >>> + errno = 0; >>> + cur.iov_base = 0; >>> + cur.iov_len = 0; >>> + >>> + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); >> >> libqnio comments: >> >> 1. >> There are blocking connect(2) and getaddrinfo(3) calls in iio_writev() >> so this may hang for arbitrary amounts of time. This is not permitted >> in .bdrv_aio_readv()/.bdrv_aio_writev(). Please make qnio actually >> asynchronous. >> >> 2. >> Where does client_callback() free reply? It looks like every reply >> message causes a memory leak! >> >> 3. >> Buffer overflow in iio_writev() since device[128] cannot fit the device >> string generated from the vdisk_guid. >> >> 4. >> Buffer overflow in iio_writev() due to >> strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is >> larger than target[64]. Also note the previous comments about strncpy() >> usage. >> >> 5. >> I don't see any endianness handling or portable alignment of struct >> fields in the network protocol code. Binary network protocols need to >> take care of these issue for portability. This means libqnio compiled >> for different architectures will not work. Do you plan to support any >> other architectures besides x86? >> >> 6. >> The networking code doesn't look robust: kvset uses assert() on input >> from the network so the other side of the connection could cause SIGABRT >> (coredump), the client uses the msg pointer as the cookie for the >> response packet so the server can easily crash the client by sending a >> bogus cookie value, etc. Even on the client side these things are >> troublesome but on a server they are guaranteed security issues. I >> didn't look into it deeply. Please audit the code. >> >>> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, >>> + int *cfd, int *rfd, Error **errp) >>> +{ >>> + QDict *backing_options = NULL; >>> + QemuOpts *opts, *tcp_opts; >>> + const char *vxhs_filename; >>> + char *of_vsa_addr = NULL; >>> + Error *local_err = NULL; >>> + const char *vdisk_id_opt; >>> + char *file_name = NULL; >>> + size_t num_servers = 0; >>> + char *str = NULL; >>> + int ret = 0; >>> + int i; >>> + >>> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); >>> + qemu_opts_absorb_qdict(opts, options, &local_err); >>> + if (local_err) { >>> + error_propagate(errp, local_err); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >>> + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); >>> + if (vxhs_filename) { >>> + trace_vxhs_qemu_init_filename(vxhs_filename); >>> + } >>> + >>> + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); >>> + if (!vdisk_id_opt) { >>> + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + s->vdisk_guid = g_strdup(vdisk_id_opt); >>> + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); >>> + >>> + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); >>> + if (num_servers < 1) { >>> + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); >>> + ret = -EINVAL; >>> + goto out; >>> + } else if (num_servers > VXHS_MAX_HOSTS) { >>> + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); >>> + error_append_hint(errp, "Maximum %d servers allowed.\n", >>> + VXHS_MAX_HOSTS); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + trace_vxhs_qemu_init_numservers(num_servers); >>> + >>> + for (i = 0; i < num_servers; i++) { >>> + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); >>> + qdict_extract_subqdict(options, &backing_options, str); >>> + >>> + /* Create opts info from runtime_tcp_opts list */ >>> + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); >>> + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); >>> + if (local_err) { >>> + qdict_del(backing_options, str); >> >> backing_options is leaked and there's no need to delete the str key. >> >>> + qemu_opts_del(tcp_opts); >>> + g_free(str); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >>> + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, >>> + VXHS_OPT_HOST)); >>> + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, >>> + VXHS_OPT_PORT), >>> + NULL, 0); >> >> This will segfault if the port option was missing. >> >>> + >>> + s->vdisk_hostinfo[i].qnio_cfd = -1; >>> + s->vdisk_hostinfo[i].vdisk_rfd = -1; >>> + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, >>> + s->vdisk_hostinfo[i].port); >> >> It's not safe to use the %s format specifier for a trace event with a >> NULL value. In the case where hostip is NULL this could crash on some >> systems. >> >>> + >>> + qdict_del(backing_options, str); >>> + qemu_opts_del(tcp_opts); >>> + g_free(str); >>> + } >> >> backing_options is leaked. >> >>> + >>> + s->vdisk_nhosts = i; >>> + s->vdisk_cur_host_idx = 0; >>> + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); >>> + of_vsa_addr = g_strdup_printf("of://%s:%d", >>> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >>> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); >> >> Can we get here with num_servers == 0? In that case this would access >> uninitialized memory. I guess num_servers == 0 does not make sense and >> there should be an error case for it. >> >>> + >>> + /* >>> + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. >>> + */ >>> + if (global_qnio_ctx == NULL) { >>> + global_qnio_ctx = vxhs_setup_qnio(); >> >> libqnio comment: >> The client epoll thread should mask all signals (like >> qemu_thread_create()). Otherwise it may receive signals that it cannot >> deal with. >> >>> + if (global_qnio_ctx == NULL) { >>> + error_setg(&local_err, "Failed vxhs_setup_qnio"); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + } >>> + >>> + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); >>> + if (!ret) { >>> + error_setg(&local_err, "Failed qnio_iio_open"); >>> + ret = -EIO; >>> + } >> >> The return value of vxhs_qnio_iio_open() is 0 for success or -errno for >> error. >> >> I guess you never ran this code! The block driver won't even open >> successfully. >> >>> + >>> +out: >>> + g_free(file_name); >>> + g_free(of_vsa_addr); >>> + qemu_opts_del(opts); >>> + >>> + if (ret < 0) { >>> + for (i = 0; i < num_servers; i++) { >>> + g_free(s->vdisk_hostinfo[i].hostip); >>> + } >>> + g_free(s->vdisk_guid); >>> + s->vdisk_guid = NULL; >>> + errno = -ret; >> >> There is no need to set errno here. The return value already contains >> the error and the caller doesn't look at errno. >> >>> + } >>> + error_propagate(errp, local_err); >>> + >>> + return ret; >>> +} >>> + >>> +static int vxhs_open(BlockDriverState *bs, QDict *options, >>> + int bdrv_flags, Error **errp) >>> +{ >>> + BDRVVXHSState *s = bs->opaque; >>> + AioContext *aio_context; >>> + int qemu_qnio_cfd = -1; >>> + int device_opened = 0; >>> + int qemu_rfd = -1; >>> + int ret = 0; >>> + int i; >>> + >>> + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); >>> + if (ret < 0) { >>> + trace_vxhs_open_fail(ret); >>> + return ret; >>> + } >>> + >>> + device_opened = 1; >>> + s->qnio_ctx = global_qnio_ctx; >>> + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; >>> + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; >>> + s->vdisk_size = 0; >>> + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); >>> + >>> + /* >>> + * Create a pipe for communicating between two threads in different >>> + * context. Set handler for read event, which gets triggered when >>> + * IO completion is done by non-QEMU context. >>> + */ >>> + ret = qemu_pipe(s->fds); >>> + if (ret < 0) { >>> + trace_vxhs_open_epipe('.'); >>> + ret = -errno; >>> + goto errout; >> >> This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc. >> bdrv_close() will not be called so this function must do cleanup itself. >> >>> + } >>> + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); >>> + >>> + aio_context = bdrv_get_aio_context(bs); >>> + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], >>> + false, vxhs_aio_event_reader, NULL, s); >>> + >>> + /* >>> + * Initialize the spin-locks. >>> + */ >>> + qemu_spin_init(&s->vdisk_lock); >>> + qemu_spin_init(&s->vdisk_acb_lock); >>> + >>> + return 0; >>> + >>> +errout: >>> + /* >>> + * Close remote vDisk device if it was opened earlier >>> + */ >>> + if (device_opened) { >> >> This is always true. The device_opened variable can be removed. >> >>> +/* >>> + * This allocates QEMU-VXHS callback for each IO >>> + * and is passed to QNIO. When QNIO completes the work, >>> + * it will be passed back through the callback. >>> + */ >>> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, >>> + int64_t sector_num, QEMUIOVector *qiov, >>> + int nb_sectors, >>> + BlockCompletionFunc *cb, >>> + void *opaque, int iodir) >>> +{ >>> + VXHSAIOCB *acb = NULL; >>> + BDRVVXHSState *s = bs->opaque; >>> + size_t size; >>> + uint64_t offset; >>> + int iio_flags = 0; >>> + int ret = 0; >>> + void *qnio_ctx = s->qnio_ctx; >>> + uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd; >>> + >>> + offset = sector_num * BDRV_SECTOR_SIZE; >>> + size = nb_sectors * BDRV_SECTOR_SIZE; >>> + >>> + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); >>> + /* >>> + * Setup or initialize VXHSAIOCB. >>> + * Every single field should be initialized since >>> + * acb will be picked up from the slab without >>> + * initializing with zero. >>> + */ >>> + acb->io_offset = offset; >>> + acb->size = size; >>> + acb->ret = 0; >>> + acb->flags = 0; >>> + acb->aio_done = VXHS_IO_INPROGRESS; >>> + acb->segments = 0; >>> + acb->buffer = 0; >>> + acb->qiov = qiov; >>> + acb->direction = iodir; >>> + >>> + qemu_spin_lock(&s->vdisk_lock); >>> + if (OF_VDISK_FAILED(s)) { >>> + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); >>> + qemu_spin_unlock(&s->vdisk_lock); >>> + goto errout; >>> + } >>> + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { >>> + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); >>> + s->vdisk_aio_retry_qd++; >>> + OF_AIOCB_FLAGS_SET_QUEUED(acb); >>> + qemu_spin_unlock(&s->vdisk_lock); >>> + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); >>> + goto out; >>> + } >>> + s->vdisk_aio_count++; >>> + qemu_spin_unlock(&s->vdisk_lock); >>> + >>> + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); >>> + >>> + switch (iodir) { >>> + case VDISK_AIO_WRITE: >>> + vxhs_inc_acb_segment_count(acb, 1); >>> + ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov, >>> + offset, (void *)acb, iio_flags); >>> + break; >>> + case VDISK_AIO_READ: >>> + vxhs_inc_acb_segment_count(acb, 1); >>> + ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov, >>> + offset, (void *)acb, iio_flags); >>> + break; >>> + default: >>> + trace_vxhs_aio_rw_invalid(iodir); >>> + goto errout; >> >> s->vdisk_aio_count must be decremented before returning. >> >>> +static void vxhs_close(BlockDriverState *bs) >>> +{ >>> + BDRVVXHSState *s = bs->opaque; >>> + int i; >>> + >>> + trace_vxhs_close(s->vdisk_guid); >>> + close(s->fds[VDISK_FD_READ]); >>> + close(s->fds[VDISK_FD_WRITE]); >>> + >>> + /* >>> + * Clearing all the event handlers for oflame registered to QEMU >>> + */ >>> + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], >>> + false, NULL, NULL, NULL); >> >> Please remove the event handler before closing the fd. I don't think it >> matters in this case but in other scenarios there could be race >> conditions if another thread opens an fd and the file descriptor number >> is reused.
Hi Stefan On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > > Review of .bdrv_open() and .bdrv_aio_writev() code paths. > > The big issues I see in this driver and libqnio: > > 1. Showstoppers like broken .bdrv_open() and leaking memory on every > reply message. > 2. Insecure due to missing input validation (network packets and > configuration) and incorrect string handling. > 3. Not fully asynchronous so QEMU and the guest may hang. > > Please think about the whole codebase and not just the lines I've > pointed out in this review when fixing these sorts of issues. There may > be similar instances of these bugs elsewhere and it's important that > they are fixed so that this can be merged. > >> +/* >> + * Structure per vDisk maintained for state >> + */ >> +typedef struct BDRVVXHSState { >> + int fds[2]; >> + int64_t vdisk_size; >> + int64_t vdisk_blocks; >> + int64_t vdisk_flags; >> + int vdisk_aio_count; >> + int event_reader_pos; >> + VXHSAIOCB *qnio_event_acb; >> + void *qnio_ctx; >> + QemuSpin vdisk_lock; /* Lock to protect BDRVVXHSState */ >> + QemuSpin vdisk_acb_lock; /* Protects ACB */ > > These comments are insufficient for documenting locking. Not all fields > are actually protected by these locks. Please order fields according to > lock coverage: > > typedef struct VXHSAIOCB { > ... > > /* Protected by BDRVVXHSState->vdisk_acb_lock */ > int segments; > ... > }; > > typedef struct BDRVVXHSState { > ... > > /* Protected by vdisk_lock */ > QemuSpin vdisk_lock; > int vdisk_aio_count; > QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; > ... > } > >> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) >> +{ >> + /* >> + * Close vDisk device >> + */ >> + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { >> + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); > > libqnio comment: > Why does iio_devclose() take an unused cfd argument? Perhaps it can be > dropped. > >> + s->vdisk_hostinfo[idx].vdisk_rfd = -1; >> + } >> + >> + /* >> + * Close QNIO channel against cached channel-fd >> + */ >> + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { >> + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); > > libqnio comment: > Why does iio_devclose() take an int32_t cfd argument but iio_close() > takes a uint32_t cfd argument? > >> + s->vdisk_hostinfo[idx].qnio_cfd = -1; >> + } >> +} >> + >> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, >> + int *rfd, const char *file_name) >> +{ >> + /* >> + * Open qnio channel to storage agent if not opened before. >> + */ >> + if (*cfd < 0) { >> + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); > > libqnio comments: > > 1. > There is a buffer overflow in qnio_create_channel(). strncpy() is used > incorrectly so long hostname or port (both can be 99 characters long) > will overflow channel->name[] (64 characters) or channel->port[] (8 > characters). > > strncpy(channel->name, hostname, strlen(hostname) + 1); > strncpy(channel->port, port, strlen(port) + 1); > > The third argument must be the size of the *destination* buffer, not the > source buffer. Also note that strncpy() doesn't NUL-terminate the > destination string so you must do that manually to ensure there is a NUL > byte at the end of the buffer. > > 2. > channel is leaked in the "Failed to open single connection" error case > in qnio_create_channel(). > > 3. > If host is longer the 63 characters then the ioapi_ctx->channels and > qnio_ctx->channels maps will use different keys due to string truncation > in qnio_create_channel(). This means "Channel already exists" in > qnio_create_channel() and possibly other things will not work as > expected. > >> + if (*cfd < 0) { >> + trace_vxhs_qnio_iio_open(of_vsa_addr); >> + return -ENODEV; >> + } >> + } >> + >> + /* >> + * Open vdisk device >> + */ >> + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); > > libqnio comment: > Buffer overflow in iio_devopen() since chandev[128] is not large enough > to hold channel[100] + " " + devpath[arbitrary length] chars: > > sprintf(chandev, "%s %s", channel, devpath); > >> + >> + if (*rfd < 0) { >> + if (*cfd >= 0) { > > This check is always true. Otherwise the return -ENODEV would have been > taken above. The if statement isn't necessary. > >> +static void vxhs_check_failover_status(int res, void *ctx) >> +{ >> + BDRVVXHSState *s = ctx; >> + >> + if (res == 0) { >> + /* found failover target */ >> + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; >> + s->vdisk_ask_failover_idx = 0; >> + trace_vxhs_check_failover_status( >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >> + s->vdisk_guid); >> + qemu_spin_lock(&s->vdisk_lock); >> + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); >> + qemu_spin_unlock(&s->vdisk_lock); >> + vxhs_handle_queued_ios(s); >> + } else { >> + /* keep looking */ >> + trace_vxhs_check_failover_status_retry(s->vdisk_guid); >> + s->vdisk_ask_failover_idx++; >> + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { >> + /* pause and cycle through list again */ >> + sleep(QNIO_CONNECT_RETRY_SECS); > > This code is called from a QEMU thread via vxhs_aio_rw(). It is not > permitted to call sleep() since it will freeze QEMU and probably the > guest. > > If you need a timer you can use QEMU's timer APIs. See aio_timer_new(), > timer_new_ns(), timer_mod(), timer_del(), timer_free(). > >> + s->vdisk_ask_failover_idx = 0; >> + } >> + res = vxhs_switch_storage_agent(s); >> + } >> +} >> + >> +static int vxhs_failover_io(BDRVVXHSState *s) >> +{ >> + int res = 0; >> + >> + trace_vxhs_failover_io(s->vdisk_guid); >> + >> + s->vdisk_ask_failover_idx = 0; >> + res = vxhs_switch_storage_agent(s); >> + >> + return res; >> +} >> + >> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, >> + uint32_t error, uint32_t opcode) > > This function is doing too much. Especially the failover code should > run in the AioContext since it's complex. Don't do failover here > because this function is outside the AioContext lock. Do it from > AioContext using a QEMUBH like block/rbd.c. > >> +static int32_t >> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, >> + uint64_t offset, void *ctx, uint32_t flags) >> +{ >> + struct iovec cur; >> + uint64_t cur_offset = 0; >> + uint64_t cur_write_len = 0; >> + int segcount = 0; >> + int ret = 0; >> + int i, nsio = 0; >> + int iovcnt = qiov->niov; >> + struct iovec *iov = qiov->iov; >> + >> + errno = 0; >> + cur.iov_base = 0; >> + cur.iov_len = 0; >> + >> + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); > > libqnio comments: > > 1. > There are blocking connect(2) and getaddrinfo(3) calls in iio_writev() > so this may hang for arbitrary amounts of time. This is not permitted > in .bdrv_aio_readv()/.bdrv_aio_writev(). Please make qnio actually > asynchronous. > > 2. > Where does client_callback() free reply? It looks like every reply > message causes a memory leak! > > 3. > Buffer overflow in iio_writev() since device[128] cannot fit the device > string generated from the vdisk_guid. > > 4. > Buffer overflow in iio_writev() due to > strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is > larger than target[64]. Also note the previous comments about strncpy() > usage. > > 5. > I don't see any endianness handling or portable alignment of struct > fields in the network protocol code. Binary network protocols need to > take care of these issue for portability. This means libqnio compiled > for different architectures will not work. Do you plan to support any > other architectures besides x86? > No, we support only x86 and do not plan to support any other arch. Please let me know if this necessitates any changes to the configure script. > 6. > The networking code doesn't look robust: kvset uses assert() on input > from the network so the other side of the connection could cause SIGABRT > (coredump), the client uses the msg pointer as the cookie for the > response packet so the server can easily crash the client by sending a > bogus cookie value, etc. Even on the client side these things are > troublesome but on a server they are guaranteed security issues. I > didn't look into it deeply. Please audit the code. > By design, our solution on OpenStack platform uses a closed set of nodes communicating on dedicated networks. VxHS servers on all the nodes are on a dedicated network. Clients (qemu) connects to these only after reading the server IP from the XML (read by libvirt). The XML cannot be modified without proper access. Therefore, IMO this problem would be relevant only if someone were to use qnio as a generic mode of communication/data transfer, but for our use-case, we will not run into this problem. Is this explanation acceptable? Will reply to other comments in this email that are still not addressed. Thanks! >> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, >> + int *cfd, int *rfd, Error **errp) >> +{ >> + QDict *backing_options = NULL; >> + QemuOpts *opts, *tcp_opts; >> + const char *vxhs_filename; >> + char *of_vsa_addr = NULL; >> + Error *local_err = NULL; >> + const char *vdisk_id_opt; >> + char *file_name = NULL; >> + size_t num_servers = 0; >> + char *str = NULL; >> + int ret = 0; >> + int i; >> + >> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); >> + qemu_opts_absorb_qdict(opts, options, &local_err); >> + if (local_err) { >> + error_propagate(errp, local_err); >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); >> + if (vxhs_filename) { >> + trace_vxhs_qemu_init_filename(vxhs_filename); >> + } >> + >> + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); >> + if (!vdisk_id_opt) { >> + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); >> + ret = -EINVAL; >> + goto out; >> + } >> + s->vdisk_guid = g_strdup(vdisk_id_opt); >> + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); >> + >> + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); >> + if (num_servers < 1) { >> + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); >> + ret = -EINVAL; >> + goto out; >> + } else if (num_servers > VXHS_MAX_HOSTS) { >> + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); >> + error_append_hint(errp, "Maximum %d servers allowed.\n", >> + VXHS_MAX_HOSTS); >> + ret = -EINVAL; >> + goto out; >> + } >> + trace_vxhs_qemu_init_numservers(num_servers); >> + >> + for (i = 0; i < num_servers; i++) { >> + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); >> + qdict_extract_subqdict(options, &backing_options, str); >> + >> + /* Create opts info from runtime_tcp_opts list */ >> + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); >> + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); >> + if (local_err) { >> + qdict_del(backing_options, str); > > backing_options is leaked and there's no need to delete the str key. > >> + qemu_opts_del(tcp_opts); >> + g_free(str); >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, >> + VXHS_OPT_HOST)); >> + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, >> + VXHS_OPT_PORT), >> + NULL, 0); > > This will segfault if the port option was missing. > >> + >> + s->vdisk_hostinfo[i].qnio_cfd = -1; >> + s->vdisk_hostinfo[i].vdisk_rfd = -1; >> + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, >> + s->vdisk_hostinfo[i].port); > > It's not safe to use the %s format specifier for a trace event with a > NULL value. In the case where hostip is NULL this could crash on some > systems. > >> + >> + qdict_del(backing_options, str); >> + qemu_opts_del(tcp_opts); >> + g_free(str); >> + } > > backing_options is leaked. > >> + >> + s->vdisk_nhosts = i; >> + s->vdisk_cur_host_idx = 0; >> + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); >> + of_vsa_addr = g_strdup_printf("of://%s:%d", >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, >> + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); > > Can we get here with num_servers == 0? In that case this would access > uninitialized memory. I guess num_servers == 0 does not make sense and > there should be an error case for it. > >> + >> + /* >> + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. >> + */ >> + if (global_qnio_ctx == NULL) { >> + global_qnio_ctx = vxhs_setup_qnio(); > > libqnio comment: > The client epoll thread should mask all signals (like > qemu_thread_create()). Otherwise it may receive signals that it cannot > deal with. > >> + if (global_qnio_ctx == NULL) { >> + error_setg(&local_err, "Failed vxhs_setup_qnio"); >> + ret = -EINVAL; >> + goto out; >> + } >> + } >> + >> + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); >> + if (!ret) { >> + error_setg(&local_err, "Failed qnio_iio_open"); >> + ret = -EIO; >> + } > > The return value of vxhs_qnio_iio_open() is 0 for success or -errno for > error. > > I guess you never ran this code! The block driver won't even open > successfully. > >> + >> +out: >> + g_free(file_name); >> + g_free(of_vsa_addr); >> + qemu_opts_del(opts); >> + >> + if (ret < 0) { >> + for (i = 0; i < num_servers; i++) { >> + g_free(s->vdisk_hostinfo[i].hostip); >> + } >> + g_free(s->vdisk_guid); >> + s->vdisk_guid = NULL; >> + errno = -ret; > > There is no need to set errno here. The return value already contains > the error and the caller doesn't look at errno. > >> + } >> + error_propagate(errp, local_err); >> + >> + return ret; >> +} >> + >> +static int vxhs_open(BlockDriverState *bs, QDict *options, >> + int bdrv_flags, Error **errp) >> +{ >> + BDRVVXHSState *s = bs->opaque; >> + AioContext *aio_context; >> + int qemu_qnio_cfd = -1; >> + int device_opened = 0; >> + int qemu_rfd = -1; >> + int ret = 0; >> + int i; >> + >> + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); >> + if (ret < 0) { >> + trace_vxhs_open_fail(ret); >> + return ret; >> + } >> + >> + device_opened = 1; >> + s->qnio_ctx = global_qnio_ctx; >> + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; >> + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; >> + s->vdisk_size = 0; >> + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); >> + >> + /* >> + * Create a pipe for communicating between two threads in different >> + * context. Set handler for read event, which gets triggered when >> + * IO completion is done by non-QEMU context. >> + */ >> + ret = qemu_pipe(s->fds); >> + if (ret < 0) { >> + trace_vxhs_open_epipe('.'); >> + ret = -errno; >> + goto errout; > > This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc. > bdrv_close() will not be called so this function must do cleanup itself. > >> + } >> + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); >> + >> + aio_context = bdrv_get_aio_context(bs); >> + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], >> + false, vxhs_aio_event_reader, NULL, s); >> + >> + /* >> + * Initialize the spin-locks. >> + */ >> + qemu_spin_init(&s->vdisk_lock); >> + qemu_spin_init(&s->vdisk_acb_lock); >> + >> + return 0; >> + >> +errout: >> + /* >> + * Close remote vDisk device if it was opened earlier >> + */ >> + if (device_opened) { > > This is always true. The device_opened variable can be removed. > >> +/* >> + * This allocates QEMU-VXHS callback for each IO >> + * and is passed to QNIO. When QNIO completes the work, >> + * it will be passed back through the callback. >> + */ >> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, >> + int64_t sector_num, QEMUIOVector *qiov, >> + int nb_sectors, >> + BlockCompletionFunc *cb, >> + void *opaque, int iodir) >> +{ >> + VXHSAIOCB *acb = NULL; >> + BDRVVXHSState *s = bs->opaque; >> + size_t size; >> + uint64_t offset; >> + int iio_flags = 0; >> + int ret = 0; >> + void *qnio_ctx = s->qnio_ctx; >> + uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd; >> + >> + offset = sector_num * BDRV_SECTOR_SIZE; >> + size = nb_sectors * BDRV_SECTOR_SIZE; >> + >> + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); >> + /* >> + * Setup or initialize VXHSAIOCB. >> + * Every single field should be initialized since >> + * acb will be picked up from the slab without >> + * initializing with zero. >> + */ >> + acb->io_offset = offset; >> + acb->size = size; >> + acb->ret = 0; >> + acb->flags = 0; >> + acb->aio_done = VXHS_IO_INPROGRESS; >> + acb->segments = 0; >> + acb->buffer = 0; >> + acb->qiov = qiov; >> + acb->direction = iodir; >> + >> + qemu_spin_lock(&s->vdisk_lock); >> + if (OF_VDISK_FAILED(s)) { >> + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); >> + qemu_spin_unlock(&s->vdisk_lock); >> + goto errout; >> + } >> + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { >> + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); >> + s->vdisk_aio_retry_qd++; >> + OF_AIOCB_FLAGS_SET_QUEUED(acb); >> + qemu_spin_unlock(&s->vdisk_lock); >> + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); >> + goto out; >> + } >> + s->vdisk_aio_count++; >> + qemu_spin_unlock(&s->vdisk_lock); >> + >> + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); >> + >> + switch (iodir) { >> + case VDISK_AIO_WRITE: >> + vxhs_inc_acb_segment_count(acb, 1); >> + ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov, >> + offset, (void *)acb, iio_flags); >> + break; >> + case VDISK_AIO_READ: >> + vxhs_inc_acb_segment_count(acb, 1); >> + ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov, >> + offset, (void *)acb, iio_flags); >> + break; >> + default: >> + trace_vxhs_aio_rw_invalid(iodir); >> + goto errout; > > s->vdisk_aio_count must be decremented before returning. > >> +static void vxhs_close(BlockDriverState *bs) >> +{ >> + BDRVVXHSState *s = bs->opaque; >> + int i; >> + >> + trace_vxhs_close(s->vdisk_guid); >> + close(s->fds[VDISK_FD_READ]); >> + close(s->fds[VDISK_FD_WRITE]); >> + >> + /* >> + * Clearing all the event handlers for oflame registered to QEMU >> + */ >> + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], >> + false, NULL, NULL, NULL); > > Please remove the event handler before closing the fd. I don't think it > matters in this case but in other scenarios there could be race > conditions if another thread opens an fd and the file descriptor number > is reused.
On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote: > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> 5. >> I don't see any endianness handling or portable alignment of struct >> fields in the network protocol code. Binary network protocols need to >> take care of these issue for portability. This means libqnio compiled >> for different architectures will not work. Do you plan to support any >> other architectures besides x86? >> > > No, we support only x86 and do not plan to support any other arch. > Please let me know if this necessitates any changes to the configure > script. I think no change to ./configure is necessary. The library will only ship on x86 so other platforms will never attempt to compile the code. >> 6. >> The networking code doesn't look robust: kvset uses assert() on input >> from the network so the other side of the connection could cause SIGABRT >> (coredump), the client uses the msg pointer as the cookie for the >> response packet so the server can easily crash the client by sending a >> bogus cookie value, etc. Even on the client side these things are >> troublesome but on a server they are guaranteed security issues. I >> didn't look into it deeply. Please audit the code. >> > > By design, our solution on OpenStack platform uses a closed set of > nodes communicating on dedicated networks. VxHS servers on all the > nodes are on a dedicated network. Clients (qemu) connects to these > only after reading the server IP from the XML (read by libvirt). The > XML cannot be modified without proper access. Therefore, IMO this > problem would be relevant only if someone were to use qnio as a > generic mode of communication/data transfer, but for our use-case, we > will not run into this problem. Is this explanation acceptable? No. The trust model is that the guest is untrusted and in the worst case may gain code execution in QEMU due to security bugs. You are assuming block/vxhs.c and libqnio are trusted but that assumption violates the trust model. In other words: 1. Guest exploits a security hole inside QEMU and gains code execution on the host. 2. Guest uses VxHS client file descriptor on host to send a malicious packet to VxHS server. 3. VxHS server is compromised by guest. 4. Compromised VxHS server sends malicious packets to all other connected clients. 5. All clients have been compromised. This means both the VxHS client and server must be robust. They have to validate inputs to avoid buffer overflows, assertion failures, infinite loops, etc. Stefan
On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote: > On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote: > > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > >> 5. > >> I don't see any endianness handling or portable alignment of struct > >> fields in the network protocol code. Binary network protocols need to > >> take care of these issue for portability. This means libqnio compiled > >> for different architectures will not work. Do you plan to support any > >> other architectures besides x86? > >> > > > > No, we support only x86 and do not plan to support any other arch. > > Please let me know if this necessitates any changes to the configure > > script. > > I think no change to ./configure is necessary. The library will only > ship on x86 so other platforms will never attempt to compile the code. > > >> 6. > >> The networking code doesn't look robust: kvset uses assert() on input > >> from the network so the other side of the connection could cause SIGABRT > >> (coredump), the client uses the msg pointer as the cookie for the > >> response packet so the server can easily crash the client by sending a > >> bogus cookie value, etc. Even on the client side these things are > >> troublesome but on a server they are guaranteed security issues. I > >> didn't look into it deeply. Please audit the code. > >> > > > > By design, our solution on OpenStack platform uses a closed set of > > nodes communicating on dedicated networks. VxHS servers on all the > > nodes are on a dedicated network. Clients (qemu) connects to these > > only after reading the server IP from the XML (read by libvirt). The > > XML cannot be modified without proper access. Therefore, IMO this > > problem would be relevant only if someone were to use qnio as a > > generic mode of communication/data transfer, but for our use-case, we > > will not run into this problem. Is this explanation acceptable? > > No. The trust model is that the guest is untrusted and in the worst > case may gain code execution in QEMU due to security bugs. > > You are assuming block/vxhs.c and libqnio are trusted but that > assumption violates the trust model. > > In other words: > 1. Guest exploits a security hole inside QEMU and gains code execution > on the host. > 2. Guest uses VxHS client file descriptor on host to send a malicious > packet to VxHS server. > 3. VxHS server is compromised by guest. > 4. Compromised VxHS server sends malicious packets to all other > connected clients. > 5. All clients have been compromised. > > This means both the VxHS client and server must be robust. They have > to validate inputs to avoid buffer overflows, assertion failures, > infinite loops, etc. > > Stefan The libqnio code is important with respect to the VxHS driver. It is a bit different than other existing external protocol drivers, in that the current user and developer base is small, and the code itself is pretty new. So I think for the VxHS driver here upstream, we really do need to get some of the libqnio issues squared away. I don't know if we've ever explicitly address the extent to which libqnio issues affect the driver merging, so I figure it is probably worth discussing here. To try and consolidate libqnio discussion, here is what I think I've read / seen from others as the major issues that should be addressed in libqnio: * Code auditing, static analysis, and general code cleanup. Things like memory leaks shouldn't be happening, and some prior libqnio compiler warnings imply that there is more code analysis that should be done with libqnio. (With regards to memory leaks: Valgrind may be useful to track these down: # valgrind ./qemu-io -c 'write -pP 0xae 66000 128k' \ vxhs://localhost/test.raw ==30369== LEAK SUMMARY: ==30369== definitely lost: 4,168 bytes in 2 blocks ==30369== indirectly lost: 1,207,720 bytes in 58,085 blocks) * Potential security issues such as buffer overruns, input validation, etc., need to be audited. * Async operations need to be truly asynchronous, without blocking calls. * Daniel pointed out that there is no authentication method for taking to a remote server. This seems a bit scary. Maybe all that is needed here is some clarification of the security scheme for authentication? My impression from above is that you are relying on the networks being private to provide some sort of implicit authentication, though, and this seems fragile (and doesn't protect against a compromised guest or other process on the server, for one). (if I've missed anything, please add it here!) -Jeff
On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: > On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote: > > On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote: > > > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > > >> 5. > > >> I don't see any endianness handling or portable alignment of struct > > >> fields in the network protocol code. Binary network protocols need to > > >> take care of these issue for portability. This means libqnio compiled > > >> for different architectures will not work. Do you plan to support any > > >> other architectures besides x86? > > >> > > > > > > No, we support only x86 and do not plan to support any other arch. > > > Please let me know if this necessitates any changes to the configure > > > script. > > > > I think no change to ./configure is necessary. The library will only > > ship on x86 so other platforms will never attempt to compile the code. > > > > >> 6. > > >> The networking code doesn't look robust: kvset uses assert() on input > > >> from the network so the other side of the connection could cause SIGABRT > > >> (coredump), the client uses the msg pointer as the cookie for the > > >> response packet so the server can easily crash the client by sending a > > >> bogus cookie value, etc. Even on the client side these things are > > >> troublesome but on a server they are guaranteed security issues. I > > >> didn't look into it deeply. Please audit the code. > > >> > > > > > > By design, our solution on OpenStack platform uses a closed set of > > > nodes communicating on dedicated networks. VxHS servers on all the > > > nodes are on a dedicated network. Clients (qemu) connects to these > > > only after reading the server IP from the XML (read by libvirt). The > > > XML cannot be modified without proper access. Therefore, IMO this > > > problem would be relevant only if someone were to use qnio as a > > > generic mode of communication/data transfer, but for our use-case, we > > > will not run into this problem. Is this explanation acceptable? > > > > No. The trust model is that the guest is untrusted and in the worst > > case may gain code execution in QEMU due to security bugs. > > > > You are assuming block/vxhs.c and libqnio are trusted but that > > assumption violates the trust model. > > > > In other words: > > 1. Guest exploits a security hole inside QEMU and gains code execution > > on the host. > > 2. Guest uses VxHS client file descriptor on host to send a malicious > > packet to VxHS server. > > 3. VxHS server is compromised by guest. > > 4. Compromised VxHS server sends malicious packets to all other > > connected clients. > > 5. All clients have been compromised. > > > > This means both the VxHS client and server must be robust. They have > > to validate inputs to avoid buffer overflows, assertion failures, > > infinite loops, etc. > > > > Stefan > > > The libqnio code is important with respect to the VxHS driver. It is a bit > different than other existing external protocol drivers, in that the current > user and developer base is small, and the code itself is pretty new. So I > think for the VxHS driver here upstream, we really do need to get some of > the libqnio issues squared away. I don't know if we've ever explicitly > address the extent to which libqnio issues affect the driver > merging, so I figure it is probably worth discussing here. > > To try and consolidate libqnio discussion, here is what I think I've read / > seen from others as the major issues that should be addressed in libqnio: > > * Code auditing, static analysis, and general code cleanup. Things like > memory leaks shouldn't be happening, and some prior libqnio compiler > warnings imply that there is more code analysis that should be done with > libqnio. > > (With regards to memory leaks: Valgrind may be useful to track these down: > > # valgrind ./qemu-io -c 'write -pP 0xae 66000 128k' \ > vxhs://localhost/test.raw > > ==30369== LEAK SUMMARY: > ==30369== definitely lost: 4,168 bytes in 2 blocks > ==30369== indirectly lost: 1,207,720 bytes in 58,085 blocks) > > * Potential security issues such as buffer overruns, input validation, etc., > need to be audited. > > * Async operations need to be truly asynchronous, without blocking calls. > > * Daniel pointed out that there is no authentication method for taking to a > remote server. This seems a bit scary. Maybe all that is needed here is > some clarification of the security scheme for authentication? My > impression from above is that you are relying on the networks being > private to provide some sort of implicit authentication, though, and this > seems fragile (and doesn't protect against a compromised guest or other > process on the server, for one). While relying on some kind of private network may have been acceptable 10 years ago, I don't think it is a credible authentication / security strategy in the current (increasingly) hostile network environments. You really have to assume as a starting position that even internal networks are compromised these days. Regards, Daniel
On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: > * Daniel pointed out that there is no authentication method for taking to a > remote server. This seems a bit scary. Maybe all that is needed here is > some clarification of the security scheme for authentication? My > impression from above is that you are relying on the networks being > private to provide some sort of implicit authentication, though, and this > seems fragile (and doesn't protect against a compromised guest or other > process on the server, for one). Exactly, from the QEMU trust model you must assume that QEMU has been compromised by the guest. The escaped guest can connect to the VxHS server since it controls the QEMU process. An escaped guest must not have access to other guests' volumes. Therefore authentication is necessary. By the way, QEMU has a secrets API for providing passwords and other sensistive data without passing them on the command-line. The command-line is vulnerable to snooping by other processes so using this API is mandatory. Please see include/crypto/secret.h. Stefan
On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote: >On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote: >> On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote: >> > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: >> >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: >> >> 5. >> >> I don't see any endianness handling or portable alignment of struct >> >> fields in the network protocol code. Binary network protocols need to >> >> take care of these issue for portability. This means libqnio compiled >> >> for different architectures will not work. Do you plan to support any >> >> other architectures besides x86? >> >> >> > >> > No, we support only x86 and do not plan to support any other arch. >> > Please let me know if this necessitates any changes to the configure >> > script. >> >> I think no change to ./configure is necessary. The library will only >> ship on x86 so other platforms will never attempt to compile the code. >> >> >> 6. >> >> The networking code doesn't look robust: kvset uses assert() on input >> >> from the network so the other side of the connection could cause SIGABRT >> >> (coredump), the client uses the msg pointer as the cookie for the >> >> response packet so the server can easily crash the client by sending a >> >> bogus cookie value, etc. Even on the client side these things are >> >> troublesome but on a server they are guaranteed security issues. I >> >> didn't look into it deeply. Please audit the code. >> >> >> > >> > By design, our solution on OpenStack platform uses a closed set of >> > nodes communicating on dedicated networks. VxHS servers on all the >> > nodes are on a dedicated network. Clients (qemu) connects to these >> > only after reading the server IP from the XML (read by libvirt). The >> > XML cannot be modified without proper access. Therefore, IMO this >> > problem would be relevant only if someone were to use qnio as a >> > generic mode of communication/data transfer, but for our use-case, we >> > will not run into this problem. Is this explanation acceptable? >> >> No. The trust model is that the guest is untrusted and in the worst >> case may gain code execution in QEMU due to security bugs. >> >> You are assuming block/vxhs.c and libqnio are trusted but that >> assumption violates the trust model. >> >> In other words: >> 1. Guest exploits a security hole inside QEMU and gains code execution >> on the host. >> 2. Guest uses VxHS client file descriptor on host to send a malicious >> packet to VxHS server. >> 3. VxHS server is compromised by guest. >> 4. Compromised VxHS server sends malicious packets to all other >> connected clients. >> 5. All clients have been compromised. >> >> This means both the VxHS client and server must be robust. They have >> to validate inputs to avoid buffer overflows, assertion failures, >> infinite loops, etc. >> >> Stefan > > >The libqnio code is important with respect to the VxHS driver. It is a bit >different than other existing external protocol drivers, in that the current >user and developer base is small, and the code itself is pretty new. So I >think for the VxHS driver here upstream, we really do need to get some of >the libqnio issues squared away. I don't know if we've ever explicitly >address the extent to which libqnio issues affect the driver >merging, so I figure it is probably worth discussing here. > >To try and consolidate libqnio discussion, here is what I think I've read / >seen from others as the major issues that should be addressed in libqnio: > >* Code auditing, static analysis, and general code cleanup. Things like > memory leaks shouldn't be happening, and some prior libqnio compiler > warnings imply that there is more code analysis that should be done with > libqnio. > > (With regards to memory leaks: Valgrind may be useful to track these down: > > # valgrind ./qemu-io -c 'write -pP 0xae 66000 128k' \ > vxhs://localhost/test.raw > > ==30369== LEAK SUMMARY: > ==30369== definitely lost: 4,168 bytes in 2 blocks > ==30369== indirectly lost: 1,207,720 bytes in 58,085 blocks) We have done and are doing exhaustive memory leak tests using valgrind. Memory leaks within qnio have been addressed to some extent. We will post detailed valgrind results to this thread. > >* Potential security issues such as buffer overruns, input validation, etc., > need to be audited. We have known a few such issues from previous comments and have addressed some of those. If there are any important outstanding ones, please let us know and we will fix them on priority. > >* Async operations need to be truly asynchronous, without blocking calls. There is only one blocking call in libqnio reconnect which we has been pointed out. We will fix this soon. > >* Daniel pointed out that there is no authentication method for taking to a > remote server. This seems a bit scary. Maybe all that is needed here is > some clarification of the security scheme for authentication? My > impression from above is that you are relying on the networks being > private to provide some sort of implicit authentication, though, and this > seems fragile (and doesn't protect against a compromised guest or other > process on the server, for one). Our auth scheme is based on network isolation at L2/L3 level. If there is a simplified authentication mechanism which we can implement without imposing significant penalties on IO performance, please let us know and we will implement that if feasible. > >(if I've missed anything, please add it here!) > >-Jeff
On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >> * Daniel pointed out that there is no authentication method for taking to a >> remote server. This seems a bit scary. Maybe all that is needed here is >> some clarification of the security scheme for authentication? My >> impression from above is that you are relying on the networks being >> private to provide some sort of implicit authentication, though, and this >> seems fragile (and doesn't protect against a compromised guest or other >> process on the server, for one). > >Exactly, from the QEMU trust model you must assume that QEMU has been >compromised by the guest. The escaped guest can connect to the VxHS >server since it controls the QEMU process. > >An escaped guest must not have access to other guests' volumes. >Therefore authentication is necessary. > >By the way, QEMU has a secrets API for providing passwords and other >sensistive data without passing them on the command-line. The >command-line is vulnerable to snooping by other processes so using this >API is mandatory. Please see include/crypto/secret.h. Stefan, do you have any details on the authentication implemented by qemu-nbd as part of the nbd protocol? > >Stefan
On Fri, Nov 18, 2016 at 10:57:00AM +0000, Ketan Nilangekar wrote: > > > > > > On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: > >> * Daniel pointed out that there is no authentication method for taking to a > >> remote server. This seems a bit scary. Maybe all that is needed here is > >> some clarification of the security scheme for authentication? My > >> impression from above is that you are relying on the networks being > >> private to provide some sort of implicit authentication, though, and this > >> seems fragile (and doesn't protect against a compromised guest or other > >> process on the server, for one). > > > >Exactly, from the QEMU trust model you must assume that QEMU has been > >compromised by the guest. The escaped guest can connect to the VxHS > >server since it controls the QEMU process. > > > >An escaped guest must not have access to other guests' volumes. > >Therefore authentication is necessary. > > > >By the way, QEMU has a secrets API for providing passwords and other > >sensistive data without passing them on the command-line. The > >command-line is vulnerable to snooping by other processes so using this > >API is mandatory. Please see include/crypto/secret.h. > > Stefan, do you have any details on the authentication implemented by > qemu-nbd as part of the nbd protocol? Historically the NBD protocol has zero authentication or encryption facilities. I recently added support for running TLS over the NBD channel. When doing this, the server is able to request an x509 certificate from the client and use the distinguished name from the cert as an identity against which to control access. The glusterfs protocol takes a similar approach of using TLS and x509 certs to provide identies for access control. The Ceph/RBD protocol uses an explicit username+password pair. NFS these days can use Kerberos My recommendation would be either TLS with x509 certs, or to integrate with SASL which is a pluggable authentication framework, or better yet to support both. This is what we do for VNC & SPICE auth, as well as for libvirt auth Regards, Daniel
On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >> * Daniel pointed out that there is no authentication method for taking to a >> remote server. This seems a bit scary. Maybe all that is needed here is >> some clarification of the security scheme for authentication? My >> impression from above is that you are relying on the networks being >> private to provide some sort of implicit authentication, though, and this >> seems fragile (and doesn't protect against a compromised guest or other >> process on the server, for one). > >Exactly, from the QEMU trust model you must assume that QEMU has been >compromised by the guest. The escaped guest can connect to the VxHS >server since it controls the QEMU process. > >An escaped guest must not have access to other guests' volumes. >Therefore authentication is necessary. Just so I am clear on this, how will such an escaped guest get to know the other guest vdisk IDs? > >By the way, QEMU has a secrets API for providing passwords and other >sensistive data without passing them on the command-line. The >command-line is vulnerable to snooping by other processes so using this >API is mandatory. Please see include/crypto/secret.h. > >Stefan
On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: > > > > > > On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > >On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: > >> * Daniel pointed out that there is no authentication method for taking to a > >> remote server. This seems a bit scary. Maybe all that is needed here is > >> some clarification of the security scheme for authentication? My > >> impression from above is that you are relying on the networks being > >> private to provide some sort of implicit authentication, though, and this > >> seems fragile (and doesn't protect against a compromised guest or other > >> process on the server, for one). > > > >Exactly, from the QEMU trust model you must assume that QEMU has been > >compromised by the guest. The escaped guest can connect to the VxHS > >server since it controls the QEMU process. > > > >An escaped guest must not have access to other guests' volumes. > >Therefore authentication is necessary. > > Just so I am clear on this, how will such an escaped guest get to know > the other guest vdisk IDs? There can be a multiple approaches depending on the deployment scenario. At the very simplest it could directly read the IDs out of the libvirt XML files in /var/run/libvirt. Or it can rnu "ps" to list other running QEMU processes and see the vdisk IDs in the command line args of those processes. Or the mgmt app may be creating vdisk IDs based on some particular scheme, and the attacker may have info about this which lets them determine likely IDs. Or the QEMU may have previously been permitted to the use the disk and remembered the ID for use later after access to the disk has been removed. IOW, you can't rely on security-through-obscurity of the vdisk IDs Regards, Daniel
> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: > >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: >> >> >> >> >> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >>> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >>>> * Daniel pointed out that there is no authentication method for taking to a >>>> remote server. This seems a bit scary. Maybe all that is needed here is >>>> some clarification of the security scheme for authentication? My >>>> impression from above is that you are relying on the networks being >>>> private to provide some sort of implicit authentication, though, and this >>>> seems fragile (and doesn't protect against a compromised guest or other >>>> process on the server, for one). >>> >>> Exactly, from the QEMU trust model you must assume that QEMU has been >>> compromised by the guest. The escaped guest can connect to the VxHS >>> server since it controls the QEMU process. >>> >>> An escaped guest must not have access to other guests' volumes. >>> Therefore authentication is necessary. >> >> Just so I am clear on this, how will such an escaped guest get to know >> the other guest vdisk IDs? > > There can be a multiple approaches depending on the deployment scenario. > At the very simplest it could directly read the IDs out of the libvirt > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running > QEMU processes and see the vdisk IDs in the command line args of those > processes. Or the mgmt app may be creating vdisk IDs based on some > particular scheme, and the attacker may have info about this which lets > them determine likely IDs. Or the QEMU may have previously been > permitted to the use the disk and remembered the ID for use later > after access to the disk has been removed. > Are we talking about a compromised guest here or compromised hypervisor? How will a compromised guest read the xml file or list running qemu processes? > IOW, you can't rely on security-through-obscurity of the vdisk IDs > > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote: > > > > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: > > > >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: > >> > >> > >> > >> > >> > >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > >>> > >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: > >>>> * Daniel pointed out that there is no authentication method for taking to a > >>>> remote server. This seems a bit scary. Maybe all that is needed here is > >>>> some clarification of the security scheme for authentication? My > >>>> impression from above is that you are relying on the networks being > >>>> private to provide some sort of implicit authentication, though, and this > >>>> seems fragile (and doesn't protect against a compromised guest or other > >>>> process on the server, for one). > >>> > >>> Exactly, from the QEMU trust model you must assume that QEMU has been > >>> compromised by the guest. The escaped guest can connect to the VxHS > >>> server since it controls the QEMU process. > >>> > >>> An escaped guest must not have access to other guests' volumes. > >>> Therefore authentication is necessary. > >> > >> Just so I am clear on this, how will such an escaped guest get to know > >> the other guest vdisk IDs? > > > > There can be a multiple approaches depending on the deployment scenario. > > At the very simplest it could directly read the IDs out of the libvirt > > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running > > QEMU processes and see the vdisk IDs in the command line args of those > > processes. Or the mgmt app may be creating vdisk IDs based on some > > particular scheme, and the attacker may have info about this which lets > > them determine likely IDs. Or the QEMU may have previously been > > permitted to the use the disk and remembered the ID for use later > > after access to the disk has been removed. > > > > Are we talking about a compromised guest here or compromised hypervisor? > How will a compromised guest read the xml file or list running qemu > processes? Compromised QEMU process, aka hypervisor userspace Regards, Daniel
Ketan Nilangekar <Ketan.Nilangekar@veritas.com> writes: > On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote: [...] >>* Daniel pointed out that there is no authentication method for taking to a >> remote server. This seems a bit scary. Maybe all that is needed here is >> some clarification of the security scheme for authentication? My >> impression from above is that you are relying on the networks being >> private to provide some sort of implicit authentication, though, and this >> seems fragile (and doesn't protect against a compromised guest or other >> process on the server, for one). > > Our auth scheme is based on network isolation at L2/L3 level. Stefan already explained the trust model. Since understanding it is crucial to security work, let me use the opportunity to explain it once more. The guest is untrusted. It interacts only with QEMU and, if enabled, KVM. KVM has a relatively small attack surface, but if the guest penetrates it, game's over. There's nothing we can do to mitigate. QEMU has a much larger attack surface, but we *can* do something to mitigate a compromise: nothing on the host trusts QEMU. Second line of defense. A line of defense is as strong as its weakest point. Adding an interface between QEMU and the host that requires the host to trust QEMU basically destroys the second line of defense. That's a big deal. You might argue that you don't require "the host" to trust, but only your daemon (or whatever it is your driver talks to). But that puts that daemon in the same security domain as QEMU itself, i.e. it should not be trusted by anything else on the host. Now you have a second problem. If you rely on "network isolation at L2/L3 level", chances are *everything* on this isolated network joins QEMU's security domain. You almost certainly need a separate isolated network per guest to have a chance at being credible. Even then, I'd rather not bet my own money on it. It's better to stick to the common trust model, and have *nothing* on the host trust QEMU. > If there is a simplified authentication mechanism which we can implement without imposing significant penalties on IO performance, please let us know and we will implement that if feasible. Daniel already listed available mechanisms.
On Fri, Nov 18, 2016 at 10:34:54AM +0000, Ketan Nilangekar wrote: > > > > > > On 11/18/16, 12:56 PM, "Jeff Cody" <jcody@redhat.com> wrote: > > >On Wed, Nov 16, 2016 at 08:12:41AM +0000, Stefan Hajnoczi wrote: > >> On Tue, Nov 15, 2016 at 10:38 PM, ashish mittal <ashmit602@gmail.com> wrote: > >> > On Wed, Sep 28, 2016 at 2:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> >> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote: > >> >> 5. > >> >> I don't see any endianness handling or portable alignment of struct > >> >> fields in the network protocol code. Binary network protocols need to > >> >> take care of these issue for portability. This means libqnio compiled > >> >> for different architectures will not work. Do you plan to support any > >> >> other architectures besides x86? > >> >> > >> > > >> > No, we support only x86 and do not plan to support any other arch. > >> > Please let me know if this necessitates any changes to the configure > >> > script. > >> > >> I think no change to ./configure is necessary. The library will only > >> ship on x86 so other platforms will never attempt to compile the code. > >> > >> >> 6. > >> >> The networking code doesn't look robust: kvset uses assert() on input > >> >> from the network so the other side of the connection could cause SIGABRT > >> >> (coredump), the client uses the msg pointer as the cookie for the > >> >> response packet so the server can easily crash the client by sending a > >> >> bogus cookie value, etc. Even on the client side these things are > >> >> troublesome but on a server they are guaranteed security issues. I > >> >> didn't look into it deeply. Please audit the code. > >> >> > >> > > >> > By design, our solution on OpenStack platform uses a closed set of > >> > nodes communicating on dedicated networks. VxHS servers on all the > >> > nodes are on a dedicated network. Clients (qemu) connects to these > >> > only after reading the server IP from the XML (read by libvirt). The > >> > XML cannot be modified without proper access. Therefore, IMO this > >> > problem would be relevant only if someone were to use qnio as a > >> > generic mode of communication/data transfer, but for our use-case, we > >> > will not run into this problem. Is this explanation acceptable? > >> > >> No. The trust model is that the guest is untrusted and in the worst > >> case may gain code execution in QEMU due to security bugs. > >> > >> You are assuming block/vxhs.c and libqnio are trusted but that > >> assumption violates the trust model. > >> > >> In other words: > >> 1. Guest exploits a security hole inside QEMU and gains code execution > >> on the host. > >> 2. Guest uses VxHS client file descriptor on host to send a malicious > >> packet to VxHS server. > >> 3. VxHS server is compromised by guest. > >> 4. Compromised VxHS server sends malicious packets to all other > >> connected clients. > >> 5. All clients have been compromised. > >> > >> This means both the VxHS client and server must be robust. They have > >> to validate inputs to avoid buffer overflows, assertion failures, > >> infinite loops, etc. > >> > >> Stefan > > > > > >The libqnio code is important with respect to the VxHS driver. It is a bit > >different than other existing external protocol drivers, in that the current > >user and developer base is small, and the code itself is pretty new. So I > >think for the VxHS driver here upstream, we really do need to get some of > >the libqnio issues squared away. I don't know if we've ever explicitly > >address the extent to which libqnio issues affect the driver > >merging, so I figure it is probably worth discussing here. > > > >To try and consolidate libqnio discussion, here is what I think I've read / > >seen from others as the major issues that should be addressed in libqnio: > > > >* Code auditing, static analysis, and general code cleanup. Things like > > memory leaks shouldn't be happening, and some prior libqnio compiler > > warnings imply that there is more code analysis that should be done with > > libqnio. > > > > (With regards to memory leaks: Valgrind may be useful to track these down: > > > > # valgrind ./qemu-io -c 'write -pP 0xae 66000 128k' \ > > vxhs://localhost/test.raw > > > > ==30369== LEAK SUMMARY: > > ==30369== definitely lost: 4,168 bytes in 2 blocks > > ==30369== indirectly lost: 1,207,720 bytes in 58,085 blocks) > > We have done and are doing exhaustive memory leak tests using valgrind. > Memory leaks within qnio have been addressed to some extent. We will post > detailed valgrind results to this thread. > That is good to hear. I ran the above on the latest HEAD from the qnio github repo, so I look forward to checking out the latest code once it is available. > > > >* Potential security issues such as buffer overruns, input validation, etc., > > need to be audited. > > We have known a few such issues from previous comments and have addressed > some of those. If there are any important outstanding ones, please let us > know and we will fix them on priority. > One concern is that the issues noted are not from an exhaustive review on Stefan's part, AFAIK. When Stefan called for auditing the code, that is really a call to look for other potential security flaws as well, using the perspective he outlined on the trust model. > > > >* Async operations need to be truly asynchronous, without blocking calls. > > There is only one blocking call in libqnio reconnect which we has been > pointed out. We will fix this soon. > Great, thanks! > > > >* Daniel pointed out that there is no authentication method for taking to a > > remote server. This seems a bit scary. Maybe all that is needed here is > > some clarification of the security scheme for authentication? My > > impression from above is that you are relying on the networks being > > private to provide some sort of implicit authentication, though, and this > > seems fragile (and doesn't protect against a compromised guest or other > > process on the server, for one). > > Our auth scheme is based on network isolation at L2/L3 level. If there is > a simplified authentication mechanism which we can implement without > imposing significant penalties on IO performance, please let us know and > we will implement that if feasible. > > > > >(if I've missed anything, please add it here!) > > > >-Jeff
+Nitin Jerath from Veritas. On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote: >On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote: >> >> >> > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: >> > >> >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: >> >> >> >> >> >> >> >> >> >> >> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >> >>> >> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >> >>>> * Daniel pointed out that there is no authentication method for taking to a >> >>>> remote server. This seems a bit scary. Maybe all that is needed here is >> >>>> some clarification of the security scheme for authentication? My >> >>>> impression from above is that you are relying on the networks being >> >>>> private to provide some sort of implicit authentication, though, and this >> >>>> seems fragile (and doesn't protect against a compromised guest or other >> >>>> process on the server, for one). >> >>> >> >>> Exactly, from the QEMU trust model you must assume that QEMU has been >> >>> compromised by the guest. The escaped guest can connect to the VxHS >> >>> server since it controls the QEMU process. >> >>> >> >>> An escaped guest must not have access to other guests' volumes. >> >>> Therefore authentication is necessary. >> >> >> >> Just so I am clear on this, how will such an escaped guest get to know >> >> the other guest vdisk IDs? >> > >> > There can be a multiple approaches depending on the deployment scenario. >> > At the very simplest it could directly read the IDs out of the libvirt >> > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running >> > QEMU processes and see the vdisk IDs in the command line args of those >> > processes. Or the mgmt app may be creating vdisk IDs based on some >> > particular scheme, and the attacker may have info about this which lets >> > them determine likely IDs. Or the QEMU may have previously been >> > permitted to the use the disk and remembered the ID for use later >> > after access to the disk has been removed. >> > >> >> Are we talking about a compromised guest here or compromised hypervisor? >> How will a compromised guest read the xml file or list running qemu >> processes? > >Compromised QEMU process, aka hypervisor userspace > > >Regards, >Daniel >-- >|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| >|: http://libvirt.org -o- http://virt-manager.org :| >|: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
On the topic of protocol security - Would it be enough for the first patch to implement only authentication and not encryption? On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar <Ketan.Nilangekar@veritas.com> wrote: > +Nitin Jerath from Veritas. > > > > > On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote: > >>On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote: >>> >>> >>> > On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: >>> > >>> >> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: >>> >> >>> >> >>> >> >>> >> >>> >> >>> >>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >>> >>> >>> >>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >>> >>>> * Daniel pointed out that there is no authentication method for taking to a >>> >>>> remote server. This seems a bit scary. Maybe all that is needed here is >>> >>>> some clarification of the security scheme for authentication? My >>> >>>> impression from above is that you are relying on the networks being >>> >>>> private to provide some sort of implicit authentication, though, and this >>> >>>> seems fragile (and doesn't protect against a compromised guest or other >>> >>>> process on the server, for one). >>> >>> >>> >>> Exactly, from the QEMU trust model you must assume that QEMU has been >>> >>> compromised by the guest. The escaped guest can connect to the VxHS >>> >>> server since it controls the QEMU process. >>> >>> >>> >>> An escaped guest must not have access to other guests' volumes. >>> >>> Therefore authentication is necessary. >>> >> >>> >> Just so I am clear on this, how will such an escaped guest get to know >>> >> the other guest vdisk IDs? >>> > >>> > There can be a multiple approaches depending on the deployment scenario. >>> > At the very simplest it could directly read the IDs out of the libvirt >>> > XML files in /var/run/libvirt. Or it can rnu "ps" to list other running >>> > QEMU processes and see the vdisk IDs in the command line args of those >>> > processes. Or the mgmt app may be creating vdisk IDs based on some >>> > particular scheme, and the attacker may have info about this which lets >>> > them determine likely IDs. Or the QEMU may have previously been >>> > permitted to the use the disk and remembered the ID for use later >>> > after access to the disk has been removed. >>> > >>> >>> Are we talking about a compromised guest here or compromised hypervisor? >>> How will a compromised guest read the xml file or list running qemu >>> processes? >> >>Compromised QEMU process, aka hypervisor userspace >> >> >>Regards, >>Daniel >>-- >>|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| >>|: http://libvirt.org -o- http://virt-manager.org :| >>|: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
On 23/11/2016 23:09, ashish mittal wrote: > On the topic of protocol security - > > Would it be enough for the first patch to implement only > authentication and not encryption? Yes, of course. However, as we introduce more and more QEMU-specific characteristics to a protocol that is already QEMU-specific (it doesn't do failover, etc.), I am still not sure of the actual benefit of using libqnio versus having an NBD server or FUSE driver. You have already mentioned performance, but the design has changed so much that I think one of the two things has to change: either failover moves back to QEMU and there is no (closed source) translator running on the node, or the translator needs to speak a well-known and already-supported protocol. Paolo > On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar > <Ketan.Nilangekar@veritas.com> wrote: >> +Nitin Jerath from Veritas. >> >> >> >> >> On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote: >> >>> On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote: >>>> >>>> >>>>> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: >>>>> >>>>>> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >>>>>>> >>>>>>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >>>>>>>> * Daniel pointed out that there is no authentication method for taking to a >>>>>>>> remote server. This seems a bit scary. Maybe all that is needed here is >>>>>>>> some clarification of the security scheme for authentication? My >>>>>>>> impression from above is that you are relying on the networks being >>>>>>>> private to provide some sort of implicit authentication, though, and this >>>>>>>> seems fragile (and doesn't protect against a compromised guest or other >>>>>>>> process on the server, for one). >>>>>>> >>>>>>> Exactly, from the QEMU trust model you must assume that QEMU has been >>>>>>> compromised by the guest. The escaped guest can connect to the VxHS >>>>>>> server since it controls the QEMU process. >>>>>>> >>>>>>> An escaped guest must not have access to other guests' volumes. >>>>>>> Therefore authentication is necessary. >>>>>> >>>>>> Just so I am clear on this, how will such an escaped guest get to know >>>>>> the other guest vdisk IDs? >>>>> >>>>> There can be a multiple approaches depending on the deployment scenario. >>>>> At the very simplest it could directly read the IDs out of the libvirt >>>>> XML files in /var/run/libvirt. Or it can rnu "ps" to list other running >>>>> QEMU processes and see the vdisk IDs in the command line args of those >>>>> processes. Or the mgmt app may be creating vdisk IDs based on some >>>>> particular scheme, and the attacker may have info about this which lets >>>>> them determine likely IDs. Or the QEMU may have previously been >>>>> permitted to the use the disk and remembered the ID for use later >>>>> after access to the disk has been removed. >>>>> >>>> >>>> Are we talking about a compromised guest here or compromised hypervisor? >>>> How will a compromised guest read the xml file or list running qemu >>>> processes? >>> >>> Compromised QEMU process, aka hypervisor userspace >>> >>> >>> Regards, >>> Daniel >>> -- >>> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| >>> |: http://libvirt.org -o- http://virt-manager.org :| >>> |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > >On 23/11/2016 23:09, ashish mittal wrote: >> On the topic of protocol security - >> >> Would it be enough for the first patch to implement only >> authentication and not encryption? > >Yes, of course. However, as we introduce more and more QEMU-specific >characteristics to a protocol that is already QEMU-specific (it doesn't >do failover, etc.), I am still not sure of the actual benefit of using >libqnio versus having an NBD server or FUSE driver. > >You have already mentioned performance, but the design has changed so >much that I think one of the two things has to change: either failover >moves back to QEMU and there is no (closed source) translator running on >the node, or the translator needs to speak a well-known and >already-supported protocol. IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. Ketan > >Paolo > >> On Wed, Nov 23, 2016 at 12:25 AM, Ketan Nilangekar >> <Ketan.Nilangekar@veritas.com> wrote: >>> +Nitin Jerath from Veritas. >>> >>> >>> >>> >>> On 11/18/16, 7:06 PM, "Daniel P. Berrange" <berrange@redhat.com> wrote: >>> >>>> On Fri, Nov 18, 2016 at 01:25:43PM +0000, Ketan Nilangekar wrote: >>>>> >>>>> >>>>>> On Nov 18, 2016, at 5:25 PM, Daniel P. Berrange <berrange@redhat.com> wrote: >>>>>> >>>>>>> On Fri, Nov 18, 2016 at 11:36:02AM +0000, Ketan Nilangekar wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 11/18/16, 3:32 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >>>>>>>> >>>>>>>>> On Fri, Nov 18, 2016 at 02:26:21AM -0500, Jeff Cody wrote: >>>>>>>>> * Daniel pointed out that there is no authentication method for taking to a >>>>>>>>> remote server. This seems a bit scary. Maybe all that is needed here is >>>>>>>>> some clarification of the security scheme for authentication? My >>>>>>>>> impression from above is that you are relying on the networks being >>>>>>>>> private to provide some sort of implicit authentication, though, and this >>>>>>>>> seems fragile (and doesn't protect against a compromised guest or other >>>>>>>>> process on the server, for one). >>>>>>>> >>>>>>>> Exactly, from the QEMU trust model you must assume that QEMU has been >>>>>>>> compromised by the guest. The escaped guest can connect to the VxHS >>>>>>>> server since it controls the QEMU process. >>>>>>>> >>>>>>>> An escaped guest must not have access to other guests' volumes. >>>>>>>> Therefore authentication is necessary. >>>>>>> >>>>>>> Just so I am clear on this, how will such an escaped guest get to know >>>>>>> the other guest vdisk IDs? >>>>>> >>>>>> There can be a multiple approaches depending on the deployment scenario. >>>>>> At the very simplest it could directly read the IDs out of the libvirt >>>>>> XML files in /var/run/libvirt. Or it can rnu "ps" to list other running >>>>>> QEMU processes and see the vdisk IDs in the command line args of those >>>>>> processes. Or the mgmt app may be creating vdisk IDs based on some >>>>>> particular scheme, and the attacker may have info about this which lets >>>>>> them determine likely IDs. Or the QEMU may have previously been >>>>>> permitted to the use the disk and remembered the ID for use later >>>>>> after access to the disk has been removed. >>>>>> >>>>> >>>>> Are we talking about a compromised guest here or compromised hypervisor? >>>>> How will a compromised guest read the xml file or list running qemu >>>>> processes? >>>> >>>> Compromised QEMU process, aka hypervisor userspace >>>> >>>> >>>> Regards, >>>> Daniel >>>> -- >>>> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| >>>> |: http://libvirt.org -o- http://virt-manager.org :| >>>> |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
On Wed, Nov 23, 2016 at 02:09:50PM -0800, ashish mittal wrote: > On the topic of protocol security - > > Would it be enough for the first patch to implement only > authentication and not encryption? Yes, authentication is the only critical thing from my POV. While encryption is a nice to have, there are plenty of storage systems which do *not* do encryption. Guest data can still be protected simply by running LUKS on the guest disks, so lack of encryption is not a serious security risk, provided the authentication scheme itself does not require encryption in order to be secure. Regards, Daniel
On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > >On 23/11/2016 23:09, ashish mittal wrote: > >> On the topic of protocol security - > >> > >> Would it be enough for the first patch to implement only > >> authentication and not encryption? > > > >Yes, of course. However, as we introduce more and more QEMU-specific > >characteristics to a protocol that is already QEMU-specific (it doesn't > >do failover, etc.), I am still not sure of the actual benefit of using > >libqnio versus having an NBD server or FUSE driver. > > > >You have already mentioned performance, but the design has changed so > >much that I think one of the two things has to change: either failover > >moves back to QEMU and there is no (closed source) translator running on > >the node, or the translator needs to speak a well-known and > >already-supported protocol. > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. By "cross memory attach" do you mean process_vm_readv(2)/process_vm_writev(2)? That puts us back to square one in terms of security. You have (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of another process on the same machine. That process is therefore also untrusted and may only process data for one guest so that guests stay isolated from each other. There's an easier way to get even better performance: get rid of libqnio and the external process. Move the code from the external process into QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and context switching. Can you remind me why there needs to be an external process? Stefan
On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > >On 23/11/2016 23:09, ashish mittal wrote: > >> On the topic of protocol security - > >> > >> Would it be enough for the first patch to implement only > >> authentication and not encryption? > > > >Yes, of course. However, as we introduce more and more QEMU-specific > >characteristics to a protocol that is already QEMU-specific (it doesn't > >do failover, etc.), I am still not sure of the actual benefit of using > >libqnio versus having an NBD server or FUSE driver. > > > >You have already mentioned performance, but the design has changed so > >much that I think one of the two things has to change: either failover > >moves back to QEMU and there is no (closed source) translator running on > >the node, or the translator needs to speak a well-known and > >already-supported protocol. > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. By "cross memory attach" do you mean process_vm_readv(2)/process_vm_writev(2)? Ketan> Yes. That puts us back to square one in terms of security. You have (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of another process on the same machine. That process is therefore also untrusted and may only process data for one guest so that guests stay isolated from each other. Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. There's an easier way to get even better performance: get rid of libqnio and the external process. Move the code from the external process into QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and context switching. Can you remind me why there needs to be an external process? Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. Ketan. Stefan
On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: > > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > >On 23/11/2016 23:09, ashish mittal wrote: > > >> On the topic of protocol security - > > >> > > >> Would it be enough for the first patch to implement only > > >> authentication and not encryption? > > > > > >Yes, of course. However, as we introduce more and more QEMU-specific > > >characteristics to a protocol that is already QEMU-specific (it doesn't > > >do failover, etc.), I am still not sure of the actual benefit of using > > >libqnio versus having an NBD server or FUSE driver. > > > > > >You have already mentioned performance, but the design has changed so > > >much that I think one of the two things has to change: either failover > > >moves back to QEMU and there is no (closed source) translator running on > > >the node, or the translator needs to speak a well-known and > > >already-supported protocol. > > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. > > By "cross memory attach" do you mean > process_vm_readv(2)/process_vm_writev(2)? > > Ketan> Yes. > > That puts us back to square one in terms of security. You have > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of > another process on the same machine. That process is therefore also > untrusted and may only process data for one guest so that guests stay > isolated from each other. > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. This is incorrect. Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. It means process A reads/writes directly from/to process B memory. Both processes must have the same uid/gid. There is no trust boundary between them. Network communication does not require both processes to have the same uid/gid. If you want multiple QEMU processes talking to a single server there must be a trust boundary between client and server. The server can validate the input from the client and reject undesired operations. Hope this makes sense now. Two architectures that implement the QEMU trust model correctly are: 1. Cross memory attach: each QEMU process has a dedicated vxhs server process to prevent guests from attacking each other. This is where I said you might as well put the code inside QEMU since there is no isolation anyway. From what you've said it sounds like the vxhs server needs a host-wide view and is responsible for all guests running on the host, so I guess we have to rule out this architecture. 2. Network communication: one vxhs server process and multiple guests. Here you might as well use NBD or iSCSI because it already exists and the vxhs driver doesn't add any unique functionality over existing protocols. > There's an easier way to get even better performance: get rid of libqnio > and the external process. Move the code from the external process into > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and > context switching. > > Can you remind me why there needs to be an external process? > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. This sounds similar to what QEMU and Linux (file systems, LVM, RAID, etc) already do. It brings to mind a third architecture: 3. A Linux driver or file system. Then QEMU opens a raw block device. This is what the Ceph rbd block driver in Linux does. This architecture has a kernel-userspace boundary so vxhs does not have to trust QEMU. I suggest Architecture #2. You'll be able to deploy on existing systems because QEMU already supports NBD or iSCSI. Use the time you gain from switching to this architecture on benchmarking and optimizing NBD or iSCSI so performance is closer to your goal. Stefan
On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: > > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > >On 23/11/2016 23:09, ashish mittal wrote: > > >> On the topic of protocol security - > > >> > > >> Would it be enough for the first patch to implement only > > >> authentication and not encryption? > > > > > >Yes, of course. However, as we introduce more and more QEMU-specific > > >characteristics to a protocol that is already QEMU-specific (it doesn't > > >do failover, etc.), I am still not sure of the actual benefit of using > > >libqnio versus having an NBD server or FUSE driver. > > > > > >You have already mentioned performance, but the design has changed so > > >much that I think one of the two things has to change: either failover > > >moves back to QEMU and there is no (closed source) translator running on > > >the node, or the translator needs to speak a well-known and > > >already-supported protocol. > > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. > > By "cross memory attach" do you mean > process_vm_readv(2)/process_vm_writev(2)? > > Ketan> Yes. > > That puts us back to square one in terms of security. You have > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of > another process on the same machine. That process is therefore also > untrusted and may only process data for one guest so that guests stay > isolated from each other. > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. This is incorrect. Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. It means process A reads/writes directly from/to process B memory. Both processes must have the same uid/gid. There is no trust boundary between them. Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. Network communication does not require both processes to have the same uid/gid. If you want multiple QEMU processes talking to a single server there must be a trust boundary between client and server. The server can validate the input from the client and reject undesired operations. Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable? Hope this makes sense now. Two architectures that implement the QEMU trust model correctly are: 1. Cross memory attach: each QEMU process has a dedicated vxhs server process to prevent guests from attacking each other. This is where I said you might as well put the code inside QEMU since there is no isolation anyway. From what you've said it sounds like the vxhs server needs a host-wide view and is responsible for all guests running on the host, so I guess we have to rule out this architecture. 2. Network communication: one vxhs server process and multiple guests. Here you might as well use NBD or iSCSI because it already exists and the vxhs driver doesn't add any unique functionality over existing protocols. Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support. There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion. > There's an easier way to get even better performance: get rid of libqnio > and the external process. Move the code from the external process into > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and > context switching. > > Can you remind me why there needs to be an external process? > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. This sounds similar to what QEMU and Linux (file systems, LVM, RAID, etc) already do. It brings to mind a third architecture: 3. A Linux driver or file system. Then QEMU opens a raw block device. This is what the Ceph rbd block driver in Linux does. This architecture has a kernel-userspace boundary so vxhs does not have to trust QEMU. I suggest Architecture #2. You'll be able to deploy on existing systems because QEMU already supports NBD or iSCSI. Use the time you gain from switching to this architecture on benchmarking and optimizing NBD or iSCSI so performance is closer to your goal. Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community. Stefan
On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote: > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > > >On 23/11/2016 23:09, ashish mittal wrote: > > > >> On the topic of protocol security - > > > >> > > > >> Would it be enough for the first patch to implement only > > > >> authentication and not encryption? > > > > > > > >Yes, of course. However, as we introduce more and more QEMU-specific > > > >characteristics to a protocol that is already QEMU-specific (it doesn't > > > >do failover, etc.), I am still not sure of the actual benefit of using > > > >libqnio versus having an NBD server or FUSE driver. > > > > > > > >You have already mentioned performance, but the design has changed so > > > >much that I think one of the two things has to change: either failover > > > >moves back to QEMU and there is no (closed source) translator running on > > > >the node, or the translator needs to speak a well-known and > > > >already-supported protocol. > > > > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > > > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. > > > > By "cross memory attach" do you mean > > process_vm_readv(2)/process_vm_writev(2)? > > > > Ketan> Yes. > > > > That puts us back to square one in terms of security. You have > > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of > > another process on the same machine. That process is therefore also > > untrusted and may only process data for one guest so that guests stay > > isolated from each other. > > > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. > > This is incorrect. > > Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. > It means process A reads/writes directly from/to process B memory. Both > processes must have the same uid/gid. There is no trust boundary > between them. > > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. > > Network communication does not require both processes to have the same > uid/gid. If you want multiple QEMU processes talking to a single server > there must be a trust boundary between client and server. The server > can validate the input from the client and reject undesired operations. > > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable? > > Hope this makes sense now. > > Two architectures that implement the QEMU trust model correctly are: > > 1. Cross memory attach: each QEMU process has a dedicated vxhs server > process to prevent guests from attacking each other. This is where I > said you might as well put the code inside QEMU since there is no > isolation anyway. From what you've said it sounds like the vxhs > server needs a host-wide view and is responsible for all guests > running on the host, so I guess we have to rule out this > architecture. > > 2. Network communication: one vxhs server process and multiple guests. > Here you might as well use NBD or iSCSI because it already exists and > the vxhs driver doesn't add any unique functionality over existing > protocols. > > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support. NBD over TCP supports TLS with X.509 certificate authentication. I think Daniel Berrange mentioned that. NBD over AF_UNIX does not need authentication because it relies on file permissions for access control. Each guest should have its own UNIX domain socket that it connects to. That socket can only see exports that have been assigned to the guest. > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion. Please discuss it now so everyone gets on the same page. I think there is a big gap and we need to communicate so that progress can be made. > > There's an easier way to get even better performance: get rid of libqnio > > and the external process. Move the code from the external process into > > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and > > context switching. > > > > Can you remind me why there needs to be an external process? > > > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. > > This sounds similar to what QEMU and Linux (file systems, LVM, RAID, > etc) already do. It brings to mind a third architecture: > > 3. A Linux driver or file system. Then QEMU opens a raw block device. > This is what the Ceph rbd block driver in Linux does. This > architecture has a kernel-userspace boundary so vxhs does not have to > trust QEMU. > > I suggest Architecture #2. You'll be able to deploy on existing systems > because QEMU already supports NBD or iSCSI. Use the time you gain from > switching to this architecture on benchmarking and optimizing NBD or > iSCSI so performance is closer to your goal. > > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community. I thought the VxHS block driver was another network block driver like GlusterFS or Sheepdog but you are actually proposing a new local I/O tap with the goal of better performance. Please share fio(1) or other standard benchmark configuration files and performance results. NBD and libqnio wire protocols have comparable performance characteristics. There is no magic that should give either one a fundamental edge over the other. Am I missing something? The main performance difference is probably that libqnio opens 8 simultaneous connections but that's not unique to the wire protocol. What happens when you run 8 NBD simultaneous TCP connections? Stefan
On Fri, 11/25 08:27, Ketan Nilangekar wrote: > Ketan> We have made a choice to go with QEMU driver approach after serious > evaluation of most if not all standard IO tapping mechanisms including NFS, > NBD and FUSE. None of these has been able to deliver the performance that we > have set ourselves to achieve. Hence the effort to propose this new IO tap > which we believe will provide an alternate to the existing mechanisms and > hopefully benefit the community. Out of curiosity: have you also evaluated the kernel TCMU interface [1] that can do native command exchange very efficiently? It is a relatively new IO tappling mechanism but provides better isolation between QEMU and the backend process with the supervision of kernel. With its "loopback" frontend, theoretically the SG lists can be passed back and forth for local clients. For remote clients, iscsi can be used as the protocol. Fam [1]: https://www.kernel.org/doc/Documentation/target/tcmu-design.txt
On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote: > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > > >On 23/11/2016 23:09, ashish mittal wrote: > > > >> On the topic of protocol security - > > > >> > > > >> Would it be enough for the first patch to implement only > > > >> authentication and not encryption? > > > > > > > >Yes, of course. However, as we introduce more and more QEMU-specific > > > >characteristics to a protocol that is already QEMU-specific (it doesn't > > > >do failover, etc.), I am still not sure of the actual benefit of using > > > >libqnio versus having an NBD server or FUSE driver. > > > > > > > >You have already mentioned performance, but the design has changed so > > > >much that I think one of the two things has to change: either failover > > > >moves back to QEMU and there is no (closed source) translator running on > > > >the node, or the translator needs to speak a well-known and > > > >already-supported protocol. > > > > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > > > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. > > > > By "cross memory attach" do you mean > > process_vm_readv(2)/process_vm_writev(2)? > > > > Ketan> Yes. > > > > That puts us back to square one in terms of security. You have > > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of > > another process on the same machine. That process is therefore also > > untrusted and may only process data for one guest so that guests stay > > isolated from each other. > > > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. > > This is incorrect. > > Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. > It means process A reads/writes directly from/to process B memory. Both > processes must have the same uid/gid. There is no trust boundary > between them. > > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. > > Network communication does not require both processes to have the same > uid/gid. If you want multiple QEMU processes talking to a single server > there must be a trust boundary between client and server. The server > can validate the input from the client and reject undesired operations. > > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable? > > Hope this makes sense now. > > Two architectures that implement the QEMU trust model correctly are: > > 1. Cross memory attach: each QEMU process has a dedicated vxhs server > process to prevent guests from attacking each other. This is where I > said you might as well put the code inside QEMU since there is no > isolation anyway. From what you've said it sounds like the vxhs > server needs a host-wide view and is responsible for all guests > running on the host, so I guess we have to rule out this > architecture. > > 2. Network communication: one vxhs server process and multiple guests. > Here you might as well use NBD or iSCSI because it already exists and > the vxhs driver doesn't add any unique functionality over existing > protocols. > > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support. NBD over TCP supports TLS with X.509 certificate authentication. I think Daniel Berrange mentioned that. Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned. NBD over AF_UNIX does not need authentication because it relies on file permissions for access control. Each guest should have its own UNIX domain socket that it connects to. That socket can only see exports that have been assigned to the guest. > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion. Please discuss it now so everyone gets on the same page. I think there is a big gap and we need to communicate so that progress can be made. Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later. > > There's an easier way to get even better performance: get rid of libqnio > > and the external process. Move the code from the external process into > > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and > > context switching. > > > > Can you remind me why there needs to be an external process? > > > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. > > This sounds similar to what QEMU and Linux (file systems, LVM, RAID, > etc) already do. It brings to mind a third architecture: > > 3. A Linux driver or file system. Then QEMU opens a raw block device. > This is what the Ceph rbd block driver in Linux does. This > architecture has a kernel-userspace boundary so vxhs does not have to > trust QEMU. > > I suggest Architecture #2. You'll be able to deploy on existing systems > because QEMU already supports NBD or iSCSI. Use the time you gain from > switching to this architecture on benchmarking and optimizing NBD or > iSCSI so performance is closer to your goal. > > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community. I thought the VxHS block driver was another network block driver like GlusterFS or Sheepdog but you are actually proposing a new local I/O tap with the goal of better performance. Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD. Please share fio(1) or other standard benchmark configuration files and performance results. Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum. NBD and libqnio wire protocols have comparable performance characteristics. There is no magic that should give either one a fundamental edge over the other. Am I missing something? Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies. The main performance difference is probably that libqnio opens 8 simultaneous connections but that's not unique to the wire protocol. What happens when you run 8 NBD simultaneous TCP connections? Ketan> Possibly. We have not benchmarked this. Stefan
On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote: > > > On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote: > > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: > > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: > > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: > > > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: > > > > >On 23/11/2016 23:09, ashish mittal wrote: > > > > >> On the topic of protocol security - > > > > >> > > > > >> Would it be enough for the first patch to implement only > > > > >> authentication and not encryption? > > > > > > > > > >Yes, of course. However, as we introduce more and more QEMU-specific > > > > >characteristics to a protocol that is already QEMU-specific (it doesn't > > > > >do failover, etc.), I am still not sure of the actual benefit of using > > > > >libqnio versus having an NBD server or FUSE driver. > > > > > > > > > >You have already mentioned performance, but the design has changed so > > > > >much that I think one of the two things has to change: either failover > > > > >moves back to QEMU and there is no (closed source) translator running on > > > > >the node, or the translator needs to speak a well-known and > > > > >already-supported protocol. > > > > > > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. > > > > > > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. > > > > > > By "cross memory attach" do you mean > > > process_vm_readv(2)/process_vm_writev(2)? > > > > > > Ketan> Yes. > > > > > > That puts us back to square one in terms of security. You have > > > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of > > > another process on the same machine. That process is therefore also > > > untrusted and may only process data for one guest so that guests stay > > > isolated from each other. > > > > > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. > > > > This is incorrect. > > > > Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. > > It means process A reads/writes directly from/to process B memory. Both > > processes must have the same uid/gid. There is no trust boundary > > between them. > > > > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. > > > > Network communication does not require both processes to have the same > > uid/gid. If you want multiple QEMU processes talking to a single server > > there must be a trust boundary between client and server. The server > > can validate the input from the client and reject undesired operations. > > > > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable? > > > > Hope this makes sense now. > > > > Two architectures that implement the QEMU trust model correctly are: > > > > 1. Cross memory attach: each QEMU process has a dedicated vxhs server > > process to prevent guests from attacking each other. This is where I > > said you might as well put the code inside QEMU since there is no > > isolation anyway. From what you've said it sounds like the vxhs > > server needs a host-wide view and is responsible for all guests > > running on the host, so I guess we have to rule out this > > architecture. > > > > 2. Network communication: one vxhs server process and multiple guests. > > Here you might as well use NBD or iSCSI because it already exists and > > the vxhs driver doesn't add any unique functionality over existing > > protocols. > > > > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support. > > NBD over TCP supports TLS with X.509 certificate authentication. I > think Daniel Berrange mentioned that. > > Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned. > > NBD over AF_UNIX does not need authentication because it relies on file > permissions for access control. Each guest should have its own UNIX > domain socket that it connects to. That socket can only see exports > that have been assigned to the guest. > > > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion. > > Please discuss it now so everyone gets on the same page. I think there > is a big gap and we need to communicate so that progress can be made. > > Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later. Why does the client have to know about failover if it's connected to a server process on the same host? I thought the server process manages networking issues (like the actual protocol to speak to other VxHS nodes and for failover). > > > There's an easier way to get even better performance: get rid of libqnio > > > and the external process. Move the code from the external process into > > > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and > > > context switching. > > > > > > Can you remind me why there needs to be an external process? > > > > > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. > > > > This sounds similar to what QEMU and Linux (file systems, LVM, RAID, > > etc) already do. It brings to mind a third architecture: > > > > 3. A Linux driver or file system. Then QEMU opens a raw block device. > > This is what the Ceph rbd block driver in Linux does. This > > architecture has a kernel-userspace boundary so vxhs does not have to > > trust QEMU. > > > > I suggest Architecture #2. You'll be able to deploy on existing systems > > because QEMU already supports NBD or iSCSI. Use the time you gain from > > switching to this architecture on benchmarking and optimizing NBD or > > iSCSI so performance is closer to your goal. > > > > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community. > > I thought the VxHS block driver was another network block driver like > GlusterFS or Sheepdog but you are actually proposing a new local I/O tap > with the goal of better performance. > > Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD. > > > Please share fio(1) or other standard benchmark configuration files and > performance results. > > Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum. > > NBD and libqnio wire protocols have comparable performance > characteristics. There is no magic that should give either one a > fundamental edge over the other. Am I missing something? > > Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies. > > The main performance difference is probably that libqnio opens 8 > simultaneous connections but that's not unique to the wire protocol. > What happens when you run 8 NBD simultaneous TCP connections? > > Ketan> Possibly. We have not benchmarked this. There must be benchmark data if you want to add a new feature or modify existing code for performance reasons. This rule is followed in QEMU so that performance changes are justified. I'm afraid that when you look into the performance you'll find that any performance difference between NBD and this VxHS patch series is due to implementation differences that can be ported across to QEMU NBD, rather than wire protocol differences. If that's the case then it would save a lot of time to use NBD over AF_UNIX for now. You could focus efforts on achieving the final architecture you've explained with cross memory attach. Please take a look at vhost-user-scsi, which folks from Nutanix are currently working on. See "[PATCH v2 0/3] Introduce vhost-user-scsi and sample application" on qemu-devel. It is a true zero-copy local I/O tap because it shares guest RAM. This is more efficient than cross memory attach's single memory copy. It does not require running the server as root. This is the #1 thing you should evaluate for your final architecture. vhost-user-scsi works on the virtio-scsi emulation level. That means the server must implement the virtio-scsi vring and device emulation. It is not a block driver. By hooking in at this level you can achieve the best performance but you lose all QEMU block layer functionality and need to implement your own SCSI target. You also need to consider live migration. Stefan
+ Rakesh from Veritas On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote: >> >> >> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >> >> On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote: >> > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >> > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote: >> > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >> > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar wrote: >> > > > On 11/24/16, 4:07 AM, "Paolo Bonzini" <pbonzini@redhat.com> wrote: >> > > > >On 23/11/2016 23:09, ashish mittal wrote: >> > > > >> On the topic of protocol security - >> > > > >> >> > > > >> Would it be enough for the first patch to implement only >> > > > >> authentication and not encryption? >> > > > > >> > > > >Yes, of course. However, as we introduce more and more QEMU-specific >> > > > >characteristics to a protocol that is already QEMU-specific (it doesn't >> > > > >do failover, etc.), I am still not sure of the actual benefit of using >> > > > >libqnio versus having an NBD server or FUSE driver. >> > > > > >> > > > >You have already mentioned performance, but the design has changed so >> > > > >much that I think one of the two things has to change: either failover >> > > > >moves back to QEMU and there is no (closed source) translator running on >> > > > >the node, or the translator needs to speak a well-known and >> > > > >already-supported protocol. >> > > > >> > > > IMO design has not changed. Implementation has changed significantly. I would propose that we keep resiliency/failover code out of QEMU driver and implement it entirely in libqnio as planned in a subsequent revision. The VxHS server does not need to understand/handle failover at all. >> > > > >> > > > Today libqnio gives us significantly better performance than any NBD/FUSE implementation. We know because we have prototyped with both. Significant improvements to libqnio are also in the pipeline which will use cross memory attach calls to further boost performance. Ofcourse a big reason for the performance is also the HyperScale storage backend but we believe this method of IO tapping/redirecting can be leveraged by other solutions as well. >> > > >> > > By "cross memory attach" do you mean >> > > process_vm_readv(2)/process_vm_writev(2)? >> > > >> > > Ketan> Yes. >> > > >> > > That puts us back to square one in terms of security. You have >> > > (untrusted) QEMU + (untrusted) libqnio directly accessing the memory of >> > > another process on the same machine. That process is therefore also >> > > untrusted and may only process data for one guest so that guests stay >> > > isolated from each other. >> > > >> > > Ketan> Understood but this will be no worse than the current network based communication between qnio and vxhs server. And although we have questions around QEMU trust/vulnerability issues, we are looking to implement basic authentication scheme between libqnio and vxhs server. >> > >> > This is incorrect. >> > >> > Cross memory attach is equivalent to ptrace(2) (i.e. debugger) access. >> > It means process A reads/writes directly from/to process B memory. Both >> > processes must have the same uid/gid. There is no trust boundary >> > between them. >> > >> > Ketan> Not if vxhs server is running as root and initiating the cross mem attach. Which is also why we are proposing a basic authentication mechanism between qemu-vxhs. But anyway the cross memory attach is for a near future implementation. >> > >> > Network communication does not require both processes to have the same >> > uid/gid. If you want multiple QEMU processes talking to a single server >> > there must be a trust boundary between client and server. The server >> > can validate the input from the client and reject undesired operations. >> > >> > Ketan> This is what we are trying to propose. With the addition of authentication between qemu-vxhs server, we should be able to achieve this. Question is, would that be acceptable? >> > >> > Hope this makes sense now. >> > >> > Two architectures that implement the QEMU trust model correctly are: >> > >> > 1. Cross memory attach: each QEMU process has a dedicated vxhs server >> > process to prevent guests from attacking each other. This is where I >> > said you might as well put the code inside QEMU since there is no >> > isolation anyway. From what you've said it sounds like the vxhs >> > server needs a host-wide view and is responsible for all guests >> > running on the host, so I guess we have to rule out this >> > architecture. >> > >> > 2. Network communication: one vxhs server process and multiple guests. >> > Here you might as well use NBD or iSCSI because it already exists and >> > the vxhs driver doesn't add any unique functionality over existing >> > protocols. >> > >> > Ketan> NBD does not give us the performance we are trying to achieve. Besides NBD does not have any authentication support. >> >> NBD over TCP supports TLS with X.509 certificate authentication. I >> think Daniel Berrange mentioned that. >> >> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did not have any auth as Daniel Berrange mentioned. >> >> NBD over AF_UNIX does not need authentication because it relies on file >> permissions for access control. Each guest should have its own UNIX >> domain socket that it connects to. That socket can only see exports >> that have been assigned to the guest. >> >> > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that for a later discussion. >> >> Please discuss it now so everyone gets on the same page. I think there >> is a big gap and we need to communicate so that progress can be made. >> >> Ketan> The approach was to use cross mem attach for IO path and a simplified network IO lib for resiliency/failover. Did not want to derail the current discussion hence the suggestion to take it up later. > > Why does the client have to know about failover if it's connected to a > server process on the same host? I thought the server process manages > networking issues (like the actual protocol to speak to other VxHS nodes > and for failover). > >> > > There's an easier way to get even better performance: get rid of libqnio >> > > and the external process. Move the code from the external process into >> > > QEMU to eliminate the process_vm_readv(2)/process_vm_writev(2) and >> > > context switching. >> > > >> > > Can you remind me why there needs to be an external process? >> > > >> > > Ketan> Apart from virtualizing the available direct attached storage on the compute, vxhs storage backend (the external process) provides features such as storage QoS, resiliency, efficient use of direct attached storage, automatic storage recovery points (snapshots) etc. Implementing this in QEMU is not practical and not the purpose of proposing this driver. >> > >> > This sounds similar to what QEMU and Linux (file systems, LVM, RAID, >> > etc) already do. It brings to mind a third architecture: >> > >> > 3. A Linux driver or file system. Then QEMU opens a raw block device. >> > This is what the Ceph rbd block driver in Linux does. This >> > architecture has a kernel-userspace boundary so vxhs does not have to >> > trust QEMU. >> > >> > I suggest Architecture #2. You'll be able to deploy on existing systems >> > because QEMU already supports NBD or iSCSI. Use the time you gain from >> > switching to this architecture on benchmarking and optimizing NBD or >> > iSCSI so performance is closer to your goal. >> > >> > Ketan> We have made a choice to go with QEMU driver approach after serious evaluation of most if not all standard IO tapping mechanisms including NFS, NBD and FUSE. None of these has been able to deliver the performance that we have set ourselves to achieve. Hence the effort to propose this new IO tap which we believe will provide an alternate to the existing mechanisms and hopefully benefit the community. >> >> I thought the VxHS block driver was another network block driver like >> GlusterFS or Sheepdog but you are actually proposing a new local I/O tap >> with the goal of better performance. >> >> Ketan> The VxHS block driver is a new local IO tap with the goal of better performance specifically when used with the VxHS server. This coupled with shared mem IPC (like cross mem attach) could be a much better IO tap option for qemu users. This will also avoid context switch between qemu/network stack to service which happens today in NBD. >> >> >> Please share fio(1) or other standard benchmark configuration files and >> performance results. >> >> Ketan> We have fio results with the VxHS storage backend which I am not sure I can share in a public forum. >> >> NBD and libqnio wire protocols have comparable performance >> characteristics. There is no magic that should give either one a >> fundamental edge over the other. Am I missing something? >> >> Ketan> I have not seen the NBD code but few things which we considered and are part of libqnio (though not exclusively) are low protocol overhead, threading model, queueing, latencies, memory pools, zero data copies in user-land, scatter-gather write/read etc. Again these are not exclusive to libqnio but could give one protocol the edge over the other. Also part of the “magic” is also in the VxHS storage backend which is able to ingest the IOs with lower latencies. >> >> The main performance difference is probably that libqnio opens 8 >> simultaneous connections but that's not unique to the wire protocol. >> What happens when you run 8 NBD simultaneous TCP connections? >> >> Ketan> Possibly. We have not benchmarked this. > > There must be benchmark data if you want to add a new feature or modify > existing code for performance reasons. This rule is followed in QEMU so > that performance changes are justified. > > I'm afraid that when you look into the performance you'll find that any > performance difference between NBD and this VxHS patch series is due to > implementation differences that can be ported across to QEMU NBD, rather > than wire protocol differences. > > If that's the case then it would save a lot of time to use NBD over > AF_UNIX for now. You could focus efforts on achieving the final > architecture you've explained with cross memory attach. > > Please take a look at vhost-user-scsi, which folks from Nutanix are > currently working on. See "[PATCH v2 0/3] Introduce vhost-user-scsi and > sample application" on qemu-devel. It is a true zero-copy local I/O tap > because it shares guest RAM. This is more efficient than cross memory > attach's single memory copy. It does not require running the server as > root. This is the #1 thing you should evaluate for your final > architecture. > > vhost-user-scsi works on the virtio-scsi emulation level. That means > the server must implement the virtio-scsi vring and device emulation. > It is not a block driver. By hooking in at this level you can achieve > the best performance but you lose all QEMU block layer functionality and > need to implement your own SCSI target. You also need to consider live > migration. > > Stefan
Hello Stefan, >>>>> Why does the client have to know about failover if it's connected to >>>>>a server process on the same host? I thought the server process >>>>>manages networking issues (like the actual protocol to speak to other >>>>>VxHS nodes and for failover). Just to comment on this, the model being followed within HyperScale is to allow application I/O continuity (resiliency) in various cases as mentioned below. It really adds value for consumer/customer and tries to avoid culprits for single points of failure. 1. HyperScale storage service failure (QNIO Server) - Daemon managing local storage for VMs and runs on each compute node - Daemon can run as a service on Hypervisor itself as well as within VSA (Virtual Storage Appliance or Virtual Machine running on the hypervisor), which depends on ecosystem where HyperScale is supported - Daemon or storage service down/crash/crash-in-loop shouldn¹t lead to an huge impact on all the VMs running on that hypervisor or compute node hence providing service level resiliency is very useful for application I/O continuity in such case. Solution: - The service failure handling can be only done at the client side and not at the server side since service running as a server itself is down. - Client detects an I/O error and depending on the logic, it does application I/O failover to another available/active QNIO server or HyperScale Storage service running on different compute node (reflection/replication node) - Once the orig/old server comes back online, client gets/receives negotiated error (not a real application error) to do the application I/O failback to the original server or local HyperScale storage service to get better I/O performance. 2. Local physical storage or media failure - Once server or HyperScale storage service detects the media or local disk failure, depending on the vDisk (guest disk) configuration, if another storage copy is available on different compute node then it internally handles the local fault and serves the application read and write requests otherwise application or client gets the fault. - Client doesn¹t know about any I/O failure since Server or Storage service manages/handles the fault tolerance. - In such case, in order to get some I/O performance benefit, once client gets a negotiated error (not an application error) from local server or storage service, client can initiate I/O failover and can directly send application I/O to another compute node where storage copy is available to serve the application need instead of sending it locally where media is faulted. -Rakesh On 11/29/16, 4:45 PM, "ashish mittal" <ashmit602@gmail.com> wrote: >+ Rakesh from Veritas > >On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefanha@gmail.com> >wrote: >> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote: >>> >>> >>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefanha@gmail.com> wrote: >>> >>> On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote: >>> > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefanha@gmail.com> >>>wrote: >>> > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar >>>wrote: >>> > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" >>><stefanha@gmail.com> wrote: >>> > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan >>>Nilangekar wrote: >>> > > > On 11/24/16, 4:07 AM, "Paolo Bonzini" >>><pbonzini@redhat.com> wrote: >>> > > > >On 23/11/2016 23:09, ashish mittal wrote: >>> > > > >> On the topic of protocol security - >>> > > > >> >>> > > > >> Would it be enough for the first patch to >>>implement only >>> > > > >> authentication and not encryption? >>> > > > > >>> > > > >Yes, of course. However, as we introduce more and >>>more QEMU-specific >>> > > > >characteristics to a protocol that is already >>>QEMU-specific (it doesn't >>> > > > >do failover, etc.), I am still not sure of the >>>actual benefit of using >>> > > > >libqnio versus having an NBD server or FUSE driver. >>> > > > > >>> > > > >You have already mentioned performance, but the >>>design has changed so >>> > > > >much that I think one of the two things has to >>>change: either failover >>> > > > >moves back to QEMU and there is no (closed source) >>>translator running on >>> > > > >the node, or the translator needs to speak a >>>well-known and >>> > > > >already-supported protocol. >>> > > > >>> > > > IMO design has not changed. Implementation has >>>changed significantly. I would propose that we keep resiliency/failover >>>code out of QEMU driver and implement it entirely in libqnio as planned >>>in a subsequent revision. The VxHS server does not need to >>>understand/handle failover at all. >>> > > > >>> > > > Today libqnio gives us significantly better >>>performance than any NBD/FUSE implementation. We know because we have >>>prototyped with both. Significant improvements to libqnio are also in >>>the pipeline which will use cross memory attach calls to further boost >>>performance. Ofcourse a big reason for the performance is also the >>>HyperScale storage backend but we believe this method of IO >>>tapping/redirecting can be leveraged by other solutions as well. >>> > > >>> > > By "cross memory attach" do you mean >>> > > process_vm_readv(2)/process_vm_writev(2)? >>> > > >>> > > Ketan> Yes. >>> > > >>> > > That puts us back to square one in terms of security. >>>You have >>> > > (untrusted) QEMU + (untrusted) libqnio directly >>>accessing the memory of >>> > > another process on the same machine. That process is >>>therefore also >>> > > untrusted and may only process data for one guest so >>>that guests stay >>> > > isolated from each other. >>> > > >>> > > Ketan> Understood but this will be no worse than the >>>current network based communication between qnio and vxhs server. And >>>although we have questions around QEMU trust/vulnerability issues, we >>>are looking to implement basic authentication scheme between libqnio >>>and vxhs server. >>> > >>> > This is incorrect. >>> > >>> > Cross memory attach is equivalent to ptrace(2) (i.e. >>>debugger) access. >>> > It means process A reads/writes directly from/to process B >>>memory. Both >>> > processes must have the same uid/gid. There is no trust >>>boundary >>> > between them. >>> > >>> > Ketan> Not if vxhs server is running as root and initiating the >>>cross mem attach. Which is also why we are proposing a basic >>>authentication mechanism between qemu-vxhs. But anyway the cross memory >>>attach is for a near future implementation. >>> > >>> > Network communication does not require both processes to >>>have the same >>> > uid/gid. If you want multiple QEMU processes talking to a >>>single server >>> > there must be a trust boundary between client and server. >>>The server >>> > can validate the input from the client and reject undesired >>>operations. >>> > >>> > Ketan> This is what we are trying to propose. With the addition >>>of authentication between qemu-vxhs server, we should be able to >>>achieve this. Question is, would that be acceptable? >>> > >>> > Hope this makes sense now. >>> > >>> > Two architectures that implement the QEMU trust model >>>correctly are: >>> > >>> > 1. Cross memory attach: each QEMU process has a dedicated >>>vxhs server >>> > process to prevent guests from attacking each other. >>>This is where I >>> > said you might as well put the code inside QEMU since >>>there is no >>> > isolation anyway. From what you've said it sounds like >>>the vxhs >>> > server needs a host-wide view and is responsible for all >>>guests >>> > running on the host, so I guess we have to rule out this >>> > architecture. >>> > >>> > 2. Network communication: one vxhs server process and >>>multiple guests. >>> > Here you might as well use NBD or iSCSI because it >>>already exists and >>> > the vxhs driver doesn't add any unique functionality over >>>existing >>> > protocols. >>> > >>> > Ketan> NBD does not give us the performance we are trying to >>>achieve. Besides NBD does not have any authentication support. >>> >>> NBD over TCP supports TLS with X.509 certificate authentication. I >>> think Daniel Berrange mentioned that. >>> >>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD >>>did not have any auth as Daniel Berrange mentioned. >>> >>> NBD over AF_UNIX does not need authentication because it relies on >>>file >>> permissions for access control. Each guest should have its own >>>UNIX >>> domain socket that it connects to. That socket can only see >>>exports >>> that have been assigned to the guest. >>> >>> > There is a hybrid 2.a approach which uses both 1 & 2 but I¹d >>>keep that for a later discussion. >>> >>> Please discuss it now so everyone gets on the same page. I think >>>there >>> is a big gap and we need to communicate so that progress can be >>>made. >>> >>> Ketan> The approach was to use cross mem attach for IO path and a >>>simplified network IO lib for resiliency/failover. Did not want to >>>derail the current discussion hence the suggestion to take it up later. >> >> Why does the client have to know about failover if it's connected to a >> server process on the same host? I thought the server process manages >> networking issues (like the actual protocol to speak to other VxHS nodes >> and for failover). >> >>> > > There's an easier way to get even better performance: >>>get rid of libqnio >>> > > and the external process. Move the code from the >>>external process into >>> > > QEMU to eliminate the >>>process_vm_readv(2)/process_vm_writev(2) and >>> > > context switching. >>> > > >>> > > Can you remind me why there needs to be an external >>>process? >>> > > >>> > > Ketan> Apart from virtualizing the available direct >>>attached storage on the compute, vxhs storage backend (the external >>>process) provides features such as storage QoS, resiliency, efficient >>>use of direct attached storage, automatic storage recovery points >>>(snapshots) etc. Implementing this in QEMU is not practical and not the >>>purpose of proposing this driver. >>> > >>> > This sounds similar to what QEMU and Linux (file systems, >>>LVM, RAID, >>> > etc) already do. It brings to mind a third architecture: >>> > >>> > 3. A Linux driver or file system. Then QEMU opens a raw >>>block device. >>> > This is what the Ceph rbd block driver in Linux does. >>>This >>> > architecture has a kernel-userspace boundary so vxhs does >>>not have to >>> > trust QEMU. >>> > >>> > I suggest Architecture #2. You'll be able to deploy on >>>existing systems >>> > because QEMU already supports NBD or iSCSI. Use the time >>>you gain from >>> > switching to this architecture on benchmarking and >>>optimizing NBD or >>> > iSCSI so performance is closer to your goal. >>> > >>> > Ketan> We have made a choice to go with QEMU driver approach >>>after serious evaluation of most if not all standard IO tapping >>>mechanisms including NFS, NBD and FUSE. None of these has been able to >>>deliver the performance that we have set ourselves to achieve. Hence >>>the effort to propose this new IO tap which we believe will provide an >>>alternate to the existing mechanisms and hopefully benefit the >>>community. >>> >>> I thought the VxHS block driver was another network block driver >>>like >>> GlusterFS or Sheepdog but you are actually proposing a new local >>>I/O tap >>> with the goal of better performance. >>> >>> Ketan> The VxHS block driver is a new local IO tap with the goal of >>>better performance specifically when used with the VxHS server. This >>>coupled with shared mem IPC (like cross mem attach) could be a much >>>better IO tap option for qemu users. This will also avoid context >>>switch between qemu/network stack to service which happens today in NBD. >>> >>> >>> Please share fio(1) or other standard benchmark configuration >>>files and >>> performance results. >>> >>> Ketan> We have fio results with the VxHS storage backend which I am >>>not sure I can share in a public forum. >>> >>> NBD and libqnio wire protocols have comparable performance >>> characteristics. There is no magic that should give either one a >>> fundamental edge over the other. Am I missing something? >>> >>> Ketan> I have not seen the NBD code but few things which we considered >>>and are part of libqnio (though not exclusively) are low protocol >>>overhead, threading model, queueing, latencies, memory pools, zero data >>>copies in user-land, scatter-gather write/read etc. Again these are not >>>exclusive to libqnio but could give one protocol the edge over the >>>other. Also part of the ³magic² is also in the VxHS storage backend >>>which is able to ingest the IOs with lower latencies. >>> >>> The main performance difference is probably that libqnio opens 8 >>> simultaneous connections but that's not unique to the wire >>>protocol. >>> What happens when you run 8 NBD simultaneous TCP connections? >>> >>> Ketan> Possibly. We have not benchmarked this. >> >> There must be benchmark data if you want to add a new feature or modify >> existing code for performance reasons. This rule is followed in QEMU so >> that performance changes are justified. >> >> I'm afraid that when you look into the performance you'll find that any >> performance difference between NBD and this VxHS patch series is due to >> implementation differences that can be ported across to QEMU NBD, rather >> than wire protocol differences. >> >> If that's the case then it would save a lot of time to use NBD over >> AF_UNIX for now. You could focus efforts on achieving the final >> architecture you've explained with cross memory attach. >> >> Please take a look at vhost-user-scsi, which folks from Nutanix are >> currently working on. See "[PATCH v2 0/3] Introduce vhost-user-scsi and >> sample application" on qemu-devel. It is a true zero-copy local I/O tap >> because it shares guest RAM. This is more efficient than cross memory >> attach's single memory copy. It does not require running the server as >> root. This is the #1 thing you should evaluate for your final >> architecture. >> >> vhost-user-scsi works on the virtio-scsi emulation level. That means >> the server must implement the virtio-scsi vring and device emulation. >> It is not a block driver. By hooking in at this level you can achieve >> the best performance but you lose all QEMU block layer functionality and >> need to implement your own SCSI target. You also need to consider live >> migration. >> >> Stefan
On Wed, Nov 30, 2016 at 04:20:03AM +0000, Rakesh Ranjan wrote: > >>>>> Why does the client have to know about failover if it's connected to > >>>>>a server process on the same host? I thought the server process > >>>>>manages networking issues (like the actual protocol to speak to other > >>>>>VxHS nodes and for failover). > > Just to comment on this, the model being followed within HyperScale is to > allow application I/O continuity (resiliency) in various cases as > mentioned below. It really adds value for consumer/customer and tries to > avoid culprits for single points of failure. > > 1. HyperScale storage service failure (QNIO Server) > - Daemon managing local storage for VMs and runs on each compute node > - Daemon can run as a service on Hypervisor itself as well as within VSA > (Virtual Storage Appliance or Virtual Machine running on the hypervisor), > which depends on ecosystem where HyperScale is supported > - Daemon or storage service down/crash/crash-in-loop shouldnÂąt lead to an > huge impact on all the VMs running on that hypervisor or compute node > hence providing service level resiliency is very useful for > application I/O continuity in such case. > > Solution: > - The service failure handling can be only done at the client side and > not at the server side since service running as a server itself is down. > - Client detects an I/O error and depending on the logic, it does > application I/O failover to another available/active QNIO server or > HyperScale Storage service running on different compute node > (reflection/replication node) > - Once the orig/old server comes back online, client gets/receives > negotiated error (not a real application error) to do the application I/O > failback to the original server or local HyperScale storage service to get > better I/O performance. > > 2. Local physical storage or media failure > - Once server or HyperScale storage service detects the media or local > disk failure, depending on the vDisk (guest disk) configuration, if > another storage copy is available > on different compute node then it internally handles the local > fault and serves the application read and write requests otherwise > application or client gets the fault. > - Client doesnÂąt know about any I/O failure since Server or Storage > service manages/handles the fault tolerance. > - In such case, in order to get some I/O performance benefit, once > client gets a negotiated error (not an application error) from local > server or storage service, > client can initiate I/O failover and can directly send > application I/O to another compute node where storage copy is available to > serve the application need instead of sending it locally where media is > faulted. Thanks for explaining the model. The new information for me here is that the qnio server may run in a VM instead of on the host and that the client will attempt to use a remote qnio server if the local qnio server fails. This means that although the discussion most recently focussed on local I/O tap performance, there is a requirement for a network protocol too. The local I/O tap stuff is just an optimization for when the local qnio server can be used. Stefan
On Mon, Nov 28, 2016 at 02:17:56PM +0000, Stefan Hajnoczi wrote: > Please take a look at vhost-user-scsi, which folks from Nutanix are > currently working on. See "[PATCH v2 0/3] Introduce vhost-user-scsi and > sample application" on qemu-devel. It is a true zero-copy local I/O tap > because it shares guest RAM. This is more efficient than cross memory > attach's single memory copy. It does not require running the server as > root. This is the #1 thing you should evaluate for your final > architecture. > > vhost-user-scsi works on the virtio-scsi emulation level. That means > the server must implement the virtio-scsi vring and device emulation. > It is not a block driver. By hooking in at this level you can achieve > the best performance but you lose all QEMU block layer functionality and > need to implement your own SCSI target. You also need to consider live > migration. To clarify why I think vhost-user-scsi is best suited to your requirements for performance: With vhost-user-scsi the qnio server would be notified by kvm.ko via eventfd when the VM submits new I/O requests to the virtio-scsi HBA. The QEMU process is completely bypassed for I/O request submission and the qnio server processes the SCSI command instead. This avoids the context switch to QEMU and then to the qnio server. With cross memory attach QEMU first needs to process the I/O request and hand it to libqnio before the qnio server can be scheduled. The vhost-user-scsi qnio server has shared memory access to guest RAM and is therefore able to do zero-copy I/O into guest buffers. Cross memory attach always incurs a memory copy. Using this high-performance architecture requires significant changes though. vhost-user-scsi hooks into the stack at a different layer so a QEMU block driver is not used at all. QEMU also wouldn't use libqnio. Instead everything will live in your qnio server process (not part of QEMU). You'd have to rethink the resiliency strategy because you currently rely on the QEMU block driver connecting to a different qnio server if the local qnio server fails. In the vhost-user-scsi world it's more like having a phyiscal SCSI adapter - redundancy and multipathing are used to achieve resiliency. For example, virtio-scsi HBA #1 would connect to the local qnio server process. virtio-scsi HBA #2 would connect to another local process called the "proxy process" which forwards requests to a remote qnio server (using libqnio?). If HBA #1 fails then I/O is sent to HBA #2 instead. The path can reset back to HBA #1 once that becomes operational again. If the qnio server is supposed to run in a VM instead of directly in the host environment then it's worth looking at the vhost-pci work that Wei Wang <wei.w.wang@intel.com> is working on. The email thread is called "[PATCH v2 0/4] *** vhost-user spec extension for vhost-pci ***". The idea here is to allow inter-VM virtio device emulation so that instead of terminating the virtio-scsi device in the qnio server process on the host, you can terminate it inside another VM with good performance characteristics. Stefan
diff --git a/block/Makefile.objs b/block/Makefile.objs index 7d4031d..1861bb9 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -18,6 +18,7 @@ block-obj-$(CONFIG_LIBNFS) += nfs.o block-obj-$(CONFIG_CURL) += curl.o block-obj-$(CONFIG_RBD) += rbd.o block-obj-$(CONFIG_GLUSTERFS) += gluster.o +block-obj-$(CONFIG_VXHS) += vxhs.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o dirty-bitmap.o @@ -38,6 +39,7 @@ rbd.o-cflags := $(RBD_CFLAGS) rbd.o-libs := $(RBD_LIBS) gluster.o-cflags := $(GLUSTERFS_CFLAGS) gluster.o-libs := $(GLUSTERFS_LIBS) +vxhs.o-libs := $(VXHS_LIBS) ssh.o-cflags := $(LIBSSH2_CFLAGS) ssh.o-libs := $(LIBSSH2_LIBS) archipelago.o-libs := $(ARCHIPELAGO_LIBS) diff --git a/block/trace-events b/block/trace-events index 05fa13c..44de452 100644 --- a/block/trace-events +++ b/block/trace-events @@ -114,3 +114,50 @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu" + +# block/vxhs.c +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c" +vxhs_iio_callback(int error, int reason) "ctx is NULL: error %d, reason %d" +vxhs_setup_qnio(void *s) "Context to HyperScale IO manager = %p" +vxhs_setup_qnio_nwerror(char c) "Could not initialize the network channel. Bailing out%c" +vxhs_iio_callback_iofail(int err, int reason, void *acb, int seg) "Read/Write failed: error %d, reason %d, acb %p, segment %d" +vxhs_iio_callback_retry(char *guid, void *acb) "vDisk %s, added acb %p to retry queue (5)" +vxhs_iio_callback_chnlfail(int error) "QNIO channel failed, no i/o (%d)" +vxhs_iio_callback_fail(int r, void *acb, int seg, uint64_t size, int err) " ALERT: reason = %d , acb = %p, acb->segments = %d, acb->size = %lu Error = %d" +vxhs_fail_aio(char * guid, void *acb) "vDisk %s, failing acb %p" +vxhs_iio_callback_ready(char *vd, int err) "async vxhs_iio_callback: IRP_VDISK_CHECK_IO_FAILOVER_READY completed for vdisk %s with error %d" +vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o %d, %d" +vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno %d" +vxhs_open_fail(int ret) "Could not open the device. Error = %d" +vxhs_open_epipe(char c) "Could not create a pipe for device. Bailing out%c" +vxhs_aio_rw(char *guid, int iodir, uint64_t size, uint64_t offset) "vDisk %s, vDisk device is in failed state iodir = %d size = %lu offset = %lu" +vxhs_aio_rw_retry(char *guid, void *acb, int queue) "vDisk %s, added acb %p to retry queue(%d)" +vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d" +vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void *acb, int seg, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %lu offset = %lu ACB = %p Segments = %d. Error = %d, errno = %d" +vxhs_co_flush(char *guid, int ret, int err) "vDisk (%s) Flush ioctl failed ret = %d errno = %d" +vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl failed, ret = %d, errno = %d" +vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat ioctl returned size %lu" +vxhs_switch_storage_agent(char *ip, char *guid) "Query host %s for vdisk %s" +vxhs_switch_storage_agent_failed(char *ip, char *guid, int res, int err) "Query to host %s for vdisk %s failed, res = %d, errno = %d" +vxhs_check_failover_status(char *ip, char *guid) "Switched to storage server host-IP %s for vdisk %s" +vxhs_check_failover_status_retry(char *guid) "failover_ioctl_cb: keep looking for io target for vdisk %s" +vxhs_failover_io(char *vdisk) "I/O Failover starting for vDisk %s" +vxhs_qnio_iio_open(const char *ip) "Failed to connect to storage agent on host-ip %s" +vxhs_qnio_iio_devopen(const char *fname) "Failed to open vdisk device: %s" +vxhs_handle_queued_ios(void *acb, int res) "Restarted acb %p res %d" +vxhs_restart_aio(int dir, int res, int err) "IO ERROR FOR: Read/Write = %d Error = %d, errno = %d" +vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %ld" +vxhs_aio_rw_iofail(char *guid) "vDisk %s, I/O operation failed." +vxhs_aio_rw_devfail(char *guid, int dir, uint64_t size, uint64_t off) "vDisk %s, vDisk device failed iodir = %d size = %lu offset = %lu" +vxhs_parse_uri_filename(const char *filename) "URI passed via bdrv_parse_filename %s" +vxhs_qemu_init_vdisk(const char *vdisk_id) "vdisk_id from json %s" +vxhs_qemu_init_numservers(int num_servers) "Number of servers passed = %d" +vxhs_parse_uri_hostinfo(int num, char *host, int port) "Host %d: IP %s, Port %d" +vxhs_qemu_init(char *of_vsa_addr, int port) "Adding host %s:%d to BDRVVXHSState" +vxhs_qemu_init_filename(const char *filename) "Filename passed as %s" +vxhs_close(char *vdisk_guid) "Closing vdisk %s" +vxhs_convert_iovector_to_buffer(size_t len) "Could not allocate buffer for size %zu bytes" +vxhs_qnio_iio_writev(int res) "iio_writev returned %d" +vxhs_qnio_iio_writev_err(int iter, uint64_t iov_len, int err) "Error for iteration : %d, iov_len = %lu errno = %d" +vxhs_qnio_iio_readv(void *ctx, int ret, int error) "Error while issuing read to QNIO. ctx %p Error = %d, errno = %d" +vxhs_qnio_iio_ioctl(uint32_t opcode) "Error while executing IOCTL. Opcode = %u" diff --git a/block/vxhs.c b/block/vxhs.c new file mode 100644 index 0000000..90a4343 --- /dev/null +++ b/block/vxhs.c @@ -0,0 +1,1645 @@ +/* + * QEMU Block driver for Veritas HyperScale (VxHS) + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qemu/osdep.h" +#include "block/block_int.h" +#include <qnio/qnio_api.h> +#include "qapi/qmp/qerror.h" +#include "qapi/qmp/qdict.h" +#include "qapi/qmp/qstring.h" +#include "trace.h" +#include "qemu/uri.h" +#include "qapi/error.h" +#include "qemu/error-report.h" + +#define QNIO_CONNECT_RETRY_SECS 5 +#define QNIO_CONNECT_TIMOUT_SECS 120 + +/* + * IO specific flags + */ +#define IIO_FLAG_ASYNC 0x00000001 +#define IIO_FLAG_DONE 0x00000010 +#define IIO_FLAG_SYNC 0 + +#define VDISK_FD_READ 0 +#define VDISK_FD_WRITE 1 +#define VXHS_MAX_HOSTS 4 + +#define VXHS_OPT_FILENAME "filename" +#define VXHS_OPT_VDISK_ID "vdisk_id" +#define VXHS_OPT_SERVER "server." +#define VXHS_OPT_HOST "host" +#define VXHS_OPT_PORT "port" + +/* qnio client ioapi_ctx */ +static void *global_qnio_ctx; + +/* vdisk prefix to pass to qnio */ +static const char vdisk_prefix[] = "/dev/of/vdisk"; + +typedef enum { + VXHS_IO_INPROGRESS, + VXHS_IO_COMPLETED, + VXHS_IO_ERROR +} VXHSIOState; + +typedef enum { + VDISK_AIO_READ, + VDISK_AIO_WRITE, + VDISK_STAT, + VDISK_TRUNC, + VDISK_AIO_FLUSH, + VDISK_AIO_RECLAIM, + VDISK_GET_GEOMETRY, + VDISK_CHECK_IO_FAILOVER_READY, + VDISK_AIO_LAST_CMD +} VDISKAIOCmd; + +typedef void (*qnio_callback_t)(ssize_t retval, void *arg); + +/* + * BDRVVXHSState specific flags + */ +#define OF_VDISK_FLAGS_STATE_ACTIVE 0x0000000000000001 +#define OF_VDISK_FLAGS_STATE_FAILED 0x0000000000000002 +#define OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS 0x0000000000000004 + +#define OF_VDISK_ACTIVE(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_ACTIVE) +#define OF_VDISK_SET_ACTIVE(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_ACTIVE) +#define OF_VDISK_RESET_ACTIVE(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_ACTIVE) + +#define OF_VDISK_FAILED(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_FAILED) +#define OF_VDISK_SET_FAILED(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_FAILED) +#define OF_VDISK_RESET_FAILED(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_FAILED) + +#define OF_VDISK_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) +#define OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) +#define OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) + +/* + * VXHSAIOCB specific flags + */ +#define OF_ACB_QUEUED 0x00000001 + +#define OF_AIOCB_FLAGS_QUEUED(a) \ + ((a)->flags & OF_ACB_QUEUED) +#define OF_AIOCB_FLAGS_SET_QUEUED(a) \ + ((a)->flags |= OF_ACB_QUEUED) +#define OF_AIOCB_FLAGS_RESET_QUEUED(a) \ + ((a)->flags &= ~OF_ACB_QUEUED) + +typedef struct qemu2qnio_ctx { + uint32_t qnio_flag; + uint64_t qnio_size; + char *qnio_channel; + char *target; + qnio_callback_t qnio_cb; +} qemu2qnio_ctx_t; + +typedef qemu2qnio_ctx_t qnio2qemu_ctx_t; + +typedef struct LibQNIOSymbol { + const char *name; + gpointer *addr; +} LibQNIOSymbol; + +/* + * HyperScale AIO callbacks structure + */ +typedef struct VXHSAIOCB { + BlockAIOCB common; + size_t ret; + size_t size; + QEMUBH *bh; + int aio_done; + int segments; + int flags; + size_t io_offset; + QEMUIOVector *qiov; + void *buffer; + int direction; /* IO direction (r/w) */ + QSIMPLEQ_ENTRY(VXHSAIOCB) retry_entry; +} VXHSAIOCB; + +typedef struct VXHSvDiskHostsInfo { + int qnio_cfd; /* Channel FD */ + int vdisk_rfd; /* vDisk remote FD */ + char *hostip; /* Host's IP addresses */ + int port; /* Host's port number */ +} VXHSvDiskHostsInfo; + +/* + * Structure per vDisk maintained for state + */ +typedef struct BDRVVXHSState { + int fds[2]; + int64_t vdisk_size; + int64_t vdisk_blocks; + int64_t vdisk_flags; + int vdisk_aio_count; + int event_reader_pos; + VXHSAIOCB *qnio_event_acb; + void *qnio_ctx; + QemuSpin vdisk_lock; /* Lock to protect BDRVVXHSState */ + QemuSpin vdisk_acb_lock; /* Protects ACB */ + VXHSvDiskHostsInfo vdisk_hostinfo[VXHS_MAX_HOSTS]; /* Per host info */ + int vdisk_nhosts; /* Total number of hosts */ + int vdisk_cur_host_idx; /* IOs are being shipped to */ + int vdisk_ask_failover_idx; /*asking permsn to ship io*/ + QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; + int vdisk_aio_retry_qd; /* Currently for debugging */ + char *vdisk_guid; +} BDRVVXHSState; + +static int vxhs_restart_aio(VXHSAIOCB *acb); +static void vxhs_check_failover_status(int res, void *ctx); + +static void vxhs_inc_acb_segment_count(void *ptr, int count) +{ + VXHSAIOCB *acb = ptr; + BDRVVXHSState *s = acb->common.bs->opaque; + + qemu_spin_lock(&s->vdisk_acb_lock); + acb->segments += count; + qemu_spin_unlock(&s->vdisk_acb_lock); +} + +static void vxhs_dec_acb_segment_count(void *ptr, int count) +{ + VXHSAIOCB *acb = ptr; + BDRVVXHSState *s = acb->common.bs->opaque; + + qemu_spin_lock(&s->vdisk_acb_lock); + acb->segments -= count; + qemu_spin_unlock(&s->vdisk_acb_lock); +} + +static void vxhs_set_acb_buffer(void *ptr, void *buffer) +{ + VXHSAIOCB *acb = ptr; + + acb->buffer = buffer; +} + +static void vxhs_inc_vdisk_iocount(void *ptr, uint32_t count) +{ + BDRVVXHSState *s = ptr; + + qemu_spin_lock(&s->vdisk_lock); + s->vdisk_aio_count += count; + qemu_spin_unlock(&s->vdisk_lock); +} + +static void vxhs_dec_vdisk_iocount(void *ptr, uint32_t count) +{ + BDRVVXHSState *s = ptr; + + qemu_spin_lock(&s->vdisk_lock); + s->vdisk_aio_count -= count; + qemu_spin_unlock(&s->vdisk_lock); +} + +static int32_t +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, + void *ctx, uint32_t flags) +{ + int ret = 0; + + switch (opcode) { + case VDISK_STAT: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, + in, ctx, flags); + break; + + case VDISK_AIO_FLUSH: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, + in, ctx, flags); + break; + + case VDISK_CHECK_IO_FAILOVER_READY: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, + in, ctx, flags); + break; + + default: + ret = -ENOTSUP; + break; + } + + if (ret) { + *in = 0; + trace_vxhs_qnio_iio_ioctl(opcode); + } + + return ret; +} + +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx) +{ + /* + * Close vDisk device + */ + if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) { + iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd); + s->vdisk_hostinfo[idx].vdisk_rfd = -1; + } + + /* + * Close QNIO channel against cached channel-fd + */ + if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) { + iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd); + s->vdisk_hostinfo[idx].qnio_cfd = -1; + } +} + +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr, + int *rfd, const char *file_name) +{ + /* + * Open qnio channel to storage agent if not opened before. + */ + if (*cfd < 0) { + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); + if (*cfd < 0) { + trace_vxhs_qnio_iio_open(of_vsa_addr); + return -ENODEV; + } + } + + /* + * Open vdisk device + */ + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); + + if (*rfd < 0) { + if (*cfd >= 0) { + iio_close(global_qnio_ctx, *cfd); + *cfd = -1; + *rfd = -1; + } + + trace_vxhs_qnio_iio_devopen(file_name); + return -ENODEV; + } + + return 0; +} + +/* + * Try to reopen the vDisk on one of the available hosts + * If vDisk reopen is successful on any of the host then + * check if that node is ready to accept I/O. + */ +static int vxhs_reopen_vdisk(BDRVVXHSState *s, int index) +{ + VXHSvDiskHostsInfo hostinfo = s->vdisk_hostinfo[index]; + char *of_vsa_addr = NULL; + char *file_name = NULL; + int res = 0; + + /* + * Close stale vdisk device remote-fd and channel-fd + * since it could be invalid after a channel disconnect. + * We will reopen the vdisk later to get the new fd. + */ + vxhs_qnio_iio_close(s, index); + + /* + * Build storage agent address and vdisk device name strings + */ + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); + of_vsa_addr = g_strdup_printf("of://%s:%d", + hostinfo.hostip, hostinfo.port); + + res = vxhs_qnio_iio_open(&hostinfo.qnio_cfd, of_vsa_addr, + &hostinfo.vdisk_rfd, file_name); + + g_free(of_vsa_addr); + g_free(file_name); + return res; +} + +static void vxhs_fail_aio(VXHSAIOCB *acb, int err) +{ + BDRVVXHSState *s = NULL; + int segcount = 0; + int rv = 0; + + s = acb->common.bs->opaque; + + trace_vxhs_fail_aio(s->vdisk_guid, acb); + if (!acb->ret) { + acb->ret = err; + } + qemu_spin_lock(&s->vdisk_acb_lock); + segcount = acb->segments; + qemu_spin_unlock(&s->vdisk_acb_lock); + if (segcount == 0) { + /* + * Complete the io request + */ + rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb)); + if (rv != sizeof(acb)) { + error_report("VXHS AIO completion failed: %s", + strerror(errno)); + abort(); + } + } +} + +static int vxhs_handle_queued_ios(BDRVVXHSState *s) +{ + VXHSAIOCB *acb = NULL; + int res = 0; + + qemu_spin_lock(&s->vdisk_lock); + while ((acb = QSIMPLEQ_FIRST(&s->vdisk_aio_retryq)) != NULL) { + /* + * Before we process the acb, check whether I/O failover + * started again due to failback or cascading failure. + */ + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + qemu_spin_unlock(&s->vdisk_lock); + goto out; + } + QSIMPLEQ_REMOVE_HEAD(&s->vdisk_aio_retryq, retry_entry); + s->vdisk_aio_retry_qd--; + OF_AIOCB_FLAGS_RESET_QUEUED(acb); + if (OF_VDISK_FAILED(s)) { + qemu_spin_unlock(&s->vdisk_lock); + vxhs_fail_aio(acb, EIO); + qemu_spin_lock(&s->vdisk_lock); + } else { + qemu_spin_unlock(&s->vdisk_lock); + res = vxhs_restart_aio(acb); + trace_vxhs_handle_queued_ios(acb, res); + qemu_spin_lock(&s->vdisk_lock); + if (res) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, + acb, retry_entry); + OF_AIOCB_FLAGS_SET_QUEUED(acb); + qemu_spin_unlock(&s->vdisk_lock); + goto out; + } + } + } + qemu_spin_unlock(&s->vdisk_lock); +out: + return res; +} + +/* + * If errors are consistent with storage agent failure: + * - Try to reconnect in case error is transient or storage agent restarted. + * - Currently failover is being triggered on per vDisk basis. There is + * a scope of further optimization where failover can be global (per VM). + * - In case of network (storage agent) failure, for all the vDisks, having + * no redundancy, I/Os will be failed without attempting for I/O failover + * because of stateless nature of vDisk. + * - If local or source storage agent is down then send an ioctl to remote + * storage agent to check if remote storage agent in a state to accept + * application I/Os. + * - Once remote storage agent is ready to accept I/O, start I/O shipping. + * - If I/Os cannot be serviced then vDisk will be marked failed so that + * new incoming I/Os are returned with failure immediately. + * - If vDisk I/O failover is in progress then all new/inflight I/Os will + * queued and will be restarted or failed based on failover operation + * is successful or not. + * - I/O failover can be started either in I/O forward or I/O backward + * path. + * - I/O failover will be started as soon as all the pending acb(s) + * are queued and there is no pending I/O count. + * - If I/O failover couldn't be completed within QNIO_CONNECT_TIMOUT_SECS + * then vDisk will be marked failed and all I/Os will be completed with + * error. + */ + +static int vxhs_switch_storage_agent(BDRVVXHSState *s) +{ + int res = 0; + int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE); + + trace_vxhs_switch_storage_agent( + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, + s->vdisk_guid); + + res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx); + if (res == 0) { + res = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd, + VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags); + } else { + trace_vxhs_switch_storage_agent_failed( + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, + s->vdisk_guid, res, errno); + /* + * Try the next host. + * Calling vxhs_check_failover_status from here ties up the qnio + * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1) + * for all the hosts in the IO target list. + */ + + vxhs_check_failover_status(res, s); + } + return res; +} + +static void vxhs_check_failover_status(int res, void *ctx) +{ + BDRVVXHSState *s = ctx; + + if (res == 0) { + /* found failover target */ + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; + s->vdisk_ask_failover_idx = 0; + trace_vxhs_check_failover_status( + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, + s->vdisk_guid); + qemu_spin_lock(&s->vdisk_lock); + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); + qemu_spin_unlock(&s->vdisk_lock); + vxhs_handle_queued_ios(s); + } else { + /* keep looking */ + trace_vxhs_check_failover_status_retry(s->vdisk_guid); + s->vdisk_ask_failover_idx++; + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { + /* pause and cycle through list again */ + sleep(QNIO_CONNECT_RETRY_SECS); + s->vdisk_ask_failover_idx = 0; + } + res = vxhs_switch_storage_agent(s); + } +} + +static int vxhs_failover_io(BDRVVXHSState *s) +{ + int res = 0; + + trace_vxhs_failover_io(s->vdisk_guid); + + s->vdisk_ask_failover_idx = 0; + res = vxhs_switch_storage_agent(s); + + return res; +} + +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, + uint32_t error, uint32_t opcode) +{ + VXHSAIOCB *acb = NULL; + BDRVVXHSState *s = NULL; + int rv = 0; + int segcount = 0; + + switch (opcode) { + case IRP_READ_REQUEST: + case IRP_WRITE_REQUEST: + + /* + * ctx is VXHSAIOCB* + * ctx is NULL if error is QNIOERROR_CHANNEL_HUP or reason is IIO_REASON_HUP + */ + if (ctx) { + acb = ctx; + s = acb->common.bs->opaque; + } else { + trace_vxhs_iio_callback(error, reason); + goto out; + } + + if (error) { + trace_vxhs_iio_callback_iofail(error, reason, acb, acb->segments); + + if (reason == IIO_REASON_DONE || reason == IIO_REASON_EVENT) { + /* + * Storage agent failed while I/O was in progress + * Fail over only if the qnio channel dropped, indicating + * storage agent failure. Don't fail over in response to other + * I/O errors such as disk failure. + */ + if (error == QNIOERROR_RETRY_ON_SOURCE || error == QNIOERROR_HUP || + error == QNIOERROR_CHANNEL_HUP || error == -1) { + /* + * Start vDisk IO failover once callback is + * called against all the pending IOs. + * If vDisk has no redundancy enabled + * then IO failover routine will mark + * the vDisk failed and fail all the + * AIOs without retry (stateless vDisk) + */ + qemu_spin_lock(&s->vdisk_lock); + if (!OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s); + } + /* + * Check if this acb is already queued before. + * It is possible in case if I/Os are submitted + * in multiple segments (QNIO_MAX_IO_SIZE). + */ + qemu_spin_lock(&s->vdisk_acb_lock); + if (!OF_AIOCB_FLAGS_QUEUED(acb)) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, + acb, retry_entry); + OF_AIOCB_FLAGS_SET_QUEUED(acb); + s->vdisk_aio_retry_qd++; + trace_vxhs_iio_callback_retry(s->vdisk_guid, acb); + } + segcount = --acb->segments; + qemu_spin_unlock(&s->vdisk_acb_lock); + /* + * Decrement AIO count only when callback is called + * against all the segments of aiocb. + */ + if (segcount == 0 && --s->vdisk_aio_count == 0) { + /* + * Start vDisk I/O failover + */ + qemu_spin_unlock(&s->vdisk_lock); + /* + * TODO: + * Need to explore further if it is possible to optimize + * the failover operation on Virtual-Machine (global) + * specific rather vDisk specific. + */ + vxhs_failover_io(s); + goto out; + } + qemu_spin_unlock(&s->vdisk_lock); + goto out; + } + } else if (reason == IIO_REASON_HUP) { + /* + * Channel failed, spontaneous notification, + * not in response to I/O + */ + trace_vxhs_iio_callback_chnlfail(error); + /* + * TODO: Start channel failover when no I/O is outstanding + */ + goto out; + } else { + trace_vxhs_iio_callback_fail(reason, acb, acb->segments, + acb->size, error); + } + } + /* + * Set error into acb if not set. In case if acb is being + * submitted in multiple segments then need to set the error + * only once. + * + * Once acb done callback is called for the last segment + * then acb->ret return status will be sent back to the + * caller. + */ + qemu_spin_lock(&s->vdisk_acb_lock); + if (error && !acb->ret) { + acb->ret = error; + } + --acb->segments; + segcount = acb->segments; + assert(segcount >= 0); + qemu_spin_unlock(&s->vdisk_acb_lock); + /* + * Check if all the outstanding I/Os are done against acb. + * If yes then send signal for AIO completion. + */ + if (segcount == 0) { + rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb)); + if (rv != sizeof(acb)) { + error_report("VXHS AIO completion failed: %s", strerror(errno)); + abort(); + } + } + break; + + case IRP_VDISK_CHECK_IO_FAILOVER_READY: + /* ctx is BDRVVXHSState* */ + assert(ctx); + trace_vxhs_iio_callback_ready(((BDRVVXHSState *)ctx)->vdisk_guid, + error); + vxhs_check_failover_status(error, ctx); + break; + + default: + if (reason == IIO_REASON_HUP) { + /* + * Channel failed, spontaneous notification, + * not in response to I/O + */ + trace_vxhs_iio_callback_chnfail(error, errno); + /* + * TODO: Start channel failover when no I/O is outstanding + */ + } else { + trace_vxhs_iio_callback_unknwn(opcode, error); + } + break; + } +out: + return; +} + +static void vxhs_complete_aio(VXHSAIOCB *acb, BDRVVXHSState *s) +{ + BlockCompletionFunc *cb = acb->common.cb; + void *opaque = acb->common.opaque; + int ret = 0; + + if (acb->ret != 0) { + trace_vxhs_complete_aio(acb, acb->ret); + /* + * We mask all the IO errors generically as EIO for upper layers + * Right now our IO Manager uses non standard error codes. Instead + * of confusing upper layers with incorrect interpretation we are + * doing this workaround. + */ + ret = (-EIO); + } + /* + * Copy back contents from stablization buffer into original iovector + * before returning the IO + */ + if (acb->buffer != NULL) { + qemu_iovec_from_buf(acb->qiov, 0, acb->buffer, acb->qiov->size); + qemu_vfree(acb->buffer); + acb->buffer = NULL; + } + vxhs_dec_vdisk_iocount(s, 1); + acb->aio_done = VXHS_IO_COMPLETED; + qemu_aio_unref(acb); + cb(opaque, ret); +} + +/* + * This is the HyperScale event handler registered to QEMU. + * It is invoked when any IO gets completed and written on pipe + * by callback called from QNIO thread context. Then it marks + * the AIO as completed, and releases HyperScale AIO callbacks. + */ +static void vxhs_aio_event_reader(void *opaque) +{ + BDRVVXHSState *s = opaque; + ssize_t ret; + + do { + char *p = (char *)&s->qnio_event_acb; + + ret = read(s->fds[VDISK_FD_READ], p + s->event_reader_pos, + sizeof(s->qnio_event_acb) - s->event_reader_pos); + if (ret > 0) { + s->event_reader_pos += ret; + if (s->event_reader_pos == sizeof(s->qnio_event_acb)) { + s->event_reader_pos = 0; + vxhs_complete_aio(s->qnio_event_acb, s); + } + } + } while (ret < 0 && errno == EINTR); +} + +/* + * Call QNIO operation to create channels to do IO on vDisk. + */ + +static void *vxhs_setup_qnio(void) +{ + void *qnio_ctx = NULL; + + qnio_ctx = iio_init(vxhs_iio_callback); + + if (qnio_ctx != NULL) { + trace_vxhs_setup_qnio(qnio_ctx); + } else { + trace_vxhs_setup_qnio_nwerror('.'); + } + + return qnio_ctx; +} + +/* + * This helper function converts an array of iovectors into a flat buffer. + */ + +static void *vxhs_convert_iovector_to_buffer(QEMUIOVector *qiov) +{ + void *buf = NULL; + size_t size = 0; + + if (qiov->niov == 0) { + return buf; + } + + size = qiov->size; + buf = qemu_try_memalign(BDRV_SECTOR_SIZE, size); + if (!buf) { + trace_vxhs_convert_iovector_to_buffer(size); + errno = -ENOMEM; + return NULL; + } + return buf; +} + +/* + * This helper function iterates over the iovector and checks + * if the length of every element is an integral multiple + * of the sector size. + * Return Value: + * On Success : true + * On Failure : false + */ +static int vxhs_is_iovector_read_aligned(QEMUIOVector *qiov, size_t sector) +{ + struct iovec *iov = qiov->iov; + int niov = qiov->niov; + int i; + + for (i = 0; i < niov; i++) { + if (iov[i].iov_len % sector != 0) { + return false; + } + } + return true; +} + +static int32_t +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, + uint64_t offset, void *ctx, uint32_t flags) +{ + struct iovec cur; + uint64_t cur_offset = 0; + uint64_t cur_write_len = 0; + int segcount = 0; + int ret = 0; + int i, nsio = 0; + int iovcnt = qiov->niov; + struct iovec *iov = qiov->iov; + + errno = 0; + cur.iov_base = 0; + cur.iov_len = 0; + + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); + + if (ret == -1 && errno == EFBIG) { + trace_vxhs_qnio_iio_writev(ret); + /* + * IO size is larger than IIO_IO_BUF_SIZE hence need to + * split the I/O at IIO_IO_BUF_SIZE boundary + * There are two cases here: + * 1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE + * 2. iovcnt is greater than 1 and IO size is greater than + * IIO_IO_BUF_SIZE. + * + * Need to adjust the segment count, for that we need to compute + * the segment count and increase the segment count in one shot + * instead of setting iteratively in for loop. It is required to + * prevent any race between the splitted IO submission and IO + * completion. + */ + cur_offset = offset; + for (i = 0; i < iovcnt; i++) { + if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) { + cur_offset += iov[i].iov_len; + nsio++; + } else if (iov[i].iov_len > 0) { + cur.iov_base = iov[i].iov_base; + cur.iov_len = IIO_IO_BUF_SIZE; + cur_write_len = 0; + while (1) { + nsio++; + cur_write_len += cur.iov_len; + if (cur_write_len == iov[i].iov_len) { + break; + } + cur_offset += cur.iov_len; + cur.iov_base += cur.iov_len; + if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) { + cur.iov_len = IIO_IO_BUF_SIZE; + } else { + cur.iov_len = (iov[i].iov_len - cur_write_len); + } + } + } + } + + segcount = nsio - 1; + vxhs_inc_acb_segment_count(ctx, segcount); + /* + * Split the IO and submit it to QNIO. + * Reset the cur_offset before splitting the IO. + */ + cur_offset = offset; + nsio = 0; + for (i = 0; i < iovcnt; i++) { + if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) { + errno = 0; + ret = iio_writev(qnio_ctx, rfd, &iov[i], 1, cur_offset, ctx, + flags); + if (ret == -1) { + trace_vxhs_qnio_iio_writev_err(i, iov[i].iov_len, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within QEMU block driver. + */ + if (segcount > 0 && (segcount - nsio) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - nsio); + } + return ret; + } + cur_offset += iov[i].iov_len; + nsio++; + } else if (iov[i].iov_len > 0) { + /* + * This case is where one element of the io vector is > 4MB. + */ + cur.iov_base = iov[i].iov_base; + cur.iov_len = IIO_IO_BUF_SIZE; + cur_write_len = 0; + while (1) { + nsio++; + errno = 0; + ret = iio_writev(qnio_ctx, rfd, &cur, 1, cur_offset, ctx, + flags); + if (ret == -1) { + trace_vxhs_qnio_iio_writev_err(i, cur.iov_len, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within the + * QEMU block driver. + */ + if (segcount > 0 && (segcount - nsio) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - nsio); + } + return ret; + } + + cur_write_len += cur.iov_len; + if (cur_write_len == iov[i].iov_len) { + break; + } + cur_offset += cur.iov_len; + cur.iov_base += cur.iov_len; + if ((iov[i].iov_len - cur_write_len) > + IIO_IO_BUF_SIZE) { + cur.iov_len = IIO_IO_BUF_SIZE; + } else { + cur.iov_len = (iov[i].iov_len - cur_write_len); + } + } + } + } + } + return ret; +} + +/* + * Iterate over the i/o vector and send read request + * to QNIO one by one. + */ +static int32_t +vxhs_qnio_iio_readv(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov, + uint64_t offset, void *ctx, uint32_t flags) +{ + uint64_t read_offset = offset; + void *buffer = NULL; + size_t size; + int aligned, segcount; + int i, ret = 0; + int iovcnt = qiov->niov; + struct iovec *iov = qiov->iov; + + aligned = vxhs_is_iovector_read_aligned(qiov, BDRV_SECTOR_SIZE); + size = qiov->size; + + if (!aligned) { + buffer = vxhs_convert_iovector_to_buffer(qiov); + if (buffer == NULL) { + return -ENOMEM; + } + + errno = 0; + ret = iio_read(qnio_ctx, rfd, buffer, size, read_offset, ctx, flags); + if (ret != 0) { + trace_vxhs_qnio_iio_readv(ctx, ret, errno); + qemu_vfree(buffer); + return ret; + } + vxhs_set_acb_buffer(ctx, buffer); + return ret; + } + + /* + * Since read IO request is going to split based on + * number of IOvectors hence increment the segment + * count depending on the number of IOVectors before + * submitting the read request to QNIO. + * This is needed to protect the QEMU block driver + * IO completion while read request for the same IO + * is being submitted to QNIO. + */ + segcount = iovcnt - 1; + if (segcount > 0) { + vxhs_inc_acb_segment_count(ctx, segcount); + } + + for (i = 0; i < iovcnt; i++) { + errno = 0; + ret = iio_read(qnio_ctx, rfd, iov[i].iov_base, iov[i].iov_len, + read_offset, ctx, flags); + if (ret != 0) { + trace_vxhs_qnio_iio_readv(ctx, ret, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within QEMU block driver. + */ + if (segcount > 0 && (segcount - i) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - i); + } + return ret; + } + read_offset += iov[i].iov_len; + } + + return ret; +} + +static int vxhs_restart_aio(VXHSAIOCB *acb) +{ + BDRVVXHSState *s = NULL; + int iio_flags = 0; + int res = 0; + + s = acb->common.bs->opaque; + + if (acb->direction == VDISK_AIO_WRITE) { + vxhs_inc_vdisk_iocount(s, 1); + vxhs_inc_acb_segment_count(acb, 1); + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + res = vxhs_qnio_iio_writev(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + acb->qiov, acb->io_offset, (void *)acb, iio_flags); + } + + if (acb->direction == VDISK_AIO_READ) { + vxhs_inc_vdisk_iocount(s, 1); + vxhs_inc_acb_segment_count(acb, 1); + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + res = vxhs_qnio_iio_readv(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + acb->qiov, acb->io_offset, (void *)acb, iio_flags); + } + + if (res != 0) { + vxhs_dec_vdisk_iocount(s, 1); + vxhs_dec_acb_segment_count(acb, 1); + trace_vxhs_restart_aio(acb->direction, res, errno); + } + + return res; +} + +static QemuOptsList runtime_opts = { + .name = "vxhs", + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), + .desc = { + { + .name = VXHS_OPT_FILENAME, + .type = QEMU_OPT_STRING, + .help = "URI to the Veritas HyperScale image", + }, + { + .name = VXHS_OPT_VDISK_ID, + .type = QEMU_OPT_STRING, + .help = "UUID of the VxHS vdisk", + }, + { /* end of list */ } + }, +}; + +static QemuOptsList runtime_tcp_opts = { + .name = "vxhs_tcp", + .head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head), + .desc = { + { + .name = VXHS_OPT_HOST, + .type = QEMU_OPT_STRING, + .help = "host address (ipv4 addresses)", + }, + { + .name = VXHS_OPT_PORT, + .type = QEMU_OPT_NUMBER, + .help = "port number on which VxHSD is listening (default 9999)", + .def_value_str = "9999" + }, + { /* end of list */ } + }, +}; + +/* + * Parse the incoming URI and populate *options with the host information. + * URI syntax has the limitation of supporting only one host info. + * To pass multiple host information, use the JSON syntax. + */ +static int vxhs_parse_uri(const char *filename, QDict *options) +{ + URI *uri = NULL; + char *hoststr, *portstr; + char *port; + int ret = 0; + + trace_vxhs_parse_uri_filename(filename); + + uri = uri_parse(filename); + if (!uri || !uri->server || !uri->path) { + uri_free(uri); + return -EINVAL; + } + + hoststr = g_strdup(VXHS_OPT_SERVER"0.host"); + qdict_put(options, hoststr, qstring_from_str(uri->server)); + g_free(hoststr); + + portstr = g_strdup(VXHS_OPT_SERVER"0.port"); + if (uri->port) { + port = g_strdup_printf("%d", uri->port); + qdict_put(options, portstr, qstring_from_str(port)); + g_free(port); + } + g_free(portstr); + + if (strstr(uri->path, "vxhs") == NULL) { + qdict_put(options, "vdisk_id", qstring_from_str(uri->path)); + } + + trace_vxhs_parse_uri_hostinfo(1, uri->server, uri->port); + uri_free(uri); + + return ret; +} + +static void vxhs_parse_filename(const char *filename, QDict *options, + Error **errp) +{ + if (qdict_haskey(options, "vdisk_id") + || qdict_haskey(options, "server")) + { + error_setg(errp, "vdisk_id/server and a file name may not be specified " + "at the same time"); + return; + } + + if (strstr(filename, "://")) { + int ret = vxhs_parse_uri(filename, options); + if (ret < 0) { + error_setg(errp, "Invalid URI. URI should be of the form " + " vxhs://<host_ip>:<port>/{<vdisk_id>}"); + } + } +} + +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, + int *cfd, int *rfd, Error **errp) +{ + QDict *backing_options = NULL; + QemuOpts *opts, *tcp_opts; + const char *vxhs_filename; + char *of_vsa_addr = NULL; + Error *local_err = NULL; + const char *vdisk_id_opt; + char *file_name = NULL; + size_t num_servers = 0; + char *str = NULL; + int ret = 0; + int i; + + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(opts, options, &local_err); + if (local_err) { + error_propagate(errp, local_err); + ret = -EINVAL; + goto out; + } + + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); + if (vxhs_filename) { + trace_vxhs_qemu_init_filename(vxhs_filename); + } + + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); + if (!vdisk_id_opt) { + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); + ret = -EINVAL; + goto out; + } + s->vdisk_guid = g_strdup(vdisk_id_opt); + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); + + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); + if (num_servers < 1) { + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); + ret = -EINVAL; + goto out; + } else if (num_servers > VXHS_MAX_HOSTS) { + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); + error_append_hint(errp, "Maximum %d servers allowed.\n", + VXHS_MAX_HOSTS); + ret = -EINVAL; + goto out; + } + trace_vxhs_qemu_init_numservers(num_servers); + + for (i = 0; i < num_servers; i++) { + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); + qdict_extract_subqdict(options, &backing_options, str); + + /* Create opts info from runtime_tcp_opts list */ + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); + if (local_err) { + qdict_del(backing_options, str); + qemu_opts_del(tcp_opts); + g_free(str); + ret = -EINVAL; + goto out; + } + + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, + VXHS_OPT_HOST)); + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, + VXHS_OPT_PORT), + NULL, 0); + + s->vdisk_hostinfo[i].qnio_cfd = -1; + s->vdisk_hostinfo[i].vdisk_rfd = -1; + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, + s->vdisk_hostinfo[i].port); + + qdict_del(backing_options, str); + qemu_opts_del(tcp_opts); + g_free(str); + } + + s->vdisk_nhosts = i; + s->vdisk_cur_host_idx = 0; + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); + of_vsa_addr = g_strdup_printf("of://%s:%d", + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); + + /* + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. + */ + if (global_qnio_ctx == NULL) { + global_qnio_ctx = vxhs_setup_qnio(); + if (global_qnio_ctx == NULL) { + error_setg(&local_err, "Failed vxhs_setup_qnio"); + ret = -EINVAL; + goto out; + } + } + + ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name); + if (!ret) { + error_setg(&local_err, "Failed qnio_iio_open"); + ret = -EIO; + } + +out: + g_free(file_name); + g_free(of_vsa_addr); + qemu_opts_del(opts); + + if (ret < 0) { + for (i = 0; i < num_servers; i++) { + g_free(s->vdisk_hostinfo[i].hostip); + } + g_free(s->vdisk_guid); + s->vdisk_guid = NULL; + errno = -ret; + } + error_propagate(errp, local_err); + + return ret; +} + +static int vxhs_open(BlockDriverState *bs, QDict *options, + int bdrv_flags, Error **errp) +{ + BDRVVXHSState *s = bs->opaque; + AioContext *aio_context; + int qemu_qnio_cfd = -1; + int device_opened = 0; + int qemu_rfd = -1; + int ret = 0; + int i; + + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); + if (ret < 0) { + trace_vxhs_open_fail(ret); + return ret; + } + + device_opened = 1; + s->qnio_ctx = global_qnio_ctx; + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; + s->vdisk_size = 0; + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); + + /* + * Create a pipe for communicating between two threads in different + * context. Set handler for read event, which gets triggered when + * IO completion is done by non-QEMU context. + */ + ret = qemu_pipe(s->fds); + if (ret < 0) { + trace_vxhs_open_epipe('.'); + ret = -errno; + goto errout; + } + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); + + aio_context = bdrv_get_aio_context(bs); + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], + false, vxhs_aio_event_reader, NULL, s); + + /* + * Initialize the spin-locks. + */ + qemu_spin_init(&s->vdisk_lock); + qemu_spin_init(&s->vdisk_acb_lock); + + return 0; + +errout: + /* + * Close remote vDisk device if it was opened earlier + */ + if (device_opened) { + for (i = 0; i < s->vdisk_nhosts; i++) { + vxhs_qnio_iio_close(s, i); + } + } + trace_vxhs_open_fail(ret); + return ret; +} + +static const AIOCBInfo vxhs_aiocb_info = { + .aiocb_size = sizeof(VXHSAIOCB) +}; + +/* + * This allocates QEMU-VXHS callback for each IO + * and is passed to QNIO. When QNIO completes the work, + * it will be passed back through the callback. + */ +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, + void *opaque, int iodir) +{ + VXHSAIOCB *acb = NULL; + BDRVVXHSState *s = bs->opaque; + size_t size; + uint64_t offset; + int iio_flags = 0; + int ret = 0; + void *qnio_ctx = s->qnio_ctx; + uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd; + + offset = sector_num * BDRV_SECTOR_SIZE; + size = nb_sectors * BDRV_SECTOR_SIZE; + + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); + /* + * Setup or initialize VXHSAIOCB. + * Every single field should be initialized since + * acb will be picked up from the slab without + * initializing with zero. + */ + acb->io_offset = offset; + acb->size = size; + acb->ret = 0; + acb->flags = 0; + acb->aio_done = VXHS_IO_INPROGRESS; + acb->segments = 0; + acb->buffer = 0; + acb->qiov = qiov; + acb->direction = iodir; + + qemu_spin_lock(&s->vdisk_lock); + if (OF_VDISK_FAILED(s)) { + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); + qemu_spin_unlock(&s->vdisk_lock); + goto errout; + } + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + qemu_spin_unlock(&s->vdisk_lock); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); + goto out; + } + s->vdisk_aio_count++; + qemu_spin_unlock(&s->vdisk_lock); + + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + + switch (iodir) { + case VDISK_AIO_WRITE: + vxhs_inc_acb_segment_count(acb, 1); + ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov, + offset, (void *)acb, iio_flags); + break; + case VDISK_AIO_READ: + vxhs_inc_acb_segment_count(acb, 1); + ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov, + offset, (void *)acb, iio_flags); + break; + default: + trace_vxhs_aio_rw_invalid(iodir); + goto errout; + } + + if (ret != 0) { + trace_vxhs_aio_rw_ioerr( + s->vdisk_guid, iodir, size, offset, + acb, acb->segments, ret, errno); + /* + * Don't retry I/Os against vDisk having no + * redundancy or stateful storage on compute + * + * TODO: Revisit this code path to see if any + * particular error needs to be handled. + * At this moment failing the I/O. + */ + qemu_spin_lock(&s->vdisk_lock); + if (s->vdisk_nhosts == 1) { + trace_vxhs_aio_rw_iofail(s->vdisk_guid); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + qemu_spin_unlock(&s->vdisk_lock); + goto errout; + } + if (OF_VDISK_FAILED(s)) { + trace_vxhs_aio_rw_devfail( + s->vdisk_guid, iodir, size, offset); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + qemu_spin_unlock(&s->vdisk_lock); + goto errout; + } + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + /* + * Queue all incoming io requests after failover starts. + * Number of requests that can arrive is limited by io queue depth + * so an app blasting independent ios will not exhaust memory. + */ + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + qemu_spin_unlock(&s->vdisk_lock); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 2); + goto out; + } + OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s); + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + vxhs_dec_acb_segment_count(acb, 1); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 3); + /* + * Start I/O failover if there is no active + * AIO within vxhs block driver. + */ + if (--s->vdisk_aio_count == 0) { + qemu_spin_unlock(&s->vdisk_lock); + /* + * Start IO failover + */ + vxhs_failover_io(s); + goto out; + } + qemu_spin_unlock(&s->vdisk_lock); + } + +out: + return &acb->common; + +errout: + qemu_aio_unref(acb); + return NULL; +} + +static BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, + cb, opaque, VDISK_AIO_READ); +} + +static BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, + cb, opaque, VDISK_AIO_WRITE); +} + +static void vxhs_close(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int i; + + trace_vxhs_close(s->vdisk_guid); + close(s->fds[VDISK_FD_READ]); + close(s->fds[VDISK_FD_WRITE]); + + /* + * Clearing all the event handlers for oflame registered to QEMU + */ + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], + false, NULL, NULL, NULL); + g_free(s->vdisk_guid); + s->vdisk_guid = NULL; + + for (i = 0; i < VXHS_MAX_HOSTS; i++) { + vxhs_qnio_iio_close(s, i); + /* + * Free the dynamically allocated hostip string + */ + g_free(s->vdisk_hostinfo[i].hostip); + s->vdisk_hostinfo[i].hostip = NULL; + s->vdisk_hostinfo[i].port = 0; + } +} + +/* + * This is called by QEMU when a flush gets triggered from within + * a guest at the block layer, either for IDE or SCSI disks. + */ +static int vxhs_co_flush(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t size = 0; + int ret = 0; + + /* + * VDISK_AIO_FLUSH ioctl is a no-op at present and will + * always return success. This could change in the future. + */ + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC); + + if (ret < 0) { + trace_vxhs_co_flush(s->vdisk_guid, ret, errno); + vxhs_close(bs); + } + + return ret; +} + +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s) +{ + int64_t vdisk_size = 0; + int ret = 0; + + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + VDISK_STAT, &vdisk_size, NULL, 0); + + if (ret < 0) { + trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno); + return 0; + } + + trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size); + return vdisk_size; +} + +/* + * Returns the size of vDisk in bytes. This is required + * by QEMU block upper block layer so that it is visible + * to guest. + */ +static int64_t vxhs_getlength(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t vdisk_size = 0; + + if (s->vdisk_size > 0) { + vdisk_size = s->vdisk_size; + } else { + /* + * Fetch the vDisk size using stat ioctl + */ + vdisk_size = vxhs_get_vdisk_stat(s); + if (vdisk_size > 0) { + s->vdisk_size = vdisk_size; + } + } + + if (vdisk_size > 0) { + return vdisk_size; /* return size in bytes */ + } + + return -EIO; +} + +/* + * Returns actual blocks allocated for the vDisk. + * This is required by qemu-img utility. + */ +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t vdisk_size = 0; + + if (s->vdisk_size > 0) { + vdisk_size = s->vdisk_size; + } else { + /* + * TODO: + * Once HyperScale storage-virtualizer provides + * actual physical allocation of blocks then + * fetch that information and return back to the + * caller but for now just get the full size. + */ + vdisk_size = vxhs_get_vdisk_stat(s); + if (vdisk_size > 0) { + s->vdisk_size = vdisk_size; + } + } + + if (vdisk_size > 0) { + return vdisk_size; /* return size in bytes */ + } + + return -EIO; +} + +static void vxhs_detach_aio_context(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], + false, NULL, NULL, NULL); + +} + +static void vxhs_attach_aio_context(BlockDriverState *bs, + AioContext *new_context) +{ + BDRVVXHSState *s = bs->opaque; + + aio_set_fd_handler(new_context, s->fds[VDISK_FD_READ], + false, vxhs_aio_event_reader, NULL, s); +} + +static BlockDriver bdrv_vxhs = { + .format_name = "vxhs", + .protocol_name = "vxhs", + .instance_size = sizeof(BDRVVXHSState), + .bdrv_file_open = vxhs_open, + .bdrv_parse_filename = vxhs_parse_filename, + .bdrv_close = vxhs_close, + .bdrv_getlength = vxhs_getlength, + .bdrv_get_allocated_file_size = vxhs_get_allocated_blocks, + .bdrv_aio_readv = vxhs_aio_readv, + .bdrv_aio_writev = vxhs_aio_writev, + .bdrv_co_flush_to_disk = vxhs_co_flush, + .bdrv_detach_aio_context = vxhs_detach_aio_context, + .bdrv_attach_aio_context = vxhs_attach_aio_context, +}; + +static void bdrv_vxhs_init(void) +{ + trace_vxhs_bdrv_init('.'); + bdrv_register(&bdrv_vxhs); +} + +block_init(bdrv_vxhs_init); diff --git a/configure b/configure index 8fa62ad..50fe935 100755 --- a/configure +++ b/configure @@ -320,6 +320,7 @@ numa="" tcmalloc="no" jemalloc="no" replication="yes" +vxhs="" # parse CC options first for opt do @@ -1159,6 +1160,11 @@ for opt do ;; --enable-replication) replication="yes" ;; + --disable-vxhs) vxhs="no" + ;; + --enable-vxhs) vxhs="yes" + ;; + *) echo "ERROR: unknown option $opt" echo "Try '$0 --help' for more information" @@ -1388,6 +1394,7 @@ disabled with --disable-FEATURE, default is enabled if available: tcmalloc tcmalloc support jemalloc jemalloc support replication replication support + vxhs Veritas HyperScale vDisk backend support NOTE: The object files are built at the place where configure is launched EOF @@ -4513,6 +4520,33 @@ if do_cc -nostdlib -Wl,-r -Wl,--no-relax -o $TMPMO $TMPO; then fi ########################################## +# Veritas HyperScale block driver VxHS +# Check if libqnio is installed + +if test "$vxhs" != "no" ; then + cat > $TMPC <<EOF +#include <stdint.h> +#include <qnio/qnio_api.h> + +void *vxhs_callback; + +int main(void) { + iio_init(vxhs_callback); + return 0; +} +EOF + vxhs_libs="-lqnio" + if compile_prog "" "$vxhs_libs" ; then + vxhs=yes + else + if test "$vxhs" = "yes" ; then + feature_not_found "vxhs block device" "Install libqnio. See github" + fi + vxhs=no + fi +fi + +########################################## # End of CC checks # After here, no more $cc or $ld runs @@ -4877,6 +4911,7 @@ echo "tcmalloc support $tcmalloc" echo "jemalloc support $jemalloc" echo "avx2 optimization $avx2_opt" echo "replication support $replication" +echo "VxHS block device $vxhs" if test "$sdl_too_old" = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -5465,6 +5500,12 @@ if test "$pthread_setname_np" = "yes" ; then echo "CONFIG_PTHREAD_SETNAME_NP=y" >> $config_host_mak fi +if test "$vxhs" = "yes" ; then + echo "CONFIG_VXHS=y" >> $config_host_mak + echo "VXHS_CFLAGS=$vxhs_cflags" >> $config_host_mak + echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak +fi + if test "$tcg_interpreter" = "yes"; then QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES" elif test "$ARCH" = "sparc64" ; then
This patch adds support for a new block device type called "vxhs". Source code for the library that this code loads can be downloaded from: https://github.com/MittalAshish/libqnio.git Sample command line using JSON syntax: ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' Sample command line using URI syntax: qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D Signed-off-by: Ashish Mittal <ashish.mittal@veritas.com> --- v7 changelog: (1) Got rid of the header file and most of function forward-declarations. (2) Added wrappers for vxhs_qnio_iio_open() and vxhs_qnio_iio_close() (3) Fixed a double close attempt of vdisk_rfd. (4) Changed to pass QEMUIOVector * in a couple of functions instead of individual structure members. (5) Got rid of VXHS_VECTOR_ALIGNED/NOT_ALIGNED. (6) Got rid of vxhs_calculate_iovec_size(). (7) Changed to use qemu_try_memalign(). (8) Got rid of unnecessary "else" conditions in a couple of places. (9) Limited the filename case to pass a single URI in vxhs_parse_uri(). Users will have to use the host/port/vdisk_id syntax to specify multiple host information. (10) Inlined couple of macros including the ones for qemu_spin_unlock. (11) Other miscellaneous changes. v6 changelog: (1) Removed cJSON dependency out of the libqnioshim layer. (2) Merged libqnioshim code into qemu vxhs driver proper. Now qemu-vxhs code only links with libqnio.so. (3) Replaced use of custom spinlocks with qemu_spin_lock. v5 changelog: (1) Removed unused functions. (2) Changed all qemu_ prefix for functions defined in libqnio and vxhs.c. (3) Fixed memory leaks in vxhs_qemu_init() and on the close of vxhs device. (4) Added upper bounds check on num_servers. (5) Close channel fds whereever necessary. (6) Changed vdisk_size to int64_t for 32-bit compilations. (7) Added message to configure file to indicate if vxhs is enabled or not. v4 changelog: (1) Reworked QAPI/JSON parsing. (2) Reworked URI parsing as suggested by Kevin. (3) Fixes per review comments from Stefan on v1. (4) Fixes per review comments from Daniel on v3. v3 changelog: (1) Implemented QAPI interface for passing VxHS block device parameters. v2 changelog: (1) Removed code to dlopen library. We now check if libqnio is installed during configure, and directly link with it. (2) Changed file headers to mention GPLv2-or-later license. (3) Removed unnecessary type casts and inlines. (4) Removed custom tokenize function and modified code to use g_strsplit. (5) Replaced malloc/free with g_new/g_free and removed code that checks for memory allocation failure conditions. (6) Removed some block ops implementations that were place-holders only. (7) Removed all custom debug messages. Added new messages in block/trace-events (8) Other miscellaneous corrections. v1 changelog: (1) First patch submission for review comments. block/Makefile.objs | 2 + block/trace-events | 47 ++ block/vxhs.c | 1645 +++++++++++++++++++++++++++++++++++++++++++++++++++ configure | 41 ++ 4 files changed, 1735 insertions(+) create mode 100644 block/vxhs.c