From patchwork Wed Sep 21 01:07:45 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ashish Mittal X-Patchwork-Id: 9342645 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 7EBAC607D0 for ; Wed, 21 Sep 2016 01:08:58 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4FF7C29919 for ; Wed, 21 Sep 2016 01:08:58 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 436F029964; Wed, 21 Sep 2016 01:08:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id CB2D029919 for ; Wed, 21 Sep 2016 01:08:54 +0000 (UTC) Received: from localhost ([::1]:39023 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bmW1t-0006V8-5t for patchwork-qemu-devel@patchwork.kernel.org; Tue, 20 Sep 2016 21:08:53 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41214) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bmW1Y-0006Uj-EG for qemu-devel@nongnu.org; Tue, 20 Sep 2016 21:08:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bmW1R-0004CW-3A for qemu-devel@nongnu.org; Tue, 20 Sep 2016 21:08:31 -0400 Received: from mail-yw0-x244.google.com ([2607:f8b0:4002:c05::244]:35985) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bmW1Q-0004CO-QP for qemu-devel@nongnu.org; Tue, 20 Sep 2016 21:08:25 -0400 Received: by mail-yw0-x244.google.com with SMTP id v2so1576948ywg.3 for ; Tue, 20 Sep 2016 18:08:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id; bh=FnwkXyrfcNDTgeh0OwBlI6lWO+Gtax3XnE8KxjOcjjo=; b=qTa5Wv0HS80D9wt2utfNc+qWbRPl8YmLvdhPFDg+ixwTejCbY8ixOce6eK1T1PD4EJ yV4SXdZHt9l/HWKMfktEPgKC2vzluhIE+gYfQY1+Si1isAK90G0skaRBhfZvN/I0Uryq 8CHOFncXHTVc/byElg03eXdd8WlJXxWeRJvrA3sL0+0Q2wG+EtSHaQuB9WfIMV2yxExJ kdCJbDLqkKWabhbMaVQHShVJxMjSmRzOUhbtq4+b2qHiPfZhTJq7lRvGHqFOdTnjBFrG 93i4068reJgK2gNxlGe3v3b6SRiDQyACwuLvdIXNnyetIT4CB0+m6DXio8MTcJql/4Xy W+Hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=FnwkXyrfcNDTgeh0OwBlI6lWO+Gtax3XnE8KxjOcjjo=; b=c0gnllFhGxIokEllGrOjwc4Vs77sEOHcO6x7ezHEM8i5ixe6ccV8OZiGk8c+qA55ZU nxFWtm271OmLr/kBe16WoSjjs/lIDw5w4tJ1UEaGnNlhxQ85w6wE8+scVmy7ZoEYKNiY vyUu+dUM3wm+pL4xhAqUUmqpvIYHbkVmA2XR2s4uKHLMSjmnnT3M/Hj7Mcd0b/1B4azR ycz5FKglRD7uSWA96391VcLojlbU5U0jo3DpNl9z/jF/I02P1FYvc9cMBHSHqxL7AWSD ZNFyROcnL6VfeCOomerzy1TtE3F8L4GLn0HmR51wnfpS1fSrp8hs+fx/OYbsqHYCEm0m VJiw== X-Gm-Message-State: AE9vXwPByW44zvrlrQjbOlMCrgrFf+Bx0GVneCc1vuONMzSo5QgGQOinUrekWePeUFbc1w== X-Received: by 10.13.211.195 with SMTP id v186mr34165036ywd.319.1474420103862; Tue, 20 Sep 2016 18:08:23 -0700 (PDT) Received: from localhost.localdomain.localdomain ([172.56.31.195]) by smtp.gmail.com with ESMTPSA id h185sm12699388ywe.12.2016.09.20.18.08.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 20 Sep 2016 18:08:23 -0700 (PDT) From: Ashish Mittal X-Google-Original-From: Ashish Mittal To: qemu-devel@nongnu.org, pbonzini@redhat.com, kwolf@redhat.com, armbru@redhat.com, berrange@redhat.com, ashish.mittal@veritas.com, stefanha@gmail.com Date: Tue, 20 Sep 2016 18:07:45 -0700 Message-Id: <1474420065-110145-1-git-send-email-ashish.mittal@veritas.com> X-Mailer: git-send-email 2.5.5 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:4002:c05::244 Subject: [Qemu-devel] [PATCH v6 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Ketan.Nilangekar@veritas.com, Abhijit.Dey@veritas.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP This patch adds support for a new block device type called "vxhs". Source code for the library that this code loads can be downloaded from: https://github.com/MittalAshish/libqnio.git Sample command line using JSON syntax: ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}' Sample command line using URI syntax: qemu-img convert -f raw -O raw -n /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D Signed-off-by: Ashish Mittal --- v6 changelog: (1) Removed cJSON dependency out of the libqnioshim layer. (2) Merged libqnioshim code into qemu vxhs driver proper. Now qemu-vxhs code only links with libqnio.so. (3) Replaced use of custom spinlocks with qemu_spin_lock. v5 changelog: (1) Removed unused functions. (2) Changed all qemu_ prefix for functions defined in libqnio and vxhs.c. (3) Fixed memory leaks in vxhs_qemu_init() and on the close of vxhs device. (4) Added upper bounds check on num_servers. (5) Close channel fds whereever necessary. (6) Changed vdisk_size to int64_t for 32-bit compilations. (7) Added message to configure file to indicate if vxhs is enabled or not. v4 changelog: (1) Reworked QAPI/JSON parsing. (2) Reworked URI parsing as suggested by Kevin. (3) Fixes per review comments from Stefan on v1. (4) Fixes per review comments from Daniel on v3. v3 changelog: (1) Implemented QAPI interface for passing VxHS block device parameters. v2 changelog: (1) Removed code to dlopen library. We now check if libqnio is installed during configure, and directly link with it. (2) Changed file headers to mention GPLv2-or-later license. (3) Removed unnecessary type casts and inlines. (4) Removed custom tokenize function and modified code to use g_strsplit. (5) Replaced malloc/free with g_new/g_free and removed code that checks for memory allocation failure conditions. (6) Removed some block ops implementations that were place-holders only. (7) Removed all custom debug messages. Added new messages in block/trace-events (8) Other miscellaneous corrections. v1 changelog: (1) First patch submission for review comments. block/Makefile.objs | 2 + block/trace-events | 47 ++ block/vxhs.c | 1602 +++++++++++++++++++++++++++++++++++++++++++++++++++ block/vxhs.h | 221 +++++++ configure | 41 ++ 5 files changed, 1913 insertions(+) create mode 100644 block/vxhs.c create mode 100644 block/vxhs.h diff --git a/block/Makefile.objs b/block/Makefile.objs index 55da626..bafb7c9 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -18,6 +18,7 @@ block-obj-$(CONFIG_LIBNFS) += nfs.o block-obj-$(CONFIG_CURL) += curl.o block-obj-$(CONFIG_RBD) += rbd.o block-obj-$(CONFIG_GLUSTERFS) += gluster.o +block-obj-$(CONFIG_VXHS) += vxhs.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o dirty-bitmap.o @@ -37,6 +38,7 @@ rbd.o-cflags := $(RBD_CFLAGS) rbd.o-libs := $(RBD_LIBS) gluster.o-cflags := $(GLUSTERFS_CFLAGS) gluster.o-libs := $(GLUSTERFS_LIBS) +vxhs.o-libs := $(VXHS_LIBS) ssh.o-cflags := $(LIBSSH2_CFLAGS) ssh.o-libs := $(LIBSSH2_LIBS) archipelago.o-libs := $(ARCHIPELAGO_LIBS) diff --git a/block/trace-events b/block/trace-events index 05fa13c..b0098a7 100644 --- a/block/trace-events +++ b/block/trace-events @@ -114,3 +114,50 @@ qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64 qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64 qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu" + +# block/vxhs.c +vxhs_bdrv_init(const char c) "Registering VxHS AIO driver%c" +vxhs_iio_callback(int error, int reason) "ctx is NULL: error %d, reason %d" +vxhs_setup_qnio(void *s) "Context to HyperScale IO manager = %p" +vxhs_setup_qnio_nwerror(char c) "Could not initialize the network channel. Bailing out%c" +vxhs_iio_callback_iofail(int err, int reason, void *acb, int seg) "Read/Write failed: error %d, reason %d, acb %p, segment %d" +vxhs_iio_callback_retry(char *guid, void *acb) "vDisk %s, added acb %p to retry queue (5)" +vxhs_iio_callback_chnlfail(int error) "QNIO channel failed, no i/o (%d)" +vxhs_iio_callback_fail(int r, void *acb, int seg, uint64_t size, int err) " ALERT: reason = %d , acb = %p, acb->segments = %d, acb->size = %lu Error = %d" +vxhs_fail_aio(char * guid, void *acb) "vDisk %s, failing acb %p" +vxhs_iio_callback_ready(char *vd, int err) "async vxhs_iio_callback: IRP_VDISK_CHECK_IO_FAILOVER_READY completed for vdisk %s with error %d" +vxhs_iio_callback_chnfail(int err, int error) "QNIO channel failed, no i/o %d, %d" +vxhs_iio_callback_unknwn(int opcode, int err) "unexpected opcode %d, errno %d" +vxhs_open_fail(int ret) "Could not open the device. Error = %d" +vxhs_open_epipe(char c) "Could not create a pipe for device. Bailing out%c" +vxhs_aio_rw(char *guid, int iodir, uint64_t size, uint64_t offset) "vDisk %s, vDisk device is in failed state iodir = %d size = %lu offset = %lu" +vxhs_aio_rw_retry(char *guid, void *acb, int queue) "vDisk %s, added acb %p to retry queue(%d)" +vxhs_aio_rw_invalid(int req) "Invalid I/O request iodir %d" +vxhs_aio_rw_ioerr(char *guid, int iodir, uint64_t size, uint64_t off, void *acb, int seg, int ret, int err) "IO ERROR (vDisk %s) FOR : Read/Write = %d size = %lu offset = %lu ACB = %p Segments = %d. Error = %d, errno = %d" +vxhs_co_flush(char *guid, int ret, int err) "vDisk (%s) Flush ioctl failed ret = %d errno = %d" +vxhs_get_vdisk_stat_err(char *guid, int ret, int err) "vDisk (%s) stat ioctl failed, ret = %d, errno = %d" +vxhs_get_vdisk_stat(char *vdisk_guid, uint64_t vdisk_size) "vDisk %s stat ioctl returned size %lu" +vxhs_switch_storage_agent(char *ip, char *guid) "Query host %s for vdisk %s" +vxhs_switch_storage_agent_failed(char *ip, char *guid, int res, int err) "Query to host %s for vdisk %s failed, res = %d, errno = %d" +vxhs_check_failover_status(char *ip, char *guid) "Switched to storage server host-IP %s for vdisk %s" +vxhs_check_failover_status_retry(char *guid) "failover_ioctl_cb: keep looking for io target for vdisk %s" +vxhs_failover_io(char *vdisk) "I/O Failover starting for vDisk %s" +vxhs_reopen_vdisk(char *ip) "Failed to connect to storage agent on host-ip %s" +vxhs_reopen_vdisk_openfail(char *fname) "Failed to open vdisk device: %s" +vxhs_handle_queued_ios(void *acb, int res) "Restarted acb %p res %d" +vxhs_restart_aio(int dir, int res, int err) "IO ERROR FOR: Read/Write = %d Error = %d, errno = %d" +vxhs_complete_aio(void *acb, uint64_t ret) "aio failed acb %p ret %ld" +vxhs_aio_rw_iofail(char *guid) "vDisk %s, I/O operation failed." +vxhs_aio_rw_devfail(char *guid, int dir, uint64_t size, uint64_t off) "vDisk %s, vDisk device failed iodir = %d size = %lu offset = %lu" +vxhs_parse_uri_filename(const char *filename) "URI passed via bdrv_parse_filename %s" +vxhs_qemu_init_vdisk(const char *vdisk_id) "vdisk_id from json %s" +vxhs_qemu_init_numservers(int num_servers) "Number of servers passed = %d" +vxhs_parse_uri_hostinfo(int num, char *host, int port) "Host %d: IP %s, Port %d" +vxhs_qemu_init(char *of_vsa_addr, int port) "Adding host %s:%d to BDRVVXHSState" +vxhs_qemu_init_filename(const char *filename) "Filename passed as %s" +vxhs_close(char *vdisk_guid) "Closing vdisk %s" +vxhs_convert_iovector_to_buffer(size_t len) "Could not allocate buffer for size %zu bytes" +vxhs_qnio_iio_writev(int res) "iio_writev returned %d" +vxhs_qnio_iio_writev_err(int iter, uint64_t iov_len, int err) "Error for iteration : %d, iov_len = %lu errno = %d" +vxhs_qnio_iio_readv(void *ctx, int ret, int error) "Error while issuing read to QNIO. ctx %p Error = %d, errno = %d" +vxhs_qnio_iio_ioctl(uint32_t opcode) "Error while executing IOCTL. Opcode = %u" diff --git a/block/vxhs.c b/block/vxhs.c new file mode 100644 index 0000000..888eca3 --- /dev/null +++ b/block/vxhs.c @@ -0,0 +1,1602 @@ +/* + * QEMU Block driver for Veritas HyperScale (VxHS) + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "vxhs.h" +#include +#include "qapi/qmp/qerror.h" +#include "qapi/qmp/qdict.h" +#include "qapi/qmp/qstring.h" +#include "trace.h" + +#define VXHS_OPT_FILENAME "filename" +#define VXHS_OPT_VDISK_ID "vdisk_id" +#define VXHS_OPT_SERVER "server." +#define VXHS_OPT_HOST "host" +#define VXHS_OPT_PORT "port" + +/* qnio client ioapi_ctx */ +static void *global_qnio_ctx; + +/* vdisk prefix to pass to qnio */ +static const char vdisk_prefix[] = "/dev/of/vdisk"; + +void vxhs_inc_acb_segment_count(void *ptr, int count) +{ + VXHSAIOCB *acb = ptr; + BDRVVXHSState *s = acb->common.bs->opaque; + + VXHS_SPIN_LOCK(s->vdisk_acb_lock); + acb->segments += count; + VXHS_SPIN_UNLOCK(s->vdisk_acb_lock); +} + +void vxhs_dec_acb_segment_count(void *ptr, int count) +{ + VXHSAIOCB *acb = ptr; + BDRVVXHSState *s = acb->common.bs->opaque; + + VXHS_SPIN_LOCK(s->vdisk_acb_lock); + acb->segments -= count; + VXHS_SPIN_UNLOCK(s->vdisk_acb_lock); +} + +void vxhs_set_acb_buffer(void *ptr, void *buffer) +{ + VXHSAIOCB *acb = ptr; + + acb->buffer = buffer; +} + +void vxhs_inc_vdisk_iocount(void *ptr, uint32_t count) +{ + BDRVVXHSState *s = ptr; + + VXHS_SPIN_LOCK(s->vdisk_lock); + s->vdisk_aio_count += count; + VXHS_SPIN_UNLOCK(s->vdisk_lock); +} + +void vxhs_dec_vdisk_iocount(void *ptr, uint32_t count) +{ + BDRVVXHSState *s = ptr; + + VXHS_SPIN_LOCK(s->vdisk_lock); + s->vdisk_aio_count -= count; + VXHS_SPIN_UNLOCK(s->vdisk_lock); +} + +void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, + uint32_t error, uint32_t opcode) +{ + VXHSAIOCB *acb = NULL; + BDRVVXHSState *s = NULL; + int rv = 0; + int segcount = 0; + + switch (opcode) { + case IRP_READ_REQUEST: + case IRP_WRITE_REQUEST: + + /* + * ctx is VXHSAIOCB* + * ctx is NULL if error is QNIOERROR_CHANNEL_HUP or reason is IIO_REASON_HUP + */ + if (ctx) { + acb = ctx; + s = acb->common.bs->opaque; + } else { + trace_vxhs_iio_callback(error, reason); + goto out; + } + + if (error) { + trace_vxhs_iio_callback_iofail(error, reason, acb, acb->segments); + + if (reason == IIO_REASON_DONE || reason == IIO_REASON_EVENT) { + /* + * Storage agent failed while I/O was in progress + * Fail over only if the qnio channel dropped, indicating + * storage agent failure. Don't fail over in response to other + * I/O errors such as disk failure. + */ + if (error == QNIOERROR_RETRY_ON_SOURCE || error == QNIOERROR_HUP || + error == QNIOERROR_CHANNEL_HUP || error == -1) { + /* + * Start vDisk IO failover once callback is + * called against all the pending IOs. + * If vDisk has no redundancy enabled + * then IO failover routine will mark + * the vDisk failed and fail all the + * AIOs without retry (stateless vDisk) + */ + VXHS_SPIN_LOCK(s->vdisk_lock); + if (!OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s); + } + /* + * Check if this acb is already queued before. + * It is possible in case if I/Os are submitted + * in multiple segments (QNIO_MAX_IO_SIZE). + */ + VXHS_SPIN_LOCK(s->vdisk_acb_lock); + if (!OF_AIOCB_FLAGS_QUEUED(acb)) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, + acb, retry_entry); + OF_AIOCB_FLAGS_SET_QUEUED(acb); + s->vdisk_aio_retry_qd++; + trace_vxhs_iio_callback_retry(s->vdisk_guid, acb); + } + segcount = --acb->segments; + VXHS_SPIN_UNLOCK(s->vdisk_acb_lock); + /* + * Decrement AIO count only when callback is called + * against all the segments of aiocb. + */ + if (segcount == 0 && --s->vdisk_aio_count == 0) { + /* + * Start vDisk I/O failover + */ + VXHS_SPIN_UNLOCK(s->vdisk_lock); + /* + * TODO: + * Need to explore further if it is possible to optimize + * the failover operation on Virtual-Machine (global) + * specific rather vDisk specific. + */ + vxhs_failover_io(s); + goto out; + } + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto out; + } + } else if (reason == IIO_REASON_HUP) { + /* + * Channel failed, spontaneous notification, + * not in response to I/O + */ + trace_vxhs_iio_callback_chnlfail(error); + /* + * TODO: Start channel failover when no I/O is outstanding + */ + goto out; + } else { + trace_vxhs_iio_callback_fail(reason, acb, acb->segments, + acb->size, error); + } + } + /* + * Set error into acb if not set. In case if acb is being + * submitted in multiple segments then need to set the error + * only once. + * + * Once acb done callback is called for the last segment + * then acb->ret return status will be sent back to the + * caller. + */ + VXHS_SPIN_LOCK(s->vdisk_acb_lock); + if (error && !acb->ret) { + acb->ret = error; + } + --acb->segments; + segcount = acb->segments; + assert(segcount >= 0); + VXHS_SPIN_UNLOCK(s->vdisk_acb_lock); + /* + * Check if all the outstanding I/Os are done against acb. + * If yes then send signal for AIO completion. + */ + if (segcount == 0) { + rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb)); + if (rv != sizeof(acb)) { + error_report("VXHS AIO completion failed: %s", strerror(errno)); + abort(); + } + } + break; + + case IRP_VDISK_CHECK_IO_FAILOVER_READY: + /* ctx is BDRVVXHSState* */ + assert(ctx); + trace_vxhs_iio_callback_ready(((BDRVVXHSState *)ctx)->vdisk_guid, + error); + vxhs_check_failover_status(error, ctx); + break; + + default: + if (reason == IIO_REASON_HUP) { + /* + * Channel failed, spontaneous notification, + * not in response to I/O + */ + trace_vxhs_iio_callback_chnfail(error, errno); + /* + * TODO: Start channel failover when no I/O is outstanding + */ + } else { + trace_vxhs_iio_callback_unknwn(opcode, error); + } + break; + } +out: + return; +} + +void vxhs_complete_aio(VXHSAIOCB *acb, BDRVVXHSState *s) +{ + BlockCompletionFunc *cb = acb->common.cb; + void *opaque = acb->common.opaque; + int ret = 0; + + if (acb->ret != 0) { + trace_vxhs_complete_aio(acb, acb->ret); + /* + * We mask all the IO errors generically as EIO for upper layers + * Right now our IO Manager uses non standard error codes. Instead + * of confusing upper layers with incorrect interpretation we are + * doing this workaround. + */ + ret = (-EIO); + } + /* + * Copy back contents from stablization buffer into original iovector + * before returning the IO + */ + if (acb->buffer != NULL) { + qemu_iovec_from_buf(acb->qiov, 0, acb->buffer, acb->qiov->size); + qemu_vfree(acb->buffer); + acb->buffer = NULL; + } + vxhs_dec_vdisk_iocount(s, 1); + acb->aio_done = VXHS_IO_COMPLETED; + qemu_aio_unref(acb); + cb(opaque, ret); +} + +/* + * This is the HyperScale event handler registered to QEMU. + * It is invoked when any IO gets completed and written on pipe + * by callback called from QNIO thread context. Then it marks + * the AIO as completed, and releases HyperScale AIO callbacks. + */ +void vxhs_aio_event_reader(void *opaque) +{ + BDRVVXHSState *s = opaque; + ssize_t ret; + + do { + char *p = (char *)&s->qnio_event_acb; + + ret = read(s->fds[VDISK_FD_READ], p + s->event_reader_pos, + sizeof(s->qnio_event_acb) - s->event_reader_pos); + if (ret > 0) { + s->event_reader_pos += ret; + if (s->event_reader_pos == sizeof(s->qnio_event_acb)) { + s->event_reader_pos = 0; + vxhs_complete_aio(s->qnio_event_acb, s); + } + } + } while (ret < 0 && errno == EINTR); +} + +/* + * Call QNIO operation to create channels to do IO on vDisk. + */ + +void *vxhs_setup_qnio(void) +{ + void *qnio_ctx = NULL; + + qnio_ctx = iio_init(vxhs_iio_callback); + + if (qnio_ctx != NULL) { + trace_vxhs_setup_qnio(qnio_ctx); + } else { + trace_vxhs_setup_qnio_nwerror('.'); + } + + return qnio_ctx; +} + +size_t vxhs_calculate_iovec_size(struct iovec *iov, int niov) +{ + int i; + size_t size = 0; + + if (!iov || niov == 0) { + return size; + } + for (i = 0; i < niov; i++) { + size += iov[i].iov_len; + } + return size; +} + +/* + * This helper function converts an array of iovectors into a flat buffer. + */ +void *vxhs_convert_iovector_to_buffer(struct iovec *iov, int niov, + size_t sector) +{ + void *buf = NULL; + size_t size = 0; + + if (!iov || niov == 0) { + return buf; + } + + size = vxhs_calculate_iovec_size(iov, niov); + buf = qemu_memalign(sector, size); + if (!buf) { + trace_vxhs_convert_iovector_to_buffer(size); + errno = -ENOMEM; + return NULL; + } + return buf; +} + +/* + * This helper function iterates over the iovector and checks + * if the length of every element is an integral multiple + * of the sector size. + * Return Value: + * On Success : return VXHS_VECTOR_ALIGNED + * On Failure : return VXHS_VECTOR_NOT_ALIGNED. + */ +int vxhs_is_iovector_read_aligned(struct iovec *iov, int niov, size_t sector) +{ + int i; + + if (!iov || niov == 0) { + return VXHS_VECTOR_ALIGNED; + } + for (i = 0; i < niov; i++) { + if (iov[i].iov_len % sector != 0) { + return VXHS_VECTOR_NOT_ALIGNED; + } + } + return VXHS_VECTOR_ALIGNED; +} + +int32_t +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, struct iovec *iov, + int iovcnt, uint64_t offset, void *ctx, uint32_t flags) +{ + struct iovec cur; + uint64_t cur_offset = 0; + uint64_t cur_write_len = 0; + int segcount = 0; + int ret = 0; + int i, nsio = 0; + + errno = 0; + cur.iov_base = 0; + cur.iov_len = 0; + + ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags); + + if (ret == -1 && errno == EFBIG) { + trace_vxhs_qnio_iio_writev(ret); + /* + * IO size is larger than IIO_IO_BUF_SIZE hence need to + * split the I/O at IIO_IO_BUF_SIZE boundary + * There are two cases here: + * 1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE + * 2. iovcnt is greater than 1 and IO size is greater than + * IIO_IO_BUF_SIZE. + * + * Need to adjust the segment count, for that we need to compute + * the segment count and increase the segment count in one shot + * instead of setting iteratively in for loop. It is required to + * prevent any race between the splitted IO submission and IO + * completion. + */ + cur_offset = offset; + for (i = 0; i < iovcnt; i++) { + if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) { + cur_offset += iov[i].iov_len; + nsio++; + } else if (iov[i].iov_len > 0) { + cur.iov_base = iov[i].iov_base; + cur.iov_len = IIO_IO_BUF_SIZE; + cur_write_len = 0; + while (1) { + nsio++; + cur_write_len += cur.iov_len; + if (cur_write_len == iov[i].iov_len) { + break; + } + cur_offset += cur.iov_len; + cur.iov_base += cur.iov_len; + if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) { + cur.iov_len = IIO_IO_BUF_SIZE; + } else { + cur.iov_len = (iov[i].iov_len - cur_write_len); + } + } + } + } + + segcount = nsio - 1; + vxhs_inc_acb_segment_count(ctx, segcount); + /* + * Split the IO and submit it to QNIO. + * Reset the cur_offset before splitting the IO. + */ + cur_offset = offset; + nsio = 0; + for (i = 0; i < iovcnt; i++) { + if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) { + errno = 0; + ret = iio_writev(qnio_ctx, rfd, &iov[i], 1, cur_offset, ctx, + flags); + if (ret == -1) { + trace_vxhs_qnio_iio_writev_err(i, iov[i].iov_len, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within QEMU block driver. + */ + if (segcount > 0 && (segcount - nsio) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - nsio); + } + return ret; + } else { + cur_offset += iov[i].iov_len; + } + nsio++; + } else if (iov[i].iov_len > 0) { + /* + * This case is where one element of the io vector is > 4MB. + */ + cur.iov_base = iov[i].iov_base; + cur.iov_len = IIO_IO_BUF_SIZE; + cur_write_len = 0; + while (1) { + nsio++; + errno = 0; + ret = iio_writev(qnio_ctx, rfd, &cur, 1, cur_offset, ctx, + flags); + if (ret == -1) { + trace_vxhs_qnio_iio_writev_err(i, cur.iov_len, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within the + * QEMU block driver. + */ + if (segcount > 0 && (segcount - nsio) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - nsio); + } + return ret; + } else { + cur_write_len += cur.iov_len; + if (cur_write_len == iov[i].iov_len) { + break; + } + cur_offset += cur.iov_len; + cur.iov_base += cur.iov_len; + if ((iov[i].iov_len - cur_write_len) > + IIO_IO_BUF_SIZE) { + cur.iov_len = IIO_IO_BUF_SIZE; + } else { + cur.iov_len = (iov[i].iov_len - cur_write_len); + } + } + } + } + } + } + return ret; +} + +/* + * Iterate over the i/o vector and send read request + * to QNIO one by one. + */ +int32_t +vxhs_qnio_iio_readv(void *qnio_ctx, uint32_t rfd, struct iovec *iov, int iovcnt, + uint64_t offset, void *ctx, uint32_t flags) +{ + uint64_t read_offset = offset; + void *buffer = NULL; + size_t size; + int aligned, segcount; + int i, ret = 0; + + aligned = vxhs_is_iovector_read_aligned(iov, iovcnt, BDRV_SECTOR_SIZE); + size = vxhs_calculate_iovec_size(iov, iovcnt); + + if (aligned == VXHS_VECTOR_NOT_ALIGNED) { + buffer = vxhs_convert_iovector_to_buffer(iov, iovcnt, BDRV_SECTOR_SIZE); + if (buffer == NULL) { + return -ENOMEM; + } + + errno = 0; + ret = iio_read(qnio_ctx, rfd, buffer, size, read_offset, ctx, flags); + if (ret != 0) { + trace_vxhs_qnio_iio_readv(ctx, ret, errno); + qemu_vfree(buffer); + return ret; + } + vxhs_set_acb_buffer(ctx, buffer); + return ret; + } + + /* + * Since read IO request is going to split based on + * number of IOvectors hence increment the segment + * count depending on the number of IOVectors before + * submitting the read request to QNIO. + * This is needed to protect the QEMU block driver + * IO completion while read request for the same IO + * is being submitted to QNIO. + */ + segcount = iovcnt - 1; + if (segcount > 0) { + vxhs_inc_acb_segment_count(ctx, segcount); + } + + for (i = 0; i < iovcnt; i++) { + errno = 0; + ret = iio_read(qnio_ctx, rfd, iov[i].iov_base, iov[i].iov_len, + read_offset, ctx, flags); + if (ret != 0) { + trace_vxhs_qnio_iio_readv(ctx, ret, errno); + /* + * Need to adjust the AIOCB segment count to prevent + * blocking of AIOCB completion within QEMU block driver. + */ + if (segcount > 0 && (segcount - i) > 0) { + vxhs_dec_acb_segment_count(ctx, segcount - i); + } + return ret; + } + read_offset += iov[i].iov_len; + } + + return ret; +} + +int32_t +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in, + void *ctx, uint32_t flags) +{ + int ret = 0; + + switch (opcode) { + case VDISK_STAT: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT, + in, ctx, flags); + break; + + case VDISK_AIO_FLUSH: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH, + in, ctx, flags); + break; + + case VDISK_CHECK_IO_FAILOVER_READY: + ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY, + in, ctx, flags); + break; + + default: + ret = -ENOTSUP; + break; + } + + if (ret) { + *in = 0; + trace_vxhs_qnio_iio_ioctl(opcode); + } + + return ret; +} + +static QemuOptsList runtime_opts = { + .name = "vxhs", + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), + .desc = { + { + .name = VXHS_OPT_FILENAME, + .type = QEMU_OPT_STRING, + .help = "URI to the Veritas HyperScale image", + }, + { + .name = VXHS_OPT_VDISK_ID, + .type = QEMU_OPT_STRING, + .help = "UUID of the VxHS vdisk", + }, + { /* end of list */ } + }, +}; + +static QemuOptsList runtime_tcp_opts = { + .name = "vxhs_tcp", + .head = QTAILQ_HEAD_INITIALIZER(runtime_tcp_opts.head), + .desc = { + { + .name = VXHS_OPT_HOST, + .type = QEMU_OPT_STRING, + .help = "host address (ipv4 addresses)", + }, + { + .name = VXHS_OPT_PORT, + .type = QEMU_OPT_NUMBER, + .help = "port number on which VxHSD is listening (default 9999)", + .def_value_str = "9999" + }, + { + .name = "to", + .type = QEMU_OPT_NUMBER, + .help = "max port number, not supported by VxHS", + }, + { + .name = "ipv4", + .type = QEMU_OPT_BOOL, + .help = "ipv4 bool value, not supported by VxHS", + }, + { + .name = "ipv6", + .type = QEMU_OPT_BOOL, + .help = "ipv6 bool value, not supported by VxHS", + }, + { /* end of list */ } + }, +}; + +/* + * Parse the incoming URI and populate *options with all the host(s) + * information. Host at index 0 is local storage agent. + * Remaining are the reflection target storage agents. The local storage agent + * ip is the efficient internal address in the uri, e.g. 192.168.0.2. + * The local storage agent address is stored at index 0. The reflection target + * ips, are the E-W data network addresses of the reflection node agents, also + * extracted from the uri. + */ +static int vxhs_parse_uri(const char *filename, QDict *options) +{ + gchar **target_list; + URI *uri = NULL; + char *hoststr, *portstr; + char *vdisk_id = NULL; + char *port; + int ret = 0; + int i = 0; + + trace_vxhs_parse_uri_filename(filename); + target_list = g_strsplit(filename, "%7D", 0); + assert(target_list != NULL && target_list[0] != NULL); + + for (i = 0; target_list[i] != NULL && *target_list[i]; i++) { + uri = uri_parse(target_list[i]); + if (!uri || !uri->server) { + uri_free(uri); + ret = -EINVAL; + break; + } + + hoststr = g_strdup_printf(VXHS_OPT_SERVER"%d.host", i); + qdict_put(options, hoststr, qstring_from_str(uri->server)); + + portstr = g_strdup_printf(VXHS_OPT_SERVER"%d.port", i); + if (uri->port) { + port = g_strdup_printf("%d", uri->port); + qdict_put(options, portstr, qstring_from_str(port)); + g_free(port); + } + + if (i == 0 && (strstr(uri->path, "vxhs") == NULL)) { + vdisk_id = g_strdup_printf("%s%c", uri->path, '}'); + qdict_put(options, "vdisk_id", qstring_from_str(vdisk_id)); + } + + trace_vxhs_parse_uri_hostinfo(i + 1, uri->server, uri->port); + g_free(hoststr); + g_free(portstr); + g_free(vdisk_id); + uri_free(uri); + } + + g_strfreev(target_list); + return ret; +} + +static void vxhs_parse_filename(const char *filename, QDict *options, + Error **errp) +{ + if (qdict_haskey(options, "host") + || qdict_haskey(options, "port") + || qdict_haskey(options, "path")) + { + error_setg(errp, "host/port/path and a file name may not be specified " + "at the same time"); + return; + } + + if (strstr(filename, "://")) { + int ret = vxhs_parse_uri(filename, options); + if (ret < 0) { + error_setg(errp, "Invalid URI. URI should be of the form " + " vxhs://:/{}"); + } + } +} + +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s, + int *cfd, int *rfd, Error **errp) +{ + QDict *backing_options = NULL; + QemuOpts *opts, *tcp_opts; + const char *vxhs_filename; + char *of_vsa_addr = NULL; + Error *local_err = NULL; + const char *vdisk_id_opt; + char *file_name = NULL; + size_t num_servers = 0; + char *str = NULL; + int ret = 0; + int i; + + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(opts, options, &local_err); + if (local_err) { + ret = -EINVAL; + goto out; + } + + vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME); + if (vxhs_filename) { + trace_vxhs_qemu_init_filename(vxhs_filename); + } + + vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID); + if (!vdisk_id_opt) { + error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID); + ret = -EINVAL; + goto out; + } + s->vdisk_guid = g_strdup(vdisk_id_opt); + trace_vxhs_qemu_init_vdisk(vdisk_id_opt); + + num_servers = qdict_array_entries(options, VXHS_OPT_SERVER); + if (num_servers < 1) { + error_setg(&local_err, QERR_MISSING_PARAMETER, "server"); + ret = -EINVAL; + goto out; + } else if (num_servers > 4) { + error_setg(&local_err, QERR_INVALID_PARAMETER, "server"); + error_append_hint(errp, "Maximum 4 servers allowed.\n"); + ret = -EINVAL; + goto out; + } + trace_vxhs_qemu_init_numservers(num_servers); + + for (i = 0; i < num_servers; i++) { + str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i); + qdict_extract_subqdict(options, &backing_options, str); + + /* Create opts info from runtime_tcp_opts list */ + tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort); + qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err); + if (local_err) { + qdict_del(backing_options, str); + qemu_opts_del(tcp_opts); + g_free(str); + ret = -EINVAL; + goto out; + } + + s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts, + VXHS_OPT_HOST)); + s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts, + VXHS_OPT_PORT), + NULL, 0); + + s->vdisk_hostinfo[i].qnio_cfd = -1; + s->vdisk_hostinfo[i].vdisk_rfd = -1; + trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip, + s->vdisk_hostinfo[i].port); + + qdict_del(backing_options, str); + qemu_opts_del(tcp_opts); + g_free(str); + } + + s->vdisk_nhosts = i; + s->vdisk_cur_host_idx = 0; + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); + of_vsa_addr = g_strdup_printf("of://%s:%d", + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].port); + + /* + * .bdrv_open() and .bdrv_create() run under the QEMU global mutex. + */ + if (global_qnio_ctx == NULL) { + global_qnio_ctx = vxhs_setup_qnio(); + if (global_qnio_ctx == NULL) { + error_setg(&local_err, "Failed vxhs_setup_qnio"); + ret = -EINVAL; + goto out; + } + } + + *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0); + if (*cfd < 0) { + error_setg(&local_err, "Failed iio_open"); + ret = -EIO; + goto out; + } + *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0); + if (*rfd < 0) { + iio_close(global_qnio_ctx, *cfd); + *cfd = -1; + error_setg(&local_err, "Failed iio_devopen"); + ret = -EIO; + goto out; + } + +out: + g_free(file_name); + g_free(of_vsa_addr); + qemu_opts_del(opts); + + if (ret < 0) { + for (i = 0; i < num_servers; i++) { + g_free(s->vdisk_hostinfo[i].hostip); + } + g_free(s->vdisk_guid); + s->vdisk_guid = NULL; + errno = -ret; + } + error_propagate(errp, local_err); + return ret; +} + +int vxhs_open(BlockDriverState *bs, QDict *options, + int bdrv_flags, Error **errp) +{ + BDRVVXHSState *s = bs->opaque; + AioContext *aio_context; + int qemu_qnio_cfd = -1; + int device_opened = 0; + int qemu_rfd = -1; + int ret = 0; + int i; + + ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp); + if (ret < 0) { + trace_vxhs_open_fail(ret); + return ret; + } else { + device_opened = 1; + } + + s->qnio_ctx = global_qnio_ctx; + s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd; + s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd; + s->vdisk_size = 0; + QSIMPLEQ_INIT(&s->vdisk_aio_retryq); + + /* + * Create a pipe for communicating between two threads in different + * context. Set handler for read event, which gets triggered when + * IO completion is done by non-QEMU context. + */ + ret = qemu_pipe(s->fds); + if (ret < 0) { + trace_vxhs_open_epipe('.'); + ret = -errno; + goto errout; + } + fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK); + + aio_context = bdrv_get_aio_context(bs); + aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ], + false, vxhs_aio_event_reader, NULL, s); + + /* + * Allocate/Initialize the spin-locks. + * + * NOTE: + * Since spin lock is being allocated + * dynamically hence moving acb struct + * specific lock to BDRVVXHSState + * struct. The reason being, + * we don't want the overhead of spin + * lock being dynamically allocated and + * freed for every AIO. + */ + s->vdisk_lock = VXHS_SPIN_LOCK_ALLOC; + s->vdisk_acb_lock = VXHS_SPIN_LOCK_ALLOC; + + return 0; + +errout: + /* + * Close remote vDisk device if it was opened before + */ + if (device_opened) { + for (i = 0; i < s->vdisk_nhosts; i++) { + if (s->vdisk_hostinfo[i].vdisk_rfd >= 0) { + iio_devclose(s->qnio_ctx, 0, + s->vdisk_hostinfo[i].vdisk_rfd); + s->vdisk_hostinfo[i].vdisk_rfd = -1; + } + /* + * close QNIO channel against cached channel open-fd + */ + if (s->vdisk_hostinfo[i].qnio_cfd >= 0) { + iio_close(s->qnio_ctx, + s->vdisk_hostinfo[i].qnio_cfd); + s->vdisk_hostinfo[i].qnio_cfd = -1; + } + } + } + trace_vxhs_open_fail(ret); + return ret; +} + +static const AIOCBInfo vxhs_aiocb_info = { + .aiocb_size = sizeof(VXHSAIOCB) +}; + +/* + * This allocates QEMU-VXHS callback for each IO + * and is passed to QNIO. When QNIO completes the work, + * it will be passed back through the callback. + */ +BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, + void *opaque, int iodir) +{ + VXHSAIOCB *acb = NULL; + BDRVVXHSState *s = bs->opaque; + size_t size; + uint64_t offset; + int iio_flags = 0; + int ret = 0; + + offset = sector_num * BDRV_SECTOR_SIZE; + size = nb_sectors * BDRV_SECTOR_SIZE; + + acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque); + /* + * Setup or initialize VXHSAIOCB. + * Every single field should be initialized since + * acb will be picked up from the slab without + * initializing with zero. + */ + acb->io_offset = offset; + acb->size = size; + acb->ret = 0; + acb->flags = 0; + acb->aio_done = VXHS_IO_INPROGRESS; + acb->segments = 0; + acb->buffer = 0; + acb->qiov = qiov; + acb->direction = iodir; + + VXHS_SPIN_LOCK(s->vdisk_lock); + if (OF_VDISK_FAILED(s)) { + trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto errout; + } + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1); + goto out; + } + s->vdisk_aio_count++; + VXHS_SPIN_UNLOCK(s->vdisk_lock); + + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + + switch (iodir) { + case VDISK_AIO_WRITE: + vxhs_inc_acb_segment_count(acb, 1); + ret = vxhs_qnio_iio_writev(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + qiov->iov, qiov->niov, offset, (void *)acb, iio_flags); + break; + case VDISK_AIO_READ: + vxhs_inc_acb_segment_count(acb, 1); + ret = vxhs_qnio_iio_readv(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + qiov->iov, qiov->niov, offset, (void *)acb, iio_flags); + break; + default: + trace_vxhs_aio_rw_invalid(iodir); + goto errout; + } + + if (ret != 0) { + trace_vxhs_aio_rw_ioerr( + s->vdisk_guid, iodir, size, offset, + acb, acb->segments, ret, errno); + /* + * Don't retry I/Os against vDisk having no + * redundancy or stateful storage on compute + * + * TODO: Revisit this code path to see if any + * particular error needs to be handled. + * At this moment failing the I/O. + */ + VXHS_SPIN_LOCK(s->vdisk_lock); + if (s->vdisk_nhosts == 1) { + trace_vxhs_aio_rw_iofail(s->vdisk_guid); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto errout; + } + if (OF_VDISK_FAILED(s)) { + trace_vxhs_aio_rw_devfail( + s->vdisk_guid, iodir, size, offset); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto errout; + } + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + /* + * Queue all incoming io requests after failover starts. + * Number of requests that can arrive is limited by io queue depth + * so an app blasting independent ios will not exhaust memory. + */ + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + s->vdisk_aio_count--; + vxhs_dec_acb_segment_count(acb, 1); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 2); + goto out; + } + OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s); + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry); + s->vdisk_aio_retry_qd++; + OF_AIOCB_FLAGS_SET_QUEUED(acb); + vxhs_dec_acb_segment_count(acb, 1); + trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 3); + /* + * Start I/O failover if there is no active + * AIO within vxhs block driver. + */ + if (--s->vdisk_aio_count == 0) { + VXHS_SPIN_UNLOCK(s->vdisk_lock); + /* + * Start IO failover + */ + vxhs_failover_io(s); + goto out; + } + VXHS_SPIN_UNLOCK(s->vdisk_lock); + } + +out: + return &acb->common; + +errout: + qemu_aio_unref(acb); + return NULL; +} + +BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, + cb, opaque, VDISK_AIO_READ); +} + +BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs, + int64_t sector_num, QEMUIOVector *qiov, + int nb_sectors, + BlockCompletionFunc *cb, void *opaque) +{ + return vxhs_aio_rw(bs, sector_num, qiov, nb_sectors, + cb, opaque, VDISK_AIO_WRITE); +} + +/* + * This is called by QEMU when a flush gets triggered from within + * a guest at the block layer, either for IDE or SCSI disks. + */ +int vxhs_co_flush(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t size = 0; + int ret = 0; + + /* + * VDISK_AIO_FLUSH ioctl is a no-op at present and will + * always return success. This could change in the future. + */ + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC); + + if (ret < 0) { + trace_vxhs_co_flush(s->vdisk_guid, ret, errno); + vxhs_close(bs); + } + + return ret; +} + +unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s) +{ + void *ctx = NULL; + int flags = 0; + int64_t vdisk_size = 0; + int ret = 0; + + ret = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + VDISK_STAT, &vdisk_size, ctx, flags); + + if (ret < 0) { + trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno); + return 0; + } + + trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size); + return vdisk_size; +} + +/* + * Returns the size of vDisk in bytes. This is required + * by QEMU block upper block layer so that it is visible + * to guest. + */ +int64_t vxhs_getlength(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t vdisk_size = 0; + + if (s->vdisk_size > 0) { + vdisk_size = s->vdisk_size; + } else { + /* + * Fetch the vDisk size using stat ioctl + */ + vdisk_size = vxhs_get_vdisk_stat(s); + if (vdisk_size > 0) { + s->vdisk_size = vdisk_size; + } + } + + if (vdisk_size > 0) { + return vdisk_size; /* return size in bytes */ + } else { + return -EIO; + } +} + +/* + * Returns actual blocks allocated for the vDisk. + * This is required by qemu-img utility. + */ +int64_t vxhs_get_allocated_blocks(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int64_t vdisk_size = 0; + + if (s->vdisk_size > 0) { + vdisk_size = s->vdisk_size; + } else { + /* + * TODO: + * Once HyperScale storage-virtualizer provides + * actual physical allocation of blocks then + * fetch that information and return back to the + * caller but for now just get the full size. + */ + vdisk_size = vxhs_get_vdisk_stat(s); + if (vdisk_size > 0) { + s->vdisk_size = vdisk_size; + } + } + + if (vdisk_size > 0) { + return vdisk_size; /* return size in bytes */ + } else { + return -EIO; + } +} + +void vxhs_close(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + int i; + + trace_vxhs_close(s->vdisk_guid); + close(s->fds[VDISK_FD_READ]); + close(s->fds[VDISK_FD_WRITE]); + + /* + * Clearing all the event handlers for oflame registered to QEMU + */ + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], + false, NULL, NULL, NULL); + + if (s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd >= 0) { + iio_devclose(s->qnio_ctx, 0, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd); + } + if (s->vdisk_lock) { + VXHS_SPIN_LOCK_DESTROY(s->vdisk_lock); + s->vdisk_lock = NULL; + } + if (s->vdisk_acb_lock) { + VXHS_SPIN_LOCK_DESTROY(s->vdisk_acb_lock); + s->vdisk_acb_lock = NULL; + } + + g_free(s->vdisk_guid); + s->vdisk_guid = NULL; + + for (i = 0; i < VXHS_MAX_HOSTS; i++) { + /* + * Close vDisk device + */ + if (s->vdisk_hostinfo[i].vdisk_rfd >= 0) { + iio_devclose(s->qnio_ctx, 0, + s->vdisk_hostinfo[i].vdisk_rfd); + s->vdisk_hostinfo[i].vdisk_rfd = -1; + } + + /* + * Close Iridium channel against cached channel-fd + */ + if (s->vdisk_hostinfo[i].qnio_cfd >= 0) { + iio_close(s->qnio_ctx, + s->vdisk_hostinfo[i].qnio_cfd); + s->vdisk_hostinfo[i].qnio_cfd = -1; + } + + /* + * Free hostip string which is allocated dynamically + */ + g_free(s->vdisk_hostinfo[i].hostip); + s->vdisk_hostinfo[i].hostip = NULL; + s->vdisk_hostinfo[i].port = 0; + } +} + +/* + * If errors are consistent with storage agent failure: + * - Try to reconnect in case error is transient or storage agent restarted. + * - Currently failover is being triggered on per vDisk basis. There is + * a scope of further optimization where failover can be global (per VM). + * - In case of network (storage agent) failure, for all the vDisks, having + * no redundancy, I/Os will be failed without attempting for I/O failover + * because of stateless nature of vDisk. + * - If local or source storage agent is down then send an ioctl to remote + * storage agent to check if remote storage agent in a state to accept + * application I/Os. + * - Once remote storage agent is ready to accept I/O, start I/O shipping. + * - If I/Os cannot be serviced then vDisk will be marked failed so that + * new incoming I/Os are returned with failure immediately. + * - If vDisk I/O failover is in progress then all new/inflight I/Os will + * queued and will be restarted or failed based on failover operation + * is successful or not. + * - I/O failover can be started either in I/O forward or I/O backward + * path. + * - I/O failover will be started as soon as all the pending acb(s) + * are queued and there is no pending I/O count. + * - If I/O failover couldn't be completed within QNIO_CONNECT_TIMOUT_SECS + * then vDisk will be marked failed and all I/Os will be completed with + * error. + */ + +int vxhs_switch_storage_agent(BDRVVXHSState *s) +{ + int res = 0; + int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE); + + trace_vxhs_switch_storage_agent( + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, + s->vdisk_guid); + + res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx); + if (res == 0) { + res = vxhs_qnio_iio_ioctl(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd, + VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags); + } else { + trace_vxhs_switch_storage_agent_failed( + s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip, + s->vdisk_guid, res, errno); + /* + * Try the next host. + * Calling vxhs_check_failover_status from here ties up the qnio + * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1) + * for all the hosts in the IO target list. + */ + + vxhs_check_failover_status(res, s); + } + return res; +} + +void vxhs_check_failover_status(int res, void *ctx) +{ + BDRVVXHSState *s = ctx; + + if (res == 0) { + /* found failover target */ + s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx; + s->vdisk_ask_failover_idx = 0; + trace_vxhs_check_failover_status( + s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip, + s->vdisk_guid); + VXHS_SPIN_LOCK(s->vdisk_lock); + OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + vxhs_handle_queued_ios(s); + } else { + /* keep looking */ + trace_vxhs_check_failover_status_retry(s->vdisk_guid); + s->vdisk_ask_failover_idx++; + if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) { + /* pause and cycle through list again */ + sleep(QNIO_CONNECT_RETRY_SECS); + s->vdisk_ask_failover_idx = 0; + } + res = vxhs_switch_storage_agent(s); + } +} + +int vxhs_failover_io(BDRVVXHSState *s) +{ + int res = 0; + + trace_vxhs_failover_io(s->vdisk_guid); + + s->vdisk_ask_failover_idx = 0; + res = vxhs_switch_storage_agent(s); + + return res; +} + +/* + * Try to reopen the vDisk on one of the available hosts + * If vDisk reopen is successful on any of the host then + * check if that node is ready to accept I/O. + */ +int vxhs_reopen_vdisk(BDRVVXHSState *s, int index) +{ + char *of_vsa_addr = NULL; + char *file_name = NULL; + int res = 0; + + + /* + * Close stale vdisk device remote fd since + * it could be invalid fd after channel disconnect. + * Reopen the vdisk to get the new fd. + */ + if (s->vdisk_hostinfo[index].vdisk_rfd >= 0) { + iio_devclose(s->qnio_ctx, 0, + s->vdisk_hostinfo[index].vdisk_rfd); + s->vdisk_hostinfo[index].vdisk_rfd = -1; + } + + /* + * As part of vDisk reopen, close the QNIO channel + * against cached channel-fd (fd is being cached into + * vDisk hostinfo). + */ + if (s->vdisk_hostinfo[index].qnio_cfd >= 0) { + iio_close(s->qnio_ctx, + s->vdisk_hostinfo[index].qnio_cfd); + s->vdisk_hostinfo[index].qnio_cfd = -1; + } + + /* + * Build storage agent address and vdisk device name strings + */ + file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid); + of_vsa_addr = g_strdup_printf("of://%s:%d", + s->vdisk_hostinfo[index].hostip, s->vdisk_hostinfo[index].port); + /* + * Open qnio channel to storage agent if not opened before. + */ + if (s->vdisk_hostinfo[index].qnio_cfd < 0) { + s->vdisk_hostinfo[index].qnio_cfd = + iio_open(global_qnio_ctx, of_vsa_addr, 0); + if (s->vdisk_hostinfo[index].qnio_cfd < 0) { + trace_vxhs_reopen_vdisk(s->vdisk_hostinfo[index].hostip); + res = ENODEV; + goto out; + } + } + + /* + * Open vdisk device + */ + s->vdisk_hostinfo[index].vdisk_rfd = + iio_devopen(global_qnio_ctx, + s->vdisk_hostinfo[index].qnio_cfd, file_name, 0); + + if (s->vdisk_hostinfo[index].vdisk_rfd < 0) { + /* + * Close QNIO channel against cached channel-fd + */ + if (s->vdisk_hostinfo[index].qnio_cfd >= 0) { + iio_close(s->qnio_ctx, + s->vdisk_hostinfo[index].qnio_cfd); + s->vdisk_hostinfo[index].qnio_cfd = -1; + } + + trace_vxhs_reopen_vdisk_openfail(file_name); + res = EIO; + goto out; + } + +out: + g_free(of_vsa_addr); + g_free(file_name); + return res; +} + +int vxhs_handle_queued_ios(BDRVVXHSState *s) +{ + VXHSAIOCB *acb = NULL; + int res = 0; + + VXHS_SPIN_LOCK(s->vdisk_lock); + while ((acb = QSIMPLEQ_FIRST(&s->vdisk_aio_retryq)) != NULL) { + /* + * Before we process the acb, check whether I/O failover + * started again due to failback or cascading failure. + */ + if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) { + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto out; + } + QSIMPLEQ_REMOVE_HEAD(&s->vdisk_aio_retryq, retry_entry); + s->vdisk_aio_retry_qd--; + OF_AIOCB_FLAGS_RESET_QUEUED(acb); + if (OF_VDISK_FAILED(s)) { + VXHS_SPIN_UNLOCK(s->vdisk_lock); + vxhs_fail_aio(acb, EIO); + VXHS_SPIN_LOCK(s->vdisk_lock); + } else { + VXHS_SPIN_UNLOCK(s->vdisk_lock); + res = vxhs_restart_aio(acb); + trace_vxhs_handle_queued_ios(acb, res); + VXHS_SPIN_LOCK(s->vdisk_lock); + if (res) { + QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, + acb, retry_entry); + OF_AIOCB_FLAGS_SET_QUEUED(acb); + VXHS_SPIN_UNLOCK(s->vdisk_lock); + goto out; + } + } + } + VXHS_SPIN_UNLOCK(s->vdisk_lock); +out: + return res; +} + +int vxhs_restart_aio(VXHSAIOCB *acb) +{ + BDRVVXHSState *s = NULL; + int iio_flags = 0; + int res = 0; + + s = acb->common.bs->opaque; + + if (acb->direction == VDISK_AIO_WRITE) { + vxhs_inc_vdisk_iocount(s, 1); + vxhs_inc_acb_segment_count(acb, 1); + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + res = vxhs_qnio_iio_writev(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + acb->qiov->iov, acb->qiov->niov, + acb->io_offset, (void *)acb, iio_flags); + } + + if (acb->direction == VDISK_AIO_READ) { + vxhs_inc_vdisk_iocount(s, 1); + vxhs_inc_acb_segment_count(acb, 1); + iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC); + res = vxhs_qnio_iio_readv(s->qnio_ctx, + s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd, + acb->qiov->iov, acb->qiov->niov, + acb->io_offset, (void *)acb, iio_flags); + } + + if (res != 0) { + vxhs_dec_vdisk_iocount(s, 1); + vxhs_dec_acb_segment_count(acb, 1); + trace_vxhs_restart_aio(acb->direction, res, errno); + } + + return res; +} + +void vxhs_fail_aio(VXHSAIOCB *acb, int err) +{ + BDRVVXHSState *s = NULL; + int segcount = 0; + int rv = 0; + + s = acb->common.bs->opaque; + + trace_vxhs_fail_aio(s->vdisk_guid, acb); + if (!acb->ret) { + acb->ret = err; + } + VXHS_SPIN_LOCK(s->vdisk_acb_lock); + segcount = acb->segments; + VXHS_SPIN_UNLOCK(s->vdisk_acb_lock); + if (segcount == 0) { + /* + * Complete the io request + */ + rv = qemu_write_full(s->fds[VDISK_FD_WRITE], &acb, sizeof(acb)); + if (rv != sizeof(acb)) { + error_report("VXHS AIO completion failed: %s", + strerror(errno)); + abort(); + } + } +} + +static void vxhs_detach_aio_context(BlockDriverState *bs) +{ + BDRVVXHSState *s = bs->opaque; + + aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ], + false, NULL, NULL, NULL); + +} + +static void vxhs_attach_aio_context(BlockDriverState *bs, + AioContext *new_context) +{ + BDRVVXHSState *s = bs->opaque; + + aio_set_fd_handler(new_context, s->fds[VDISK_FD_READ], + false, vxhs_aio_event_reader, NULL, s); +} + +static BlockDriver bdrv_vxhs = { + .format_name = "vxhs", + .protocol_name = "vxhs", + .instance_size = sizeof(BDRVVXHSState), + .bdrv_file_open = vxhs_open, + .bdrv_parse_filename = vxhs_parse_filename, + .bdrv_close = vxhs_close, + .bdrv_getlength = vxhs_getlength, + .bdrv_get_allocated_file_size = vxhs_get_allocated_blocks, + .bdrv_aio_readv = vxhs_aio_readv, + .bdrv_aio_writev = vxhs_aio_writev, + .bdrv_co_flush_to_disk = vxhs_co_flush, + .bdrv_detach_aio_context = vxhs_detach_aio_context, + .bdrv_attach_aio_context = vxhs_attach_aio_context, +}; + +void bdrv_vxhs_init(void) +{ + trace_vxhs_bdrv_init('.'); + bdrv_register(&bdrv_vxhs); +} + +block_init(bdrv_vxhs_init); diff --git a/block/vxhs.h b/block/vxhs.h new file mode 100644 index 0000000..d3d94bf --- /dev/null +++ b/block/vxhs.h @@ -0,0 +1,221 @@ +/* + * QEMU Block driver for Veritas HyperScale (VxHS) + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef VXHSD_H +#define VXHSD_H + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qemu/error-report.h" +#include "block/block_int.h" +#include "qemu/uri.h" +#include "qemu/queue.h" + +#define QNIO_CONNECT_RETRY_SECS 5 +#define QNIO_CONNECT_TIMOUT_SECS 120 + +/* + * IO specific flags + */ +#define IIO_FLAG_ASYNC 0x00000001 +#define IIO_FLAG_DONE 0x00000010 +#define IIO_FLAG_SYNC 0 + +#define VDISK_FD_READ 0 +#define VDISK_FD_WRITE 1 +#define VXHS_MAX_HOSTS 4 + +/* Lock specific macros */ +#define VXHS_SPIN_LOCK_ALLOC \ + (g_malloc(sizeof(QemuSpin))) +#define VXHS_SPIN_LOCK(lock) \ + (qemu_spin_lock(lock)) +#define VXHS_SPIN_UNLOCK(lock) \ + (qemu_spin_unlock(lock)) +#define VXHS_SPIN_LOCK_DESTROY(lock) \ + (g_free(lock)) + +typedef enum { + VXHS_IO_INPROGRESS, + VXHS_IO_COMPLETED, + VXHS_IO_ERROR +} VXHSIOState; + +typedef enum { + VDISK_AIO_READ, + VDISK_AIO_WRITE, + VDISK_STAT, + VDISK_TRUNC, + VDISK_AIO_FLUSH, + VDISK_AIO_RECLAIM, + VDISK_GET_GEOMETRY, + VDISK_CHECK_IO_FAILOVER_READY, + VDISK_AIO_LAST_CMD +} VDISKAIOCmd; + +typedef void (*qnio_callback_t)(ssize_t retval, void *arg); + +/* + * BDRVVXHSState specific flags + */ +#define OF_VDISK_FLAGS_STATE_ACTIVE 0x0000000000000001 +#define OF_VDISK_FLAGS_STATE_FAILED 0x0000000000000002 +#define OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS 0x0000000000000004 + +#define OF_VDISK_ACTIVE(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_ACTIVE) +#define OF_VDISK_SET_ACTIVE(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_ACTIVE) +#define OF_VDISK_RESET_ACTIVE(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_ACTIVE) + +#define OF_VDISK_FAILED(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_STATE_FAILED) +#define OF_VDISK_SET_FAILED(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_STATE_FAILED) +#define OF_VDISK_RESET_FAILED(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_STATE_FAILED) + +#define OF_VDISK_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags & OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) +#define OF_VDISK_SET_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags |= OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) +#define OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s) \ + ((s)->vdisk_flags &= ~OF_VDISK_FLAGS_IOFAILOVER_IN_PROGRESS) + +/* + * VXHSAIOCB specific flags + */ +#define OF_ACB_QUEUED 0x00000001 + +#define OF_AIOCB_FLAGS_QUEUED(a) \ + ((a)->flags & OF_ACB_QUEUED) +#define OF_AIOCB_FLAGS_SET_QUEUED(a) \ + ((a)->flags |= OF_ACB_QUEUED) +#define OF_AIOCB_FLAGS_RESET_QUEUED(a) \ + ((a)->flags &= ~OF_ACB_QUEUED) + +typedef struct qemu2qnio_ctx { + uint32_t qnio_flag; + uint64_t qnio_size; + char *qnio_channel; + char *target; + qnio_callback_t qnio_cb; +} qemu2qnio_ctx_t; + +typedef qemu2qnio_ctx_t qnio2qemu_ctx_t; + +typedef struct LibQNIOSymbol { + const char *name; + gpointer *addr; +} LibQNIOSymbol; + +/* + * HyperScale AIO callbacks structure + */ +typedef struct VXHSAIOCB { + BlockAIOCB common; + size_t ret; + size_t size; + QEMUBH *bh; + int aio_done; + int segments; + int flags; + size_t io_offset; + QEMUIOVector *qiov; + void *buffer; + int direction; /* IO direction (r/w) */ + QSIMPLEQ_ENTRY(VXHSAIOCB) retry_entry; +} VXHSAIOCB; + +typedef struct VXHSvDiskHostsInfo { + int qnio_cfd; /* Channel FD */ + int vdisk_rfd; /* vDisk remote FD */ + char *hostip; /* Host's IP addresses */ + int port; /* Host's port number */ +} VXHSvDiskHostsInfo; + +/* + * Structure per vDisk maintained for state + */ +typedef struct BDRVVXHSState { + int fds[2]; + int64_t vdisk_size; + int64_t vdisk_blocks; + int64_t vdisk_flags; + int vdisk_aio_count; + int event_reader_pos; + VXHSAIOCB *qnio_event_acb; + void *qnio_ctx; + void *vdisk_lock; /* Lock to protect BDRVVXHSState */ + void *vdisk_acb_lock; /* Protects ACB */ + VXHSvDiskHostsInfo vdisk_hostinfo[VXHS_MAX_HOSTS]; /* Per host info */ + int vdisk_nhosts; /* Total number of hosts */ + int vdisk_cur_host_idx; /* IOs are being shipped to */ + int vdisk_ask_failover_idx; /*asking permsn to ship io*/ + QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq; + int vdisk_aio_retry_qd; /* Currently for debugging */ + char *vdisk_guid; +} BDRVVXHSState; + +void bdrv_vxhs_init(void); +void *vxhs_setup_qnio(void); +void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx, + uint32_t error, uint32_t opcode); +void vxhs_aio_event_reader(void *opaque); +void vxhs_complete_aio(VXHSAIOCB *acb, BDRVVXHSState *s); +unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s); +int vxhs_open(BlockDriverState *bs, QDict *options, + int bdrv_flags, Error **errp); +void vxhs_close(BlockDriverState *bs); +BlockAIOCB *vxhs_aio_readv(BlockDriverState *bs, int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockCompletionFunc *cb, + void *opaque); +BlockAIOCB *vxhs_aio_writev(BlockDriverState *bs, int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockCompletionFunc *cb, + void *opaque); +int64_t vxhs_get_allocated_blocks(BlockDriverState *bs); +BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs, int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockCompletionFunc *cb, + void *opaque, int write); +int32_t vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, struct iovec *iov, + int iovcnt, uint64_t offset, + void *ctx, uint32_t flags); +int32_t vxhs_qnio_iio_readv(void *qnio_ctx, uint32_t rfd, struct iovec *iov, + int iovcnt, uint64_t offset, + void *ctx, uint32_t flags); +int32_t vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, + int64_t *in, void *ctx, + uint32_t flags); +size_t vxhs_calculate_iovec_size(struct iovec *iov, int niov); +void vxhs_copy_iov_to_buffer(struct iovec *iov, int niov, void *buf); +void *vxhs_convert_iovector_to_buffer(struct iovec *iov, int niov, + size_t sector); +int vxhs_is_iovector_read_aligned(struct iovec *iov, int niov, size_t sector); + +int vxhs_aio_flush_cb(void *opaque); +int vxhs_co_flush(BlockDriverState *bs); +int64_t vxhs_getlength(BlockDriverState *bs); +void vxhs_inc_vdisk_iocount(void *ptr, uint32_t delta); +void vxhs_dec_vdisk_iocount(void *ptr, uint32_t delta); +uint32_t vxhs_get_vdisk_iocount(void *ptr); +void vxhs_inc_acb_segment_count(void *ptr, int count); +void vxhs_dec_acb_segment_count(void *ptr, int count); +void vxhs_set_acb_buffer(void *ptr, void *buffer); +int vxhs_failover_io(BDRVVXHSState *s); +int vxhs_reopen_vdisk(BDRVVXHSState *s, int hostinfo_index); +int vxhs_switch_storage_agent(BDRVVXHSState *s); +int vxhs_handle_queued_ios(BDRVVXHSState *s); +int vxhs_restart_aio(VXHSAIOCB *acb); +void vxhs_fail_aio(VXHSAIOCB *acb, int err); +void vxhs_check_failover_status(int res, void *ctx); + +#endif diff --git a/configure b/configure index 7d083bd..11d1dec 100755 --- a/configure +++ b/configure @@ -322,6 +322,7 @@ numa="" tcmalloc="no" jemalloc="no" replication="yes" +vxhs="" # parse CC options first for opt do @@ -1163,6 +1164,11 @@ for opt do ;; --enable-replication) replication="yes" ;; + --disable-vxhs) vxhs="no" + ;; + --enable-vxhs) vxhs="yes" + ;; + *) echo "ERROR: unknown option $opt" echo "Try '$0 --help' for more information" @@ -1394,6 +1400,7 @@ disabled with --disable-FEATURE, default is enabled if available: tcmalloc tcmalloc support jemalloc jemalloc support replication replication support + vxhs Veritas HyperScale vDisk backend support NOTE: The object files are built at the place where configure is launched EOF @@ -4560,6 +4567,33 @@ if do_cc -nostdlib -Wl,-r -Wl,--no-relax -o $TMPMO $TMPO; then fi ########################################## +# Veritas HyperScale block driver VxHS +# Check if libqnio is installed + +if test "$vxhs" != "no" ; then + cat > $TMPC < +#include + +void *vxhs_callback; + +int main(void) { + iio_init(vxhs_callback); + return 0; +} +EOF + vxhs_libs="-lqnio" + if compile_prog "" "$vxhs_libs" ; then + vxhs=yes + else + if test "$vxhs" = "yes" ; then + feature_not_found "vxhs block device" "Install libqnio. See github" + fi + vxhs=no + fi +fi + +########################################## # End of CC checks # After here, no more $cc or $ld runs @@ -4927,6 +4961,7 @@ echo "tcmalloc support $tcmalloc" echo "jemalloc support $jemalloc" echo "avx2 optimization $avx2_opt" echo "replication support $replication" +echo "VxHS block device $vxhs" if test "$sdl_too_old" = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -5521,6 +5556,12 @@ if test "$pthread_setname_np" = "yes" ; then echo "CONFIG_PTHREAD_SETNAME_NP=y" >> $config_host_mak fi +if test "$vxhs" = "yes" ; then + echo "CONFIG_VXHS=y" >> $config_host_mak + echo "VXHS_CFLAGS=$vxhs_cflags" >> $config_host_mak + echo "VXHS_LIBS=$vxhs_libs" >> $config_host_mak +fi + if test "$tcg_interpreter" = "yes"; then QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES" elif test "$ARCH" = "sparc64" ; then