From patchwork Thu Mar 5 17:08:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stefan Hajnoczi X-Patchwork-Id: 11422415 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1080F14E3 for ; Thu, 5 Mar 2020 17:23:16 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id CC0252072D for ; Thu, 5 Mar 2020 17:23:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="OJoLLHgX" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CC0252072D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Received: from localhost ([::1]:53992 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j9uDK-0006U7-VJ for patchwork-qemu-devel@patchwork.kernel.org; Thu, 05 Mar 2020 12:23:14 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:38145) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j9tzH-0000iO-Uv for qemu-devel@nongnu.org; Thu, 05 Mar 2020 12:08:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1j9tzG-0003R3-4G for qemu-devel@nongnu.org; Thu, 05 Mar 2020 12:08:43 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:43424 helo=us-smtp-1.mimecast.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1j9tzF-0003Qv-VH for qemu-devel@nongnu.org; Thu, 05 Mar 2020 12:08:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1583428121; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=he8/R2Qv1IzI7sLuNXk0arLrZyAYuTMQLQHRW2FbE2A=; b=OJoLLHgXeCiA+DbB6Hdua+qNQ7vlvAYJKFrHn+2oT0wIN6LSyE1Tp8KGOSreiYZtXQDri+ kdbTRlkhTZobmz0dOx5i6klvSmbSVkiEKPrUtGmgQEg7iY3hw2lxdYL9ZpYMDJUc4anoSR wwgxS3IOGtXKgorMtleDIQa2aURdi2I= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-410-pnJsEYy0PjCVtRyU0mNexw-1; Thu, 05 Mar 2020 12:08:39 -0500 X-MC-Unique: pnJsEYy0PjCVtRyU0mNexw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BC76318AB2C2; Thu, 5 Mar 2020 17:08:38 +0000 (UTC) Received: from localhost (ovpn-117-104.ams2.redhat.com [10.36.117.104]) by smtp.corp.redhat.com (Postfix) with ESMTP id 34BE71001902; Thu, 5 Mar 2020 17:08:34 +0000 (UTC) From: Stefan Hajnoczi To: qemu-devel@nongnu.org Subject: [PATCH 7/7] aio-posix: remove idle poll handlers to improve scalability Date: Thu, 5 Mar 2020 17:08:06 +0000 Message-Id: <20200305170806.1313245-8-stefanha@redhat.com> In-Reply-To: <20200305170806.1313245-1-stefanha@redhat.com> References: <20200305170806.1313245-1-stefanha@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 205.139.110.120 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Fam Zheng , Kevin Wolf , qemu-block@nongnu.org, Max Reitz , Stefan Hajnoczi , Paolo Bonzini Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" When there are many poll handlers it's likely that some of them are idle most of the time. Remove handlers that haven't had activity recently so that the polling loop scales better for guests with a large number of devices. This feature only takes effect for the Linux io_uring fd monitoring implementation because it is capable of combining fd monitoring with userspace polling. The other implementations can't do that and risk starving fds in favor of poll handlers, so don't try this optimization when they are in use. IOPS improves from 10k to 105k when the guest has 100 virtio-blk-pci,num-queues=32 devices and 1 virtio-blk-pci,num-queues=1 device for rw=randread,iodepth=1,bs=4k,ioengine=libaio on NVMe. Signed-off-by: Stefan Hajnoczi --- include/block/aio.h | 7 ++++ util/aio-posix.c | 93 +++++++++++++++++++++++++++++++++++++++++---- util/aio-posix.h | 2 + util/trace-events | 2 + 4 files changed, 97 insertions(+), 7 deletions(-) diff --git a/include/block/aio.h b/include/block/aio.h index f07ebb76b8..60779285da 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -227,6 +227,13 @@ struct AioContext { int64_t poll_grow; /* polling time growth factor */ int64_t poll_shrink; /* polling time shrink factor */ + /* + * List of handlers participating in userspace polling. Accessed almost + * exclusively from aio_poll() and therefore not an RCU list. Protected by + * ctx->list_lock. + */ + AioHandlerList poll_aio_handlers; + /* Are we in polling mode or monitoring file descriptors? */ bool poll_started; diff --git a/util/aio-posix.c b/util/aio-posix.c index ede04a4bc2..ab0ed41f2a 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -22,6 +22,9 @@ #include "trace.h" #include "aio-posix.h" +/* Stop userspace polling on a handler if it isn't active for some time */ +#define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND) + bool aio_poll_disabled(AioContext *ctx) { return atomic_read(&ctx->poll_disable_cnt); @@ -78,6 +81,7 @@ static bool aio_remove_fd_handler(AioContext *ctx, AioHandler *node) * deleted because deleted nodes are only cleaned up while * no one is walking the handlers list. */ + QLIST_SAFE_REMOVE(node, node_poll); QLIST_REMOVE(node, node); return true; } @@ -205,7 +209,7 @@ static bool poll_set_started(AioContext *ctx, bool started) ctx->poll_started = started; qemu_lockcnt_inc(&ctx->list_lock); - QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) { + QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) { IOHandler *fn; if (QLIST_IS_INSERTED(node, node_deleted)) { @@ -286,6 +290,7 @@ static void aio_free_deleted_handlers(AioContext *ctx) while ((node = QLIST_FIRST_RCU(&ctx->deleted_aio_handlers))) { QLIST_REMOVE(node, node); QLIST_REMOVE(node, node_deleted); + QLIST_SAFE_REMOVE(node, node_poll); g_free(node); } @@ -300,6 +305,22 @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node) revents = node->pfd.revents & node->pfd.events; node->pfd.revents = 0; + /* + * Start polling AioHandlers when they become ready because activity is + * likely to continue. Note that starvation is theoretically possible when + * fdmon_supports_polling(), but only until the fd fires for the first + * time. + */ + if (!QLIST_IS_INSERTED(node, node_deleted) && + !QLIST_IS_INSERTED(node, node_poll) && + node->io_poll) { + trace_poll_add(ctx, node, node->pfd.fd, revents); + if (ctx->poll_started && node->io_poll_begin) { + node->io_poll_begin(node->opaque); + } + QLIST_INSERT_HEAD(&ctx->poll_aio_handlers, node, node_poll); + } + if (!QLIST_IS_INSERTED(node, node_deleted) && (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) && aio_node_check(ctx, node->is_external) && @@ -364,15 +385,19 @@ void aio_dispatch(AioContext *ctx) timerlistgroup_run_timers(&ctx->tlg); } -static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout) +static bool run_poll_handlers_once(AioContext *ctx, + int64_t now, + int64_t *timeout) { bool progress = false; AioHandler *node; + AioHandler *tmp; - QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) { - if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll && - aio_node_check(ctx, node->is_external) && + QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) { + if (aio_node_check(ctx, node->is_external) && node->io_poll(node->opaque)) { + node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS; + /* * Polling was successful, exit try_poll_mode immediately * to adjust the next polling time. @@ -389,6 +414,50 @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout) return progress; } +static bool fdmon_supports_polling(AioContext *ctx) +{ + return ctx->fdmon_ops->need_wait != aio_poll_disabled; +} + +static bool remove_idle_poll_handlers(AioContext *ctx, int64_t now) +{ + AioHandler *node; + AioHandler *tmp; + bool progress = false; + + /* + * File descriptor monitoring implementations without userspace polling + * support suffer from starvation when a subset of handlers is polled + * because fds will not be processed in a timely fashion. Don't remove + * idle poll handlers. + */ + if (!fdmon_supports_polling(ctx)) { + return false; + } + + QLIST_FOREACH_SAFE(node, &ctx->poll_aio_handlers, node_poll, tmp) { + if (node->poll_idle_timeout == 0LL) { + node->poll_idle_timeout = now + POLL_IDLE_INTERVAL_NS; + } else if (now >= node->poll_idle_timeout) { + trace_poll_remove(ctx, node, node->pfd.fd); + node->poll_idle_timeout = 0LL; + QLIST_SAFE_REMOVE(node, node_poll); + if (ctx->poll_started && node->io_poll_end) { + node->io_poll_end(node->opaque); + + /* + * Final poll in case ->io_poll_end() races with an event. + * Nevermind about re-adding the handler in the rare case where + * this causes progress. + */ + progress = node->io_poll(node->opaque) || progress; + } + } + } + + return progress; +} + /* run_poll_handlers: * @ctx: the AioContext * @max_ns: maximum time to poll for, in nanoseconds @@ -424,12 +493,17 @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout) start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); do { - progress = run_poll_handlers_once(ctx, timeout); + progress = run_poll_handlers_once(ctx, start_time, timeout); elapsed_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time; max_ns = qemu_soonest_timeout(*timeout, max_ns); assert(!(max_ns && progress)); } while (elapsed_time < max_ns && !ctx->fdmon_ops->need_wait(ctx)); + if (remove_idle_poll_handlers(ctx, start_time + elapsed_time)) { + *timeout = 0; + progress = true; + } + /* If time has passed with no successful polling, adjust *timeout to * keep the same ending time. */ @@ -454,8 +528,13 @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout) */ static bool try_poll_mode(AioContext *ctx, int64_t *timeout) { - int64_t max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); + int64_t max_ns; + + if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) { + return false; + } + max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) { poll_set_started(ctx, true); diff --git a/util/aio-posix.h b/util/aio-posix.h index 55fc771327..c80c04506a 100644 --- a/util/aio-posix.h +++ b/util/aio-posix.h @@ -30,10 +30,12 @@ struct AioHandler { QLIST_ENTRY(AioHandler) node; QLIST_ENTRY(AioHandler) node_ready; /* only used during aio_poll() */ QLIST_ENTRY(AioHandler) node_deleted; + QLIST_ENTRY(AioHandler) node_poll; #ifdef CONFIG_LINUX_IO_URING QSLIST_ENTRY(AioHandler) node_submitted; unsigned flags; /* see fdmon-io_uring.c */ #endif + int64_t poll_idle_timeout; /* when to stop userspace polling */ bool is_external; }; diff --git a/util/trace-events b/util/trace-events index 83b6639018..0ce42822eb 100644 --- a/util/trace-events +++ b/util/trace-events @@ -5,6 +5,8 @@ run_poll_handlers_begin(void *ctx, int64_t max_ns, int64_t timeout) "ctx %p max_ run_poll_handlers_end(void *ctx, bool progress, int64_t timeout) "ctx %p progress %d new timeout %"PRId64 poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64 poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new %"PRId64 +poll_add(void *ctx, void *node, int fd, unsigned revents) "ctx %p node %p fd %d revents 0x%x" +poll_remove(void *ctx, void *node, int fd) "ctx %p node %p fd %d" # async.c aio_co_schedule(void *ctx, void *co) "ctx %p co %p"