From patchwork Tue Jul 12 14:12:42 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Pen X-Patchwork-Id: 9225389 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 687FA604DB for ; Tue, 12 Jul 2016 14:15:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 58D0D27BF9 for ; Tue, 12 Jul 2016 14:15:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4B27127C39; Tue, 12 Jul 2016 14:15:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id F41D227BF9 for ; Tue, 12 Jul 2016 14:15:16 +0000 (UTC) Received: from localhost ([::1]:40764 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bMySy-0006rC-5o for patchwork-qemu-devel@patchwork.kernel.org; Tue, 12 Jul 2016 10:15:16 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38901) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bMyQr-0005bw-Ur for qemu-devel@nongnu.org; Tue, 12 Jul 2016 10:13:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bMyQp-0006IV-6y for qemu-devel@nongnu.org; Tue, 12 Jul 2016 10:13:04 -0400 Received: from mail-it0-x229.google.com ([2607:f8b0:4001:c0b::229]:37793) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bMyQp-0006IJ-0Y for qemu-devel@nongnu.org; Tue, 12 Jul 2016 10:13:03 -0400 Received: by mail-it0-x229.google.com with SMTP id f6so16224379ith.0 for ; Tue, 12 Jul 2016 07:13:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=2SatmkdSFMCMv0wVqrCSb2MMiaIVAZqFHL7OzW0Wh4Q=; b=OnjFKxLVChdtb4gKoABAZ9ndrYUI953IFY3lBBQFy5RWTwWrVH6QP/WZnIKIP4AAkJ x/3sRT2jhflACykTl7E8gAzN0ZB+f6bdn2pSAqy9oNZ/1/Nef1oboT33zOBgio+A6h5c E2wVLZdqGXIioOuEjqhjqlilT/ubQsQZj5ACnOv3+fdMM8b/KBfxslQ4aNOmTheb5fFl H9a0Jj8yeQgmr0cEbgps7Ur6vcBTxFaliOeYSwaCPU+PUy7oYlXv6imwkAHnUSsmjFez clGMGyPquEJly+KPFTNTJE6LjXKi3t09B4kZruembzTq1XxHa7o59KXtuuf0Fv0+G7Ht 7wUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=2SatmkdSFMCMv0wVqrCSb2MMiaIVAZqFHL7OzW0Wh4Q=; b=As6ZyUxpm5kK78pgsVnTROdmT1MN7yUt+pQ+DmzrRVx5xpiPrRxRkAySxy+qBYvVaG N2kjT52x14innIwqojWTD3z7CSDQIa85+Wf0pfg0sIft7DBJWWn0iK6m+EzeZ8j4Z9jz xEQm85R/IcXIu/fq+uIMGkq+43yIoXcfhyuJ98nIOHgDyqYv5zsHC8KwXZiw42rSZN/f PFXZeMLSa7dWNTb5QFohdS5RIkPMqtmHo88TsT/jw3pr15voQbXCbS8PmHWy6LYAsspx +wXqzp2Q+jaLWoH5XAa6DVfLvvh1YhShzRj7VXt5MPDk24nm/Ipuxf7Sdh20Iohi8khf zsyA== X-Gm-Message-State: ALyK8tLz6997F4JX/kaSwrmmV7rDmBfY+w/RkvpsAydqv6kOZJ8erZDZLZWDFpN7pFrSpbDmsYI3PNPiU57hRDZQ X-Received: by 10.36.22.208 with SMTP id a199mr15798671ita.58.1468332782042; Tue, 12 Jul 2016 07:13:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.12.137 with HTTP; Tue, 12 Jul 2016 07:12:42 -0700 (PDT) In-Reply-To: <20160628094222.GD2798@stefanha-x1.localdomain> References: <1467045468-25709-1-git-send-email-stefanha@redhat.com> <20160628094222.GD2798@stefanha-x1.localdomain> From: Roman Penyaev Date: Tue, 12 Jul 2016 16:12:42 +0200 Message-ID: To: Stefan Hajnoczi X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2607:f8b0:4001:c0b::229 Subject: Re: [Qemu-devel] [PATCH] linux-aio: keep processing events if MAX_EVENTS reached X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Paolo Bonzini , qemu-devel , Stefan Hajnoczi Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Jun 28, 2016 at 11:42 AM, Stefan Hajnoczi wrote: > On Mon, Jun 27, 2016 at 08:36:19PM +0200, Roman Penyaev wrote: >> On Jun 27, 2016 6:37 PM, "Stefan Hajnoczi" wrote: >> > >> > Commit ccb9dc10129954d0bcd7814298ed445e684d5a2a ("linux-aio: Cancel BH >> > if not needed") exposed a side-effect of scheduling the BH for nested >> > event loops. When io_getevents(2) fetches MAX_EVENTS events then it's >> > necessary to call it again. Failure to do so can lead to hung I/O >> > because the eventfd has already been cleared and the completion BH will >> > not run again. >> > >> > This patch fixes the hang by calling io_getevents(2) again if the events >> > array was totally full. >> > >> > Reported-by: Roman Penyaev >> > Signed-off-by: Stefan Hajnoczi >> > --- >> > block/linux-aio.c | 5 +++++ >> > 1 file changed, 5 insertions(+) >> > >> > diff --git a/block/linux-aio.c b/block/linux-aio.c >> > index e468960..af15f85 100644 >> > --- a/block/linux-aio.c >> > +++ b/block/linux-aio.c >> > @@ -117,6 +117,7 @@ static void qemu_laio_completion_bh(void *opaque) >> > LinuxAioState *s = opaque; >> > >> > /* Fetch more completion events when empty */ >> > +more_events: >> > if (s->event_idx == s->event_max) { >> > do { >> > struct timespec ts = { 0 }; >> > @@ -146,6 +147,10 @@ static void qemu_laio_completion_bh(void *opaque) >> > qemu_laio_process_completion(laiocb); >> > } >> > >> > + if (s->event_idx == MAX_EVENTS) { >> > + goto more_events; /* there might still be events waiting for us >> */ >> > + } >> > + >> >> I am on vacation and can't check anything, but seems you also >> could reproduce an issue and seems you are right: what I've >> also noticed is that io hangs only if total number of requests >> in guest > than 128 (what is defined in linux-aio.c). >> >> But I do not understand why you have to repeat io_getevents in >> case of return value == MAX_EVENTS. According to my >> understanding you cant submit more than 128, so the first >> io_getevents call returns you exactly what you've submitted. > > Hi Roman, > There is nothing like discussions on qemu-devel from the beach. True > vacation! > > The issue is: > > 1. s->events[] is only MAX_EVENTS large so each io_getevents() call can > only fetch that maximum number of events. > > 2. qemu_laio_completion_cb() clears the eventfd so the event loop will > not call us again (unless additional I/O requests complete). > > Therefore, returning from qemu_laio_completion_bh() without having > processed all events causes a hang. Hi Stefan, The issue is clear now. The thing is that I had an assumption, that we never submit more than MAX_EVENTS. Now I see that according to the ioq_submit() this is not true, so we can have inflights > MAX_EVENTS. Frankly, that seems a bug to me, because we promise the kernel [when we call io_setup(MAX_EVENTS)] not to submit more than specified value. Kernel allocates ring buffer aligned to the page size, also does some compensations to have enough free requests for each CPU, so the final io events number *can be* > than we have requested. So eventually kernel can "swallow" more events than MAX_EVENTS. I did the following patch (at the bottom of this email), which restricts submission more than MAX_EVENTS and the interesting thing, that it works faster than your current fix for this hang: my setup: 1 iothread, VCPU=8, MQ=8 your current patch "linux-aio: keep processing events if MAX_EVENTS reached": READ: io=48199MB, aggrb=1606.5MB/s, minb=1606.5MB/s, maxb=1606.5MB/s, mint=30003msec, maxt=30003msec WRITE: io=48056MB, aggrb=1601.8MB/s, minb=1601.8MB/s, maxb=1601.8MB/s, mint=30003msec, maxt=30003msec mine changes: READ: io=53294MB, aggrb=1776.3MB/s, minb=1776.3MB/s, maxb=1776.3MB/s, mint=30003msec, maxt=30003msec WRITE: io=53177MB, aggrb=1772.4MB/s, minb=1772.4MB/s, maxb=1772.4MB/s, mint=30003msec, maxt=30003msec But what is the most important thing here is, that reverting "linux-aio: Cancel BH if not needed" brings these numbers: READ: io=56362MB, aggrb=1878.4MB/s, minb=1878.4MB/s, maxb=1878.4MB/s, mint=30007msec, maxt=30007msec WRITE: io=56255MB, aggrb=1874.8MB/s, minb=1874.8MB/s, maxb=1874.8MB/s, mint=30007msec, maxt=30007msec So, it seems to me that "linux-aio: Cancel BH if not needed" introduces regression. --- Roman --- block/linux-aio.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/block/linux-aio.c b/block/linux-aio.c index e468960..dd768cc 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -44,7 +44,8 @@ struct qemu_laiocb { typedef struct { int plugged; - unsigned int n; + unsigned int inqueue; + unsigned int inflight; bool blocked; QSIMPLEQ_HEAD(, qemu_laiocb) pending; } LaioQueue; @@ -129,6 +130,7 @@ static void qemu_laio_completion_bh(void *opaque) s->event_max = 0; return; /* no more events */ } + s->io_q.inflight -= s->event_max; } /* Reschedule so nested event loops see currently pending completions */ @@ -190,7 +192,8 @@ static void ioq_init(LaioQueue *io_q) { QSIMPLEQ_INIT(&io_q->pending); io_q->plugged = 0; - io_q->n = 0; + io_q->inqueue = 0; + io_q->inflight = 0; io_q->blocked = false; } @@ -203,9 +206,12 @@ static void ioq_submit(LinuxAioState *s) do { len = 0; + if (s->io_q.inflight >= MAX_EVENTS) + break; QSIMPLEQ_FOREACH(aiocb, &s->io_q.pending, next) { iocbs[len++] = &aiocb->iocb; - if (len == MAX_QUEUED_IO) { + if (len == MAX_QUEUED_IO || + (s->io_q.inflight + len) >= MAX_EVENTS) { break; } } @@ -218,11 +224,12 @@ static void ioq_submit(LinuxAioState *s) abort(); } - s->io_q.n -= ret; + s->io_q.inflight += ret; + s->io_q.inqueue -= ret; aiocb = container_of(iocbs[ret - 1], struct qemu_laiocb, iocb); QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed); } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending)); - s->io_q.blocked = (s->io_q.n > 0); + s->io_q.blocked = (s->io_q.inqueue > 0); } void laio_io_plug(BlockDriverState *bs, LinuxAioState *s) @@ -263,9 +270,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset, io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e)); QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next); - s->io_q.n++; + s->io_q.inqueue++; if (!s->io_q.blocked && - (!s->io_q.plugged || s->io_q.n >= MAX_QUEUED_IO)) { + (!s->io_q.plugged || s->io_q.inqueue >= MAX_QUEUED_IO)) { ioq_submit(s); }