Message ID | 20170302130422.81380-1-pasic@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2 Mar 2017 14:04:22 +0100 Halil Pasic <pasic@linux.vnet.ibm.com> wrote: > diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c > index 5556f0e..13dd14d 100644 > --- a/hw/block/dataplane/virtio-blk.c > +++ b/hw/block/dataplane/virtio-blk.c > @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) > virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL); > } > > - /* Drain and switch bs back to the QEMU main loop */ > + /* Drain and switch bs back to the QEMU main loop. After drain, the > + * device will not submit (nor comple) any requests until dataplane s/comple/complete/ > + * starts again. > + */ > blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); > > + /* Notify guest before the guest notifiers get cleaned up */ > + qemu_bh_cancel(s->bh); > + notify_guest_bh(s); > + Hm... does virtio-scsi dataplane need a similar treatment? Or am I missing something? > aio_context_release(s->ctx); > > for (i = 0; i < nvqs; i++) {
On 02/03/2017 15:49, Cornelia Huck wrote: > On Thu, 2 Mar 2017 14:04:22 +0100 > Halil Pasic <pasic@linux.vnet.ibm.com> wrote: > >> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c >> index 5556f0e..13dd14d 100644 >> --- a/hw/block/dataplane/virtio-blk.c >> +++ b/hw/block/dataplane/virtio-blk.c >> @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) >> virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL); >> } >> >> - /* Drain and switch bs back to the QEMU main loop */ >> + /* Drain and switch bs back to the QEMU main loop. After drain, the >> + * device will not submit (nor comple) any requests until dataplane > > s/comple/complete/ > >> + * starts again. >> + */ >> blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); >> >> + /* Notify guest before the guest notifiers get cleaned up */ >> + qemu_bh_cancel(s->bh); >> + notify_guest_bh(s); >> + > > Hm... does virtio-scsi dataplane need a similar treatment? Or am I > missing something? No, the BH optimization is specific to virtio-blk. Thanks for the patch Halil! Paolo >> aio_context_release(s->ctx); >> >> for (i = 0; i < nvqs; i++) { >
On 03/02/2017 04:32 PM, Paolo Bonzini wrote: > > > On 02/03/2017 15:49, Cornelia Huck wrote: >> On Thu, 2 Mar 2017 14:04:22 +0100 >> Halil Pasic <pasic@linux.vnet.ibm.com> wrote: >> >>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c >>> index 5556f0e..13dd14d 100644 >>> --- a/hw/block/dataplane/virtio-blk.c >>> +++ b/hw/block/dataplane/virtio-blk.c >>> @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) >>> virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL); >>> } >>> >>> - /* Drain and switch bs back to the QEMU main loop */ >>> + /* Drain and switch bs back to the QEMU main loop. After drain, the >>> + * device will not submit (nor comple) any requests until dataplane >> >> s/comple/complete/ >> Will fix. >>> + * starts again. >>> + */ >>> blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); I'm wondering if synchronization is needed for batch_notify_vqs. I think the set_bit can be from the iothread, but the notify_guest_bh below is main event loop. Is it OK like this (could we miss bits set in other thread)? Does blk_set_aio_context include a barrier? >>> >>> + /* Notify guest before the guest notifiers get cleaned up */ >>> + qemu_bh_cancel(s->bh); >>> + notify_guest_bh(s); >>> + >> >> Hm... does virtio-scsi dataplane need a similar treatment? Or am I >> missing something? > > No, the BH optimization is specific to virtio-blk. Thanks for the patch > Halil! > You are welcome ;)! Regards, Halil
On 02/03/2017 16:55, Halil Pasic wrote: >>>> blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); > I'm wondering if synchronization is needed for batch_notify_vqs. I think > the set_bit can be from the iothread, but the notify_guest_bh below is main > event loop. Is it OK like this (could we miss bits set in other thread)? > Does blk_set_aio_context include a barrier? Another good question---yes, it does (via bdrv_set_aio_context's call to aio_context_acquire). But it's something to remember for when I get round to removing aio_context_acquire! Paolo
On 03/02/2017 05:21 PM, Paolo Bonzini wrote: > > > On 02/03/2017 16:55, Halil Pasic wrote: >>>>> blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); >> I'm wondering if synchronization is needed for batch_notify_vqs. I think >> the set_bit can be from the iothread, but the notify_guest_bh below is main >> event loop. Is it OK like this (could we miss bits set in other thread)? >> Does blk_set_aio_context include a barrier? > > Another good question---yes, it does (via bdrv_set_aio_context's call to > aio_context_acquire). > > But it's something to remember for when I get round to removing > aio_context_acquire! Uh, this is complicated. I'm not out of questions, but I fear taking to much of your precious time. I will ask again nevertheless, but please just cut the conversation with -EBUSY if it gets to expensive. What I understand is the batch_notify_vqs is effectively protected by blk_get_aio_context(s->conf.conf.blk)->lock which might or might not be the same as qemu_get_aio_context()->lock. If it ain't the same that means we are running with iothread. Now let us have a look at the drain. IFAIU in #define BDRV_POLL_WHILE(bs, cond) ({ \ bool waited_ = false; \ BlockDriverState *bs_ = (bs); \ AioContext *ctx_ = bdrv_get_aio_context(bs_); \ if (aio_context_in_iothread(ctx_)) { \ while ((cond)) { \ aio_poll(ctx_, true); \ waited_ = true; \ } \ } else { \ assert(qemu_get_current_aio_context() == \ qemu_get_aio_context()); \ /* Ask bdrv_dec_in_flight to wake up the main \ * QEMU AioContext. Extra I/O threads never take \ * other I/O threads' AioContexts (see for example \ * block_job_defer_to_main_loop for how to do it). \ */ \ assert(!bs_->wakeup); \ bs_->wakeup = true; \ while ((cond)) { \ aio_context_release(ctx_); \ aio_poll(qemu_get_aio_context(), true); \ aio_context_acquire(ctx_); \ waited_ = true; \ } \ bs_->wakeup = false; \ } \ waited_; }) where it's interesting (me running with 2 iothreads one assigned to my block device) we are gonna take the "else" branch , a will end up releasing the ctx belonging to the iothread and then acquire it again, and basically wait till the requests are done. Since is virtio_blk_rw_complete acquiring-releasing the same ctx this makes a lot of sense. If we would not release the ctx we would end up with a deadlock. What I do not understand entirely are the atomic_read/write on bs->in_flight which tells us when are we done waiting. In static void blk_aio_complete(BlkAioEmAIOCB *acb) { if (acb->has_returned) { bdrv_dec_in_flight(acb->common.bs); acb->common.cb(acb->common.opaque, acb->rwco.ret); qemu_aio_unref(acb); } } the atomic decrements is before the callback, that is virtio_blk_rw_complete called. My guess here is, that this does not matter, because we are holding the ctx anyway, so it is ensured that we will not observe 0 until the last virtio_blk_rw_complete completed (mutex is recursive). But then I do not understand what is the point acquiring the mutex in virtio_blk_rw_complete? TL;DR Is patch b9e413dd required for the correctness of this patch? What is the role of the aio_context_acquire/release introduced by b9e413dd in virtio_blk_rw_complete? Regards, Halil
On 03/03/2017 20:43, Halil Pasic wrote: > Uh, this is complicated. I'm not out of questions, but I fear taking to > much of your precious time. I will ask again nevertheless, but please > just cut the conversation with -EBUSY if it gets to expensive. It's the opposite! I need other people to look at it, understand it and poke holes at my own reasoning. > bs_->wakeup = true; \ > while ((cond)) { \ > aio_context_release(ctx_); \ > aio_poll(qemu_get_aio_context(), true); \ > aio_context_acquire(ctx_); \ > waited_ = true; \ > } \ > bs_->wakeup = false; \ > } \ > waited_; }) > > where it's interesting (me running with 2 iothreads one assigned to my > block device) we are gonna take the "else" branch , a will end up > releasing the ctx belonging to the iothread and then acquire it again, > and basically wait till the requests are done. > > Since is virtio_blk_rw_complete acquiring-releasing the same ctx this > makes a lot of sense. If we would not release the ctx we would end up > with a deadlock. Related to this, note that the above release/wait/acquire logic is itself new to 2.8. Before, QEMU would run aio_poll(other_aio_context) directly in the main thread. This relied on recursive mutexes and special callbacks to pass the lock between the I/O and main threads. This worked but it hid some thread-unsafe idioms, so I removed this in the commits ending at 65c1b5b ("iothread: release AioContext around aio_poll", 2016-10-28). BDRV_POLL_WHILE provides an abstraction that will not change once aio_context_acquire/release disappears in favor of fine-grained locks. > But then I do not understand what is the point acquiring > the mutex in virtio_blk_rw_complete? When QEMU does I/O in the main thread, it can call BlockBackend/BlockDriverState functions. Even though their completion is processed in the I/O thread (via BDRV_POLL_WHILE), you still need a lock to handle mutual exclusion between the two. In the case of virtio_blk_rw_complete the mutex is needed because, for example, block_acct_done is not thread safe yet. > Is patch b9e413dd required for the correctness of this patch? > What is the role of the aio_context_acquire/release introduced by > b9e413dd in virtio_blk_rw_complete? Patch b9e413dd should have no semantic change. The acquire/release that are added are all balanced by removed acquire/releases elsewhere, for example in block/linux-aio.c. Paolo
On 03/06/2017 03:55 PM, Paolo Bonzini wrote: > > On 03/03/2017 20:43, Halil Pasic wrote: >> Uh, this is complicated. I'm not out of questions, but I fear taking to >> much of your precious time. I will ask again nevertheless, but please >> just cut the conversation with -EBUSY if it gets to expensive. > It's the opposite! I need other people to look at it, understand it and > poke holes at my own reasoning. > Thanks for the explanation. I intend to send out a v2 (with the typo fixed) tomorrow. Regards, Halil
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c index 5556f0e..13dd14d 100644 --- a/hw/block/dataplane/virtio-blk.c +++ b/hw/block/dataplane/virtio-blk.c @@ -258,9 +258,16 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL); } - /* Drain and switch bs back to the QEMU main loop */ + /* Drain and switch bs back to the QEMU main loop. After drain, the + * device will not submit (nor comple) any requests until dataplane + * starts again. + */ blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context()); + /* Notify guest before the guest notifiers get cleaned up */ + qemu_bh_cancel(s->bh); + notify_guest_bh(s); + aio_context_release(s->ctx); for (i = 0; i < nvqs; i++) {
The commits 03de2f527 "virtio-blk: do not use vring in dataplane" and 9ffe337c08 "virtio-blk: always use dataplane path if ioeventfd is active" changed how notifications are done for virtio-blk substantially. Due to a race condition, interrupts are lost when irqfd behind the guest notifier is torn down after notify_guest_bh was scheduled but before it actually runs. Let's fix this by forcing guest notifications before cleaning up the irqfd's. Let's also add some explanatory comments. Cc: qemu-stable@nongnu.org Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reported-by: Michael A. Tebolt <miket@us.ibm.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> --- This patch supersedes my previous attempt to fix the same issue which was running under the title "virtio: fallback from irqfd to non-irqfd notify". Tested it briefly without iothread and over the night with iothread, but more testing would not hurt. @Paolo: Could you please check the comments? They reflect my understanding which may be very wrong. --- hw/block/dataplane/virtio-blk.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)