diff mbox series

[1/4] vhost-user: Introduce nested event loop in vhost_user_read()

Message ID 20210308123141.26444-2-groug@kaod.org (mailing list archive)
State New, archived
Headers show
Series virtiofsd: Avoid potential deadlocks | expand

Commit Message

Greg Kurz March 8, 2021, 12:31 p.m. UTC
A deadlock condition potentially exists if a vhost-user process needs
to request something to QEMU on the slave channel while processing a
vhost-user message.

This doesn't seem to affect any vhost-user implementation so far, but
this is currently biting the upcoming enablement of DAX with virtio-fs.
The issue is being observed when the guest does an emergency reboot while
a mapping still exits in the DAX window, which is very easy to get with
a busy enough workload (e.g. as simulated by blogbench [1]) :

- QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.

- In order to complete the request, virtiofsd then asks QEMU to remove
  the mapping on the slave channel.

All these dialogs are synchronous, hence the deadlock.

As pointed out by Stefan Hajnoczi:

When QEMU's vhost-user master implementation sends a vhost-user protocol
message, vhost_user_read() does a "blocking" read during which slave_fd
is not monitored by QEMU.

As a preliminary step to address this, split vhost_user_read() into a
nested even loop and a one-shot callback that does the actual reading.
A subsequent patch will teach the loop to monitor and process messages
from the slave channel as well.

[1] https://github.com/jedisct1/Blogbench

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 hw/virtio/vhost-user.c | 59 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 5 deletions(-)

Comments

Stefan Hajnoczi March 9, 2021, 3:02 p.m. UTC | #1
On Mon, Mar 08, 2021 at 01:31:38PM +0100, Greg Kurz wrote:
> A deadlock condition potentially exists if a vhost-user process needs
> to request something to QEMU on the slave channel while processing a
> vhost-user message.
> 
> This doesn't seem to affect any vhost-user implementation so far, but
> this is currently biting the upcoming enablement of DAX with virtio-fs.
> The issue is being observed when the guest does an emergency reboot while
> a mapping still exits in the DAX window, which is very easy to get with
> a busy enough workload (e.g. as simulated by blogbench [1]) :
> 
> - QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.
> 
> - In order to complete the request, virtiofsd then asks QEMU to remove
>   the mapping on the slave channel.
> 
> All these dialogs are synchronous, hence the deadlock.
> 
> As pointed out by Stefan Hajnoczi:
> 
> When QEMU's vhost-user master implementation sends a vhost-user protocol
> message, vhost_user_read() does a "blocking" read during which slave_fd
> is not monitored by QEMU.
> 
> As a preliminary step to address this, split vhost_user_read() into a
> nested even loop and a one-shot callback that does the actual reading.

In case you respin:
s/even/event/

> +static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
> +{
> +    struct vhost_user *u = dev->opaque;
> +    CharBackend *chr = u->user->chr;
> +    GMainContext *prev_ctxt = chr->chr->gcontext;
> +    GMainContext *ctxt = g_main_context_new();
> +    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
> +    struct vhost_user_read_cb_data data = {
> +        .dev = dev,
> +        .loop = loop,
> +        .msg = msg,
> +        .ret = 0
> +    };
> +
> +    /* Switch context and add a new watch to monitor chardev activity */
> +    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
> +    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);

This comment could be expanded to explain why the nested event loop is
necessary. The goal is to monitor the slave_fd while waiting for chr
I/O so we'll need an event loop. prev_ctxt cannot be run nested since
its fd handlers may not be prepared (e.g. re-entrancy).
Greg Kurz March 9, 2021, 6:35 p.m. UTC | #2
On Tue, 9 Mar 2021 15:02:48 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Mar 08, 2021 at 01:31:38PM +0100, Greg Kurz wrote:
> > A deadlock condition potentially exists if a vhost-user process needs
> > to request something to QEMU on the slave channel while processing a
> > vhost-user message.
> > 
> > This doesn't seem to affect any vhost-user implementation so far, but
> > this is currently biting the upcoming enablement of DAX with virtio-fs.
> > The issue is being observed when the guest does an emergency reboot while
> > a mapping still exits in the DAX window, which is very easy to get with
> > a busy enough workload (e.g. as simulated by blogbench [1]) :
> > 
> > - QEMU sends VHOST_USER_GET_VRING_BASE to virtiofsd.
> > 
> > - In order to complete the request, virtiofsd then asks QEMU to remove
> >   the mapping on the slave channel.
> > 
> > All these dialogs are synchronous, hence the deadlock.
> > 
> > As pointed out by Stefan Hajnoczi:
> > 
> > When QEMU's vhost-user master implementation sends a vhost-user protocol
> > message, vhost_user_read() does a "blocking" read during which slave_fd
> > is not monitored by QEMU.
> > 
> > As a preliminary step to address this, split vhost_user_read() into a
> > nested even loop and a one-shot callback that does the actual reading.
> 
> In case you respin:
> s/even/event/
> 

Fixed.

> > +static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
> > +{
> > +    struct vhost_user *u = dev->opaque;
> > +    CharBackend *chr = u->user->chr;
> > +    GMainContext *prev_ctxt = chr->chr->gcontext;
> > +    GMainContext *ctxt = g_main_context_new();
> > +    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
> > +    struct vhost_user_read_cb_data data = {
> > +        .dev = dev,
> > +        .loop = loop,
> > +        .msg = msg,
> > +        .ret = 0
> > +    };
> > +
> > +    /* Switch context and add a new watch to monitor chardev activity */
> > +    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
> > +    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
> 
> This comment could be expanded to explain why the nested event loop is
> necessary. The goal is to monitor the slave_fd while waiting for chr
> I/O so we'll need an event loop. prev_ctxt cannot be run nested since
> its fd handlers may not be prepared (e.g. re-entrancy).

Ok, will do.
diff mbox series

Patch

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2fdd5daf74bb..8a0574d5f959 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -294,15 +294,27 @@  static int vhost_user_read_header(struct vhost_dev *dev, VhostUserMsg *msg)
     return 0;
 }
 
-static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
+struct vhost_user_read_cb_data {
+    struct vhost_dev *dev;
+    VhostUserMsg *msg;
+    GMainLoop *loop;
+    int ret;
+};
+
+static gboolean vhost_user_read_cb(GIOChannel *source, GIOCondition condition,
+                                   gpointer opaque)
 {
+    struct vhost_user_read_cb_data *data = opaque;
+    struct vhost_dev *dev = data->dev;
+    VhostUserMsg *msg = data->msg;
     struct vhost_user *u = dev->opaque;
     CharBackend *chr = u->user->chr;
     uint8_t *p = (uint8_t *) msg;
     int r, size;
 
     if (vhost_user_read_header(dev, msg) < 0) {
-        return -1;
+        data->ret = -1;
+        goto end;
     }
 
     /* validate message size is sane */
@@ -310,7 +322,8 @@  static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
         error_report("Failed to read msg header."
                 " Size %d exceeds the maximum %zu.", msg->hdr.size,
                 VHOST_USER_PAYLOAD_SIZE);
-        return -1;
+        data->ret = -1;
+        goto end;
     }
 
     if (msg->hdr.size) {
@@ -320,11 +333,47 @@  static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
         if (r != size) {
             error_report("Failed to read msg payload."
                          " Read %d instead of %d.", r, msg->hdr.size);
-            return -1;
+            data->ret = -1;
+            goto end;
         }
     }
 
-    return 0;
+end:
+    g_main_loop_quit(data->loop);
+    return G_SOURCE_REMOVE;
+}
+
+static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
+{
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->user->chr;
+    GMainContext *prev_ctxt = chr->chr->gcontext;
+    GMainContext *ctxt = g_main_context_new();
+    GMainLoop *loop = g_main_loop_new(ctxt, FALSE);
+    struct vhost_user_read_cb_data data = {
+        .dev = dev,
+        .loop = loop,
+        .msg = msg,
+        .ret = 0
+    };
+
+    /* Switch context and add a new watch to monitor chardev activity */
+    qemu_chr_be_update_read_handlers(chr->chr, ctxt);
+    qemu_chr_fe_add_watch(chr, G_IO_IN | G_IO_HUP, vhost_user_read_cb, &data);
+
+    g_main_loop_run(loop);
+
+    /*
+     * Restore the previous context. This also destroys/recreates event
+     * sources : this guarantees that all pending events in the original
+     * context that have been processed by the nested loop are purged.
+     */
+    qemu_chr_be_update_read_handlers(chr->chr, prev_ctxt);
+
+    g_main_loop_unref(loop);
+    g_main_context_unref(ctxt);
+
+    return data.ret;
 }
 
 static int process_message_reply(struct vhost_dev *dev,