Message ID | 20210129195631.1577922-1-iris@modwiz.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | chardev/char-io: Fix polling by not removing polls when buffers are full | expand |
On Fri, Jan 29, 2021 at 11:57 PM Iris Johnson <iris@modwiz.com> wrote: > > Currently, the chardev backend code will prepare for IO polling to occur > by potentially adding or removing a watch of the backing channel for the > chardev. The chardev poll is added if the fd_can_read() function reports > more than 0 byte of buffer space, if a poll handler is already setup and > the bufer is now empty, the poll handler is removed. > > This causes a bug where the device buffer becomes ready, but the poll is > blocking on a sleep (potentially forever), because the buffer is small > and fills up immediately, while the backend channel has more data. This > leads to a stall condition or potentially a deadlock in the guest. > > The guest is looping, waiting for data to be reported as ready to read, > the host sees that the buffer is ready for reading and adds the poll, > the poll returns since data is available and data is made available to > the guest. Before the guest code is able to retrieve the data and clear > the full buffer, the poll code runs again, sees that the buffer is now > full, and removes the poll. At this point only a timeout from another > polled source, or another source having it's poll complete will result > in the loop running again to see that the buffer is now ready and to > add the poll again. > > We solve this issue by removing the logic that removes the poll, keeping > the existing logic to only create the poll once there's space for the > first read. > > Buglink: https://bugs.launchpad.net/qemu/+bug/1913341 > Signed-off-by: Iris Johnson <iris@modwiz.com> > --- > chardev/char-io.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/chardev/char-io.c b/chardev/char-io.c > index 8ced184160..fa9e222f78 100644 > --- a/chardev/char-io.c > +++ b/chardev/char-io.c > @@ -50,16 +50,14 @@ static gboolean io_watch_poll_prepare(GSource *source, > return FALSE; > } > > - if (now_active) { > + if (now_active && !was_active) { > iwp->src = qio_channel_create_watch( > iwp->ioc, G_IO_IN | G_IO_ERR | G_IO_HUP | G_IO_NVAL); > g_source_set_callback(iwp->src, iwp->fd_read, iwp->opaque, NULL); > g_source_add_child_source(source, iwp->src); > g_source_unref(iwp->src); > - } else { > - g_source_remove_child_source(source, iwp->src); > - iwp->src = NULL; > } I don't think this works well enough: if the source isn't removed, but fd_can_read() returns 0, there is a potential busy loop on the next fd_read(). My understanding is that if data is read from the frontend, the loop will be re-entered and io_watch_poll_prepare will set the callback again. Could you provide a simple use-case or reproducer where we can evaluate how your patch improves the situation? thanks
> > > I don't think this works well enough: if the source isn't removed, but > fd_can_read() returns 0, there is a potential busy loop on the next > > fd_read(). > This shouldn't happen. The sources all get set to non-blocking mode, the only blocking code is the poll itself. If fd_can_read() returns 0, then the next time fd_read() is called, it will attempt to read zero bytes. The backend logic checks the results of the same method that fd_can_read() calls and sets its read size to that amount, in the case of a full buffer it will read 0 bytes and return. > My understanding is that if data is read from the frontend, the loop > > will be re-entered and io_watch_poll_prepare will set the callback > > again. > This just doesn't happen. The issue is that between the poll being added (and some but not all data being read) and the frontend code getting triggered by the guest, the IO loop runs again and the poll is removed, it then runs again with the poll removed (since the poll is removed during setup) and it's now just going to block because the input fd in question has been "temporarily removed". Except that nothing in the fd set it polls on is now connected to the guest clearing the buffer. Meanwhile the guest reads the data during what can be a potentially infinite block (if nothing else sets the timeout, in my case something in the uart peripheral sets a 1000ms timeout so I could read a byte every second or so in the guest). The guest will now be spinning until the poll is re-added, meanwhile the poll is blocking on a timeout or another fd becoming ready because the buffers are small, the fd in question has already been removed from the set by the time the guest has a chance to clear the buffer. > > Could you provide a simple use-case or reproducer where we can > > evaluate how your patch improves the situation? I can do this, but I don't have anything ready immediately, my test case isn't ideal for others to reproduce. But I can attach one later today when I have that done. Thanks, Iris Johnson
I was working on preparing this test case for the virt machine (I've been using a test machine based around the exynos UART). While looking at the uart code for the virt to find which serial peripheral it uses (and the register layout), I noticed that the pl011 uart code calls qemu_chr_fe_accept_input, in fact it looks like every uart except the exynos UART that I was using has this call. Somehow while I was debugging this issue I missed that call. (In fact adding a call like that was my backup patch to this one), In light of the discovery I'll send a follow up patch that adds this functionality into the exynos4210_uart code instead of modifying the chardev handling itself. You can safely disregard this patch. Sorry about the confusion. Iris On Sat, Jan 30, 2021 at 9:54 AM Iris Johnson <iris@modwiz.com> wrote: > > I don't think this works well enough: if the source isn't removed, but > > > fd_can_read() returns 0, there is a potential busy loop on the next >> > fd_read(). >> > > This shouldn't happen. The sources all get set to non-blocking mode, the > only > blocking code is the poll itself. If fd_can_read() returns 0, then the > next time > fd_read() is called, it will attempt to read zero bytes. The backend logic > checks the > results of the same method that fd_can_read() calls and sets its read size > to that amount, in the case of a full buffer it will read 0 bytes and > return. > > > My understanding is that if data is read from the frontend, the loop >> > will be re-entered and io_watch_poll_prepare will set the callback >> > again. >> > > This just doesn't happen. The issue is that between the poll being added > (and > some but not all data being read) and the frontend code getting triggered > by > the guest, the IO loop runs again and the poll is removed, it then runs > again > with the poll removed (since the poll is removed during setup) and it's now > just going to block because the input fd in question has been "temporarily > removed". Except that nothing in the fd set it polls on is now connected to > the guest clearing the buffer. > > Meanwhile the guest reads the data during what can be a potentially > infinite block (if nothing else sets the timeout, in my case something > in the uart peripheral sets a 1000ms timeout so I could read a byte > every second or so in the guest). The guest will now be spinning until > the poll is re-added, meanwhile the poll is blocking on a timeout or > another > fd becoming ready because the buffers are small, the fd in question has > already been removed from the set by the time the guest has a chance > to clear the buffer. > > >> > Could you provide a simple use-case or reproducer where we can >> > evaluate how your patch improves the situation? > > > I can do this, but I don't have anything ready immediately, my test case > isn't > ideal for others to reproduce. But I can attach one later today when I > have that done. > > Thanks, > Iris Johnson >
diff --git a/chardev/char-io.c b/chardev/char-io.c index 8ced184160..fa9e222f78 100644 --- a/chardev/char-io.c +++ b/chardev/char-io.c @@ -50,16 +50,14 @@ static gboolean io_watch_poll_prepare(GSource *source, return FALSE; } - if (now_active) { + if (now_active && !was_active) { iwp->src = qio_channel_create_watch( iwp->ioc, G_IO_IN | G_IO_ERR | G_IO_HUP | G_IO_NVAL); g_source_set_callback(iwp->src, iwp->fd_read, iwp->opaque, NULL); g_source_add_child_source(source, iwp->src); g_source_unref(iwp->src); - } else { - g_source_remove_child_source(source, iwp->src); - iwp->src = NULL; } + return FALSE; }
Currently, the chardev backend code will prepare for IO polling to occur by potentially adding or removing a watch of the backing channel for the chardev. The chardev poll is added if the fd_can_read() function reports more than 0 byte of buffer space, if a poll handler is already setup and the bufer is now empty, the poll handler is removed. This causes a bug where the device buffer becomes ready, but the poll is blocking on a sleep (potentially forever), because the buffer is small and fills up immediately, while the backend channel has more data. This leads to a stall condition or potentially a deadlock in the guest. The guest is looping, waiting for data to be reported as ready to read, the host sees that the buffer is ready for reading and adds the poll, the poll returns since data is available and data is made available to the guest. Before the guest code is able to retrieve the data and clear the full buffer, the poll code runs again, sees that the buffer is now full, and removes the poll. At this point only a timeout from another polled source, or another source having it's poll complete will result in the loop running again to see that the buffer is now ready and to add the poll again. We solve this issue by removing the logic that removes the poll, keeping the existing logic to only create the poll once there's space for the first read. Buglink: https://bugs.launchpad.net/qemu/+bug/1913341 Signed-off-by: Iris Johnson <iris@modwiz.com> --- chardev/char-io.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)