mbox series

[v3,00/16] chardev: refactoring & many bugfixes related tcp_chr_wait_connected

Message ID 20190211182442.8542-1-berrange@redhat.com (mailing list archive)
Headers show
Series chardev: refactoring & many bugfixes related tcp_chr_wait_connected | expand

Message

Daniel P. Berrangé Feb. 11, 2019, 6:24 p.m. UTC
This is a followup to

  v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
  v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html

This series comes out of a discussion between myself & Yongji Xie in:

  https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html

I eventually understood that the problem faced was that
tcp_chr_wait_connected was racing with the background connection attempt
previously started, causing two connections to be established. This
broke because some vhost user servers only allow a single connection.

After messing around with the code alot the final solution was in fact
very easy. We simply have to delay the first background connection
attempt until the main loop is running. It will then automatically
turn into a no-op if tcp_chr_wait_connected has been run. This is
dealt with in the last patch in this series

I believe this should solve the problem Yongji Xie faced, and thus not
require us to add support for "nowait" option with client sockets at
all. The reconnect=1 option effectively already implements nowait
semantics, and now plays nicely with tcp_chr_wait_connected.

In investigating this I found various other bugs that needed fixing and
identified some useful refactoring to simplify / clarify the code, hence
this very long series.

This series passes make check-unit & check-qtest (for x86_64 target).

I did a test with tests/vhost-user-bridge too, which was in fact already
broken for the same reason Yongji Xie found - it only accepts a single
connection. This series fixes use of reconnect=1 with vhost-user-bridge.

Changed in v3:

 - Fix thread between qio_task_thread_worker unlocking mutex
   and qio_task_thread_result free'ing mutex (Peter)
 - Ensure tcp_chr_wait_connected destroys any pending
   reconnect_timer GSource
 - Use g_setenv/g_unsetenv instead of setenv/unsetenv for
   platform portability

Changed in v2:

 - Rewrite the way we synchronize in tcp_chr_wait_connected again
   to cope with chardev+device hotplug scenario
 - Add extensive unit test coverage
 - Resolve conflicts with git master

Daniel P. Berrangé (16):
  io: store reference to thread information in the QIOTask struct
  io: add qio_task_wait_thread to join with a background thread
  chardev: fix validation of options for QMP created chardevs
  chardev: forbid 'reconnect' option with server sockets
  chardev: forbid 'wait' option with client sockets
  chardev: remove many local variables in qemu_chr_parse_socket
  chardev: ensure qemu_chr_parse_compat reports missing driver error
  chardev: remove unused 'sioc' variable & cleanup paths
  chardev: split tcp_chr_wait_connected into two methods
  chardev: split up qmp_chardev_open_socket connection code
  chardev: use a state machine for socket connection state
  chardev: honour the reconnect setting in tcp_chr_wait_connected
  chardev: disallow TLS/telnet/websocket with tcp_chr_wait_connected
  chardev: fix race with client connections in tcp_chr_wait_connected
  tests: expand coverage of socket chardev test
  chardev: ensure termios is fully initialized

 chardev/char-serial.c          |   2 +-
 chardev/char-socket.c          | 490 ++++++++++++++++++-------
 chardev/char.c                 |   2 +
 include/io/task.h              |  29 +-
 io/task.c                      | 101 ++++--
 io/trace-events                |   2 +
 tests/ivshmem-test.c           |   2 +-
 tests/libqtest.c               |   4 +-
 tests/test-char.c              | 643 ++++++++++++++++++++++++---------
 tests/test-filter-redirector.c |   4 +-
 10 files changed, 939 insertions(+), 340 deletions(-)

Comments

Eric Blake April 22, 2019, 2:51 p.m. UTC | #1
On 2/11/19 12:24 PM, Daniel P. Berrangé wrote:
> This is a followup to
> 
>   v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
>   v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html
> 
> This series comes out of a discussion between myself & Yongji Xie in:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html
> 
> I eventually understood that the problem faced was that
> tcp_chr_wait_connected was racing with the background connection attempt
> previously started, causing two connections to be established. This
> broke because some vhost user servers only allow a single connection.
> 
> After messing around with the code alot the final solution was in fact
> very easy. We simply have to delay the first background connection
> attempt until the main loop is running. It will then automatically
> turn into a no-op if tcp_chr_wait_connected has been run. This is
> dealt with in the last patch in this series
> 
> I believe this should solve the problem Yongji Xie faced, and thus not
> require us to add support for "nowait" option with client sockets at
> all. The reconnect=1 option effectively already implements nowait
> semantics, and now plays nicely with tcp_chr_wait_connected.
> 
> In investigating this I found various other bugs that needed fixing and
> identified some useful refactoring to simplify / clarify the code, hence
> this very long series.

Even with this series applied, I'm still seeing sporadic failures of
iotest 169. Max posted a hack patch a while back that tries to work
around the race:

https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg05907.html

which he originally diagnosed in iotest 147:
https://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html

but as it was a hack, he has not pursued it further, and so the symptoms
are still there, although not completely reproducible:

169 10s ... - output mismatch (see 169.out.bad)
--- /home/eblake/qemu/tests/qemu-iotests/169.out	2018-11-16
15:48:12.018526748 -0600
+++ /home/eblake/qemu/tests/qemu-iotests/169.out.bad	2019-04-22
09:38:45.481517132 -0500
@@ -1,3 +1,5 @@
+WARNING:qemu:qemu received signal 11:
/home/eblake/qemu/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
-chardev
socket,id=mon,path=/home/eblake/qemu/tests/qemu-iotests/scratch/tmp4clmPF/qemua-26803-monitor.sock
-mon chardev=mon,mode=control -display none -vga none -qtest
unix:path=/home/eblake/qemu/tests/qemu-iotests/scratch/qemua-26803-qtest.sock
-machine accel=qtest -nodefaults -machine accel=qtest -drive
if=virtio,id=drive0,file=/home/eblake/qemu/tests/qemu-iotests/scratch/disk_a,format=qcow2,cache=writeback

Any chance you can take a look as to what a non-hack fix should be?
Daniel P. Berrangé April 23, 2019, 2:13 p.m. UTC | #2
On Mon, Apr 22, 2019 at 09:51:17AM -0500, Eric Blake wrote:
> On 2/11/19 12:24 PM, Daniel P. Berrangé wrote:
> > This is a followup to
> > 
> >   v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html
> >   v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html
> > 
> > This series comes out of a discussion between myself & Yongji Xie in:
> > 
> >   https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html
> > 
> > I eventually understood that the problem faced was that
> > tcp_chr_wait_connected was racing with the background connection attempt
> > previously started, causing two connections to be established. This
> > broke because some vhost user servers only allow a single connection.
> > 
> > After messing around with the code alot the final solution was in fact
> > very easy. We simply have to delay the first background connection
> > attempt until the main loop is running. It will then automatically
> > turn into a no-op if tcp_chr_wait_connected has been run. This is
> > dealt with in the last patch in this series
> > 
> > I believe this should solve the problem Yongji Xie faced, and thus not
> > require us to add support for "nowait" option with client sockets at
> > all. The reconnect=1 option effectively already implements nowait
> > semantics, and now plays nicely with tcp_chr_wait_connected.
> > 
> > In investigating this I found various other bugs that needed fixing and
> > identified some useful refactoring to simplify / clarify the code, hence
> > this very long series.
> 
> Even with this series applied, I'm still seeing sporadic failures of
> iotest 169. Max posted a hack patch a while back that tries to work
> around the race:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg05907.html
> 
> which he originally diagnosed in iotest 147:
> https://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html
> 
> but as it was a hack, he has not pursued it further, and so the symptoms
> are still there, although not completely reproducible:
> 
> 169 10s ... - output mismatch (see 169.out.bad)
> --- /home/eblake/qemu/tests/qemu-iotests/169.out	2018-11-16
> 15:48:12.018526748 -0600
> +++ /home/eblake/qemu/tests/qemu-iotests/169.out.bad	2019-04-22
> 09:38:45.481517132 -0500
> @@ -1,3 +1,5 @@
> +WARNING:qemu:qemu received signal 11:
> /home/eblake/qemu/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> -chardev
> socket,id=mon,path=/home/eblake/qemu/tests/qemu-iotests/scratch/tmp4clmPF/qemua-26803-monitor.sock
> -mon chardev=mon,mode=control -display none -vga none -qtest
> unix:path=/home/eblake/qemu/tests/qemu-iotests/scratch/qemua-26803-qtest.sock
> -machine accel=qtest -nodefaults -machine accel=qtest -drive
> if=virtio,id=drive0,file=/home/eblake/qemu/tests/qemu-iotests/scratch/disk_a,format=qcow2,cache=writeback
> 
> Any chance you can take a look as to what a non-hack fix should be?

Oh, it looks like we dropped the ball here. We have a fix already but
it doesn't appear to have been merged for 4.0 :-(

  https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg06174.html

Regards,
Daniel