Message ID | 20190211182442.8542-1-berrange@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | chardev: refactoring & many bugfixes related tcp_chr_wait_connected | expand |
On 2/11/19 12:24 PM, Daniel P. Berrangé wrote: > This is a followup to > > v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html > v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html > > This series comes out of a discussion between myself & Yongji Xie in: > > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html > > I eventually understood that the problem faced was that > tcp_chr_wait_connected was racing with the background connection attempt > previously started, causing two connections to be established. This > broke because some vhost user servers only allow a single connection. > > After messing around with the code alot the final solution was in fact > very easy. We simply have to delay the first background connection > attempt until the main loop is running. It will then automatically > turn into a no-op if tcp_chr_wait_connected has been run. This is > dealt with in the last patch in this series > > I believe this should solve the problem Yongji Xie faced, and thus not > require us to add support for "nowait" option with client sockets at > all. The reconnect=1 option effectively already implements nowait > semantics, and now plays nicely with tcp_chr_wait_connected. > > In investigating this I found various other bugs that needed fixing and > identified some useful refactoring to simplify / clarify the code, hence > this very long series. Even with this series applied, I'm still seeing sporadic failures of iotest 169. Max posted a hack patch a while back that tries to work around the race: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg05907.html which he originally diagnosed in iotest 147: https://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html but as it was a hack, he has not pursued it further, and so the symptoms are still there, although not completely reproducible: 169 10s ... - output mismatch (see 169.out.bad) --- /home/eblake/qemu/tests/qemu-iotests/169.out 2018-11-16 15:48:12.018526748 -0600 +++ /home/eblake/qemu/tests/qemu-iotests/169.out.bad 2019-04-22 09:38:45.481517132 -0500 @@ -1,3 +1,5 @@ +WARNING:qemu:qemu received signal 11: /home/eblake/qemu/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 -chardev socket,id=mon,path=/home/eblake/qemu/tests/qemu-iotests/scratch/tmp4clmPF/qemua-26803-monitor.sock -mon chardev=mon,mode=control -display none -vga none -qtest unix:path=/home/eblake/qemu/tests/qemu-iotests/scratch/qemua-26803-qtest.sock -machine accel=qtest -nodefaults -machine accel=qtest -drive if=virtio,id=drive0,file=/home/eblake/qemu/tests/qemu-iotests/scratch/disk_a,format=qcow2,cache=writeback Any chance you can take a look as to what a non-hack fix should be?
On Mon, Apr 22, 2019 at 09:51:17AM -0500, Eric Blake wrote: > On 2/11/19 12:24 PM, Daniel P. Berrangé wrote: > > This is a followup to > > > > v1: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg03344.html > > v2: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg05947.html > > > > This series comes out of a discussion between myself & Yongji Xie in: > > > > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01881.html > > > > I eventually understood that the problem faced was that > > tcp_chr_wait_connected was racing with the background connection attempt > > previously started, causing two connections to be established. This > > broke because some vhost user servers only allow a single connection. > > > > After messing around with the code alot the final solution was in fact > > very easy. We simply have to delay the first background connection > > attempt until the main loop is running. It will then automatically > > turn into a no-op if tcp_chr_wait_connected has been run. This is > > dealt with in the last patch in this series > > > > I believe this should solve the problem Yongji Xie faced, and thus not > > require us to add support for "nowait" option with client sockets at > > all. The reconnect=1 option effectively already implements nowait > > semantics, and now plays nicely with tcp_chr_wait_connected. > > > > In investigating this I found various other bugs that needed fixing and > > identified some useful refactoring to simplify / clarify the code, hence > > this very long series. > > Even with this series applied, I'm still seeing sporadic failures of > iotest 169. Max posted a hack patch a while back that tries to work > around the race: > > https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg05907.html > > which he originally diagnosed in iotest 147: > https://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html > > but as it was a hack, he has not pursued it further, and so the symptoms > are still there, although not completely reproducible: > > 169 10s ... - output mismatch (see 169.out.bad) > --- /home/eblake/qemu/tests/qemu-iotests/169.out 2018-11-16 > 15:48:12.018526748 -0600 > +++ /home/eblake/qemu/tests/qemu-iotests/169.out.bad 2019-04-22 > 09:38:45.481517132 -0500 > @@ -1,3 +1,5 @@ > +WARNING:qemu:qemu received signal 11: > /home/eblake/qemu/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 > -chardev > socket,id=mon,path=/home/eblake/qemu/tests/qemu-iotests/scratch/tmp4clmPF/qemua-26803-monitor.sock > -mon chardev=mon,mode=control -display none -vga none -qtest > unix:path=/home/eblake/qemu/tests/qemu-iotests/scratch/qemua-26803-qtest.sock > -machine accel=qtest -nodefaults -machine accel=qtest -drive > if=virtio,id=drive0,file=/home/eblake/qemu/tests/qemu-iotests/scratch/disk_a,format=qcow2,cache=writeback > > Any chance you can take a look as to what a non-hack fix should be? Oh, it looks like we dropped the ball here. We have a fix already but it doesn't appear to have been merged for 4.0 :-( https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg06174.html Regards, Daniel