tests/qemu_iotests: Minimize usage of used ports

Message ID	20200203075955.28861-1-ldoktor@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=kBby=3X=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 457832080D From: =?utf-8?b?THVrw6HFoSBEb2t0b3I=?= <ldoktor@redhat.com> To: qemu-block@nongnu.org Subject: [PATCH] tests/qemu_iotests: Minimize usage of used ports Date: Mon, 3 Feb 2020 08:59:55 +0100 Message-Id: <20200203075955.28861-1-ldoktor@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: list Cc: ldoktor@redhat.com, kwolf@redhat.com, qemu-devel@nongnu.org, mreitz@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	tests/qemu_iotests: Minimize usage of used ports \| expand tests/qemu_iotests: Minimize usage of used ports

Lukáš Doktor Feb. 3, 2020, 7:59 a.m. UTC

Using a range of ports from 32768 to 65538 is dangerous as some
application might already be listening there and interfere with the
testing. There is no way to reserve ports, but let's decrease the chance
of interactions by only using ports that were free at the time of
importing this module.

Without this patch CI occasionally fails with used ports. Additionally I
tried listening on the first port to be tried via "nc -l localhost
$port" and no matter how many other ports were available it always
hanged for infinity.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
---
 tests/qemu-iotests/147        | 43 ++++++++++++++++-------
 tests/qemu-iotests/iotests.py | 64 +++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 13 deletions(-)

Eric Blake Feb. 3, 2020, 3:32 p.m. UTC | #1

On 2/3/20 1:59 AM, Lukáš Doktor wrote:
> Using a range of ports from 32768 to 65538 is dangerous as some
> application might already be listening there and interfere with the
> testing. There is no way to reserve ports, but let's decrease the chance
> of interactions by only using ports that were free at the time of
> importing this module.
> 
> Without this patch CI occasionally fails with used ports. Additionally I
> tried listening on the first port to be tried via "nc -l localhost
> $port" and no matter how many other ports were available it always
> hanged for infinity.
> 
> Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
> ---
>   tests/qemu-iotests/147        | 43 ++++++++++++++++-------
>   tests/qemu-iotests/iotests.py | 64 +++++++++++++++++++++++++++++++++++
>   2 files changed, 94 insertions(+), 13 deletions(-)

Is it worth sharing the logic already present in common.nbd's 
nbd_server_start_tcp_socket shell function?  (Oh right, that's shell, 
this is python).  It seems like we keep reinventing logic to pick a safe 
port.

> 
> diff --git a/tests/qemu-iotests/147 b/tests/qemu-iotests/147
> index 2b6f859a09..4d0e1418bb 100755
> --- a/tests/qemu-iotests/147
> +++ b/tests/qemu-iotests/147
> @@ -26,10 +26,8 @@ import time
>   import iotests
>   from iotests import cachemode, aiomode, imgfmt, qemu_img, qemu_nbd, qemu_nbd_early_pipe
>   
> -NBD_PORT_START      = 32768
> -NBD_PORT_END        = NBD_PORT_START + 1024
> -NBD_IPV6_PORT_START = NBD_PORT_END
> -NBD_IPV6_PORT_END   = NBD_IPV6_PORT_START + 1024
> +NBD_PORTS = iotests.find_free_ports(32768, 65536, 1024)
> +NBD_IPV6_PORTS = iotests.find_free_ports(NBD_PORTS[-1] + 1, 65536, 1024)

The changes here look sane...

> +++ b/tests/qemu-iotests/iotests.py
> @@ -20,6 +20,7 @@ from __future__ import print_function
>   import errno
>   import os
>   import re
> +import socket
>   import subprocess
>   import string
>   import unittest
> @@ -75,6 +76,69 @@ luks_default_secret_object = 'secret,id=keysec0,data=' + \
>   luks_default_key_secret_opt = 'key-secret=keysec0'
>   
>   
> +def is_port_free(port, address):

...and I'm glad you're adding a reusable method here.

> +    """
> +    Return True if the given port is available for use.
> +
> +    Currently we only check for TCP/UDP connections on IPv4/6
> +    Backported from avocado.utils.network
> +
> +    :param port: Port number
> +    :param address: Socket address to bind or connect
> +    """
> +    families = (socket.AF_INET, socket.AF_INET6)
> +    if address == "localhost" or not address:
> +        localhost = True
> +        protocols = (socket.SOCK_STREAM, socket.SOCK_DGRAM)
> +    else:
> +        localhost = False
> +        # sock.connect always connects for UDP
> +        protocols = (socket.SOCK_STREAM, )
> +    sock = None
> +    try:
> +        for family in families:
> +            for protocol in protocols:
> +                try:
> +                    sock = socket.socket(family, protocol)
> +                    if localhost:
> +                        sock.bind(("", port))
> +                    else:
> +                        sock.connect((address, port))
> +                        return False
> +                except socket.error as exc:
> +                    if exc.errno in (93, 94):   # Unsupported combinations

Ouch - that seems rather hard-coded (not all the world uses the same 
errno values as Linux).  Does python have symbolic names for 
EPROTONOSUPPORT and ESOCKTNOSUPPORT?

> +                        continue
> +                    if localhost:
> +                        return False
> +                sock.close()
> +        return True
> +    finally:
> +        if sock is not None:
> +            sock.close()
> +
> +
> +def find_free_ports(start_port, end_port, count, address="localhost"):
> +    """
> +    Return count of host free ports in the range [start_port, end_port].
> +
> +    Backported from avocado.utils.network
> +
> +    :param start_port: header of candidate port range
> +    :param end_port: ender of candidate port range

s/ender/end/

> +    :param count: Initial number of ports known to be free in the range.
> +    :param address: Socket address to bind or connect
> +    """
> +    ports = []
> +    port_range = range(start_port, end_port)
> +    for i in port_range:
> +        if is_port_free(i, address):
> +            ports.append(i)
> +        if len(ports) >= count:
> +            break
> +
> +    return ports
> +
> +
>   def qemu_img(*args):
>       '''Run qemu-img and return the exit code'''
>       devnull = open('/dev/null', 'r+')
> 

I'd like a second review on this (my python is tolerable, but not 
strong), but I'm happy to take it through my NBD tree if no one else 
speaks up first.

Max Reitz Feb. 6, 2020, 3:03 p.m. UTC | #2

On 03.02.20 08:59, Lukáš Doktor wrote:
> Using a range of ports from 32768 to 65538 is dangerous as some
> application might already be listening there and interfere with the
> testing. There is no way to reserve ports, but let's decrease the chance
> of interactions by only using ports that were free at the time of
> importing this module.
> 
> Without this patch CI occasionally fails with used ports. Additionally I
> tried listening on the first port to be tried via "nc -l localhost
> $port" and no matter how many other ports were available it always
> hanged for infinity.

I’m afraid I don’t quite understand.  The new functions check whether
the ports are available for use by creating a server on them (i.e.,
binding a socket there).  The current code just lets qemu create a
server there, and see if that works or not.

So the only difference I can see is that instead of trying out random
ports during the test and see whether they’re free to use we do this
check only once when the test is started.

And the only problem I can imagine from your description is that there
is some other tool on the system that tries to set up a server but
cannot because we already have an NBD server there (by accident).

But I don’t see how checking for free ports once at startup solves that
problem reliably.

If what I guessed above is right, the only reliable solution I can
imagine would be to allow users to specify the port range through
environment variables, and then you’d have to specify a range that you
know is free for use.

Max

Lukáš Doktor Feb. 6, 2020, 4:27 p.m. UTC | #3

Dne 06. 02. 20 v 16:03 Max Reitz napsal(a):
> On 03.02.20 08:59, Lukáš Doktor wrote:
>> Using a range of ports from 32768 to 65538 is dangerous as some
>> application might already be listening there and interfere with the
>> testing. There is no way to reserve ports, but let's decrease the chance
>> of interactions by only using ports that were free at the time of
>> importing this module.
>>
>> Without this patch CI occasionally fails with used ports. Additionally I
>> tried listening on the first port to be tried via "nc -l localhost
>> $port" and no matter how many other ports were available it always
>> hanged for infinity.
> 
> I’m afraid I don’t quite understand.  The new functions check whether
> the ports are available for use by creating a server on them (i.e.,
> binding a socket there).  The current code just lets qemu create a
> server there, and see if that works or not.
> 
> So the only difference I can see is that instead of trying out random
> ports during the test and see whether they’re free to use we do this
> check only once when the test is started.
> 
> And the only problem I can imagine from your description is that there
> is some other tool on the system that tries to set up a server but
> cannot because we already have an NBD server there (by accident).
> 
> But I don’t see how checking for free ports once at startup solves that
> problem reliably.
> 
> If what I guessed above is right, the only reliable solution I can
> imagine would be to allow users to specify the port range through
> environment variables, and then you’d have to specify a range that you
> know is free for use.
> 
> Max
> 

Hello Max,

thank you and I am sorry for not digging deep enough. This week my CI failed with:

01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
01:24:06 DEBUG| [stdout] +----------------------------------------------------------------------
01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
01:24:06 DEBUG| [stdout] +    self.vm.launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 302, in launch
01:24:06 DEBUG| [stdout] +    self._launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 319, in _launch
01:24:06 DEBUG| [stdout] +    self._pre_launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py", line 106, in _pre_launch
01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine, self)._pre_launch()
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 270, in _pre_launch
01:24:06 DEBUG| [stdout] +    self._qmp = qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py", line 60, in __init__
01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use

I made the mistake of reproducing this on my home system using the qemu revision that I had and assuming it's caused by a used port. So I limited the port range and used nc to occupy the port. It sort-of reproduced but instead of Address already in use it hanged until I kill the nc process. Then it failed with:

+Traceback (most recent call last):
+  File "147", line 124, in test_inet
+    flatten_sock_addr(address))
+  File "147", line 59, in client_test
+    self.assert_qmp(result, 'return', {})
+  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 821, in assert_qmp
+    result = self.dictpath(d, path)
+  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 797, in dictpath
+    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
+AssertionError: failed path traversal for "return" in "{'error': {'class': 'GenericError', 'desc': 'Failed to read initial magic: Unexpected end-of-file before all bytes were read'}}"

After a brief study I thought qemu is not doing the job well enough and wanted to add a protection. Anyway after a more thorough overview I came to a different conclusion and that is that we are facing the same issue as with incoming migration about a year ago. What happened is that I started "nc -l localhost 32789" which results in:

COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)

Then we start the server by "_try_server_up" where qemu-nbd detects the port is occupied on IPv6 but available on IPv4, so it claims it:
COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nc        26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP localhost:32789 (LISTEN)

and reports success. Then we try to connect but the hotplugged VM first attempts to connect on the IPv6 address and hangs for infinity.

Now is this an expected behavior? If so then we need the find_free_address (but preferably directly in _try_server_up just before starting the qemu-nbd) to leave as little time-frame for collision as possible. Otherwise the test is alright and qemu-nbd needs a fix to bail out in case some address is already used (IIRC this is what incoming migration does).

My second mistake was testing this on the old code-base and rebasing it only before sending the patch (without testing after the rebase). If I were to test it first, I would have found out that the real reproducer is simply running the test as the commit 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.

So basically there are 2 actions:

1. fix the test as on my system it fails in 100% of cases, bisect says the first bad commit is 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in digging into this? I already spent way too much time on this and don't really know what that commit is trying to do.
2. decide on the behavior when IPv4/6 is already in use (bail-out or start).
2a. In case it should bail-out than the test is correct and there is no need for my patch. On the other hand qemu-nbd has to be fixed
2b. Otherwise I can send a v2 that will check the port in the _try_server_up just before starting qemu-nbd to minimize the risk of using a utilized port (or should you decide it's not worth checking, I can simply forget about this)

Regards,
Lukáš

Max Reitz Feb. 6, 2020, 4:37 p.m. UTC | #4

On 06.02.20 17:27, Lukáš Doktor wrote:
> Dne 06. 02. 20 v 16:03 Max Reitz napsal(a):
>> On 03.02.20 08:59, Lukáš Doktor wrote:
>>> Using a range of ports from 32768 to 65538 is dangerous as some
>>> application might already be listening there and interfere with the
>>> testing. There is no way to reserve ports, but let's decrease the chance
>>> of interactions by only using ports that were free at the time of
>>> importing this module.
>>>
>>> Without this patch CI occasionally fails with used ports. Additionally I
>>> tried listening on the first port to be tried via "nc -l localhost
>>> $port" and no matter how many other ports were available it always
>>> hanged for infinity.
>>
>> I’m afraid I don’t quite understand.  The new functions check whether
>> the ports are available for use by creating a server on them (i.e.,
>> binding a socket there).  The current code just lets qemu create a
>> server there, and see if that works or not.
>>
>> So the only difference I can see is that instead of trying out random
>> ports during the test and see whether they’re free to use we do this
>> check only once when the test is started.
>>
>> And the only problem I can imagine from your description is that there
>> is some other tool on the system that tries to set up a server but
>> cannot because we already have an NBD server there (by accident).
>>
>> But I don’t see how checking for free ports once at startup solves that
>> problem reliably.
>>
>> If what I guessed above is right, the only reliable solution I can
>> imagine would be to allow users to specify the port range through
>> environment variables, and then you’d have to specify a range that you
>> know is free for use.
>>
>> Max
>>
> 
> Hello Max,
> 
> thank you and I am sorry for not digging deep enough. This week my CI failed with:
> 
> 01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
> 01:24:06 DEBUG| [stdout] +----------------------------------------------------------------------
> 01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
> 01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
> 01:24:06 DEBUG| [stdout] +    self.vm.launch()
> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 302, in launch
> 01:24:06 DEBUG| [stdout] +    self._launch()
> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 319, in _launch
> 01:24:06 DEBUG| [stdout] +    self._pre_launch()
> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py", line 106, in _pre_launch
> 01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine, self)._pre_launch()
> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 270, in _pre_launch
> 01:24:06 DEBUG| [stdout] +    self._qmp = qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py", line 60, in __init__
> 01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
> 01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use
> 
> I made the mistake of reproducing this on my home system using the qemu revision that I had and assuming it's caused by a used port. So I limited the port range and used nc to occupy the port. It sort-of reproduced but instead of Address already in use it hanged until I kill the nc process. Then it failed with:
> 
> +Traceback (most recent call last):
> +  File "147", line 124, in test_inet
> +    flatten_sock_addr(address))
> +  File "147", line 59, in client_test
> +    self.assert_qmp(result, 'return', {})
> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 821, in assert_qmp
> +    result = self.dictpath(d, path)
> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 797, in dictpath
> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
> +AssertionError: failed path traversal for "return" in "{'error': {'class': 'GenericError', 'desc': 'Failed to read initial magic: Unexpected end-of-file before all bytes were read'}}"
> 
> After a brief study I thought qemu is not doing the job well enough and wanted to add a protection. Anyway after a more thorough overview I came to a different conclusion and that is that we are facing the same issue as with incoming migration about a year ago. What happened is that I started "nc -l localhost 32789" which results in:
> 
> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
> 
> Then we start the server by "_try_server_up" where qemu-nbd detects the port is occupied on IPv6 but available on IPv4, so it claims it:
> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> nc        26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
> qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP localhost:32789 (LISTEN)
> 
> and reports success. Then we try to connect but the hotplugged VM first attempts to connect on the IPv6 address and hangs for infinity.
> 
> Now is this an expected behavior? If so then we need the find_free_address (but preferably directly in _try_server_up just before starting the qemu-nbd) to leave as little time-frame for collision as possible. Otherwise the test is alright and qemu-nbd needs a fix to bail out in case some address is already used (IIRC this is what incoming migration does).

Ah, OK.

Well, expected behavior...  It’s a shame, that’s what it is.

> My second mistake was testing this on the old code-base and rebasing it only before sending the patch (without testing after the rebase). If I were to test it first, I would have found out that the real reproducer is simply running the test as the commit 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.
> 
> 
> So basically there are 2 actions:
> 
> 1. fix the test as on my system it fails in 100% of cases, bisect says the first bad commit is 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in digging into this? I already spent way too much time on this and don't really know what that commit is trying to do.

Yep, I’ve sent a patch:

https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00294.html

> 2. decide on the behavior when IPv4/6 is already in use (bail-out or start).
> 2a. In case it should bail-out than the test is correct and there is no need for my patch. On the other hand qemu-nbd has to be fixed

I don’t think it makes much sense to let qemu’s NBD server ensure that
it claims both IPv4 and IPv6 in case the user specifies a
non-descriptive hostname.

> 2b. Otherwise I can send a v2 that will check the port in the _try_server_up just before starting qemu-nbd to minimize the risk of using a utilized port (or should you decide it's not worth checking, I can simply forget about this)

Hm.  It wouldn’t be fully reliable, but, well...  The risk would be minimal.

OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
test?  We have specific cases for IPv6, so I think it makes sense to
force IPv4 in all other cases.

Max

Eric Blake Feb. 6, 2020, 4:48 p.m. UTC | #5

On 2/6/20 10:37 AM, Max Reitz wrote:

>>
>> thank you and I am sorry for not digging deep enough. This week my CI failed with:
>>
>> 01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
>> 01:24:06 DEBUG| [stdout] +----------------------------------------------------------------------
>> 01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
>> 01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
>> 01:24:06 DEBUG| [stdout] +    self.vm.launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 302, in launch
>> 01:24:06 DEBUG| [stdout] +    self._launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 319, in _launch
>> 01:24:06 DEBUG| [stdout] +    self._pre_launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py", line 106, in _pre_launch
>> 01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine, self)._pre_launch()
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py", line 270, in _pre_launch
>> 01:24:06 DEBUG| [stdout] +    self._qmp = qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
>> 01:24:06 DEBUG| [stdout] +  File "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py", line 60, in __init__
>> 01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
>> 01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use

Was this test 147?  If so, see:
https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg01469.html

because that failure matches what I was seeing.

>>
>> I made the mistake of reproducing this on my home system using the qemu revision that I had and assuming it's caused by a used port. So I limited the port range and used nc to occupy the port. It sort-of reproduced but instead of Address already in use it hanged until I kill the nc process. Then it failed with:
>>
>> +Traceback (most recent call last):
>> +  File "147", line 124, in test_inet
>> +    flatten_sock_addr(address))
>> +  File "147", line 59, in client_test
>> +    self.assert_qmp(result, 'return', {})
>> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 821, in assert_qmp
>> +    result = self.dictpath(d, path)
>> +  File "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line 797, in dictpath
>> +    self.fail('failed path traversal for "%s" in "%s"' % (path, str(d)))
>> +AssertionError: failed path traversal for "return" in "{'error': {'class': 'GenericError', 'desc': 'Failed to read initial magic: Unexpected end-of-file before all bytes were read'}}"
>>

That's a secondary failure, I assume if the initial bug is fixed we are 
less likely to hit the secondary one; but the secondary one may still be 
worth fixing.

>> After a brief study I thought qemu is not doing the job well enough and wanted to add a protection. Anyway after a more thorough overview I came to a different conclusion and that is that we are facing the same issue as with incoming migration about a year ago. What happened is that I started "nc -l localhost 32789" which results in:
>>
>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>> nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
>>
>> Then we start the server by "_try_server_up" where qemu-nbd detects the port is occupied on IPv6 but available on IPv4, so it claims it:
>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>> nc        26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789 (LISTEN)
>> qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP localhost:32789 (LISTEN)
>>
>> and reports success. Then we try to connect but the hotplugged VM first attempts to connect on the IPv6 address and hangs for infinity.
>>
>> Now is this an expected behavior? If so then we need the find_free_address (but preferably directly in _try_server_up just before starting the qemu-nbd) to leave as little time-frame for collision as possible. Otherwise the test is alright and qemu-nbd needs a fix to bail out in case some address is already used (IIRC this is what incoming migration does).
> 
> Ah, OK.
> 
> Well, expected behavior...  It’s a shame, that’s what it is.

In libnbd, we recently improved the testsuite by switching over to 
systemd-style fd passing: instead of asking qemu-nbd to open a random 
port (and hoping it is available), we instead pre-open the port (where 
failure is under our control) and then pass in that fd with environment 
variables to qemu-nbd, which in turn guarantees that qemu-nbd won't hit 
failures in trying to use the port.  Maybe we should utilize that more 
in qemu's own testsuite.

Also, I need to revisit my proposed patches for letting qemu-nbd support 
TLS over Unix sockets, as that's another way to avoid TCP contention 
(right now, qemu has an anachronistic prohibition against the 
combination of TLS and Unix sockets).

> 
>> My second mistake was testing this on the old code-base and rebasing it only before sending the patch (without testing after the rebase). If I were to test it first, I would have found out that the real reproducer is simply running the test as the commit 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.
>>
>>
>> So basically there are 2 actions:
>>
>> 1. fix the test as on my system it fails in 100% of cases, bisect says the first bad commit is 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in digging into this? I already spent way too much time on this and don't really know what that commit is trying to do.
> 
> Yep, I’ve sent a patch:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00294.html

Ah, so we did notice the same problem.

> 
>> 2. decide on the behavior when IPv4/6 is already in use (bail-out or start).
>> 2a. In case it should bail-out than the test is correct and there is no need for my patch. On the other hand qemu-nbd has to be fixed
> 
> I don’t think it makes much sense to let qemu’s NBD server ensure that
> it claims both IPv4 and IPv6 in case the user specifies a
> non-descriptive hostname.
> 
>> 2b. Otherwise I can send a v2 that will check the port in the _try_server_up just before starting qemu-nbd to minimize the risk of using a utilized port (or should you decide it's not worth checking, I can simply forget about this)
> 
> Hm.  It wouldn’t be fully reliable, but, well...  The risk would be minimal.
> 
> OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
> test?  We have specific cases for IPv6, so I think it makes sense to
> force IPv4 in all other cases.

Except then it will fail on machines configured for IPv6-only.

Max Reitz Feb. 6, 2020, 4:59 p.m. UTC | #6

On 06.02.20 17:48, Eric Blake wrote:
> On 2/6/20 10:37 AM, Max Reitz wrote:
> 
>>>
>>> thank you and I am sorry for not digging deep enough. This week my CI
>>> failed with:
>>>
>>> 01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
>>> 01:24:06 DEBUG| [stdout]
>>> +----------------------------------------------------------------------
>>> 01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
>>> 01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
>>> 01:24:06 DEBUG| [stdout] +    self.vm.launch()
>>> 01:24:06 DEBUG| [stdout] +  File
>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>> line 302, in launch
>>> 01:24:06 DEBUG| [stdout] +    self._launch()
>>> 01:24:06 DEBUG| [stdout] +  File
>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>> line 319, in _launch
>>> 01:24:06 DEBUG| [stdout] +    self._pre_launch()
>>> 01:24:06 DEBUG| [stdout] +  File
>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py",
>>> line 106, in _pre_launch
>>> 01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine,
>>> self)._pre_launch()
>>> 01:24:06 DEBUG| [stdout] +  File
>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>> line 270, in _pre_launch
>>> 01:24:06 DEBUG| [stdout] +    self._qmp =
>>> qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
>>> 01:24:06 DEBUG| [stdout] +  File
>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py",
>>> line 60, in __init__
>>> 01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
>>> 01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use
> 
> Was this test 147?  If so, see:
> https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg01469.html
> 
> because that failure matches what I was seeing.
> 
>>>
>>> I made the mistake of reproducing this on my home system using the
>>> qemu revision that I had and assuming it's caused by a used port. So
>>> I limited the port range and used nc to occupy the port. It sort-of
>>> reproduced but instead of Address already in use it hanged until I
>>> kill the nc process. Then it failed with:
>>>
>>> +Traceback (most recent call last):
>>> +  File "147", line 124, in test_inet
>>> +    flatten_sock_addr(address))
>>> +  File "147", line 59, in client_test
>>> +    self.assert_qmp(result, 'return', {})
>>> +  File
>>> "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line
>>> 821, in assert_qmp
>>> +    result = self.dictpath(d, path)
>>> +  File
>>> "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line
>>> 797, in dictpath
>>> +    self.fail('failed path traversal for "%s" in "%s"' % (path,
>>> str(d)))
>>> +AssertionError: failed path traversal for "return" in "{'error':
>>> {'class': 'GenericError', 'desc': 'Failed to read initial magic:
>>> Unexpected end-of-file before all bytes were read'}}"
>>>
> 
> That's a secondary failure, I assume if the initial bug is fixed we are
> less likely to hit the secondary one; but the secondary one may still be
> worth fixing.
> 
>>> After a brief study I thought qemu is not doing the job well enough
>>> and wanted to add a protection. Anyway after a more thorough overview
>>> I came to a different conclusion and that is that we are facing the
>>> same issue as with incoming migration about a year ago. What happened
>>> is that I started "nc -l localhost 32789" which results in:
>>>
>>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>>> nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789
>>> (LISTEN)
>>>
>>> Then we start the server by "_try_server_up" where qemu-nbd detects
>>> the port is occupied on IPv6 but available on IPv4, so it claims it:
>>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>>> nc        26758 medic    3u  IPv6 9579487      0t0  TCP
>>> localhost:32789 (LISTEN)
>>> qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP
>>> localhost:32789 (LISTEN)
>>>
>>> and reports success. Then we try to connect but the hotplugged VM
>>> first attempts to connect on the IPv6 address and hangs for infinity.
>>>
>>> Now is this an expected behavior? If so then we need the
>>> find_free_address (but preferably directly in _try_server_up just
>>> before starting the qemu-nbd) to leave as little time-frame for
>>> collision as possible. Otherwise the test is alright and qemu-nbd
>>> needs a fix to bail out in case some address is already used (IIRC
>>> this is what incoming migration does).
>>
>> Ah, OK.
>>
>> Well, expected behavior...  It’s a shame, that’s what it is.
> 
> In libnbd, we recently improved the testsuite by switching over to
> systemd-style fd passing: instead of asking qemu-nbd to open a random
> port (and hoping it is available), we instead pre-open the port (where
> failure is under our control) and then pass in that fd with environment
> variables to qemu-nbd, which in turn guarantees that qemu-nbd won't hit
> failures in trying to use the port.  Maybe we should utilize that more
> in qemu's own testsuite.

I suppose you’re welcome, but it sounds a bit like overkill for now. :-)

> Also, I need to revisit my proposed patches for letting qemu-nbd support
> TLS over Unix sockets, as that's another way to avoid TCP contention
> (right now, qemu has an anachronistic prohibition against the
> combination of TLS and Unix sockets).

147 specifically wants to test inet, though (among other things).
Support for TLS over Unix sockets won’t change that.  (I suppose it will
reduce the number of inet NBD sockets in other tests, though.)

>>> My second mistake was testing this on the old code-base and rebasing
>>> it only before sending the patch (without testing after the rebase).
>>> If I were to test it first, I would have found out that the real
>>> reproducer is simply running the test as the commit
>>> 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.
>>>
>>>
>>> So basically there are 2 actions:
>>>
>>> 1. fix the test as on my system it fails in 100% of cases, bisect
>>> says the first bad commit is
>>> 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in
>>> digging into this? I already spent way too much time on this and
>>> don't really know what that commit is trying to do.
>>
>> Yep, I’ve sent a patch:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00294.html
> 
> Ah, so we did notice the same problem.
> 
>>
>>> 2. decide on the behavior when IPv4/6 is already in use (bail-out or
>>> start).
>>> 2a. In case it should bail-out than the test is correct and there is
>>> no need for my patch. On the other hand qemu-nbd has to be fixed
>>
>> I don’t think it makes much sense to let qemu’s NBD server ensure that
>> it claims both IPv4 and IPv6 in case the user specifies a
>> non-descriptive hostname.
>>
>>> 2b. Otherwise I can send a v2 that will check the port in the
>>> _try_server_up just before starting qemu-nbd to minimize the risk of
>>> using a utilized port (or should you decide it's not worth checking,
>>> I can simply forget about this)
>>
>> Hm.  It wouldn’t be fully reliable, but, well...  The risk would be
>> minimal.
>>
>> OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
>> test?  We have specific cases for IPv6, so I think it makes sense to
>> force IPv4 in all other cases.
> 
> Except then it will fail on machines configured for IPv6-only.

So we’ll just have to test whether IPv4 works, just like we already do
for IPv6, no?

Max

Lukáš Doktor Feb. 6, 2020, 6:33 p.m. UTC | #7

Dne 06. 02. 20 v 17:59 Max Reitz napsal(a):
> On 06.02.20 17:48, Eric Blake wrote:
>> On 2/6/20 10:37 AM, Max Reitz wrote:
>>
>>>>
>>>> thank you and I am sorry for not digging deep enough. This week my CI
>>>> failed with:
>>>>
>>>> 01:24:06 DEBUG| [stdout] +ERROR: test_inet (__main__.QemuNBD)
>>>> 01:24:06 DEBUG| [stdout]
>>>> +----------------------------------------------------------------------
>>>> 01:24:06 DEBUG| [stdout] +Traceback (most recent call last):
>>>> 01:24:06 DEBUG| [stdout] +  File "147", line 85, in setUp
>>>> 01:24:06 DEBUG| [stdout] +    self.vm.launch()
>>>> 01:24:06 DEBUG| [stdout] +  File
>>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>>> line 302, in launch
>>>> 01:24:06 DEBUG| [stdout] +    self._launch()
>>>> 01:24:06 DEBUG| [stdout] +  File
>>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>>> line 319, in _launch
>>>> 01:24:06 DEBUG| [stdout] +    self._pre_launch()
>>>> 01:24:06 DEBUG| [stdout] +  File
>>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qtest.py",
>>>> line 106, in _pre_launch
>>>> 01:24:06 DEBUG| [stdout] +    super(QEMUQtestMachine,
>>>> self)._pre_launch()
>>>> 01:24:06 DEBUG| [stdout] +  File
>>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/machine.py",
>>>> line 270, in _pre_launch
>>>> 01:24:06 DEBUG| [stdout] +    self._qmp =
>>>> qmp.QEMUMonitorProtocol(self._vm_monitor, server=True)
>>>> 01:24:06 DEBUG| [stdout] +  File
>>>> "/home/jenkins/ppc64le/qemu-master/tests/qemu-iotests/../../python/qemu/qmp.py",
>>>> line 60, in __init__
>>>> 01:24:06 DEBUG| [stdout] +    self.__sock.bind(self.__address)
>>>> 01:24:06 DEBUG| [stdout] +OSError: [Errno 98] Address already in use
>>
>> Was this test 147?  If so, see:
>> https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg01469.html
>>
>> because that failure matches what I was seeing.
>>
>>>>
>>>> I made the mistake of reproducing this on my home system using the
>>>> qemu revision that I had and assuming it's caused by a used port. So
>>>> I limited the port range and used nc to occupy the port. It sort-of
>>>> reproduced but instead of Address already in use it hanged until I
>>>> kill the nc process. Then it failed with:
>>>>
>>>> +Traceback (most recent call last):
>>>> +  File "147", line 124, in test_inet
>>>> +    flatten_sock_addr(address))
>>>> +  File "147", line 59, in client_test
>>>> +    self.assert_qmp(result, 'return', {})
>>>> +  File
>>>> "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line
>>>> 821, in assert_qmp
>>>> +    result = self.dictpath(d, path)
>>>> +  File
>>>> "/home/medic/Work/Projekty/qemu/tests/qemu-iotests/iotests.py", line
>>>> 797, in dictpath
>>>> +    self.fail('failed path traversal for "%s" in "%s"' % (path,
>>>> str(d)))
>>>> +AssertionError: failed path traversal for "return" in "{'error':
>>>> {'class': 'GenericError', 'desc': 'Failed to read initial magic:
>>>> Unexpected end-of-file before all bytes were read'}}"
>>>>
>>
>> That's a secondary failure, I assume if the initial bug is fixed we are
>> less likely to hit the secondary one; but the secondary one may still be
>> worth fixing.
>>
>>>> After a brief study I thought qemu is not doing the job well enough
>>>> and wanted to add a protection. Anyway after a more thorough overview
>>>> I came to a different conclusion and that is that we are facing the
>>>> same issue as with incoming migration about a year ago. What happened
>>>> is that I started "nc -l localhost 32789" which results in:
>>>>
>>>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>>>> nc      26758 medic    3u  IPv6 9579487      0t0  TCP localhost:32789
>>>> (LISTEN)
>>>>
>>>> Then we start the server by "_try_server_up" where qemu-nbd detects
>>>> the port is occupied on IPv6 but available on IPv4, so it claims it:
>>>> COMMAND   PID  USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
>>>> nc        26758 medic    3u  IPv6 9579487      0t0  TCP
>>>> localhost:32789 (LISTEN)
>>>> qemu-nbd  26927 medic    4u  IPv4 9591857      0t0  TCP
>>>> localhost:32789 (LISTEN)
>>>>
>>>> and reports success. Then we try to connect but the hotplugged VM
>>>> first attempts to connect on the IPv6 address and hangs for infinity.
>>>>
>>>> Now is this an expected behavior? If so then we need the
>>>> find_free_address (but preferably directly in _try_server_up just
>>>> before starting the qemu-nbd) to leave as little time-frame for
>>>> collision as possible. Otherwise the test is alright and qemu-nbd
>>>> needs a fix to bail out in case some address is already used (IIRC
>>>> this is what incoming migration does).
>>>
>>> Ah, OK.
>>>
>>> Well, expected behavior...  It’s a shame, that’s what it is.
>>
>> In libnbd, we recently improved the testsuite by switching over to
>> systemd-style fd passing: instead of asking qemu-nbd to open a random
>> port (and hoping it is available), we instead pre-open the port (where
>> failure is under our control) and then pass in that fd with environment
>> variables to qemu-nbd, which in turn guarantees that qemu-nbd won't hit
>> failures in trying to use the port.  Maybe we should utilize that more
>> in qemu's own testsuite.
> 
> I suppose you’re welcome, but it sounds a bit like overkill for now. :-)
> 

Still you need to test all the variants...

>> Also, I need to revisit my proposed patches for letting qemu-nbd support
>> TLS over Unix sockets, as that's another way to avoid TCP contention
>> (right now, qemu has an anachronistic prohibition against the
>> combination of TLS and Unix sockets).
> 
> 147 specifically wants to test inet, though (among other things).
> Support for TLS over Unix sockets won’t change that.  (I suppose it will
> reduce the number of inet NBD sockets in other tests, though.)
> 
>>>> My second mistake was testing this on the old code-base and rebasing
>>>> it only before sending the patch (without testing after the rebase).
>>>> If I were to test it first, I would have found out that the real
>>>> reproducer is simply running the test as the commit
>>>> 8dff69b9415b4287e900358744b732195e1ab2e2 broke it.
>>>>
>>>>
>>>> So basically there are 2 actions:
>>>>
>>>> 1. fix the test as on my system it fails in 100% of cases, bisect
>>>> says the first bad commit is
>>>> 8dff69b9415b4287e900358744b732195e1ab2e2. Would anyone have time in
>>>> digging into this? I already spent way too much time on this and
>>>> don't really know what that commit is trying to do.
>>>
>>> Yep, I’ve sent a patch:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00294.html
>>

Thank you :-)

>> Ah, so we did notice the same problem.
>>
>>>
>>>> 2. decide on the behavior when IPv4/6 is already in use (bail-out or
>>>> start).
>>>> 2a. In case it should bail-out than the test is correct and there is
>>>> no need for my patch. On the other hand qemu-nbd has to be fixed
>>>
>>> I don’t think it makes much sense to let qemu’s NBD server ensure that
>>> it claims both IPv4 and IPv6 in case the user specifies a
>>> non-descriptive hostname.
>>>
>>>> 2b. Otherwise I can send a v2 that will check the port in the
>>>> _try_server_up just before starting qemu-nbd to minimize the risk of
>>>> using a utilized port (or should you decide it's not worth checking,
>>>> I can simply forget about this)
>>>
>>> Hm.  It wouldn’t be fully reliable, but, well...  The risk would be
>>> minimal.
>>>
>>> OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
>>> test?  We have specific cases for IPv6, so I think it makes sense to
>>> force IPv4 in all other cases.
>>
>> Except then it will fail on machines configured for IPv6-only.
> 
> So we’ll just have to test whether IPv4 works, just like we already do
> for IPv6, no?
> 

Sure, using ::1 for IPv6 and 127.0.0.1 for IPv4 cases would work. The question is whether the behavior is really expected. In migration it was considered dangerous, because you can have 2 VMs starting the migration at the same time. They both might attempt to get the same port, one would receive IPv4 the other IPv6. Then depending on the order of start migrate you might end-up attempting to migrate the other VMs instead of the intended ones.

The same can happen here, you might start 2 nbd exports at the same time, get the same port without any failures and then depending on luck get the right or wrong export when attempting to connect. So I think bailing out in case we fail to get all interfaces would be the most appropriate (note the same situation is for 0.0.0.0 where you might end-up serving multiple different disks on different interfaces). Anyway I leave it to you. If you decide you don't want to fail than using ::1/127.0.0.1 sounds like a good idea. Otherwise it'd make sense to create a test that especially uses ::1 and then localhost to make sure it bails-out.

Regards,
Lukáš

> Max
>

Max Reitz Feb. 7, 2020, 8:24 a.m. UTC | #8

On 06.02.20 19:33, Lukáš Doktor wrote:
> Dne 06. 02. 20 v 17:59 Max Reitz napsal(a):
>> On 06.02.20 17:48, Eric Blake wrote:
>>> On 2/6/20 10:37 AM, Max Reitz wrote:

[...]

>>>> OTOH, would it work if we just did a %s/localhost/127.0.0.1/ in the
>>>> test?  We have specific cases for IPv6, so I think it makes sense to
>>>> force IPv4 in all other cases.
>>>
>>> Except then it will fail on machines configured for IPv6-only.
>>
>> So we’ll just have to test whether IPv4 works, just like we already do
>> for IPv6, no?
>>
> 
> Sure, using ::1 for IPv6 and 127.0.0.1 for IPv4 cases would work. The question is whether the behavior is really expected. In migration it was considered dangerous, because you can have 2 VMs starting the migration at the same time. They both might attempt to get the same port, one would receive IPv4 the other IPv6. Then depending on the order of start migrate you might end-up attempting to migrate the other VMs instead of the intended ones.
> 
> The same can happen here, you might start 2 nbd exports at the same time, get the same port without any failures and then depending on luck get the right or wrong export when attempting to connect. So I think bailing out in case we fail to get all interfaces would be the most appropriate (note the same situation is for 0.0.0.0 where you might end-up serving multiple different disks on different interfaces). Anyway I leave it to you. If you decide you don't want to fail than using ::1/127.0.0.1 sounds like a good idea. Otherwise it'd make sense to create a test that especially uses ::1 and then localhost to make sure it bails-out.

OK.  I’ll defer to Eric on that one. O:-)

Max

tests/qemu_iotests: Minimize usage of used ports

Commit Message

Comments

Patch