diff mbox series

[2/6,for-3.1] nbd: stop waiting for a NBD response with NBD_CMD_DISC

Message ID 20181116155325.22428-3-berrange@redhat.com (mailing list archive)
State New, archived
Headers show
Series Misc fixes to NBD | expand

Commit Message

Daniel P. Berrangé Nov. 16, 2018, 3:53 p.m. UTC
When sending a NBD_CMD_DISC message there is no reply expected,
however, the nbd_read_eof() coroutine is still waiting for a reply.
In a plain NBD connection this doesn't matter as it will just get an
EOF, however, on a TLS connection it will get an interrupted TLS data
packet. The nbd_read_eof() will then print an error message on the
console due to missing reply to NBD_CMD_DISC.

This can be seen with qemu-img

  $ qemu-img info \
      --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
      --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
  qemu-img: Cannot read from TLS channel: Input/output error
  image: nbd://127.0.0.1:9000
  file format: nbd
  virtual size: 10M (10485760 bytes)
  disk size: unavailable

Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
get the coroutine to stop waiting for a reply and thus supress the error
message. This fixes a regression introduced in

  commit be41c100c0d67f6072ddd4910c4b6f7d239f025f
  Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  Date:   Fri May 26 14:09:13 2017 +0300

    nbd/client.c: use errp instead of LOG

    Move to modern errp scheme from just LOGging errors.

    Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
    Message-Id: <20170526110913.89098-1-vsementsov@virtuozzo.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 block/nbd-client.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Eric Blake Nov. 16, 2018, 4:08 p.m. UTC | #1
On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:
> When sending a NBD_CMD_DISC message there is no reply expected,
> however, the nbd_read_eof() coroutine is still waiting for a reply.
> In a plain NBD connection this doesn't matter as it will just get an
> EOF, however, on a TLS connection it will get an interrupted TLS data
> packet.

Does that depend on whether the server tried to gracefully close the TLS 
connection (and flushed enough packets to do so)?  At any rate, while 
it's nice to wait for the server to cleanly end TLS, we can't control 
the behavior of all servers, and I agree with your analysis that having 
the server silently deal with an abrupt shutdown is nicer than polluting 
stderr, as it's not actually a data loss situation for either the client 
or the server.

> The nbd_read_eof() will then print an error message on the
> console due to missing reply to NBD_CMD_DISC.
> 
> This can be seen with qemu-img
> 
>    $ qemu-img info \
>        --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
>        --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
>    qemu-img: Cannot read from TLS channel: Input/output error
>    image: nbd://127.0.0.1:9000
>    file format: nbd
>    virtual size: 10M (10485760 bytes)
>    disk size: unavailable
> 
> Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
> get the coroutine to stop waiting for a reply and thus supress the error

s/supress/suppress/

> message. This fixes a regression introduced in
> 
>    commit be41c100c0d67f6072ddd4910c4b6f7d239f025f
>    Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>    Date:   Fri May 26 14:09:13 2017 +0300
> 
>      nbd/client.c: use errp instead of LOG
> 
>      Move to modern errp scheme from just LOGging errors.
> 
>      Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>      Message-Id: <20170526110913.89098-1-vsementsov@virtuozzo.com>
>      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   block/nbd-client.c | 1 +
>   1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

Agree that this is 3.1 material; will queue.
Eric Blake Nov. 18, 2018, 2:19 a.m. UTC | #2
On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:
> When sending a NBD_CMD_DISC message there is no reply expected,
> however, the nbd_read_eof() coroutine is still waiting for a reply.
> In a plain NBD connection this doesn't matter as it will just get an
> EOF, however, on a TLS connection it will get an interrupted TLS data
> packet. The nbd_read_eof() will then print an error message on the
> console due to missing reply to NBD_CMD_DISC.
> 
> This can be seen with qemu-img
> 
>    $ qemu-img info \
>        --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
>        --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
>    qemu-img: Cannot read from TLS channel: Input/output error
>    image: nbd://127.0.0.1:9000
>    file format: nbd
>    virtual size: 10M (10485760 bytes)
>    disk size: unavailable
> 
> Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
> get the coroutine to stop waiting for a reply and thus supress the error
> message.

Actually, it's not quite enough - once you actually start performing 
I/O, enough coroutines are kicked off that the error still happens:

$  qemu-io -c 'r 1m 1m' -c 'w -P 0x22 1m 1m' --image-opts \
--object tls-creds-x509,dir=scratch/tls/client1,endpoint=client,id=tls0\
  driver=nbd,host=localhost,port=10809,tls-creds=tls0
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 0.0430 sec (23.204 MiB/sec and 23.2040 ops/sec)
wrote 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 0.0152 sec (65.479 MiB/sec and 65.4793 ops/sec)
Cannot read from TLS channel: Input/output error

Squashing this in on top of your patch helps, though:

diff --git i/block/nbd-client.c w/block/nbd-client.c
index 5f63e4b8f15..e7916c78996 100644
--- i/block/nbd-client.c
+++ w/block/nbd-client.c
@@ -79,7 +79,14 @@ static coroutine_fn void nbd_read_reply_entry(void 
*opaque)
          assert(s->reply.handle == 0);
          ret = nbd_receive_reply(s->ioc, &s->reply, &local_err);
          if (local_err) {
-            error_report_err(local_err);
+            /* If we are already quitting, either another error has
+             * already been reported, or we requested NBD_CMD_DISC and
+             * don't need to report anything further.  */
+            if (!s->quit) {
+                error_report_err(local_err);
+            } else {
+                error_free(local_err);
+            }
          }
          if (ret <= 0) {
              break;

But I want to do more testing to make sure I'm not missing out on 
reporting an actual error if I add that.
Daniel P. Berrangé Nov. 19, 2018, 10:23 a.m. UTC | #3
On Sat, Nov 17, 2018 at 08:19:10PM -0600, Eric Blake wrote:
> On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:
> > When sending a NBD_CMD_DISC message there is no reply expected,
> > however, the nbd_read_eof() coroutine is still waiting for a reply.
> > In a plain NBD connection this doesn't matter as it will just get an
> > EOF, however, on a TLS connection it will get an interrupted TLS data
> > packet. The nbd_read_eof() will then print an error message on the
> > console due to missing reply to NBD_CMD_DISC.
> > 
> > This can be seen with qemu-img
> > 
> >    $ qemu-img info \
> >        --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
> >        --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
> >    qemu-img: Cannot read from TLS channel: Input/output error
> >    image: nbd://127.0.0.1:9000
> >    file format: nbd
> >    virtual size: 10M (10485760 bytes)
> >    disk size: unavailable
> > 
> > Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
> > get the coroutine to stop waiting for a reply and thus supress the error
> > message.
> 
> Actually, it's not quite enough - once you actually start performing I/O,
> enough coroutines are kicked off that the error still happens:

Hmm, does the client not send requests synchronously  ? I expected that
any other I/O operations would have already received their reply by the
time we sent the DISC command.

> 
> $  qemu-io -c 'r 1m 1m' -c 'w -P 0x22 1m 1m' --image-opts \
> --object tls-creds-x509,dir=scratch/tls/client1,endpoint=client,id=tls0\
>  driver=nbd,host=localhost,port=10809,tls-creds=tls0
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 0.0430 sec (23.204 MiB/sec and 23.2040 ops/sec)
> wrote 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 0.0152 sec (65.479 MiB/sec and 65.4793 ops/sec)
> Cannot read from TLS channel: Input/output error
> 
> Squashing this in on top of your patch helps, though:
> 
> diff --git i/block/nbd-client.c w/block/nbd-client.c
> index 5f63e4b8f15..e7916c78996 100644
> --- i/block/nbd-client.c
> +++ w/block/nbd-client.c
> @@ -79,7 +79,14 @@ static coroutine_fn void nbd_read_reply_entry(void
> *opaque)
>          assert(s->reply.handle == 0);
>          ret = nbd_receive_reply(s->ioc, &s->reply, &local_err);
>          if (local_err) {
> -            error_report_err(local_err);
> +            /* If we are already quitting, either another error has
> +             * already been reported, or we requested NBD_CMD_DISC and
> +             * don't need to report anything further.  */
> +            if (!s->quit) {
> +                error_report_err(local_err);
> +            } else {
> +                error_free(local_err);
> +            }
>          }
>          if (ret <= 0) {
>              break;
> 
> But I want to do more testing to make sure I'm not missing out on reporting
> an actual error if I add that.

Yes, I'd be a little concerned about missing reporting of an error from
a command other than NBD_CMD_DISC if we did this.

Regards,
Daniel
Daniel P. Berrangé Nov. 19, 2018, 1:47 p.m. UTC | #4
On Sat, Nov 17, 2018 at 08:19:10PM -0600, Eric Blake wrote:
> On 11/16/18 9:53 AM, Daniel P. Berrangé wrote:
> > When sending a NBD_CMD_DISC message there is no reply expected,
> > however, the nbd_read_eof() coroutine is still waiting for a reply.
> > In a plain NBD connection this doesn't matter as it will just get an
> > EOF, however, on a TLS connection it will get an interrupted TLS data
> > packet. The nbd_read_eof() will then print an error message on the
> > console due to missing reply to NBD_CMD_DISC.
> > 
> > This can be seen with qemu-img
> > 
> >    $ qemu-img info \
> >        --object tls-creds-x509,dir=tlsdir,id=tls0,endpoint=client \
> >        --image-opts driver=nbd,host=127.0.0.1,port=9000,tls-creds=tls0
> >    qemu-img: Cannot read from TLS channel: Input/output error
> >    image: nbd://127.0.0.1:9000
> >    file format: nbd
> >    virtual size: 10M (10485760 bytes)
> >    disk size: unavailable
> > 
> > Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
> > get the coroutine to stop waiting for a reply and thus supress the error
> > message.
> 
> Actually, it's not quite enough - once you actually start performing I/O,
> enough coroutines are kicked off that the error still happens:
> 
> $  qemu-io -c 'r 1m 1m' -c 'w -P 0x22 1m 1m' --image-opts \
> --object tls-creds-x509,dir=scratch/tls/client1,endpoint=client,id=tls0\
>  driver=nbd,host=localhost,port=10809,tls-creds=tls0
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 0.0430 sec (23.204 MiB/sec and 23.2040 ops/sec)
> wrote 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 0.0152 sec (65.479 MiB/sec and 65.4793 ops/sec)
> Cannot read from TLS channel: Input/output error
> 
> Squashing this in on top of your patch helps, though:
> 
> diff --git i/block/nbd-client.c w/block/nbd-client.c
> index 5f63e4b8f15..e7916c78996 100644
> --- i/block/nbd-client.c
> +++ w/block/nbd-client.c
> @@ -79,7 +79,14 @@ static coroutine_fn void nbd_read_reply_entry(void
> *opaque)
>          assert(s->reply.handle == 0);
>          ret = nbd_receive_reply(s->ioc, &s->reply, &local_err);
>          if (local_err) {
> -            error_report_err(local_err);
> +            /* If we are already quitting, either another error has
> +             * already been reported, or we requested NBD_CMD_DISC and
> +             * don't need to report anything further.  */
> +            if (!s->quit) {
> +                error_report_err(local_err);
> +            } else {
> +                error_free(local_err);
> +            }
>          }
>          if (ret <= 0) {
>              break;
> 
> But I want to do more testing to make sure I'm not missing out on reporting
> an actual error if I add that.

It occurred to me that it would be nice if the TLS channel behaved the same
a normal channel when seeing EOF on the socket. Unfortunately the reason why
GNUTLS returns an error in this case does in fact make sense, so it isn't
quite as simple as just ignoring the error. Fortunately the NBD client has
decided it wants to shutdown and crucially has called qio_channel_shutdown()
at this point. Thus we can safely ignore the gnutls error code when the
channel has been shutdown for read and just return 0 for EOF as with a plain
socket.

Thus I've sent a patch which solves the problem in that way:

  https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg03616.html

With this applied, both my original patch and your extra chunk here
should be redundant.

Regards,
Daniel
Eric Blake Nov. 19, 2018, 2:24 p.m. UTC | #5
On 11/19/18 4:23 AM, Daniel P. Berrangé wrote:

>>> Simply setting the 'quit' flag after sending NBD_CMD_DISC is enough to
>>> get the coroutine to stop waiting for a reply and thus supress the error
>>> message.
>>
>> Actually, it's not quite enough - once you actually start performing I/O,
>> enough coroutines are kicked off that the error still happens:
> 
> Hmm, does the client not send requests synchronously  ? I expected that
> any other I/O operations would have already received their reply by the
> time we sent the DISC command.
> 

During handshake phase (NBD_OPT_* messages), client sends requests 
synchronously (at most one pending read, in response to the most recent 
write). During transmission phase (NBD_CMD_* messages), the client is 
allowed to send messages asynchronously.  However, the NBD protocol 
recommends (but not requires) that the client never send NBD_CMD_DISC to 
disconnect until AFTER the client has collected all responses to any 
earlier in-flight messages. And if I'm reading the block/nbd-client.c 
code correctly, in general we don't try to shutdown a client while there 
are any pending commands with unread replies, except when we are 
shutting down due to detecting a protocol error (at which point we are 
already out of sync with the server and waiting for replies is not going 
to be helpful).


>> But I want to do more testing to make sure I'm not missing out on reporting
>> an actual error if I add that.
> 
> Yes, I'd be a little concerned about missing reporting of an error from
> a command other than NBD_CMD_DISC if we did this.

But it looks like your approach of making the qio code behave more 
nicely is a smarter cleanup than my idea, so I'm willing to ditch my 
code in favor of yours.  Testing your approach now :)
diff mbox series

Patch

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 76e9ca3abe..5f63e4b8f1 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -961,6 +961,7 @@  void nbd_client_close(BlockDriverState *bs)
     }
 
     nbd_send_request(client->ioc, &request);
+    client->quit = true;
 
     nbd_teardown_connection(bs);
 }