[for-4.0] nbd: Permit simple error to NBD_CMD_BLOCK_STATUS

Message ID	20190323142455.5301-1-eblake@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Eric Blake <eblake@redhat.com> To: qemu-devel@nongnu.org Date: Sat, 23 Mar 2019 09:24:55 -0500 Message-Id: <20190323142455.5301-1-eblake@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] [PATCH for-4.0] nbd: Permit simple error to NBD_CMD_BLOCK_STATUS Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, vsementsov@virtuozzo.com, qemu-block@nongnu.org, rjones@redhat.com, Max Reitz <mreitz@redhat.com>, jsnow@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	[for-4.0] nbd: Permit simple error to NBD_CMD_BLOCK_STATUS \| expand [for-4.0] nbd: Permit simple error to NBD_CMD_BLOCK_STATUS

Eric Blake March 23, 2019, 2:24 p.m. UTC

The NBD spec is clear that when structured replies are active, a
simple error reply is acceptable to any command except for
NBD_CMD_READ.  However, we were mistakenly requiring structured errors
for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
simple error (since qemu does not behave as such a server, we didn't
notice the problem until now).  Broken since its introduction in
commit 78a33ab5 (v2.12).

Howeve, even if we had gotten it correct to accept simple errors back
then, we run into another problem: the client treats the server's
reply as fatal and hangs up on the connection, instead of merely
failing the block status request but being ready for the next
command. Broken in commit 7f86068d (unreleased).

Signed-off-by: Eric Blake <eblake@redhat.com>
---

I confirmed that when backporting things for qemu 2.12 through 3.1,
the first hunk is sufficient to let clients tolerate a simple error
without hanging up (the second hunk is only required for the 4.0 code
base).

Rich - if you choose to make nbdkit work around the qemu 2.12 bug
where it refuses to communicate with a server that supports
NBD_OPT_SET_META_CONTEXT but not base:allocation, you'll also want to
make sure that you send a structured error instead of a simple error
to any failures of NBD_CMD_BLOCK_STATUS.

 block/nbd-client.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Richard W.M. Jones March 23, 2019, 2:40 p.m. UTC | #1

On Sat, Mar 23, 2019 at 09:24:55AM -0500, Eric Blake wrote:
> The NBD spec is clear that when structured replies are active, a
> simple error reply is acceptable to any command except for
> NBD_CMD_READ.  However, we were mistakenly requiring structured errors
> for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
> simple error (since qemu does not behave as such a server, we didn't
> notice the problem until now).  Broken since its introduction in
> commit 78a33ab5 (v2.12).
> 
> Howeve, even if we had gotten it correct to accept simple errors back

"However"

> then, we run into another problem: the client treats the server's
> reply as fatal and hangs up on the connection, instead of merely
> failing the block status request but being ready for the next
> command. Broken in commit 7f86068d (unreleased).

This latter part fixes the silent qemu client disconnection that
happens when the server sends back a large number of extents in a
single block status reply?

> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> I confirmed that when backporting things for qemu 2.12 through 3.1,
> the first hunk is sufficient to let clients tolerate a simple error
> without hanging up (the second hunk is only required for the 4.0 code
> base).
> 
> Rich - if you choose to make nbdkit work around the qemu 2.12 bug
> where it refuses to communicate with a server that supports
> NBD_OPT_SET_META_CONTEXT but not base:allocation, you'll also want to
> make sure that you send a structured error instead of a simple error
> to any failures of NBD_CMD_BLOCK_STATUS.

OK, got it.  However I think I'd prefer to see how it goes fixing qemu
in RHEL 7 first.

>  block/nbd-client.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index bfbaf7ebe94..5fc9ea40383 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -718,9 +718,7 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>      bool received = false;
> 
>      assert(!extent->length);
> -    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
> -                            NULL, &reply, &payload)
> -    {
> +    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {
>          int ret;
>          NBDStructuredReplyChunk *chunk = &reply.structured;
> 
> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>          payload = NULL;
>      }
> 
> -    if (!extent->length && !iter.err) {
> -        error_setg(&iter.err,
> -                   "Server did not reply with any status extents");
> +    if (!extent->length && !iter.request_ret) {
> +        if (!iter.err) {
> +            error_setg(&iter.err,
> +                       "Server did not reply with any status extents");
> +        }
>          if (!iter.ret) {
>              iter.ret = -EIO;
>          }

ACK

Rich.

Eric Blake March 23, 2019, 2:48 p.m. UTC | #2

On 3/23/19 9:40 AM, Richard W.M. Jones wrote:
> On Sat, Mar 23, 2019 at 09:24:55AM -0500, Eric Blake wrote:
>> The NBD spec is clear that when structured replies are active, a
>> simple error reply is acceptable to any command except for
>> NBD_CMD_READ.  However, we were mistakenly requiring structured errors
>> for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
>> simple error (since qemu does not behave as such a server, we didn't
>> notice the problem until now).  Broken since its introduction in
>> commit 78a33ab5 (v2.12).
>>
>> Howeve, even if we had gotten it correct to accept simple errors back
> 
> "However"
> 
>> then, we run into another problem: the client treats the server's
>> reply as fatal and hangs up on the connection, instead of merely
>> failing the block status request but being ready for the next
>> command. Broken in commit 7f86068d (unreleased).
> 
> This latter part fixes the silent qemu client disconnection that
> happens when the server sends back a large number of extents in a
> single block status reply?

Not sure - our mails crossed, so I have not reproduced that failure mode
yet to know if this patch helps there, or if I'll need another patch.

> 
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>
>> I confirmed that when backporting things for qemu 2.12 through 3.1,
>> the first hunk is sufficient to let clients tolerate a simple error
>> without hanging up (the second hunk is only required for the 4.0 code
>> base).
>>
>> Rich - if you choose to make nbdkit work around the qemu 2.12 bug
>> where it refuses to communicate with a server that supports
>> NBD_OPT_SET_META_CONTEXT but not base:allocation, you'll also want to
>> make sure that you send a structured error instead of a simple error
>> to any failures of NBD_CMD_BLOCK_STATUS.
> 
> OK, got it.  However I think I'd prefer to see how it goes fixing qemu
> in RHEL 7 first.

Indeed, if RHEL 7 picks up the various NBD patches from qemu 2.12
onwards, it will probably pick up this one as well, at which point
nbdkit shouldn't have to worry about workarounds.

Vladimir Sementsov-Ogievskiy March 25, 2019, 10:12 a.m. UTC | #3

23.03.2019 17:24, Eric Blake wrote:
> The NBD spec is clear that when structured replies are active, a
> simple error reply is acceptable to any command except for
> NBD_CMD_READ.  However, we were mistakenly requiring structured errors
> for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
> simple error (since qemu does not behave as such a server, we didn't
> notice the problem until now).  Broken since its introduction in
> commit 78a33ab5 (v2.12).
> 
> Howeve, even if we had gotten it correct to accept simple errors back
> then, we run into another problem: the client treats the server's
> reply as fatal and hangs up on the connection, instead of merely
> failing the block status request but being ready for the next
> command. Broken in commit 7f86068d (unreleased).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> I confirmed that when backporting things for qemu 2.12 through 3.1,
> the first hunk is sufficient to let clients tolerate a simple error
> without hanging up (the second hunk is only required for the 4.0 code
> base).
> 
> Rich - if you choose to make nbdkit work around the qemu 2.12 bug
> where it refuses to communicate with a server that supports
> NBD_OPT_SET_META_CONTEXT but not base:allocation, you'll also want to
> make sure that you send a structured error instead of a simple error
> to any failures of NBD_CMD_BLOCK_STATUS.
> 
>   block/nbd-client.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index bfbaf7ebe94..5fc9ea40383 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -718,9 +718,7 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>       bool received = false;
> 
>       assert(!extent->length);
> -    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
> -                            NULL, &reply, &payload)
> -    {
> +    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {
>           int ret;
>           NBDStructuredReplyChunk *chunk = &reply.structured;
> 
> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>           payload = NULL;
>       }
> 
> -    if (!extent->length && !iter.err) {
> -        error_setg(&iter.err,
> -                   "Server did not reply with any status extents");
> +    if (!extent->length && !iter.request_ret) {

Hmm, I don't see, what is changed. You just don't set ret and err if there already
set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
and local_err will not be set and traced. But there is no connection close in this case
in current code, s->quit is not set.

> +        if (!iter.err) {
> +            error_setg(&iter.err,
> +                       "Server did not reply with any status extents");
> +        }
>           if (!iter.ret) {
>               iter.ret = -EIO;
>           }
>

Eric Blake March 25, 2019, 2:44 p.m. UTC | #4

On 3/25/19 5:12 AM, Vladimir Sementsov-Ogievskiy wrote:
> 23.03.2019 17:24, Eric Blake wrote:
>> The NBD spec is clear that when structured replies are active, a
>> simple error reply is acceptable to any command except for
>> NBD_CMD_READ.  However, we were mistakenly requiring structured errors
>> for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
>> simple error (since qemu does not behave as such a server, we didn't
>> notice the problem until now).  Broken since its introduction in
>> commit 78a33ab5 (v2.12).
>>
>> Howeve, even if we had gotten it correct to accept simple errors back
>> then, we run into another problem: the client treats the server's
>> reply as fatal and hangs up on the connection, instead of merely
>> failing the block status request but being ready for the next
>> command. Broken in commit 7f86068d (unreleased).
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>

>> +++ b/block/nbd-client.c
>> @@ -718,9 +718,7 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>       bool received = false;
>>
>>       assert(!extent->length);
>> -    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
>> -                            NULL, &reply, &payload)
>> -    {
>> +    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {

This part changes things to permit a simple reply if that reply is an
error. In turn, if the server sends a simple reply error,
iter.request_ret is set to the error returned by the server while
leaving iter.ret and iter.err unset (because we successfully got the
server's reply and can keep the connection alive), with commit 7f86068d
in play (prior to that commit, iter.ret was left 0 because the server
replied successfully, while iter.err was set to the set to the server's
error, while leaving iter.fatal unset).

>>           int ret;
>>           NBDStructuredReplyChunk *chunk = &reply.structured;
>>
>> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>           payload = NULL;
>>       }
>>
>> -    if (!extent->length && !iter.err) {
>> -        error_setg(&iter.err,
>> -                   "Server did not reply with any status extents");
>> +    if (!extent->length && !iter.request_ret) {
> 
> Hmm, I don't see, what is changed. You just don't set ret and err if there already
> set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
> and local_err will not be set and traced. But there is no connection close in this case
> in current code, s->quit is not set.

And that's okay. The whole point of this is that if the server replied
with an error message (iter.request_ret), then it is okay that the
server did not give us any extent->length, and the caller will turn
iter.request_ret into the EINVAL reply (or whatever other errno value
the server sent) to .bdrv_co_block_status() while keeping the connection
alive.  But if the server did NOT send an error reply (iter.request_ret
is 0), but also did not send us any status, then we need to turn that
into an error condition (because our caller expected us to make progress
on success, even though the server did not make progress).

> 
>> +        if (!iter.err) {
>> +            error_setg(&iter.err,
>> +                       "Server did not reply with any status extents");
>> +        }
>>           if (!iter.ret) {
>>               iter.ret = -EIO;
>>           }
>>
> 
>

Vladimir Sementsov-Ogievskiy March 25, 2019, 4:04 p.m. UTC | #5

25.03.2019 17:44, Eric Blake wrote:
> On 3/25/19 5:12 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 23.03.2019 17:24, Eric Blake wrote:
>>> The NBD spec is clear that when structured replies are active, a
>>> simple error reply is acceptable to any command except for
>>> NBD_CMD_READ.  However, we were mistakenly requiring structured errors
>>> for NBD_CMD_BLOCK_STATUS, and hanging up on a server that gave a
>>> simple error (since qemu does not behave as such a server, we didn't
>>> notice the problem until now).  Broken since its introduction in
>>> commit 78a33ab5 (v2.12).
>>>
>>> Howeve, even if we had gotten it correct to accept simple errors back
>>> then, we run into another problem: the client treats the server's
>>> reply as fatal and hangs up on the connection, instead of merely
>>> failing the block status request but being ready for the next
>>> command. Broken in commit 7f86068d (unreleased).
>>>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>>
> 
>>> +++ b/block/nbd-client.c
>>> @@ -718,9 +718,7 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>>        bool received = false;
>>>
>>>        assert(!extent->length);
>>> -    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
>>> -                            NULL, &reply, &payload)
>>> -    {
>>> +    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {
> 
> This part changes things to permit a simple reply if that reply is an
> error. In turn, if the server sends a simple reply error,
> iter.request_ret is set to the error returned by the server while
> leaving iter.ret and iter.err unset (because we successfully got the
> server's reply and can keep the connection alive), with commit 7f86068d
> in play (prior to that commit, iter.ret was left 0 because the server
> replied successfully, while iter.err was set to the set to the server's
> error, while leaving iter.fatal unset).
> 
>>>            int ret;
>>>            NBDStructuredReplyChunk *chunk = &reply.structured;
>>>
>>> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>>            payload = NULL;
>>>        }
>>>
>>> -    if (!extent->length && !iter.err) {
>>> -        error_setg(&iter.err,
>>> -                   "Server did not reply with any status extents");
>>> +    if (!extent->length && !iter.request_ret) {
>>
>> Hmm, I don't see, what is changed. You just don't set ret and err if there already
>> set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
>> and local_err will not be set and traced. But there is no connection close in this case
>> in current code, s->quit is not set.
> 
> And that's okay. The whole point of this is that if the server replied
> with an error message (iter.request_ret), then it is okay that the
> server did not give us any extent->length, and the caller will turn
> iter.request_ret into the EINVAL reply (or whatever other errno value
> the server sent) to .bdrv_co_block_status() while keeping the connection
> alive.  But if the server did NOT send an error reply (iter.request_ret
> is 0), but also did not send us any status, then we need to turn that
> into an error condition (because our caller expected us to make progress
> on success, even though the server did not make progress).


I just say, that I don't see "the client treats the server's
reply as fatal and hangs up on the connection" in this place before your change.
Or what you mean?

> 
>>
>>> +        if (!iter.err) {
>>> +            error_setg(&iter.err,
>>> +                       "Server did not reply with any status extents");
>>> +        }
>>>            if (!iter.ret) {
>>>                iter.ret = -EIO;
>>>            }
>>>
>>
>>
>

Eric Blake March 25, 2019, 4:05 p.m. UTC | #6

On 3/25/19 9:44 AM, Eric Blake wrote:

>>>
>>> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>>           payload = NULL;
>>>       }
>>>
>>> -    if (!extent->length && !iter.err) {
>>> -        error_setg(&iter.err,
>>> -                   "Server did not reply with any status extents");
>>> +    if (!extent->length && !iter.request_ret) {
>>
>> Hmm, I don't see, what is changed. You just don't set ret and err if there already
>> set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
>> and local_err will not be set and traced. But there is no connection close in this case
>> in current code, s->quit is not set.
> 
> And that's okay. The whole point of this is that if the server replied
> with an error message (iter.request_ret), then it is okay that the
> server did not give us any extent->length, and the caller will turn
> iter.request_ret into the EINVAL reply (or whatever other errno value
> the server sent) to .bdrv_co_block_status() while keeping the connection
> alive.  But if the server did NOT send an error reply (iter.request_ret
> is 0), but also did not send us any status, then we need to turn that
> into an error condition (because our caller expected us to make progress
> on success, even though the server did not make progress).

More to the point, the behavior of qemu for a (structured) error reply
to NBD_CMD_BLOCK_STATUS with no extent->length was to keep the
connection alive (both before and after commit 7f86068d) - the
difference in behavior for this hunk of the patch is only visible for
simple error replies. (Here's the hacks I applied to the server, to test
forced error replies to NBD_CMD_BLOCK_STATUS, whether structured or simple:


From 22b4181f887d3b782695dbb11fcc1759281fc51e Mon Sep 17 00:00:00 2001
From: Eric Blake <eblake@redhat.com>
Date: Sat, 23 Mar 2019 07:19:33 -0500
Subject: [PATCH 1/2] nbd: Debug patch to force NBD_CMD_BLOCK_STATUS failure

Hack to test whether the client is robust to block status failure.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 nbd/server.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/nbd/server.c b/nbd/server.c
index fd013a2817a..94d0c9f4e9e 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2269,6 +2269,8 @@ static coroutine_fn int
nbd_handle_request(NBDClient *client,
                                       "discard failed", errp);

     case NBD_CMD_BLOCK_STATUS:
+        return nbd_send_generic_reply(client, request->handle, -EINVAL,
+                                      "no status for you today", errp);
         if (!request->len) {
             return nbd_send_generic_reply(client, request->handle, -EINVAL,
                                           "need non-zero length", errp);

Eric Blake March 25, 2019, 4:21 p.m. UTC | #7

On 3/25/19 11:04 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> +++ b/block/nbd-client.c
>>>> @@ -718,9 +718,7 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>>>        bool received = false;
>>>>
>>>>        assert(!extent->length);
>>>> -    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
>>>> -                            NULL, &reply, &payload)
>>>> -    {
>>>> +    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {
>>
>> This part changes things to permit a simple reply if that reply is an
>> error. In turn, if the server sends a simple reply error,
>> iter.request_ret is set to the error returned by the server while
>> leaving iter.ret and iter.err unset (because we successfully got the
>> server's reply and can keep the connection alive), with commit 7f86068d
>> in play (prior to that commit, iter.ret was left 0 because the server
>> replied successfully, while iter.err was set to the set to the server's
>> error, while leaving iter.fatal unset).

If this hunk is not present, then NBD_FOREACH_REPLY_CHUNK sets iter.err
when encountering a simple error reply from the server (the server
violated the client's expectations by not sending a structured error,
give up on communicating with the server). When this hunk is added,
NBD_FOREACH_REPLY_CHUNK now sets iter.request_ret instead (the server
gave us an error reply, and we are happy with that whether it was an
simple or structured error).

>>
>>>>            int ret;
>>>>            NBDStructuredReplyChunk *chunk = &reply.structured;
>>>>
>>>> @@ -758,9 +756,11 @@ static int nbd_co_receive_blockstatus_reply(NBDClientSession *s,
>>>>            payload = NULL;
>>>>        }
>>>>
>>>> -    if (!extent->length && !iter.err) {
>>>> -        error_setg(&iter.err,
>>>> -                   "Server did not reply with any status extents");
>>>> +    if (!extent->length && !iter.request_ret) {
>>>
>>> Hmm, I don't see, what is changed. You just don't set ret and err if there already
>>> set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
>>> and local_err will not be set and traced. But there is no connection close in this case
>>> in current code, s->quit is not set.
>>
>> And that's okay. The whole point of this is that if the server replied
>> with an error message (iter.request_ret), then it is okay that the
>> server did not give us any extent->length, and the caller will turn
>> iter.request_ret into the EINVAL reply (or whatever other errno value
>> the server sent) to .bdrv_co_block_status() while keeping the connection
>> alive.  But if the server did NOT send an error reply (iter.request_ret
>> is 0), but also did not send us any status, then we need to turn that
>> into an error condition (because our caller expected us to make progress
>> on success, even though the server did not make progress).
> 
> 
> I just say, that I don't see "the client treats the server's
> reply as fatal and hangs up on the connection" in this place before your change.
> Or what you mean?

When you fix the first bug (the client setting iter.err on a simple
error reply from the server because the server's reply wasn't
structured, to now the client setting just iter.request_ret because it
successfully parsed an error out of the server's reply), this code is
now reachable with iter.err == NULL. Since the server replied with an
error, extent->length is 0. So this code (added in 7f86068d) says that
since iter.err is not set, we give up on the server and hang up (but
only for simple errors, not for structured errors). It is a regression,
because prior to 7f86068d, we did not have iter.request_ret, and instead
went by the presence or absence of iter.fatal, and iter.fatal was not
set when dealing with a simple error when a structured error was supplied.

So, the full setup of things to test:

server that always fails with structured error:
 - client before 7f86068d (qemu 3.1): connection stays alive, regardless
of this patch
 - client after 7f86068d: connection stays alive, regardless of this patch

server that always fails with simple error:
 - client before 7f86068d: (second half of patch does not apply, so only
the first half is needed):
   - without first half, connection dies because "Protocol error: simple
reply when structured reply chunk was expected"
   - with first half, connection stays alive as it does for structured error
 - client after 7f86068d, with various stages of this patch:
   - without either half (or with second half only), connection dies
because "Protocol error: simple reply when structured reply chunk was
expected"
   - with first half, connection dies because "server did not reply with
any status extents"
   - with both halves, connection stays alive

Eric Blake March 25, 2019, 4:43 p.m. UTC | #8

On 3/25/19 11:21 AM, Eric Blake wrote:

> When you fix the first bug (the client setting iter.err on a simple
> error reply from the server because the server's reply wasn't
> structured, to now the client setting just iter.request_ret because it
> successfully parsed an error out of the server's reply), this code is
> now reachable with iter.err == NULL. Since the server replied with an
> error, extent->length is 0. So this code (added in 7f86068d) says that
> since iter.err is not set, we give up on the server and hang up (but
> only for simple errors, not for structured errors). It is a regression,
> because prior to 7f86068d, we did not have iter.request_ret, and instead
> went by the presence or absence of iter.fatal, and iter.fatal was not
> set when dealing with a simple error when a structured error was supplied.
> 
> So, the full setup of things to test:
> 
> server that always fails with structured error:
>  - client before 7f86068d (qemu 3.1): connection stays alive, regardless
> of this patch
>  - client after 7f86068d: connection stays alive, regardless of this patch
> 
> server that always fails with simple error:
>  - client before 7f86068d: (second half of patch does not apply, so only
> the first half is needed):
>    - without first half, connection dies because "Protocol error: simple
> reply when structured reply chunk was expected"

What's really annoying is that iter.err does not show up anywhere except
maybe in trace messages; on stderr, the symptoms are merely:

Failed to get allocation status: Invalid argument

because nbd_co_do_receive_one_chunk returns -EINVAL for iter.ret.

>    - with first half, connection stays alive as it does for structured error
>  - client after 7f86068d, with various stages of this patch:
>    - without either half (or with second half only), connection dies
> because "Protocol error: simple reply when structured reply chunk was
> expected"

again, shows up only as

Failed to get allocation status: Invalid argument

>    - with first half, connection dies because "server did not reply with
> any status extents"

Here, shows up only as

Failed to get allocation status: Input/output error

because without the second hunk, iter.ret is set to -EIO.

>    - with both halves, connection stays alive
>

Eric Blake March 25, 2019, 4:52 p.m. UTC | #9

On 3/25/19 11:05 AM, Eric Blake wrote:

> More to the point, the behavior of qemu for a (structured) error reply
> to NBD_CMD_BLOCK_STATUS with no extent->length was to keep the
> connection alive (both before and after commit 7f86068d) - the
> difference in behavior for this hunk of the patch is only visible for
> simple error replies. (Here's the hacks I applied to the server, to test
> forced error replies to NBD_CMD_BLOCK_STATUS, whether structured or simple:
> 
> 
> From 22b4181f887d3b782695dbb11fcc1759281fc51e Mon Sep 17 00:00:00 2001
> From: Eric Blake <eblake@redhat.com>
> Date: Sat, 23 Mar 2019 07:19:33 -0500
> Subject: [PATCH 1/2] nbd: Debug patch to force NBD_CMD_BLOCK_STATUS failure
> 
> Hack to test whether the client is robust to block status failure.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  nbd/server.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index fd013a2817a..94d0c9f4e9e 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2269,6 +2269,8 @@ static coroutine_fn int
> nbd_handle_request(NBDClient *client,
>                                        "discard failed", errp);
> 
>      case NBD_CMD_BLOCK_STATUS:
> +        return nbd_send_generic_reply(client, request->handle, -EINVAL,

As long as I'm hacking things, forcing -ENOMEM here is more instructive
(as both -EINVAL and -EIO can be set in client-side code in nbd-client.c
regardless of what the server sent, but -ENOMEM is a fairly reliable
witness of the actual error message being reported by the server). That
is, when trying to decipher what clients do to a particular forced
server error, seeing the server's 'Cannot allocate memory' is a good
sign that the error message properly transferred across all the needed
layers, rather than getting replaced somewhere along the way.

Eric Blake March 25, 2019, 5:05 p.m. UTC | #10

On 3/25/19 11:21 AM, Eric Blake wrote:

>>>>> -    if (!extent->length && !iter.err) {
>>>>> -        error_setg(&iter.err,
>>>>> -                   "Server did not reply with any status extents");
>>>>> +    if (!extent->length && !iter.request_ret) {
>>>>
>>>> Hmm, I don't see, what is changed. You just don't set ret and err if there already
>>>> set request_ret.. So, finally it may lead to other return code in nbd_client_co_block_status
>>>> and local_err will not be set and traced. But there is no connection close in this case
>>>> in current code, s->quit is not set.
>>>
>>> And that's okay. The whole point of this is that if the server replied
>>> with an error message (iter.request_ret), then it is okay that the
>>> server did not give us any extent->length, and the caller will turn
>>> iter.request_ret into the EINVAL reply (or whatever other errno value
>>> the server sent) to .bdrv_co_block_status() while keeping the connection
>>> alive.  But if the server did NOT send an error reply (iter.request_ret
>>> is 0), but also did not send us any status, then we need to turn that
>>> into an error condition (because our caller expected us to make progress
>>> on success, even though the server did not make progress).
>>
>>
>> I just say, that I don't see "the client treats the server's
>> reply as fatal and hangs up on the connection" in this place before your change.
>> Or what you mean?

Okay, maybe I'm seeing what you are asking. When I reran my testing, I
note that iter.err is left unset when iter.request_ret is set.  So my
earlier message was wrong:


> So, the full setup of things to test:
> 
> server that always fails with structured error:
>  - client before 7f86068d (qemu 3.1): connection stays alive, regardless
> of this patch
>  - client after 7f86068d: connection stays alive, regardless of this patch

The connection does not die, but this patch still matters:

- without the second half of this patch, the error given to the caller
is always EIO, regardless of what the server reported
- with the second half of this patch, the error given to the caller is
the error presented by the server. (The first half of the patch doesn't
matter).

> 
> server that always fails with simple error:
>  - client before 7f86068d: (second half of patch does not apply, so only
> the first half is needed):
>    - without first half, connection dies because "Protocol error: simple
> reply when structured reply chunk was expected"
>    - with first half, connection stays alive as it does for structured error
>  - client after 7f86068d, with various stages of this patch:
>    - without either half (or with second half only), connection dies
> because "Protocol error: simple reply when structured reply chunk was
> expected"
>    - with first half, connection dies because "server did not reply with
> any status extents"

the connection does NOT die in this case, but without the second half,
the caller gets EIO regardless of what the server passed.

>    - with both halves, connection stays alive

and here, the caller gets the server's error.

Okay, with your insistence on making me explain myself, I need to submit
a v2 of this patch, split into two pieces, and with a better commit message.

[for-4.0] nbd: Permit simple error to NBD_CMD_BLOCK_STATUS

Commit Message

Comments

Patch