diff mbox series

[01/10] vhost-user-blk: reconnect on any error during realize

Message ID 20211111153354.18807-2-rvkagan@yandex-team.ru (mailing list archive)
State New, archived
Headers show
Series vhost: stick to -errno error return convention | expand

Commit Message

Roman Kagan Nov. 11, 2021, 3:33 p.m. UTC
vhost-user-blk realize only attempts to reconnect if the previous
connection attempt failed on "a problem with the connection and not an
error related to the content (which would fail again the same way in the
next attempt)".

However this distinction is very subtle, and may be inadvertently broken
if the code changes somewhere deep down the stack and a new error gets
propagated up to here.

OTOH now that the number of reconnection attempts is limited it seems
harmless to try reconnecting on any error.

So relax the condition of whether to retry connecting to check for any
error.

This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
during realize".

Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kevin Wolf Nov. 11, 2021, 5:52 p.m. UTC | #1
Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> vhost-user-blk realize only attempts to reconnect if the previous
> connection attempt failed on "a problem with the connection and not an
> error related to the content (which would fail again the same way in the
> next attempt)".
> 
> However this distinction is very subtle, and may be inadvertently broken
> if the code changes somewhere deep down the stack and a new error gets
> propagated up to here.
> 
> OTOH now that the number of reconnection attempts is limited it seems
> harmless to try reconnecting on any error.
> 
> So relax the condition of whether to retry connecting to check for any
> error.
> 
> This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> during realize".
> 
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>

It results in less than perfect error messages. With a modified export
that just crashes qemu-storage-daemon during get_features, I get:

qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1.
qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error
qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused

I guess this might be tolerable. On the other hand, the patch doesn't
really fix anything either, but just gets rid of possible subtleties.

Kevin
Roman Kagan Nov. 12, 2021, 7:39 a.m. UTC | #2
On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > vhost-user-blk realize only attempts to reconnect if the previous
> > connection attempt failed on "a problem with the connection and not an
> > error related to the content (which would fail again the same way in the
> > next attempt)".
> > 
> > However this distinction is very subtle, and may be inadvertently broken
> > if the code changes somewhere deep down the stack and a new error gets
> > propagated up to here.
> > 
> > OTOH now that the number of reconnection attempts is limited it seems
> > harmless to try reconnecting on any error.
> > 
> > So relax the condition of whether to retry connecting to check for any
> > error.
> > 
> > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > during realize".
> > 
> > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
> 
> It results in less than perfect error messages. With a modified export
> that just crashes qemu-storage-daemon during get_features, I get:
> 
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1.
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused

This patch doesn't change any error messages.  Which ones specifically
became less than perfect as a result of this patch?

> I guess this might be tolerable. On the other hand, the patch doesn't
> really fix anything either, but just gets rid of possible subtleties.

The remaining patches in the series make other errors beside -EPROTO
propagate up to this point, and some (most) of them are retryable.  This
was the reason to include this patch at the beginning of the series (I
guess I should've mentioned that in the patch log).

Thanks,
Roman.
Kevin Wolf Nov. 12, 2021, 11:37 a.m. UTC | #3
Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben:
> On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > > vhost-user-blk realize only attempts to reconnect if the previous
> > > connection attempt failed on "a problem with the connection and not an
> > > error related to the content (which would fail again the same way in the
> > > next attempt)".
> > > 
> > > However this distinction is very subtle, and may be inadvertently broken
> > > if the code changes somewhere deep down the stack and a new error gets
> > > propagated up to here.
> > > 
> > > OTOH now that the number of reconnection attempts is limited it seems
> > > harmless to try reconnecting on any error.
> > > 
> > > So relax the condition of whether to retry connecting to check for any
> > > error.
> > > 
> > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > > during realize".
> > > 
> > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
> > 
> > It results in less than perfect error messages. With a modified export
> > that just crashes qemu-storage-daemon during get_features, I get:
> > 
> > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1.
> > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error
> > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused
> 
> This patch doesn't change any error messages.  Which ones specifically
> became less than perfect as a result of this patch?

But it adds error messages (for each retry), which are different from
the first error message. As I said this is not the end of the world, but
maybe a bit more confusing.

> > I guess this might be tolerable. On the other hand, the patch doesn't
> > really fix anything either, but just gets rid of possible subtleties.
> 
> The remaining patches in the series make other errors beside -EPROTO
> propagate up to this point, and some (most) of them are retryable.  This
> was the reason to include this patch at the beginning of the series (I
> guess I should've mentioned that in the patch log).

I see. I hadn't looked at the rest of the series yet because I ran out
of time, but now that I'm skimming them, I see quite a few places that
use non-EPROTO, but I wonder which of them actually should be
reconnected. So far all I saw were presumably persistent errors where a
retry won't help. Can you give me some examples?

Kevin
Roman Kagan Nov. 12, 2021, 7:59 p.m. UTC | #4
On Fri, Nov 12, 2021 at 12:37:59PM +0100, Kevin Wolf wrote:
> Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben:
> > On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> > > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > > > vhost-user-blk realize only attempts to reconnect if the previous
> > > > connection attempt failed on "a problem with the connection and not an
> > > > error related to the content (which would fail again the same way in the
> > > > next attempt)".
> > > > 
> > > > However this distinction is very subtle, and may be inadvertently broken
> > > > if the code changes somewhere deep down the stack and a new error gets
> > > > propagated up to here.
> > > > 
> > > > OTOH now that the number of reconnection attempts is limited it seems
> > > > harmless to try reconnecting on any error.
> > > > 
> > > > So relax the condition of whether to retry connecting to check for any
> > > > error.
> > > > 
> > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > > > during realize".
> > > > 
> > > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
> > > 
> > > It results in less than perfect error messages. With a modified export
> > > that just crashes qemu-storage-daemon during get_features, I get:
> > > 
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1.
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused
> > 
> > This patch doesn't change any error messages.  Which ones specifically
> > became less than perfect as a result of this patch?
> 
> But it adds error messages (for each retry), which are different from
> the first error message. As I said this is not the end of the world, but
> maybe a bit more confusing.

Ah, now I see what you mean: it adds reconnection attempts where there
used to be immediate failure return, so now every failed attempt logs
its own message.

> > > I guess this might be tolerable. On the other hand, the patch doesn't
> > > really fix anything either, but just gets rid of possible subtleties.
> > 
> > The remaining patches in the series make other errors beside -EPROTO
> > propagate up to this point, and some (most) of them are retryable.  This
> > was the reason to include this patch at the beginning of the series (I
> > guess I should've mentioned that in the patch log).
> 
> I see. I hadn't looked at the rest of the series yet because I ran out
> of time, but now that I'm skimming them, I see quite a few places that
> use non-EPROTO, but I wonder which of them actually should be
> reconnected. So far all I saw were presumably persistent errors where a
> retry won't help. Can you give me some examples?

E.g. the particular case you mention earlier, -ECONNREFUSED, is not
unlikely to happen due to the vhost-user server restart for maintenance;
in this case retying looks like a reasonable thing to do, doesn't it?

Thanks,
Roman.
Raphael Norwitz Nov. 29, 2021, 10:15 p.m. UTC | #5
> > 
> > I see. I hadn't looked at the rest of the series yet because I ran out
> > of time, but now that I'm skimming them, I see quite a few places that
> > use non-EPROTO, but I wonder which of them actually should be
> > reconnected. So far all I saw were presumably persistent errors where a
> > retry won't help. Can you give me some examples?
> 
> E.g. the particular case you mention earlier, -ECONNREFUSED, is not
> unlikely to happen due to the vhost-user server restart for maintenance;
> in this case retying looks like a reasonable thing to do, doesn't it?
>

Seems like a net-positive to me, expecially with the cleanups in the
rest of the series, but I don't feel strongly.

> Thanks,
> Roman.
>
Raphael Norwitz Nov. 29, 2021, 10:17 p.m. UTC | #6
As mst said, not for 6.2.

On Thu, Nov 11, 2021 at 06:33:45PM +0300, Roman Kagan wrote:
> vhost-user-blk realize only attempts to reconnect if the previous
> connection attempt failed on "a problem with the connection and not an
> error related to the content (which would fail again the same way in the
> next attempt)".
> 
> However this distinction is very subtle, and may be inadvertently broken
> if the code changes somewhere deep down the stack and a new error gets
> propagated up to here.
> 
> OTOH now that the number of reconnection attempts is limited it seems
> harmless to try reconnecting on any error.
> 
> So relax the condition of whether to retry connecting to check for any
> error.
> 
> This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> during realize".
> 
> Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>

Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>

> ---
>  hw/block/vhost-user-blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..f9b17f6813 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -511,7 +511,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
>              *errp = NULL;
>          }
>          ret = vhost_user_blk_realize_connect(s, errp);
> -    } while (ret == -EPROTO && retries--);
> +    } while (ret < 0 && retries--);
>  
>      if (ret < 0) {
>          goto virtio_err;
> -- 
> 2.33.1
>
diff mbox series

Patch

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e5..f9b17f6813 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -511,7 +511,7 @@  static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
             *errp = NULL;
         }
         ret = vhost_user_blk_realize_connect(s, errp);
-    } while (ret == -EPROTO && retries--);
+    } while (ret < 0 && retries--);
 
     if (ret < 0) {
         goto virtio_err;