Message ID | 20211111153354.18807-2-rvkagan@yandex-team.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vhost: stick to -errno error return convention | expand |
Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben: > vhost-user-blk realize only attempts to reconnect if the previous > connection attempt failed on "a problem with the connection and not an > error related to the content (which would fail again the same way in the > next attempt)". > > However this distinction is very subtle, and may be inadvertently broken > if the code changes somewhere deep down the stack and a new error gets > propagated up to here. > > OTOH now that the number of reconnection attempts is limited it seems > harmless to try reconnecting on any error. > > So relax the condition of whether to retry connecting to check for any > error. > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection > during realize". > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> It results in less than perfect error messages. With a modified export that just crashes qemu-storage-daemon during get_features, I get: qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1. qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused I guess this might be tolerable. On the other hand, the patch doesn't really fix anything either, but just gets rid of possible subtleties. Kevin
On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote: > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben: > > vhost-user-blk realize only attempts to reconnect if the previous > > connection attempt failed on "a problem with the connection and not an > > error related to the content (which would fail again the same way in the > > next attempt)". > > > > However this distinction is very subtle, and may be inadvertently broken > > if the code changes somewhere deep down the stack and a new error gets > > propagated up to here. > > > > OTOH now that the number of reconnection attempts is limited it seems > > harmless to try reconnecting on any error. > > > > So relax the condition of whether to retry connecting to check for any > > error. > > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection > > during realize". > > > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> > > It results in less than perfect error messages. With a modified export > that just crashes qemu-storage-daemon during get_features, I get: > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1. > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused This patch doesn't change any error messages. Which ones specifically became less than perfect as a result of this patch? > I guess this might be tolerable. On the other hand, the patch doesn't > really fix anything either, but just gets rid of possible subtleties. The remaining patches in the series make other errors beside -EPROTO propagate up to this point, and some (most) of them are retryable. This was the reason to include this patch at the beginning of the series (I guess I should've mentioned that in the patch log). Thanks, Roman.
Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben: > On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote: > > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben: > > > vhost-user-blk realize only attempts to reconnect if the previous > > > connection attempt failed on "a problem with the connection and not an > > > error related to the content (which would fail again the same way in the > > > next attempt)". > > > > > > However this distinction is very subtle, and may be inadvertently broken > > > if the code changes somewhere deep down the stack and a new error gets > > > propagated up to here. > > > > > > OTOH now that the number of reconnection attempts is limited it seems > > > harmless to try reconnecting on any error. > > > > > > So relax the condition of whether to retry connecting to check for any > > > error. > > > > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection > > > during realize". > > > > > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> > > > > It results in less than perfect error messages. With a modified export > > that just crashes qemu-storage-daemon during get_features, I get: > > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1. > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused > > This patch doesn't change any error messages. Which ones specifically > became less than perfect as a result of this patch? But it adds error messages (for each retry), which are different from the first error message. As I said this is not the end of the world, but maybe a bit more confusing. > > I guess this might be tolerable. On the other hand, the patch doesn't > > really fix anything either, but just gets rid of possible subtleties. > > The remaining patches in the series make other errors beside -EPROTO > propagate up to this point, and some (most) of them are retryable. This > was the reason to include this patch at the beginning of the series (I > guess I should've mentioned that in the patch log). I see. I hadn't looked at the rest of the series yet because I ran out of time, but now that I'm skimming them, I see quite a few places that use non-EPROTO, but I wonder which of them actually should be reconnected. So far all I saw were presumably persistent errors where a retry won't help. Can you give me some examples? Kevin
On Fri, Nov 12, 2021 at 12:37:59PM +0100, Kevin Wolf wrote: > Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben: > > On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote: > > > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben: > > > > vhost-user-blk realize only attempts to reconnect if the previous > > > > connection attempt failed on "a problem with the connection and not an > > > > error related to the content (which would fail again the same way in the > > > > next attempt)". > > > > > > > > However this distinction is very subtle, and may be inadvertently broken > > > > if the code changes somewhere deep down the stack and a new error gets > > > > propagated up to here. > > > > > > > > OTOH now that the number of reconnection attempts is limited it seems > > > > harmless to try reconnecting on any error. > > > > > > > > So relax the condition of whether to retry connecting to check for any > > > > error. > > > > > > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection > > > > during realize". > > > > > > > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> > > > > > > It results in less than perfect error messages. With a modified export > > > that just crashes qemu-storage-daemon during get_features, I get: > > > > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg header. Read 0 instead of 12. Original request 1. > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: vhost_backend_init failed: Protocol error > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after error: Failed to connect to '/tmp/vsock': Connection refused > > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect to '/tmp/vsock': Connection refused > > > > This patch doesn't change any error messages. Which ones specifically > > became less than perfect as a result of this patch? > > But it adds error messages (for each retry), which are different from > the first error message. As I said this is not the end of the world, but > maybe a bit more confusing. Ah, now I see what you mean: it adds reconnection attempts where there used to be immediate failure return, so now every failed attempt logs its own message. > > > I guess this might be tolerable. On the other hand, the patch doesn't > > > really fix anything either, but just gets rid of possible subtleties. > > > > The remaining patches in the series make other errors beside -EPROTO > > propagate up to this point, and some (most) of them are retryable. This > > was the reason to include this patch at the beginning of the series (I > > guess I should've mentioned that in the patch log). > > I see. I hadn't looked at the rest of the series yet because I ran out > of time, but now that I'm skimming them, I see quite a few places that > use non-EPROTO, but I wonder which of them actually should be > reconnected. So far all I saw were presumably persistent errors where a > retry won't help. Can you give me some examples? E.g. the particular case you mention earlier, -ECONNREFUSED, is not unlikely to happen due to the vhost-user server restart for maintenance; in this case retying looks like a reasonable thing to do, doesn't it? Thanks, Roman.
> > > > I see. I hadn't looked at the rest of the series yet because I ran out > > of time, but now that I'm skimming them, I see quite a few places that > > use non-EPROTO, but I wonder which of them actually should be > > reconnected. So far all I saw were presumably persistent errors where a > > retry won't help. Can you give me some examples? > > E.g. the particular case you mention earlier, -ECONNREFUSED, is not > unlikely to happen due to the vhost-user server restart for maintenance; > in this case retying looks like a reasonable thing to do, doesn't it? > Seems like a net-positive to me, expecially with the cleanups in the rest of the series, but I don't feel strongly. > Thanks, > Roman. >
As mst said, not for 6.2. On Thu, Nov 11, 2021 at 06:33:45PM +0300, Roman Kagan wrote: > vhost-user-blk realize only attempts to reconnect if the previous > connection attempt failed on "a problem with the connection and not an > error related to the content (which would fail again the same way in the > next attempt)". > > However this distinction is very subtle, and may be inadvertently broken > if the code changes somewhere deep down the stack and a new error gets > propagated up to here. > > OTOH now that the number of reconnection attempts is limited it seems > harmless to try reconnecting on any error. > > So relax the condition of whether to retry connecting to check for any > error. > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection > during realize". > > Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> > --- > hw/block/vhost-user-blk.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c > index ba13cb87e5..f9b17f6813 100644 > --- a/hw/block/vhost-user-blk.c > +++ b/hw/block/vhost-user-blk.c > @@ -511,7 +511,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp) > *errp = NULL; > } > ret = vhost_user_blk_realize_connect(s, errp); > - } while (ret == -EPROTO && retries--); > + } while (ret < 0 && retries--); > > if (ret < 0) { > goto virtio_err; > -- > 2.33.1 >
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index ba13cb87e5..f9b17f6813 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -511,7 +511,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp) *errp = NULL; } ret = vhost_user_blk_realize_connect(s, errp); - } while (ret == -EPROTO && retries--); + } while (ret < 0 && retries--); if (ret < 0) { goto virtio_err;
vhost-user-blk realize only attempts to reconnect if the previous connection attempt failed on "a problem with the connection and not an error related to the content (which would fail again the same way in the next attempt)". However this distinction is very subtle, and may be inadvertently broken if the code changes somewhere deep down the stack and a new error gets propagated up to here. OTOH now that the number of reconnection attempts is limited it seems harmless to try reconnecting on any error. So relax the condition of whether to retry connecting to check for any error. This patch amends a527e312b5 "vhost-user-blk: Implement reconnection during realize". Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> --- hw/block/vhost-user-blk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)