[2/2] util/qemu-sockets: make keep-alive enabled by default
diff mbox series

Message ID 20200708191540.28455-3-vsementsov@virtuozzo.com
State New
Headers show
Series
  • keepalive default
Related show

Commit Message

Vladimir Sementsov-Ogievskiy July 8, 2020, 7:15 p.m. UTC
Keep-alive won't hurt, let's try to enable it even if not requested by
user.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 util/qemu-sockets.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Comments

Daniel P. Berrangé July 9, 2020, 8:29 a.m. UTC | #1
On Wed, Jul 08, 2020 at 10:15:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Keep-alive won't hurt, let's try to enable it even if not requested by
> user.

Keep-alive intentionally breaks TCP connections earlier than normal
in face of transient networking problems.

The question is more about which type of pain is more desirable. A
stall in the network connection (for a potentially very long time),
or an intentionally broken socket.

I'm not at all convinced it is a good idea to intentionally break
/all/ QEMU sockets in the face of transient problems, even if the
problems last for 2 hours or more. 

I could see keep-alives being ok on some QEMU socket. For example
VNC/SPICE clients, as there is no downside to proactively culling
them as they can trivially reconnect. Migration too is quite
reasonable to use keep alives, as you generally want migration to
run to completion in a short amount of time, and aborting migration
needs to be safe no matter what.

Breaking chardevs or block devices or network devices that use
QEMU sockets though will be disruptive. The only solution once
those backends have a dead socket is going to be to kill QEMU
and cold-boot the VM again.


> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  util/qemu-sockets.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> index b961963472..f6851376f5 100644
> --- a/util/qemu-sockets.c
> +++ b/util/qemu-sockets.c
> @@ -438,7 +438,8 @@ static struct addrinfo *inet_parse_connect_saddr(InetSocketAddress *saddr,
>   *
>   * Handle keep_alive settings. If user specified settings explicitly, fail if
>   * can't set the settings. If user just enabled keep-alive, not specifying the
> - * settings, try to set defaults but ignore failures.
> + * settings, try to set defaults but ignore failures. If keep-alive option is
> + * not specified, try to set it but ignore failures.
>   */
>  static int inet_set_keepalive(int sock, bool has_keep_alive,
>                                KeepAliveField *keep_alive, Error **errp)
> @@ -447,8 +448,8 @@ static int inet_set_keepalive(int sock, bool has_keep_alive,
>      int val;
>      bool has_settings = has_keep_alive &&  keep_alive->type == QTYPE_QDICT;
>  
> -    if (!has_keep_alive || (keep_alive->type == QTYPE_QBOOL &&
> -                            !keep_alive->u.enabled))
> +    if (has_keep_alive &&
> +        keep_alive->type == QTYPE_QBOOL && !keep_alive->u.enabled)
>      {
>          return 0;
>      }
> @@ -456,8 +457,12 @@ static int inet_set_keepalive(int sock, bool has_keep_alive,
>      val = 1;
>      ret = qemu_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val));
>      if (ret < 0) {
> -        error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
> -        return -1;
> +        if (has_keep_alive) {
> +            error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
> +            return -1;
> +        } else {
> +            return 0;
> +        }
>      }
>  
>      val = has_settings ? keep_alive->u.settings.idle : 30;
> -- 
> 2.21.0
> 

Regards,
Daniel
Vladimir Sementsov-Ogievskiy July 9, 2020, 8:49 a.m. UTC | #2
09.07.2020 11:29, Daniel P. Berrangé wrote:
> On Wed, Jul 08, 2020 at 10:15:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Keep-alive won't hurt, let's try to enable it even if not requested by
>> user.
> 
> Keep-alive intentionally breaks TCP connections earlier than normal
> in face of transient networking problems.
> 
> The question is more about which type of pain is more desirable. A
> stall in the network connection (for a potentially very long time),
> or an intentionally broken socket.
> 
> I'm not at all convinced it is a good idea to intentionally break
> /all/ QEMU sockets in the face of transient problems, even if the
> problems last for 2 hours or more.
> 
> I could see keep-alives being ok on some QEMU socket. For example
> VNC/SPICE clients, as there is no downside to proactively culling
> them as they can trivially reconnect. Migration too is quite
> reasonable to use keep alives, as you generally want migration to
> run to completion in a short amount of time, and aborting migration
> needs to be safe no matter what.
> 
> Breaking chardevs or block devices or network devices that use
> QEMU sockets though will be disruptive. The only solution once
> those backends have a dead socket is going to be to kill QEMU
> and cold-boot the VM again.
> 

Reasonable, thanks for explanation.

We are mostly interested in keep-alive for migration and NBD connections.
(NBD driver has ability to reconnect). What do you think about setting
keep-alive (with some KEEPIDLE smaller than 2 hours) by default for
migration and NBD (at least when NBD reconnect is enabled), would it be
valid?

> 
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   util/qemu-sockets.c | 15 ++++++++++-----
>>   1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
>> index b961963472..f6851376f5 100644
>> --- a/util/qemu-sockets.c
>> +++ b/util/qemu-sockets.c
>> @@ -438,7 +438,8 @@ static struct addrinfo *inet_parse_connect_saddr(InetSocketAddress *saddr,
>>    *
>>    * Handle keep_alive settings. If user specified settings explicitly, fail if
>>    * can't set the settings. If user just enabled keep-alive, not specifying the
>> - * settings, try to set defaults but ignore failures.
>> + * settings, try to set defaults but ignore failures. If keep-alive option is
>> + * not specified, try to set it but ignore failures.
>>    */
>>   static int inet_set_keepalive(int sock, bool has_keep_alive,
>>                                 KeepAliveField *keep_alive, Error **errp)
>> @@ -447,8 +448,8 @@ static int inet_set_keepalive(int sock, bool has_keep_alive,
>>       int val;
>>       bool has_settings = has_keep_alive &&  keep_alive->type == QTYPE_QDICT;
>>   
>> -    if (!has_keep_alive || (keep_alive->type == QTYPE_QBOOL &&
>> -                            !keep_alive->u.enabled))
>> +    if (has_keep_alive &&
>> +        keep_alive->type == QTYPE_QBOOL && !keep_alive->u.enabled)
>>       {
>>           return 0;
>>       }
>> @@ -456,8 +457,12 @@ static int inet_set_keepalive(int sock, bool has_keep_alive,
>>       val = 1;
>>       ret = qemu_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val));
>>       if (ret < 0) {
>> -        error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
>> -        return -1;
>> +        if (has_keep_alive) {
>> +            error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
>> +            return -1;
>> +        } else {
>> +            return 0;
>> +        }
>>       }
>>   
>>       val = has_settings ? keep_alive->u.settings.idle : 30;
>> -- 
>> 2.21.0
>>
> 
> Regards,
> Daniel
>
Denis V. Lunev July 9, 2020, 8:54 a.m. UTC | #3
On 7/9/20 11:29 AM, Daniel P. Berrangé wrote:
> On Wed, Jul 08, 2020 at 10:15:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Keep-alive won't hurt, let's try to enable it even if not requested by
>> user.
> Keep-alive intentionally breaks TCP connections earlier than normal
> in face of transient networking problems.
>
> The question is more about which type of pain is more desirable. A
> stall in the network connection (for a potentially very long time),
> or an intentionally broken socket.
>
> I'm not at all convinced it is a good idea to intentionally break
> /all/ QEMU sockets in the face of transient problems, even if the
> problems last for 2 hours or more. 
>
> I could see keep-alives being ok on some QEMU socket. For example
> VNC/SPICE clients, as there is no downside to proactively culling
> them as they can trivially reconnect. Migration too is quite
> reasonable to use keep alives, as you generally want migration to
> run to completion in a short amount of time, and aborting migration
> needs to be safe no matter what.
>
> Breaking chardevs or block devices or network devices that use
> QEMU sockets though will be disruptive. The only solution once
> those backends have a dead socket is going to be to kill QEMU
> and cold-boot the VM again.

nope, and this is exactly what we are trying to achive.

Let us assume that QEMU NBD is connected to the
outside world, f.e. to some HA service running in
other virtual machine. Once that far away VM is
becoming dead, it is re-started on some other host
with the same IP.

QEMU NBD has an ability to reconnect to this same
endpoint and this process is transient for the guest.

This is the workflow we are trying to improve.

Anyway, sitting over dead socket is somewhat
which is not productive. This is like NFS hard and
soft mounts. In hypervisor world using hard mounts
(defaults before the patch) leads to various non
detectable deadlocks, that is why we are proposing
soft with such defaults.

It should also be noted that this is more consistent
as we could face the problem if we perform write
to the dead socket OR we could hang forever, thus
the problem with the current state is still possible.
With new settings we would consistently observe
the problem.

Den
Markus Armbruster July 9, 2020, 11:40 a.m. UTC | #4
"Denis V. Lunev" <den@openvz.org> writes:

> On 7/9/20 11:29 AM, Daniel P. Berrangé wrote:
>> On Wed, Jul 08, 2020 at 10:15:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> Keep-alive won't hurt, let's try to enable it even if not requested by
>>> user.
>> Keep-alive intentionally breaks TCP connections earlier than normal
>> in face of transient networking problems.
>>
>> The question is more about which type of pain is more desirable. A
>> stall in the network connection (for a potentially very long time),
>> or an intentionally broken socket.
>>
>> I'm not at all convinced it is a good idea to intentionally break
>> /all/ QEMU sockets in the face of transient problems, even if the
>> problems last for 2 hours or more. 
>>
>> I could see keep-alives being ok on some QEMU socket. For example
>> VNC/SPICE clients, as there is no downside to proactively culling
>> them as they can trivially reconnect. Migration too is quite
>> reasonable to use keep alives, as you generally want migration to
>> run to completion in a short amount of time, and aborting migration
>> needs to be safe no matter what.
>>
>> Breaking chardevs or block devices or network devices that use
>> QEMU sockets though will be disruptive. The only solution once
>> those backends have a dead socket is going to be to kill QEMU
>> and cold-boot the VM again.
>
> nope, and this is exactly what we are trying to achive.
>
> Let us assume that QEMU NBD is connected to the
> outside world, f.e. to some HA service running in
> other virtual machine. Once that far away VM is
> becoming dead, it is re-started on some other host
> with the same IP.
>
> QEMU NBD has an ability to reconnect to this same
> endpoint and this process is transient for the guest.
>
> This is the workflow we are trying to improve.
>
> Anyway, sitting over dead socket is somewhat
> which is not productive. This is like NFS hard and
> soft mounts. In hypervisor world using hard mounts
> (defaults before the patch) leads to various non
> detectable deadlocks, that is why we are proposing
> soft with such defaults.
>
> It should also be noted that this is more consistent
> as we could face the problem if we perform write
> to the dead socket OR we could hang forever, thus
> the problem with the current state is still possible.
> With new settings we would consistently observe
> the problem.

Daniel's point remains valid: keep-alive makes sense only for sockets
where we can recover from connection breakage.  When graceful recovery
is impossible, we shouldn't aggressively break unresponsive connections,
throwing away the chance (however slim) of them becoming responsive
again.
Daniel P. Berrangé July 9, 2020, 11:52 a.m. UTC | #5
On Thu, Jul 09, 2020 at 11:49:17AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.07.2020 11:29, Daniel P. Berrangé wrote:
> > On Wed, Jul 08, 2020 at 10:15:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > Keep-alive won't hurt, let's try to enable it even if not requested by
> > > user.
> > 
> > Keep-alive intentionally breaks TCP connections earlier than normal
> > in face of transient networking problems.
> > 
> > The question is more about which type of pain is more desirable. A
> > stall in the network connection (for a potentially very long time),
> > or an intentionally broken socket.
> > 
> > I'm not at all convinced it is a good idea to intentionally break
> > /all/ QEMU sockets in the face of transient problems, even if the
> > problems last for 2 hours or more.
> > 
> > I could see keep-alives being ok on some QEMU socket. For example
> > VNC/SPICE clients, as there is no downside to proactively culling
> > them as they can trivially reconnect. Migration too is quite
> > reasonable to use keep alives, as you generally want migration to
> > run to completion in a short amount of time, and aborting migration
> > needs to be safe no matter what.
> > 
> > Breaking chardevs or block devices or network devices that use
> > QEMU sockets though will be disruptive. The only solution once
> > those backends have a dead socket is going to be to kill QEMU
> > and cold-boot the VM again.
> > 
> 
> Reasonable, thanks for explanation.
> 
> We are mostly interested in keep-alive for migration and NBD connections.
> (NBD driver has ability to reconnect). What do you think about setting
> keep-alive (with some KEEPIDLE smaller than 2 hours) by default for
> migration and NBD (at least when NBD reconnect is enabled), would it be
> valid?

I think it should be reasonable to set by default for those particular
scenarios, as both are expecting failures and ready to take action when
they occur.

Regards,
Daniel

Patch
diff mbox series

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index b961963472..f6851376f5 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -438,7 +438,8 @@  static struct addrinfo *inet_parse_connect_saddr(InetSocketAddress *saddr,
  *
  * Handle keep_alive settings. If user specified settings explicitly, fail if
  * can't set the settings. If user just enabled keep-alive, not specifying the
- * settings, try to set defaults but ignore failures.
+ * settings, try to set defaults but ignore failures. If keep-alive option is
+ * not specified, try to set it but ignore failures.
  */
 static int inet_set_keepalive(int sock, bool has_keep_alive,
                               KeepAliveField *keep_alive, Error **errp)
@@ -447,8 +448,8 @@  static int inet_set_keepalive(int sock, bool has_keep_alive,
     int val;
     bool has_settings = has_keep_alive &&  keep_alive->type == QTYPE_QDICT;
 
-    if (!has_keep_alive || (keep_alive->type == QTYPE_QBOOL &&
-                            !keep_alive->u.enabled))
+    if (has_keep_alive &&
+        keep_alive->type == QTYPE_QBOOL && !keep_alive->u.enabled)
     {
         return 0;
     }
@@ -456,8 +457,12 @@  static int inet_set_keepalive(int sock, bool has_keep_alive,
     val = 1;
     ret = qemu_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &val, sizeof(val));
     if (ret < 0) {
-        error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
-        return -1;
+        if (has_keep_alive) {
+            error_setg_errno(errp, errno, "Unable to set KEEPALIVE");
+            return -1;
+        } else {
+            return 0;
+        }
     }
 
     val = has_settings ? keep_alive->u.settings.idle : 30;