diff mbox series

[v11,2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

Message ID 20220504191835.791580-3-leobras@redhat.com (mailing list archive)
State New, archived
Headers show
Series MSG_ZEROCOPY + multifd | expand

Commit Message

Leonardo Bras May 4, 2022, 7:18 p.m. UTC
For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call qio_channel_flush() before freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
 2 files changed, 118 insertions(+), 4 deletions(-)

Comments

Peter Xu May 4, 2022, 7:53 p.m. UTC | #1
On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> +/*
> + * Zero-copy defines bellow are included to avoid breaking builds on systems
> + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> + * (without a lot of ifdefs).
> + */
> +#ifndef MSG_ZEROCOPY
> +#define MSG_ZEROCOPY 0x4000000
> +#endif
> +#ifndef SO_ZEROCOPY
> +#define SO_ZEROCOPY 60
> +#endif

So this will define these two values on e.g. FreeBSD, while they do not
make sense at all there because these numbers are pure magics and
meaningless outside Linux..

I don't think it's anything dangerous, but IMHO it's another way of being
not clean comparing of using some "#ifdef"s.  Comparing to this approach
the "use #ifdef" approach is actually slightly more cleaner to me. :)

Let's wait for some other inputs.
Leonardo Bras May 5, 2022, 4:20 a.m. UTC | #2
On Wed, May 4, 2022 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x4000000
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> So this will define these two values on e.g. FreeBSD, while they do not
> make sense at all there because these numbers are pure magics and
> meaningless outside Linux..

Correct.
But since only in Linux it's possible to set the
QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag, sflags will always be zero and
it would never try using MSG_ZEROCOPY outside Linux.

> I don't think it's anything dangerous, but IMHO it's another way of being
> not clean comparing of using some "#ifdef"s.  Comparing to this approach
> the "use #ifdef" approach is actually slightly more cleaner to me. :)
>

This requires:
- Creating a define such as 'QEMU_MSG_ZEROCOPY', that needs to include
<sys/socket.h> to get some flags:
#define QEMU_MSG_ZEROCOPY defined(CONFIG_LINUX) &&
defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY)
- Making it available for all code in this patchset that does "ifdef
CONFIG_LINUX'
(migration/migration.c/h, qapi/migration.json, monitor/hmp-cmds.c,
io/channel-socket.c)
- Replace current usage of CONFIG_LINUX in this patchset for QEMU_MSG_ZEROCOPY
- Change qio_channel_socket_writev() so the current 2 usages of
MSG_ZEROCOPY are surrounded by ifdef QEMU_MSG_ZEROCOPY.

Pros of above approach (1):
- Smaller binary: The whole MSG_ZEROCOPY code is compiled out if the
building system does not support it.
- Since it's compiled out, there is a couple lines of less code
running if the building system does not support it
- It's not even possible to set this option in MigrationSetParams,
which will return an error.

Pros of current approach (2):
- Define is local to file (I am not sure if it's ok to create a
'global' define for above approach, including <sys/socket.h> bits)
- A build system that does not support MSG_ZEROCOPY can produce a
binary that can use MSG_ZEROCOPY if the target system supports it.
- There are no #ifdefs on qio_channel_socket_writev()

(2) is already implemented in v11, but I have no issue implementing
(1) for v12 if it's ok to create this 'global' define.

> Let's wait for some other inputs.

Agree.
Having the pros of each approach clear, I would like some input on
what is better for the project.

Best regards,
Leo
Daniel P. Berrangé May 5, 2022, 8:05 a.m. UTC | #3
On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> feature is available in the host kernel, which is checked on
> qio_channel_socket_connect_sync()
> 
> qio_channel_socket_flush() was implemented by counting how many times
> sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> socket's error queue, in order to find how many of them finished sending.
> Flush will loop until those counters are the same, or until some error occurs.
> 
> Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> 1: Buffer
> - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> some caution is necessary to avoid overwriting any buffer before it's sent.
> If something like this happen, a newer version of the buffer may be sent instead.
> - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> or re-using the buffer.
> 
> 2: Locked memory
> - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> unlocked after it's sent.
> - Depending on the size of each buffer, and how often it's sent, it may require
> a larger amount of locked memory than usually available to non-root user.
> - If the required amount of locked memory is not available, writev_zero_copy
> will return an error, which can abort an operation like migration,
> - Because of this, when an user code wants to add zero copy as a feature, it
> requires a mechanism to disable it, so it can still be accessible to less
> privileged users.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  include/io/channel-socket.h |   2 +
>  io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 118 insertions(+), 4 deletions(-)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..513c428fe4 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +    ssize_t zero_copy_queued;
> +    ssize_t zero_copy_sent;
>  };
>  
>  
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 696a04dc9c..ae756ce166 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -25,9 +25,25 @@
>  #include "io/channel-watch.h"
>  #include "trace.h"
>  #include "qapi/clone-visitor.h"
> +#ifdef CONFIG_LINUX
> +#include <linux/errqueue.h>
> +#include <sys/socket.h>
> +#endif
>  
>  #define SOCKET_MAX_FDS 16
>  
> +/*
> + * Zero-copy defines bellow are included to avoid breaking builds on systems
> + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> + * (without a lot of ifdefs).
> + */
> +#ifndef MSG_ZEROCOPY
> +#define MSG_ZEROCOPY 0x4000000
> +#endif
> +#ifndef SO_ZEROCOPY
> +#define SO_ZEROCOPY 60
> +#endif

Please put these behind CONFIG_LINUX to make it clear to readers that
this is entirely Linux specific


> +
>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>                                       Error **errp)
> @@ -54,6 +70,8 @@ qio_channel_socket_new(void)
>  
>      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
>      sioc->fd = -1;
> +    sioc->zero_copy_queued = 0;
> +    sioc->zero_copy_sent = 0;
>  
>      ioc = QIO_CHANNEL(sioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
>          return -1;
>      }
>  
> +#ifdef CONFIG_LINUX
> +    int ret, v = 1;
> +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret == 0) {
> +        /* Zero copy available on host */
> +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> +    }
> +#endif
> +
>      return 0;
>  }
>  
> @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>      char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
>      size_t fdsize = sizeof(int) * nfds;
>      struct cmsghdr *cmsg;
> +    int sflags = 0;
>  
>      memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>  
> @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>          memcpy(CMSG_DATA(cmsg), fds, fdsize);
>      }
>  
> +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> +        sflags = MSG_ZEROCOPY;
> +    }

Also should be behind CONFIG_LINUX

> +
>   retry:
> -    ret = sendmsg(sioc->fd, &msg, 0);
> +    ret = sendmsg(sioc->fd, &msg, sflags);
>      if (ret <= 0) {
> -        if (errno == EAGAIN) {
> +        switch (errno) {
> +        case EAGAIN:
>              return QIO_CHANNEL_ERR_BLOCK;
> -        }
> -        if (errno == EINTR) {
> +        case EINTR:
>              goto retry;
> +        case ENOBUFS:
> +            if (sflags & MSG_ZEROCOPY) {
> +                error_setg_errno(errp, errno,
> +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> +                return -1;
> +            }
> +            break;

And this ENOBUGS case behind CONFIG_LINUX

>          }
> +
>          error_setg_errno(errp, errno,
>                           "Unable to write to socket");
>          return -1;
> @@ -659,6 +700,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>  }
>  #endif /* WIN32 */
>  
> +
> +#ifdef CONFIG_LINUX
> +static int qio_channel_socket_flush(QIOChannel *ioc,
> +                                    Error **errp)
> +{
> +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> +    struct msghdr msg = {};
> +    struct sock_extended_err *serr;
> +    struct cmsghdr *cm;
> +    char control[CMSG_SPACE(sizeof(*serr))];
> +    int received;
> +    int ret = 1;
> +
> +    msg.msg_control = control;
> +    msg.msg_controllen = sizeof(control);
> +    memset(control, 0, sizeof(control));
> +
> +    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
> +        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
> +        if (received < 0) {
> +            switch (errno) {
> +            case EAGAIN:
> +                /* Nothing on errqueue, wait until something is available */
> +                qio_channel_wait(ioc, G_IO_ERR);
> +                continue;
> +            case EINTR:
> +                continue;
> +            default:
> +                error_setg_errno(errp, errno,
> +                                 "Unable to read errqueue");
> +                return -1;
> +            }
> +        }
> +
> +        cm = CMSG_FIRSTHDR(&msg);
> +        if (cm->cmsg_level != SOL_IP &&
> +            cm->cmsg_type != IP_RECVERR) {
> +            error_setg_errno(errp, EPROTOTYPE,
> +                             "Wrong cmsg in errqueue");
> +            return -1;
> +        }
> +
> +        serr = (void *) CMSG_DATA(cm);
> +        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
> +            error_setg_errno(errp, serr->ee_errno,
> +                             "Error on socket");
> +            return -1;
> +        }
> +        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
> +            error_setg_errno(errp, serr->ee_origin,
> +                             "Error not from zero copy");
> +            return -1;
> +        }
> +
> +        /* No errors, count successfully finished sendmsg()*/
> +        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
> +
> +        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
> +        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
> +            ret = 0;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +#endif /* CONFIG_LINUX */
> +
>  static int
>  qio_channel_socket_set_blocking(QIOChannel *ioc,
>                                  bool enabled,
> @@ -789,6 +898,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
>      ioc_klass->io_set_delay = qio_channel_socket_set_delay;
>      ioc_klass->io_create_watch = qio_channel_socket_create_watch;
>      ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
> +#ifdef CONFIG_LINUX
> +    ioc_klass->io_flush = qio_channel_socket_flush;
> +#endif
>  }
>  
>  static const TypeInfo qio_channel_socket_info = {
> -- 
> 2.36.0
> 

With regards,
Daniel
Peter Xu May 5, 2022, 12:41 p.m. UTC | #4
On Thu, May 05, 2022 at 01:20:04AM -0300, Leonardo Bras Soares Passos wrote:
> (2) is already implemented in v11, but I have no issue implementing
> (1) for v12 if it's ok to create this 'global' define.

Dan's suggestion in the other thread sounds good to me with current
approach, on having CONFIG_LINUX to wrap the defines.

But note that it's still prone to change in the future, e.g., when other
qemu modules start to use MSG_ZEROCOPY, we'll probably at least move the
defines into some other headers.  We could probably need to clean things up
when it happens.

But I won't strongly ask for that - we can leave that for later.

Thanks,
Leonardo Bras May 5, 2022, 3:42 p.m. UTC | #5
On Thu, May 5, 2022 at 5:05 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error occurs.
> >
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >  include/io/channel-socket.h |   2 +
> >  io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
> >  2 files changed, 118 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..513c428fe4 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -47,6 +47,8 @@ struct QIOChannelSocket {
> >      socklen_t localAddrLen;
> >      struct sockaddr_storage remoteAddr;
> >      socklen_t remoteAddrLen;
> > +    ssize_t zero_copy_queued;
> > +    ssize_t zero_copy_sent;
> >  };
> >
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index 696a04dc9c..ae756ce166 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -25,9 +25,25 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#ifdef CONFIG_LINUX
> > +#include <linux/errqueue.h>
> > +#include <sys/socket.h>
> > +#endif
> >
> >  #define SOCKET_MAX_FDS 16
> >
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x4000000
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> Please put these behind CONFIG_LINUX to make it clear to readers that
> this is entirely Linux specific
>
>
> > +
> >  SocketAddress *
> >  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> >                                       Error **errp)
> > @@ -54,6 +70,8 @@ qio_channel_socket_new(void)
> >
> >      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
> >      sioc->fd = -1;
> > +    sioc->zero_copy_queued = 0;
> > +    sioc->zero_copy_sent = 0;
> >
> >      ioc = QIO_CHANNEL(sioc);
> >      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> > @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> >          return -1;
> >      }
> >
> > +#ifdef CONFIG_LINUX
> > +    int ret, v = 1;
> > +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret == 0) {
> > +        /* Zero copy available on host */
> > +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +    }
> > +#endif
> > +
> >      return 0;
> >  }
> >
> > @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >      char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
> >      size_t fdsize = sizeof(int) * nfds;
> >      struct cmsghdr *cmsg;
> > +    int sflags = 0;
> >
> >      memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >
> > @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >          memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >      }
> >
> > +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +        sflags = MSG_ZEROCOPY;
> > +    }
>
> Also should be behind CONFIG_LINUX


Hello Daniel,

But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
As qapi will have zero-copy-send as an option we could have this scenario:

- User request migration using zero-copy-send
- multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
- In qio_channel_socket_connect_sync(): setsockopt() part will be
compiled-out, so no error here
- above part in qio_channel_socket_writev() will be commented-out,
which means write_flags will be ignored
- sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
- migration will succeed

In the above case, the user has all the reason to think migration is
using MSG_ZEROCOPY, but in fact it's quietly falling back to
copy-mode.

That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
of even offering zero-copy-send if we don't have it.

Another local option is to do implement your suggestions, and also
change qio_channel_socket_connect_sync() so it returns an error if
MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:

+#ifdef CONFIG_LINUX
+#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
+    int ret, v = 1;
+    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#else
+    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
+    return -1;
+#endif
+#endif

What do you think?

Best regards,
Leo

>
> > +
> >   retry:
> > -    ret = sendmsg(sioc->fd, &msg, 0);
> > +    ret = sendmsg(sioc->fd, &msg, sflags);
> >      if (ret <= 0) {
> > -        if (errno == EAGAIN) {
> > +        switch (errno) {
> > +        case EAGAIN:
> >              return QIO_CHANNEL_ERR_BLOCK;
> > -        }
> > -        if (errno == EINTR) {
> > +        case EINTR:
> >              goto retry;
> > +        case ENOBUFS:
> > +            if (sflags & MSG_ZEROCOPY) {
> > +                error_setg_errno(errp, errno,
> > +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> > +                return -1;
> > +            }
> > +            break;
>
> And this ENOBUGS case behind CONFIG_LINUX
>


[...]
Daniel P. Berrangé May 5, 2022, 3:55 p.m. UTC | #6
On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote:
> 
> Hello Daniel,
> 
> But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
> As qapi will have zero-copy-send as an option we could have this scenario:
> 
> - User request migration using zero-copy-send
> - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
> - In qio_channel_socket_connect_sync(): setsockopt() part will be
> compiled-out, so no error here
> - above part in qio_channel_socket_writev() will be commented-out,
> which means write_flags will be ignored
> - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
> - migration will succeed
> 
> In the above case, the user has all the reason to think migration is
> using MSG_ZEROCOPY, but in fact it's quietly falling back to
> copy-mode.

I think we're ok because qio_channel_writev_full() does

    if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
        error_setg_errno(errp, EINVAL,
                         "Requested Zero Copy feature is not available");
        return -1;
    }

and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to
get set when MSG_ZEROCOPY is compiled out, we'll trigger the error
condition.

> That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
> MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
> of even offering zero-copy-send if we don't have it.
> 
> Another local option is to do implement your suggestions, and also
> change qio_channel_socket_connect_sync() so it returns an error if
> MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:
> 
> +#ifdef CONFIG_LINUX
> +#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
> +    int ret, v = 1;
> +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret == 0) {
> +        /* Zero copy available on host */
> +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> +    }
> +#else
> +    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
> +    return -1;
> +#endif
> +#endif

Do we actually need the ifdef CONFIG_LINUX bit at all ?

Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY,
which will fail on non-Linux anyway.


With regards,
Daniel
Leonardo Bras May 5, 2022, 5:01 p.m. UTC | #7
On Thu, May 5, 2022 at 12:55 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote:
> >
> > Hello Daniel,
> >
> > But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
> > As qapi will have zero-copy-send as an option we could have this scenario:
> >
> > - User request migration using zero-copy-send
> > - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
> > - In qio_channel_socket_connect_sync(): setsockopt() part will be
> > compiled-out, so no error here
> > - above part in qio_channel_socket_writev() will be commented-out,
> > which means write_flags will be ignored
> > - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
> > - migration will succeed
> >
> > In the above case, the user has all the reason to think migration is
> > using MSG_ZEROCOPY, but in fact it's quietly falling back to
> > copy-mode.
>
> I think we're ok because qio_channel_writev_full() does
>
>     if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
>         !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
>         error_setg_errno(errp, EINVAL,
>                          "Requested Zero Copy feature is not available");
>         return -1;
>     }
>
> and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to
> get set when MSG_ZEROCOPY is compiled out, we'll trigger the error
> condition.

Oh, that's right. It will fail in the first writev(), I was just
considering failing during setup.

>
> > That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
> > MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
> > of even offering zero-copy-send if we don't have it.
> >
> > Another local option is to do implement your suggestions, and also
> > change qio_channel_socket_connect_sync() so it returns an error if
> > MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:
> >
> > +#ifdef CONFIG_LINUX
> > +#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
> > +    int ret, v = 1;
> > +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret == 0) {
> > +        /* Zero copy available on host */
> > +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +    }
> > +#else
> > +    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
> > +    return -1;
> > +#endif
> > +#endif
>
> Do we actually need the ifdef CONFIG_LINUX bit at all ?
>
> Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY,
> which will fail on non-Linux anyway.

By some include issue, or by future implementations we can have
MSG_ZEROCOPY or SO_ZEROCOPY getting defined in OS other than Linux,
which would introduce some headaches.

Since you pointed out that migration will fail on writev, the above
piece of code is not necessary.
We could have a local define that equals to (MSG_ZEROCOPY &&
SO_ZEROCOPY && CONFIG_LINUX) so that we can make the code simpler
where needed.

I will work on a v12 and send it here.

Best regards,
Leo
diff mbox series

Patch

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@  struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    ssize_t zero_copy_queued;
+    ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 696a04dc9c..ae756ce166 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -25,9 +25,25 @@ 
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include <linux/errqueue.h>
+#include <sys/socket.h>
+#endif
 
 #define SOCKET_MAX_FDS 16
 
+/*
+ * Zero-copy defines bellow are included to avoid breaking builds on systems
+ * that don't support MSG_ZEROCOPY, while keeping the functions more readable
+ * (without a lot of ifdefs).
+ */
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY 0x4000000
+#endif
+#ifndef SO_ZEROCOPY
+#define SO_ZEROCOPY 60
+#endif
+
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
                                      Error **errp)
@@ -54,6 +70,8 @@  qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zero_copy_queued = 0;
+    sioc->zero_copy_sent = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -153,6 +171,16 @@  int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
         return -1;
     }
 
+#ifdef CONFIG_LINUX
+    int ret, v = 1;
+    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#endif
+
     return 0;
 }
 
@@ -533,6 +561,7 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
     size_t fdsize = sizeof(int) * nfds;
     struct cmsghdr *cmsg;
+    int sflags = 0;
 
     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -557,15 +586,27 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
         memcpy(CMSG_DATA(cmsg), fds, fdsize);
     }
 
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        sflags = MSG_ZEROCOPY;
+    }
+
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, sflags);
     if (ret <= 0) {
-        if (errno == EAGAIN) {
+        switch (errno) {
+        case EAGAIN:
             return QIO_CHANNEL_ERR_BLOCK;
-        }
-        if (errno == EINTR) {
+        case EINTR:
             goto retry;
+        case ENOBUFS:
+            if (sflags & MSG_ZEROCOPY) {
+                error_setg_errno(errp, errno,
+                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
+                return -1;
+            }
+            break;
         }
+
         error_setg_errno(errp, errno,
                          "Unable to write to socket");
         return -1;
@@ -659,6 +700,74 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef CONFIG_LINUX
+static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+    char control[CMSG_SPACE(sizeof(*serr))];
+    int received;
+    int ret = 1;
+
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    memset(control, 0, sizeof(control));
+
+    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
+        if (received < 0) {
+            switch (errno) {
+            case EAGAIN:
+                /* Nothing on errqueue, wait until something is available */
+                qio_channel_wait(ioc, G_IO_ERR);
+                continue;
+            case EINTR:
+                continue;
+            default:
+                error_setg_errno(errp, errno,
+                                 "Unable to read errqueue");
+                return -1;
+            }
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            return -1;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            return -1;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zero copy");
+            return -1;
+        }
+
+        /* No errors, count successfully finished sendmsg()*/
+        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
+
+        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
+        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
+#endif /* CONFIG_LINUX */
+
 static int
 qio_channel_socket_set_blocking(QIOChannel *ioc,
                                 bool enabled,
@@ -789,6 +898,9 @@  static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+#ifdef CONFIG_LINUX
+    ioc_klass->io_flush = qio_channel_socket_flush;
+#endif
 }
 
 static const TypeInfo qio_channel_socket_info = {