diff mbox series

[v13,3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

Message ID 20220513062836.965425-4-leobras@redhat.com (mailing list archive)
State New, archived
Headers show
Series MSG_ZEROCOPY + multifd | expand

Commit Message

Leonardo Bras May 13, 2022, 6:28 a.m. UTC
For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call qio_channel_flush() before freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c         | 116 ++++++++++++++++++++++++++++++++++--
 2 files changed, 114 insertions(+), 4 deletions(-)

Comments

Chuang Xu June 1, 2022, 9:37 a.m. UTC | #1
On 2022/5/13 下午2:28, Leonardo Bras wrote:
> For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> feature is available in the host kernel, which is checked on
> qio_channel_socket_connect_sync()
>
> qio_channel_socket_flush() was implemented by counting how many times
> sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> socket's error queue, in order to find how many of them finished sending.
> Flush will loop until those counters are the same, or until some error occurs.
>
> Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> 1: Buffer
> - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> some caution is necessary to avoid overwriting any buffer before it's sent.
> If something like this happen, a newer version of the buffer may be sent instead.
> - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> or re-using the buffer.
>
> 2: Locked memory
> - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> unlocked after it's sent.
> - Depending on the size of each buffer, and how often it's sent, it may require
> a larger amount of locked memory than usually available to non-root user.
> - If the required amount of locked memory is not available, writev_zero_copy
> will return an error, which can abort an operation like migration,
> - Because of this, when an user code wants to add zero copy as a feature, it
> requires a mechanism to disable it, so it can still be accessible to less
> privileged users.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>   include/io/channel-socket.h |   2 +
>   io/channel-socket.c         | 116 ++++++++++++++++++++++++++++++++++--
>   2 files changed, 114 insertions(+), 4 deletions(-)
>
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..513c428fe4 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>       socklen_t localAddrLen;
>       struct sockaddr_storage remoteAddr;
>       socklen_t remoteAddrLen;
> +    ssize_t zero_copy_queued;
> +    ssize_t zero_copy_sent;
>   };
>   
Hi, Leonardo. I'm also paying attention to the application of 
MSG_ZEROCOPY in live migration recently. I noticed that you defined a 
member `zero_copy_queued` in the struct QIOChannelSocket, but I can't 
find out where the value of this member has been changed in your patch. 
Can you answer it for me?
Peter Xu June 1, 2022, 1:58 p.m. UTC | #2
On Wed, Jun 01, 2022 at 05:37:10PM +0800, 徐闯 wrote:
> 
> On 2022/5/13 下午2:28, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> > 
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error occurs.
> > 
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> > or re-using the buffer.
> > 
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >   include/io/channel-socket.h |   2 +
> >   io/channel-socket.c         | 116 ++++++++++++++++++++++++++++++++++--
> >   2 files changed, 114 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..513c428fe4 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -47,6 +47,8 @@ struct QIOChannelSocket {
> >       socklen_t localAddrLen;
> >       struct sockaddr_storage remoteAddr;
> >       socklen_t remoteAddrLen;
> > +    ssize_t zero_copy_queued;
> > +    ssize_t zero_copy_sent;
> >   };
> Hi, Leonardo. I'm also paying attention to the application of MSG_ZEROCOPY
> in live migration recently. I noticed that you defined a member
> `zero_copy_queued` in the struct QIOChannelSocket, but I can't find out
> where the value of this member has been changed in your patch. Can you
> answer it for me?
> 

Good point.. it should probably be increased when queuing the pages. We'd
better fix it up or it seems the flush() will be literally an no-op..

Two things in qio_channel_socket_flush() we can do to make sure it'll work
as expected, imo:

  1) make ret=-1 as initial value, rather than 1 - we only check negative
     errors in the caller so we could have missed a positive "1"

  2) add a tracepoint into the loop of updating zero_copy_sent

Leo, what's your take?

Thanks,
Leonardo Bras June 8, 2022, 5:24 a.m. UTC | #3
Hello 徐闯,

Thanks for reviewing!

On Wed, Jun 1, 2022 at 6:37 AM 徐闯 <xuchuangxclwt@bytedance.com> wrote:
[...]
> Hi, Leonardo. I'm also paying attention to the application of
> MSG_ZEROCOPY in live migration recently. I noticed that you defined a
> member `zero_copy_queued` in the struct QIOChannelSocket, but I can't
> find out where the value of this member has been changed in your patch.
> Can you answer it for me?
>

You are right.
This was being correctly implemented until v6, and then the increment
just vanished.

Since v6 there  were a lot of changes both in the patch and in the
base repository, so I think I completely missed it in some change
iteration.

I will send a fix shortly.
Is that ok if I include a "Reported-by:  徐闯
<xuchuangxclwt@bytedance.com>" in the patch?

Best regards,
Leo
Leonardo Bras Soares Passos June 8, 2022, 5:37 a.m. UTC | #4
Hello Peter,

On Wed, Jun 1, 2022 at 10:58 AM Peter Xu <peterx@redhat.com> wrote:
>
[...]
> > Hi, Leonardo. I'm also paying attention to the application of MSG_ZEROCOPY
> > in live migration recently. I noticed that you defined a member
> > `zero_copy_queued` in the struct QIOChannelSocket, but I can't find out
> > where the value of this member has been changed in your patch. Can you
> > answer it for me?
> >
>
> Good point.. it should probably be increased when queuing the pages. We'd
> better fix it up or it seems the flush() will be literally an no-op..

That's correct.
I am working on a fix right now.
The idea is to increment it in qio_channel_socket_writev() if sendmsg succeeds.

>
> Two things in qio_channel_socket_flush() we can do to make sure it'll work
> as expected, imo:
>
>   1) make ret=-1 as initial value, rather than 1 - we only check negative
>      errors in the caller so we could have missed a positive "1"
>
>   2) add a tracepoint into the loop of updating zero_copy_sent
>
> Leo, what's your take?

(1) is not an option, as the interface currently uses ret=1 to make
sure MSG_ZEROCOPY is getting used,
I added that so the user of qio_channel can switch off zero-copy if
it's not getting used, and save some cpu.

(2) is not a problem, but I fail to see how useful that would be. Is
the idea manually keeping track of flush happening?

Best regards,
Leo
Chuang Xu June 8, 2022, 6:48 a.m. UTC | #5
On 2022/6/8 下午1:24, Leonardo Bras Soares Passos wrote:
> I will send a fix shortly.
> Is that ok if I include a "Reported-by:  徐闯
> <xuchuangxclwt@bytedance.com>" in the patch?

okay.

Best Regards,

chuang xu
Peter Xu June 8, 2022, 11:41 a.m. UTC | #6
On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> (1) is not an option, as the interface currently uses ret=1 to make
> sure MSG_ZEROCOPY is getting used,
> I added that so the user of qio_channel can switch off zero-copy if
> it's not getting used, and save some cpu.

Yes (1) is not, but could you explain what do you mean by making sure
MSG_ZEROCOPY being used?  Why is it relevant to the retval here?

I just figured it's a bit weird to return >0 here in flush().

> 
> (2) is not a problem, but I fail to see how useful that would be. Is
> the idea manually keeping track of flush happening?

Yes if we can check this up it'll be good enough to me.  The trace point
could help in some case in the future too to monitor the behavior of kernel
MSG_ERRQUEUE but if you don't like it then it's okay.
Leonardo Bras Soares Passos June 8, 2022, 6:14 p.m. UTC | #7
On Wed, Jun 8, 2022 at 8:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> > (1) is not an option, as the interface currently uses ret=1 to make
> > sure MSG_ZEROCOPY is getting used,
> > I added that so the user of qio_channel can switch off zero-copy if
> > it's not getting used, and save some cpu.
>
> Yes (1) is not, but could you explain what do you mean by making sure
> MSG_ZEROCOPY being used?  Why is it relevant to the retval here?

If sendmsg() is called with MSG_ZEROCOPY, and everything is configured
correctly, the kernel will attempt to send the buffer using zero-copy.

Even with the right configuration on a recent enough kernel, there are
factors that can prevent zero-copy from happening, and the kernel will
fall back to the copying mechanism.
An example being the net device not supporting 'Scatter-Gather'
feature (NETIF_F_SG).

When this happens, there is an overhead for 'trying zero-copy first',
instead of just opting for the copying mechanism.

In a previous iteration of the patchset, it was made clear that it's
desirable to detect when the kernel falls back to copying mechanism,
so the user of 'QIOChannelSocket' can switch to copying and avoid the
overhead. This was done by the return value of flush(), which is 1 if
that occurs.

>
> I just figured it's a bit weird to return >0 here in flush().
>
> >
> > (2) is not a problem, but I fail to see how useful that would be. Is
> > the idea manually keeping track of flush happening?
>
> Yes if we can check this up it'll be good enough to me.  The trace point
> could help in some case in the future too to monitor the behavior of kernel
> MSG_ERRQUEUE but if you don't like it then it's okay.
>

TBH I am not sure how those traces work yet, and I am afraid it can
introduce some overhead in flush.
In any way, we can introduce this trace in a separated patch, since
fixing zero-copy flush seems more urgent right now.

Best regards,
Leo

> --
> Peter Xu
>
Peter Xu June 8, 2022, 8:23 p.m. UTC | #8
On Wed, Jun 08, 2022 at 03:14:36PM -0300, Leonardo Bras Soares Passos wrote:
> On Wed, Jun 8, 2022 at 8:41 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> > > (1) is not an option, as the interface currently uses ret=1 to make
> > > sure MSG_ZEROCOPY is getting used,
> > > I added that so the user of qio_channel can switch off zero-copy if
> > > it's not getting used, and save some cpu.
> >
> > Yes (1) is not, but could you explain what do you mean by making sure
> > MSG_ZEROCOPY being used?  Why is it relevant to the retval here?
> 
> If sendmsg() is called with MSG_ZEROCOPY, and everything is configured
> correctly, the kernel will attempt to send the buffer using zero-copy.
> 
> Even with the right configuration on a recent enough kernel, there are
> factors that can prevent zero-copy from happening, and the kernel will
> fall back to the copying mechanism.
> An example being the net device not supporting 'Scatter-Gather'
> feature (NETIF_F_SG).
> 
> When this happens, there is an overhead for 'trying zero-copy first',
> instead of just opting for the copying mechanism.
> 
> In a previous iteration of the patchset, it was made clear that it's
> desirable to detect when the kernel falls back to copying mechanism,
> so the user of 'QIOChannelSocket' can switch to copying and avoid the
> overhead. This was done by the return value of flush(), which is 1 if
> that occurs.

Two questions..

  1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
     is functional?

     If the answer is yes, I don't see how ret=1 will ever be
     returned.. because we'll also go into the same loop in
     qio_channel_socket_flush() anyway.

     If the answer is no, then since we'll have non-zero zero_copy_queued,
     will the loop in qio_channel_socket_flush() go into a dead one?  How
     could it return?

  2) Even if we have the correct ret=1 returned when that happens, which
     caller is detecting that ret==1 and warn the admin?

Thanks,
Leonardo Bras Soares Passos June 13, 2022, 8:58 p.m. UTC | #9
Hello Peter,

On Wed, Jun 8, 2022 at 5:23 PM Peter Xu <peterx@redhat.com> wrote:
[...]
> > In a previous iteration of the patchset, it was made clear that it's
> > desirable to detect when the kernel falls back to copying mechanism,
> > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > overhead. This was done by the return value of flush(), which is 1 if
> > that occurs.
>
> Two questions..
>
>   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
>      is functional?

I am not sure about what exactly you meant by 'like zerocopy is
funcional', but the
idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
syscall with MSG_ZEROCOPY that previously happened. This does not depend on
the outcome (like falling back to the copying mechanism).
btw, most of those messages may be batched to reduce overhead.

At some point, zero-copy may fail, and fall back to copying, so in
those messages
an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
those messages in a flush will trigger the returning of 1 from the
flush function.

>
>      If the answer is yes, I don't see how ret=1 will ever be
>      returned.. because we'll also go into the same loop in
>      qio_channel_socket_flush() anyway.


We set ret to 1 at function entry and then for each message in the MSG_ERRQUEUE,
we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
If it ever have a different error code, we set ret=0.

So, in our previous example, if we have a net device not supporting
the 'Scatter-Gather'
feature (NETIF_F_SG), every error message will be
SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.


>
>      If the answer is no, then since we'll have non-zero zero_copy_queued,
>      will the loop in qio_channel_socket_flush() go into a dead one?  How
>      could it return?

No, because it will go through all packets sent with MSG_ZEROCOPY, including the
ones that fell back to copying, so the counter should be fine. If any
code disables
zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, so it
should be fine.

>
>   2) Even if we have the correct ret=1 returned when that happens, which
>      caller is detecting that ret==1 and warn the admin?
>

No caller is using that right now.
It's supposed to be a QIOChannel interface feature, and any user/implementation
could use that information to warn if zero-copy is not being used, fall back to
copying directly (to avoid overhead of testing zero-copy) or even use
it to cancel the
sending if wanted.

It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.

Best regards,
Leo
Peter Xu June 13, 2022, 10:53 p.m. UTC | #10
On Mon, Jun 13, 2022 at 05:58:44PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter,
> 
> On Wed, Jun 8, 2022 at 5:23 PM Peter Xu <peterx@redhat.com> wrote:
> [...]
> > > In a previous iteration of the patchset, it was made clear that it's
> > > desirable to detect when the kernel falls back to copying mechanism,
> > > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > > overhead. This was done by the return value of flush(), which is 1 if
> > > that occurs.
> >
> > Two questions..
> >
> >   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
> >      is functional?
> 
> I am not sure about what exactly you meant by 'like zerocopy is
> funcional', but the
> idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
> syscall with MSG_ZEROCOPY that previously happened. This does not depend on
> the outcome (like falling back to the copying mechanism).
> btw, most of those messages may be batched to reduce overhead.
> 
> At some point, zero-copy may fail, and fall back to copying, so in
> those messages
> an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
> those messages in a flush will trigger the returning of 1 from the
> flush function.

Ah I think I missed the "reset ret==0 when !SO_EE_CODE_ZEROCOPY_COPIED"
path..  Sorry.

> 
> >
> >      If the answer is yes, I don't see how ret=1 will ever be
> >      returned.. because we'll also go into the same loop in
> >      qio_channel_socket_flush() anyway.
> 
> 
> We set ret to 1 at function entry and then for each message in the MSG_ERRQUEUE,
> we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
> If it ever have a different error code, we set ret=0.
> 
> So, in our previous example, if we have a net device not supporting
> the 'Scatter-Gather'
> feature (NETIF_F_SG), every error message will be
> SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.
> 
> 
> >
> >      If the answer is no, then since we'll have non-zero zero_copy_queued,
> >      will the loop in qio_channel_socket_flush() go into a dead one?  How
> >      could it return?
> 
> No, because it will go through all packets sent with MSG_ZEROCOPY, including the
> ones that fell back to copying, so the counter should be fine. If any
> code disables
> zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, so it
> should be fine.
> 
> >
> >   2) Even if we have the correct ret=1 returned when that happens, which
> >      caller is detecting that ret==1 and warn the admin?
> >
> 
> No caller is using that right now.
> It's supposed to be a QIOChannel interface feature, and any user/implementation
> could use that information to warn if zero-copy is not being used, fall back to
> copying directly (to avoid overhead of testing zero-copy) or even use
> it to cancel the
> sending if wanted.
> 
> It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.

OK the detection makes sense, thanks for the details.

Then now I'm wondering whether we should have warned the admin already if
zero-copy send is not fully enabled in live migration.  Should we add a
error_report_once() somewhere for the ret==1 already?  After all the user
specify zero_copy_send=true explicitly.  Did I miss something again?

Thanks,
Leonardo Bras Soares Passos June 14, 2022, 3:14 a.m. UTC | #11
On Mon, Jun 13, 2022 at 7:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jun 13, 2022 at 05:58:44PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Peter,
> >
> > On Wed, Jun 8, 2022 at 5:23 PM Peter Xu <peterx@redhat.com> wrote:
> > [...]
> > > > In a previous iteration of the patchset, it was made clear that it's
> > > > desirable to detect when the kernel falls back to copying mechanism,
> > > > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > > > overhead. This was done by the return value of flush(), which is 1 if
> > > > that occurs.
> > >
> > > Two questions..
> > >
> > >   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
> > >      is functional?
> >
> > I am not sure about what exactly you meant by 'like zerocopy is
> > funcional', but the
> > idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
> > syscall with MSG_ZEROCOPY that previously happened. This does not depend on
> > the outcome (like falling back to the copying mechanism).
> > btw, most of those messages may be batched to reduce overhead.
> >
> > At some point, zero-copy may fail, and fall back to copying, so in
> > those messages
> > an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
> > those messages in a flush will trigger the returning of 1 from the
> > flush function.
>
> Ah I think I missed the "reset ret==0 when !SO_EE_CODE_ZEROCOPY_COPIED"
> path..  Sorry.
>
> >
> > >
> > >      If the answer is yes, I don't see how ret=1 will ever be
> > >      returned.. because we'll also go into the same loop in
> > >      qio_channel_socket_flush() anyway.
> >
> >
> > We set ret to 1 at function entry and then for each message in the MSG_ERRQUEUE,
> > we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
> > If it ever have a different error code, we set ret=0.
> >
> > So, in our previous example, if we have a net device not supporting
> > the 'Scatter-Gather'
> > feature (NETIF_F_SG), every error message will be
> > SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.
> >
> >
> > >
> > >      If the answer is no, then since we'll have non-zero zero_copy_queued,
> > >      will the loop in qio_channel_socket_flush() go into a dead one?  How
> > >      could it return?
> >
> > No, because it will go through all packets sent with MSG_ZEROCOPY, including the
> > ones that fell back to copying, so the counter should be fine. If any
> > code disables
> > zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, so it
> > should be fine.
> >
> > >
> > >   2) Even if we have the correct ret=1 returned when that happens, which
> > >      caller is detecting that ret==1 and warn the admin?
> > >
> >
> > No caller is using that right now.
> > It's supposed to be a QIOChannel interface feature, and any user/implementation
> > could use that information to warn if zero-copy is not being used, fall back to
> > copying directly (to avoid overhead of testing zero-copy) or even use
> > it to cancel the
> > sending if wanted.
> >
> > It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.
>
> OK the detection makes sense, thanks for the details.
>
> Then now I'm wondering whether we should have warned the admin already if
> zero-copy send is not fully enabled in live migration.  Should we add a
> error_report_once() somewhere for the ret==1 already?  After all the user
> specify zero_copy_send=true explicitly.  Did I miss something again?
>

You are correct, I think warning the user is the valid thing to have here.
At the end of the first iteration, where the first flush happens,  I
think it's too late to
fail the migration, since a huge lot of the data has already been sent.

Best regards,
Leo
Chuang Xu June 14, 2022, 1:09 p.m. UTC | #12
On 2022/5/13 下午2:28, Leonardo Bras wrote:
> @@ -557,15 +578,31 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>           memcpy(CMSG_DATA(cmsg), fds, fdsize);
>       }
>   
> +#ifdef QEMU_MSG_ZEROCOPY
> +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> +        sflags = MSG_ZEROCOPY;
> +    }
> +#endif
> +
>    retry:
> -    ret = sendmsg(sioc->fd, &msg, 0);
> +    ret = sendmsg(sioc->fd, &msg, sflags);
>       if (ret <= 0) {
> -        if (errno == EAGAIN) {
> +        switch (errno) {
> +        case EAGAIN:
>               return QIO_CHANNEL_ERR_BLOCK;
> -        }
> -        if (errno == EINTR) {
> +        case EINTR:
>               goto retry;
> +#ifdef QEMU_MSG_ZEROCOPY
> +        case ENOBUFS:
> +            if (sflags & MSG_ZEROCOPY) {
> +                error_setg_errno(errp, errno,
> +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> +                return -1;
> +            }
> +            break;
> +#endif
>           }
> +
>           error_setg_errno(errp, errno,
>                            "Unable to write to socket");
>           return -1;

Hi, Leo.

There are some other questions I would like to discuss with you.

I tested the multifd zero_copy migration and found that sometimes even 
if max locked memory of qemu was set to 16GB(much greater than 
`MULTIFD_PACKET_SIZE`), the error "Process can't lock enough memory for 
using MSG_ZEROCOPY" would still be reported.

I noticed that the 
doc(https://www.kernel.org/doc/html/v5.12/networking/msg_zerocopy.html) 
says "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
if the socket option was not set, _the socket exceeds its optmem limit_ 
or the user exceeds its ulimit on locked pages."

I also found that the RFC(https://lwn.net/Articles/715279/) says _"__The 
change to allocate notification skbuffs from optmem requires__ensuring 
that net.core.optmem is at least a few 100KB."_

On my host,  optmem was initially set to 20KB, I tried to change it to 
100KB (echo 102400 > /proc/sys/net/core/optmem_max) as the RFC says.Then 
I tested the multifd zero_copy migration repeatedly,and the error 
disappeared.

So when sendmsg returns -1 with errno ENOBUFS, should we distinguish 
between error ''socket exceeds optmem limit" and error "user exceeds 
ulimit on locked pages"? Or is there any better way to avoid this problem?

Best Regards,

chuang xu
Dr. David Alan Gilbert June 14, 2022, 2:14 p.m. UTC | #13
* chuang xu (xuchuangxclwt@bytedance.com) wrote:
> 
> On 2022/5/13 下午2:28, Leonardo Bras wrote:
> > @@ -557,15 +578,31 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >           memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >       }
> > +#ifdef QEMU_MSG_ZEROCOPY
> > +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +        sflags = MSG_ZEROCOPY;
> > +    }
> > +#endif
> > +
> >    retry:
> > -    ret = sendmsg(sioc->fd, &msg, 0);
> > +    ret = sendmsg(sioc->fd, &msg, sflags);
> >       if (ret <= 0) {
> > -        if (errno == EAGAIN) {
> > +        switch (errno) {
> > +        case EAGAIN:
> >               return QIO_CHANNEL_ERR_BLOCK;
> > -        }
> > -        if (errno == EINTR) {
> > +        case EINTR:
> >               goto retry;
> > +#ifdef QEMU_MSG_ZEROCOPY
> > +        case ENOBUFS:
> > +            if (sflags & MSG_ZEROCOPY) {
> > +                error_setg_errno(errp, errno,
> > +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> > +                return -1;
> > +            }
> > +            break;
> > +#endif
> >           }
> > +
> >           error_setg_errno(errp, errno,
> >                            "Unable to write to socket");
> >           return -1;
> 
> Hi, Leo.
> 
> There are some other questions I would like to discuss with you.
> 
> I tested the multifd zero_copy migration and found that sometimes even if
> max locked memory of qemu was set to 16GB(much greater than
> `MULTIFD_PACKET_SIZE`), the error "Process can't lock enough memory for
> using MSG_ZEROCOPY" would still be reported.
> 
> I noticed that the
> doc(https://www.kernel.org/doc/html/v5.12/networking/msg_zerocopy.html) says
> "A zerocopy failure will return -1 with errno ENOBUFS. This happens if the
> socket option was not set, _the socket exceeds its optmem limit_ or the user
> exceeds its ulimit on locked pages."
> 
> I also found that the RFC(https://lwn.net/Articles/715279/) says _"__The
> change to allocate notification skbuffs from optmem requires__ensuring that
> net.core.optmem is at least a few 100KB."_

Interesting.

> On my host,  optmem was initially set to 20KB, I tried to change it to 100KB
> (echo 102400 > /proc/sys/net/core/optmem_max) as the RFC says.Then I tested
> the multifd zero_copy migration repeatedly,and the error disappeared.
> 
> So when sendmsg returns -1 with errno ENOBUFS, should we distinguish between
> error ''socket exceeds optmem limit" and error "user exceeds ulimit on
> locked pages"? Or is there any better way to avoid this problem?

I don't think we can tell which one of them triggered the error; so the
only thing I can suggest is that we document the need for optmem_max
setting; I wonder how we get a better answer than 'a few 100KB'?
I guess it's something like the number of packets inflight *
sizeof(cmsghdr) ?

Dave

> Best Regards,
> 
> chuang xu
Chuang Xu June 15, 2022, 2:44 p.m. UTC | #14
On 2022/6/14 下午10:14, Dr. David Alan Gilbert wrote:
> I don't think we can tell which one of them triggered the error; so the
> only thing I can suggest is that we document the need for optmem_max
> setting; I wonder how we get a better answer than 'a few 100KB'?
> I guess it's something like the number of packets inflight *
> sizeof(cmsghdr) ?
>
> Dave

Three cases with errno ENOBUFS are described in the official 
doc(https://www.kernel.org/doc/html/v5.12/networking/msg_zerocopy.html):

1.The socket option was not set

2.The socket exceeds its optmem limit

3.The user exceeds its ulimit on locked pages

For case 1, if the code logic is correct, this possibility can be ignored.

For case 2, I asked a kernel developer about the reason for "a few 
100KB". He said that the recommended value should be for the purpose of 
improving the performance of zero_copy send. If the NICsends data slower 
than the data generation speed, even if optmem is set to 100KB, there is 
a probability that sendmsg returns with errno ENOBUFS.

For case 3, If I do not set max locked memory for the qemu, the max 
locked memory will be unlimited. I set the max locked memory for qemu 
and found that once the memory usage exceeds the max locked memory, oom 
will occur.  Does this mean that sendmsg cannot return with errno 
ENOBUFS at all when user exceeds its ulimit on locked pages?

If the above is true, can we take the errno as the case 2?

I modified the code logic to call sendmsg again when the errno is 
ENOBUFS and set optmem to the initial 20KB(echo 20480 > 
/proc/sys/net/core/optmem_max), now the multifd zero_copy migration goes 
well.

Here are the changes I made to the code:


Signed-off-by: chuang xu <xuchuangxclwt@bytedance.com>
---
  io/channel-socket.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index dc9c165de1..9267f55a1d 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -595,9 +595,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
*ioc,
  #ifdef QEMU_MSG_ZEROCOPY
          case ENOBUFS:
              if (sflags & MSG_ZEROCOPY) {
-                error_setg_errno(errp, errno,
-                                 "Process can't lock enough memory for 
using MSG_ZEROCOPY");
-                return -1;
+                goto retry;
              }
              break;
  #endif
diff mbox series

Patch

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@  struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    ssize_t zero_copy_queued;
+    ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 05c425abb8..dc9c165de1 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -25,6 +25,14 @@ 
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include <linux/errqueue.h>
+#include <sys/socket.h>
+
+#if (defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY))
+#define QEMU_MSG_ZEROCOPY
+#endif
+#endif
 
 #define SOCKET_MAX_FDS 16
 
@@ -54,6 +62,8 @@  qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zero_copy_queued = 0;
+    sioc->zero_copy_sent = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -153,6 +163,16 @@  int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
         return -1;
     }
 
+#ifdef QEMU_MSG_ZEROCOPY
+    int ret, v = 1;
+    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#endif
+
     return 0;
 }
 
@@ -533,6 +553,7 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
     size_t fdsize = sizeof(int) * nfds;
     struct cmsghdr *cmsg;
+    int sflags = 0;
 
     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -557,15 +578,31 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
         memcpy(CMSG_DATA(cmsg), fds, fdsize);
     }
 
+#ifdef QEMU_MSG_ZEROCOPY
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        sflags = MSG_ZEROCOPY;
+    }
+#endif
+
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, sflags);
     if (ret <= 0) {
-        if (errno == EAGAIN) {
+        switch (errno) {
+        case EAGAIN:
             return QIO_CHANNEL_ERR_BLOCK;
-        }
-        if (errno == EINTR) {
+        case EINTR:
             goto retry;
+#ifdef QEMU_MSG_ZEROCOPY
+        case ENOBUFS:
+            if (sflags & MSG_ZEROCOPY) {
+                error_setg_errno(errp, errno,
+                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
+                return -1;
+            }
+            break;
+#endif
         }
+
         error_setg_errno(errp, errno,
                          "Unable to write to socket");
         return -1;
@@ -659,6 +696,74 @@  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef QEMU_MSG_ZEROCOPY
+static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+    char control[CMSG_SPACE(sizeof(*serr))];
+    int received;
+    int ret = 1;
+
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    memset(control, 0, sizeof(control));
+
+    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
+        if (received < 0) {
+            switch (errno) {
+            case EAGAIN:
+                /* Nothing on errqueue, wait until something is available */
+                qio_channel_wait(ioc, G_IO_ERR);
+                continue;
+            case EINTR:
+                continue;
+            default:
+                error_setg_errno(errp, errno,
+                                 "Unable to read errqueue");
+                return -1;
+            }
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            return -1;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            return -1;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zero copy");
+            return -1;
+        }
+
+        /* No errors, count successfully finished sendmsg()*/
+        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
+
+        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
+        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
+#endif /* QEMU_MSG_ZEROCOPY */
+
 static int
 qio_channel_socket_set_blocking(QIOChannel *ioc,
                                 bool enabled,
@@ -789,6 +894,9 @@  static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+#ifdef QEMU_MSG_ZEROCOPY
+    ioc_klass->io_flush = qio_channel_socket_flush;
+#endif
 }
 
 static const TypeInfo qio_channel_socket_info = {