mbox

[PULL,00/16] migration queue

Message ID 20220510083355.92738-1-dgilbert@redhat.com (mailing list archive)
State New, archived
Headers show

Pull-request

https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220510a

Message

Dr. David Alan Gilbert May 10, 2022, 8:33 a.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The following changes since commit 178bacb66d98d9ee7a702b9f2a4dfcd88b72a9ab:

  Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-05-09 11:07:04 -0700)

are available in the Git repository at:

  https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220510a

for you to fetch changes up to 511f4a0506af1d380115a61f3362149953646871:

  multifd: Implement zero copy write in multifd migration (multifd-zero-copy) (2022-05-10 09:15:06 +0100)

----------------------------------------------------------------
Migration pull 2022-05-10

This replaces yesterdays and the pull originally sent on 28th April;
compared to yesterdays this fixes an accidental change to skiboot.

It contains:
  TLS test fixes from Dan
  Zerocopy migration feature from Leo

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

----------------------------------------------------------------
Daniel P. Berrangé (9):
      tests: fix encoding of IP addresses in x509 certs
      tests: add more helper macros for creating TLS x509 certs
      tests: add migration tests of TLS with PSK credentials
      tests: add migration tests of TLS with x509 credentials
      tests: convert XBZRLE migration test to use common helper
      tests: convert multifd migration tests to use common helper
      tests: add multifd migration tests of TLS with PSK credentials
      tests: add multifd migration tests of TLS with x509 credentials
      tests: ensure migration status isn't reported as failed

Leonardo Bras (7):
      QIOChannel: Add flags on io_writev and introduce io_flush callback
      QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
      migration: Add zero-copy-send parameter for QMP/HMP for Linux
      migration: Add migrate_use_tls() helper
      multifd: multifd_send_sync_main now returns negative on error
      multifd: Send header packet without flags if zero-copy-send is enabled
      multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

 chardev/char-io.c                    |   2 +-
 hw/remote/mpqemu-link.c              |   2 +-
 include/io/channel-socket.h          |   2 +
 include/io/channel.h                 |  38 +-
 io/channel-buffer.c                  |   1 +
 io/channel-command.c                 |   1 +
 io/channel-file.c                    |   1 +
 io/channel-socket.c                  | 118 ++++-
 io/channel-tls.c                     |   1 +
 io/channel-websock.c                 |   1 +
 io/channel.c                         |  49 +-
 meson.build                          |   1 +
 migration/channel.c                  |   3 +-
 migration/migration.c                |  52 ++-
 migration/migration.h                |   6 +
 migration/multifd.c                  |  74 ++-
 migration/multifd.h                  |   4 +-
 migration/ram.c                      |  29 +-
 migration/rdma.c                     |   1 +
 migration/socket.c                   |  12 +-
 monitor/hmp-cmds.c                   |   6 +
 qapi/migration.json                  |  24 +
 scsi/pr-manager-helper.c             |   2 +-
 tests/qtest/meson.build              |  12 +-
 tests/qtest/migration-helpers.c      |  13 +
 tests/qtest/migration-helpers.h      |   1 +
 tests/qtest/migration-test.c         | 867 +++++++++++++++++++++++++++++++----
 tests/unit/crypto-tls-psk-helpers.c  |  18 +-
 tests/unit/crypto-tls-psk-helpers.h  |   1 +
 tests/unit/crypto-tls-x509-helpers.c |  16 +-
 tests/unit/crypto-tls-x509-helpers.h |  53 +++
 tests/unit/test-crypto-tlssession.c  |  11 +-
 tests/unit/test-io-channel-socket.c  |   1 +
 33 files changed, 1284 insertions(+), 139 deletions(-)

Comments

Dr. David Alan Gilbert May 10, 2022, 9:58 a.m. UTC | #1
* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> The following changes since commit 178bacb66d98d9ee7a702b9f2a4dfcd88b72a9ab:
> 
>   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-05-09 11:07:04 -0700)
> 
> are available in the Git repository at:
> 
>   https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220510a
> 
> for you to fetch changes up to 511f4a0506af1d380115a61f3362149953646871:
> 
>   multifd: Implement zero copy write in multifd migration (multifd-zero-copy) (2022-05-10 09:15:06 +0100)

Nack
This is still failing the Alpine build test:

ninja: job failed: cc -m64 -mcx16 -Ilibio.fa.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/p11-kit-1 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/dagrh/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/dagrh/qemu -iquote /builds/dagrh/qemu/include -iquote /builds/dagrh/qemu/disas/libvixl -iquote /builds/dagrh/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -MD -MQ libio.fa.p/io_channel-socket.c.o -MF libio.fa.p/io_channel-socket.c.o.d -o libio.fa.p/io_channel-socket.c.o -c ../io/channel-socket.c
In file included from /usr/include/linux/errqueue.h:6,
                 from ../io/channel-socket.c:29:
/usr/include/linux/time_types.h:7:8: error: redefinition of 'struct __kernel_timespec'
    7 | struct __kernel_timespec {
      |        ^~~~~~~~~~~~~~~~~
In file included from /usr/include/liburing.h:19,
                 from /builds/dagrh/qemu/include/block/aio.h:18,
                 from /builds/dagrh/qemu/include/io/channel.h:26,
                 from /builds/dagrh/qemu/include/io/channel-socket.h:24,
                 from ../io/channel-socket.c:24:
/usr/include/liburing/compat.h:9:8: note: originally defined here
    9 | struct __kernel_timespec {
      |        ^~~~~~~~~~~~~~~~~
ninja: subcommand failed
make: *** [Makefile:163: run-ninja] Error 1

> ----------------------------------------------------------------
> Migration pull 2022-05-10
> 
> This replaces yesterdays and the pull originally sent on 28th April;
> compared to yesterdays this fixes an accidental change to skiboot.
> 
> It contains:
>   TLS test fixes from Dan
>   Zerocopy migration feature from Leo
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> ----------------------------------------------------------------
> Daniel P. Berrangé (9):
>       tests: fix encoding of IP addresses in x509 certs
>       tests: add more helper macros for creating TLS x509 certs
>       tests: add migration tests of TLS with PSK credentials
>       tests: add migration tests of TLS with x509 credentials
>       tests: convert XBZRLE migration test to use common helper
>       tests: convert multifd migration tests to use common helper
>       tests: add multifd migration tests of TLS with PSK credentials
>       tests: add multifd migration tests of TLS with x509 credentials
>       tests: ensure migration status isn't reported as failed
> 
> Leonardo Bras (7):
>       QIOChannel: Add flags on io_writev and introduce io_flush callback
>       QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
>       migration: Add zero-copy-send parameter for QMP/HMP for Linux
>       migration: Add migrate_use_tls() helper
>       multifd: multifd_send_sync_main now returns negative on error
>       multifd: Send header packet without flags if zero-copy-send is enabled
>       multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
> 
>  chardev/char-io.c                    |   2 +-
>  hw/remote/mpqemu-link.c              |   2 +-
>  include/io/channel-socket.h          |   2 +
>  include/io/channel.h                 |  38 +-
>  io/channel-buffer.c                  |   1 +
>  io/channel-command.c                 |   1 +
>  io/channel-file.c                    |   1 +
>  io/channel-socket.c                  | 118 ++++-
>  io/channel-tls.c                     |   1 +
>  io/channel-websock.c                 |   1 +
>  io/channel.c                         |  49 +-
>  meson.build                          |   1 +
>  migration/channel.c                  |   3 +-
>  migration/migration.c                |  52 ++-
>  migration/migration.h                |   6 +
>  migration/multifd.c                  |  74 ++-
>  migration/multifd.h                  |   4 +-
>  migration/ram.c                      |  29 +-
>  migration/rdma.c                     |   1 +
>  migration/socket.c                   |  12 +-
>  monitor/hmp-cmds.c                   |   6 +
>  qapi/migration.json                  |  24 +
>  scsi/pr-manager-helper.c             |   2 +-
>  tests/qtest/meson.build              |  12 +-
>  tests/qtest/migration-helpers.c      |  13 +
>  tests/qtest/migration-helpers.h      |   1 +
>  tests/qtest/migration-test.c         | 867 +++++++++++++++++++++++++++++++----
>  tests/unit/crypto-tls-psk-helpers.c  |  18 +-
>  tests/unit/crypto-tls-psk-helpers.h  |   1 +
>  tests/unit/crypto-tls-x509-helpers.c |  16 +-
>  tests/unit/crypto-tls-x509-helpers.h |  53 +++
>  tests/unit/test-crypto-tlssession.c  |  11 +-
>  tests/unit/test-io-channel-socket.c  |   1 +
>  33 files changed, 1284 insertions(+), 139 deletions(-)
> 
>
Daniel P. Berrangé May 10, 2022, 10:19 a.m. UTC | #2
On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > The following changes since commit 178bacb66d98d9ee7a702b9f2a4dfcd88b72a9ab:
> > 
> >   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-05-09 11:07:04 -0700)
> > 
> > are available in the Git repository at:
> > 
> >   https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220510a
> > 
> > for you to fetch changes up to 511f4a0506af1d380115a61f3362149953646871:
> > 
> >   multifd: Implement zero copy write in multifd migration (multifd-zero-copy) (2022-05-10 09:15:06 +0100)
> 
> Nack
> This is still failing the Alpine build test:
> 
> ninja: job failed: cc -m64 -mcx16 -Ilibio.fa.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/p11-kit-1 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/dagrh/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/dagrh/qemu -iquote /builds/dagrh/qemu/include -iquote /builds/dagrh/qemu/disas/libvixl -iquote /builds/dagrh/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -MD -MQ libio.fa.p/io_channel-socket.c.o -MF libio.fa.p/io_channel-socket.c.o.d -o libio.fa.p/io_channel-socket.c.o -c ../io/channel-socket.c
> In file included from /usr/include/linux/errqueue.h:6,
>                  from ../io/channel-socket.c:29:
> /usr/include/linux/time_types.h:7:8: error: redefinition of 'struct __kernel_timespec'
>     7 | struct __kernel_timespec {
>       |        ^~~~~~~~~~~~~~~~~
> In file included from /usr/include/liburing.h:19,
>                  from /builds/dagrh/qemu/include/block/aio.h:18,
>                  from /builds/dagrh/qemu/include/io/channel.h:26,
>                  from /builds/dagrh/qemu/include/io/channel-socket.h:24,
>                  from ../io/channel-socket.c:24:
> /usr/include/liburing/compat.h:9:8: note: originally defined here
>     9 | struct __kernel_timespec {
>       |        ^~~~~~~~~~~~~~~~~
> ninja: subcommand failed
> make: *** [Makefile:163: run-ninja] Error 1

Yuk. That very much looks like a bug in liburing itself to me.


We've exposed the latent bug by including linux/errqueue.h  

With regards,
Daniel
Dr. David Alan Gilbert May 10, 2022, 10:43 a.m. UTC | #3
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > The following changes since commit 178bacb66d98d9ee7a702b9f2a4dfcd88b72a9ab:
> > > 
> > >   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into staging (2022-05-09 11:07:04 -0700)
> > > 
> > > are available in the Git repository at:
> > > 
> > >   https://gitlab.com/dagrh/qemu.git tags/pull-migration-20220510a
> > > 
> > > for you to fetch changes up to 511f4a0506af1d380115a61f3362149953646871:
> > > 
> > >   multifd: Implement zero copy write in multifd migration (multifd-zero-copy) (2022-05-10 09:15:06 +0100)
> > 
> > Nack
> > This is still failing the Alpine build test:
> > 
> > ninja: job failed: cc -m64 -mcx16 -Ilibio.fa.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/p11-kit-1 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -isystem /builds/dagrh/qemu/linux-headers -isystem linux-headers -iquote . -iquote /builds/dagrh/qemu -iquote /builds/dagrh/qemu/include -iquote /builds/dagrh/qemu/disas/libvixl -iquote /builds/dagrh/qemu/tcg/i386 -pthread -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wno-missing-include-dirs -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -MD -MQ libio.fa.p/io_channel-socket.c.o -MF libio.fa.p/io_channel-socket.c.o.d -o libio.fa.p/io_channel-socket.c.o -c ../io/channel-socket.c
> > In file included from /usr/include/linux/errqueue.h:6,
> >                  from ../io/channel-socket.c:29:
> > /usr/include/linux/time_types.h:7:8: error: redefinition of 'struct __kernel_timespec'
> >     7 | struct __kernel_timespec {
> >       |        ^~~~~~~~~~~~~~~~~
> > In file included from /usr/include/liburing.h:19,
> >                  from /builds/dagrh/qemu/include/block/aio.h:18,
> >                  from /builds/dagrh/qemu/include/io/channel.h:26,
> >                  from /builds/dagrh/qemu/include/io/channel-socket.h:24,
> >                  from ../io/channel-socket.c:24:
> > /usr/include/liburing/compat.h:9:8: note: originally defined here
> >     9 | struct __kernel_timespec {
> >       |        ^~~~~~~~~~~~~~~~~
> > ninja: subcommand failed
> > make: *** [Makefile:163: run-ninja] Error 1
> 
> Yuk. That very much looks like a bug in liburing itself to me.
> 
> 
> We've exposed the latent bug by including linux/errqueue.h  

Yes, I think there was a thread after the 1st pull where Leo identified
the patch that fixed it; but it's not in that image.

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Leonardo Bras May 11, 2022, 3:40 a.m. UTC | #4
From a previous thread:

On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> Leo:
>   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> one I guess is the simpler one; I think Stefanha managed to find the
> liburing fix for the __kernel_timespec case, but that looks like a bit
> more fun!
>
> Dave

I thought Stefanha had fixed this bug, and we were just waiting for a
new alpine rootfs/image with that fixed.
Is that correct?

On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
[...]
> >
> > Yuk. That very much looks like a bug in liburing itself to me.
> >
> >
> > We've exposed the latent bug by including linux/errqueue.h
>
> Yes, I think there was a thread after the 1st pull where Leo identified
> the patch that fixed it; but it's not in that image.

I only fixed the MSG_ZEROCOPY missing define bug, as I got that
Stefanha had already fixed the issue in liburing/alpine.

questions:
- Has Stefanha really fixed that, and we are just waiting for a new
image, or have I got that wrong?
- How should I proceed with that?
- If we proceed with fixing this up in alpine, will that require this
patchset to be on pause until it's fixed there?
- If so, is there any suggestion on how to fix that in qemu code?
(this header is needed because of SO_EE_* defines)

Thank you all!

Best regards,
Leo

>
> Dave
>
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Dr. David Alan Gilbert May 11, 2022, 8:55 a.m. UTC | #5
* Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> From a previous thread:
> 
> On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > Leo:
> >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > one I guess is the simpler one; I think Stefanha managed to find the
> > liburing fix for the __kernel_timespec case, but that looks like a bit
> > more fun!
> >
> > Dave
> 
> I thought Stefanha had fixed this bug, and we were just waiting for a
> new alpine rootfs/image with that fixed.
> Is that correct?
> 
> On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> [...]
> > >
> > > Yuk. That very much looks like a bug in liburing itself to me.
> > >
> > >
> > > We've exposed the latent bug by including linux/errqueue.h
> >
> > Yes, I think there was a thread after the 1st pull where Leo identified
> > the patch that fixed it; but it's not in that image.
> 
> I only fixed the MSG_ZEROCOPY missing define bug, as I got that
> Stefanha had already fixed the issue in liburing/alpine.
> 
> questions:
> - Has Stefanha really fixed that, and we are just waiting for a new
> image, or have I got that wrong?
> - How should I proceed with that?
>
> - If we proceed with fixing this up in alpine, will that require this
> patchset to be on pause until it's fixed there?

It needs to pass in CI; so yes.

> - If so, is there any suggestion on how to fix that in qemu code?
> (this header is needed because of SO_EE_* defines)

I've not actually looked at the detail of the failure; but yes I think
we need a qemu workaround here.

If there's no simple fix, then adding a test to meson.build to
conditionally disable liburing might be best; like the test code for
libcap_ng I guess (search in meson.build for libcap_ng.found  at around
line 540.

Dave

> Thank you all!
> 
> Best regards,
> Leo
> 
> >
> > Dave
> >
> > > With regards,
> > > Daniel
> > > --
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>
Leonardo Bras May 13, 2022, 6:28 a.m. UTC | #6
On Wed, May 11, 2022 at 5:55 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > From a previous thread:
> >
> > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > Leo:
> > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > one I guess is the simpler one; I think Stefanha managed to find the
> > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > more fun!
> > >
> > > Dave
> >
> > I thought Stefanha had fixed this bug, and we were just waiting for a
> > new alpine rootfs/image with that fixed.
> > Is that correct?
> >
> > On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> > [...]
> > > >
> > > > Yuk. That very much looks like a bug in liburing itself to me.
> > > >
> > > >
> > > > We've exposed the latent bug by including linux/errqueue.h
> > >
> > > Yes, I think there was a thread after the 1st pull where Leo identified
> > > the patch that fixed it; but it's not in that image.
> >
> > I only fixed the MSG_ZEROCOPY missing define bug, as I got that
> > Stefanha had already fixed the issue in liburing/alpine.
> >
> > questions:
> > - Has Stefanha really fixed that, and we are just waiting for a new
> > image, or have I got that wrong?
> > - How should I proceed with that?
> >
> > - If we proceed with fixing this up in alpine, will that require this
> > patchset to be on pause until it's fixed there?
>
> It needs to pass in CI; so yes.
>
> > - If so, is there any suggestion on how to fix that in qemu code?
> > (this header is needed because of SO_EE_* defines)
>
> I've not actually looked at the detail of the failure; but yes I think
> we need a qemu workaround here.
>
> If there's no simple fix, then adding a test to meson.build to
> conditionally disable liburing might be best; like the test code for
> libcap_ng I guess (search in meson.build for libcap_ng.found  at around
> line 540.

Hello Dave,

I solved this issue by doing this:

+linux_io_uring_test = '''
+  #include <liburing.h>
+  #include <linux/errqueue.h>
+
+  int main(void) { return 0; }'''
+
 linux_io_uring = not_found
 if not get_option('linux_io_uring').auto() or have_block
   linux_io_uring = dependency('liburing', version: '>=0.3',
                               required: get_option('linux_io_uring'),
                               method: 'pkg-config', kwargs: static_kwargs)
+  if not cc.links(linux_io_uring_test)
+    linux_io_uring = not_found
+  endif
 endif
+

Seems to work fine in CI, and now Alpine does not fail anymore.
(See pipeline https://gitlab.com/LeoBras/qemu/-/pipelines/538123933
for reference)

I am not sure if this is the right thing to do, but I will be sending
it as a full new patchset (v13), with the first patch being the one
with the above change and the rest just carrying the recommended
fixes.

I was also thinking I could instead send the single "fix" patch, and
recommend adding it before my v12. If that is the correct approach for
this case, please let me know so I can improve in the future. (I am
trying to figure out what is simpler/best for maintainers)

Best regards,
Leo







>
> Dave
>
> > Thank you all!
> >
> > Best regards,
> > Leo
> >
> > >
> > > Dave
> > >
> > > > With regards,
> > > > Daniel
> > > > --
> > > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > >
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Dr. David Alan Gilbert May 16, 2022, 8:18 a.m. UTC | #7
* Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> On Wed, May 11, 2022 at 5:55 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Leonardo Bras Soares Passos (leobras@redhat.com) wrote:
> > > From a previous thread:
> > >
> > > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > Leo:
> > > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > > one I guess is the simpler one; I think Stefanha managed to find the
> > > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > > more fun!
> > > >
> > > > Dave
> > >
> > > I thought Stefanha had fixed this bug, and we were just waiting for a
> > > new alpine rootfs/image with that fixed.
> > > Is that correct?
> > >
> > > On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> > > [...]
> > > > >
> > > > > Yuk. That very much looks like a bug in liburing itself to me.
> > > > >
> > > > >
> > > > > We've exposed the latent bug by including linux/errqueue.h
> > > >
> > > > Yes, I think there was a thread after the 1st pull where Leo identified
> > > > the patch that fixed it; but it's not in that image.
> > >
> > > I only fixed the MSG_ZEROCOPY missing define bug, as I got that
> > > Stefanha had already fixed the issue in liburing/alpine.
> > >
> > > questions:
> > > - Has Stefanha really fixed that, and we are just waiting for a new
> > > image, or have I got that wrong?
> > > - How should I proceed with that?
> > >
> > > - If we proceed with fixing this up in alpine, will that require this
> > > patchset to be on pause until it's fixed there?
> >
> > It needs to pass in CI; so yes.
> >
> > > - If so, is there any suggestion on how to fix that in qemu code?
> > > (this header is needed because of SO_EE_* defines)
> >
> > I've not actually looked at the detail of the failure; but yes I think
> > we need a qemu workaround here.
> >
> > If there's no simple fix, then adding a test to meson.build to
> > conditionally disable liburing might be best; like the test code for
> > libcap_ng I guess (search in meson.build for libcap_ng.found  at around
> > line 540.
> 
> Hello Dave,
> 
> I solved this issue by doing this:
> 
> +linux_io_uring_test = '''
> +  #include <liburing.h>
> +  #include <linux/errqueue.h>
> +
> +  int main(void) { return 0; }'''
> +
>  linux_io_uring = not_found
>  if not get_option('linux_io_uring').auto() or have_block
>    linux_io_uring = dependency('liburing', version: '>=0.3',
>                                required: get_option('linux_io_uring'),
>                                method: 'pkg-config', kwargs: static_kwargs)
> +  if not cc.links(linux_io_uring_test)
> +    linux_io_uring = not_found
> +  endif
>  endif
> +
> 
> Seems to work fine in CI, and now Alpine does not fail anymore.
> (See pipeline https://gitlab.com/LeoBras/qemu/-/pipelines/538123933
> for reference)
> 
> I am not sure if this is the right thing to do, but I will be sending
> it as a full new patchset (v13), with the first patch being the one
> with the above change and the rest just carrying the recommended
> fixes.

Thanks! That looks promising.  I'll cook a new pull.

> I was also thinking I could instead send the single "fix" patch, and
> recommend adding it before my v12. If that is the correct approach for
> this case, please let me know so I can improve in the future. (I am
> trying to figure out what is simpler/best for maintainers)

Either way would be fine; the full series is probably better.

Dave

> Best regards,
> Leo
> 
> 
> 
> 
> 
> 
> 
> >
> > Dave
> >
> > > Thank you all!
> > >
> > > Best regards,
> > > Leo
> > >
> > > >
> > > > Dave
> > > >
> > > > > With regards,
> > > > > Daniel
> > > > > --
> > > > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > > >
> > > > --
> > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > > >
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>
Stefan Hajnoczi May 16, 2022, 8:51 a.m. UTC | #8
On Wed, May 11, 2022 at 12:40:07AM -0300, Leonardo Bras Soares Passos wrote:
> From a previous thread:
> 
> On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > Leo:
> >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > one I guess is the simpler one; I think Stefanha managed to find the
> > liburing fix for the __kernel_timespec case, but that looks like a bit
> > more fun!
> >
> > Dave
> 
> I thought Stefanha had fixed this bug, and we were just waiting for a
> new alpine rootfs/image with that fixed.
> Is that correct?

I didn't fix the liburing __kernel_timespec issue on alpine Linux. It's
probably a packaging bug and I gave Dave the email address of the
package maintainer. I'm not sure if the package maintainer has been
contacted or released a fixed liburing package.

Stefan
Daniel P. Berrangé May 16, 2022, 9:40 a.m. UTC | #9
On Mon, May 16, 2022 at 09:51:12AM +0100, Stefan Hajnoczi wrote:
> On Wed, May 11, 2022 at 12:40:07AM -0300, Leonardo Bras Soares Passos wrote:
> > From a previous thread:
> > 
> > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > >
> > > Leo:
> > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > one I guess is the simpler one; I think Stefanha managed to find the
> > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > more fun!
> > >
> > > Dave
> > 
> > I thought Stefanha had fixed this bug, and we were just waiting for a
> > new alpine rootfs/image with that fixed.
> > Is that correct?
> 
> I didn't fix the liburing __kernel_timespec issue on alpine Linux. It's
> probably a packaging bug and I gave Dave the email address of the
> package maintainer. I'm not sure if the package maintainer has been
> contacted or released a fixed liburing package.

It was prety easy to bisect the problem with liburing so I filed a
bug pointing to the fix needed:

  https://gitlab.alpinelinux.org/alpine/aports/-/issues/13813

With regards,
Daniel
Daniel P. Berrangé May 16, 2022, 10:09 a.m. UTC | #10
On Mon, May 16, 2022 at 10:40:15AM +0100, Daniel P. Berrangé wrote:
> On Mon, May 16, 2022 at 09:51:12AM +0100, Stefan Hajnoczi wrote:
> > On Wed, May 11, 2022 at 12:40:07AM -0300, Leonardo Bras Soares Passos wrote:
> > > From a previous thread:
> > > 
> > > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > >
> > > > Leo:
> > > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > > one I guess is the simpler one; I think Stefanha managed to find the
> > > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > > more fun!
> > > >
> > > > Dave
> > > 
> > > I thought Stefanha had fixed this bug, and we were just waiting for a
> > > new alpine rootfs/image with that fixed.
> > > Is that correct?
> > 
> > I didn't fix the liburing __kernel_timespec issue on alpine Linux. It's
> > probably a packaging bug and I gave Dave the email address of the
> > package maintainer. I'm not sure if the package maintainer has been
> > contacted or released a fixed liburing package.
> 
> It was prety easy to bisect the problem with liburing so I filed a
> bug pointing to the fix needed:
> 
>   https://gitlab.alpinelinux.org/alpine/aports/-/issues/13813

Alpine win the prize for quickest distro bug fix response. The liburing
problem is now fixed in all current Alpine release streams, just minutes
after I filed the above bug.

With regards,
Daniel
Stefan Hajnoczi May 16, 2022, 3:30 p.m. UTC | #11
On Mon, May 16, 2022 at 11:09:23AM +0100, Daniel P. Berrangé wrote:
> On Mon, May 16, 2022 at 10:40:15AM +0100, Daniel P. Berrangé wrote:
> > On Mon, May 16, 2022 at 09:51:12AM +0100, Stefan Hajnoczi wrote:
> > > On Wed, May 11, 2022 at 12:40:07AM -0300, Leonardo Bras Soares Passos wrote:
> > > > From a previous thread:
> > > > 
> > > > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> > > > <dgilbert@redhat.com> wrote:
> > > > >
> > > > > Leo:
> > > > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > > > one I guess is the simpler one; I think Stefanha managed to find the
> > > > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > > > more fun!
> > > > >
> > > > > Dave
> > > > 
> > > > I thought Stefanha had fixed this bug, and we were just waiting for a
> > > > new alpine rootfs/image with that fixed.
> > > > Is that correct?
> > > 
> > > I didn't fix the liburing __kernel_timespec issue on alpine Linux. It's
> > > probably a packaging bug and I gave Dave the email address of the
> > > package maintainer. I'm not sure if the package maintainer has been
> > > contacted or released a fixed liburing package.
> > 
> > It was prety easy to bisect the problem with liburing so I filed a
> > bug pointing to the fix needed:
> > 
> >   https://gitlab.alpinelinux.org/alpine/aports/-/issues/13813
> 
> Alpine win the prize for quickest distro bug fix response. The liburing
> problem is now fixed in all current Alpine release streams, just minutes
> after I filed the above bug.

Awesome, thank you for identifying the issue and filing the issue!

Stefan