mbox

[PULL,00/19] Migration patches

Message ID 20190712143207.4214-1-quintela@redhat.com (mailing list archive)
State New, archived
Headers show

Pull-request

https://github.com/juanquintela/qemu.git tags/migration-pull-request

Message

Juan Quintela July 12, 2019, 2:31 p.m. UTC
The following changes since commit a2a9d4adabe340617a24eb73a8b2a116d28a6b38:

  Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.1-20190712' into staging (2019-07-12 11:06:48 +0100)

are available in the Git repository at:

  https://github.com/juanquintela/qemu.git tags/migration-pull-request

for you to fetch changes up to a48ad5602f496236b4e1955d9e2e8228a7d0ad56:

  migration: allow private destination ram with x-ignore-shared (2019-07-12 16:25:59 +0200)

----------------------------------------------------------------
Migration pull request

Fix the issues with the previous pull request and 32 bits.

Please apply.

----------------------------------------------------------------

Juan Quintela (3):
  migration: fix multifd_recv event typo
  migration-test: rename parameter to parameter_int
  migration-test: Add migration multifd test

Peng Tao (1):
  migration: allow private destination ram with x-ignore-shared

Peter Xu (10):
  migration: No need to take rcu during sync_dirty_bitmap
  memory: Don't set migration bitmap when without migration
  bitmap: Add bitmap_copy_with_{src|dst}_offset()
  memory: Pass mr into snapshot_and_clear_dirty
  memory: Introduce memory listener hook log_clear()
  kvm: Update comments for sync_dirty_bitmap
  kvm: Persistent per kvmslot dirty bitmap
  kvm: Introduce slots lock for memory listener
  kvm: Support KVM_CLEAR_DIRTY_LOG
  migration: Split log_clear() into smaller chunks

Wei Yang (5):
  migration/multifd: call multifd_send_sync_main when sending
    RAM_SAVE_FLAG_EOS
  migration/xbzrle: update cache and current_data in one place
  cutils: remove one unnecessary pointer operation
  migration/multifd: sync packet_num after all thread are done
  migration/ram.c: reset complete_round when we gets a queued page

 accel/kvm/kvm-all.c      | 260 ++++++++++++++++++++++++++++++++++++---
 accel/kvm/trace-events   |   1 +
 exec.c                   |  15 ++-
 include/exec/memory.h    |  19 +++
 include/exec/ram_addr.h  |  92 +++++++++++++-
 include/qemu/bitmap.h    |   9 ++
 include/sysemu/kvm_int.h |   4 +
 memory.c                 |  56 ++++++++-
 migration/migration.c    |   4 +
 migration/migration.h    |  27 ++++
 migration/ram.c          |  93 ++++++++++----
 migration/trace-events   |   3 +-
 tests/Makefile.include   |   2 +
 tests/migration-test.c   | 103 ++++++++++++----
 tests/test-bitmap.c      |  72 +++++++++++
 util/bitmap.c            |  85 +++++++++++++
 util/cutils.c            |   8 +-
 17 files changed, 770 insertions(+), 83 deletions(-)
 create mode 100644 tests/test-bitmap.c

Comments

Peter Maydell July 12, 2019, 4:33 p.m. UTC | #1
On Fri, 12 Jul 2019 at 15:32, Juan Quintela <quintela@redhat.com> wrote:
>
> The following changes since commit a2a9d4adabe340617a24eb73a8b2a116d28a6b38:
>
>   Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.1-20190712' into staging (2019-07-12 11:06:48 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/juanquintela/qemu.git tags/migration-pull-request
>
> for you to fetch changes up to a48ad5602f496236b4e1955d9e2e8228a7d0ad56:
>
>   migration: allow private destination ram with x-ignore-shared (2019-07-12 16:25:59 +0200)
>
> ----------------------------------------------------------------
> Migration pull request
>
> Fix the issues with the previous pull request and 32 bits.
>
> Please apply.
>

Still fails on aarch32 host, I'm afraid:

MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
QTEST_QEMU_BINARY=aarch64-softmmu/qemu-system-aarch64
QTEST_QEMU_IMG=qemu-img tests/migration-test -m=quick -k --tap <
/dev/null | ./scripts/tap-driver.pl --test-name="migration-test"
PASS 1 migration-test /aarch64/migration/deprecated
PASS 2 migration-test /aarch64/migration/bad_dest
PASS 3 migration-test /aarch64/migration/fd_proto
PASS 4 migration-test /aarch64/migration/postcopy/unix
PASS 5 migration-test /aarch64/migration/postcopy/recovery
PASS 6 migration-test /aarch64/migration/precopy/unix
PASS 7 migration-test /aarch64/migration/precopy/tcp
PASS 8 migration-test /aarch64/migration/xbzrle/unix
malloc(): memory corruption
Broken pipe
qemu-system-aarch64: load of migration failed: Invalid argument
/home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
terminate QEMU process but encountered exit status 1
Aborted
ERROR - too few tests run (expected 9, got 8)
/home/peter.maydell/qemu/tests/Makefile.include:899: recipe for target
'check-qtest-aarch64' failed


thanks
-- PMM
Dr. David Alan Gilbert July 12, 2019, 5:54 p.m. UTC | #2
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Fri, 12 Jul 2019 at 15:32, Juan Quintela <quintela@redhat.com> wrote:
> >
> > The following changes since commit a2a9d4adabe340617a24eb73a8b2a116d28a6b38:
> >
> >   Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.1-20190712' into staging (2019-07-12 11:06:48 +0100)
> >
> > are available in the Git repository at:
> >
> >   https://github.com/juanquintela/qemu.git tags/migration-pull-request
> >
> > for you to fetch changes up to a48ad5602f496236b4e1955d9e2e8228a7d0ad56:
> >
> >   migration: allow private destination ram with x-ignore-shared (2019-07-12 16:25:59 +0200)
> >
> > ----------------------------------------------------------------
> > Migration pull request
> >
> > Fix the issues with the previous pull request and 32 bits.
> >
> > Please apply.
> >
> 
> Still fails on aarch32 host, I'm afraid:
> 
> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
> QTEST_QEMU_BINARY=aarch64-softmmu/qemu-system-aarch64
> QTEST_QEMU_IMG=qemu-img tests/migration-test -m=quick -k --tap <
> /dev/null | ./scripts/tap-driver.pl --test-name="migration-test"
> PASS 1 migration-test /aarch64/migration/deprecated
> PASS 2 migration-test /aarch64/migration/bad_dest
> PASS 3 migration-test /aarch64/migration/fd_proto
> PASS 4 migration-test /aarch64/migration/postcopy/unix
> PASS 5 migration-test /aarch64/migration/postcopy/recovery
> PASS 6 migration-test /aarch64/migration/precopy/unix
> PASS 7 migration-test /aarch64/migration/precopy/tcp
> PASS 8 migration-test /aarch64/migration/xbzrle/unix
> malloc(): memory corruption
> Broken pipe
> qemu-system-aarch64: load of migration failed: Invalid argument
> /home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
> terminate QEMU process but encountered exit status 1
> Aborted
> ERROR - too few tests run (expected 9, got 8)
> /home/peter.maydell/qemu/tests/Makefile.include:899: recipe for target
> 'check-qtest-aarch64' failed

Hmm, I think the only one to go near xbzrle specifically is
'migration/xbzrle: update cache and current_data in one place'

it did look right to me; but that is a subtle function so hmmmm.

Dave

> 
> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Maydell July 15, 2019, 11:16 a.m. UTC | #3
On Fri, 12 Jul 2019 at 17:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> Still fails on aarch32 host, I'm afraid:
>
> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
> QTEST_QEMU_BINARY=aarch64-softmmu/qemu-system-aarch64
> QTEST_QEMU_IMG=qemu-img tests/migration-test -m=quick -k --tap <
> /dev/null | ./scripts/tap-driver.pl --test-name="migration-test"
> PASS 1 migration-test /aarch64/migration/deprecated
> PASS 2 migration-test /aarch64/migration/bad_dest
> PASS 3 migration-test /aarch64/migration/fd_proto
> PASS 4 migration-test /aarch64/migration/postcopy/unix
> PASS 5 migration-test /aarch64/migration/postcopy/recovery
> PASS 6 migration-test /aarch64/migration/precopy/unix
> PASS 7 migration-test /aarch64/migration/precopy/tcp
> PASS 8 migration-test /aarch64/migration/xbzrle/unix
> malloc(): memory corruption
> Broken pipe
> qemu-system-aarch64: load of migration failed: Invalid argument
> /home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
> terminate QEMU process but encountered exit status 1
> Aborted
> ERROR - too few tests run (expected 9, got 8)
> /home/peter.maydell/qemu/tests/Makefile.include:899: recipe for target
> 'check-qtest-aarch64' failed

A run with valgrind:

(armhf)pmaydell@mustang-maydell:~/qemu/build/all-a32$
QTEST_QEMU_BINARY='valgrind aarch64-softmmu/qemu-system-aarch64'
tests/migration-test -v -p '/aarch64/migration/multifd/tcp'
/aarch64/migration/multifd/tcp: ==4034== Memcheck, a memory error detector
==4034== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==4034== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==4034== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
chardev=char0,mode=control -machine accel=qtest -display none -machine
virt,accel=kvm:tcg,gic-version=max -name vmsource,debug-threads=on
-cpu max -m 150M -serial file:/tmp/migration-test-mSLr4A/src_serial
-kernel /tmp/migration-test-mSLr4A/bootsect
==4034==
==4040== Memcheck, a memory error detector
==4040== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==4040== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==4040== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
chardev=char0,mode=control -machine accel=qtest -display none -machine
virt,accel=kvm:tcg,gic-version=max -name vmdest,debug-threads=on -cpu
max -m 150M -serial file:/tmp/migration-test-mSLr4A/dest_serial
-kernel /tmp/migration-test-mSLr4A/bootsect -incoming tcp:127.0.0.1:0
==4040==
==4034== Thread 5 multifdsend_0:
==4034== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
==4034==    at 0x5299F06: __libc_do_syscall (libc-do-syscall.S:47)
==4034==    by 0x5298FCB: sendmsg (sendmsg.c:28)
==4034==    by 0x60135D: qio_channel_socket_writev (channel-socket.c:544)
==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
==4034==    by 0x26BA73: multifd_send_initial_packet (ram.c:711)
==4034==    by 0x26BA73: multifd_send_thread (ram.c:1085)
==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
==4034==    by 0x5290613: start_thread (pthread_create.c:463)
==4034==    by 0x53487FB: ??? (clone.S:73)
==4034==  Address 0x2320048d is on thread 5's stack
==4034==  in frame #5, created by multifd_send_thread (ram.c:1077)
==4034==
==4034== Thread 6 multifdsend_1:
==4034== Invalid write of size 4
==4034==    at 0x26BB7C: multifd_send_fill_packet (ram.c:806)
==4034==    by 0x26BB7C: multifd_send_thread (ram.c:1101)
==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
==4034==    by 0x5290613: start_thread (pthread_create.c:463)
==4034==    by 0x53487FB: ??? (clone.S:73)
==4034==  Address 0x224ed668 is 0 bytes after a block of size 832 alloc'd
==4034==    at 0x4841BC4: calloc (vg_replace_malloc.c:711)
==4034==    by 0x5018269: g_malloc0 (in
/usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0.5600.4)
==4034==
==4034== Invalid write of size 4
==4034==    at 0x26BB82: multifd_send_fill_packet (ram.c:806)
==4034==    by 0x26BB82: multifd_send_thread (ram.c:1101)
==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
==4034==    by 0x5290613: start_thread (pthread_create.c:463)
==4034==    by 0x53487FB: ??? (clone.S:73)
==4034==  Address 0x224ed66c is 4 bytes after a block of size 832 alloc'd
==4034==    at 0x4841BC4: calloc (vg_replace_malloc.c:711)
==4034==    by 0x5018269: g_malloc0 (in
/usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0.5600.4)
==4034==
==4034== Invalid read of size 4
==4034==    at 0x5FF1DA: qio_channel_writev_full (channel.c:86)
==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
==4034==    by 0x26BBD9: multifd_send_thread (ram.c:1111)
==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
==4034==    by 0x5290613: start_thread (pthread_create.c:463)
==4034==    by 0x53487FB: ??? (clone.S:73)
==4034==  Address 0x30 is not stack'd, malloc'd or (recently) free'd
==4034==
==4034==
==4034== Process terminating with default action of signal 11 (SIGSEGV)
==4034==  Access not within mapped region at address 0x30
==4034==    at 0x5FF1DA: qio_channel_writev_full (channel.c:86)
==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
==4034==    by 0x26BBD9: multifd_send_thread (ram.c:1111)
==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
==4034==    by 0x5290613: start_thread (pthread_create.c:463)
==4034==    by 0x53487FB: ??? (clone.S:73)
==4034==  If you believe this happened as a result of a stack
==4034==  overflow in your program's main thread (unlikely but
==4034==  possible), you can try to increase the size of the
==4034==  main thread stack using the --main-stacksize= flag.
==4034==  The main thread stack size used in this run was 8388608.
==4034==
==4034== HEAP SUMMARY:
==4034==     in use at exit: 5,994,911 bytes in 23,588 blocks
==4034==   total heap usage: 87,487 allocs, 63,899 frees, 17,732,188
bytes allocated
==4034==
==4034== LEAK SUMMARY:
==4034==    definitely lost: 56 bytes in 1 blocks
==4034==    indirectly lost: 64 bytes in 2 blocks
==4034==      possibly lost: 1,620 bytes in 26 blocks
==4034==    still reachable: 5,993,171 bytes in 23,559 blocks
==4034==         suppressed: 0 bytes in 0 blocks
==4034== Rerun with --leak-check=full to see details of leaked memory
==4034==
==4034== For counts of detected and suppressed errors, rerun with: -v
==4034== Use --track-origins=yes to see where uninitialised values come from
==4034== ERROR SUMMARY: 66 errors from 4 contexts (suppressed: 6 from 3)
Broken pipe
qemu-system-aarch64: load of migration failed: Input/output error
==4040==
==4040== HEAP SUMMARY:
==4040==     in use at exit: 4,893,269 bytes in 19,702 blocks
==4040==   total heap usage: 86,196 allocs, 66,494 frees, 17,438,183
bytes allocated
==4040==
==4040== LEAK SUMMARY:
==4040==    definitely lost: 0 bytes in 0 blocks
==4040==    indirectly lost: 0 bytes in 0 blocks
==4040==      possibly lost: 1,160 bytes in 5 blocks
==4040==    still reachable: 4,892,109 bytes in 19,697 blocks
==4040==         suppressed: 0 bytes in 0 blocks
==4040== Rerun with --leak-check=full to see details of leaked memory
==4040==
==4040== For counts of detected and suppressed errors, rerun with: -v
==4040== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 3)
/home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
terminate QEMU process but encountered exit status 1
Aborted

thanks
-- PMM
Juan Quintela July 15, 2019, 1:44 p.m. UTC | #4
Peter Maydell <peter.maydell@linaro.org> wrote:
> On Fri, 12 Jul 2019 at 17:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> Still fails on aarch32 host, I'm afraid:

Hi

dropping the multifd test patch from now.  For "some" reason, having a
packed struct and 32bits is getting ugly, not sure yet _why_.

Resending the pull request.


>>
>> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
>> QTEST_QEMU_BINARY=aarch64-softmmu/qemu-system-aarch64
>> QTEST_QEMU_IMG=qemu-img tests/migration-test -m=quick -k --tap <
>> /dev/null | ./scripts/tap-driver.pl --test-name="migration-test"
>> PASS 1 migration-test /aarch64/migration/deprecated
>> PASS 2 migration-test /aarch64/migration/bad_dest
>> PASS 3 migration-test /aarch64/migration/fd_proto
>> PASS 4 migration-test /aarch64/migration/postcopy/unix
>> PASS 5 migration-test /aarch64/migration/postcopy/recovery
>> PASS 6 migration-test /aarch64/migration/precopy/unix
>> PASS 7 migration-test /aarch64/migration/precopy/tcp
>> PASS 8 migration-test /aarch64/migration/xbzrle/unix
>> malloc(): memory corruption
>> Broken pipe
>> qemu-system-aarch64: load of migration failed: Invalid argument
>> /home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
>> terminate QEMU process but encountered exit status 1
>> Aborted
>> ERROR - too few tests run (expected 9, got 8)
>> /home/peter.maydell/qemu/tests/Makefile.include:899: recipe for target
>> 'check-qtest-aarch64' failed
>
> A run with valgrind:
>
> (armhf)pmaydell@mustang-maydell:~/qemu/build/all-a32$
> QTEST_QEMU_BINARY='valgrind aarch64-softmmu/qemu-system-aarch64'
> tests/migration-test -v -p '/aarch64/migration/multifd/tcp'
> /aarch64/migration/multifd/tcp: ==4034== Memcheck, a memory error detector
> ==4034== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==4034== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==4034== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
> unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
> chardev=char0,mode=control -machine accel=qtest -display none -machine
> virt,accel=kvm:tcg,gic-version=max -name vmsource,debug-threads=on
> -cpu max -m 150M -serial file:/tmp/migration-test-mSLr4A/src_serial
> -kernel /tmp/migration-test-mSLr4A/bootsect
> ==4034==
> ==4040== Memcheck, a memory error detector
> ==4040== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==4040== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==4040== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
> unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
> chardev=char0,mode=control -machine accel=qtest -display none -machine
> virt,accel=kvm:tcg,gic-version=max -name vmdest,debug-threads=on -cpu
> max -m 150M -serial file:/tmp/migration-test-mSLr4A/dest_serial
> -kernel /tmp/migration-test-mSLr4A/bootsect -incoming tcp:127.0.0.1:0
> ==4040==
> ==4034== Thread 5 multifdsend_0:
> ==4034== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
> ==4034==    at 0x5299F06: __libc_do_syscall (libc-do-syscall.S:47)
> ==4034==    by 0x5298FCB: sendmsg (sendmsg.c:28)
> ==4034==    by 0x60135D: qio_channel_socket_writev (channel-socket.c:544)
> ==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
> ==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
> ==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
> ==4034==    by 0x26BA73: multifd_send_initial_packet (ram.c:711)
> ==4034==    by 0x26BA73: multifd_send_thread (ram.c:1085)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x2320048d is on thread 5's stack
> ==4034==  in frame #5, created by multifd_send_thread (ram.c:1077)
> ==4034==
> ==4034== Thread 6 multifdsend_1:
> ==4034== Invalid write of size 4
> ==4034==    at 0x26BB7C: multifd_send_fill_packet (ram.c:806)
> ==4034==    by 0x26BB7C: multifd_send_thread (ram.c:1101)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x224ed668 is 0 bytes after a block of size 832 alloc'd
> ==4034==    at 0x4841BC4: calloc (vg_replace_malloc.c:711)
> ==4034==    by 0x5018269: g_malloc0 (in
> /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0.5600.4)
> ==4034==
> ==4034== Invalid write of size 4
> ==4034==    at 0x26BB82: multifd_send_fill_packet (ram.c:806)
> ==4034==    by 0x26BB82: multifd_send_thread (ram.c:1101)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x224ed66c is 4 bytes after a block of size 832 alloc'd
> ==4034==    at 0x4841BC4: calloc (vg_replace_malloc.c:711)
> ==4034==    by 0x5018269: g_malloc0 (in
> /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0.5600.4)
> ==4034==
> ==4034== Invalid read of size 4
> ==4034==    at 0x5FF1DA: qio_channel_writev_full (channel.c:86)
> ==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
> ==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
> ==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
> ==4034==    by 0x26BBD9: multifd_send_thread (ram.c:1111)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x30 is not stack'd, malloc'd or (recently) free'd
> ==4034==
> ==4034==
> ==4034== Process terminating with default action of signal 11 (SIGSEGV)
> ==4034==  Access not within mapped region at address 0x30
> ==4034==    at 0x5FF1DA: qio_channel_writev_full (channel.c:86)
> ==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
> ==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
> ==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
> ==4034==    by 0x26BBD9: multifd_send_thread (ram.c:1111)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  If you believe this happened as a result of a stack
> ==4034==  overflow in your program's main thread (unlikely but
> ==4034==  possible), you can try to increase the size of the
> ==4034==  main thread stack using the --main-stacksize= flag.
> ==4034==  The main thread stack size used in this run was 8388608.
> ==4034==
> ==4034== HEAP SUMMARY:
> ==4034==     in use at exit: 5,994,911 bytes in 23,588 blocks
> ==4034==   total heap usage: 87,487 allocs, 63,899 frees, 17,732,188
> bytes allocated
> ==4034==
> ==4034== LEAK SUMMARY:
> ==4034==    definitely lost: 56 bytes in 1 blocks
> ==4034==    indirectly lost: 64 bytes in 2 blocks
> ==4034==      possibly lost: 1,620 bytes in 26 blocks
> ==4034==    still reachable: 5,993,171 bytes in 23,559 blocks
> ==4034==         suppressed: 0 bytes in 0 blocks
> ==4034== Rerun with --leak-check=full to see details of leaked memory
> ==4034==
> ==4034== For counts of detected and suppressed errors, rerun with: -v
> ==4034== Use --track-origins=yes to see where uninitialised values come from
> ==4034== ERROR SUMMARY: 66 errors from 4 contexts (suppressed: 6 from 3)
> Broken pipe
> qemu-system-aarch64: load of migration failed: Input/output error
> ==4040==
> ==4040== HEAP SUMMARY:
> ==4040==     in use at exit: 4,893,269 bytes in 19,702 blocks
> ==4040==   total heap usage: 86,196 allocs, 66,494 frees, 17,438,183
> bytes allocated
> ==4040==
> ==4040== LEAK SUMMARY:
> ==4040==    definitely lost: 0 bytes in 0 blocks
> ==4040==    indirectly lost: 0 bytes in 0 blocks
> ==4040==      possibly lost: 1,160 bytes in 5 blocks
> ==4040==    still reachable: 4,892,109 bytes in 19,697 blocks
> ==4040==         suppressed: 0 bytes in 0 blocks
> ==4040== Rerun with --leak-check=full to see details of leaked memory
> ==4040==
> ==4040== For counts of detected and suppressed errors, rerun with: -v
> ==4040== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 3)
> /home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
> terminate QEMU process but encountered exit status 1
> Aborted
>
> thanks
> -- PMM
Peter Maydell July 15, 2019, 1:48 p.m. UTC | #5
On Mon, 15 Jul 2019 at 14:44, Juan Quintela <quintela@redhat.com> wrote:
>
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > On Fri, 12 Jul 2019 at 17:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> Still fails on aarch32 host, I'm afraid:
>
> Hi
>
> dropping the multifd test patch from now.  For "some" reason, having a
> packed struct and 32bits is getting ugly, not sure yet _why_.

IMHO 'packed' structs are usually a bad idea. They have a bunch
of behaviours you may not be expecting (for instance they're
also not naturally aligned, and arrays of them won't be the
size you expect).

thanks
-- PMM
Daniel P. Berrangé July 15, 2019, 2:04 p.m. UTC | #6
On Mon, Jul 15, 2019 at 12:16:57PM +0100, Peter Maydell wrote:
> On Fri, 12 Jul 2019 at 17:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> > Still fails on aarch32 host, I'm afraid:
> >
> > MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
> > QTEST_QEMU_BINARY=aarch64-softmmu/qemu-system-aarch64
> > QTEST_QEMU_IMG=qemu-img tests/migration-test -m=quick -k --tap <
> > /dev/null | ./scripts/tap-driver.pl --test-name="migration-test"
> > PASS 1 migration-test /aarch64/migration/deprecated
> > PASS 2 migration-test /aarch64/migration/bad_dest
> > PASS 3 migration-test /aarch64/migration/fd_proto
> > PASS 4 migration-test /aarch64/migration/postcopy/unix
> > PASS 5 migration-test /aarch64/migration/postcopy/recovery
> > PASS 6 migration-test /aarch64/migration/precopy/unix
> > PASS 7 migration-test /aarch64/migration/precopy/tcp
> > PASS 8 migration-test /aarch64/migration/xbzrle/unix
> > malloc(): memory corruption
> > Broken pipe
> > qemu-system-aarch64: load of migration failed: Invalid argument
> > /home/peter.maydell/qemu/tests/libqtest.c:137: kill_qemu() tried to
> > terminate QEMU process but encountered exit status 1
> > Aborted
> > ERROR - too few tests run (expected 9, got 8)
> > /home/peter.maydell/qemu/tests/Makefile.include:899: recipe for target
> > 'check-qtest-aarch64' failed
> 
> A run with valgrind:
> 
> (armhf)pmaydell@mustang-maydell:~/qemu/build/all-a32$
> QTEST_QEMU_BINARY='valgrind aarch64-softmmu/qemu-system-aarch64'
> tests/migration-test -v -p '/aarch64/migration/multifd/tcp'
> /aarch64/migration/multifd/tcp: ==4034== Memcheck, a memory error detector
> ==4034== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==4034== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==4034== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
> unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
> chardev=char0,mode=control -machine accel=qtest -display none -machine
> virt,accel=kvm:tcg,gic-version=max -name vmsource,debug-threads=on
> -cpu max -m 150M -serial file:/tmp/migration-test-mSLr4A/src_serial
> -kernel /tmp/migration-test-mSLr4A/bootsect
> ==4034==
> ==4040== Memcheck, a memory error detector
> ==4040== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
> ==4040== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
> ==4040== Command: aarch64-softmmu/qemu-system-aarch64 -qtest
> unix:/tmp/qtest-4033.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-4033.qmp,id=char0 -mon
> chardev=char0,mode=control -machine accel=qtest -display none -machine
> virt,accel=kvm:tcg,gic-version=max -name vmdest,debug-threads=on -cpu
> max -m 150M -serial file:/tmp/migration-test-mSLr4A/dest_serial
> -kernel /tmp/migration-test-mSLr4A/bootsect -incoming tcp:127.0.0.1:0
> ==4040==
> ==4034== Thread 5 multifdsend_0:
> ==4034== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
> ==4034==    at 0x5299F06: __libc_do_syscall (libc-do-syscall.S:47)
> ==4034==    by 0x5298FCB: sendmsg (sendmsg.c:28)
> ==4034==    by 0x60135D: qio_channel_socket_writev (channel-socket.c:544)
> ==4034==    by 0x5FF995: qio_channel_writev (channel.c:207)
> ==4034==    by 0x5FF995: qio_channel_writev_all (channel.c:171)
> ==4034==    by 0x5FFA0F: qio_channel_write_all (channel.c:257)
> ==4034==    by 0x26BA73: multifd_send_initial_packet (ram.c:711)
> ==4034==    by 0x26BA73: multifd_send_thread (ram.c:1085)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x2320048d is on thread 5's stack
> ==4034==  in frame #5, created by multifd_send_thread (ram.c:1077)

This is a simple missing initialization

multifd_send_initial_packet has a local variable:

    MultiFDInit_t msg;

the code initializes 4 fields, but does *not* initialize the 2
padding fields, so we're writing random data. Harmless as the
receiving end will ignore padding too, but we should fill with
zeros really. so

    MultiFDInit_t msg = {0};

should fix it.

> ==4034==
> ==4034== Thread 6 multifdsend_1:
> ==4034== Invalid write of size 4
> ==4034==    at 0x26BB7C: multifd_send_fill_packet (ram.c:806)
> ==4034==    by 0x26BB7C: multifd_send_thread (ram.c:1101)
> ==4034==    by 0x63C0B1: qemu_thread_start (qemu-thread-posix.c:502)
> ==4034==    by 0x5290613: start_thread (pthread_create.c:463)
> ==4034==    by 0x53487FB: ??? (clone.S:73)
> ==4034==  Address 0x224ed668 is 0 bytes after a block of size 832 alloc'd
> ==4034==    at 0x4841BC4: calloc (vg_replace_malloc.c:711)
> ==4034==    by 0x5018269: g_malloc0 (in
> /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0.5600.4)

multifd_send_fill_packet is getting the oob write in:

    for (i = 0; i < p->pages->used; i++) {
        packet->offset[i] = cpu_to_be64(p->pages->offset[i]);
    }

offset is a variable length struct field at the end of MultiFDPacket_t:

  typedef struct {
     ...snip...
    char ramblock[256];
    uint64_t offset[];
  } __attribute__((packed)) MultiFDPacket_t;

but the packet data is allocated back in multifd_save_setup using:

        p->packet_len = sizeof(MultiFDPacket_t)
                      + sizeof(ram_addr_t) * page_count;
        p->packet = g_malloc0(p->packet_len);


Notice the field in the struct is "uint64_t" but the length we're
allocating is "ram_addr_t".

Since this is a 32-bit build, I'm guessing ram_addr_t is a 32-bit
integer and thus we're under-allocating the variable length offset
field by half

Regards,
Daniel
Juan Quintela July 15, 2019, 2:10 p.m. UTC | #7
Peter Maydell <peter.maydell@linaro.org> wrote:
> On Mon, 15 Jul 2019 at 14:44, Juan Quintela <quintela@redhat.com> wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On Fri, 12 Jul 2019 at 17:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> Still fails on aarch32 host, I'm afraid:
>>
>> Hi
>>
>> dropping the multifd test patch from now.  For "some" reason, having a
>> packed struct and 32bits is getting ugly, not sure yet _why_.
>
> IMHO 'packed' structs are usually a bad idea. They have a bunch
> of behaviours you may not be expecting (for instance they're
> also not naturally aligned, and arrays of them won't be the
> size you expect).

I can't get everything happy O:-)
For the multifd initial packet, I used to have that I wrote the fields
by hand.  Then danp asked that I used a packed struct, and converted the
values inside it.  So ..... Imposible to have everybody happy.

Anyways, the struct is packed, both sides are i386 32bits, and it should
be exactly the same, but it appears that there is where your valgrind
problems appear.  Still investigating _where_ the problem is.  What is
even weirder is that there is no error at all on 64bits.

Thanks, Juan.

PS.  BTW, did you launched by hand the guests with valgrind, or there is
     a trick that I am missing for launching a qtest with valgrind?
Peter Maydell July 15, 2019, 2:15 p.m. UTC | #8
On Mon, 15 Jul 2019 at 15:10, Juan Quintela <quintela@redhat.com> wrote:
> PS.  BTW, did you launched by hand the guests with valgrind, or there is
>      a trick that I am missing for launching a qtest with valgrind?

I quoted the command line I used:

QTEST_QEMU_BINARY='valgrind aarch64-softmmu/qemu-system-aarch64'
tests/migration-test -v -p '/aarch64/migration/multifd/tcp'

(https://wiki.qemu.org/Features/QTest lists this and some other
things you might want to do with qtest tests.)

thanks
-- PMM
Peter Maydell July 15, 2019, 2:17 p.m. UTC | #9
On Mon, 15 Jul 2019 at 15:04, Daniel P. Berrangé <berrange@redhat.com> wrote:
>     MultiFDInit_t msg = {0};
>
> should fix it.

A minor nit, but "= {}" is our standard struct-zero-initializer
so we should prefer that, I think. (I know it is not the C-spec
recommended version but some C compilers incorrectly warn about
"= {0}" whereas no compiler we care about warns about the
gnu-extension "= {}".)

thanks
-- PMM