diff mbox

ping Re: [PATCH v7 03/16] migration: split common postcopy out of ram postcopy

Message ID 20170925132312.GA7999@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Kevin Wolf Sept. 25, 2017, 1:23 p.m. UTC
Am 20.09.2017 um 13:45 hat Juan Quintela geschrieben:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> >> ping for 1-3
> >> Can we merge them?
> >
> > I see all of them have R-b's; so lets try and put them in the next
> > migration merge.
> >
> > Quintela: Sound good?
> 
> Yeap.

This patch broke qemu-iotests 181 ('Test postcopy live migration with
shared storage'):

Comments

Vladimir Sementsov-Ogievskiy Sept. 25, 2017, 2:31 p.m. UTC | #1
25.09.2017 16:23, Kevin Wolf wrote:
> Am 20.09.2017 um 13:45 hat Juan Quintela geschrieben:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>> ping for 1-3
>>>> Can we merge them?
>>> I see all of them have R-b's; so lets try and put them in the next
>>> migration merge.
>>>
>>> Quintela: Sound good?
>> Yeap.
> This patch broke qemu-iotests 181 ('Test postcopy live migration with
> shared storage'):
>
> --- /home/kwolf/source/qemu/tests/qemu-iotests/181.out  2017-06-16 19:19:53.000000000 +0200
> +++ 181.out.bad 2017-09-25 15:20:40.787582000 +0200
> @@ -21,18 +21,16 @@
>   === Do some I/O on the destination ===
>   
>   QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) qemu-io disk "read -P 0x55 0 64k"
> +(qemu) QEMU_PROG: Expected vmdescription section, but got 0
> +QEMU_PROG: Failed to get "write" lock
> +Is another process using the image?
> +qemu-io disk "read -P 0x55 0 64k"
>   read 65536/65536 bytes at offset 0
>   64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   (qemu)
>   (qemu) qemu-io disk "write -P 0x66 1M 64k"
> -wrote 65536/65536 bytes at offset 1048576
> -64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -
> -=== Shut down and check image ===
> -
> -(qemu) quit
> -(qemu)
> -(qemu) quit
> -No errors were found on the image.
> -*** done
> +QEMU_PROG: block/io.c:1359: bdrv_aligned_pwritev: Assertion `child->perm & BLK_PERM_WRITE' failed.
> +./common.config: Aborted                 (core dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
> +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
> +Timeout waiting for ops/sec on handle 1

Not sure about locking (don't see this error on my old kernel without 
OFD locking), but it looks like that
181 test should be fixed to set postcopy-ram capability on target too 
(which was considered as correct way on list)
Kevin Wolf Sept. 25, 2017, 2:58 p.m. UTC | #2
Am 25.09.2017 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 25.09.2017 16:23, Kevin Wolf wrote:
> > Am 20.09.2017 um 13:45 hat Juan Quintela geschrieben:
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > > > ping for 1-3
> > > > > Can we merge them?
> > > > I see all of them have R-b's; so lets try and put them in the next
> > > > migration merge.
> > > > 
> > > > Quintela: Sound good?
> > > Yeap.
> > This patch broke qemu-iotests 181 ('Test postcopy live migration with
> > shared storage'):
> > 
> > --- /home/kwolf/source/qemu/tests/qemu-iotests/181.out  2017-06-16 19:19:53.000000000 +0200
> > +++ 181.out.bad 2017-09-25 15:20:40.787582000 +0200
> > @@ -21,18 +21,16 @@
> >   === Do some I/O on the destination ===
> >   QEMU X.Y.Z monitor - type 'help' for more information
> > -(qemu) qemu-io disk "read -P 0x55 0 64k"
> > +(qemu) QEMU_PROG: Expected vmdescription section, but got 0
> > +QEMU_PROG: Failed to get "write" lock
> > +Is another process using the image?
> > +qemu-io disk "read -P 0x55 0 64k"
> >   read 65536/65536 bytes at offset 0
> >   64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >   (qemu)
> >   (qemu) qemu-io disk "write -P 0x66 1M 64k"
> > -wrote 65536/65536 bytes at offset 1048576
> > -64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > -
> > -=== Shut down and check image ===
> > -
> > -(qemu) quit
> > -(qemu)
> > -(qemu) quit
> > -No errors were found on the image.
> > -*** done
> > +QEMU_PROG: block/io.c:1359: bdrv_aligned_pwritev: Assertion `child->perm & BLK_PERM_WRITE' failed.
> > +./common.config: Aborted                 (core dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
> > +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> > +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
> > +Timeout waiting for ops/sec on handle 1
> 
> Not sure about locking (don't see this error on my old kernel without OFD
> locking), but it looks like that
> 181 test should be fixed to set postcopy-ram capability on target too (which
> was considered as correct way on list)

Whatever you think the preferred way to set up postcopy migration is: If
something worked before this patch and doesn't after it, that's a
regression and breaks backwards compatibility.

If we were talking about a graceful failure, maybe we could discuss
whether carefully and deliberately breaking compatibility could be
justified in this specific case. But the breakage is neither mentioned
in the commit message nor is it graceful, so I can only call it a bug.

Kevin
Vladimir Sementsov-Ogievskiy Sept. 25, 2017, 3:07 p.m. UTC | #3
25.09.2017 17:58, Kevin Wolf wrote:
> Am 25.09.2017 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 25.09.2017 16:23, Kevin Wolf wrote:
>>> Am 20.09.2017 um 13:45 hat Juan Quintela geschrieben:
>>>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>>> ping for 1-3
>>>>>> Can we merge them?
>>>>> I see all of them have R-b's; so lets try and put them in the next
>>>>> migration merge.
>>>>>
>>>>> Quintela: Sound good?
>>>> Yeap.
>>> This patch broke qemu-iotests 181 ('Test postcopy live migration with
>>> shared storage'):
>>>
>>> --- /home/kwolf/source/qemu/tests/qemu-iotests/181.out  2017-06-16 19:19:53.000000000 +0200
>>> +++ 181.out.bad 2017-09-25 15:20:40.787582000 +0200
>>> @@ -21,18 +21,16 @@
>>>    === Do some I/O on the destination ===
>>>    QEMU X.Y.Z monitor - type 'help' for more information
>>> -(qemu) qemu-io disk "read -P 0x55 0 64k"
>>> +(qemu) QEMU_PROG: Expected vmdescription section, but got 0
>>> +QEMU_PROG: Failed to get "write" lock
>>> +Is another process using the image?
>>> +qemu-io disk "read -P 0x55 0 64k"
>>>    read 65536/65536 bytes at offset 0
>>>    64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>    (qemu)
>>>    (qemu) qemu-io disk "write -P 0x66 1M 64k"
>>> -wrote 65536/65536 bytes at offset 1048576
>>> -64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>> -
>>> -=== Shut down and check image ===
>>> -
>>> -(qemu) quit
>>> -(qemu)
>>> -(qemu) quit
>>> -No errors were found on the image.
>>> -*** done
>>> +QEMU_PROG: block/io.c:1359: bdrv_aligned_pwritev: Assertion `child->perm & BLK_PERM_WRITE' failed.
>>> +./common.config: Aborted                 (core dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
>>> +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
>>> +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
>>> +Timeout waiting for ops/sec on handle 1
>> Not sure about locking (don't see this error on my old kernel without OFD
>> locking), but it looks like that
>> 181 test should be fixed to set postcopy-ram capability on target too (which
>> was considered as correct way on list)
> Whatever you think the preferred way to set up postcopy migration is: If
> something worked before this patch and doesn't after it, that's a
> regression and breaks backwards compatibility.
>
> If we were talking about a graceful failure, maybe we could discuss
> whether carefully and deliberately breaking compatibility could be
> justified in this specific case. But the breakage is neither mentioned
> in the commit message nor is it graceful, so I can only call it a bug.
>
> Kevin

It's of course my fault, I don't mean "it's wrong test, so it's not my 
problem") And I've already sent a patch.
Dr. David Alan Gilbert Sept. 25, 2017, 3:27 p.m. UTC | #4
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> 25.09.2017 17:58, Kevin Wolf wrote:
> > Am 25.09.2017 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 25.09.2017 16:23, Kevin Wolf wrote:
> > > > Am 20.09.2017 um 13:45 hat Juan Quintela geschrieben:
> > > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > > > > > ping for 1-3
> > > > > > > Can we merge them?
> > > > > > I see all of them have R-b's; so lets try and put them in the next
> > > > > > migration merge.
> > > > > > 
> > > > > > Quintela: Sound good?
> > > > > Yeap.
> > > > This patch broke qemu-iotests 181 ('Test postcopy live migration with
> > > > shared storage'):
> > > > 
> > > > --- /home/kwolf/source/qemu/tests/qemu-iotests/181.out  2017-06-16 19:19:53.000000000 +0200
> > > > +++ 181.out.bad 2017-09-25 15:20:40.787582000 +0200
> > > > @@ -21,18 +21,16 @@
> > > >    === Do some I/O on the destination ===
> > > >    QEMU X.Y.Z monitor - type 'help' for more information
> > > > -(qemu) qemu-io disk "read -P 0x55 0 64k"
> > > > +(qemu) QEMU_PROG: Expected vmdescription section, but got 0
> > > > +QEMU_PROG: Failed to get "write" lock
> > > > +Is another process using the image?
> > > > +qemu-io disk "read -P 0x55 0 64k"
> > > >    read 65536/65536 bytes at offset 0
> > > >    64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > >    (qemu)
> > > >    (qemu) qemu-io disk "write -P 0x66 1M 64k"
> > > > -wrote 65536/65536 bytes at offset 1048576
> > > > -64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > > -
> > > > -=== Shut down and check image ===
> > > > -
> > > > -(qemu) quit
> > > > -(qemu)
> > > > -(qemu) quit
> > > > -No errors were found on the image.
> > > > -*** done
> > > > +QEMU_PROG: block/io.c:1359: bdrv_aligned_pwritev: Assertion `child->perm & BLK_PERM_WRITE' failed.
> > > > +./common.config: Aborted                 (core dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
> > > > +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> > > > +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
> > > > +Timeout waiting for ops/sec on handle 1
> > > Not sure about locking (don't see this error on my old kernel without OFD
> > > locking), but it looks like that
> > > 181 test should be fixed to set postcopy-ram capability on target too (which
> > > was considered as correct way on list)
> > Whatever you think the preferred way to set up postcopy migration is: If
> > something worked before this patch and doesn't after it, that's a
> > regression and breaks backwards compatibility.
> > 
> > If we were talking about a graceful failure, maybe we could discuss
> > whether carefully and deliberately breaking compatibility could be
> > justified in this specific case. But the breakage is neither mentioned
> > in the commit message nor is it graceful, so I can only call it a bug.
> > 
> > Kevin
> 
> It's of course my fault, I don't mean "it's wrong test, so it's not my
> problem") And I've already sent a patch.

Why does this fail so badly, asserts etc - I was hoping for something
a bit more obvious from the migration code.

postcopy did originally work without the destination having the flag on
but setting the flag on the destination was always good practice because
it detected whether the host support was there early on.

Dave

> -- 
> Best regards,
> Vladimir
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Kevin Wolf Sept. 26, 2017, 10:12 a.m. UTC | #5
Am 25.09.2017 um 17:27 hat Dr. David Alan Gilbert geschrieben:
> > > Whatever you think the preferred way to set up postcopy migration is: If
> > > something worked before this patch and doesn't after it, that's a
> > > regression and breaks backwards compatibility.
> > > 
> > > If we were talking about a graceful failure, maybe we could discuss
> > > whether carefully and deliberately breaking compatibility could be
> > > justified in this specific case. But the breakage is neither mentioned
> > > in the commit message nor is it graceful, so I can only call it a bug.
> > > 
> > > Kevin
> > 
> > It's of course my fault, I don't mean "it's wrong test, so it's not my
> > problem") And I've already sent a patch.
> 
> Why does this fail so badly, asserts etc - I was hoping for something
> a bit more obvious from the migration code.
> 
> postcopy did originally work without the destination having the flag on
> but setting the flag on the destination was always good practice because
> it detected whether the host support was there early on.

So what does this mean for 2.11? Do you think it is acceptable breaking
cases where the flag isn't set on the destination?

If so, just changing the test case is enough. But if not, I'd rather
keep the test case as it is and fix only the migration code.

Kevin
Dr. David Alan Gilbert Sept. 26, 2017, 10:21 a.m. UTC | #6
* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 25.09.2017 um 17:27 hat Dr. David Alan Gilbert geschrieben:
> > > > Whatever you think the preferred way to set up postcopy migration is: If
> > > > something worked before this patch and doesn't after it, that's a
> > > > regression and breaks backwards compatibility.
> > > > 
> > > > If we were talking about a graceful failure, maybe we could discuss
> > > > whether carefully and deliberately breaking compatibility could be
> > > > justified in this specific case. But the breakage is neither mentioned
> > > > in the commit message nor is it graceful, so I can only call it a bug.
> > > > 
> > > > Kevin
> > > 
> > > It's of course my fault, I don't mean "it's wrong test, so it's not my
> > > problem") And I've already sent a patch.
> > 
> > Why does this fail so badly, asserts etc - I was hoping for something
> > a bit more obvious from the migration code.
> > 
> > postcopy did originally work without the destination having the flag on
> > but setting the flag on the destination was always good practice because
> > it detected whether the host support was there early on.
> 
> So what does this mean for 2.11? Do you think it is acceptable breaking
> cases where the flag isn't set on the destination?

I think so, because we've always recommended setting it on the
destination for the early detection.

> If so, just changing the test case is enough. But if not, I'd rather
> keep the test case as it is and fix only the migration code.

I'd take the test case fix, but I also want to dig why it fails so
badly; it would be nice just to have a clean failure telling you
that postcopy was expected.

Dave

> 
> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Kevin Wolf Sept. 26, 2017, 12:32 p.m. UTC | #7
Am 26.09.2017 um 12:21 hat Dr. David Alan Gilbert geschrieben:
> * Kevin Wolf (kwolf@redhat.com) wrote:
> > Am 25.09.2017 um 17:27 hat Dr. David Alan Gilbert geschrieben:
> > > > > Whatever you think the preferred way to set up postcopy migration is: If
> > > > > something worked before this patch and doesn't after it, that's a
> > > > > regression and breaks backwards compatibility.
> > > > > 
> > > > > If we were talking about a graceful failure, maybe we could discuss
> > > > > whether carefully and deliberately breaking compatibility could be
> > > > > justified in this specific case. But the breakage is neither mentioned
> > > > > in the commit message nor is it graceful, so I can only call it a bug.
> > > > > 
> > > > > Kevin
> > > > 
> > > > It's of course my fault, I don't mean "it's wrong test, so it's not my
> > > > problem") And I've already sent a patch.
> > > 
> > > Why does this fail so badly, asserts etc - I was hoping for something
> > > a bit more obvious from the migration code.
> > > 
> > > postcopy did originally work without the destination having the flag on
> > > but setting the flag on the destination was always good practice because
> > > it detected whether the host support was there early on.
> > 
> > So what does this mean for 2.11? Do you think it is acceptable breaking
> > cases where the flag isn't set on the destination?
> 
> I think so, because we've always recommended setting it on the
> destination for the early detection.

Okay, I'll include the test case patch in my pull request today then.

> > If so, just changing the test case is enough. But if not, I'd rather
> > keep the test case as it is and fix only the migration code.
> 
> I'd take the test case fix, but I also want to dig why it fails so
> badly; it would be nice just to have a clean failure telling you
> that postcopy was expected.

Yes, that would be nice.

Kevin
diff mbox

Patch

--- /home/kwolf/source/qemu/tests/qemu-iotests/181.out  2017-06-16 19:19:53.000000000 +0200
+++ 181.out.bad 2017-09-25 15:20:40.787582000 +0200
@@ -21,18 +21,16 @@ 
 === Do some I/O on the destination ===
 
 QEMU X.Y.Z monitor - type 'help' for more information
-(qemu) qemu-io disk "read -P 0x55 0 64k"
+(qemu) QEMU_PROG: Expected vmdescription section, but got 0
+QEMU_PROG: Failed to get "write" lock
+Is another process using the image?
+qemu-io disk "read -P 0x55 0 64k"
 read 65536/65536 bytes at offset 0
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 (qemu) 
 (qemu) qemu-io disk "write -P 0x66 1M 64k"
-wrote 65536/65536 bytes at offset 1048576
-64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-
-=== Shut down and check image ===
-
-(qemu) quit
-(qemu) 
-(qemu) quit
-No errors were found on the image.
-*** done
+QEMU_PROG: block/io.c:1359: bdrv_aligned_pwritev: Assertion `child->perm & BLK_PERM_WRITE' failed.
+./common.config: Aborted                 (core dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
+echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
+fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
+Timeout waiting for ops/sec on handle 1