diff mbox series

[v3,8/9] tests/qtest: make more migration pre-copy scenarios run non-live

Message ID 20230531132400.1129576-9-berrange@redhat.com (mailing list archive)
State New, archived
Headers show
Series tests/qtest: make migration-test massively faster | expand

Commit Message

Daniel P. Berrangé May 31, 2023, 1:23 p.m. UTC
There are 27 pre-copy live migration scenarios being tested. In all of
these we force non-convergance and run for one iteration, then let it
converge and wait for completion during the second (or following)
iterations. At 3 mbps bandwidth limit the first iteration takes a very
long time (~30 seconds).

While it is important to test the migration passes and convergance
logic, it is overkill to do this for all 27 pre-copy scenarios. The
TLS migration scenarios in particular are merely exercising different
code paths during connection establishment.

To optimize time taken, switch most of the test scenarios to run
non-live (ie guest CPUs paused) with no bandwidth limits. This gives
a massive speed up for most of the test scenarios.

For test coverage the following scenarios are unchanged

 * Precopy with UNIX sockets
 * Precopy with UNIX sockets and dirty ring tracking
 * Precopy with XBZRLE
 * Precopy with UNIX compress
 * Precopy with UNIX compress (nowait)
 * Precopy with multifd

On a test machine this reduces execution time from 13 minutes to
8 minutes.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 81 +++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 15 deletions(-)

Comments

Thomas Huth June 1, 2023, 9:47 a.m. UTC | #1
On 31/05/2023 15.23, Daniel P. Berrangé wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
> 
> While it is important to test the migration passes and convergance

s/convergance/convergence/ (also in the first paragraph)

> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
> 
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
> 
> For test coverage the following scenarios are unchanged
> 
>   * Precopy with UNIX sockets
>   * Precopy with UNIX sockets and dirty ring tracking
>   * Precopy with XBZRLE
>   * Precopy with UNIX compress
>   * Precopy with UNIX compress (nowait)
>   * Precopy with multifd
> 
> On a test machine this reduces execution time from 13 minutes to
> 8 minutes.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 81 +++++++++++++++++++++++++++++-------
>   1 file changed, 66 insertions(+), 15 deletions(-)

Tested-by: Thomas Huth <thuth@redhat.com>
Juan Quintela June 1, 2023, 12:33 p.m. UTC | #2
Daniel P. Berrangé <berrange@redhat.com> wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with UNIX compress
>  * Precopy with UNIX compress (nowait)
>  * Precopy with multifd
>
> On a test machine this reduces execution time from 13 minutes to
> 8 minutes.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Hi

I have a different idea to get the migration tests faster.  Namely, just
stop letting enter the completation stage until we are ready to use it.

I need that functionality also for vfio migration, so as I have to
create the patches, can we put on hold the last two patches and give me
a couple of days?

Thomas, do you care if I get the whole sets of patches through the
migration tree?  Almost everything is migration related, and I am
changing the same files that Daniel, so it is easier if I can get them
into my tree.

Later, Juan.
Daniel P. Berrangé June 1, 2023, 12:38 p.m. UTC | #3
On Thu, Jun 01, 2023 at 02:33:56PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with UNIX compress
> >  * Precopy with UNIX compress (nowait)
> >  * Precopy with multifd
> >
> > On a test machine this reduces execution time from 13 minutes to
> > 8 minutes.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> Hi
> 
> I have a different idea to get the migration tests faster.  Namely, just
> stop letting enter the completation stage until we are ready to use it.

That is still going to be slower than running the migration non-live,
because the guest will be dirtying memory that needs to be transferred.
Even if we have the elevated bandwidth limit that's going to have a
CPU cost which will hurt running time in CI where alot is running in
parallel.

> I need that functionality also for vfio migration, so as I have to
> create the patches, can we put on hold the last two patches and give me
> a couple of days?

IMHO this patch is beneficial regardless.

The next patch might be redundant though.


With regards,
Daniel
Peter Xu June 1, 2023, 3:30 p.m. UTC | #4
Thanks for looking into this.. definitely worthwhile.

On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
> 
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
> 
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
> 
> For test coverage the following scenarios are unchanged

Curious how are below chosen?  I assume..

> 
>  * Precopy with UNIX sockets

this one verifies dirty log.

>  * Precopy with UNIX sockets and dirty ring tracking

... dirty ring...

>  * Precopy with XBZRLE

... xbzrle I think needs a diff on old/new, makes sense.

>  * Precopy with UNIX compress
>  * Precopy with UNIX compress (nowait)
>  * Precopy with multifd

What about the rest three?  Especially for two compression tests.

> 
> On a test machine this reduces execution time from 13 minutes to
> 8 minutes.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Thanks,
Daniel P. Berrangé June 1, 2023, 3:39 p.m. UTC | #5
On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
> Thanks for looking into this.. definitely worthwhile.
> 
> On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> > 
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> > 
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> > 
> > For test coverage the following scenarios are unchanged
> 
> Curious how are below chosen?  I assume..

Chosen based on whether they exercise code paths that are unique
and interesting during the RAM transfer phase.

Essentially the goal is that if we have N% code coverage before this
patch, then we should still have the same N% code coverage after this
patch.

The TLS tests exercise code paths that are unique during the migration
establishment phase. Once establishd they don't exercise anything
"interesting" during RAM transfer phase. Thus we don't loose code coverage
by runing TLS tests non-live.

> 
> > 
> >  * Precopy with UNIX sockets
> 
> this one verifies dirty log.
> 
> >  * Precopy with UNIX sockets and dirty ring tracking
> 
> ... dirty ring...
> 
> >  * Precopy with XBZRLE
> 
> ... xbzrle I think needs a diff on old/new, makes sense.
> 
> >  * Precopy with UNIX compress
> >  * Precopy with UNIX compress (nowait)
> >  * Precopy with multifd
> 
> What about the rest three?  Especially for two compression tests.

The compress thread logic is unique/interesting during RAM transfer
so benefits from running live. The wait vs non-wait scenario tests
a distinct codepath/logic.

multifd has been a massive source of bugs and has very different
codepaths from non-multifd, so is needed for codeage.


With regards,
Daniel
Peter Xu June 1, 2023, 3:53 p.m. UTC | #6
On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
> > Thanks for looking into this.. definitely worthwhile.
> > 
> > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > these we force non-convergance and run for one iteration, then let it
> > > converge and wait for completion during the second (or following)
> > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > long time (~30 seconds).
> > > 
> > > While it is important to test the migration passes and convergance
> > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > TLS migration scenarios in particular are merely exercising different
> > > code paths during connection establishment.
> > > 
> > > To optimize time taken, switch most of the test scenarios to run
> > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > a massive speed up for most of the test scenarios.
> > > 
> > > For test coverage the following scenarios are unchanged
> > 
> > Curious how are below chosen?  I assume..
> 
> Chosen based on whether they exercise code paths that are unique
> and interesting during the RAM transfer phase.
> 
> Essentially the goal is that if we have N% code coverage before this
> patch, then we should still have the same N% code coverage after this
> patch.
> 
> The TLS tests exercise code paths that are unique during the migration
> establishment phase. Once establishd they don't exercise anything
> "interesting" during RAM transfer phase. Thus we don't loose code coverage
> by runing TLS tests non-live.
> 
> > 
> > > 
> > >  * Precopy with UNIX sockets
> > 
> > this one verifies dirty log.
> > 
> > >  * Precopy with UNIX sockets and dirty ring tracking
> > 
> > ... dirty ring...
> > 
> > >  * Precopy with XBZRLE
> > 
> > ... xbzrle I think needs a diff on old/new, makes sense.
> > 
> > >  * Precopy with UNIX compress
> > >  * Precopy with UNIX compress (nowait)
> > >  * Precopy with multifd
> > 
> > What about the rest three?  Especially for two compression tests.
> 
> The compress thread logic is unique/interesting during RAM transfer
> so benefits from running live. The wait vs non-wait scenario tests
> a distinct codepath/logic.

I assume you mean e.g. when compressing with guest page being modified and
we should survive that rather than crashing the compressor?

Okay to me.

> 
> multifd has been a massive source of bugs and has very different
> codepaths from non-multifd, so is needed for codeage.

:(

I think it makes sense to keep multifd in that, especially if we want to
some day make it more or less default.

Would you mind add some comment into each of .live=true case right above?
Then it'll be fairly clear that's not the default and any new test case
should justify them to be live=true.

Feel free to add:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!
Daniel P. Berrangé June 1, 2023, 3:55 p.m. UTC | #7
On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote:
> On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
> > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
> > > Thanks for looking into this.. definitely worthwhile.
> > > 
> > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> > > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > > these we force non-convergance and run for one iteration, then let it
> > > > converge and wait for completion during the second (or following)
> > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > > long time (~30 seconds).
> > > > 
> > > > While it is important to test the migration passes and convergance
> > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > > TLS migration scenarios in particular are merely exercising different
> > > > code paths during connection establishment.
> > > > 
> > > > To optimize time taken, switch most of the test scenarios to run
> > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > > a massive speed up for most of the test scenarios.
> > > > 
> > > > For test coverage the following scenarios are unchanged
> > > 
> > > Curious how are below chosen?  I assume..
> > 
> > Chosen based on whether they exercise code paths that are unique
> > and interesting during the RAM transfer phase.
> > 
> > Essentially the goal is that if we have N% code coverage before this
> > patch, then we should still have the same N% code coverage after this
> > patch.
> > 
> > The TLS tests exercise code paths that are unique during the migration
> > establishment phase. Once establishd they don't exercise anything
> > "interesting" during RAM transfer phase. Thus we don't loose code coverage
> > by runing TLS tests non-live.
> > 
> > > 
> > > > 
> > > >  * Precopy with UNIX sockets
> > > 
> > > this one verifies dirty log.
> > > 
> > > >  * Precopy with UNIX sockets and dirty ring tracking
> > > 
> > > ... dirty ring...
> > > 
> > > >  * Precopy with XBZRLE
> > > 
> > > ... xbzrle I think needs a diff on old/new, makes sense.
> > > 
> > > >  * Precopy with UNIX compress
> > > >  * Precopy with UNIX compress (nowait)
> > > >  * Precopy with multifd
> > > 
> > > What about the rest three?  Especially for two compression tests.
> > 
> > The compress thread logic is unique/interesting during RAM transfer
> > so benefits from running live. The wait vs non-wait scenario tests
> > a distinct codepath/logic.
> 
> I assume you mean e.g. when compressing with guest page being modified and
> we should survive that rather than crashing the compressor?

No, i mean the compression code has a significant behaviour difference
between its two tests, because they toggle:

 @compress-wait-thread: Controls behavior when all compression
     threads are currently busy.  If true (default), wait for a free
     compression thread to become available; otherwise, send the page
     uncompressed.  (Since 3.1)

so we need to exercise the code path that falls back to sending
uncompressed, as well as the code path that waits for free threads.


With regards,
Daniel
Thomas Huth June 1, 2023, 4:09 p.m. UTC | #8
On 01/06/2023 14.33, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
...
>> On a test machine this reduces execution time from 13 minutes to
>> 8 minutes.
>>
>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> Hi
> 
> I have a different idea to get the migration tests faster.  Namely, just
> stop letting enter the completation stage until we are ready to use it.
> 
> I need that functionality also for vfio migration, so as I have to
> create the patches, can we put on hold the last two patches and give me
> a couple of days?

A couple of days are certainly OK ... but could we please avoid being stuck 
in the current situation for many more weeks? The slowness of the migration 
tests are really slowing down my turnaround times, and I think they 
contribute to the timeouts of the Travis CI tests that I'm currently facing, 
and likely also contributed to the issue that the QEMU project ran out of 
Gitlab CI minutes again last month (which is thankfully mostly hidden by the 
new private Linux runners, but some things like the Windows runners were not 
available anymore). So I'd appreciate if we'd rather get a speed up here 
merged rather sooner than later.

> Thomas, do you care if I get the whole sets of patches through the
> migration tree? 

That's fine, please take them through your tree!

  Thomas
Daniel P. Berrangé June 1, 2023, 4:17 p.m. UTC | #9
On Thu, Jun 01, 2023 at 06:09:44PM +0200, Thomas Huth wrote:
> On 01/06/2023 14.33, Juan Quintela wrote:
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> ...
> > > On a test machine this reduces execution time from 13 minutes to
> > > 8 minutes.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > 
> > Hi
> > 
> > I have a different idea to get the migration tests faster.  Namely, just
> > stop letting enter the completation stage until we are ready to use it.
> > 
> > I need that functionality also for vfio migration, so as I have to
> > create the patches, can we put on hold the last two patches and give me
> > a couple of days?
> 
> A couple of days are certainly OK ... but could we please avoid being stuck
> in the current situation for many more weeks?

Also as I mentioned in my reply to Peter, I'm wary of an approach that
puts in a synchronization between the test suite and the migrate code,
because that's making migration run in a manner which doesn't actually
match how it'll run in the real world. So I think if there was a feature
to add a sync before the final iteration, we'd still want to test without
using that feature so we get proper coverage.

Could we merge this series as-is, and simply re-visit the last patch
afterwards, once we have time to debate the feature about completion
phase synchronization.

>                                               The slowness of the migration
> tests are really slowing down my turnaround times, and I think they
> contribute to the timeouts of the Travis CI tests that I'm currently facing,
> and likely also contributed to the issue that the QEMU project ran out of
> Gitlab CI minutes again last month (which is thankfully mostly hidden by the
> new private Linux runners, but some things like the Windows runners were not
> available anymore). So I'd appreciate if we'd rather get a speed up here
> merged rather sooner than later.
> 
> > Thomas, do you care if I get the whole sets of patches through the
> > migration tree?
> 
> That's fine, please take them through your tree!

With regards,
Daniel
Peter Xu June 1, 2023, 4:17 p.m. UTC | #10
On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote:
> > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
> > > > Thanks for looking into this.. definitely worthwhile.
> > > > 
> > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> > > > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > > > these we force non-convergance and run for one iteration, then let it
> > > > > converge and wait for completion during the second (or following)
> > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > > > long time (~30 seconds).
> > > > > 
> > > > > While it is important to test the migration passes and convergance
> > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > > > TLS migration scenarios in particular are merely exercising different
> > > > > code paths during connection establishment.
> > > > > 
> > > > > To optimize time taken, switch most of the test scenarios to run
> > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > > > a massive speed up for most of the test scenarios.
> > > > > 
> > > > > For test coverage the following scenarios are unchanged
> > > > 
> > > > Curious how are below chosen?  I assume..
> > > 
> > > Chosen based on whether they exercise code paths that are unique
> > > and interesting during the RAM transfer phase.
> > > 
> > > Essentially the goal is that if we have N% code coverage before this
> > > patch, then we should still have the same N% code coverage after this
> > > patch.
> > > 
> > > The TLS tests exercise code paths that are unique during the migration
> > > establishment phase. Once establishd they don't exercise anything
> > > "interesting" during RAM transfer phase. Thus we don't loose code coverage
> > > by runing TLS tests non-live.
> > > 
> > > > 
> > > > > 
> > > > >  * Precopy with UNIX sockets
> > > > 
> > > > this one verifies dirty log.
> > > > 
> > > > >  * Precopy with UNIX sockets and dirty ring tracking
> > > > 
> > > > ... dirty ring...
> > > > 
> > > > >  * Precopy with XBZRLE
> > > > 
> > > > ... xbzrle I think needs a diff on old/new, makes sense.
> > > > 
> > > > >  * Precopy with UNIX compress
> > > > >  * Precopy with UNIX compress (nowait)
> > > > >  * Precopy with multifd
> > > > 
> > > > What about the rest three?  Especially for two compression tests.
> > > 
> > > The compress thread logic is unique/interesting during RAM transfer
> > > so benefits from running live. The wait vs non-wait scenario tests
> > > a distinct codepath/logic.
> > 
> > I assume you mean e.g. when compressing with guest page being modified and
> > we should survive that rather than crashing the compressor?
> 
> No, i mean the compression code has a significant behaviour difference
> between its two tests, because they toggle:
> 
>  @compress-wait-thread: Controls behavior when all compression
>      threads are currently busy.  If true (default), wait for a free
>      compression thread to become available; otherwise, send the page
>      uncompressed.  (Since 3.1)
> 
> so we need to exercise the code path that falls back to sending
> uncompressed, as well as the code path that waits for free threads.

But then the question is why live is needed?

IIUC whether the wait thing triggers have nothing directly related to VM is
live or not, but whether all compress thread busy.  IOW, IIUC all compress
paths will be tested even if non-live as long as we feed enough pages to
the compressor threads.
Peter Xu June 1, 2023, 4:26 p.m. UTC | #11
On Thu, Jun 01, 2023 at 05:17:11PM +0100, Daniel P. Berrangé wrote:
> Could we merge this series as-is, and simply re-visit the last patch
> afterwards, once we have time to debate the feature about completion
> phase synchronization.

This sounds good to me too.
Daniel P. Berrangé June 1, 2023, 4:35 p.m. UTC | #12
On Thu, Jun 01, 2023 at 12:17:38PM -0400, Peter Xu wrote:
> On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote:
> > On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote:
> > > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
> > > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
> > > > > Thanks for looking into this.. definitely worthwhile.
> > > > > 
> > > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
> > > > > > There are 27 pre-copy live migration scenarios being tested. In all of
> > > > > > these we force non-convergance and run for one iteration, then let it
> > > > > > converge and wait for completion during the second (or following)
> > > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > > > > > long time (~30 seconds).
> > > > > > 
> > > > > > While it is important to test the migration passes and convergance
> > > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > > > > > TLS migration scenarios in particular are merely exercising different
> > > > > > code paths during connection establishment.
> > > > > > 
> > > > > > To optimize time taken, switch most of the test scenarios to run
> > > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > > > > > a massive speed up for most of the test scenarios.
> > > > > > 
> > > > > > For test coverage the following scenarios are unchanged
> > > > > 
> > > > > Curious how are below chosen?  I assume..
> > > > 
> > > > Chosen based on whether they exercise code paths that are unique
> > > > and interesting during the RAM transfer phase.
> > > > 
> > > > Essentially the goal is that if we have N% code coverage before this
> > > > patch, then we should still have the same N% code coverage after this
> > > > patch.
> > > > 
> > > > The TLS tests exercise code paths that are unique during the migration
> > > > establishment phase. Once establishd they don't exercise anything
> > > > "interesting" during RAM transfer phase. Thus we don't loose code coverage
> > > > by runing TLS tests non-live.
> > > > 
> > > > > 
> > > > > > 
> > > > > >  * Precopy with UNIX sockets
> > > > > 
> > > > > this one verifies dirty log.
> > > > > 
> > > > > >  * Precopy with UNIX sockets and dirty ring tracking
> > > > > 
> > > > > ... dirty ring...
> > > > > 
> > > > > >  * Precopy with XBZRLE
> > > > > 
> > > > > ... xbzrle I think needs a diff on old/new, makes sense.
> > > > > 
> > > > > >  * Precopy with UNIX compress
> > > > > >  * Precopy with UNIX compress (nowait)
> > > > > >  * Precopy with multifd
> > > > > 
> > > > > What about the rest three?  Especially for two compression tests.
> > > > 
> > > > The compress thread logic is unique/interesting during RAM transfer
> > > > so benefits from running live. The wait vs non-wait scenario tests
> > > > a distinct codepath/logic.
> > > 
> > > I assume you mean e.g. when compressing with guest page being modified and
> > > we should survive that rather than crashing the compressor?
> > 
> > No, i mean the compression code has a significant behaviour difference
> > between its two tests, because they toggle:
> > 
> >  @compress-wait-thread: Controls behavior when all compression
> >      threads are currently busy.  If true (default), wait for a free
> >      compression thread to become available; otherwise, send the page
> >      uncompressed.  (Since 3.1)
> > 
> > so we need to exercise the code path that falls back to sending
> > uncompressed, as well as the code path that waits for free threads.
> 
> But then the question is why live is needed?
>
> IIUC whether the wait thing triggers have nothing directly related to VM is
> live or not, but whether all compress thread busy.  IOW, IIUC all compress
> paths will be tested even if non-live as long as we feed enough pages to
> the compressor threads.

If non-live the guest won't have dirtied any pages, so I wasn't
confident there would be sufficient amount of dirty ram to send
to trigger this. Possibly that's being too paranoid though

With regards,
Daniel
Peter Xu June 1, 2023, 4:59 p.m. UTC | #13
On Thu, Jun 01, 2023 at 05:35:07PM +0100, Daniel P. Berrangé wrote:
> If non-live the guest won't have dirtied any pages, so I wasn't
> confident there would be sufficient amount of dirty ram to send
> to trigger this. Possibly that's being too paranoid though

I think it makes sense to keep compression tests in white list, as I
mentioned the compressor does have issue with buffer being modified during
compressing.  IIRC we used to hit that, so live test makes sense even for
code path coverage.  See comment in do_compress_ram_page(), and also
34ab9e9743 ("migration: detect compression and decompression errors").

If we want to start doing this, imho we should make it strict and any live
use case needs to be very well justifed or otherwise should be non-live in
the qest, so we don't easily go back to square 1.

I see that in the latest post that part is still missing.  I can do on top
to prepare a patch to document why each use cases need live, and emphasize
that.
Juan Quintela June 1, 2023, 10:55 p.m. UTC | #14
Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote:
>> On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
>> > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
>> > > Thanks for looking into this.. definitely worthwhile.
>> > > 
>> > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
>> > > > There are 27 pre-copy live migration scenarios being tested. In all of
>> > > > these we force non-convergance and run for one iteration, then let it
>> > > > converge and wait for completion during the second (or following)
>> > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
>> > > > long time (~30 seconds).
>> > > > 
>> > > > While it is important to test the migration passes and convergance
>> > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
>> > > > TLS migration scenarios in particular are merely exercising different
>> > > > code paths during connection establishment.
>> > > > 
>> > > > To optimize time taken, switch most of the test scenarios to run
>> > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
>> > > > a massive speed up for most of the test scenarios.
>> > > > 
>> > > > For test coverage the following scenarios are unchanged
>> > > 
>> > > Curious how are below chosen?  I assume..
>> > 
>> > Chosen based on whether they exercise code paths that are unique
>> > and interesting during the RAM transfer phase.
>> > 
>> > Essentially the goal is that if we have N% code coverage before this
>> > patch, then we should still have the same N% code coverage after this
>> > patch.
>> > 
>> > The TLS tests exercise code paths that are unique during the migration
>> > establishment phase. Once establishd they don't exercise anything
>> > "interesting" during RAM transfer phase. Thus we don't loose code coverage
>> > by runing TLS tests non-live.
>> > 
>> > > 
>> > > > 
>> > > >  * Precopy with UNIX sockets
>> > > 
>> > > this one verifies dirty log.
>> > > 
>> > > >  * Precopy with UNIX sockets and dirty ring tracking
>> > > 
>> > > ... dirty ring...
>> > > 
>> > > >  * Precopy with XBZRLE
>> > > 
>> > > ... xbzrle I think needs a diff on old/new, makes sense.
>> > > 
>> > > >  * Precopy with UNIX compress
>> > > >  * Precopy with UNIX compress (nowait)
>> > > >  * Precopy with multifd
>> > > 
>> > > What about the rest three?  Especially for two compression tests.
>> > 
>> > The compress thread logic is unique/interesting during RAM transfer
>> > so benefits from running live. The wait vs non-wait scenario tests
>> > a distinct codepath/logic.
>> 
>> I assume you mean e.g. when compressing with guest page being modified and
>> we should survive that rather than crashing the compressor?
>
> No, i mean the compression code has a significant behaviour difference
> between its two tests, because they toggle:
>
>  @compress-wait-thread: Controls behavior when all compression
>      threads are currently busy.  If true (default), wait for a free
>      compression thread to become available; otherwise, send the page
>      uncompressed.  (Since 3.1)
>
> so we need to exercise the code path that falls back to sending
> uncompressed, as well as the code path that waits for free threads.

It don't work.
I think that I am going to just drop it for this iteration.

I tried 2 or 3 years ago to get a test to run to compression -> was not
able to get it to work.

Moved compression on top of multifd, much, much faster and much cleaner
(each compression method is around 50 lines of code).

Lukas tried this time and he was not able to get it working either.

So I have no hope at all for this code.

To add insult to injury, it copies things so many times that is just not
worthy.

Later, Juan.
Juan Quintela June 1, 2023, 10:58 p.m. UTC | #15
Peter Xu <peterx@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 04:55:25PM +0100, Daniel P. Berrangé wrote:
>> On Thu, Jun 01, 2023 at 11:53:17AM -0400, Peter Xu wrote:
>> > On Thu, Jun 01, 2023 at 04:39:48PM +0100, Daniel P. Berrangé wrote:
>> > > On Thu, Jun 01, 2023 at 11:30:10AM -0400, Peter Xu wrote:
>> > > > Thanks for looking into this.. definitely worthwhile.
>> > > > 
>> > > > On Wed, May 31, 2023 at 02:23:59PM +0100, Daniel P. Berrangé wrote:
>> > > > > There are 27 pre-copy live migration scenarios being tested. In all of
>> > > > > these we force non-convergance and run for one iteration, then let it
>> > > > > converge and wait for completion during the second (or following)
>> > > > > iterations. At 3 mbps bandwidth limit the first iteration takes a very
>> > > > > long time (~30 seconds).
>> > > > > 
>> > > > > While it is important to test the migration passes and convergance
>> > > > > logic, it is overkill to do this for all 27 pre-copy scenarios. The
>> > > > > TLS migration scenarios in particular are merely exercising different
>> > > > > code paths during connection establishment.
>> > > > > 
>> > > > > To optimize time taken, switch most of the test scenarios to run
>> > > > > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
>> > > > > a massive speed up for most of the test scenarios.
>> > > > > 
>> > > > > For test coverage the following scenarios are unchanged
>> > > > 
>> > > > Curious how are below chosen?  I assume..
>> > > 
>> > > Chosen based on whether they exercise code paths that are unique
>> > > and interesting during the RAM transfer phase.
>> > > 
>> > > Essentially the goal is that if we have N% code coverage before this
>> > > patch, then we should still have the same N% code coverage after this
>> > > patch.
>> > > 
>> > > The TLS tests exercise code paths that are unique during the migration
>> > > establishment phase. Once establishd they don't exercise anything
>> > > "interesting" during RAM transfer phase. Thus we don't loose code coverage
>> > > by runing TLS tests non-live.
>> > > 
>> > > > 
>> > > > > 
>> > > > >  * Precopy with UNIX sockets
>> > > > 
>> > > > this one verifies dirty log.
>> > > > 
>> > > > >  * Precopy with UNIX sockets and dirty ring tracking
>> > > > 
>> > > > ... dirty ring...
>> > > > 
>> > > > >  * Precopy with XBZRLE
>> > > > 
>> > > > ... xbzrle I think needs a diff on old/new, makes sense.
>> > > > 
>> > > > >  * Precopy with UNIX compress
>> > > > >  * Precopy with UNIX compress (nowait)
>> > > > >  * Precopy with multifd
>> > > > 
>> > > > What about the rest three?  Especially for two compression tests.
>> > > 
>> > > The compress thread logic is unique/interesting during RAM transfer
>> > > so benefits from running live. The wait vs non-wait scenario tests
>> > > a distinct codepath/logic.
>> > 
>> > I assume you mean e.g. when compressing with guest page being modified and
>> > we should survive that rather than crashing the compressor?
>> 
>> No, i mean the compression code has a significant behaviour difference
>> between its two tests, because they toggle:
>> 
>>  @compress-wait-thread: Controls behavior when all compression
>>      threads are currently busy.  If true (default), wait for a free
>>      compression thread to become available; otherwise, send the page
>>      uncompressed.  (Since 3.1)
>> 
>> so we need to exercise the code path that falls back to sending
>> uncompressed, as well as the code path that waits for free threads.
>
> But then the question is why live is needed?
>
> IIUC whether the wait thing triggers have nothing directly related to VM is
> live or not, but whether all compress thread busy.  IOW, IIUC all compress
> paths will be tested even if non-live as long as we feed enough pages to
> the compressor threads.

It is even wrong.

We didn't fix this for compression:

commit 007e179ef0e97eafda4c9ff2a9d665a1947c7c6d
Author: Ilya Leoshkevich <iii@linux.ibm.com>
Date:   Tue Jul 5 22:35:59 2022 +0200

    multifd: Copy pages before compressing them with zlib
    
    zlib_send_prepare() compresses pages of a running VM. zlib does not
    make any thread-safety guarantees with respect to changing deflate()
    input concurrently with deflate() [1].


Not that anyone is going to use any accelerator to run zlib when we are
compression just 4k.

Intel AT engine had to also move to 64 pages at a time to make it a
difference.  As said, I can't think of a single scenary where
compression is a good option.

Later, Juan.
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index aee25e1c4f..e033d9f0c1 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -577,9 +577,12 @@  typedef struct {
         MIG_TEST_FAIL_DEST_QUIT_ERR,
     } result;
 
-    /* Optional: set number of migration passes to wait for */
+    /* Optional: set number of migration passes to wait for, if live==true */
     unsigned int iterations;
 
+    /* Optional: whether the guest CPUs should be running during migration */
+    bool live;
+
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
@@ -1385,8 +1388,6 @@  static void test_precopy_common(MigrateCommon *args)
         return;
     }
 
-    migrate_ensure_non_converge(from);
-
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
     }
@@ -1396,6 +1397,31 @@  static void test_precopy_common(MigrateCommon *args)
         wait_for_serial("src_serial");
     }
 
+    if (args->live) {
+        /*
+         * Testing live migration, we want to ensure that some
+         * memory is re-dirtied after being transferred, so that
+         * we exercise logic for dirty page handling. We achieve
+         * this with a ridiculosly low bandwidth that guarantees
+         * non-convergance.
+         */
+        migrate_ensure_non_converge(from);
+    } else {
+        /*
+         * Testing non-live migration, we allow it to run at
+         * full speed to ensure short test case duration.
+         * For tests expected to fail, we don't need to
+         * change anything.
+         */
+        if (args->result == MIG_TEST_SUCCEED) {
+            qtest_qmp_assert_success(from, "{ 'execute' : 'stop'}");
+            if (!got_stop) {
+                qtest_qmp_eventwait(from, "STOP");
+            }
+            migrate_ensure_converge(from);
+        }
+    }
+
     if (!args->connect_uri) {
         g_autofree char *local_connect_uri =
             migrate_get_socket_address(to, "socket-address");
@@ -1413,25 +1439,41 @@  static void test_precopy_common(MigrateCommon *args)
             qtest_set_expected_status(to, EXIT_FAILURE);
         }
     } else {
-        if (args->iterations) {
-            while (args->iterations--) {
+        if (args->live) {
+            if (args->iterations) {
+                while (args->iterations--) {
+                    wait_for_migration_pass(from);
+                }
+            } else {
                 wait_for_migration_pass(from);
             }
-        } else {
-            wait_for_migration_pass(from);
-        }
 
-        migrate_ensure_converge(from);
+            migrate_ensure_converge(from);
 
-        /* We do this first, as it has a timeout to stop us
-         * hanging forever if migration didn't converge */
-        wait_for_migration_complete(from);
+            /*
+             * We do this first, as it has a timeout to stop us
+             * hanging forever if migration didn't converge
+             */
+            wait_for_migration_complete(from);
 
-        if (!got_stop) {
-            qtest_qmp_eventwait(from, "STOP");
+            if (!got_stop) {
+                qtest_qmp_eventwait(from, "STOP");
+            }
+        } else {
+            wait_for_migration_complete(from);
+            /*
+             * Must wait for dst to finish reading all incoming
+             * data on the socket before issuing 'cont' otherwise
+             * it'll be ignored
+             */
+            wait_for_migration_complete(to);
+
+            qtest_qmp_assert_success(to, "{ 'execute' : 'cont'}");
         }
 
-        qtest_qmp_eventwait(to, "RESUME");
+        if (!got_resume) {
+            qtest_qmp_eventwait(to, "RESUME");
+        }
 
         wait_for_serial("dest_serial");
     }
@@ -1449,6 +1491,8 @@  static void test_precopy_unix_plain(void)
     MigrateCommon args = {
         .listen_uri = uri,
         .connect_uri = uri,
+
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1464,6 +1508,8 @@  static void test_precopy_unix_dirty_ring(void)
         },
         .listen_uri = uri,
         .connect_uri = uri,
+
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1575,6 +1621,7 @@  static void test_precopy_unix_xbzrle(void)
         .start_hook = test_migrate_xbzrle_start,
 
         .iterations = 2,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1592,6 +1639,7 @@  static void test_precopy_unix_compress(void)
          * the previous iteration.
          */
         .iterations = 2,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1609,6 +1657,7 @@  static void test_precopy_unix_compress_nowait(void)
          * the previous iteration.
          */
         .iterations = 2,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -2017,6 +2066,8 @@  static void test_multifd_tcp_none(void)
     MigrateCommon args = {
         .listen_uri = "defer",
         .start_hook = test_migrate_precopy_tcp_multifd_start,
+
+        .live = true,
     };
     test_precopy_common(&args);
 }