diff mbox series

[4/4] ci: Add check-migration-quick to the clang job

Message ID 20241017143211.17771-5-farosas@suse.de (mailing list archive)
State New
Headers show
Series tests/qtest: Move the bulk of migration tests into a separate target | expand

Commit Message

Fabiano Rosas Oct. 17, 2024, 2:32 p.m. UTC
Recent changes to how we invoke the migration tests have
(intentionally) caused them to not be part of the check-qtest target
anymore. Add the check-migration-quick target so we don't lose
migration code testing in this job.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 .gitlab-ci.d/buildtest.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Daniel P. Berrangé Oct. 17, 2024, 2:57 p.m. UTC | #1
On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> Recent changes to how we invoke the migration tests have
> (intentionally) caused them to not be part of the check-qtest target
> anymore. Add the check-migration-quick target so we don't lose
> migration code testing in this job.

But 'check-migration-quick' is only the subset of migration tests,
'check-migration' is all of the migration tests. So surely this is
a massive regressions in covage in CI pipelines.

Experience shows us that relying on humans to run tests periodically
doesn't work well, and they'll slowly bit rot. Migration maintainers
don't have a way to run this as gating test for every pull request
that merges, and pull requests coming from non-migration maintainers
can still break migration code.

Any tests in tree need to be exercised by CI as the minimum bar
to prevent bit rot from merges.

> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  .gitlab-ci.d/buildtest.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index 34d3f4e9ab..37247dc5fa 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -442,7 +442,7 @@ clang-system:
>      CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
>        --extra-cflags=-fno-sanitize-recover=undefined
>      TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
> -    MAKE_CHECK_ARGS: check-qtest check-tcg
> +    MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
>  
>  clang-user:
>    extends: .native_build_job_template
> -- 
> 2.35.3
> 
> 

With regards,
Daniel
Fabiano Rosas Oct. 17, 2024, 4:29 p.m. UTC | #2
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> Recent changes to how we invoke the migration tests have
>> (intentionally) caused them to not be part of the check-qtest target
>> anymore. Add the check-migration-quick target so we don't lose
>> migration code testing in this job.
>
> But 'check-migration-quick' is only the subset of migration tests,
> 'check-migration' is all of the migration tests. So surely this is
> a massive regressions in covage in CI pipelines.

I'm not sure it is. There are tests there already for all the major
parts of the code: precopy, postcopy, multifd, socket. Besides, we can
tweak migration-quick to cover spots where we think we're losing
coverage.

Since our CI offers nothing in terms of reproducibility or
debuggability, I don't think it's productive to have an increasing
amount of tests running in CI if that means we'll be dealing with
timeouts and intermittent crashes constantly.

>
> Experience shows us that relying on humans to run tests periodically
> doesn't work well, and they'll slowly bit rot. Migration maintainers
> don't have a way to run this as gating test for every pull request
> that merges, and pull requests coming from non-migration maintainers
> can still break migration code.

Right, but migration code would still be tested with migration-quick
which is executed at every make check. Do we really need the full set in
every pull request? We must draw a line somewhere, otherwise make check
will just balloon in duration.

>
> Any tests in tree need to be exercised by CI as the minimum bar
> to prevent bit rot from merges.
>

No disagreement here. But then I'm going to need advice on what to do
when other maintainers ask us to stop writing migration tests because
they take too long. I cannot send contributors away nor merge code
without tests.

Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
allocation issue there. Addressing problems with timeouts/races in our
CI is not something any random person can do.

>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  .gitlab-ci.d/buildtest.yml | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
>> index 34d3f4e9ab..37247dc5fa 100644
>> --- a/.gitlab-ci.d/buildtest.yml
>> +++ b/.gitlab-ci.d/buildtest.yml
>> @@ -442,7 +442,7 @@ clang-system:
>>      CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
>>        --extra-cflags=-fno-sanitize-recover=undefined
>>      TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
>> -    MAKE_CHECK_ARGS: check-qtest check-tcg
>> +    MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
>>  
>>  clang-user:
>>    extends: .native_build_job_template
>> -- 
>> 2.35.3
>> 
>> 
>
> With regards,
> Daniel
Daniel P. Berrangé Oct. 18, 2024, 9:01 a.m. UTC | #3
On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> Recent changes to how we invoke the migration tests have
> >> (intentionally) caused them to not be part of the check-qtest target
> >> anymore. Add the check-migration-quick target so we don't lose
> >> migration code testing in this job.
> >
> > But 'check-migration-quick' is only the subset of migration tests,
> > 'check-migration' is all of the migration tests. So surely this is
> > a massive regressions in covage in CI pipelines.
> 
> I'm not sure it is. There are tests there already for all the major
> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> tweak migration-quick to cover spots where we think we're losing
> coverage.

Each of the tests in migration-test  were added for a good reason,
generally to address testing gaps where we had functional regressions
in the past. I don't think its a good idea to stop running such tests
in CI as gating on new contributions. Any time we've had optional
tests in QEMU, we've seen repeated regressions in the area in question.

> Since our CI offers nothing in terms of reproducibility or
> debuggability, I don't think it's productive to have an increasing
> amount of tests running in CI if that means we'll be dealing with
> timeouts and intermittent crashes constantly.

Test reliability is a different thing. If a particular test is
flaky, it needs to either be fixed or disabled. Splitting into
a fast & slow grouping doesn't address reliability, just hides
the problem from view.

> > Experience shows us that relying on humans to run tests periodically
> > doesn't work well, and they'll slowly bit rot. Migration maintainers
> > don't have a way to run this as gating test for every pull request
> > that merges, and pull requests coming from non-migration maintainers
> > can still break migration code.
> 
> Right, but migration code would still be tested with migration-quick
> which is executed at every make check. Do we really need the full set in
> every pull request? We must draw a line somewhere, otherwise make check
> will just balloon in duration.

Again, the tests all exist because migration code is incredibly
complicated, with a lot of permutations, with a history of being
very bug / regression prone. With that in mind, it is unavoidable
that we're going to have a significant testing overhead for
migration code.

Looking at its execution time right now, I'd say migration test
is pretty good, considering the permutations we have to target.

It gets a bad reputation because historically it has been as
much as x20 slower than it is today, and has also struggled
with reliability. The latter is a reflection of the complexity
of migration and and IMHO actually justifies greater testing,
as long as we put in time to address bugs.

Also we've got one single test program, covering an entire
subsystem in one go, rather than lots of short individual
test programs, so migration unfairly gets blamed for being
slow, when it simply covers alot of functionality in one
program.

> > Any tests in tree need to be exercised by CI as the minimum bar
> > to prevent bit rot from merges.
> >
> 
> No disagreement here. But then I'm going to need advice on what to do
> when other maintainers ask us to stop writing migration tests because
> they take too long. I cannot send contributors away nor merge code
> without tests.

In general, I think it is unreasonable for other maintainers to
tell us to stop adding test coverage for migration, and would
push back against such a request. 

We should, however, continue to optimize how we add further test
coverage, where practical, overload testing of multiple features
onto a single test case helps.

We've already massively optimized the migration-test compared to
its historical behaviour.

A potentially bigger win could be seen if we change how we exercise
the migration functionality. Since we had the migration qtest that
runs a full migration operation, we've tended to expand testing by
adding new qtest functions. ie we've added a functional test for
everything we want covered. This is nice & simple, but also expensive.
We've ignored unit testing, which I think is a mistake.

If i look at the test list:

# /x86_64/migration/bad_dest
# /x86_64/migration/analyze-script
# /x86_64/migration/validate_uuid
# /x86_64/migration/validate_uuid_error
# /x86_64/migration/validate_uuid_src_not_set
# /x86_64/migration/validate_uuid_dst_not_set
# /x86_64/migration/dirty_ring
# /x86_64/migration/precopy/file
# /x86_64/migration/precopy/unix/plain
# /x86_64/migration/precopy/unix/suspend/live
# /x86_64/migration/precopy/unix/suspend/notlive
# /x86_64/migration/precopy/unix/tls/psk
# /x86_64/migration/precopy/unix/tls/x509/default-host
# /x86_64/migration/precopy/unix/tls/x509/override-host
# /x86_64/migration/precopy/file/offset
# /x86_64/migration/precopy/file/mapped-ram
# /x86_64/migration/precopy/file/offset/fdset
# /x86_64/migration/precopy/file/offset/bad
# /x86_64/migration/precopy/file/mapped-ram/live
# /x86_64/migration/precopy/tcp/plain
# /x86_64/migration/precopy/tcp/plain/switchover-ack
# /x86_64/migration/precopy/tcp/tls/psk/match
# /x86_64/migration/precopy/tcp/tls/psk/mismatch
# /x86_64/migration/precopy/tcp/tls/x509/default-host
# /x86_64/migration/precopy/tcp/tls/x509/override-host
# /x86_64/migration/precopy/tcp/tls/x509/mismatch-host
# /x86_64/migration/precopy/tcp/tls/x509/friendly-client
# /x86_64/migration/precopy/tcp/tls/x509/hostile-client
# /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
# /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
# /x86_64/migration/precopy/fd/tcp
# /x86_64/migration/precopy/fd/file
# /x86_64/migration/multifd/file/mapped-ram
# /x86_64/migration/multifd/file/mapped-ram/live
# /x86_64/migration/multifd/file/mapped-ram/dio
# /x86_64/migration/multifd/file/mapped-ram/fdset
# /x86_64/migration/multifd/file/mapped-ram/fdset/dio
# /x86_64/migration/multifd/tcp/uri/plain/none
# /x86_64/migration/multifd/tcp/channels/plain/none
# /x86_64/migration/multifd/tcp/plain/cancel
# /x86_64/migration/multifd/tcp/plain/zlib
# /x86_64/migration/multifd/tcp/plain/zstd
# /x86_64/migration/multifd/tcp/plain/zero-page/legacy
# /x86_64/migration/multifd/tcp/plain/zero-page/none
# /x86_64/migration/multifd/tcp/tls/psk/match
# /x86_64/migration/multifd/tcp/tls/psk/mismatch
# /x86_64/migration/multifd/tcp/tls/x509/default-host
# /x86_64/migration/multifd/tcp/tls/x509/override-host
# /x86_64/migration/multifd/tcp/tls/x509/mismatch-host
# /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
# /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
# /x86_64/migration/validate_uri/channels/both_set
# /x86_64/migration/validate_uri/channels/none_set

Individually none of those is very slow on its own - 10 are in
the 2-3 second range,  35 are 1-2 secs,  and 6 are less than
1 second.

A very large portion of those are validating different ways to
establish migration. Hardly any of them actually need to run
a migration to completion. Even without running to completion
though, we have the overheads of spawning 2 QEMUs.

This feels like something that should be amenable to unit testing.
Might need a little re-factoring of migration code to make it
easier to run a subset of its logic in isolation, but that'd
probably be a win anyway, as such work usually makes code cleaner.

> Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
> allocation issue there. Addressing problems with timeouts/races in our
> CI is not something any random person can do.

In terms of running time, I think migration-test is acceptable as it is
to run in 'make check' by default and doesn't justify dropping test
coverage. We should still look to optimize & move to unit testing more
code, and any reliability issues are something that needs to be addressed
too.

With regards,
Daniel
Peter Maydell Oct. 18, 2024, 9:46 a.m. UTC | #4
On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> >
> > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> > >> Recent changes to how we invoke the migration tests have
> > >> (intentionally) caused them to not be part of the check-qtest target
> > >> anymore. Add the check-migration-quick target so we don't lose
> > >> migration code testing in this job.
> > >
> > > But 'check-migration-quick' is only the subset of migration tests,
> > > 'check-migration' is all of the migration tests. So surely this is
> > > a massive regressions in covage in CI pipelines.
> >
> > I'm not sure it is. There are tests there already for all the major
> > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> > tweak migration-quick to cover spots where we think we're losing
> > coverage.
>
> Each of the tests in migration-test  were added for a good reason,
> generally to address testing gaps where we had functional regressions
> in the past. I don't think its a good idea to stop running such tests
> in CI as gating on new contributions. Any time we've had optional
> tests in QEMU, we've seen repeated regressions in the area in question.
>
> > Since our CI offers nothing in terms of reproducibility or
> > debuggability, I don't think it's productive to have an increasing
> > amount of tests running in CI if that means we'll be dealing with
> > timeouts and intermittent crashes constantly.
>
> Test reliability is a different thing. If a particular test is
> flaky, it needs to either be fixed or disabled. Splitting into
> a fast & slow grouping doesn't address reliability, just hides
> the problem from view.

A lot of the current reliability issue is timeouts -- sometimes
our CI runners just run really slow (I have seen an example where
between a normal and a slow run on the same commit both the
compile and test times were 10x different...) So any test
that is not a fast-to-complete is much much more likely to
hit its timeout if the runner is running slowly. When I am
doing CI testing for merges "migration test timed out again"
is really really common.

> > No disagreement here. But then I'm going to need advice on what to do
> > when other maintainers ask us to stop writing migration tests because
> > they take too long. I cannot send contributors away nor merge code
> > without tests.
>
> In general, I think it is unreasonable for other maintainers to
> tell us to stop adding test coverage for migration, and would
> push back against such a request.

We do not have infinite CI resources, unfortunately. Migration
is competing with everything else for time on CI. You have to
find a balance between "what do we run every time" and "what
do we only run when specifically testing a migration pullreq".
Similarly, there's a lot of iotests but we don't run all of them
for every block backend for every CI job via "make check".

Long test times for tests run under "make check" are also bad
for individual developers -- if I'm running "make check" to
test a target/arm change I've made I don't really want that
to then spend 15 minutes testing the migration code that
I haven't touched and that is vanishingly unlikely to be
affected by my patches.

thanks
-- PMM
Daniel P. Berrangé Oct. 18, 2024, 10 a.m. UTC | #5
On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> > >
> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> > > >> Recent changes to how we invoke the migration tests have
> > > >> (intentionally) caused them to not be part of the check-qtest target
> > > >> anymore. Add the check-migration-quick target so we don't lose
> > > >> migration code testing in this job.
> > > >
> > > > But 'check-migration-quick' is only the subset of migration tests,
> > > > 'check-migration' is all of the migration tests. So surely this is
> > > > a massive regressions in covage in CI pipelines.
> > >
> > > I'm not sure it is. There are tests there already for all the major
> > > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> > > tweak migration-quick to cover spots where we think we're losing
> > > coverage.
> >
> > Each of the tests in migration-test  were added for a good reason,
> > generally to address testing gaps where we had functional regressions
> > in the past. I don't think its a good idea to stop running such tests
> > in CI as gating on new contributions. Any time we've had optional
> > tests in QEMU, we've seen repeated regressions in the area in question.
> >
> > > Since our CI offers nothing in terms of reproducibility or
> > > debuggability, I don't think it's productive to have an increasing
> > > amount of tests running in CI if that means we'll be dealing with
> > > timeouts and intermittent crashes constantly.
> >
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled. Splitting into
> > a fast & slow grouping doesn't address reliability, just hides
> > the problem from view.
> 
> A lot of the current reliability issue is timeouts -- sometimes
> our CI runners just run really slow (I have seen an example where
> between a normal and a slow run on the same commit both the
> compile and test times were 10x different...) So any test
> that is not a fast-to-complete is much much more likely to
> hit its timeout if the runner is running slowly. When I am
> doing CI testing for merges "migration test timed out again"
> is really really common.

If its frequently timing out, then we've got the timeouts
wrong, or we have some genuine bugs in there to be fixed.

> > > No disagreement here. But then I'm going to need advice on what to do
> > > when other maintainers ask us to stop writing migration tests because
> > > they take too long. I cannot send contributors away nor merge code
> > > without tests.
> >
> > In general, I think it is unreasonable for other maintainers to
> > tell us to stop adding test coverage for migration, and would
> > push back against such a request.
> 
> We do not have infinite CI resources, unfortunately. Migration
> is competing with everything else for time on CI. You have to
> find a balance between "what do we run every time" and "what
> do we only run when specifically testing a migration pullreq".
> Similarly, there's a lot of iotests but we don't run all of them
> for every block backend for every CI job via "make check".

The combos we don't run for iotests are a good source of
regressions too :-(

> Long test times for tests run under "make check" are also bad
> for individual developers -- if I'm running "make check" to
> test a target/arm change I've made I don't really want that
> to then spend 15 minutes testing the migration code that
> I haven't touched and that is vanishingly unlikely to be
> affected by my patches.

Migration-test *used* to take 15 minutes to run, but that was a
very long time ago. A run of it today is around 1m20.

That said, if you are building multiple system emulators, we
run the same test multiple times, and with the number of
targets we have, that will be painful.

That could be a good reason to split the migration-test into
two distinct programs. One program that runs for every target,
and one that is only run once, for some arbitrary "primary"
target ?  Or could we make use of glib's g_test_thorough
for this - a primary target runs with "SPEED=through" and
all other targets with normal settings. That would give us
a way to optimize any of the qtests to reduce redundant
testing where appropriate.


If we move alot of testing out into a migration unit test,
this also solves the redundancy problem.


With regards,
Daniel
Peter Maydell Oct. 18, 2024, 10:09 a.m. UTC | #6
On Fri, 18 Oct 2024 at 11:00, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
> > We do not have infinite CI resources, unfortunately. Migration
> > is competing with everything else for time on CI. You have to
> > find a balance between "what do we run every time" and "what
> > do we only run when specifically testing a migration pullreq".
> > Similarly, there's a lot of iotests but we don't run all of them
> > for every block backend for every CI job via "make check".
>
> The combos we don't run for iotests are a good source of
> regressions too :-(
>
> > Long test times for tests run under "make check" are also bad
> > for individual developers -- if I'm running "make check" to
> > test a target/arm change I've made I don't really want that
> > to then spend 15 minutes testing the migration code that
> > I haven't touched and that is vanishingly unlikely to be
> > affected by my patches.
>
> Migration-test *used* to take 15 minutes to run, but that was a
> very long time ago. A run of it today is around 1m20.
>
> That said, if you are building multiple system emulators, we
> run the same test multiple times, and with the number of
> targets we have, that will be painful.

Yeah. Here's a recent s390 job, and not one that was running
slow: https://gitlab.com/qemu-project/qemu/-/jobs/8112195449

The migration tests it ran were:

 95/954 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test
           OK             196.95s   50 subtests passed
 96/954 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
           OK             202.47s   50 subtests passed
 99/954 qemu:qtest+qtest-i386 / qtest-i386/migration-test
           OK             203.54s   52 subtests passed
107/954 qemu:qtest+qtest-s390x / qtest-s390x/migration-test
           OK             193.65s   50 subtests passed
156/954 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
           OK             164.44s   52 subtests passed

So that's 192s (over 3 minutes) average, and over 16 minutes
total spent on migration testing on this one CI job. If
the s390 VM has a noisy-neighbour problem then the
migration tests can hit their 8 minute timeout, implying
40 minutes spent on migration testing alone...

thanks
-- PMM
diff mbox series

Patch

diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 34d3f4e9ab..37247dc5fa 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -442,7 +442,7 @@  clang-system:
     CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
       --extra-cflags=-fno-sanitize-recover=undefined
     TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
-    MAKE_CHECK_ARGS: check-qtest check-tcg
+    MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
 
 clang-user:
   extends: .native_build_job_template