gitlab: remove unreliable avocado CI jobs

Message ID	20230912150611.70676-1-stefanha@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Stefan Hajnoczi <stefanha@redhat.com> To: qemu-devel@nongnu.org Cc: Wainer dos Santos Moschetta <wainersm@redhat.com>, Thomas Huth <thuth@redhat.com>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>, Beraldo Leal <bleal@redhat.com>, =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>, Stefan Hajnoczi <stefanha@redhat.com> Subject: [PATCH] gitlab: remove unreliable avocado CI jobs Date: Tue, 12 Sep 2023 11:06:11 -0400 Message-ID: <20230912150611.70676-1-stefanha@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	gitlab: remove unreliable avocado CI jobs \| expand gitlab: remove unreliable avocado CI jobs

Stefan Hajnoczi Sept. 12, 2023, 3:06 p.m. UTC

The avocado-system-alpine, avocado-system-fedora, and
avocado-system-ubuntu jobs are unreliable. I identified them while
looking over CI failures from the past week:
https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
https://gitlab.com/qemu-project/qemu/-/jobs/5030428571

Thomas Huth suggest on IRC today that there may be a legitimate failure
in there:

  th_huth: f4bug, yes, seems like it does not start at all correctly on
  alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
  noticed this by now, this is worrying

It crept in because the jobs were already unreliable.

I don't know how to interpret the job output, so all I can do is to
propose removing these jobs. A useful CI job has two outcomes: pass or
fail. Timeouts and other in-between states are not useful because they
require constant triaging by someone who understands the details of the
tests and they can occur when run against pull requests that have
nothing to do with the area covered by the test.

Hopefully test owners will be able to identify the root causes and solve
them so that these jobs can stay. In their current state the jobs are
not useful since I cannot cannot tell whether job failures are real or
just intermittent when merging qemu.git pull requests.

If you are a test owner, please take a look.

It is likely that other avocado-system-* CI jobs have similar failures
from time to time, but I'll leave them as long as they are passing.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 .gitlab-ci.d/buildtest.yml | 27 ---------------------------
 1 file changed, 27 deletions(-)

Daniel P. Berrangé Sept. 12, 2023, 3:20 p.m. UTC | #1

On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
> The avocado-system-alpine, avocado-system-fedora, and
> avocado-system-ubuntu jobs are unreliable. I identified them while
> looking over CI failures from the past week:
> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> 
> Thomas Huth suggest on IRC today that there may be a legitimate failure
> in there:
> 
>   th_huth: f4bug, yes, seems like it does not start at all correctly on
>   alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
>   noticed this by now, this is worrying
> 
> It crept in because the jobs were already unreliable.
> 
> I don't know how to interpret the job output, so all I can do is to
> propose removing these jobs. A useful CI job has two outcomes: pass or
> fail. Timeouts and other in-between states are not useful because they
> require constant triaging by someone who understands the details of the
> tests and they can occur when run against pull requests that have
> nothing to do with the area covered by the test.
> 
> Hopefully test owners will be able to identify the root causes and solve
> them so that these jobs can stay. In their current state the jobs are
> not useful since I cannot cannot tell whether job failures are real or
> just intermittent when merging qemu.git pull requests.
> 
> If you are a test owner, please take a look.
> 
> It is likely that other avocado-system-* CI jobs have similar failures
> from time to time, but I'll leave them as long as they are passing.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
>  1 file changed, 27 deletions(-)
> 
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index aee9101507..83ce448c4d 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -22,15 +22,6 @@ check-system-alpine:
>      IMAGE: alpine
>      MAKE_CHECK_ARGS: check-unit check-qtest
>  
> -avocado-system-alpine:
> -  extends: .avocado_test_job_template
> -  needs:
> -    - job: build-system-alpine
> -      artifacts: true
> -  variables:
> -    IMAGE: alpine
> -    MAKE_CHECK_ARGS: check-avocado

Instead of entirely deleting, I'd suggest adding

   # Disabled due to frequent random failures
   # https://gitlab.com/qemu-project/qemu/-/issues/1884
   when: manual

See example: https://docs.gitlab.com/ee/ci/yaml/#when

This disables the job from running unless someone explicitly
tells it to run

> -
>  build-system-ubuntu:
>    extends:
>      - .native_build_job_template
> @@ -53,15 +44,6 @@ check-system-ubuntu:
>      IMAGE: ubuntu2204
>      MAKE_CHECK_ARGS: check
>  
> -avocado-system-ubuntu:
> -  extends: .avocado_test_job_template
> -  needs:
> -    - job: build-system-ubuntu
> -      artifacts: true
> -  variables:
> -    IMAGE: ubuntu2204
> -    MAKE_CHECK_ARGS: check-avocado
> -
>  build-system-debian:
>    extends:
>      - .native_build_job_template
> @@ -127,15 +109,6 @@ check-system-fedora:
>      IMAGE: fedora
>      MAKE_CHECK_ARGS: check
>  
> -avocado-system-fedora:
> -  extends: .avocado_test_job_template
> -  needs:
> -    - job: build-system-fedora
> -      artifacts: true
> -  variables:
> -    IMAGE: fedora
> -    MAKE_CHECK_ARGS: check-avocado
> -
>  crash-test-fedora:
>    extends: .native_test_job_template
>    needs:
> -- 
> 2.41.0
> 
> 

With regards,
Daniel

Alex Bennée Sept. 12, 2023, 4:01 p.m. UTC | #2

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
>> The avocado-system-alpine, avocado-system-fedora, and
>> avocado-system-ubuntu jobs are unreliable. I identified them while
>> looking over CI failures from the past week:
>> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
>> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
>> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
>> 
>> Thomas Huth suggest on IRC today that there may be a legitimate failure
>> in there:
>> 
>>   th_huth: f4bug, yes, seems like it does not start at all correctly on
>>   alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
>>   noticed this by now, this is worrying
>> 
>> It crept in because the jobs were already unreliable.
>> 
>> I don't know how to interpret the job output, so all I can do is to
>> propose removing these jobs. A useful CI job has two outcomes: pass or
>> fail. Timeouts and other in-between states are not useful because they
>> require constant triaging by someone who understands the details of the
>> tests and they can occur when run against pull requests that have
>> nothing to do with the area covered by the test.
>> 
>> Hopefully test owners will be able to identify the root causes and solve
>> them so that these jobs can stay. In their current state the jobs are
>> not useful since I cannot cannot tell whether job failures are real or
>> just intermittent when merging qemu.git pull requests.
>> 
>> If you are a test owner, please take a look.
>> 
>> It is likely that other avocado-system-* CI jobs have similar failures
>> from time to time, but I'll leave them as long as they are passing.
>> 
>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
>>  1 file changed, 27 deletions(-)
>> 
>> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
>> index aee9101507..83ce448c4d 100644
>> --- a/.gitlab-ci.d/buildtest.yml
>> +++ b/.gitlab-ci.d/buildtest.yml
>> @@ -22,15 +22,6 @@ check-system-alpine:
>>      IMAGE: alpine
>>      MAKE_CHECK_ARGS: check-unit check-qtest
>>  
>> -avocado-system-alpine:
>> -  extends: .avocado_test_job_template
>> -  needs:
>> -    - job: build-system-alpine
>> -      artifacts: true
>> -  variables:
>> -    IMAGE: alpine
>> -    MAKE_CHECK_ARGS: check-avocado
>
> Instead of entirely deleting, I'd suggest adding
>
>    # Disabled due to frequent random failures
>    # https://gitlab.com/qemu-project/qemu/-/issues/1884
>    when: manual
>
> See example: https://docs.gitlab.com/ee/ci/yaml/#when
>
> This disables the job from running unless someone explicitly
> tells it to run

What I don't understand is why we didn't gate the release back when they
first tripped. We should have noticed between:

  https://gitlab.com/qemu-project/qemu/-/pipelines/956543770

and

  https://gitlab.com/qemu-project/qemu/-/pipelines/957154381

that the system tests where regressing. Yet we merged the changes
anyway.

>
>> -
>>  build-system-ubuntu:
>>    extends:
>>      - .native_build_job_template
>> @@ -53,15 +44,6 @@ check-system-ubuntu:
>>      IMAGE: ubuntu2204
>>      MAKE_CHECK_ARGS: check
>>  
>> -avocado-system-ubuntu:
>> -  extends: .avocado_test_job_template
>> -  needs:
>> -    - job: build-system-ubuntu
>> -      artifacts: true
>> -  variables:
>> -    IMAGE: ubuntu2204
>> -    MAKE_CHECK_ARGS: check-avocado
>> -
>>  build-system-debian:
>>    extends:
>>      - .native_build_job_template
>> @@ -127,15 +109,6 @@ check-system-fedora:
>>      IMAGE: fedora
>>      MAKE_CHECK_ARGS: check
>>  
>> -avocado-system-fedora:
>> -  extends: .avocado_test_job_template
>> -  needs:
>> -    - job: build-system-fedora
>> -      artifacts: true
>> -  variables:
>> -    IMAGE: fedora
>> -    MAKE_CHECK_ARGS: check-avocado
>> -
>>  crash-test-fedora:
>>    extends: .native_test_job_template
>>    needs:
>> -- 
>> 2.41.0
>> 
>> 
>
> With regards,
> Daniel

Daniel P. Berrangé Sept. 12, 2023, 4:14 p.m. UTC | #3

On Tue, Sep 12, 2023 at 05:01:26PM +0100, Alex Bennée wrote:
> 
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
> >> The avocado-system-alpine, avocado-system-fedora, and
> >> avocado-system-ubuntu jobs are unreliable. I identified them while
> >> looking over CI failures from the past week:
> >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> >> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> >> 
> >> Thomas Huth suggest on IRC today that there may be a legitimate failure
> >> in there:
> >> 
> >>   th_huth: f4bug, yes, seems like it does not start at all correctly on
> >>   alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
> >>   noticed this by now, this is worrying
> >> 
> >> It crept in because the jobs were already unreliable.
> >> 
> >> I don't know how to interpret the job output, so all I can do is to
> >> propose removing these jobs. A useful CI job has two outcomes: pass or
> >> fail. Timeouts and other in-between states are not useful because they
> >> require constant triaging by someone who understands the details of the
> >> tests and they can occur when run against pull requests that have
> >> nothing to do with the area covered by the test.
> >> 
> >> Hopefully test owners will be able to identify the root causes and solve
> >> them so that these jobs can stay. In their current state the jobs are
> >> not useful since I cannot cannot tell whether job failures are real or
> >> just intermittent when merging qemu.git pull requests.
> >> 
> >> If you are a test owner, please take a look.
> >> 
> >> It is likely that other avocado-system-* CI jobs have similar failures
> >> from time to time, but I'll leave them as long as they are passing.
> >> 
> >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> >> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> ---
> >>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
> >>  1 file changed, 27 deletions(-)
> >> 
> >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> >> index aee9101507..83ce448c4d 100644
> >> --- a/.gitlab-ci.d/buildtest.yml
> >> +++ b/.gitlab-ci.d/buildtest.yml
> >> @@ -22,15 +22,6 @@ check-system-alpine:
> >>      IMAGE: alpine
> >>      MAKE_CHECK_ARGS: check-unit check-qtest
> >>  
> >> -avocado-system-alpine:
> >> -  extends: .avocado_test_job_template
> >> -  needs:
> >> -    - job: build-system-alpine
> >> -      artifacts: true
> >> -  variables:
> >> -    IMAGE: alpine
> >> -    MAKE_CHECK_ARGS: check-avocado
> >
> > Instead of entirely deleting, I'd suggest adding
> >
> >    # Disabled due to frequent random failures
> >    # https://gitlab.com/qemu-project/qemu/-/issues/1884
> >    when: manual
> >
> > See example: https://docs.gitlab.com/ee/ci/yaml/#when
> >
> > This disables the job from running unless someone explicitly
> > tells it to run
> 
> What I don't understand is why we didn't gate the release back when they
> first tripped. We should have noticed between:
> 
>   https://gitlab.com/qemu-project/qemu/-/pipelines/956543770
> 
> and
> 
>   https://gitlab.com/qemu-project/qemu/-/pipelines/957154381
> 
> that the system tests where regressing. Yet we merged the changes
> anyway.

I think that green series is misleading, based on Richard's
mail on list wrt the TCG pull series:

  https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg04014.html

  "It's some sort of timing issue, which sometimes goes away
   when re-run. I was re-running tests *a lot* in order to
   get them to go green while running the 8.1 release. "


Essentially I'd put this down to the tests being soo non-deterministic
that we've given up trusting them.

With regards,
Daniel

Stefan Hajnoczi Sept. 12, 2023, 4:19 p.m. UTC | #4

On Tue, Sep 12, 2023, 12:14 Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Sep 12, 2023 at 05:01:26PM +0100, Alex Bennée wrote:
> >
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> >
> > > On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
> > >> The avocado-system-alpine, avocado-system-fedora, and
> > >> avocado-system-ubuntu jobs are unreliable. I identified them while
> > >> looking over CI failures from the past week:
> > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> > >> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> > >>
> > >> Thomas Huth suggest on IRC today that there may be a legitimate
> failure
> > >> in there:
> > >>
> > >>   th_huth: f4bug, yes, seems like it does not start at all correctly
> on
> > >>   alpine anymore ... and it's broken since ~ 2 weeks already, so if
> nobody
> > >>   noticed this by now, this is worrying
> > >>
> > >> It crept in because the jobs were already unreliable.
> > >>
> > >> I don't know how to interpret the job output, so all I can do is to
> > >> propose removing these jobs. A useful CI job has two outcomes: pass or
> > >> fail. Timeouts and other in-between states are not useful because they
> > >> require constant triaging by someone who understands the details of
> the
> > >> tests and they can occur when run against pull requests that have
> > >> nothing to do with the area covered by the test.
> > >>
> > >> Hopefully test owners will be able to identify the root causes and
> solve
> > >> them so that these jobs can stay. In their current state the jobs are
> > >> not useful since I cannot cannot tell whether job failures are real or
> > >> just intermittent when merging qemu.git pull requests.
> > >>
> > >> If you are a test owner, please take a look.
> > >>
> > >> It is likely that other avocado-system-* CI jobs have similar failures
> > >> from time to time, but I'll leave them as long as they are passing.
> > >>
> > >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> > >> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > >> ---
> > >>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
> > >>  1 file changed, 27 deletions(-)
> > >>
> > >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> > >> index aee9101507..83ce448c4d 100644
> > >> --- a/.gitlab-ci.d/buildtest.yml
> > >> +++ b/.gitlab-ci.d/buildtest.yml
> > >> @@ -22,15 +22,6 @@ check-system-alpine:
> > >>      IMAGE: alpine
> > >>      MAKE_CHECK_ARGS: check-unit check-qtest
> > >>
> > >> -avocado-system-alpine:
> > >> -  extends: .avocado_test_job_template
> > >> -  needs:
> > >> -    - job: build-system-alpine
> > >> -      artifacts: true
> > >> -  variables:
> > >> -    IMAGE: alpine
> > >> -    MAKE_CHECK_ARGS: check-avocado
> > >
> > > Instead of entirely deleting, I'd suggest adding
> > >
> > >    # Disabled due to frequent random failures
> > >    # https://gitlab.com/qemu-project/qemu/-/issues/1884
> > >    when: manual
> > >
> > > See example: https://docs.gitlab.com/ee/ci/yaml/#when
> > >
> > > This disables the job from running unless someone explicitly
> > > tells it to run
> >
> > What I don't understand is why we didn't gate the release back when they
> > first tripped. We should have noticed between:
> >
> >   https://gitlab.com/qemu-project/qemu/-/pipelines/956543770
> >
> > and
> >
> >   https://gitlab.com/qemu-project/qemu/-/pipelines/957154381
> >
> > that the system tests where regressing. Yet we merged the changes
> > anyway.
>
> I think that green series is misleading, based on Richard's
> mail on list wrt the TCG pull series:
>
>   https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg04014.html
>
>   "It's some sort of timing issue, which sometimes goes away
>    when re-run. I was re-running tests *a lot* in order to
>    get them to go green while running the 8.1 release. "
>
>
> Essentially I'd put this down to the tests being soo non-deterministic
> that we've given up trusting them.
>

Yes.

Stefan


> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|
>
>
>

Alex Bennée Sept. 12, 2023, 5:39 p.m. UTC | #5

Stefan Hajnoczi <stefanha@gmail.com> writes:

> On Tue, Sep 12, 2023, 12:14 Daniel P. Berrangé <berrange@redhat.com> wrote:
>
>  On Tue, Sep 12, 2023 at 05:01:26PM +0100, Alex Bennée wrote:
>  > 
>  > Daniel P. Berrangé <berrange@redhat.com> writes:
>  > 
>  > > On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
>  > >> The avocado-system-alpine, avocado-system-fedora, and
>  > >> avocado-system-ubuntu jobs are unreliable. I identified them while
>  > >> looking over CI failures from the past week:
>  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
>  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
>  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
>  > >> 
>  > >> Thomas Huth suggest on IRC today that there may be a legitimate failure
>  > >> in there:
>  > >> 
>  > >>   th_huth: f4bug, yes, seems like it does not start at all correctly on
>  > >>   alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
>  > >>   noticed this by now, this is worrying
>  > >> 
>  > >> It crept in because the jobs were already unreliable.
>  > >> 
>  > >> I don't know how to interpret the job output, so all I can do is to
>  > >> propose removing these jobs. A useful CI job has two outcomes: pass or
>  > >> fail. Timeouts and other in-between states are not useful because they
>  > >> require constant triaging by someone who understands the details of the
>  > >> tests and they can occur when run against pull requests that have
>  > >> nothing to do with the area covered by the test.
>  > >> 
>  > >> Hopefully test owners will be able to identify the root causes and solve
>  > >> them so that these jobs can stay. In their current state the jobs are
>  > >> not useful since I cannot cannot tell whether job failures are real or
>  > >> just intermittent when merging qemu.git pull requests.
>  > >> 
>  > >> If you are a test owner, please take a look.
>  > >> 
>  > >> It is likely that other avocado-system-* CI jobs have similar failures
>  > >> from time to time, but I'll leave them as long as they are passing.
>  > >> 
>  > >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
>  > >> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>  > >> ---
>  > >>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
>  > >>  1 file changed, 27 deletions(-)
>  > >> 
>  > >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
>  > >> index aee9101507..83ce448c4d 100644
>  > >> --- a/.gitlab-ci.d/buildtest.yml
>  > >> +++ b/.gitlab-ci.d/buildtest.yml
>  > >> @@ -22,15 +22,6 @@ check-system-alpine:
>  > >>      IMAGE: alpine
>  > >>      MAKE_CHECK_ARGS: check-unit check-qtest
>  > >>  
>  > >> -avocado-system-alpine:
>  > >> -  extends: .avocado_test_job_template
>  > >> -  needs:
>  > >> -    - job: build-system-alpine
>  > >> -      artifacts: true
>  > >> -  variables:
>  > >> -    IMAGE: alpine
>  > >> -    MAKE_CHECK_ARGS: check-avocado
>  > >
>  > > Instead of entirely deleting, I'd suggest adding
>  > >
>  > >    # Disabled due to frequent random failures
>  > >    # https://gitlab.com/qemu-project/qemu/-/issues/1884
>  > >    when: manual
>  > >
>  > > See example: https://docs.gitlab.com/ee/ci/yaml/#when
>  > >
>  > > This disables the job from running unless someone explicitly
>  > > tells it to run
>  > 
>  > What I don't understand is why we didn't gate the release back when they
>  > first tripped. We should have noticed between:
>  > 
>  >   https://gitlab.com/qemu-project/qemu/-/pipelines/956543770
>  > 
>  > and
>  > 
>  >   https://gitlab.com/qemu-project/qemu/-/pipelines/957154381
>  > 
>  > that the system tests where regressing. Yet we merged the changes
>  > anyway.
>
>  I think that green series is misleading, based on Richard's
>  mail on list wrt the TCG pull series:
>
>    https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg04014.html
>
>    "It's some sort of timing issue, which sometimes goes away
>     when re-run. I was re-running tests *a lot* in order to
>     get them to go green while running the 8.1 release. "

But I think in that actual case a change exposed a race condition which
has only recently been fixed - however we've had additional regresssions
since.

Rather than kill the system tests we can disable the flaky individual
tests in avocado. 

>
>  Essentially I'd put this down to the tests being soo non-deterministic
>  that we've given up trusting them.
>
> Yes.
>
> Stefan
>
>  With regards,
>  Daniel
>  -- 
>  |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>  |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>  |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Stefan Hajnoczi Sept. 12, 2023, 6:52 p.m. UTC | #6

On Tue, 12 Sept 2023 at 14:36, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Stefan Hajnoczi <stefanha@gmail.com> writes:
>
> > On Tue, Sep 12, 2023, 12:14 Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> >  On Tue, Sep 12, 2023 at 05:01:26PM +0100, Alex Bennée wrote:
> >  >
> >  > Daniel P. Berrangé <berrange@redhat.com> writes:
> >  >
> >  > > On Tue, Sep 12, 2023 at 11:06:11AM -0400, Stefan Hajnoczi wrote:
> >  > >> The avocado-system-alpine, avocado-system-fedora, and
> >  > >> avocado-system-ubuntu jobs are unreliable. I identified them while
> >  > >> looking over CI failures from the past week:
> >  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> >  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> >  > >> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> >  > >>
> >  > >> Thomas Huth suggest on IRC today that there may be a legitimate failure
> >  > >> in there:
> >  > >>
> >  > >>   th_huth: f4bug, yes, seems like it does not start at all correctly on
> >  > >>   alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
> >  > >>   noticed this by now, this is worrying
> >  > >>
> >  > >> It crept in because the jobs were already unreliable.
> >  > >>
> >  > >> I don't know how to interpret the job output, so all I can do is to
> >  > >> propose removing these jobs. A useful CI job has two outcomes: pass or
> >  > >> fail. Timeouts and other in-between states are not useful because they
> >  > >> require constant triaging by someone who understands the details of the
> >  > >> tests and they can occur when run against pull requests that have
> >  > >> nothing to do with the area covered by the test.
> >  > >>
> >  > >> Hopefully test owners will be able to identify the root causes and solve
> >  > >> them so that these jobs can stay. In their current state the jobs are
> >  > >> not useful since I cannot cannot tell whether job failures are real or
> >  > >> just intermittent when merging qemu.git pull requests.
> >  > >>
> >  > >> If you are a test owner, please take a look.
> >  > >>
> >  > >> It is likely that other avocado-system-* CI jobs have similar failures
> >  > >> from time to time, but I'll leave them as long as they are passing.
> >  > >>
> >  > >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> >  > >> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >  > >> ---
> >  > >>  .gitlab-ci.d/buildtest.yml | 27 ---------------------------
> >  > >>  1 file changed, 27 deletions(-)
> >  > >>
> >  > >> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> >  > >> index aee9101507..83ce448c4d 100644
> >  > >> --- a/.gitlab-ci.d/buildtest.yml
> >  > >> +++ b/.gitlab-ci.d/buildtest.yml
> >  > >> @@ -22,15 +22,6 @@ check-system-alpine:
> >  > >>      IMAGE: alpine
> >  > >>      MAKE_CHECK_ARGS: check-unit check-qtest
> >  > >>
> >  > >> -avocado-system-alpine:
> >  > >> -  extends: .avocado_test_job_template
> >  > >> -  needs:
> >  > >> -    - job: build-system-alpine
> >  > >> -      artifacts: true
> >  > >> -  variables:
> >  > >> -    IMAGE: alpine
> >  > >> -    MAKE_CHECK_ARGS: check-avocado
> >  > >
> >  > > Instead of entirely deleting, I'd suggest adding
> >  > >
> >  > >    # Disabled due to frequent random failures
> >  > >    # https://gitlab.com/qemu-project/qemu/-/issues/1884
> >  > >    when: manual
> >  > >
> >  > > See example: https://docs.gitlab.com/ee/ci/yaml/#when
> >  > >
> >  > > This disables the job from running unless someone explicitly
> >  > > tells it to run
> >  >
> >  > What I don't understand is why we didn't gate the release back when they
> >  > first tripped. We should have noticed between:
> >  >
> >  >   https://gitlab.com/qemu-project/qemu/-/pipelines/956543770
> >  >
> >  > and
> >  >
> >  >   https://gitlab.com/qemu-project/qemu/-/pipelines/957154381
> >  >
> >  > that the system tests where regressing. Yet we merged the changes
> >  > anyway.
> >
> >  I think that green series is misleading, based on Richard's
> >  mail on list wrt the TCG pull series:
> >
> >    https://lists.gnu.org/archive/html/qemu-devel/2023-08/msg04014.html
> >
> >    "It's some sort of timing issue, which sometimes goes away
> >     when re-run. I was re-running tests *a lot* in order to
> >     get them to go green while running the 8.1 release. "
>
> But I think in that actual case a change exposed a race condition which
> has only recently been fixed - however we've had additional regresssions
> since.
>
> Rather than kill the system tests we can disable the flaky individual
> tests in avocado.

That would be nice, please send an alternative patch.

I can't do that myself because there are a bunch of test cases with
suspicious output and I don't know which ones are legitimate failures,
intermittent problems, or expected failures.

Stefan

>
> >
> >  Essentially I'd put this down to the tests being soo non-deterministic
> >  that we've given up trusting them.
> >
> > Yes.
> >
> > Stefan
> >
> >  With regards,
> >  Daniel
> >  --
> >  |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> >  |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> >  |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro

Thomas Huth Sept. 12, 2023, 7:58 p.m. UTC | #7

On 12/09/2023 17.06, Stefan Hajnoczi wrote:
> The avocado-system-alpine, avocado-system-fedora, and
> avocado-system-ubuntu jobs are unreliable. I identified them while
> looking over CI failures from the past week:
> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
> 
> Thomas Huth suggest on IRC today that there may be a legitimate failure
> in there:
> 
>    th_huth: f4bug, yes, seems like it does not start at all correctly on
>    alpine anymore ... and it's broken since ~ 2 weeks already, so if nobody
>    noticed this by now, this is worrying
> 
> It crept in because the jobs were already unreliable.
> 
> I don't know how to interpret the job output, so all I can do is to
> propose removing these jobs. A useful CI job has two outcomes: pass or
> fail. Timeouts and other in-between states are not useful because they
> require constant triaging by someone who understands the details of the
> tests and they can occur when run against pull requests that have
> nothing to do with the area covered by the test.
> 
> Hopefully test owners will be able to identify the root causes and solve
> them so that these jobs can stay. In their current state the jobs are
> not useful since I cannot cannot tell whether job failures are real or
> just intermittent when merging qemu.git pull requests.
> 
> If you are a test owner, please take a look.
> 
> It is likely that other avocado-system-* CI jobs have similar failures
> from time to time, but I'll leave them as long as they are passing.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   .gitlab-ci.d/buildtest.yml | 27 ---------------------------
>   1 file changed, 27 deletions(-)
> 
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index aee9101507..83ce448c4d 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -22,15 +22,6 @@ check-system-alpine:
>       IMAGE: alpine
>       MAKE_CHECK_ARGS: check-unit check-qtest
>   
> -avocado-system-alpine:
> -  extends: .avocado_test_job_template
> -  needs:
> -    - job: build-system-alpine
> -      artifacts: true
> -  variables:
> -    IMAGE: alpine
> -    MAKE_CHECK_ARGS: check-avocado

Please don't remove the whole job! Just disable the failing tests within the job, e.g.:

diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
--- a/tests/avocado/replay_kernel.py
+++ b/tests/avocado/replay_kernel.py
@@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, kernel_path_xz):
          console_pattern = 'Kernel command line: %s' % kernel_command_line
          self.run_rr(kernel_path, kernel_command_line, console_pattern, shift=5)
  
+    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
      def test_mips_malta32el_nanomips_4k(self):
          """
          :avocado: tags=arch:mipsel
@@ -517,6 +518,7 @@ def test_mips_malta32el_nanomips_4k(self):
          kernel_path_xz = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
          self.do_test_mips_malta32el_nanomips(kernel_path_xz)
  
+    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
      def test_mips_malta32el_nanomips_16k_up(self):
          """
          :avocado: tags=arch:mipsel
@@ -531,6 +533,7 @@ def test_mips_malta32el_nanomips_16k_up(self):
          kernel_path_xz = self.fetch_asset(kernel_url, asset_hash=kernel_hash)
          self.do_test_mips_malta32el_nanomips(kernel_path_xz)
  
+    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
      def test_mips_malta32el_nanomips_64k_dbg(self):
          """
          :avocado: tags=arch:mipsel

Philippe Mathieu-Daudé Sept. 13, 2023, 6:43 a.m. UTC | #8

On 12/9/23 21:58, Thomas Huth wrote:
> On 12/09/2023 17.06, Stefan Hajnoczi wrote:
>> The avocado-system-alpine, avocado-system-fedora, and
>> avocado-system-ubuntu jobs are unreliable. I identified them while
>> looking over CI failures from the past week:
>> https://gitlab.com/qemu-project/qemu/-/jobs/5058610614
>> https://gitlab.com/qemu-project/qemu/-/jobs/5058610654
>> https://gitlab.com/qemu-project/qemu/-/jobs/5030428571
>>
>> Thomas Huth suggest on IRC today that there may be a legitimate failure
>> in there:
>>
>>    th_huth: f4bug, yes, seems like it does not start at all correctly on
>>    alpine anymore ... and it's broken since ~ 2 weeks already, so if 
>> nobody
>>    noticed this by now, this is worrying
>>
>> It crept in because the jobs were already unreliable.
>>
>> I don't know how to interpret the job output, so all I can do is to
>> propose removing these jobs. A useful CI job has two outcomes: pass or
>> fail. Timeouts and other in-between states are not useful because they
>> require constant triaging by someone who understands the details of the
>> tests and they can occur when run against pull requests that have
>> nothing to do with the area covered by the test.
>>
>> Hopefully test owners will be able to identify the root causes and solve
>> them so that these jobs can stay. In their current state the jobs are
>> not useful since I cannot cannot tell whether job failures are real or
>> just intermittent when merging qemu.git pull requests.
>>
>> If you are a test owner, please take a look.
>>
>> It is likely that other avocado-system-* CI jobs have similar failures
>> from time to time, but I'll leave them as long as they are passing.
>>
>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/1884
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   .gitlab-ci.d/buildtest.yml | 27 ---------------------------
>>   1 file changed, 27 deletions(-)
>>
>> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
>> index aee9101507..83ce448c4d 100644
>> --- a/.gitlab-ci.d/buildtest.yml
>> +++ b/.gitlab-ci.d/buildtest.yml
>> @@ -22,15 +22,6 @@ check-system-alpine:
>>       IMAGE: alpine
>>       MAKE_CHECK_ARGS: check-unit check-qtest
>> -avocado-system-alpine:
>> -  extends: .avocado_test_job_template
>> -  needs:
>> -    - job: build-system-alpine
>> -      artifacts: true
>> -  variables:
>> -    IMAGE: alpine
>> -    MAKE_CHECK_ARGS: check-avocado
> 
> Please don't remove the whole job! Just disable the failing tests within 
> the job, e.g.:
> 
> diff --git a/tests/avocado/replay_kernel.py 
> b/tests/avocado/replay_kernel.py
> --- a/tests/avocado/replay_kernel.py
> +++ b/tests/avocado/replay_kernel.py
> @@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, 
> kernel_path_xz):
>           console_pattern = 'Kernel command line: %s' % kernel_command_line
>           self.run_rr(kernel_path, kernel_command_line, console_pattern, 
> shift=5)
> 
> +    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')

Better:

        @skipUntil(gitlab_issue_resolved(1884))

(joking)

I'll send a patch, smth like:

         @skipIf(os.getenv('GITLAB_CI'), 'Pending 
https://gitlab.com/qemu-project/qemu/-/issues/1884')

>       def test_mips_malta32el_nanomips_4k(self):
>           """
>           :avocado: tags=arch:mipsel

Peter Maydell Sept. 13, 2023, 9:18 a.m. UTC | #9

On Tue, 12 Sept 2023 at 21:00, Thomas Huth <thuth@redhat.com> wrote:
> Please don't remove the whole job! Just disable the failing tests within the job, e.g.:
>
> diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
> --- a/tests/avocado/replay_kernel.py
> +++ b/tests/avocado/replay_kernel.py
> @@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, kernel_path_xz):
>           console_pattern = 'Kernel command line: %s' % kernel_command_line
>           self.run_rr(kernel_path, kernel_command_line, console_pattern, shift=5)
>
> +    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
>       def test_mips_malta32el_nanomips_4k(self):
>           """
>           :avocado: tags=arch:mipsel

Please don't skip unstable tests on gitlab only. If they're
unstable, then nobody wants to be running them and wondering
if these are flaky tests or real issues, whether theyr'e doing
it on gitlab or locally. (I know we already have a lot of these,
but the effect is that instead of saying 'make check-avocado'
you have to say 'GITLAB_CI=1 make check-avocado'.)

thanks
-- PMM

Philippe Mathieu-Daudé Sept. 13, 2023, 9:45 a.m. UTC | #10

On 13/9/23 11:18, Peter Maydell wrote:
> On Tue, 12 Sept 2023 at 21:00, Thomas Huth <thuth@redhat.com> wrote:
>> Please don't remove the whole job! Just disable the failing tests within the job, e.g.:
>>
>> diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
>> --- a/tests/avocado/replay_kernel.py
>> +++ b/tests/avocado/replay_kernel.py
>> @@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, kernel_path_xz):
>>            console_pattern = 'Kernel command line: %s' % kernel_command_line
>>            self.run_rr(kernel_path, kernel_command_line, console_pattern, shift=5)
>>
>> +    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
>>        def test_mips_malta32el_nanomips_4k(self):
>>            """
>>            :avocado: tags=arch:mipsel
> 
> Please don't skip unstable tests on gitlab only. If they're
> unstable, then nobody wants to be running them and wondering
> if these are flaky tests or real issues, whether theyr'e doing
> it on gitlab or locally. (I know we already have a lot of these,
> but the effect is that instead of saying 'make check-avocado'
> you have to say 'GITLAB_CI=1 make check-avocado'.)

Good point, I'll simply use:

         @skip('Pending https://gitlab.com/qemu-project/qemu/-/issues/1884')

Looking at other ones:

$ git grep -w @skip tests/avocado/
tests/avocado/machine_sparc_leon3.py:17:    @skip("Test currently broken")
tests/avocado/netdev-ethtool.py:89:    @skip("Incomplete reg 0x00178 
support")
tests/avocado/netdev-ethtool.py:96:    @skip("Incomplete reg 0x00178 
support")
tests/avocado/replay_kernel.py:333:    @skip("Test currently broken") # 
Console stuck as of 5.2-rc1
tests/avocado/replay_kernel.py:368:    @skip("nios2 emulation is buggy 
under record/replay")
tests/avocado/virtio_check_params.py:119:    @skip("break multi-arch CI")

Looking at the first one:

commit 5baecf58ad9fb3ce24d331978526909d0beca482
Author: Philippe Mathieu-Daudé <f4bug@amsat.org>
Date:   Tue Mar 31 12:50:42 2020 +0200

     tests/acceptance/machine_sparc_leon3: Disable HelenOS test

     This test was written/tested around beginning of 2019, but was
     extracted from a bigger series and posted end of June 2019 [*].
     Unfortunately I did not notice commit 162abf1a8 was merged by
     then, which implements the AHB and APB plug and play devices.

     HelenOS 0.6 is expecting the PnP registers to be not implemented
     by QEMU, then forces the discovered AMBA devices (see [2]).

     Before 162abf1a8, the console was displaying:

       HelenOS bootloader, release 0.6.0 (Elastic Horse)
       Built on 2014-12-21 20:17:42 for sparc32
       Copyright (c) 2001-2014 HelenOS project
        0x4000bf20|0x4000bf20: kernel image (496640/128466 bytes)
        0x4002b4f2|0x4002b4f2: ns image (154195/66444 bytes)
        0x4003b87e|0x4003b87e: loader image (153182/66437 bytes)
        0x4004bc03|0x4004bc03: init image (155339/66834 bytes)
        0x4005c115|0x4005c115: locsrv image (162063/70267 bytes)
        0x4006d390|0x4006d390: rd image (152678/65889 bytes)
        0x4007d4f1|0x4007d4f1: vfs image (168480/73394 bytes)
        0x4008f3a3|0x4008f3a3: logger image (158034/68368 bytes)
        0x4009feb3|0x4009feb3: ext4fs image (234510/100301 bytes)
        0x400b8680|0x400b8680: initrd image (8388608/1668901 bytes)
       ABMA devices:
       <1:00c> at 0x80000100 irq 3
       <1:00d> at 0x80000200
       <1:011> at 0x80000300 irq 8
       Memory size: 64 MB

     As of this commit, it is now confused:

       ABMA devices:
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       <1:3000> at 0x00000000 irq 0
       ...

     As this test is not working as expected, simply disable it (by
     skipping it) for now.

More than 3 years passed already, what a disappointment.
Offending commit is 4 years old.

commit 162abf1a83ddd06ce1618666f84f88ba4dbffe10
Author: KONRAD Frederic <frederic.konrad@adacore.com>
Date:   Wed May 15 14:31:32 2019 +0200

     leon3: introduce the plug and play mechanism

     This adds the AHB and APB plug and play devices.
     They are scanned during the linux boot to discover the various
     peripheral.

I'm not complaining about that particular commit, I wonder about
usefulness of disabling tests from unmaintained areas.

Maybe we can commit a date when disabling a test, having a disabled
test failing _after_ that date, so if it isn't fixed we remove it.
Smth like,

   @SkipBroken(date='2023-11-15',
               desc='Pending 
https://gitlab.com/qemu-project/qemu/-/issues/1884') # Will fail if run 
after 2023-11-15 and this test isn't fixed

Thoughts?

Alex Bennée Sept. 13, 2023, 10:35 a.m. UTC | #11

Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> On 13/9/23 11:18, Peter Maydell wrote:
>> On Tue, 12 Sept 2023 at 21:00, Thomas Huth <thuth@redhat.com> wrote:
>>> Please don't remove the whole job! Just disable the failing tests within the job, e.g.:
>>>
>>> diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
>>> --- a/tests/avocado/replay_kernel.py
>>> +++ b/tests/avocado/replay_kernel.py
>>> @@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, kernel_path_xz):
>>>            console_pattern = 'Kernel command line: %s' % kernel_command_line
>>>            self.run_rr(kernel_path, kernel_command_line, console_pattern, shift=5)
>>>
>>> +    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
>>>        def test_mips_malta32el_nanomips_4k(self):
>>>            """
>>>            :avocado: tags=arch:mipsel
>> Please don't skip unstable tests on gitlab only. If they're
>> unstable, then nobody wants to be running them and wondering
>> if these are flaky tests or real issues, whether theyr'e doing
>> it on gitlab or locally. (I know we already have a lot of these,
>> but the effect is that instead of saying 'make check-avocado'
>> you have to say 'GITLAB_CI=1 make check-avocado'.)
>
> Good point, I'll simply use:
>
>         @skip('Pending https://gitlab.com/qemu-project/qemu/-/issues/1884')
>
> Looking at other ones:
>
> $ git grep -w @skip tests/avocado/
> tests/avocado/machine_sparc_leon3.py:17:    @skip("Test currently broken")
> tests/avocado/netdev-ethtool.py:89:    @skip("Incomplete reg 0x00178
> support")
> tests/avocado/netdev-ethtool.py:96:    @skip("Incomplete reg 0x00178
> support")
> tests/avocado/replay_kernel.py:333:    @skip("Test currently broken")
> # Console stuck as of 5.2-rc1
> tests/avocado/replay_kernel.py:368:    @skip("nios2 emulation is buggy
> under record/replay")
> tests/avocado/virtio_check_params.py:119:    @skip("break multi-arch CI")
>
> Looking at the first one:
>
> commit 5baecf58ad9fb3ce24d331978526909d0beca482
> Author: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Date:   Tue Mar 31 12:50:42 2020 +0200
>
>     tests/acceptance/machine_sparc_leon3: Disable HelenOS test
>
>     This test was written/tested around beginning of 2019, but was
>     extracted from a bigger series and posted end of June 2019 [*].
>     Unfortunately I did not notice commit 162abf1a8 was merged by
>     then, which implements the AHB and APB plug and play devices.
>
>     HelenOS 0.6 is expecting the PnP registers to be not implemented
>     by QEMU, then forces the discovered AMBA devices (see [2]).
>
>     Before 162abf1a8, the console was displaying:
>
>       HelenOS bootloader, release 0.6.0 (Elastic Horse)
>       Built on 2014-12-21 20:17:42 for sparc32
>       Copyright (c) 2001-2014 HelenOS project
>        0x4000bf20|0x4000bf20: kernel image (496640/128466 bytes)
>        0x4002b4f2|0x4002b4f2: ns image (154195/66444 bytes)
>        0x4003b87e|0x4003b87e: loader image (153182/66437 bytes)
>        0x4004bc03|0x4004bc03: init image (155339/66834 bytes)
>        0x4005c115|0x4005c115: locsrv image (162063/70267 bytes)
>        0x4006d390|0x4006d390: rd image (152678/65889 bytes)
>        0x4007d4f1|0x4007d4f1: vfs image (168480/73394 bytes)
>        0x4008f3a3|0x4008f3a3: logger image (158034/68368 bytes)
>        0x4009feb3|0x4009feb3: ext4fs image (234510/100301 bytes)
>        0x400b8680|0x400b8680: initrd image (8388608/1668901 bytes)
>       ABMA devices:
>       <1:00c> at 0x80000100 irq 3
>       <1:00d> at 0x80000200
>       <1:011> at 0x80000300 irq 8
>       Memory size: 64 MB
>
>     As of this commit, it is now confused:
>
>       ABMA devices:
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       <1:3000> at 0x00000000 irq 0
>       ...
>
>     As this test is not working as expected, simply disable it (by
>     skipping it) for now.
>
> More than 3 years passed already, what a disappointment.
> Offending commit is 4 years old.
>
> commit 162abf1a83ddd06ce1618666f84f88ba4dbffe10
> Author: KONRAD Frederic <frederic.konrad@adacore.com>
> Date:   Wed May 15 14:31:32 2019 +0200
>
>     leon3: introduce the plug and play mechanism
>
>     This adds the AHB and APB plug and play devices.
>     They are scanned during the linux boot to discover the various
>     peripheral.
>
> I'm not complaining about that particular commit, I wonder about
> usefulness of disabling tests from unmaintained areas.
>
> Maybe we can commit a date when disabling a test, having a disabled
> test failing _after_ that date, so if it isn't fixed we remove it.
> Smth like,
>
>   @SkipBroken(date='2023-11-15',
>               desc='Pending
>               https://gitlab.com/qemu-project/qemu/-/issues/1884') #
>               Will fail if run after 2023-11-15 and this test isn't
>              fixed
>
> Thoughts?

I like the:

  @skip('Pending https://gitlab.com/qemu-project/qemu/-/issues/1884')

I think trying to do anything more fancy is just going to lead to
frustration later on.

Daniel P. Berrangé Sept. 13, 2023, 10:39 a.m. UTC | #12

On Wed, Sep 13, 2023 at 11:35:12AM +0100, Alex Bennée wrote:
> 
> Philippe Mathieu-Daudé <philmd@linaro.org> writes:
> 
> > On 13/9/23 11:18, Peter Maydell wrote:
> >> On Tue, 12 Sept 2023 at 21:00, Thomas Huth <thuth@redhat.com> wrote:
> >>> Please don't remove the whole job! Just disable the failing tests within the job, e.g.:
> >>>
> >>> diff --git a/tests/avocado/replay_kernel.py b/tests/avocado/replay_kernel.py
> >>> --- a/tests/avocado/replay_kernel.py
> >>> +++ b/tests/avocado/replay_kernel.py
> >>> @@ -503,6 +503,7 @@ def do_test_mips_malta32el_nanomips(self, kernel_path_xz):
> >>>            console_pattern = 'Kernel command line: %s' % kernel_command_line
> >>>            self.run_rr(kernel_path, kernel_command_line, console_pattern, shift=5)
> >>>
> >>> +    @skipIf(os.getenv('GITLAB_CI'), 'Skipping unstable test on GitLab')
> >>>        def test_mips_malta32el_nanomips_4k(self):
> >>>            """
> >>>            :avocado: tags=arch:mipsel
> >> Please don't skip unstable tests on gitlab only. If they're
> >> unstable, then nobody wants to be running them and wondering
> >> if these are flaky tests or real issues, whether theyr'e doing
> >> it on gitlab or locally. (I know we already have a lot of these,
> >> but the effect is that instead of saying 'make check-avocado'
> >> you have to say 'GITLAB_CI=1 make check-avocado'.)
> >
> > Good point, I'll simply use:
> >
> >         @skip('Pending https://gitlab.com/qemu-project/qemu/-/issues/1884')
> >
> > Looking at other ones:
> >
> > $ git grep -w @skip tests/avocado/
> > tests/avocado/machine_sparc_leon3.py:17:    @skip("Test currently broken")
> > tests/avocado/netdev-ethtool.py:89:    @skip("Incomplete reg 0x00178
> > support")
> > tests/avocado/netdev-ethtool.py:96:    @skip("Incomplete reg 0x00178
> > support")
> > tests/avocado/replay_kernel.py:333:    @skip("Test currently broken")
> > # Console stuck as of 5.2-rc1
> > tests/avocado/replay_kernel.py:368:    @skip("nios2 emulation is buggy
> > under record/replay")
> > tests/avocado/virtio_check_params.py:119:    @skip("break multi-arch CI")
> >
> > Looking at the first one:
> >
> > commit 5baecf58ad9fb3ce24d331978526909d0beca482
> > Author: Philippe Mathieu-Daudé <f4bug@amsat.org>
> > Date:   Tue Mar 31 12:50:42 2020 +0200
> >
> >     tests/acceptance/machine_sparc_leon3: Disable HelenOS test
> >
> >     This test was written/tested around beginning of 2019, but was
> >     extracted from a bigger series and posted end of June 2019 [*].
> >     Unfortunately I did not notice commit 162abf1a8 was merged by
> >     then, which implements the AHB and APB plug and play devices.
> >
> >     HelenOS 0.6 is expecting the PnP registers to be not implemented
> >     by QEMU, then forces the discovered AMBA devices (see [2]).
> >
> >     Before 162abf1a8, the console was displaying:
> >
> >       HelenOS bootloader, release 0.6.0 (Elastic Horse)
> >       Built on 2014-12-21 20:17:42 for sparc32
> >       Copyright (c) 2001-2014 HelenOS project
> >        0x4000bf20|0x4000bf20: kernel image (496640/128466 bytes)
> >        0x4002b4f2|0x4002b4f2: ns image (154195/66444 bytes)
> >        0x4003b87e|0x4003b87e: loader image (153182/66437 bytes)
> >        0x4004bc03|0x4004bc03: init image (155339/66834 bytes)
> >        0x4005c115|0x4005c115: locsrv image (162063/70267 bytes)
> >        0x4006d390|0x4006d390: rd image (152678/65889 bytes)
> >        0x4007d4f1|0x4007d4f1: vfs image (168480/73394 bytes)
> >        0x4008f3a3|0x4008f3a3: logger image (158034/68368 bytes)
> >        0x4009feb3|0x4009feb3: ext4fs image (234510/100301 bytes)
> >        0x400b8680|0x400b8680: initrd image (8388608/1668901 bytes)
> >       ABMA devices:
> >       <1:00c> at 0x80000100 irq 3
> >       <1:00d> at 0x80000200
> >       <1:011> at 0x80000300 irq 8
> >       Memory size: 64 MB
> >
> >     As of this commit, it is now confused:
> >
> >       ABMA devices:
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       <1:3000> at 0x00000000 irq 0
> >       ...
> >
> >     As this test is not working as expected, simply disable it (by
> >     skipping it) for now.
> >
> > More than 3 years passed already, what a disappointment.
> > Offending commit is 4 years old.
> >
> > commit 162abf1a83ddd06ce1618666f84f88ba4dbffe10
> > Author: KONRAD Frederic <frederic.konrad@adacore.com>
> > Date:   Wed May 15 14:31:32 2019 +0200
> >
> >     leon3: introduce the plug and play mechanism
> >
> >     This adds the AHB and APB plug and play devices.
> >     They are scanned during the linux boot to discover the various
> >     peripheral.
> >
> > I'm not complaining about that particular commit, I wonder about
> > usefulness of disabling tests from unmaintained areas.
> >
> > Maybe we can commit a date when disabling a test, having a disabled
> > test failing _after_ that date, so if it isn't fixed we remove it.
> > Smth like,
> >
> >   @SkipBroken(date='2023-11-15',
> >               desc='Pending
> >               https://gitlab.com/qemu-project/qemu/-/issues/1884') #
> >               Will fail if run after 2023-11-15 and this test isn't
> >              fixed
> >
> > Thoughts?
> 
> I like the:
> 
>   @skip('Pending https://gitlab.com/qemu-project/qemu/-/issues/1884')

I'd remove "Pending " too - the URL alone is sufficient info
including the record of dates, to guide us when to finally
delete a skipped test for good.


With regards,
Daniel

gitlab: remove unreliable avocado CI jobs

Commit Message

Comments

Patch