gitlab-ci/cirrus: Increase timeout to 80 minutes

Message ID	20211116163309.246602-1-thuth@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=iNjW=QD=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0E8F96108D From: Thomas Huth <thuth@redhat.com> To: qemu-devel@nongnu.org, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <f4bug@amsat.org>, =?utf-8?q?Alex_B?= =?utf-8?q?enn=C3=A9e?= <alex.bennee@linaro.org> Subject: [PATCH] gitlab-ci/cirrus: Increase timeout to 80 minutes Date: Tue, 16 Nov 2021 17:33:09 +0100 Message-Id: <20211116163309.246602-1-thuth@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=170.10.129.124; envelope-from=thuth@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.697, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Willian Rampazzo <willianr@redhat.com>, =?utf-8?q?Daniel_P_=2E_Berrang?= =?utf-8?q?=C3=A9?= <berrange@redhat.com>, Wainer dos Santos Moschetta <wainersm@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	gitlab-ci/cirrus: Increase timeout to 80 minutes \| expand gitlab-ci/cirrus: Increase timeout to 80 minutes

Thomas Huth Nov. 16, 2021, 4:33 p.m. UTC

The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
be scheduled, so while the build test itself finishes within 60 minutes,
the total run time of the jobs can be longer due to this waiting time.
Thus let's increase the timeout on the gitlab side a little bit, so
that these jobs are not marked as failing just because of the delay.

Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 .gitlab-ci.d/cirrus.yml | 1 +
 1 file changed, 1 insertion(+)

Daniel P. Berrangé Nov. 16, 2021, 4:49 p.m. UTC | #1

On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
> be scheduled, so while the build test itself finishes within 60 minutes,
> the total run time of the jobs can be longer due to this waiting time.
> Thus let's increase the timeout on the gitlab side a little bit, so
> that these jobs are not marked as failing just because of the delay.

On a successful pipeline I see

 freebsd-11  - 28 minutes
 freebsd-12  - 57 minutes
 macos       - 30 minutes

We know cirrus allows 2 concurrent jobs, so from that I infer
that the freebsd-12 job was queued for ~30 minutes waiting for
either the freebsd-11 or macos job to finish, and then it
ran in 30 minutes, giving the ~60 minute total.

That's too close to the 60 minute gitlab default job timeout
for comfort - it can easily slip over 60 minutes by just a
small amount.

80 minutes will certainly help in the case where we
randomly take a little longer than 30 minutes to build,
and have 1 of the 3 jobs queued.

When we're running jobs on both master + staging, we can
have 2 jobs running and 4 more queued - 2 of those queued
might just finish in time, but 2 will definitely fail.
My patch will cut these extra jobs on master, so in common
case we only ever get 1 queued, which should work well in
combo with your patch here. That should be good enough
for the qemu-project namespace, unless someone is triggering
pipelines for stable branch staging at the same time as
the master branch staging.

If we do want to worry about more than 2 queued jobs
again for that reason, we might consider putting
it upto 100 minutes. That would give us enough slack to
have 4 queued jobs behind two running jobs and have
them all succeed

> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  .gitlab-ci.d/cirrus.yml | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
> index e7b25e7427..22d42585e4 100644
> --- a/.gitlab-ci.d/cirrus.yml
> +++ b/.gitlab-ci.d/cirrus.yml
> @@ -14,6 +14,7 @@
>    stage: build
>    image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
>    needs: []
> +  timeout: 80m
>    allow_failure: true
>    script:
>      - source .gitlab-ci.d/cirrus/$NAME.vars

Whether 80 or 100 minute, consider it

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Regards,
Daniel

Philippe Mathieu-Daudé Nov. 16, 2021, 5:09 p.m. UTC | #2

On 11/16/21 17:49, Daniel P. Berrangé wrote:
> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
>> be scheduled, so while the build test itself finishes within 60 minutes,
>> the total run time of the jobs can be longer due to this waiting time.
>> Thus let's increase the timeout on the gitlab side a little bit, so
>> that these jobs are not marked as failing just because of the delay.
> 
> On a successful pipeline I see
> 
>  freebsd-11  - 28 minutes
>  freebsd-12  - 57 minutes
>  macos       - 30 minutes
> 
> We know cirrus allows 2 concurrent jobs, so from that I infer
> that the freebsd-12 job was queued for ~30 minutes waiting for
> either the freebsd-11 or macos job to finish, and then it
> ran in 30 minutes, giving the ~60 minute total.
> 
> That's too close to the 60 minute gitlab default job timeout
> for comfort - it can easily slip over 60 minutes by just a
> small amount.
> 
> 80 minutes will certainly help in the case where we
> randomly take a little longer than 30 minutes to build,
> and have 1 of the 3 jobs queued.
> 
> When we're running jobs on both master + staging, we can
> have 2 jobs running and 4 more queued - 2 of those queued
> might just finish in time, but 2 will definitely fail.
> My patch will cut these extra jobs on master, so in common
> case we only ever get 1 queued, which should work well in
> combo with your patch here. That should be good enough
> for the qemu-project namespace, unless someone is triggering
> pipelines for stable branch staging at the same time as
> the master branch staging.
> 
> If we do want to worry about more than 2 queued jobs
> again for that reason, we might consider putting
> it upto 100 minutes. That would give us enough slack to
> have 4 queued jobs behind two running jobs and have
> them all succeed
> 
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>  .gitlab-ci.d/cirrus.yml | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
>> index e7b25e7427..22d42585e4 100644
>> --- a/.gitlab-ci.d/cirrus.yml
>> +++ b/.gitlab-ci.d/cirrus.yml
>> @@ -14,6 +14,7 @@
>>    stage: build
>>    image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
>>    needs: []
>> +  timeout: 80m
>>    allow_failure: true
>>    script:
>>      - source .gitlab-ci.d/cirrus/$NAME.vars
> 
> Whether 80 or 100 minute, consider it
> 
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

This pipeline took 1h51m09s:
https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds
But Richard restarted unstable jobs, which probably added time
to the total.

IIRC from a maintainer perspective 1h15 is the upper limit.
80m fits, 100m is over. Up to the project maintainers
(personally I don't have any objection, in particular if
this reduces the failures rate).

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

Willian Rampazzo Nov. 16, 2021, 5:17 p.m. UTC | #3

On Tue, Nov 16, 2021 at 1:33 PM Thomas Huth <thuth@redhat.com> wrote:
>
> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
> be scheduled, so while the build test itself finishes within 60 minutes,
> the total run time of the jobs can be longer due to this waiting time.
> Thus let's increase the timeout on the gitlab side a little bit, so
> that these jobs are not marked as failing just because of the delay.
>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  .gitlab-ci.d/cirrus.yml | 1 +
>  1 file changed, 1 insertion(+)
>

Reviewed-by: Willian Rampazzo <willianr@redhat.com>

Thomas Huth Nov. 16, 2021, 5:22 p.m. UTC | #4

On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote:
> On 11/16/21 17:49, Daniel P. Berrangé wrote:
>> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
>>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
>>> be scheduled, so while the build test itself finishes within 60 minutes,
>>> the total run time of the jobs can be longer due to this waiting time.
>>> Thus let's increase the timeout on the gitlab side a little bit, so
>>> that these jobs are not marked as failing just because of the delay.
...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
>>> index e7b25e7427..22d42585e4 100644
>>> --- a/.gitlab-ci.d/cirrus.yml
>>> +++ b/.gitlab-ci.d/cirrus.yml
>>> @@ -14,6 +14,7 @@
>>>     stage: build
>>>     image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
>>>     needs: []
>>> +  timeout: 80m
>>>     allow_failure: true
>>>     script:
>>>       - source .gitlab-ci.d/cirrus/$NAME.vars
>>
>> Whether 80 or 100 minute, consider it
>>
>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> This pipeline took 1h51m09s:
> https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds
> But Richard restarted unstable jobs, which probably added time
> to the total.
> 
> IIRC from a maintainer perspective 1h15 is the upper limit.
> 80m fits, 100m is over.

I think I agree ... I normally don't want to wait more than a little bit 
more than one hour, so 100 minutes feels too long already. We already have 
some 70m timeouts in other jobs, and one 80 minute timeout in 
.gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really the 
upper boundary that we should use.

> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

Thank to all for your reviews!

  Thomas

Richard Henderson Nov. 16, 2021, 5:36 p.m. UTC | #5

On 11/16/21 6:22 PM, Thomas Huth wrote:
> On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote:
>> On 11/16/21 17:49, Daniel P. Berrangé wrote:
>>> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
>>>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
>>>> be scheduled, so while the build test itself finishes within 60 minutes,
>>>> the total run time of the jobs can be longer due to this waiting time.
>>>> Thus let's increase the timeout on the gitlab side a little bit, so
>>>> that these jobs are not marked as failing just because of the delay.
> ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
>>>> index e7b25e7427..22d42585e4 100644
>>>> --- a/.gitlab-ci.d/cirrus.yml
>>>> +++ b/.gitlab-ci.d/cirrus.yml
>>>> @@ -14,6 +14,7 @@
>>>>     stage: build
>>>>     image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
>>>>     needs: []
>>>> +  timeout: 80m
>>>>     allow_failure: true
>>>>     script:
>>>>       - source .gitlab-ci.d/cirrus/$NAME.vars
>>>
>>> Whether 80 or 100 minute, consider it
>>>
>>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>>
>> This pipeline took 1h51m09s:
>> https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds
>> But Richard restarted unstable jobs, which probably added time
>> to the total.
>>
>> IIRC from a maintainer perspective 1h15 is the upper limit.
>> 80m fits, 100m is over.
> 
> I think I agree ... I normally don't want to wait more than a little bit more than one 
> hour, so 100 minutes feels too long already. We already have some 70m timeouts in other 
> jobs, and one 80 minute timeout in .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 
> minutes are really the upper boundary that we should use.

We are also talking apples and oranges:
Gitlab timeouts are on the amount of time the job runs.
Cirrus timeouts appear to be on the amount of time the job is queued.

If cirrus would just not start accounting until the thing runs we'd be fine.


r~

Daniel P. Berrangé Nov. 16, 2021, 6:20 p.m. UTC | #6

On Tue, Nov 16, 2021 at 06:36:50PM +0100, Richard Henderson wrote:
> On 11/16/21 6:22 PM, Thomas Huth wrote:
> > On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote:
> > > On 11/16/21 17:49, Daniel P. Berrangé wrote:
> > > > On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
> > > > > The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
> > > > > be scheduled, so while the build test itself finishes within 60 minutes,
> > > > > the total run time of the jobs can be longer due to this waiting time.
> > > > > Thus let's increase the timeout on the gitlab side a little bit, so
> > > > > that these jobs are not marked as failing just because of the delay.
> > ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
> > > > > index e7b25e7427..22d42585e4 100644
> > > > > --- a/.gitlab-ci.d/cirrus.yml
> > > > > +++ b/.gitlab-ci.d/cirrus.yml
> > > > > @@ -14,6 +14,7 @@
> > > > >     stage: build
> > > > >     image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
> > > > >     needs: []
> > > > > +  timeout: 80m
> > > > >     allow_failure: true
> > > > >     script:
> > > > >       - source .gitlab-ci.d/cirrus/$NAME.vars
> > > > 
> > > > Whether 80 or 100 minute, consider it
> > > > 
> > > > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> > > 
> > > This pipeline took 1h51m09s:
> > > https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds
> > > But Richard restarted unstable jobs, which probably added time
> > > to the total.
> > > 
> > > IIRC from a maintainer perspective 1h15 is the upper limit.
> > > 80m fits, 100m is over.
> > 
> > I think I agree ... I normally don't want to wait more than a little bit
> > more than one hour, so 100 minutes feels too long already. We already
> > have some 70m timeouts in other jobs, and one 80 minute timeout in
> > .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really
> > the upper boundary that we should use.
> 
> We are also talking apples and oranges:
> Gitlab timeouts are on the amount of time the job runs.
> Cirrus timeouts appear to be on the amount of time the job is queued.
> 
> If cirrus would just not start accounting until the thing runs we'd be fine.

Unfortunately it isn't that easy. Our cirrus CI jobs are launched using
the cirrus-run tool, from a gitlab job. The timeouts we're usually
hitting are from the gitlab job which is sitting around waiting for
the cirrus job it launched to finish, so it can report back stats.

Cirrus CI does itself have a job timeout, but I'm not aware of us
hitting that typically, unless i'm misinterpreting something.

Regards,
Daniel

Thomas Huth Nov. 17, 2021, 7:03 a.m. UTC | #7

On 16/11/2021 19.20, Daniel P. Berrangé wrote:
> On Tue, Nov 16, 2021 at 06:36:50PM +0100, Richard Henderson wrote:
>> On 11/16/21 6:22 PM, Thomas Huth wrote:
>>> On 16/11/2021 18.09, Philippe Mathieu-Daudé wrote:
>>>> On 11/16/21 17:49, Daniel P. Berrangé wrote:
>>>>> On Tue, Nov 16, 2021 at 05:33:09PM +0100, Thomas Huth wrote:
>>>>>> The jobs on Cirrus-CI sometimes get delayed quite a bit, waiting to
>>>>>> be scheduled, so while the build test itself finishes within 60 minutes,
>>>>>> the total run time of the jobs can be longer due to this waiting time.
>>>>>> Thus let's increase the timeout on the gitlab side a little bit, so
>>>>>> that these jobs are not marked as failing just because of the delay.
>>> ...>>> diff --git a/.gitlab-ci.d/cirrus.yml b/.gitlab-ci.d/cirrus.yml
>>>>>> index e7b25e7427..22d42585e4 100644
>>>>>> --- a/.gitlab-ci.d/cirrus.yml
>>>>>> +++ b/.gitlab-ci.d/cirrus.yml
>>>>>> @@ -14,6 +14,7 @@
>>>>>>      stage: build
>>>>>>      image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
>>>>>>      needs: []
>>>>>> +  timeout: 80m
>>>>>>      allow_failure: true
>>>>>>      script:
>>>>>>        - source .gitlab-ci.d/cirrus/$NAME.vars
>>>>>
>>>>> Whether 80 or 100 minute, consider it
>>>>>
>>>>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>>>>
>>>> This pipeline took 1h51m09s:
>>>> https://gitlab.com/qemu-project/qemu/-/pipelines/409666733/builds
>>>> But Richard restarted unstable jobs, which probably added time
>>>> to the total.
>>>>
>>>> IIRC from a maintainer perspective 1h15 is the upper limit.
>>>> 80m fits, 100m is over.
>>>
>>> I think I agree ... I normally don't want to wait more than a little bit
>>> more than one hour, so 100 minutes feels too long already. We already
>>> have some 70m timeouts in other jobs, and one 80 minute timeout in
>>> .gitlab-ci.d/crossbuild-template.yml, so I'd say 80 minutes are really
>>> the upper boundary that we should use.
>>
>> We are also talking apples and oranges:
>> Gitlab timeouts are on the amount of time the job runs.
>> Cirrus timeouts appear to be on the amount of time the job is queued.
>>
>> If cirrus would just not start accounting until the thing runs we'd be fine.
> 
> Unfortunately it isn't that easy. Our cirrus CI jobs are launched using
> the cirrus-run tool, from a gitlab job. The timeouts we're usually
> hitting are from the gitlab job which is sitting around waiting for
> the cirrus job it launched to finish, so it can report back stats.
> 
> Cirrus CI does itself have a job timeout, but I'm not aware of us
> hitting that typically, unless i'm misinterpreting something.

Right, the problem is the timeout on the gitlab-CI side, not the timeout on 
the Cirrus-CI side. I've never seen us hitting the timeout on the Cirrus 
side either.

  Thomas

gitlab-ci/cirrus: Increase timeout to 80 minutes

Commit Message

Comments

Patch