cirrus: don't run full qtest on macOS

Message ID	20210106114159.981538-1-berrange@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=d9zC=GJ=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 98CB822EBD From: =?utf-8?q?Daniel_P=2E_Berrang=C3=A9?= <berrange@redhat.com> To: qemu-devel@nongnu.org Subject: [PATCH] cirrus: don't run full qtest on macOS Date: Wed, 6 Jan 2021 11:41:59 +0000 Message-Id: <20210106114159.981538-1-berrange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=63.128.21.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -29 X-Spam_score: -3.0 X-Spam_bar: --- X-Spam_report: (-3.0 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.252, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Thomas Huth <thuth@redhat.com>, Ed Maste <emaste@freebsd.org>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@redhat.com>, =?utf-8?q?Dani?= =?utf-8?q?el_P=2E_Berrang=C3=A9?= <berrange@redhat.com>, =?utf-8?q?Alex_Ben?= =?utf-8?q?n=C3=A9e?= <alex.bennee@linaro.org>, Li-Wen Hsu <lwhsu@freebsd.org> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	cirrus: don't run full qtest on macOS \| expand cirrus: don't run full qtest on macOS

Daniel P. Berrangé Jan. 6, 2021, 11:41 a.m. UTC

The Cirrus CI macOS build hosts have exhibited a serious performance
degradation in recent months. For example the "qom-test" qtest takes
over an hour for only the qemu-system-aarch64 binary. This is as much
20-40 times slower than other environments. The other qtests all show
similar performance degradation, as do many of the unit tests.

This does not appear related to QEMU code changes, since older git
commits which were known to fully complete in less than 1 hour on
Cirrus CI now also show similar bad performance. Either Cirrus CI
performance has degraded, or an change in its environment has exposed
a latent bug widely affecting all of QEMU. Debugging the qom-test
showed no easily identified large bottleneck - every step of the
test execution was simply slower.

macOS builds/tests run outside Cirrus CI show normal performance.

With an inability to identify any obvious problem, the only viable
way to get a reliably completing Cirrus CI macOS job is to cut out
almost all of the qtests. We choose to run the x86_64 target only,
since that has very few machine types and thus is least badly
impacted in the qom-test execution.

With this change, the macOS jobs complete in approx 35 minutes.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 .cirrus.yml | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Thomas Huth Jan. 6, 2021, 12:30 p.m. UTC | #1

On 06/01/2021 12.41, Daniel P. Berrangé wrote:
> The Cirrus CI macOS build hosts have exhibited a serious performance
> degradation in recent months. For example the "qom-test" qtest takes
> over an hour for only the qemu-system-aarch64 binary. This is as much
> 20-40 times slower than other environments. The other qtests all show
> similar performance degradation, as do many of the unit tests.
> 
> This does not appear related to QEMU code changes, since older git
> commits which were known to fully complete in less than 1 hour on
> Cirrus CI now also show similar bad performance. Either Cirrus CI
> performance has degraded, or an change in its environment has exposed
> a latent bug widely affecting all of QEMU. Debugging the qom-test
> showed no easily identified large bottleneck - every step of the
> test execution was simply slower.
> 
> macOS builds/tests run outside Cirrus CI show normal performance.
> 
> With an inability to identify any obvious problem, the only viable
> way to get a reliably completing Cirrus CI macOS job is to cut out
> almost all of the qtests. We choose to run the x86_64 target only,
> since that has very few machine types and thus is least badly
> impacted in the qom-test execution.

Sad that we have to do this, but as long as we still can run the x86 qtests, 
that's certainly better than nothing.

Reviewed-by: Thomas Huth <thuth@redhat.com>

Willian Rampazzo Jan. 6, 2021, 12:44 p.m. UTC | #2

On Wed, Jan 6, 2021 at 8:44 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> The Cirrus CI macOS build hosts have exhibited a serious performance
> degradation in recent months. For example the "qom-test" qtest takes
> over an hour for only the qemu-system-aarch64 binary. This is as much
> 20-40 times slower than other environments. The other qtests all show
> similar performance degradation, as do many of the unit tests.
>
> This does not appear related to QEMU code changes, since older git
> commits which were known to fully complete in less than 1 hour on
> Cirrus CI now also show similar bad performance. Either Cirrus CI
> performance has degraded, or an change in its environment has exposed
> a latent bug widely affecting all of QEMU. Debugging the qom-test
> showed no easily identified large bottleneck - every step of the
> test execution was simply slower.
>
> macOS builds/tests run outside Cirrus CI show normal performance.
>
> With an inability to identify any obvious problem, the only viable
> way to get a reliably completing Cirrus CI macOS job is to cut out
> almost all of the qtests. We choose to run the x86_64 target only,
> since that has very few machine types and thus is least badly
> impacted in the qom-test execution.
>
> With this change, the macOS jobs complete in approx 35 minutes.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  .cirrus.yml | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>

Reviewed-by: Willian Rampazzo <willianr@redhat.com>

Daniel P. Berrangé Jan. 6, 2021, 1:36 p.m. UTC | #3

On Wed, Jan 06, 2021 at 11:41:59AM +0000, Daniel P. Berrangé wrote:
> The Cirrus CI macOS build hosts have exhibited a serious performance
> degradation in recent months. For example the "qom-test" qtest takes
> over an hour for only the qemu-system-aarch64 binary. This is as much
> 20-40 times slower than other environments. The other qtests all show
> similar performance degradation, as do many of the unit tests.
> 
> This does not appear related to QEMU code changes, since older git
> commits which were known to fully complete in less than 1 hour on
> Cirrus CI now also show similar bad performance. Either Cirrus CI
> performance has degraded, or an change in its environment has exposed
> a latent bug widely affecting all of QEMU. Debugging the qom-test
> showed no easily identified large bottleneck - every step of the
> test execution was simply slower.

It appears I might be mistaken here. On IRC it was reported that
going back furrther to v5.1.0 shows good performance in Cirrus
still.

I had only gone back as far as 2a5a79d1b57280edd72193f6031de3feb682154e
which I thought was fast originally.

So somewhere between v5.1.0 and 2a5a79 we apparently regressed.

Regards,
Daniel

Alex Bennée Jan. 6, 2021, 1:51 p.m. UTC | #4

Daniel P. Berrangé <berrange@redhat.com> writes:

> The Cirrus CI macOS build hosts have exhibited a serious performance
> degradation in recent months. For example the "qom-test" qtest takes
> over an hour for only the qemu-system-aarch64 binary. This is as much
> 20-40 times slower than other environments. The other qtests all show
> similar performance degradation, as do many of the unit tests.
>
> This does not appear related to QEMU code changes, since older git
> commits which were known to fully complete in less than 1 hour on
> Cirrus CI now also show similar bad performance. Either Cirrus CI
> performance has degraded, or an change in its environment has exposed
> a latent bug widely affecting all of QEMU. Debugging the qom-test
> showed no easily identified large bottleneck - every step of the
> test execution was simply slower.
>
> macOS builds/tests run outside Cirrus CI show normal performance.
>
> With an inability to identify any obvious problem, the only viable
> way to get a reliably completing Cirrus CI macOS job is to cut out
> almost all of the qtests. We choose to run the x86_64 target only,
> since that has very few machine types and thus is least badly
> impacted in the qom-test execution.
>
> With this change, the macOS jobs complete in approx 35 minutes.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Queued to testing/next, thanks.

Mark Cave-Ayland Jan. 7, 2021, 12:23 p.m. UTC | #5

On 06/01/2021 13:36, Daniel P. Berrangé wrote:

>> The Cirrus CI macOS build hosts have exhibited a serious performance
>> degradation in recent months. For example the "qom-test" qtest takes
>> over an hour for only the qemu-system-aarch64 binary. This is as much
>> 20-40 times slower than other environments. The other qtests all show
>> similar performance degradation, as do many of the unit tests.
>>
>> This does not appear related to QEMU code changes, since older git
>> commits which were known to fully complete in less than 1 hour on
>> Cirrus CI now also show similar bad performance. Either Cirrus CI
>> performance has degraded, or an change in its environment has exposed
>> a latent bug widely affecting all of QEMU. Debugging the qom-test
>> showed no easily identified large bottleneck - every step of the
>> test execution was simply slower.
> 
> It appears I might be mistaken here. On IRC it was reported that
> going back furrther to v5.1.0 shows good performance in Cirrus
> still.
> 
> I had only gone back as far as 2a5a79d1b57280edd72193f6031de3feb682154e
> which I thought was fast originally.
> 
> So somewhere between v5.1.0 and 2a5a79 we apparently regressed.

I tested a few macos cirrus-ci builds after the meson conversion and found that they 
were working fine, so whatever is affecting the macos build must be related to a QEMU 
change.

A full bisect proved to be too tricky due to the instability of the tree at that 
point in time, however reading through "git log" and attempting some builds at merges 
I thought might be related I discovered that the slowness was introduced by this PR:


commit b7092cda1b36ce687e65ab1831346f9529b781b8
Merge: 497d415d76 eb94b81a94
Author: Peter Maydell <peter.maydell@linaro.org>
Date:   Fri Oct 9 13:20:46 2020 +0100

     Merge remote-tracking branch 'remotes/armbru/tags/pull-monitor-2020-10-09' into 
staging

     Monitor patches for 2020-10-09

     # gpg: Signature made Fri 09 Oct 2020 06:16:51 BST
     # gpg:                using RSA key 354BC8B3D7EB2A6B68674E5F3870B400EB918653
     # gpg:                issuer "armbru@redhat.com"
     # gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" [full]
     # gpg:                 aka "Markus Armbruster <armbru@pond.sub.org>" [full]
     # Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867  4E5F 3870 B400 EB91 8653

     * remotes/armbru/tags/pull-monitor-2020-10-09:
       block: Convert 'block_resize' to coroutine
       block: Add bdrv_lock()/unlock()
       block: Add bdrv_co_enter()/leave()
       util/async: Add aio_co_reschedule_self()
       hmp: Add support for coroutine command handlers
       qmp: Move dispatcher to a coroutine
       qapi: Add a 'coroutine' flag for commands
       monitor: Make current monitor a per-coroutine property
       qmp: Call monitor_set_cur() only in qmp_dispatch()
       qmp: Assert that no other monitor is active
       hmp: Update current monitor only in handle_hmp_command()
       monitor: Use getter/setter functions for cur_mon
       monitor: Add Monitor parameter to monitor_get_cpu_index()
       monitor: Add Monitor parameter to monitor_set_cpu()

     Signed-off-by: Peter Maydell <peter.maydell@linaro.org>


Fortunately that PR could be bisected and that led me this commit:


9ce44e2ce267caf5559904a201aa1986b0a8326b is the first bad commit
commit 9ce44e2ce267caf5559904a201aa1986b0a8326b
Author: Kevin Wolf <kwolf@redhat.com>
Date:   Mon Oct 5 17:58:50 2020 +0200

     qmp: Move dispatcher to a coroutine

     This moves the QMP dispatcher to a coroutine and runs all QMP command
     handlers that declare 'coroutine': true in coroutine context so they
     can avoid blocking the main loop while doing I/O or waiting for other
     events.

     For commands that are not declared safe to run in a coroutine, the
     dispatcher drops out of coroutine context by calling the QMP command
     handler from a bottom half.

     Signed-off-by: Kevin Wolf <kwolf@redhat.com>
     Reviewed-by: Markus Armbruster <armbru@redhat.com>
     Message-Id: <20201005155855.256490-10-kwolf@redhat.com>
     Reviewed-by: Markus Armbruster <armbru@redhat.com>
     Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
     Signed-off-by: Markus Armbruster <armbru@redhat.com>


Given that Peter can run the tests manually, I'm not exactly sure why the cirrus-ci 
environment is different - all I can think of is that it could be related to running 
in a headless terminal.

For reference running cirrus-ci on the last good commit 04f22362f1 "qapi: Add a 
'coroutine' flag for commands" gave me a total runtime of 35 mins.


ATB,

Mark.

Markus Armbruster Jan. 15, 2021, 1:10 p.m. UTC | #6

Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 06/01/2021 13:36, Daniel P. Berrangé wrote:
>
>>> The Cirrus CI macOS build hosts have exhibited a serious performance
>>> degradation in recent months. For example the "qom-test" qtest takes
>>> over an hour for only the qemu-system-aarch64 binary. This is as much
>>> 20-40 times slower than other environments. The other qtests all show
>>> similar performance degradation, as do many of the unit tests.
>>>
>>> This does not appear related to QEMU code changes, since older git
>>> commits which were known to fully complete in less than 1 hour on
>>> Cirrus CI now also show similar bad performance. Either Cirrus CI
>>> performance has degraded, or an change in its environment has exposed
>>> a latent bug widely affecting all of QEMU. Debugging the qom-test
>>> showed no easily identified large bottleneck - every step of the
>>> test execution was simply slower.
>> It appears I might be mistaken here. On IRC it was reported that
>> going back furrther to v5.1.0 shows good performance in Cirrus
>> still.
>> I had only gone back as far as
>> 2a5a79d1b57280edd72193f6031de3feb682154e
>> which I thought was fast originally.
>> So somewhere between v5.1.0 and 2a5a79 we apparently regressed.
>
> I tested a few macos cirrus-ci builds after the meson conversion and
> found that they were working fine, so whatever is affecting the macos
> build must be related to a QEMU change.
>
> A full bisect proved to be too tricky due to the instability of the
> tree at that point in time, however reading through "git log" and
> attempting some builds at merges I thought might be related I
> discovered that the slowness was introduced by this PR:
>
>
> commit b7092cda1b36ce687e65ab1831346f9529b781b8
> Merge: 497d415d76 eb94b81a94
> Author: Peter Maydell <peter.maydell@linaro.org>
> Date:   Fri Oct 9 13:20:46 2020 +0100
>
>     Merge remote-tracking branch
>     'remotes/armbru/tags/pull-monitor-2020-10-09' into staging
>
>     Monitor patches for 2020-10-09
[...]
>
> Fortunately that PR could be bisected and that led me this commit:
>
>
> 9ce44e2ce267caf5559904a201aa1986b0a8326b is the first bad commit
> commit 9ce44e2ce267caf5559904a201aa1986b0a8326b
> Author: Kevin Wolf <kwolf@redhat.com>
> Date:   Mon Oct 5 17:58:50 2020 +0200
>
>     qmp: Move dispatcher to a coroutine
>
>     This moves the QMP dispatcher to a coroutine and runs all QMP command
>     handlers that declare 'coroutine': true in coroutine context so they
>     can avoid blocking the main loop while doing I/O or waiting for other
>     events.
>
>     For commands that are not declared safe to run in a coroutine, the
>     dispatcher drops out of coroutine context by calling the QMP command
>     handler from a bottom half.
>
>     Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>     Reviewed-by: Markus Armbruster <armbru@redhat.com>
>     Message-Id: <20201005155855.256490-10-kwolf@redhat.com>
>     Reviewed-by: Markus Armbruster <armbru@redhat.com>
>     Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>     Signed-off-by: Markus Armbruster <armbru@redhat.com>
>
>
> Given that Peter can run the tests manually, I'm not exactly sure why
> the cirrus-ci environment is different - all I can think of is that it
> could be related to running in a headless terminal.
>
> For reference running cirrus-ci on the last good commit 04f22362f1
> "qapi: Add a 'coroutine' flag for commands" gave me a total runtime of
> 35 mins.

Let me explain what the patch does, in the hope of helping somebody else
find out why it behaves badly on MacOS.  Unfortunately, a proper
explanation requires quite some context.  Bear with me.

In the beginning, there was just one monitor, and it ran entirely in the
main loop (callback when input is available).  To keep the main loop
going, monitor commands better complete quickly.

Then we got multiple monitors.  Same story, just multiple input streams,
each with a callback.

We also got additional threads.  When I say "main loop", I mean the main
thread's main loop.

"Must complete quickly" means no blocking I/O and such.  Writing code
that way is somewhere between hard and impractical.  Much code called by
monitor commands wasn't written that way.

When a monitor command blocks, the main loop blocks, and that means no
more monitor commands can run, not even on other monitors.

"Doctor, doctor, running code in the main loop hurts".  Sadly, the
doctor's recommended remedy "don't do that then" is really hard to
apply: a lot of code has been written assuming "running in the main
loop, with the big QEMU lock held".

The first small step towards it was taken to enable the "out-of-band"
feature.  We moved the QMP monitor proper out of the main loop into a
monitor I/O thread.  The monitor commands get funneled to the main loop.
Instead of the main loop calling the monitor when a file descriptor has
input, it now calls the command dispatcher when a funnel queue has a
command.

Why bother?  Because now we can thread execute special "out-of-band"
commands right away, in the I/O thread, regardless of how badly the main
loop is.  Peter Xu wanted this badly enough for postcopy recovery to
code it up.  It was hard.  It's not generally useful, as the restriction
on what OOB commands can do are severe.

The next step was the coroutine feature.  Quite a few of the problematic
monitor commands are actually running coroutine-capable code: when
running in coroutine context, the code yields instead of blocking.
Running such commands in monitor context improves things from "blocks
the main loop" to "blocks all monitor commands".

Sadly, code exists that falls apart in coroutine context.  So we had to
make running in coroutine context opt-in.  Right now only two commands
opt in: block_resize and screendump.  Hopefully, we can get to the point
where most or all do.

Until all do, the dispatcher needs to run some commands coroutine
context, and others outside coroutine context.  How?

We've finally reached commit 9ce44e2ce2 "qmp: Move dispatcher to a
coroutine".

The QMP command dispatcher runs in a coroutine in the main loop (HMP
works differently, but let's ignore it here).

If the command can be executed in coroutine context, the dispatcher
calls its handler as before.  Right now, we take this path just for
block_resize and screendump.

Else, we create a bottom half that calls the handler, and schedule it to
run in the main loop.  Right now, we take this path for all the other
commands.

Hypothesis: on MacOS, taking this path is s-l-o-w.  Perhaps creating
bottom halves takes ages.  Perhaps the delay until a scheduled bottom
half actually runs is huge.

Possibly useful experiment: find out which QMP commands the slow test
case runs a lot, and mark them 'coroutine': true in the QAPI schema.

Another hypothesis: entering and leaving the QMP dispatcher coroutine is
slow.  If yes, all coroutines are probably slow, which hurts a lot more
than just QMP.

Hope this helps.

cirrus: don't run full qtest on macOS

Commit Message

Comments

Patch