Message ID | 20210106114159.981538-1-berrange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cirrus: don't run full qtest on macOS | expand |
On 06/01/2021 12.41, Daniel P. Berrangé wrote: > The Cirrus CI macOS build hosts have exhibited a serious performance > degradation in recent months. For example the "qom-test" qtest takes > over an hour for only the qemu-system-aarch64 binary. This is as much > 20-40 times slower than other environments. The other qtests all show > similar performance degradation, as do many of the unit tests. > > This does not appear related to QEMU code changes, since older git > commits which were known to fully complete in less than 1 hour on > Cirrus CI now also show similar bad performance. Either Cirrus CI > performance has degraded, or an change in its environment has exposed > a latent bug widely affecting all of QEMU. Debugging the qom-test > showed no easily identified large bottleneck - every step of the > test execution was simply slower. > > macOS builds/tests run outside Cirrus CI show normal performance. > > With an inability to identify any obvious problem, the only viable > way to get a reliably completing Cirrus CI macOS job is to cut out > almost all of the qtests. We choose to run the x86_64 target only, > since that has very few machine types and thus is least badly > impacted in the qom-test execution. Sad that we have to do this, but as long as we still can run the x86 qtests, that's certainly better than nothing. Reviewed-by: Thomas Huth <thuth@redhat.com>
On Wed, Jan 6, 2021 at 8:44 AM Daniel P. Berrangé <berrange@redhat.com> wrote: > > The Cirrus CI macOS build hosts have exhibited a serious performance > degradation in recent months. For example the "qom-test" qtest takes > over an hour for only the qemu-system-aarch64 binary. This is as much > 20-40 times slower than other environments. The other qtests all show > similar performance degradation, as do many of the unit tests. > > This does not appear related to QEMU code changes, since older git > commits which were known to fully complete in less than 1 hour on > Cirrus CI now also show similar bad performance. Either Cirrus CI > performance has degraded, or an change in its environment has exposed > a latent bug widely affecting all of QEMU. Debugging the qom-test > showed no easily identified large bottleneck - every step of the > test execution was simply slower. > > macOS builds/tests run outside Cirrus CI show normal performance. > > With an inability to identify any obvious problem, the only viable > way to get a reliably completing Cirrus CI macOS job is to cut out > almost all of the qtests. We choose to run the x86_64 target only, > since that has very few machine types and thus is least badly > impacted in the qom-test execution. > > With this change, the macOS jobs complete in approx 35 minutes. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> > --- > .cirrus.yml | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > Reviewed-by: Willian Rampazzo <willianr@redhat.com>
On Wed, Jan 06, 2021 at 11:41:59AM +0000, Daniel P. Berrangé wrote: > The Cirrus CI macOS build hosts have exhibited a serious performance > degradation in recent months. For example the "qom-test" qtest takes > over an hour for only the qemu-system-aarch64 binary. This is as much > 20-40 times slower than other environments. The other qtests all show > similar performance degradation, as do many of the unit tests. > > This does not appear related to QEMU code changes, since older git > commits which were known to fully complete in less than 1 hour on > Cirrus CI now also show similar bad performance. Either Cirrus CI > performance has degraded, or an change in its environment has exposed > a latent bug widely affecting all of QEMU. Debugging the qom-test > showed no easily identified large bottleneck - every step of the > test execution was simply slower. It appears I might be mistaken here. On IRC it was reported that going back furrther to v5.1.0 shows good performance in Cirrus still. I had only gone back as far as 2a5a79d1b57280edd72193f6031de3feb682154e which I thought was fast originally. So somewhere between v5.1.0 and 2a5a79 we apparently regressed. Regards, Daniel
Daniel P. Berrangé <berrange@redhat.com> writes: > The Cirrus CI macOS build hosts have exhibited a serious performance > degradation in recent months. For example the "qom-test" qtest takes > over an hour for only the qemu-system-aarch64 binary. This is as much > 20-40 times slower than other environments. The other qtests all show > similar performance degradation, as do many of the unit tests. > > This does not appear related to QEMU code changes, since older git > commits which were known to fully complete in less than 1 hour on > Cirrus CI now also show similar bad performance. Either Cirrus CI > performance has degraded, or an change in its environment has exposed > a latent bug widely affecting all of QEMU. Debugging the qom-test > showed no easily identified large bottleneck - every step of the > test execution was simply slower. > > macOS builds/tests run outside Cirrus CI show normal performance. > > With an inability to identify any obvious problem, the only viable > way to get a reliably completing Cirrus CI macOS job is to cut out > almost all of the qtests. We choose to run the x86_64 target only, > since that has very few machine types and thus is least badly > impacted in the qom-test execution. > > With this change, the macOS jobs complete in approx 35 minutes. > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Queued to testing/next, thanks.
On 06/01/2021 13:36, Daniel P. Berrangé wrote: >> The Cirrus CI macOS build hosts have exhibited a serious performance >> degradation in recent months. For example the "qom-test" qtest takes >> over an hour for only the qemu-system-aarch64 binary. This is as much >> 20-40 times slower than other environments. The other qtests all show >> similar performance degradation, as do many of the unit tests. >> >> This does not appear related to QEMU code changes, since older git >> commits which were known to fully complete in less than 1 hour on >> Cirrus CI now also show similar bad performance. Either Cirrus CI >> performance has degraded, or an change in its environment has exposed >> a latent bug widely affecting all of QEMU. Debugging the qom-test >> showed no easily identified large bottleneck - every step of the >> test execution was simply slower. > > It appears I might be mistaken here. On IRC it was reported that > going back furrther to v5.1.0 shows good performance in Cirrus > still. > > I had only gone back as far as 2a5a79d1b57280edd72193f6031de3feb682154e > which I thought was fast originally. > > So somewhere between v5.1.0 and 2a5a79 we apparently regressed. I tested a few macos cirrus-ci builds after the meson conversion and found that they were working fine, so whatever is affecting the macos build must be related to a QEMU change. A full bisect proved to be too tricky due to the instability of the tree at that point in time, however reading through "git log" and attempting some builds at merges I thought might be related I discovered that the slowness was introduced by this PR: commit b7092cda1b36ce687e65ab1831346f9529b781b8 Merge: 497d415d76 eb94b81a94 Author: Peter Maydell <peter.maydell@linaro.org> Date: Fri Oct 9 13:20:46 2020 +0100 Merge remote-tracking branch 'remotes/armbru/tags/pull-monitor-2020-10-09' into staging Monitor patches for 2020-10-09 # gpg: Signature made Fri 09 Oct 2020 06:16:51 BST # gpg: using RSA key 354BC8B3D7EB2A6B68674E5F3870B400EB918653 # gpg: issuer "armbru@redhat.com" # gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" [full] # gpg: aka "Markus Armbruster <armbru@pond.sub.org>" [full] # Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867 4E5F 3870 B400 EB91 8653 * remotes/armbru/tags/pull-monitor-2020-10-09: block: Convert 'block_resize' to coroutine block: Add bdrv_lock()/unlock() block: Add bdrv_co_enter()/leave() util/async: Add aio_co_reschedule_self() hmp: Add support for coroutine command handlers qmp: Move dispatcher to a coroutine qapi: Add a 'coroutine' flag for commands monitor: Make current monitor a per-coroutine property qmp: Call monitor_set_cur() only in qmp_dispatch() qmp: Assert that no other monitor is active hmp: Update current monitor only in handle_hmp_command() monitor: Use getter/setter functions for cur_mon monitor: Add Monitor parameter to monitor_get_cpu_index() monitor: Add Monitor parameter to monitor_set_cpu() Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Fortunately that PR could be bisected and that led me this commit: 9ce44e2ce267caf5559904a201aa1986b0a8326b is the first bad commit commit 9ce44e2ce267caf5559904a201aa1986b0a8326b Author: Kevin Wolf <kwolf@redhat.com> Date: Mon Oct 5 17:58:50 2020 +0200 qmp: Move dispatcher to a coroutine This moves the QMP dispatcher to a coroutine and runs all QMP command handlers that declare 'coroutine': true in coroutine context so they can avoid blocking the main loop while doing I/O or waiting for other events. For commands that are not declared safe to run in a coroutine, the dispatcher drops out of coroutine context by calling the QMP command handler from a bottom half. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20201005155855.256490-10-kwolf@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Given that Peter can run the tests manually, I'm not exactly sure why the cirrus-ci environment is different - all I can think of is that it could be related to running in a headless terminal. For reference running cirrus-ci on the last good commit 04f22362f1 "qapi: Add a 'coroutine' flag for commands" gave me a total runtime of 35 mins. ATB, Mark.
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes: > On 06/01/2021 13:36, Daniel P. Berrangé wrote: > >>> The Cirrus CI macOS build hosts have exhibited a serious performance >>> degradation in recent months. For example the "qom-test" qtest takes >>> over an hour for only the qemu-system-aarch64 binary. This is as much >>> 20-40 times slower than other environments. The other qtests all show >>> similar performance degradation, as do many of the unit tests. >>> >>> This does not appear related to QEMU code changes, since older git >>> commits which were known to fully complete in less than 1 hour on >>> Cirrus CI now also show similar bad performance. Either Cirrus CI >>> performance has degraded, or an change in its environment has exposed >>> a latent bug widely affecting all of QEMU. Debugging the qom-test >>> showed no easily identified large bottleneck - every step of the >>> test execution was simply slower. >> It appears I might be mistaken here. On IRC it was reported that >> going back furrther to v5.1.0 shows good performance in Cirrus >> still. >> I had only gone back as far as >> 2a5a79d1b57280edd72193f6031de3feb682154e >> which I thought was fast originally. >> So somewhere between v5.1.0 and 2a5a79 we apparently regressed. > > I tested a few macos cirrus-ci builds after the meson conversion and > found that they were working fine, so whatever is affecting the macos > build must be related to a QEMU change. > > A full bisect proved to be too tricky due to the instability of the > tree at that point in time, however reading through "git log" and > attempting some builds at merges I thought might be related I > discovered that the slowness was introduced by this PR: > > > commit b7092cda1b36ce687e65ab1831346f9529b781b8 > Merge: 497d415d76 eb94b81a94 > Author: Peter Maydell <peter.maydell@linaro.org> > Date: Fri Oct 9 13:20:46 2020 +0100 > > Merge remote-tracking branch > 'remotes/armbru/tags/pull-monitor-2020-10-09' into staging > > Monitor patches for 2020-10-09 [...] > > Fortunately that PR could be bisected and that led me this commit: > > > 9ce44e2ce267caf5559904a201aa1986b0a8326b is the first bad commit > commit 9ce44e2ce267caf5559904a201aa1986b0a8326b > Author: Kevin Wolf <kwolf@redhat.com> > Date: Mon Oct 5 17:58:50 2020 +0200 > > qmp: Move dispatcher to a coroutine > > This moves the QMP dispatcher to a coroutine and runs all QMP command > handlers that declare 'coroutine': true in coroutine context so they > can avoid blocking the main loop while doing I/O or waiting for other > events. > > For commands that are not declared safe to run in a coroutine, the > dispatcher drops out of coroutine context by calling the QMP command > handler from a bottom half. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > Reviewed-by: Markus Armbruster <armbru@redhat.com> > Message-Id: <20201005155855.256490-10-kwolf@redhat.com> > Reviewed-by: Markus Armbruster <armbru@redhat.com> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Markus Armbruster <armbru@redhat.com> > > > Given that Peter can run the tests manually, I'm not exactly sure why > the cirrus-ci environment is different - all I can think of is that it > could be related to running in a headless terminal. > > For reference running cirrus-ci on the last good commit 04f22362f1 > "qapi: Add a 'coroutine' flag for commands" gave me a total runtime of > 35 mins. Let me explain what the patch does, in the hope of helping somebody else find out why it behaves badly on MacOS. Unfortunately, a proper explanation requires quite some context. Bear with me. In the beginning, there was just one monitor, and it ran entirely in the main loop (callback when input is available). To keep the main loop going, monitor commands better complete quickly. Then we got multiple monitors. Same story, just multiple input streams, each with a callback. We also got additional threads. When I say "main loop", I mean the main thread's main loop. "Must complete quickly" means no blocking I/O and such. Writing code that way is somewhere between hard and impractical. Much code called by monitor commands wasn't written that way. When a monitor command blocks, the main loop blocks, and that means no more monitor commands can run, not even on other monitors. "Doctor, doctor, running code in the main loop hurts". Sadly, the doctor's recommended remedy "don't do that then" is really hard to apply: a lot of code has been written assuming "running in the main loop, with the big QEMU lock held". The first small step towards it was taken to enable the "out-of-band" feature. We moved the QMP monitor proper out of the main loop into a monitor I/O thread. The monitor commands get funneled to the main loop. Instead of the main loop calling the monitor when a file descriptor has input, it now calls the command dispatcher when a funnel queue has a command. Why bother? Because now we can thread execute special "out-of-band" commands right away, in the I/O thread, regardless of how badly the main loop is. Peter Xu wanted this badly enough for postcopy recovery to code it up. It was hard. It's not generally useful, as the restriction on what OOB commands can do are severe. The next step was the coroutine feature. Quite a few of the problematic monitor commands are actually running coroutine-capable code: when running in coroutine context, the code yields instead of blocking. Running such commands in monitor context improves things from "blocks the main loop" to "blocks all monitor commands". Sadly, code exists that falls apart in coroutine context. So we had to make running in coroutine context opt-in. Right now only two commands opt in: block_resize and screendump. Hopefully, we can get to the point where most or all do. Until all do, the dispatcher needs to run some commands coroutine context, and others outside coroutine context. How? We've finally reached commit 9ce44e2ce2 "qmp: Move dispatcher to a coroutine". The QMP command dispatcher runs in a coroutine in the main loop (HMP works differently, but let's ignore it here). If the command can be executed in coroutine context, the dispatcher calls its handler as before. Right now, we take this path just for block_resize and screendump. Else, we create a bottom half that calls the handler, and schedule it to run in the main loop. Right now, we take this path for all the other commands. Hypothesis: on MacOS, taking this path is s-l-o-w. Perhaps creating bottom halves takes ages. Perhaps the delay until a scheduled bottom half actually runs is huge. Possibly useful experiment: find out which QMP commands the slow test case runs a lot, and mark them 'coroutine': true in the QAPI schema. Another hypothesis: entering and leaving the QMP dispatcher coroutine is slow. If yes, all coroutines are probably slow, which hurts a lot more than just QMP. Hope this helps.
diff --git a/.cirrus.yml b/.cirrus.yml index 62a9b57530..3907e036da 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -18,7 +18,6 @@ freebsd_12_task: - gmake -j$(sysctl -n hw.ncpu) check V=1 macos_task: - timeout_in: 90m osx_instance: image: catalina-base install_script: @@ -30,10 +29,13 @@ macos_task: --extra-cflags='-Wno-error=deprecated-declarations' || { cat config.log meson-logs/meson-log.txt; exit 1; } - gmake -j$(sysctl -n hw.ncpu) - - gmake check V=1 + - gmake check-unit V=1 + - gmake check-block V=1 + - gmake check-qapi-schema V=1 + - gmake check-softfloat V=1 + - gmake check-qtest-x86_64 V=1 macos_xcode_task: - timeout_in: 90m osx_instance: # this is an alias for the latest Xcode image: catalina-xcode @@ -45,7 +47,11 @@ macos_xcode_task: - ../configure --extra-cflags='-Wno-error=deprecated-declarations' --enable-modules --enable-werror --cc=clang || { cat config.log meson-logs/meson-log.txt; exit 1; } - gmake -j$(sysctl -n hw.ncpu) - - gmake check V=1 + - gmake check-unit V=1 + - gmake check-block V=1 + - gmake check-qapi-schema V=1 + - gmake check-softfloat V=1 + - gmake check-qtest-x86_64 V=1 windows_msys2_task: timeout_in: 90m
The Cirrus CI macOS build hosts have exhibited a serious performance degradation in recent months. For example the "qom-test" qtest takes over an hour for only the qemu-system-aarch64 binary. This is as much 20-40 times slower than other environments. The other qtests all show similar performance degradation, as do many of the unit tests. This does not appear related to QEMU code changes, since older git commits which were known to fully complete in less than 1 hour on Cirrus CI now also show similar bad performance. Either Cirrus CI performance has degraded, or an change in its environment has exposed a latent bug widely affecting all of QEMU. Debugging the qom-test showed no easily identified large bottleneck - every step of the test execution was simply slower. macOS builds/tests run outside Cirrus CI show normal performance. With an inability to identify any obvious problem, the only viable way to get a reliably completing Cirrus CI macOS job is to cut out almost all of the qtests. We choose to run the x86_64 target only, since that has very few machine types and thus is least badly impacted in the qom-test execution. With this change, the macOS jobs complete in approx 35 minutes. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- .cirrus.yml | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)