Message ID | 20240305141214.707046-1-nrb@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [kvm-unit-tests,v1] arch-run: Wait for incoming socket being removed | expand |
On Tue, Mar 05, 2024 at 03:11 PM +0100, Nico Boehr <nrb@linux.ibm.com> wrote: > Sometimes, QEMU needs a bit longer to remove the incoming migration > socket. This happens in some environments on s390x for the > migration-skey-sequential test. > > Instead of directly erroring out, wait for the removal of the socket. > > Signed-off-by: Nico Boehr <nrb@linux.ibm.com> > --- > scripts/arch-run.bash | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash > index 2214d940cf7d..413f3eda8cb8 100644 > --- a/scripts/arch-run.bash > +++ b/scripts/arch-run.bash > @@ -237,12 +237,8 @@ do_migration () > echo > ${dst_infifo} > rm ${dst_infifo} > > - # Ensure the incoming socket is removed, ready for next destination > - if [ -S ${dst_incoming} ] ; then > - echo "ERROR: Incoming migration socket not removed after migration." >& 2 > - qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null > - return 2 > - fi > + # Wait for the incoming socket being removed, ready for next destination > + while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done But now, you have removed the erroring out path completely. Maybe wait max. 3s and then bail out? […snip]
Quoting Marc Hartmayer (2024-03-05 19:12:16) [...] > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash > > index 2214d940cf7d..413f3eda8cb8 100644 > > --- a/scripts/arch-run.bash > > +++ b/scripts/arch-run.bash > > @@ -237,12 +237,8 @@ do_migration () > > echo > ${dst_infifo} > > rm ${dst_infifo} > > > > - # Ensure the incoming socket is removed, ready for next destination > > - if [ -S ${dst_incoming} ] ; then > > - echo "ERROR: Incoming migration socket not removed after migration." >& 2 > > - qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null > > - return 2 > > - fi > > + # Wait for the incoming socket being removed, ready for next destination > > + while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done > > But now, you have removed the erroring out path completely. Maybe wait > max. 3s and then bail out? Well, I was considering that, but: - I'm not a huge fan of fine-grained timeouts. Fine-tuning a gazillion timeouts is not a fun task, I think you know what I'm talking about :) - a number of other places that can potentially get stuck also don't have proper timeouts (like waiting for the QMP socket or the migration socket), so for a proper solution we'd need to touch a lot of other places... What I think we really want is a migration timeout. That isn't quite simple since we can't easily pull $(timeout_cmd) before $(panic_cmd) and $(migration_cmd) in run-scripts... My suggestion: let's fix this issue and work on the timeout as a seperate fix.
On Wed Mar 6, 2024 at 11:03 PM AEST, Nico Boehr wrote: > Quoting Marc Hartmayer (2024-03-05 19:12:16) > [...] > > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash > > > index 2214d940cf7d..413f3eda8cb8 100644 > > > --- a/scripts/arch-run.bash > > > +++ b/scripts/arch-run.bash > > > @@ -237,12 +237,8 @@ do_migration () > > > echo > ${dst_infifo} > > > rm ${dst_infifo} > > > > > > - # Ensure the incoming socket is removed, ready for next destination > > > - if [ -S ${dst_incoming} ] ; then > > > - echo "ERROR: Incoming migration socket not removed after migration." >& 2 > > > - qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null > > > - return 2 > > > - fi > > > + # Wait for the incoming socket being removed, ready for next destination > > > + while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done > > > > But now, you have removed the erroring out path completely. Maybe wait > > max. 3s and then bail out? > > Well, I was considering that, but: > - I'm not a huge fan of fine-grained timeouts. Fine-tuning a gazillion > timeouts is not a fun task, I think you know what I'm talking about :) > - a number of other places that can potentially get stuck also don't have > proper timeouts (like waiting for the QMP socket or the migration > socket), so for a proper solution we'd need to touch a lot of other > places... > > What I think we really want is a migration timeout. That isn't quite simple > since we can't easily pull $(timeout_cmd) before $(panic_cmd) and > $(migration_cmd) in run-scripts... > > My suggestion: let's fix this issue and work on the timeout as a seperate > fix. The migration tests as a whole have big trouble with timeouts already. The problem is timeouts are implemented with the 'timeout' command but that is specific to the QEMU process so especially the migration harness with lots of loops can easily hang. I tried a few ways to address this like starting a background 'sleep ; kill' shell, but that gets very complicated to handle interrupts properly that kill the stuck bits and having the harness report the error sanely. I'm thinking a subshell that runs the entire test case and start *that* with 'timeout' might be a better approach. So I agree, let's take patches that fix behaviour when there are no timeouts, and address the timeout problem as a whole as a separate effort rather than worrying too much about individual loops just yet. (It's a fair review comment to ask though). Thanks, Nick
On Wed Mar 6, 2024 at 12:11 AM AEST, Nico Boehr wrote: > Sometimes, QEMU needs a bit longer to remove the incoming migration > socket. This happens in some environments on s390x for the > migration-skey-sequential test. > > Instead of directly erroring out, wait for the removal of the socket. > > Signed-off-by: Nico Boehr <nrb@linux.ibm.com> It's interesting that the incoming socket is not removed after QMP says migration complete. I guess that's by design, but have you checked the QEMU code whether it's intentional? I guess it's code like this - in migration/migration.c /* * This must happen after any state changes since as soon as an external * observer sees this event they might start to prod at the VM assuming * it's ready to use. */ migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE, MIGRATION_STATUS_COMPLETED); migration_incoming_state_destroy(); So, it looks like a good fix. And probably not just s390x specific it might be just unlucky timing. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> > --- > scripts/arch-run.bash | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash > index 2214d940cf7d..413f3eda8cb8 100644 > --- a/scripts/arch-run.bash > +++ b/scripts/arch-run.bash > @@ -237,12 +237,8 @@ do_migration () > echo > ${dst_infifo} > rm ${dst_infifo} > > - # Ensure the incoming socket is removed, ready for next destination > - if [ -S ${dst_incoming} ] ; then > - echo "ERROR: Incoming migration socket not removed after migration." >& 2 > - qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null > - return 2 > - fi > + # Wait for the incoming socket being removed, ready for next destination > + while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done > > wait ${live_pid} > ret=$?
On 12/03/2024 07.37, Nicholas Piggin wrote: > On Wed Mar 6, 2024 at 12:11 AM AEST, Nico Boehr wrote: >> Sometimes, QEMU needs a bit longer to remove the incoming migration >> socket. This happens in some environments on s390x for the >> migration-skey-sequential test. >> >> Instead of directly erroring out, wait for the removal of the socket. >> >> Signed-off-by: Nico Boehr <nrb@linux.ibm.com> > > It's interesting that the incoming socket is not removed after > QMP says migration complete. I guess that's by design, but have > you checked the QEMU code whether it's intentional? > > I guess it's code like this - in migration/migration.c > > /* > * This must happen after any state changes since as soon as an external > * observer sees this event they might start to prod at the VM assuming > * it's ready to use. > */ > migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE, > MIGRATION_STATUS_COMPLETED); > migration_incoming_state_destroy(); > > So, it looks like a good fix. And probably not just s390x specific > it might be just unlucky timing. > > Reviewed-by: Nicholas Piggin <npiggin@gmail.com> All right, I pushed this as a fix now to the repository. Thomas
diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash index 2214d940cf7d..413f3eda8cb8 100644 --- a/scripts/arch-run.bash +++ b/scripts/arch-run.bash @@ -237,12 +237,8 @@ do_migration () echo > ${dst_infifo} rm ${dst_infifo} - # Ensure the incoming socket is removed, ready for next destination - if [ -S ${dst_incoming} ] ; then - echo "ERROR: Incoming migration socket not removed after migration." >& 2 - qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null - return 2 - fi + # Wait for the incoming socket being removed, ready for next destination + while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done wait ${live_pid} ret=$?
Sometimes, QEMU needs a bit longer to remove the incoming migration socket. This happens in some environments on s390x for the migration-skey-sequential test. Instead of directly erroring out, wait for the removal of the socket. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> --- scripts/arch-run.bash | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)