diff mbox series

[kvm-unit-tests,v1] arch-run: Wait for incoming socket being removed

Message ID 20240305141214.707046-1-nrb@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [kvm-unit-tests,v1] arch-run: Wait for incoming socket being removed | expand

Commit Message

Nico Boehr March 5, 2024, 2:11 p.m. UTC
Sometimes, QEMU needs a bit longer to remove the incoming migration
socket. This happens in some environments on s390x for the
migration-skey-sequential test.

Instead of directly erroring out, wait for the removal of the socket.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
---
 scripts/arch-run.bash | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Comments

Marc Hartmayer March 5, 2024, 6:12 p.m. UTC | #1
On Tue, Mar 05, 2024 at 03:11 PM +0100, Nico Boehr <nrb@linux.ibm.com> wrote:
> Sometimes, QEMU needs a bit longer to remove the incoming migration
> socket. This happens in some environments on s390x for the
> migration-skey-sequential test.
>
> Instead of directly erroring out, wait for the removal of the socket.
>
> Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
> ---
>  scripts/arch-run.bash | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> index 2214d940cf7d..413f3eda8cb8 100644
> --- a/scripts/arch-run.bash
> +++ b/scripts/arch-run.bash
> @@ -237,12 +237,8 @@ do_migration ()
>  	echo > ${dst_infifo}
>  	rm ${dst_infifo}
>  
> -	# Ensure the incoming socket is removed, ready for next destination
> -	if [ -S ${dst_incoming} ] ; then
> -		echo "ERROR: Incoming migration socket not removed after migration." >& 2
> -		qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> -		return 2
> -	fi
> +	# Wait for the incoming socket being removed, ready for next destination
> +	while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done

But now, you have removed the erroring out path completely. Maybe wait
max. 3s and then bail out?

[…snip]
Nico Boehr March 6, 2024, 1:03 p.m. UTC | #2
Quoting Marc Hartmayer (2024-03-05 19:12:16)
[...]
> > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> > index 2214d940cf7d..413f3eda8cb8 100644
> > --- a/scripts/arch-run.bash
> > +++ b/scripts/arch-run.bash
> > @@ -237,12 +237,8 @@ do_migration ()
> >       echo > ${dst_infifo}
> >       rm ${dst_infifo}
> >  
> > -     # Ensure the incoming socket is removed, ready for next destination
> > -     if [ -S ${dst_incoming} ] ; then
> > -             echo "ERROR: Incoming migration socket not removed after migration." >& 2
> > -             qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> > -             return 2
> > -     fi
> > +     # Wait for the incoming socket being removed, ready for next destination
> > +     while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done
> 
> But now, you have removed the erroring out path completely. Maybe wait
> max. 3s and then bail out?

Well, I was considering that, but:
- I'm not a huge fan of fine-grained timeouts. Fine-tuning a gazillion
  timeouts is not a fun task, I think you know what I'm talking about :)
- a number of other places that can potentially get stuck also don't have
  proper timeouts (like waiting for the QMP socket or the migration
  socket), so for a proper solution we'd need to touch a lot of other
  places...

What I think we really want is a migration timeout. That isn't quite simple
since we can't easily pull $(timeout_cmd) before $(panic_cmd) and
$(migration_cmd) in run-scripts...

My suggestion: let's fix this issue and work on the timeout as a seperate
fix.
Nicholas Piggin March 12, 2024, 5:39 a.m. UTC | #3
On Wed Mar 6, 2024 at 11:03 PM AEST, Nico Boehr wrote:
> Quoting Marc Hartmayer (2024-03-05 19:12:16)
> [...]
> > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> > > index 2214d940cf7d..413f3eda8cb8 100644
> > > --- a/scripts/arch-run.bash
> > > +++ b/scripts/arch-run.bash
> > > @@ -237,12 +237,8 @@ do_migration ()
> > >       echo > ${dst_infifo}
> > >       rm ${dst_infifo}
> > >  
> > > -     # Ensure the incoming socket is removed, ready for next destination
> > > -     if [ -S ${dst_incoming} ] ; then
> > > -             echo "ERROR: Incoming migration socket not removed after migration." >& 2
> > > -             qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> > > -             return 2
> > > -     fi
> > > +     # Wait for the incoming socket being removed, ready for next destination
> > > +     while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done
> > 
> > But now, you have removed the erroring out path completely. Maybe wait
> > max. 3s and then bail out?
>
> Well, I was considering that, but:
> - I'm not a huge fan of fine-grained timeouts. Fine-tuning a gazillion
>   timeouts is not a fun task, I think you know what I'm talking about :)
> - a number of other places that can potentially get stuck also don't have
>   proper timeouts (like waiting for the QMP socket or the migration
>   socket), so for a proper solution we'd need to touch a lot of other
>   places...
>
> What I think we really want is a migration timeout. That isn't quite simple
> since we can't easily pull $(timeout_cmd) before $(panic_cmd) and
> $(migration_cmd) in run-scripts...
>
> My suggestion: let's fix this issue and work on the timeout as a seperate
> fix.

The migration tests as a whole have big trouble with timeouts already.
The problem is timeouts are implemented with the 'timeout' command but
that is specific to the QEMU process so especially the migration harness
with lots of loops can easily hang.

I tried a few ways to address this like starting a background 'sleep ;
kill' shell, but that gets very complicated to handle interrupts properly
that kill the stuck bits and having the harness report the error
sanely. I'm thinking a subshell that runs the entire test case and start
*that* with 'timeout' might be a better approach.

So I agree, let's take patches that fix behaviour when there are no
timeouts, and address the timeout problem as a whole as a separate
effort rather than worrying too much about individual loops just yet.
(It's a fair review comment to ask though).

Thanks,
Nick
Nicholas Piggin March 12, 2024, 6:37 a.m. UTC | #4
On Wed Mar 6, 2024 at 12:11 AM AEST, Nico Boehr wrote:
> Sometimes, QEMU needs a bit longer to remove the incoming migration
> socket. This happens in some environments on s390x for the
> migration-skey-sequential test.
>
> Instead of directly erroring out, wait for the removal of the socket.
>
> Signed-off-by: Nico Boehr <nrb@linux.ibm.com>

It's interesting that the incoming socket is not removed after
QMP says migration complete. I guess that's by design, but have
you checked the QEMU code whether it's intentional?

I guess it's code like this - in migration/migration.c

    /*
     * This must happen after any state changes since as soon as an external
     * observer sees this event they might start to prod at the VM assuming
     * it's ready to use.
     */
    migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
                      MIGRATION_STATUS_COMPLETED);
    migration_incoming_state_destroy();

So, it looks like a good fix. And probably not just s390x specific
it might be just unlucky timing.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  scripts/arch-run.bash | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> index 2214d940cf7d..413f3eda8cb8 100644
> --- a/scripts/arch-run.bash
> +++ b/scripts/arch-run.bash
> @@ -237,12 +237,8 @@ do_migration ()
>  	echo > ${dst_infifo}
>  	rm ${dst_infifo}
>  
> -	# Ensure the incoming socket is removed, ready for next destination
> -	if [ -S ${dst_incoming} ] ; then
> -		echo "ERROR: Incoming migration socket not removed after migration." >& 2
> -		qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> -		return 2
> -	fi
> +	# Wait for the incoming socket being removed, ready for next destination
> +	while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done
>  
>  	wait ${live_pid}
>  	ret=$?
Thomas Huth March 12, 2024, 6:45 a.m. UTC | #5
On 12/03/2024 07.37, Nicholas Piggin wrote:
> On Wed Mar 6, 2024 at 12:11 AM AEST, Nico Boehr wrote:
>> Sometimes, QEMU needs a bit longer to remove the incoming migration
>> socket. This happens in some environments on s390x for the
>> migration-skey-sequential test.
>>
>> Instead of directly erroring out, wait for the removal of the socket.
>>
>> Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
> 
> It's interesting that the incoming socket is not removed after
> QMP says migration complete. I guess that's by design, but have
> you checked the QEMU code whether it's intentional?
> 
> I guess it's code like this - in migration/migration.c
> 
>      /*
>       * This must happen after any state changes since as soon as an external
>       * observer sees this event they might start to prod at the VM assuming
>       * it's ready to use.
>       */
>      migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>                        MIGRATION_STATUS_COMPLETED);
>      migration_incoming_state_destroy();
> 
> So, it looks like a good fix. And probably not just s390x specific
> it might be just unlucky timing.
> 
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

All right, I pushed this as a fix now to the repository.

  Thomas
diff mbox series

Patch

diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index 2214d940cf7d..413f3eda8cb8 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -237,12 +237,8 @@  do_migration ()
 	echo > ${dst_infifo}
 	rm ${dst_infifo}
 
-	# Ensure the incoming socket is removed, ready for next destination
-	if [ -S ${dst_incoming} ] ; then
-		echo "ERROR: Incoming migration socket not removed after migration." >& 2
-		qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
-		return 2
-	fi
+	# Wait for the incoming socket being removed, ready for next destination
+	while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done
 
 	wait ${live_pid}
 	ret=$?