diff mbox series

[2/2] migration: failover: continue to wait card unplug on error

Message ID 20210629155007.629086-3-lvivier@redhat.com (mailing list archive)
State New, archived
Headers show
Series migration: failover: continue to wait card unplug on error | expand

Commit Message

Laurent Vivier June 29, 2021, 3:50 p.m. UTC
If the user cancels the migration in the unplug-wait state,
QEMU will try to plug back the card and this fails because the card
is partially unplugged.
To avoid the problem, continue to wait the card unplug, but to
allow the migration to be canceled if the card never finishes to unplug
use a timeout.

Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 migration/migration.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Juan Quintela June 29, 2021, 5:50 p.m. UTC | #1
Laurent Vivier <lvivier@redhat.com> wrote:
> If the user cancels the migration in the unplug-wait state,
> QEMU will try to plug back the card and this fails because the card
> is partially unplugged.
> To avoid the problem, continue to wait the card unplug, but to
> allow the migration to be canceled if the card never finishes to unplug
> use a timeout.
>
> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  migration/migration.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 3e92c405a2b6..3b06d43a7f42 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3679,6 +3679,17 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>                 qemu_savevm_state_guest_unplug_pending()) {
>              qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>          }
> +        if (s->state != MIGRATION_STATUS_WAIT_UNPLUG) {
> +            int timeout = 120; /* 30 seconds */
> +            /*
> +             * migration has been canceled
> +             * but as we have started an unplug we must wait the end
> +             * to be able to plug back the card
> +             */
> +            while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
> +                qemu_sem_timedwait(&s->wait_unplug_sem, 250);
> +            }
> +        }
>  
>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>      } else {

I agree with the idea.  But if we are getting out due to timeout == 0,
shouldn't we return some error, warning, whatever?

Later, Juan.
Laurent Vivier June 30, 2021, 9:04 a.m. UTC | #2
On 29/06/2021 19:50, Juan Quintela wrote:
> Laurent Vivier <lvivier@redhat.com> wrote:
>> If the user cancels the migration in the unplug-wait state,
>> QEMU will try to plug back the card and this fails because the card
>> is partially unplugged.
>> To avoid the problem, continue to wait the card unplug, but to
>> allow the migration to be canceled if the card never finishes to unplug
>> use a timeout.
>>
>> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>> ---
>>  migration/migration.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3e92c405a2b6..3b06d43a7f42 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3679,6 +3679,17 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>>                 qemu_savevm_state_guest_unplug_pending()) {
>>              qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>>          }
>> +        if (s->state != MIGRATION_STATUS_WAIT_UNPLUG) {
>> +            int timeout = 120; /* 30 seconds */
>> +            /*
>> +             * migration has been canceled
>> +             * but as we have started an unplug we must wait the end
>> +             * to be able to plug back the card
>> +             */
>> +            while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
>> +                qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>> +            }
>> +        }
>>  
>>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>>      } else {
> I agree with the idea.  But if we are getting out due to timeout == 0,
> shouldn't we return some error, warning, whatever?

In that case, we keep the current behaviour: guest kernel will report an error when it
will try to plug back the card that has not been unplugged. This is a corner case: if it
happens we have something really wrong with the machine. Perhaps we can remove the
timeout, but I don't like to block the user, or increase it to be sure.

Thanks,

Laurent
Juan Quintela June 30, 2021, 9:13 a.m. UTC | #3
Laurent Vivier <lvivier@redhat.com> wrote:
> On 29/06/2021 19:50, Juan Quintela wrote:
>> Laurent Vivier <lvivier@redhat.com> wrote:
>>> If the user cancels the migration in the unplug-wait state,
>>> QEMU will try to plug back the card and this fails because the card
>>> is partially unplugged.
>>> To avoid the problem, continue to wait the card unplug, but to
>>> allow the migration to be canceled if the card never finishes to unplug
>>> use a timeout.
>>>
>>> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
>>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>>> ---
>>>  migration/migration.c | 11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 3e92c405a2b6..3b06d43a7f42 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -3679,6 +3679,17 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>>>                 qemu_savevm_state_guest_unplug_pending()) {
>>>              qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>>>          }
>>> +        if (s->state != MIGRATION_STATUS_WAIT_UNPLUG) {
>>> +            int timeout = 120; /* 30 seconds */
>>> +            /*
>>> +             * migration has been canceled
>>> +             * but as we have started an unplug we must wait the end
>>> +             * to be able to plug back the card
>>> +             */
>>> +            while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
>>> +                qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>>> +            }
>>> +        }
>>>  
>>>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>>>      } else {
>> I agree with the idea.  But if we are getting out due to timeout == 0,
>> shouldn't we return some error, warning, whatever?
>
> In that case, we keep the current behaviour: guest kernel will report
> an error when it
> will try to plug back the card that has not been unplugged. This is a
> corner case: if it
> happens we have something really wrong with the machine. Perhaps we can remove the
> timeout, but I don't like to block the user, or increase it to be sure.

Oh, I whole agree that it is a corner case, and that it shouldn't
happen.

But if it happens, we don't log it anywhere.  That was my complaint.

Later, Juan.
Dr. David Alan Gilbert (git) June 30, 2021, 5:48 p.m. UTC | #4
* Laurent Vivier (lvivier@redhat.com) wrote:
> If the user cancels the migration in the unplug-wait state,
> QEMU will try to plug back the card and this fails because the card
> is partially unplugged.
> To avoid the problem, continue to wait the card unplug, but to
> allow the migration to be canceled if the card never finishes to unplug
> use a timeout.
> 
> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

I'll take this for now, but as Juan says, we could really do with some
diags when this happens, so when someone comes and tells us that
the hotplug has failed we can see.  Please send something to add it.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 3e92c405a2b6..3b06d43a7f42 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3679,6 +3679,17 @@ static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
>                 qemu_savevm_state_guest_unplug_pending()) {
>              qemu_sem_timedwait(&s->wait_unplug_sem, 250);
>          }
> +        if (s->state != MIGRATION_STATUS_WAIT_UNPLUG) {
> +            int timeout = 120; /* 30 seconds */
> +            /*
> +             * migration has been canceled
> +             * but as we have started an unplug we must wait the end
> +             * to be able to plug back the card
> +             */
> +            while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
> +                qemu_sem_timedwait(&s->wait_unplug_sem, 250);
> +            }
> +        }
>  
>          migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
>      } else {
> -- 
> 2.31.1
> 
>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 3e92c405a2b6..3b06d43a7f42 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3679,6 +3679,17 @@  static void qemu_savevm_wait_unplug(MigrationState *s, int old_state,
                qemu_savevm_state_guest_unplug_pending()) {
             qemu_sem_timedwait(&s->wait_unplug_sem, 250);
         }
+        if (s->state != MIGRATION_STATUS_WAIT_UNPLUG) {
+            int timeout = 120; /* 30 seconds */
+            /*
+             * migration has been canceled
+             * but as we have started an unplug we must wait the end
+             * to be able to plug back the card
+             */
+            while (timeout-- && qemu_savevm_state_guest_unplug_pending()) {
+                qemu_sem_timedwait(&s->wait_unplug_sem, 250);
+            }
+        }
 
         migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG, new_state);
     } else {