diff mbox series

[5/5] migration: Print expected-downtime on completion

Message ID 20230926161841.98464-6-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series migration: Downtime observability improvements | expand

Commit Message

Joao Martins Sept. 26, 2023, 4:18 p.m. UTC
Right now, migration statistics either print downtime or expected
downtime depending on migration completing of in progress. Also in the
beginning of migration by printing the downtime limit as expected
downtime, when estimation is not available.

The pending_size is private in migration iteration and not necessarily
accessible outside. Given the non-determinism of the switchover cost, it
can be useful to understand if the downtime was far off from the one
detected by the migration algoritm, thus print the resultant downtime
alongside its estimation.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 migration/migration.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Peter Xu Oct. 4, 2023, 7:33 p.m. UTC | #1
On Tue, Sep 26, 2023 at 05:18:41PM +0100, Joao Martins wrote:
> Right now, migration statistics either print downtime or expected
> downtime depending on migration completing of in progress. Also in the
> beginning of migration by printing the downtime limit as expected
> downtime, when estimation is not available.
> 
> The pending_size is private in migration iteration and not necessarily
> accessible outside. Given the non-determinism of the switchover cost, it
> can be useful to understand if the downtime was far off from the one
> detected by the migration algoritm, thus print the resultant downtime
> alongside its estimation.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  migration/migration.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index dec6c88fbff9..f08f65b4b1c3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -943,6 +943,10 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>      if (s->state == MIGRATION_STATUS_COMPLETED) {
>          info->has_total_time = true;
>          info->total_time = s->total_time;
> +        if (s->expected_downtime) {
> +            info->has_expected_downtime = true;
> +            info->expected_downtime = s->expected_downtime;
> +        }

There's another chunk right below that will also show
expected_downtime.. How about we merge them to be clear?

IIUC the current patch will not display expected_downtime during postcopy,
which makes sense.  But it'll pop up again after postcopy completes... so
not ideal either. If so sounds easier to just show it as long as we have a
value, and the user can ignore it.

@@ -913,7 +913,9 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
     if (migrate_show_downtime(s)) {
         info->has_downtime = true;
         info->downtime = s->downtime;
-    } else {
+    }
+
+    if (s->expected_downtime) {
         info->has_expected_downtime = true;
         info->expected_downtime = s->expected_downtime;
     }

IIUC currently expected_downtime for postcopy makes less sense.  Maybe one
day we can make it reflect reality, by taking more things into account
(besides dirty RAM rate).

>      } else {
>          info->has_total_time = true;
>          info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
> @@ -2844,6 +2848,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>      if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>          trace_migration_thread_low_pending(pending_size);
> +        if (s->threshold_size) {
> +            s->expected_downtime = (pending_size * s->parameters.downtime_limit) /
> +                                   s->threshold_size;
> +        }

I had a feeling that you did the calculation to avoid accessing ->mbps. :)

I'd suggest we move this into migration_completion(), and use ->mbps
(before the other avail-switchover-bandwidth patch lands).  It's just that
using the bandwidth value seems more straightforward.  Or maybe I missed
something tricky?

>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
> -- 
> 2.39.3
> 

Thanks,
Joao Martins Oct. 6, 2023, 11:45 a.m. UTC | #2
On 04/10/2023 20:33, Peter Xu wrote:
> On Tue, Sep 26, 2023 at 05:18:41PM +0100, Joao Martins wrote:
>> Right now, migration statistics either print downtime or expected
>> downtime depending on migration completing of in progress. Also in the
>> beginning of migration by printing the downtime limit as expected
>> downtime, when estimation is not available.
>>
>> The pending_size is private in migration iteration and not necessarily
>> accessible outside. Given the non-determinism of the switchover cost, it
>> can be useful to understand if the downtime was far off from the one
>> detected by the migration algoritm, thus print the resultant downtime
>> alongside its estimation.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  migration/migration.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index dec6c88fbff9..f08f65b4b1c3 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -943,6 +943,10 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>>      if (s->state == MIGRATION_STATUS_COMPLETED) {
>>          info->has_total_time = true;
>>          info->total_time = s->total_time;
>> +        if (s->expected_downtime) {
>> +            info->has_expected_downtime = true;
>> +            info->expected_downtime = s->expected_downtime;
>> +        }
> 
> There's another chunk right below that will also show
> expected_downtime.. How about we merge them to be clear?
>
Definitly

> IIUC the current patch will not display expected_downtime during postcopy,
> which makes sense.  But it'll pop up again after postcopy completes... so
> not ideal either. If so sounds easier to just show it as long as we have a
> value, and the user can ignore it.
> 
Yes.


> @@ -913,7 +913,9 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>      if (migrate_show_downtime(s)) {
>          info->has_downtime = true;
>          info->downtime = s->downtime;
> -    } else {
> +    }
> +
> +    if (s->expected_downtime) {
>          info->has_expected_downtime = true;
>          info->expected_downtime = s->expected_downtime;
>      }
> 
> IIUC currently expected_downtime for postcopy makes less sense. 

I think it makes sense, you still need to switchover to destination. Knowing how
much you miss is useful? Albeit compared to the rest of the algorithm is less
critical than say in precopy.

> Maybe one
> day we can make it reflect reality, by taking more things into account
> (besides dirty RAM rate).
> 
>>      } else {
>>          info->has_total_time = true;
>>          info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
>> @@ -2844,6 +2848,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>  
>>      if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>>          trace_migration_thread_low_pending(pending_size);
>> +        if (s->threshold_size) {
>> +            s->expected_downtime = (pending_size * s->parameters.downtime_limit) /
>> +                                   s->threshold_size;
>> +        }
> 
> I had a feeling that you did the calculation to avoid accessing ->mbps. :)
> 
It was oversight on my end

> I'd suggest we move this into migration_completion(), and use ->mbps
> (before the other avail-switchover-bandwidth patch lands).  It's just that
> using the bandwidth value seems more straightforward. 

It is better that way, and things get consolidated.


> Or maybe I missed
> something tricky?
> 

Nah, just oversight :)
Juan Quintela Oct. 31, 2023, 1:14 p.m. UTC | #3
Joao Martins <joao.m.martins@oracle.com> wrote:
> Right now, migration statistics either print downtime or expected
> downtime depending on migration completing of in progress. Also in the
> beginning of migration by printing the downtime limit as expected
> downtime, when estimation is not available.
>
> The pending_size is private in migration iteration and not necessarily
> accessible outside. Given the non-determinism of the switchover cost, it
> can be useful to understand if the downtime was far off from the one
> detected by the migration algoritm, thus print the resultant downtime
> alongside its estimation.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

I see that "part" of this series is on the downtime series by Peter.
I have merged them (they are tracepoints, we can change them when
needed).

But this one is not on that series.

Should we continue and send a patch for it?

Later, Juan.


> ---
>  migration/migration.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index dec6c88fbff9..f08f65b4b1c3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -943,6 +943,10 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>      if (s->state == MIGRATION_STATUS_COMPLETED) {
>          info->has_total_time = true;
>          info->total_time = s->total_time;
> +        if (s->expected_downtime) {
> +            info->has_expected_downtime = true;
> +            info->expected_downtime = s->expected_downtime;
> +        }
>      } else {
>          info->has_total_time = true;
>          info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
> @@ -2844,6 +2848,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>      if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>          trace_migration_thread_low_pending(pending_size);
> +        if (s->threshold_size) {
> +            s->expected_downtime = (pending_size * s->parameters.downtime_limit) /
> +                                   s->threshold_size;
> +        }
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
Joao Martins Nov. 2, 2023, 10:22 a.m. UTC | #4
On 31/10/2023 13:14, Juan Quintela wrote:
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> Right now, migration statistics either print downtime or expected
>> downtime depending on migration completing of in progress. Also in the
>> beginning of migration by printing the downtime limit as expected
>> downtime, when estimation is not available.
>>
>> The pending_size is private in migration iteration and not necessarily
>> accessible outside. Given the non-determinism of the switchover cost, it
>> can be useful to understand if the downtime was far off from the one
>> detected by the migration algoritm, thus print the resultant downtime
>> alongside its estimation.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> I see that "part" of this series is on the downtime series by Peter.
> I have merged them (they are tracepoints, we can change them when
> needed).
> 
> But this one is not on that series.
> 
> Should we continue and send a patch for it?
> 

Yeap, I will follow up (taking into account Peter's comments)
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index dec6c88fbff9..f08f65b4b1c3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -943,6 +943,10 @@  static void populate_time_info(MigrationInfo *info, MigrationState *s)
     if (s->state == MIGRATION_STATUS_COMPLETED) {
         info->has_total_time = true;
         info->total_time = s->total_time;
+        if (s->expected_downtime) {
+            info->has_expected_downtime = true;
+            info->expected_downtime = s->expected_downtime;
+        }
     } else {
         info->has_total_time = true;
         info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
@@ -2844,6 +2848,10 @@  static MigIterateState migration_iteration_run(MigrationState *s)
 
     if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
         trace_migration_thread_low_pending(pending_size);
+        if (s->threshold_size) {
+            s->expected_downtime = (pending_size * s->parameters.downtime_limit) /
+                                   s->threshold_size;
+        }
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }