diff mbox series

[RFC,v4,3/3] migration: reduce time of loading non-iterable vmstate

Message ID 20221223142307.1614945-4-xuchuangxclwt@bytedance.com (mailing list archive)
State New, archived
Headers show
Series migration: reduce time of loading non-iterable vmstate | expand

Commit Message

Chuang Xu Dec. 23, 2022, 2:23 p.m. UTC
The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 16-queue vhost-net device
  - 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 150 ms			  740+ ms
after		about 30 ms			  630+ ms

In test2, we keep the number of the device the same as test1, reduce the
number of queues per device:

Here are the test2 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 1-queue vhost-net device
  - 16 1-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 90 ms			 about 250 ms

after		about 25 ms			 about 160 ms

In test3, we keep the number of queues per device the same as test1, reduce
the number of devices:

Here are the test3 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 1 16-queue vhost-net device
  - 1 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 20 ms			 about 70 ms
after		about 11 ms			 about 60 ms

As we can see from the test results above, both the number of queues and
the number of devices have a great impact on the time of loading non-iterable
vmstate. The growth of the number of devices and queues will lead to more
mr commits, and the time consumption caused by the flatview reconstruction
will also increase.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 migration/savevm.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

David Hildenbrand Dec. 23, 2022, 4:06 p.m. UTC | #1
On 23.12.22 15:23, Chuang Xu wrote:
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
> 
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
> 
> Here are the test1 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 8 16-queue vhost-net device
>    - 16 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 150 ms			  740+ ms
> after		about 30 ms			  630+ ms
> 
> In test2, we keep the number of the device the same as test1, reduce the
> number of queues per device:
> 
> Here are the test2 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 8 1-queue vhost-net device
>    - 16 1-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 90 ms			 about 250 ms
> 
> after		about 25 ms			 about 160 ms
> 
> In test3, we keep the number of queues per device the same as test1, reduce
> the number of devices:
> 
> Here are the test3 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 1 16-queue vhost-net device
>    - 1 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 20 ms			 about 70 ms
> after		about 11 ms			 about 60 ms
> 
> As we can see from the test results above, both the number of queues and
> the number of devices have a great impact on the time of loading non-iterable
> vmstate. The growth of the number of devices and queues will lead to more
> mr commits, and the time consumption caused by the flatview reconstruction
> will also increase.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>   migration/savevm.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index a0cdb714f7..19785e5a54 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
>       uint8_t section_type;
>       int ret = 0;
>   
> +    /* call memory_region_transaction_begin() before loading vmstate */


I'd suggest extending the comment *why* you are doing that, that it's a 
pure performance optimization, and how it achieves that.
Chuang Xu Jan. 4, 2023, 7:31 a.m. UTC | #2
On 2022/12/24 上午12:06, David Hildenbrand wrote:

On 23.12.22 15:23, Chuang Xu wrote:

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 8 16-queue vhost-net device
   - 16 4-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 150 ms              740+ ms
after        about 30 ms              630+ ms

In test2, we keep the number of the device the same as test1, reduce the
number of queues per device:

Here are the test2 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 8 1-queue vhost-net device
   - 16 1-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 90 ms             about 250 ms

after        about 25 ms             about 160 ms

In test3, we keep the number of queues per device the same as test1, reduce
the number of devices:

Here are the test3 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 1 16-queue vhost-net device
   - 1 4-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 20 ms             about 70 ms
after        about 11 ms             about 60 ms

As we can see from the test results above, both the number of queues and
the number of devices have a great impact on the time of loading
non-iterable
vmstate. The growth of the number of devices and queues will lead to more
mr commits, and the time consumption caused by the flatview reconstruction
will also increase.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
<xuchuangxclwt@bytedance.com>
---
  migration/savevm.c | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..19785e5a54 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f,
MigrationIncomingState *mis)
      uint8_t section_type;
      int ret = 0;
  +    /* call memory_region_transaction_begin() before loading vmstate */



I'd suggest extending the comment *why* you are doing that, that it's a
pure performance optimization, and how it achieves that.

Thanks! I'll extend the comment in v5.
diff mbox series

Patch

diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..19785e5a54 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2617,6 +2617,9 @@  int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
     uint8_t section_type;
     int ret = 0;
 
+    /* call memory_region_transaction_begin() before loading vmstate */
+    memory_region_transaction_begin();
+
 retry:
     while (true) {
         section_type = qemu_get_byte(f);
@@ -2684,6 +2687,10 @@  out:
             goto retry;
         }
     }
+
+    /* call memory_region_transaction_commit() after loading vmstate */
+    memory_region_transaction_commit();
+
     return ret;
 }