Message ID | 20240705182901.48948-1-yichen.wang@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement using Intel QAT to offload ZLIB | expand |
> -----Original Message----- > From: Yichen Wang <yichen.wang@bytedance.com> > Sent: Saturday, July 6, 2024 2:29 AM > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu- > devel@nongnu.org > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>; > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com> > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB > > v4: > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768 > - Move the IOV initialization to qatzip implementation > - Only use qatzip to compress normal pages > > v3: > - Rebase changes on top of master > - Merge two patches per Fabiano Rosas's comment > - Add versions into comments and documentations > > v2: > - Rebase changes on top of recent multifd code changes. > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers. > - Remove parameter tuning and use QATzip's defaults for better > performance. > - Add parameter to enable QAT software fallback. > > v1: > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html > > * Performance > > We present updated performance results. For circumstantial reasons, v1 > presented performance on a low-bandwidth (1Gbps) network. > > Here, we present updated results with a similar setup as before but with > two main differences: > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'. > 2. We had a bug in our memory allocation causing us to only use ~1/2 of > the VM's RAM. Now we properly allocate and fill nearly all of the VM's > RAM. > > Thus, the test setup is as follows: > > We perform multifd live migration over TCP using a VM with 64GB memory. > We prepare the machine's memory by powering it on, allocating a large > amount of memory (60GB) as a single buffer, and filling the buffer with > the repeated contents of the Silesia corpus[0]. This is in lieu of a more > realistic memory snapshot, which proved troublesome to acquire. > > We analyze CPU usage by averaging the output of 'top' every second > during migration. This is admittedly imprecise, but we feel that it > accurately portrays the different degrees of CPU usage of varying > compression methods. > > We present the latency, throughput, and CPU usage results for all of the > compression methods, with varying numbers of multifd threads (4, 8, and > 16). > > [0] The Silesia corpus can be accessed here: > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia > > ** Results > > 4 multifd threads: > > |---------------|---------------|----------------|---------|---------| > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > |---------------|---------------|----------------|---------|---------| > |qatzip | 23.13 | 8749.94 |117.50 |186.49 | > |---------------|---------------|----------------|---------|---------| > |zlib |254.35 | 771.87 |388.20 |144.40 | > |---------------|---------------|----------------|---------|---------| > |zstd | 54.52 | 3442.59 |414.59 |149.77 | > |---------------|---------------|----------------|---------|---------| > |none | 12.45 |43739.60 |159.71 |204.96 | > |---------------|---------------|----------------|---------|---------| > > 8 multifd threads: > > |---------------|---------------|----------------|---------|---------| > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > |---------------|---------------|----------------|---------|---------| > |qatzip | 16.91 |12306.52 |186.37 |391.84 | > |---------------|---------------|----------------|---------|---------| > |zlib |130.11 | 1508.89 |753.86 |289.35 | > |---------------|---------------|----------------|---------|---------| > |zstd | 27.57 | 6823.23 |786.83 |303.80 | > |---------------|---------------|----------------|---------|---------| > |none | 11.82 |46072.63 |163.74 |238.56 | > |---------------|---------------|----------------|---------|---------| > > 16 multifd threads: > > |---------------|---------------|----------------|---------|---------| > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > |---------------|---------------|----------------|---------|---------| > |qatzip |18.64 |11044.52 | 573.61 |437.65 | > |---------------|---------------|----------------|---------|---------| > |zlib |66.43 | 2955.79 |1469.68 |567.47 | > |---------------|---------------|----------------|---------|---------| > |zstd |14.17 |13290.66 |1504.08 |615.33 | > |---------------|---------------|----------------|---------|---------| > |none |16.82 |32363.26 | 180.74 |217.17 | > |---------------|---------------|----------------|---------|---------| > > ** Observations > > - In general, not using compression outperforms using compression in a > non-network-bound environment. > - 'qatzip' outperforms other compression workers with 4 and 8 workers, > achieving a ~91% latency reduction over 'zlib' with 4 workers, and a > ~58% latency reduction over 'zstd' with 4 workers. > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers, > showing a ~32% increase in latency. This performance difference > becomes more noticeable with more workers, as CPU compression is highly > parallelizable. > - 'qatzip' compression uses considerably less CPU than other compression > methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in > compression CPU usage compared to 'zstd' and 'zlib'. > - 'qatzip' decompression CPU usage is less impressive, and is even > slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers. Hi Peter & Yichen I have a test based on the V4 patch set VM configuration:16 vCPU, 64G memory, VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data. QAT Devices: 4 Sender migration parameters migrate_set_capability multifd on migrate_set_parameter multifd-channels 2/4/8 migrate_set_parameter max-bandwidth 1G/10G migrate_set_parameter multifd-compression qatzip/zstd Receiver migration parameters migrate_set_capability multifd on migrate_set_parameter multifd-channels 2 migrate_set_parameter multifd-compression qatzip/zstd max-bandwidth: 1GBps |-----------|--------|---------|----------|------|------| |2 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 21607| 77| 8051| 88| 125| |-----------|--------|---------|----------|------|------| |zstd | 78351| 96| 2199| 204| 80| |-----------|--------|---------|----------|------|------| |-----------|--------|---------|----------|------|------| |4 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 20336| 25| 8557| 110| 190| |-----------|--------|---------|----------|------|------| |zstd | 39324| 31| 4389| 406| 160| |-----------|--------|---------|----------|------|------| |-----------|--------|---------|----------|------|------| |8 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 20208| 22| 8613| 125| 300| |-----------|--------|---------|----------|------|------| |zstd | 20515| 22| 8438| 800| 340| |-----------|--------|---------|----------|------|------| max-bandwidth: 10GBps |-----------|--------|---------|----------|------|------| |2 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 22450| 77| 7748| 80| 125| |-----------|--------|---------|----------|------|------| |zstd | 78339| 76| 2199| 204| 80| |-----------|--------|---------|----------|------|------| |-----------|--------|---------|----------|------|------| |4 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 13017| 24| 13401| 180| 285| |-----------|--------|---------|----------|------|------| |zstd | 39466| 21| 4373| 406| 160| |-----------|--------|---------|----------|------|------| |-----------|--------|---------|----------|------|------| |8 Channels |Total |down |throughput| send | recv | | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | |-----------|--------|---------|----------|------|------| |qatzip | 10255| 22| 17037| 280| 590| |-----------|--------|---------|----------|------|------| |zstd | 20126| 77| 8595| 810| 340| |-----------|--------|---------|----------|------|------| If the user has enabled compression in live migration, using QAT can save the host CPU resources. When compression is enabled, the bottleneck of migration is usually the compression throughput on the sender side, since CPU decompression throughput is higher than compression, some reference data https://github.com/inikep/lzbench, so more CPU resources need to be allocated to the sender side. Summary: 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, but ZSTD needs 800%. 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps But ZSTD still cannot reach 10Gbps even if it uses 810%. 3. The QAT decompression CPU utilization is higher than compression and ZSTD, from my analysis 3.1 when using QAT compression, the data needs to be copied to the QAT memory (for DMA operations), and the same for decompression. However, do_user_addr_fault will be triggered during decompression because the QAT decompressed data is copied to the VM address space for the first time, in addition, both compression and decompression are processed by QAT and do not consume CPU resources, so the CPU utilization of the receiver is slightly higher than the sender. 3.2 Since zstd decompression decompresses data directly into the VM address space, there is one less memory copy than QAT, so the CPU utilization on the receiver is better than QAT. For the 1GBps case, the receiver CPU utilization is 125%, and the memory copy occupies ~80% of CPU utilization. I think this is acceptable. Considering the overall CPU usage of the sender and receiver, the QAT benefit is good. > Bryan Zhang (4): > meson: Introduce 'qatzip' feature to the build system > migration: Add migration parameters for QATzip > migration: Introduce 'qatzip' compression method > tests/migration: Add integration test for 'qatzip' compression method > > hw/core/qdev-properties-system.c | 6 +- > meson.build | 10 + > meson_options.txt | 2 + > migration/meson.build | 1 + > migration/migration-hmp-cmds.c | 8 + > migration/multifd-qatzip.c | 391 +++++++++++++++++++++++++++++++ > migration/multifd.h | 5 +- > migration/options.c | 57 +++++ > migration/options.h | 2 + > qapi/migration.json | 38 +++ > scripts/meson-buildoptions.sh | 3 + > tests/qtest/meson.build | 4 + > tests/qtest/migration-test.c | 35 +++ > 13 files changed, 559 insertions(+), 3 deletions(-) > create mode 100644 migration/multifd-qatzip.c > > -- > Yichen Wang
On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote: > > -----Original Message----- > > From: Yichen Wang <yichen.wang@bytedance.com> > > Sent: Saturday, July 6, 2024 2:29 AM > > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé > > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André > > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu- > > devel@nongnu.org > > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>; > > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang > > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com> > > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB > > > > v4: > > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768 > > - Move the IOV initialization to qatzip implementation > > - Only use qatzip to compress normal pages > > > > v3: > > - Rebase changes on top of master > > - Merge two patches per Fabiano Rosas's comment > > - Add versions into comments and documentations > > > > v2: > > - Rebase changes on top of recent multifd code changes. > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers. > > - Remove parameter tuning and use QATzip's defaults for better > > performance. > > - Add parameter to enable QAT software fallback. > > > > v1: > > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html > > > > * Performance > > > > We present updated performance results. For circumstantial reasons, v1 > > presented performance on a low-bandwidth (1Gbps) network. > > > > Here, we present updated results with a similar setup as before but with > > two main differences: > > > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'. > > 2. We had a bug in our memory allocation causing us to only use ~1/2 of > > the VM's RAM. Now we properly allocate and fill nearly all of the VM's > > RAM. > > > > Thus, the test setup is as follows: > > > > We perform multifd live migration over TCP using a VM with 64GB memory. > > We prepare the machine's memory by powering it on, allocating a large > > amount of memory (60GB) as a single buffer, and filling the buffer with > > the repeated contents of the Silesia corpus[0]. This is in lieu of a more > > realistic memory snapshot, which proved troublesome to acquire. > > > > We analyze CPU usage by averaging the output of 'top' every second > > during migration. This is admittedly imprecise, but we feel that it > > accurately portrays the different degrees of CPU usage of varying > > compression methods. > > > > We present the latency, throughput, and CPU usage results for all of the > > compression methods, with varying numbers of multifd threads (4, 8, and > > 16). > > > > [0] The Silesia corpus can be accessed here: > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia > > > > ** Results > > > > 4 multifd threads: > > > > |---------------|---------------|----------------|---------|---------| > > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > > |---------------|---------------|----------------|---------|---------| > > |qatzip | 23.13 | 8749.94 |117.50 |186.49 | > > |---------------|---------------|----------------|---------|---------| > > |zlib |254.35 | 771.87 |388.20 |144.40 | > > |---------------|---------------|----------------|---------|---------| > > |zstd | 54.52 | 3442.59 |414.59 |149.77 | > > |---------------|---------------|----------------|---------|---------| > > |none | 12.45 |43739.60 |159.71 |204.96 | > > |---------------|---------------|----------------|---------|---------| > > > > 8 multifd threads: > > > > |---------------|---------------|----------------|---------|---------| > > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > > |---------------|---------------|----------------|---------|---------| > > |qatzip | 16.91 |12306.52 |186.37 |391.84 | > > |---------------|---------------|----------------|---------|---------| > > |zlib |130.11 | 1508.89 |753.86 |289.35 | > > |---------------|---------------|----------------|---------|---------| > > |zstd | 27.57 | 6823.23 |786.83 |303.80 | > > |---------------|---------------|----------------|---------|---------| > > |none | 11.82 |46072.63 |163.74 |238.56 | > > |---------------|---------------|----------------|---------|---------| > > > > 16 multifd threads: > > > > |---------------|---------------|----------------|---------|---------| > > |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| > > |---------------|---------------|----------------|---------|---------| > > |qatzip |18.64 |11044.52 | 573.61 |437.65 | > > |---------------|---------------|----------------|---------|---------| > > |zlib |66.43 | 2955.79 |1469.68 |567.47 | > > |---------------|---------------|----------------|---------|---------| > > |zstd |14.17 |13290.66 |1504.08 |615.33 | > > |---------------|---------------|----------------|---------|---------| > > |none |16.82 |32363.26 | 180.74 |217.17 | > > |---------------|---------------|----------------|---------|---------| > > > > ** Observations > > > > - In general, not using compression outperforms using compression in a > > non-network-bound environment. > > - 'qatzip' outperforms other compression workers with 4 and 8 workers, > > achieving a ~91% latency reduction over 'zlib' with 4 workers, and a > > ~58% latency reduction over 'zstd' with 4 workers. > > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers, > > showing a ~32% increase in latency. This performance difference > > becomes more noticeable with more workers, as CPU compression is highly > > parallelizable. > > - 'qatzip' compression uses considerably less CPU than other compression > > methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in > > compression CPU usage compared to 'zstd' and 'zlib'. > > - 'qatzip' decompression CPU usage is less impressive, and is even > > slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers. > > Hi Peter & Yichen > > I have a test based on the V4 patch set > VM configuration:16 vCPU, 64G memory, > VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data. > QAT Devices: 4 > > Sender migration parameters > migrate_set_capability multifd on > migrate_set_parameter multifd-channels 2/4/8 > migrate_set_parameter max-bandwidth 1G/10G Ah, I think this means GBps... not Gbps, then. > migrate_set_parameter multifd-compression qatzip/zstd > > Receiver migration parameters > migrate_set_capability multifd on > migrate_set_parameter multifd-channels 2 > migrate_set_parameter multifd-compression qatzip/zstd > > max-bandwidth: 1GBps > |-----------|--------|---------|----------|------|------| > |2 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 21607| 77| 8051| 88| 125| > |-----------|--------|---------|----------|------|------| > |zstd | 78351| 96| 2199| 204| 80| > |-----------|--------|---------|----------|------|------| > > |-----------|--------|---------|----------|------|------| > |4 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 20336| 25| 8557| 110| 190| > |-----------|--------|---------|----------|------|------| > |zstd | 39324| 31| 4389| 406| 160| > |-----------|--------|---------|----------|------|------| > > |-----------|--------|---------|----------|------|------| > |8 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 20208| 22| 8613| 125| 300| > |-----------|--------|---------|----------|------|------| > |zstd | 20515| 22| 8438| 800| 340| > |-----------|--------|---------|----------|------|------| > > max-bandwidth: 10GBps > |-----------|--------|---------|----------|------|------| > |2 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 22450| 77| 7748| 80| 125| > |-----------|--------|---------|----------|------|------| > |zstd | 78339| 76| 2199| 204| 80| > |-----------|--------|---------|----------|------|------| > > |-----------|--------|---------|----------|------|------| > |4 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 13017| 24| 13401| 180| 285| > |-----------|--------|---------|----------|------|------| > |zstd | 39466| 21| 4373| 406| 160| > |-----------|--------|---------|----------|------|------| > > |-----------|--------|---------|----------|------|------| > |8 Channels |Total |down |throughput| send | recv | > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > |-----------|--------|---------|----------|------|------| > |qatzip | 10255| 22| 17037| 280| 590| > |-----------|--------|---------|----------|------|------| > |zstd | 20126| 77| 8595| 810| 340| > |-----------|--------|---------|----------|------|------| PS: this 77ms downtime smells like it hits some spikes during save/load. Doesn't look like reproducable comparing to the rest data. > > If the user has enabled compression in live migration, using QAT > can save the host CPU resources. > > When compression is enabled, the bottleneck of migration is usually > the compression throughput on the sender side, since CPU decompression > throughput is higher than compression, some reference data > https://github.com/inikep/lzbench, so more CPU resources need to be > allocated to the sender side. Thank you, Yuan. > > Summary: > 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, > but ZSTD needs 800%. > 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps > But ZSTD still cannot reach 10Gbps even if it uses 810%. So I assumed you always meant GBps across all the test results, as only that matches with max-bandwidth parameter. Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth test. And I think the most interesting one that I would be curious is nocomp in low network tests. Would you mind run one more test with the same workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)? I think in this case multifd shouldn't matter a huge deal, but let's still enable that just assume that's the baseline / default setup. I would expect this result should obviously show a win on using compressors, but just to check. > 3. The QAT decompression CPU utilization is higher than compression and ZSTD, > from my analysis > 3.1 when using QAT compression, the data needs to be copied to the QAT > memory (for DMA operations), and the same for decompression. However, > do_user_addr_fault will be triggered during decompression because the > QAT decompressed data is copied to the VM address space for the first time, > in addition, both compression and decompression are processed by QAT and > do not consume CPU resources, so the CPU utilization of the receiver is > slightly higher than the sender. I thought you hit this same issue when working on QPL and I remember you used -mem-prealloc. Why not use it here? > > 3.2 Since zstd decompression decompresses data directly into the VM address space, > there is one less memory copy than QAT, so the CPU utilization on the receiver > is better than QAT. For the 1GBps case, the receiver CPU utilization is 125%, > and the memory copy occupies ~80% of CPU utilization. Hmm, yes I read that part in code and I thought it was a design decision to do the copy, the comment said "it is faster". So it's not? I think we can definitely submit compression tasks per-page rather than buffering, if that would be better. > > I think this is acceptable. Considering the overall CPU usage of the sender and receiver, > the QAT benefit is good. Yes, I don't think there's any major issue to block this from supported, it's more about when we are at it we'd better figure all things out. For example, I think we used to discuss the use case where there's 100G*2 network deployed, but the admin may still want to have some control plane VMs moving around using very limited network for QoS. In that case, I wonder any of you thought about using postcopy? I assume the control plane workload isn't super critical in this case or it won't get provisioned with low network for migrations, in that case maybe it'll also be fine to post-copy after one round of precopy on the slow-bandwidth network. Again, I don't think the answer blocks such feature in any form whoever simply wants to use a compressor, just to ask. Thanks,
> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Wednesday, July 10, 2024 2:43 AM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; Eduardo > Habkost <eduardo@habkost.net>; Marc-André Lureau > <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; Philippe > Mathieu-Daudé <philmd@linaro.org>; Fabiano Rosas <farosas@suse.de>; Eric > Blake <eblake@redhat.com>; Markus Armbruster <armbru@redhat.com>; Laurent > Vivier <lvivier@redhat.com>; qemu-devel@nongnu.org; Hao Xiang > <hao.xiang@linux.dev>; Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) > Chuang <horenchuang@bytedance.com> > Subject: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB > > On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Yichen Wang <yichen.wang@bytedance.com> > > > Sent: Saturday, July 6, 2024 2:29 AM > > > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé > > > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc- > André > > > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > <peterx@redhat.com>; > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > Markus > > > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; > qemu- > > > devel@nongnu.org > > > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>; > > > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang > > > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com> > > > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB > > > > > > v4: > > > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768 > > > - Move the IOV initialization to qatzip implementation > > > - Only use qatzip to compress normal pages > > > > > > v3: > > > - Rebase changes on top of master > > > - Merge two patches per Fabiano Rosas's comment > > > - Add versions into comments and documentations > > > > > > v2: > > > - Rebase changes on top of recent multifd code changes. > > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers. > > > - Remove parameter tuning and use QATzip's defaults for better > > > performance. > > > - Add parameter to enable QAT software fallback. > > > > > > v1: > > > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html > > > > > > * Performance > > > > > > We present updated performance results. For circumstantial reasons, v1 > > > presented performance on a low-bandwidth (1Gbps) network. > > > > > > Here, we present updated results with a similar setup as before but > with > > > two main differences: > > > > > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'. > > > 2. We had a bug in our memory allocation causing us to only use ~1/2 > of > > > the VM's RAM. Now we properly allocate and fill nearly all of the VM's > > > RAM. > > > > > > Thus, the test setup is as follows: > > > > > > We perform multifd live migration over TCP using a VM with 64GB > memory. > > > We prepare the machine's memory by powering it on, allocating a large > > > amount of memory (60GB) as a single buffer, and filling the buffer > with > > > the repeated contents of the Silesia corpus[0]. This is in lieu of a > more > > > realistic memory snapshot, which proved troublesome to acquire. > > > > > > We analyze CPU usage by averaging the output of 'top' every second > > > during migration. This is admittedly imprecise, but we feel that it > > > accurately portrays the different degrees of CPU usage of varying > > > compression methods. > > > > > > We present the latency, throughput, and CPU usage results for all of > the > > > compression methods, with varying numbers of multifd threads (4, 8, > and > > > 16). > > > > > > [0] The Silesia corpus can be accessed here: > > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia > > > > > > ** Results > > > > > > 4 multifd threads: > > > > > > |---------------|---------------|----------------|---------|------ > ---| > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > cpu%| > > > |---------------|---------------|----------------|---------|------ > ---| > > > |qatzip | 23.13 | 8749.94 |117.50 |186.49 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zlib |254.35 | 771.87 |388.20 |144.40 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zstd | 54.52 | 3442.59 |414.59 |149.77 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |none | 12.45 |43739.60 |159.71 |204.96 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > > > > 8 multifd threads: > > > > > > |---------------|---------------|----------------|---------|------ > ---| > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > cpu%| > > > |---------------|---------------|----------------|---------|------ > ---| > > > |qatzip | 16.91 |12306.52 |186.37 |391.84 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zlib |130.11 | 1508.89 |753.86 |289.35 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zstd | 27.57 | 6823.23 |786.83 |303.80 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |none | 11.82 |46072.63 |163.74 |238.56 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > > > > 16 multifd threads: > > > > > > |---------------|---------------|----------------|---------|------ > ---| > > > |method |time(sec) |throughput(mbps)|send cpu%|recv > cpu%| > > > |---------------|---------------|----------------|---------|------ > ---| > > > |qatzip |18.64 |11044.52 | 573.61 |437.65 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zlib |66.43 | 2955.79 |1469.68 |567.47 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |zstd |14.17 |13290.66 |1504.08 |615.33 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > |none |16.82 |32363.26 | 180.74 |217.17 > | > > > |---------------|---------------|----------------|---------|------ > ---| > > > > > > ** Observations > > > > > > - In general, not using compression outperforms using compression in a > > > non-network-bound environment. > > > - 'qatzip' outperforms other compression workers with 4 and 8 workers, > > > achieving a ~91% latency reduction over 'zlib' with 4 workers, and a > > > ~58% latency reduction over 'zstd' with 4 workers. > > > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers, > > > showing a ~32% increase in latency. This performance difference > > > becomes more noticeable with more workers, as CPU compression is > highly > > > parallelizable. > > > - 'qatzip' compression uses considerably less CPU than other > compression > > > methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in > > > compression CPU usage compared to 'zstd' and 'zlib'. > > > - 'qatzip' decompression CPU usage is less impressive, and is even > > > slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers. > > > > Hi Peter & Yichen > > > > I have a test based on the V4 patch set > > VM configuration:16 vCPU, 64G memory, > > VM Workload: all vCPUs are idle and 54G memory is filled with Silesia > data. > > QAT Devices: 4 > > > > Sender migration parameters > > migrate_set_capability multifd on > > migrate_set_parameter multifd-channels 2/4/8 > > migrate_set_parameter max-bandwidth 1G/10G > > Ah, I think this means GBps... not Gbps, then. > > > migrate_set_parameter multifd-compression qatzip/zstd > > > > Receiver migration parameters > > migrate_set_capability multifd on > > migrate_set_parameter multifd-channels 2 > > migrate_set_parameter multifd-compression qatzip/zstd > > > > max-bandwidth: 1GBps > > |-----------|--------|---------|----------|------|------| > > |2 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 21607| 77| 8051| 88| 125| > > |-----------|--------|---------|----------|------|------| > > |zstd | 78351| 96| 2199| 204| 80| > > |-----------|--------|---------|----------|------|------| > > > > |-----------|--------|---------|----------|------|------| > > |4 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 20336| 25| 8557| 110| 190| > > |-----------|--------|---------|----------|------|------| > > |zstd | 39324| 31| 4389| 406| 160| > > |-----------|--------|---------|----------|------|------| > > > > |-----------|--------|---------|----------|------|------| > > |8 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 20208| 22| 8613| 125| 300| > > |-----------|--------|---------|----------|------|------| > > |zstd | 20515| 22| 8438| 800| 340| > > |-----------|--------|---------|----------|------|------| > > > > max-bandwidth: 10GBps > > |-----------|--------|---------|----------|------|------| > > |2 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 22450| 77| 7748| 80| 125| > > |-----------|--------|---------|----------|------|------| > > |zstd | 78339| 76| 2199| 204| 80| > > |-----------|--------|---------|----------|------|------| > > > > |-----------|--------|---------|----------|------|------| > > |4 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 13017| 24| 13401| 180| 285| > > |-----------|--------|---------|----------|------|------| > > |zstd | 39466| 21| 4373| 406| 160| > > |-----------|--------|---------|----------|------|------| > > > > |-----------|--------|---------|----------|------|------| > > |8 Channels |Total |down |throughput| send | recv | > > | |time(ms)|time(ms) |(mbps) | cpu %| cpu% | > > |-----------|--------|---------|----------|------|------| > > |qatzip | 10255| 22| 17037| 280| 590| > > |-----------|--------|---------|----------|------|------| > > |zstd | 20126| 77| 8595| 810| 340| > > |-----------|--------|---------|----------|------|------| > > PS: this 77ms downtime smells like it hits some spikes during save/load. > Doesn't look like reproducable comparing to the rest data. I agree with this. > > > > If the user has enabled compression in live migration, using QAT > > can save the host CPU resources. > > > > When compression is enabled, the bottleneck of migration is usually > > the compression throughput on the sender side, since CPU decompression > > throughput is higher than compression, some reference data > > https://github.com/inikep/lzbench, so more CPU resources need to be > > allocated to the sender side. > > Thank you, Yuan. > > > > > Summary: > > 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, > > but ZSTD needs 800%. > > 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps > > But ZSTD still cannot reach 10Gbps even if it uses 810%. > > So I assumed you always meant GBps across all the test results, as only > that matches with max-bandwidth parameter. > > Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth > test. > > And I think the most interesting one that I would be curious is nocomp in > low network tests. Would you mind run one more test with the same > workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)? > > I think in this case multifd shouldn't matter a huge deal, but let's still > enable that just assume that's the baseline / default setup. I would > expect this result should obviously show a win on using compressors, but > just to check. migrate_set_parameter max-bandwidth 1250M |-----------|--------|---------|----------|----------|------|------| |8 Channels |Total |down |throughput|pages per | send | recv | | |time(ms)|time(ms) |(mbps) |second | cpu %| cpu% | |-----------|--------|---------|----------|----------|------|------| |qatzip | 16630| 28| 10467| 2940235| 160| 360| |-----------|--------|---------|----------|----------|------|------| |zstd | 20165| 24| 8579| 2391465| 810| 340| |-----------|--------|---------|----------|----------|------|------| |none | 46063| 40| 10848| 330240| 45| 85| |-----------|--------|---------|----------|----------|------|------| QATzip's dirty page processing throughput is much higher than that no compression. In this test, the vCPUs are in idle state, so the migration can be successful even without compression. > > 3. The QAT decompression CPU utilization is higher than compression and > ZSTD, > > from my analysis > > 3.1 when using QAT compression, the data needs to be copied to the > QAT > > memory (for DMA operations), and the same for decompression. > However, > > do_user_addr_fault will be triggered during decompression because > the > > QAT decompressed data is copied to the VM address space for the > first time, > > in addition, both compression and decompression are processed by > QAT and > > do not consume CPU resources, so the CPU utilization of the > receiver is > > slightly higher than the sender. > > I thought you hit this same issue when working on QPL and I remember you > used -mem-prealloc. Why not use it here? > > > > > 3.2 Since zstd decompression decompresses data directly into the VM > address space, > > there is one less memory copy than QAT, so the CPU utilization on > the receiver > > is better than QAT. For the 1GBps case, the receiver CPU > utilization is 125%, > > and the memory copy occupies ~80% of CPU utilization. > > Hmm, yes I read that part in code and I thought it was a design decision > to > do the copy, the comment said "it is faster". So it's not? > > I think we can definitely submit compression tasks per-page rather than > buffering, if that would be better. I think faster here probably refers to QAT throughput, QAT is more friendly to large block data compression(e.g. 32K). And QATzip doesn't support batching compression tasks, so copying multiple small data to a buffer for compression is a common practice. > > I think this is acceptable. Considering the overall CPU usage of the > sender and receiver, > > the QAT benefit is good. > > Yes, I don't think there's any major issue to block this from supported, > it's more about when we are at it we'd better figure all things out. > > For example, I think we used to discuss the use case where there's 100G*2 > network deployed, but the admin may still want to have some control plane > VMs moving around using very limited network for QoS. In that case, I > wonder any of you thought about using postcopy? I assume the control > plane > workload isn't super critical in this case or it won't get provisioned > with > low network for migrations, in that case maybe it'll also be fine to > post-copy after one round of precopy on the slow-bandwidth network. > > Again, I don't think the answer blocks such feature in any form whoever > simply wants to use a compressor, just to ask. I don’t have much experience with postcopy, here are some of my thoughts 1. For write-intensive VMs, this solution can improve the migration success, because in a limited bandwidth network scenario, the dirty page processing throughput will be significantly reduced for no compression, the previous data includes this(pages_per_second), it means that in the no compression precopy, the dirty pages generated by the workload are greater than the migration processing, resulting in migration failure. 2. If the VM is read-intensive or has low vCPU utilization (for example, my current test scenario is that the vCPUs are all idle). I think no compression + precopy + postcopy also cannot improve the migration performance, and may also cause timeout failure due to long migration time, same with no compression precopy. 3. In my opinion, the postcopy is a good solution in this scenario(low network bandwidth, VM is not critical), because even if compression is turned on, the migration may still fail(page_per_second may still less than the new dirty pages), and it is hard to predict whether VM memory is compression-friendly. > Thanks, > > -- > Peter Xu
On Wed, Jul 10, 2024 at 01:55:23PM +0000, Liu, Yuan1 wrote: [...] > migrate_set_parameter max-bandwidth 1250M > |-----------|--------|---------|----------|----------|------|------| > |8 Channels |Total |down |throughput|pages per | send | recv | > | |time(ms)|time(ms) |(mbps) |second | cpu %| cpu% | > |-----------|--------|---------|----------|----------|------|------| > |qatzip | 16630| 28| 10467| 2940235| 160| 360| > |-----------|--------|---------|----------|----------|------|------| > |zstd | 20165| 24| 8579| 2391465| 810| 340| > |-----------|--------|---------|----------|----------|------|------| > |none | 46063| 40| 10848| 330240| 45| 85| > |-----------|--------|---------|----------|----------|------|------| > > QATzip's dirty page processing throughput is much higher than that no compression. > In this test, the vCPUs are in idle state, so the migration can be successful even > without compression. Thanks! Maybe good material to be put into the docs/ too, if Yichen's going to pick up your doc patch when repost. [...] > I don’t have much experience with postcopy, here are some of my thoughts > 1. For write-intensive VMs, this solution can improve the migration success, > because in a limited bandwidth network scenario, the dirty page processing > throughput will be significantly reduced for no compression, the previous > data includes this(pages_per_second), it means that in the no compression > precopy, the dirty pages generated by the workload are greater than the > migration processing, resulting in migration failure. Yes. > > 2. If the VM is read-intensive or has low vCPU utilization (for example, my > current test scenario is that the vCPUs are all idle). I think no compression + > precopy + postcopy also cannot improve the migration performance, and may also > cause timeout failure due to long migration time, same with no compression precopy. I don't think postcopy will trigger timeout failures - postcopy should use constant time to complete a migration, that is guest memsize / bw. The challenge is normally on the delay of page requests higher than precopy, but in this case it might not be a big deal. And I wonder if on 100G*2 cards it can also perform pretty well, as the delay might be minimal even if bandwidth is throttled. > > 3. In my opinion, the postcopy is a good solution in this scenario(low network bandwidth, > VM is not critical), because even if compression is turned on, the migration may still > fail(page_per_second may still less than the new dirty pages), and it is hard to predict > whether VM memory is compression-friendly. Yes. Thanks,
> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Wednesday, July 10, 2024 11:19 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; Eduardo > Habkost <eduardo@habkost.net>; Marc-André Lureau > <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; Philippe > Mathieu-Daudé <philmd@linaro.org>; Fabiano Rosas <farosas@suse.de>; Eric > Blake <eblake@redhat.com>; Markus Armbruster <armbru@redhat.com>; Laurent > Vivier <lvivier@redhat.com>; qemu-devel@nongnu.org; Hao Xiang > <hao.xiang@linux.dev>; Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) > Chuang <horenchuang@bytedance.com> > Subject: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB > > On Wed, Jul 10, 2024 at 01:55:23PM +0000, Liu, Yuan1 wrote: > > [...] > > > migrate_set_parameter max-bandwidth 1250M > > |-----------|--------|---------|----------|----------|------|------| > > |8 Channels |Total |down |throughput|pages per | send | recv | > > | |time(ms)|time(ms) |(mbps) |second | cpu %| cpu% | > > |-----------|--------|---------|----------|----------|------|------| > > |qatzip | 16630| 28| 10467| 2940235| 160| 360| > > |-----------|--------|---------|----------|----------|------|------| > > |zstd | 20165| 24| 8579| 2391465| 810| 340| > > |-----------|--------|---------|----------|----------|------|------| > > |none | 46063| 40| 10848| 330240| 45| 85| > > |-----------|--------|---------|----------|----------|------|------| > > > > QATzip's dirty page processing throughput is much higher than that no > compression. > > In this test, the vCPUs are in idle state, so the migration can be > successful even > > without compression. > > Thanks! Maybe good material to be put into the docs/ too, if Yichen's > going to pick up your doc patch when repost. Sure, Yichen will add my doc patch, if he doesn't add this part in the next version, I will add it later. > [...] > > > I don’t have much experience with postcopy, here are some of my thoughts > > 1. For write-intensive VMs, this solution can improve the migration > success, > > because in a limited bandwidth network scenario, the dirty page > processing > > throughput will be significantly reduced for no compression, the > previous > > data includes this(pages_per_second), it means that in the no > compression > > precopy, the dirty pages generated by the workload are greater than > the > > migration processing, resulting in migration failure. > > Yes. > > > > > 2. If the VM is read-intensive or has low vCPU utilization (for example, > my > > current test scenario is that the vCPUs are all idle). I think no > compression + > > precopy + postcopy also cannot improve the migration performance, and > may also > > cause timeout failure due to long migration time, same with no > compression precopy. > > I don't think postcopy will trigger timeout failures - postcopy should use > constant time to complete a migration, that is guest memsize / bw. Yes, the migration total time is predictable, failure due to timeout is incorrect, migration taking a long time may be more accurate. > The challenge is normally on the delay of page requests higher than > precopy, but in this case it might not be a big deal. And I wonder if on > 100G*2 cards it can also perform pretty well, as the delay might be > minimal > even if bandwidth is throttled. I got your point, I don't have much experience in this area. So you mean to reserve a small amount of bandwidth on a NIC for postcopy migration, and compare the migration performance with and without traffic on the NIC? Will data plane traffic affect page request delays in postcopy? > > > > 3. In my opinion, the postcopy is a good solution in this scenario(low > network bandwidth, > > VM is not critical), because even if compression is turned on, the > migration may still > > fail(page_per_second may still less than the new dirty pages), and it > is hard to predict > > whether VM memory is compression-friendly. > > Yes. > > Thanks, > > -- > Peter Xu
On Wed, Jul 10, 2024 at 03:39:43PM +0000, Liu, Yuan1 wrote: > > I don't think postcopy will trigger timeout failures - postcopy should use > > constant time to complete a migration, that is guest memsize / bw. > > Yes, the migration total time is predictable, failure due to timeout is incorrect, > migration taking a long time may be more accurate. It shouldn't: postcopy is run always together with precopy, so if you start postcopy after one round of precopy, the total migration time should alwways be smaller than if you run the precopy two rounds. With postcopy after that migration completes, but for precopy two rounds of migration will follow with a dirty sync which may say "there's unforunately more dirty pages, let's move on with the 3rd round and more". > > > The challenge is normally on the delay of page requests higher than > > precopy, but in this case it might not be a big deal. And I wonder if on > > 100G*2 cards it can also perform pretty well, as the delay might be > > minimal > > even if bandwidth is throttled. > > I got your point, I don't have much experience in this area. > So you mean to reserve a small amount of bandwidth on a NIC for postcopy > migration, and compare the migration performance with and without traffic > on the NIC? Will data plane traffic affect page request delays in postcopy? I'm not sure what's the "data plane" you're describing here, but logically VMs should be migrated using mgmt networks, and should be somehow separate from IOs within the VMs. I'm not really asking for another test, sorry to cause confusions; it's only about some pure discussions. I just feel like postcopy wasn't really seriously considered even for many valid cases, some of them postcopy can play pretty well even without any modern hardwares requested. There's no need to prove which is better for this series. Thanks,