Message ID | 20190404105838.101559-1-sgarzare@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | vsock/virtio: optimizations to increase the throughput | expand |
On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote: > This series tries to increase the throughput of virtio-vsock with slight > changes: > - patch 1/4: reduces the number of credit update messages sent to the > transmitter > - patch 2/4: allows the host to split packets on multiple buffers, > in this way, we can remove the packet size limit to > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE > - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size > allowed > - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest) > > RFC: > - maybe patch 4 can be replaced with multiple queues with different > buffer sizes or using EWMA to adapt the buffer size to the traffic > > - as Jason suggested in a previous thread [1] I'll evaluate to use > virtio-net as transport, but I need to understand better how to > interface with it, maybe introducing sk_buff in virtio-vsock. > > Any suggestions? Great performance results, nice job! Please include efficiency numbers (bandwidth / CPU utilization) in the future. Due to the nature of these optimizations it's unlikely that efficiency has decreased, so I'm not too worried about it this time.
On Thu, Apr 04, 2019 at 03:14:10PM +0100, Stefan Hajnoczi wrote: > On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote: > > This series tries to increase the throughput of virtio-vsock with slight > > changes: > > - patch 1/4: reduces the number of credit update messages sent to the > > transmitter > > - patch 2/4: allows the host to split packets on multiple buffers, > > in this way, we can remove the packet size limit to > > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE > > - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size > > allowed > > - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest) > > > > RFC: > > - maybe patch 4 can be replaced with multiple queues with different > > buffer sizes or using EWMA to adapt the buffer size to the traffic > > > > - as Jason suggested in a previous thread [1] I'll evaluate to use > > virtio-net as transport, but I need to understand better how to > > interface with it, maybe introducing sk_buff in virtio-vsock. > > > > Any suggestions? > > Great performance results, nice job! :) > > Please include efficiency numbers (bandwidth / CPU utilization) in the > future. Due to the nature of these optimizations it's unlikely that > efficiency has decreased, so I'm not too worried about it this time. Thanks for the suggestion! I'll measure also the efficiency for future optimizations. Cheers, Stefano
On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote: > This series tries to increase the throughput of virtio-vsock with slight > changes: > - patch 1/4: reduces the number of credit update messages sent to the > transmitter > - patch 2/4: allows the host to split packets on multiple buffers, > in this way, we can remove the packet size limit to > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE > - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size > allowed > - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest) > > RFC: > - maybe patch 4 can be replaced with multiple queues with different > buffer sizes or using EWMA to adapt the buffer size to the traffic > > - as Jason suggested in a previous thread [1] I'll evaluate to use > virtio-net as transport, but I need to understand better how to > interface with it, maybe introducing sk_buff in virtio-vsock. > > Any suggestions? > > Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK > support: > > host -> guest [Gbps] > pkt_size before opt. patch 1 patches 2+3 patch 4 > 64 0.060 0.102 0.102 0.096 > 256 0.22 0.40 0.40 0.36 > 512 0.42 0.82 0.85 0.74 > 1K 0.7 1.6 1.6 1.5 > 2K 1.5 3.0 3.1 2.9 > 4K 2.5 5.2 5.3 5.3 > 8K 3.9 8.4 8.6 8.8 > 16K 6.6 11.1 11.3 12.8 > 32K 9.9 15.8 15.8 18.1 > 64K 13.5 17.4 17.7 21.4 > 128K 17.9 19.0 19.0 23.6 > 256K 18.0 19.4 19.8 24.4 > 512K 18.4 19.6 20.1 25.3 > > guest -> host [Gbps] > pkt_size before opt. patch 1 patches 2+3 > 64 0.088 0.100 0.101 > 256 0.35 0.36 0.41 > 512 0.70 0.74 0.73 > 1K 1.1 1.3 1.3 > 2K 2.4 2.4 2.6 > 4K 4.3 4.3 4.5 > 8K 7.3 7.4 7.6 > 16K 9.2 9.6 11.1 > 32K 8.3 8.9 18.1 > 64K 8.3 8.9 25.4 > 128K 7.2 8.7 26.7 > 256K 7.7 8.4 24.9 > 512K 7.7 8.5 25.0 > > Thanks, > Stefano I simply love it that you have analysed the individual impact of each patch! Great job! For comparison's sake, it could be IMHO benefitial to add a column with virtio-net+vhost-net performance. This will both give us an idea about whether the vsock layer introduces inefficiencies, and whether the virtio-net idea has merit. One other comment: it makes sense to test with disabling smap mitigations (boot host and guest with nosmap). No problem with also testing the default smap path, but I think you will discover that the performance impact of smap hardening being enabled is often severe for such benchmarks. > [1] https://www.spinics.net/lists/netdev/msg531783.html > [2] https://github.com/stefano-garzarella/iperf/ > > Stefano Garzarella (4): > vsock/virtio: reduce credit update messages > vhost/vsock: split packets to send using multiple buffers > vsock/virtio: change the maximum packet size allowed > vsock/virtio: increase RX buffer size to 64 KiB > > drivers/vhost/vsock.c | 35 ++++++++++++++++++++----- > include/linux/virtio_vsock.h | 3 ++- > net/vmw_vsock/virtio_transport_common.c | 18 +++++++++---- > 3 files changed, 44 insertions(+), 12 deletions(-) > > -- > 2.20.1
On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote: > I simply love it that you have analysed the individual impact of > each patch! Great job! Thanks! I followed Stefan's suggestions! > > For comparison's sake, it could be IMHO benefitial to add a column > with virtio-net+vhost-net performance. > > This will both give us an idea about whether the vsock layer introduces > inefficiencies, and whether the virtio-net idea has merit. > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in this way: $ qemu-system-x86_64 ... \ -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \ -device virtio-net-pci,netdev=net0 I did also a test using TCP_NODELAY, just to be fair, because VSOCK doesn't implement something like this. In both cases I set the MTU to the maximum allowed (65520). VSOCK TCP + virtio-net + vhost host -> guest [Gbps] host -> guest [Gbps] pkt_size before opt. patch 1 patches 2+3 patch 4 TCP_NODELAY 64 0.060 0.102 0.102 0.096 0.16 0.15 256 0.22 0.40 0.40 0.36 0.32 0.57 512 0.42 0.82 0.85 0.74 1.2 1.2 1K 0.7 1.6 1.6 1.5 2.1 2.1 2K 1.5 3.0 3.1 2.9 3.5 3.4 4K 2.5 5.2 5.3 5.3 5.5 5.3 8K 3.9 8.4 8.6 8.8 8.0 7.9 16K 6.6 11.1 11.3 12.8 9.8 10.2 32K 9.9 15.8 15.8 18.1 11.8 10.7 64K 13.5 17.4 17.7 21.4 11.4 11.3 128K 17.9 19.0 19.0 23.6 11.2 11.0 256K 18.0 19.4 19.8 24.4 11.1 11.0 512K 18.4 19.6 20.1 25.3 10.1 10.7 For small packet size (< 4K) I think we should implement some kind of batching/merging, that could be for free if we use virtio-net as a transport. Note: Maybe I have something miss configured because TCP on virtio-net for host -> guest case doesn't exceed 11 Gbps. VSOCK TCP + virtio-net + vhost guest -> host [Gbps] guest -> host [Gbps] pkt_size before opt. patch 1 patches 2+3 TCP_NODELAY 64 0.088 0.100 0.101 0.24 0.24 256 0.35 0.36 0.41 0.36 1.03 512 0.70 0.74 0.73 0.69 1.6 1K 1.1 1.3 1.3 1.1 3.0 2K 2.4 2.4 2.6 2.1 5.5 4K 4.3 4.3 4.5 3.8 8.8 8K 7.3 7.4 7.6 6.6 20.0 16K 9.2 9.6 11.1 12.3 29.4 32K 8.3 8.9 18.1 19.3 28.2 64K 8.3 8.9 25.4 20.6 28.7 128K 7.2 8.7 26.7 23.1 27.9 256K 7.7 8.4 24.9 28.5 29.4 512K 7.7 8.5 25.0 28.3 29.3 For guest -> host I think is important the TCP_NODELAY test, because TCP buffering increases a lot the throughput. > One other comment: it makes sense to test with disabling smap > mitigations (boot host and guest with nosmap). No problem with also > testing the default smap path, but I think you will discover that the > performance impact of smap hardening being enabled is often severe for > such benchmarks. Thanks for this valuable suggestion, I'll redo all the tests with nosmap! Cheers, Stefano
On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote: > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote: > > I simply love it that you have analysed the individual impact of > > each patch! Great job! > > Thanks! I followed Stefan's suggestions! > > > > > For comparison's sake, it could be IMHO benefitial to add a column > > with virtio-net+vhost-net performance. > > > > This will both give us an idea about whether the vsock layer introduces > > inefficiencies, and whether the virtio-net idea has merit. > > > > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in > this way: > $ qemu-system-x86_64 ... \ > -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \ > -device virtio-net-pci,netdev=net0 > > I did also a test using TCP_NODELAY, just to be fair, because VSOCK > doesn't implement something like this. Why not? > In both cases I set the MTU to the maximum allowed (65520). > > VSOCK TCP + virtio-net + vhost > host -> guest [Gbps] host -> guest [Gbps] > pkt_size before opt. patch 1 patches 2+3 patch 4 TCP_NODELAY > 64 0.060 0.102 0.102 0.096 0.16 0.15 > 256 0.22 0.40 0.40 0.36 0.32 0.57 > 512 0.42 0.82 0.85 0.74 1.2 1.2 > 1K 0.7 1.6 1.6 1.5 2.1 2.1 > 2K 1.5 3.0 3.1 2.9 3.5 3.4 > 4K 2.5 5.2 5.3 5.3 5.5 5.3 > 8K 3.9 8.4 8.6 8.8 8.0 7.9 > 16K 6.6 11.1 11.3 12.8 9.8 10.2 > 32K 9.9 15.8 15.8 18.1 11.8 10.7 > 64K 13.5 17.4 17.7 21.4 11.4 11.3 > 128K 17.9 19.0 19.0 23.6 11.2 11.0 > 256K 18.0 19.4 19.8 24.4 11.1 11.0 > 512K 18.4 19.6 20.1 25.3 10.1 10.7 > > For small packet size (< 4K) I think we should implement some kind of > batching/merging, that could be for free if we use virtio-net as a transport. > > Note: Maybe I have something miss configured because TCP on virtio-net > for host -> guest case doesn't exceed 11 Gbps. > > VSOCK TCP + virtio-net + vhost > guest -> host [Gbps] guest -> host [Gbps] > pkt_size before opt. patch 1 patches 2+3 TCP_NODELAY > 64 0.088 0.100 0.101 0.24 0.24 > 256 0.35 0.36 0.41 0.36 1.03 > 512 0.70 0.74 0.73 0.69 1.6 > 1K 1.1 1.3 1.3 1.1 3.0 > 2K 2.4 2.4 2.6 2.1 5.5 > 4K 4.3 4.3 4.5 3.8 8.8 > 8K 7.3 7.4 7.6 6.6 20.0 > 16K 9.2 9.6 11.1 12.3 29.4 > 32K 8.3 8.9 18.1 19.3 28.2 > 64K 8.3 8.9 25.4 20.6 28.7 > 128K 7.2 8.7 26.7 23.1 27.9 > 256K 7.7 8.4 24.9 28.5 29.4 > 512K 7.7 8.5 25.0 28.3 29.3 > > For guest -> host I think is important the TCP_NODELAY test, because TCP > buffering increases a lot the throughput. > > > One other comment: it makes sense to test with disabling smap > > mitigations (boot host and guest with nosmap). No problem with also > > testing the default smap path, but I think you will discover that the > > performance impact of smap hardening being enabled is often severe for > > such benchmarks. > > Thanks for this valuable suggestion, I'll redo all the tests with nosmap! > > Cheers, > Stefano
On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin wrote: > On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote: > > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote: > > > I simply love it that you have analysed the individual impact of > > > each patch! Great job! > > > > Thanks! I followed Stefan's suggestions! > > > > > > > > For comparison's sake, it could be IMHO benefitial to add a column > > > with virtio-net+vhost-net performance. > > > > > > This will both give us an idea about whether the vsock layer introduces > > > inefficiencies, and whether the virtio-net idea has merit. > > > > > > > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in > > this way: > > $ qemu-system-x86_64 ... \ > > -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \ > > -device virtio-net-pci,netdev=net0 > > > > I did also a test using TCP_NODELAY, just to be fair, because VSOCK > > doesn't implement something like this. > > Why not? > I think because originally VSOCK was designed to be simple and low-latency, but of course we can introduce something like that. Current implementation directly copy the buffer from the user-space in a virtio_vsock_pkt and enqueue it to be transmitted. Maybe we can introduce a buffer per socket where accumulate bytes and send it when it is full or when a timer is fired . We can also introduce a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for compatibility) to send the buffer immediately for low-latency use cases. What do you think? Thanks, Stefano
On 2019/4/4 下午6:58, Stefano Garzarella wrote: > This series tries to increase the throughput of virtio-vsock with slight > changes: > - patch 1/4: reduces the number of credit update messages sent to the > transmitter > - patch 2/4: allows the host to split packets on multiple buffers, > in this way, we can remove the packet size limit to > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE > - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size > allowed > - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest) > > RFC: > - maybe patch 4 can be replaced with multiple queues with different > buffer sizes or using EWMA to adapt the buffer size to the traffic Or EWMA + mergeable rx buffer, but if we decide to unify the datapath with virtio-net, we can reuse their codes. > > - as Jason suggested in a previous thread [1] I'll evaluate to use > virtio-net as transport, but I need to understand better how to > interface with it, maybe introducing sk_buff in virtio-vsock. > > Any suggestions? My understanding is this is not a must, but if it makes things easier, we can do this. Another thing that may help is to implement sendpage(), which will greatly improve the performance. Thanks > > Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK > support: > > host -> guest [Gbps] > pkt_size before opt. patch 1 patches 2+3 patch 4 > 64 0.060 0.102 0.102 0.096 > 256 0.22 0.40 0.40 0.36 > 512 0.42 0.82 0.85 0.74 > 1K 0.7 1.6 1.6 1.5 > 2K 1.5 3.0 3.1 2.9 > 4K 2.5 5.2 5.3 5.3 > 8K 3.9 8.4 8.6 8.8 > 16K 6.6 11.1 11.3 12.8 > 32K 9.9 15.8 15.8 18.1 > 64K 13.5 17.4 17.7 21.4 > 128K 17.9 19.0 19.0 23.6 > 256K 18.0 19.4 19.8 24.4 > 512K 18.4 19.6 20.1 25.3 > > guest -> host [Gbps] > pkt_size before opt. patch 1 patches 2+3 > 64 0.088 0.100 0.101 > 256 0.35 0.36 0.41 > 512 0.70 0.74 0.73 > 1K 1.1 1.3 1.3 > 2K 2.4 2.4 2.6 > 4K 4.3 4.3 4.5 > 8K 7.3 7.4 7.6 > 16K 9.2 9.6 11.1 > 32K 8.3 8.9 18.1 > 64K 8.3 8.9 25.4 > 128K 7.2 8.7 26.7 > 256K 7.7 8.4 24.9 > 512K 7.7 8.5 25.0 > > Thanks, > Stefano > > [1] https://www.spinics.net/lists/netdev/msg531783.html > [2] https://github.com/stefano-garzarella/iperf/ > > Stefano Garzarella (4): > vsock/virtio: reduce credit update messages > vhost/vsock: split packets to send using multiple buffers > vsock/virtio: change the maximum packet size allowed > vsock/virtio: increase RX buffer size to 64 KiB > > drivers/vhost/vsock.c | 35 ++++++++++++++++++++----- > include/linux/virtio_vsock.h | 3 ++- > net/vmw_vsock/virtio_transport_common.c | 18 +++++++++---- > 3 files changed, 44 insertions(+), 12 deletions(-) >
On Fri, Apr 05, 2019 at 09:49:17AM +0200, Stefano Garzarella wrote: > On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin wrote: > > On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote: > > > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote: > > > > I simply love it that you have analysed the individual impact of > > > > each patch! Great job! > > > > > > Thanks! I followed Stefan's suggestions! > > > > > > > > > > > For comparison's sake, it could be IMHO benefitial to add a column > > > > with virtio-net+vhost-net performance. > > > > > > > > This will both give us an idea about whether the vsock layer introduces > > > > inefficiencies, and whether the virtio-net idea has merit. > > > > > > > > > > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in > > > this way: > > > $ qemu-system-x86_64 ... \ > > > -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \ > > > -device virtio-net-pci,netdev=net0 > > > > > > I did also a test using TCP_NODELAY, just to be fair, because VSOCK > > > doesn't implement something like this. > > > > Why not? > > > > I think because originally VSOCK was designed to be simple and > low-latency, but of course we can introduce something like that. > > Current implementation directly copy the buffer from the user-space in a > virtio_vsock_pkt and enqueue it to be transmitted. > > Maybe we can introduce a buffer per socket where accumulate bytes and > send it when it is full or when a timer is fired . We can also introduce > a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for > compatibility) to send the buffer immediately for low-latency use cases. > > What do you think? Today virtio-vsock implements a 1:1 sendmsg():packet relationship because it's simple. But there's no need for the guest to enqueue multiple VIRTIO_VSOCK_OP_RW packets when a single large packet could combine all payloads for a connection. This is not the same as TCP_NODELAY but related. I think it's worth exploring TCP_NODELAY and send_pkt_list merging. Hopefully it won't make the code much more complicated. Stefan
On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote: > Another thing that may help is to implement sendpage(), which will greatly > improve the performance. I can't find documentation for ->sendpage(). Is the idea that you get a struct page for the payload and can do zero-copy tx? (And can userspace still write to the page, invalidating checksums in the header?) Stefan
On 2019/4/8 下午5:44, Stefan Hajnoczi wrote: > On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote: >> Another thing that may help is to implement sendpage(), which will greatly >> improve the performance. > I can't find documentation for ->sendpage(). Is the idea that you get a > struct page for the payload and can do zero-copy tx? Yes. > (And can userspace > still write to the page, invalidating checksums in the header?) > > Stefan Userspace can still write to the page, but for correctness (e.g in the case of SPLICE_F_GIFT describe by vmsplice(2)), it should not do this. For vmsplice, it may hard to detect the time to reuse the page. Maybe we MSG_ZEROCOPY[1] is better. Anyway, sendpage() could be still useful for sendfile() or splice(). Thanks [1] https://netdevconf.org/2.1/papers/netdev.pdf
On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote: > > On 2019/4/4 下午6:58, Stefano Garzarella wrote: > > This series tries to increase the throughput of virtio-vsock with slight > > changes: > > - patch 1/4: reduces the number of credit update messages sent to the > > transmitter > > - patch 2/4: allows the host to split packets on multiple buffers, > > in this way, we can remove the packet size limit to > > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE > > - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size > > allowed > > - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest) > > > > RFC: > > - maybe patch 4 can be replaced with multiple queues with different > > buffer sizes or using EWMA to adapt the buffer size to the traffic > > > Or EWMA + mergeable rx buffer, but if we decide to unify the datapath with > virtio-net, we can reuse their codes. > > > > > > - as Jason suggested in a previous thread [1] I'll evaluate to use > > virtio-net as transport, but I need to understand better how to > > interface with it, maybe introducing sk_buff in virtio-vsock. > > > > Any suggestions? > > > My understanding is this is not a must, but if it makes things easier, we > can do this. Hopefully it should simplify the maintainability and avoid duplicated code. > > Another thing that may help is to implement sendpage(), which will greatly > improve the performance. Thanks for your suggestions! I'll try to implement sendpage() in VSOCK to measure the improvement. Cheers, Stefano