[RFC,0/4] vsock/virtio: optimizations to increase the throughput

Message ID	20190404105838.101559-1-sgarzare@redhat.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Stefano Garzarella <sgarzare@redhat.com> To: netdev@vger.kernel.org Cc: Jason Wang <jasowang@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-kernel@vger.kernel.org, "David S. Miller" <davem@davemloft.net> Subject: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput Date: Thu, 4 Apr 2019 12:58:34 +0200 Message-Id: <20190404105838.101559-1-sgarzare@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	vsock/virtio: optimizations to increase the throughput \| expand [RFC,0/4] vsock/virtio: optimizations to increase the throughput [RFC,1/4] vsock/virtio: reduce credit update messages [RFC,2/4] vhost/vsock: split packets to send using multiple buffers [RFC,3/4] vsock/virtio: change the maximum packet size allowed [RFC,4/4] vsock/virtio: increase RX buffer size to 64 KiB

Stefano Garzarella April 4, 2019, 10:58 a.m. UTC

This series tries to increase the throughput of virtio-vsock with slight
changes:
 - patch 1/4: reduces the number of credit update messages sent to the
              transmitter
 - patch 2/4: allows the host to split packets on multiple buffers,
              in this way, we can remove the packet size limit to
              VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
 - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
              allowed
 - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)

RFC:
 - maybe patch 4 can be replaced with multiple queues with different
   buffer sizes or using EWMA to adapt the buffer size to the traffic

 - as Jason suggested in a previous thread [1] I'll evaluate to use
   virtio-net as transport, but I need to understand better how to
   interface with it, maybe introducing sk_buff in virtio-vsock.

Any suggestions?

Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK
support:

                        host -> guest [Gbps]
pkt_size    before opt.   patch 1   patches 2+3   patch 4
  64            0.060       0.102       0.102       0.096
  256           0.22        0.40        0.40        0.36
  512           0.42        0.82        0.85        0.74
  1K            0.7         1.6         1.6         1.5
  2K            1.5         3.0         3.1         2.9
  4K            2.5         5.2         5.3         5.3
  8K            3.9         8.4         8.6         8.8
  16K           6.6        11.1        11.3        12.8
  32K           9.9        15.8        15.8        18.1
  64K          13.5        17.4        17.7        21.4
  128K         17.9        19.0        19.0        23.6
  256K         18.0        19.4        19.8        24.4
  512K         18.4        19.6        20.1        25.3

                        guest -> host [Gbps]
pkt_size    before opt.   patch 1   patches 2+3
  64            0.088       0.100       0.101
  256           0.35        0.36        0.41
  512           0.70        0.74        0.73
  1K            1.1         1.3         1.3
  2K            2.4         2.4         2.6
  4K            4.3         4.3         4.5
  8K            7.3         7.4         7.6
  16K           9.2         9.6        11.1
  32K           8.3         8.9        18.1
  64K           8.3         8.9        25.4
  128K          7.2         8.7        26.7
  256K          7.7         8.4        24.9
  512K          7.7         8.5        25.0

Thanks,
Stefano

[1] https://www.spinics.net/lists/netdev/msg531783.html
[2] https://github.com/stefano-garzarella/iperf/

Stefano Garzarella (4):
  vsock/virtio: reduce credit update messages
  vhost/vsock: split packets to send using multiple buffers
  vsock/virtio: change the maximum packet size allowed
  vsock/virtio: increase RX buffer size to 64 KiB

 drivers/vhost/vsock.c                   | 35 ++++++++++++++++++++-----
 include/linux/virtio_vsock.h            |  3 ++-
 net/vmw_vsock/virtio_transport_common.c | 18 +++++++++----
 3 files changed, 44 insertions(+), 12 deletions(-)

Stefan Hajnoczi April 4, 2019, 2:14 p.m. UTC | #1

On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote:
> This series tries to increase the throughput of virtio-vsock with slight
> changes:
>  - patch 1/4: reduces the number of credit update messages sent to the
>               transmitter
>  - patch 2/4: allows the host to split packets on multiple buffers,
>               in this way, we can remove the packet size limit to
>               VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
>  - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
>               allowed
>  - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> 
> RFC:
>  - maybe patch 4 can be replaced with multiple queues with different
>    buffer sizes or using EWMA to adapt the buffer size to the traffic
> 
>  - as Jason suggested in a previous thread [1] I'll evaluate to use
>    virtio-net as transport, but I need to understand better how to
>    interface with it, maybe introducing sk_buff in virtio-vsock.
> 
> Any suggestions?

Great performance results, nice job!

Please include efficiency numbers (bandwidth / CPU utilization) in the
future.  Due to the nature of these optimizations it's unlikely that
efficiency has decreased, so I'm not too worried about it this time.

Stefano Garzarella April 4, 2019, 3:44 p.m. UTC | #2

On Thu, Apr 04, 2019 at 03:14:10PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote:
> > This series tries to increase the throughput of virtio-vsock with slight
> > changes:
> >  - patch 1/4: reduces the number of credit update messages sent to the
> >               transmitter
> >  - patch 2/4: allows the host to split packets on multiple buffers,
> >               in this way, we can remove the packet size limit to
> >               VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
> >  - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
> >               allowed
> >  - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> > 
> > RFC:
> >  - maybe patch 4 can be replaced with multiple queues with different
> >    buffer sizes or using EWMA to adapt the buffer size to the traffic
> > 
> >  - as Jason suggested in a previous thread [1] I'll evaluate to use
> >    virtio-net as transport, but I need to understand better how to
> >    interface with it, maybe introducing sk_buff in virtio-vsock.
> > 
> > Any suggestions?
> 
> Great performance results, nice job!

:)

> 
> Please include efficiency numbers (bandwidth / CPU utilization) in the
> future.  Due to the nature of these optimizations it's unlikely that
> efficiency has decreased, so I'm not too worried about it this time.

Thanks for the suggestion! I'll measure also the efficiency for future
optimizations.

Cheers,
Stefano

Michael S. Tsirkin April 4, 2019, 3:52 p.m. UTC | #3

On Thu, Apr 04, 2019 at 12:58:34PM +0200, Stefano Garzarella wrote:
> This series tries to increase the throughput of virtio-vsock with slight
> changes:
>  - patch 1/4: reduces the number of credit update messages sent to the
>               transmitter
>  - patch 2/4: allows the host to split packets on multiple buffers,
>               in this way, we can remove the packet size limit to
>               VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
>  - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
>               allowed
>  - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> 
> RFC:
>  - maybe patch 4 can be replaced with multiple queues with different
>    buffer sizes or using EWMA to adapt the buffer size to the traffic
> 
>  - as Jason suggested in a previous thread [1] I'll evaluate to use
>    virtio-net as transport, but I need to understand better how to
>    interface with it, maybe introducing sk_buff in virtio-vsock.
> 
> Any suggestions?
> 
> Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK
> support:
> 
>                         host -> guest [Gbps]
> pkt_size    before opt.   patch 1   patches 2+3   patch 4
>   64            0.060       0.102       0.102       0.096
>   256           0.22        0.40        0.40        0.36
>   512           0.42        0.82        0.85        0.74
>   1K            0.7         1.6         1.6         1.5
>   2K            1.5         3.0         3.1         2.9
>   4K            2.5         5.2         5.3         5.3
>   8K            3.9         8.4         8.6         8.8
>   16K           6.6        11.1        11.3        12.8
>   32K           9.9        15.8        15.8        18.1
>   64K          13.5        17.4        17.7        21.4
>   128K         17.9        19.0        19.0        23.6
>   256K         18.0        19.4        19.8        24.4
>   512K         18.4        19.6        20.1        25.3
> 
>                         guest -> host [Gbps]
> pkt_size    before opt.   patch 1   patches 2+3
>   64            0.088       0.100       0.101
>   256           0.35        0.36        0.41
>   512           0.70        0.74        0.73
>   1K            1.1         1.3         1.3
>   2K            2.4         2.4         2.6
>   4K            4.3         4.3         4.5
>   8K            7.3         7.4         7.6
>   16K           9.2         9.6        11.1
>   32K           8.3         8.9        18.1
>   64K           8.3         8.9        25.4
>   128K          7.2         8.7        26.7
>   256K          7.7         8.4        24.9
>   512K          7.7         8.5        25.0
> 
> Thanks,
> Stefano

I simply love it that you have analysed the individual impact of
each patch! Great job!

For comparison's sake, it could be IMHO benefitial to add a column
with virtio-net+vhost-net performance.

This will both give us an idea about whether the vsock layer introduces
inefficiencies, and whether the virtio-net idea has merit.

One other comment: it makes sense to test with disabling smap
mitigations (boot host and guest with nosmap).  No problem with also
testing the default smap path, but I think you will discover that the
performance impact of smap hardening being enabled is often severe for
such benchmarks.


> [1] https://www.spinics.net/lists/netdev/msg531783.html
> [2] https://github.com/stefano-garzarella/iperf/
> 
> Stefano Garzarella (4):
>   vsock/virtio: reduce credit update messages
>   vhost/vsock: split packets to send using multiple buffers
>   vsock/virtio: change the maximum packet size allowed
>   vsock/virtio: increase RX buffer size to 64 KiB
> 
>  drivers/vhost/vsock.c                   | 35 ++++++++++++++++++++-----
>  include/linux/virtio_vsock.h            |  3 ++-
>  net/vmw_vsock/virtio_transport_common.c | 18 +++++++++----
>  3 files changed, 44 insertions(+), 12 deletions(-)
> 
> -- 
> 2.20.1

Stefano Garzarella April 4, 2019, 4:47 p.m. UTC | #4

On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> I simply love it that you have analysed the individual impact of
> each patch! Great job!

Thanks! I followed Stefan's suggestions!

> 
> For comparison's sake, it could be IMHO benefitial to add a column
> with virtio-net+vhost-net performance.
> 
> This will both give us an idea about whether the vsock layer introduces
> inefficiencies, and whether the virtio-net idea has merit.
> 

Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
this way:
  $ qemu-system-x86_64 ... \
      -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
      -device virtio-net-pci,netdev=net0

I did also a test using TCP_NODELAY, just to be fair, because VSOCK
doesn't implement something like this.
In both cases I set the MTU to the maximum allowed (65520).

                        VSOCK                        TCP + virtio-net + vhost
                  host -> guest [Gbps]                 host -> guest [Gbps]
pkt_size  before opt. patch 1 patches 2+3 patch 4     TCP_NODELAY
  64          0.060     0.102     0.102     0.096         0.16        0.15
  256         0.22      0.40      0.40      0.36          0.32        0.57
  512         0.42      0.82      0.85      0.74          1.2         1.2
  1K          0.7       1.6       1.6       1.5           2.1         2.1
  2K          1.5       3.0       3.1       2.9           3.5         3.4
  4K          2.5       5.2       5.3       5.3           5.5         5.3
  8K          3.9       8.4       8.6       8.8           8.0         7.9
  16K         6.6      11.1      11.3      12.8           9.8        10.2
  32K         9.9      15.8      15.8      18.1          11.8        10.7
  64K        13.5      17.4      17.7      21.4          11.4        11.3
  128K       17.9      19.0      19.0      23.6          11.2        11.0
  256K       18.0      19.4      19.8      24.4          11.1        11.0
  512K       18.4      19.6      20.1      25.3          10.1        10.7

For small packet size (< 4K) I think we should implement some kind of
batching/merging, that could be for free if we use virtio-net as a transport.

Note: Maybe I have something miss configured because TCP on virtio-net
for host -> guest case doesn't exceed 11 Gbps.

                        VSOCK                        TCP + virtio-net + vhost
                  guest -> host [Gbps]                 guest -> host [Gbps]
pkt_size  before opt. patch 1 patches 2+3             TCP_NODELAY
  64          0.088     0.100     0.101                   0.24        0.24
  256         0.35      0.36      0.41                    0.36        1.03
  512         0.70      0.74      0.73                    0.69        1.6
  1K          1.1       1.3       1.3                     1.1         3.0
  2K          2.4       2.4       2.6                     2.1         5.5
  4K          4.3       4.3       4.5                     3.8         8.8
  8K          7.3       7.4       7.6                     6.6        20.0
  16K         9.2       9.6      11.1                    12.3        29.4
  32K         8.3       8.9      18.1                    19.3        28.2
  64K         8.3       8.9      25.4                    20.6        28.7
  128K        7.2       8.7      26.7                    23.1        27.9
  256K        7.7       8.4      24.9                    28.5        29.4
  512K        7.7       8.5      25.0                    28.3        29.3

For guest -> host I think is important the TCP_NODELAY test, because TCP
buffering increases a lot the throughput.

> One other comment: it makes sense to test with disabling smap
> mitigations (boot host and guest with nosmap).  No problem with also
> testing the default smap path, but I think you will discover that the
> performance impact of smap hardening being enabled is often severe for
> such benchmarks.

Thanks for this valuable suggestion, I'll redo all the tests with nosmap!

Cheers,
Stefano

Michael S. Tsirkin April 4, 2019, 6:04 p.m. UTC | #5

On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote:
> On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > I simply love it that you have analysed the individual impact of
> > each patch! Great job!
> 
> Thanks! I followed Stefan's suggestions!
> 
> > 
> > For comparison's sake, it could be IMHO benefitial to add a column
> > with virtio-net+vhost-net performance.
> > 
> > This will both give us an idea about whether the vsock layer introduces
> > inefficiencies, and whether the virtio-net idea has merit.
> > 
> 
> Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> this way:
>   $ qemu-system-x86_64 ... \
>       -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
>       -device virtio-net-pci,netdev=net0
> 
> I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> doesn't implement something like this.

Why not?

> In both cases I set the MTU to the maximum allowed (65520).
> 
>                         VSOCK                        TCP + virtio-net + vhost
>                   host -> guest [Gbps]                 host -> guest [Gbps]
> pkt_size  before opt. patch 1 patches 2+3 patch 4     TCP_NODELAY
>   64          0.060     0.102     0.102     0.096         0.16        0.15
>   256         0.22      0.40      0.40      0.36          0.32        0.57
>   512         0.42      0.82      0.85      0.74          1.2         1.2
>   1K          0.7       1.6       1.6       1.5           2.1         2.1
>   2K          1.5       3.0       3.1       2.9           3.5         3.4
>   4K          2.5       5.2       5.3       5.3           5.5         5.3
>   8K          3.9       8.4       8.6       8.8           8.0         7.9
>   16K         6.6      11.1      11.3      12.8           9.8        10.2
>   32K         9.9      15.8      15.8      18.1          11.8        10.7
>   64K        13.5      17.4      17.7      21.4          11.4        11.3
>   128K       17.9      19.0      19.0      23.6          11.2        11.0
>   256K       18.0      19.4      19.8      24.4          11.1        11.0
>   512K       18.4      19.6      20.1      25.3          10.1        10.7
> 
> For small packet size (< 4K) I think we should implement some kind of
> batching/merging, that could be for free if we use virtio-net as a transport.
> 
> Note: Maybe I have something miss configured because TCP on virtio-net
> for host -> guest case doesn't exceed 11 Gbps.
> 
>                         VSOCK                        TCP + virtio-net + vhost
>                   guest -> host [Gbps]                 guest -> host [Gbps]
> pkt_size  before opt. patch 1 patches 2+3             TCP_NODELAY
>   64          0.088     0.100     0.101                   0.24        0.24
>   256         0.35      0.36      0.41                    0.36        1.03
>   512         0.70      0.74      0.73                    0.69        1.6
>   1K          1.1       1.3       1.3                     1.1         3.0
>   2K          2.4       2.4       2.6                     2.1         5.5
>   4K          4.3       4.3       4.5                     3.8         8.8
>   8K          7.3       7.4       7.6                     6.6        20.0
>   16K         9.2       9.6      11.1                    12.3        29.4
>   32K         8.3       8.9      18.1                    19.3        28.2
>   64K         8.3       8.9      25.4                    20.6        28.7
>   128K        7.2       8.7      26.7                    23.1        27.9
>   256K        7.7       8.4      24.9                    28.5        29.4
>   512K        7.7       8.5      25.0                    28.3        29.3
> 
> For guest -> host I think is important the TCP_NODELAY test, because TCP
> buffering increases a lot the throughput.
> 
> > One other comment: it makes sense to test with disabling smap
> > mitigations (boot host and guest with nosmap).  No problem with also
> > testing the default smap path, but I think you will discover that the
> > performance impact of smap hardening being enabled is often severe for
> > such benchmarks.
> 
> Thanks for this valuable suggestion, I'll redo all the tests with nosmap!
> 
> Cheers,
> Stefano

Stefano Garzarella April 5, 2019, 7:49 a.m. UTC | #6

On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin wrote:
> On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote:
> > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > > I simply love it that you have analysed the individual impact of
> > > each patch! Great job!
> > 
> > Thanks! I followed Stefan's suggestions!
> > 
> > > 
> > > For comparison's sake, it could be IMHO benefitial to add a column
> > > with virtio-net+vhost-net performance.
> > > 
> > > This will both give us an idea about whether the vsock layer introduces
> > > inefficiencies, and whether the virtio-net idea has merit.
> > > 
> > 
> > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> > this way:
> >   $ qemu-system-x86_64 ... \
> >       -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
> >       -device virtio-net-pci,netdev=net0
> > 
> > I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> > doesn't implement something like this.
> 
> Why not?
> 

I think because originally VSOCK was designed to be simple and
low-latency, but of course we can introduce something like that.

Current implementation directly copy the buffer from the user-space in a
virtio_vsock_pkt and enqueue it to be transmitted.

Maybe we can introduce a buffer per socket where accumulate bytes and
send it when it is full or when a timer is fired . We can also introduce
a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for
compatibility) to send the buffer immediately for low-latency use cases.

What do you think?

Thanks,
Stefano

Jason Wang April 8, 2019, 6:43 a.m. UTC | #7

On 2019/4/4 下午6:58, Stefano Garzarella wrote:
> This series tries to increase the throughput of virtio-vsock with slight
> changes:
>   - patch 1/4: reduces the number of credit update messages sent to the
>                transmitter
>   - patch 2/4: allows the host to split packets on multiple buffers,
>                in this way, we can remove the packet size limit to
>                VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
>   - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
>                allowed
>   - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
>
> RFC:
>   - maybe patch 4 can be replaced with multiple queues with different
>     buffer sizes or using EWMA to adapt the buffer size to the traffic


Or EWMA + mergeable rx buffer, but if we decide to unify the datapath 
with virtio-net, we can reuse their codes.


>
>   - as Jason suggested in a previous thread [1] I'll evaluate to use
>     virtio-net as transport, but I need to understand better how to
>     interface with it, maybe introducing sk_buff in virtio-vsock.
>
> Any suggestions?


My understanding is this is not a must, but if it makes things easier, 
we can do this.

Another thing that may help is to implement sendpage(), which will 
greatly improve the performance.

Thanks


>
> Here some benchmarks step by step. I used iperf3 [2] modified with VSOCK
> support:
>
>                          host -> guest [Gbps]
> pkt_size    before opt.   patch 1   patches 2+3   patch 4
>    64            0.060       0.102       0.102       0.096
>    256           0.22        0.40        0.40        0.36
>    512           0.42        0.82        0.85        0.74
>    1K            0.7         1.6         1.6         1.5
>    2K            1.5         3.0         3.1         2.9
>    4K            2.5         5.2         5.3         5.3
>    8K            3.9         8.4         8.6         8.8
>    16K           6.6        11.1        11.3        12.8
>    32K           9.9        15.8        15.8        18.1
>    64K          13.5        17.4        17.7        21.4
>    128K         17.9        19.0        19.0        23.6
>    256K         18.0        19.4        19.8        24.4
>    512K         18.4        19.6        20.1        25.3
>
>                          guest -> host [Gbps]
> pkt_size    before opt.   patch 1   patches 2+3
>    64            0.088       0.100       0.101
>    256           0.35        0.36        0.41
>    512           0.70        0.74        0.73
>    1K            1.1         1.3         1.3
>    2K            2.4         2.4         2.6
>    4K            4.3         4.3         4.5
>    8K            7.3         7.4         7.6
>    16K           9.2         9.6        11.1
>    32K           8.3         8.9        18.1
>    64K           8.3         8.9        25.4
>    128K          7.2         8.7        26.7
>    256K          7.7         8.4        24.9
>    512K          7.7         8.5        25.0
>
> Thanks,
> Stefano
>
> [1] https://www.spinics.net/lists/netdev/msg531783.html
> [2] https://github.com/stefano-garzarella/iperf/
>
> Stefano Garzarella (4):
>    vsock/virtio: reduce credit update messages
>    vhost/vsock: split packets to send using multiple buffers
>    vsock/virtio: change the maximum packet size allowed
>    vsock/virtio: increase RX buffer size to 64 KiB
>
>   drivers/vhost/vsock.c                   | 35 ++++++++++++++++++++-----
>   include/linux/virtio_vsock.h            |  3 ++-
>   net/vmw_vsock/virtio_transport_common.c | 18 +++++++++----
>   3 files changed, 44 insertions(+), 12 deletions(-)
>

Stefan Hajnoczi April 8, 2019, 9:23 a.m. UTC | #8

On Fri, Apr 05, 2019 at 09:49:17AM +0200, Stefano Garzarella wrote:
> On Thu, Apr 04, 2019 at 02:04:10PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Apr 04, 2019 at 06:47:15PM +0200, Stefano Garzarella wrote:
> > > On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> > > > I simply love it that you have analysed the individual impact of
> > > > each patch! Great job!
> > > 
> > > Thanks! I followed Stefan's suggestions!
> > > 
> > > > 
> > > > For comparison's sake, it could be IMHO benefitial to add a column
> > > > with virtio-net+vhost-net performance.
> > > > 
> > > > This will both give us an idea about whether the vsock layer introduces
> > > > inefficiencies, and whether the virtio-net idea has merit.
> > > > 
> > > 
> > > Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
> > > this way:
> > >   $ qemu-system-x86_64 ... \
> > >       -netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
> > >       -device virtio-net-pci,netdev=net0
> > > 
> > > I did also a test using TCP_NODELAY, just to be fair, because VSOCK
> > > doesn't implement something like this.
> > 
> > Why not?
> > 
> 
> I think because originally VSOCK was designed to be simple and
> low-latency, but of course we can introduce something like that.
> 
> Current implementation directly copy the buffer from the user-space in a
> virtio_vsock_pkt and enqueue it to be transmitted.
> 
> Maybe we can introduce a buffer per socket where accumulate bytes and
> send it when it is full or when a timer is fired . We can also introduce
> a VSOCK_NODELAY (maybe using the same value of TCP_NODELAY for
> compatibility) to send the buffer immediately for low-latency use cases.
> 
> What do you think?

Today virtio-vsock implements a 1:1 sendmsg():packet relationship
because it's simple.  But there's no need for the guest to enqueue
multiple VIRTIO_VSOCK_OP_RW packets when a single large packet could
combine all payloads for a connection.  This is not the same as
TCP_NODELAY but related.

I think it's worth exploring TCP_NODELAY and send_pkt_list merging.
Hopefully it won't make the code much more complicated.

Stefan

Stefan Hajnoczi April 8, 2019, 9:44 a.m. UTC | #9

On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote:
> Another thing that may help is to implement sendpage(), which will greatly
> improve the performance.

I can't find documentation for ->sendpage().  Is the idea that you get a
struct page for the payload and can do zero-copy tx?  (And can userspace
still write to the page, invalidating checksums in the header?)

Stefan

Jason Wang April 9, 2019, 8:36 a.m. UTC | #10

On 2019/4/8 下午5:44, Stefan Hajnoczi wrote:
> On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote:
>> Another thing that may help is to implement sendpage(), which will greatly
>> improve the performance.
> I can't find documentation for ->sendpage().  Is the idea that you get a
> struct page for the payload and can do zero-copy tx?

Yes.

>   (And can userspace
> still write to the page, invalidating checksums in the header?)
>
> Stefan

Userspace can still write to the page, but for correctness (e.g in the 
case of SPLICE_F_GIFT describe by vmsplice(2)), it should not do this. 
For vmsplice, it may hard to detect the time to reuse the page. Maybe we 
MSG_ZEROCOPY[1] is better.

Anyway, sendpage() could be still useful for sendfile() or splice().

Thanks

[1] https://netdevconf.org/2.1/papers/netdev.pdf

Stefano Garzarella April 9, 2019, 9:13 a.m. UTC | #11

On Mon, Apr 08, 2019 at 02:43:28PM +0800, Jason Wang wrote:
> 
> On 2019/4/4 下午6:58, Stefano Garzarella wrote:
> > This series tries to increase the throughput of virtio-vsock with slight
> > changes:
> >   - patch 1/4: reduces the number of credit update messages sent to the
> >                transmitter
> >   - patch 2/4: allows the host to split packets on multiple buffers,
> >                in this way, we can remove the packet size limit to
> >                VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE
> >   - patch 3/4: uses VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size
> >                allowed
> >   - patch 4/4: increases RX buffer size to 64 KiB (affects only host->guest)
> > 
> > RFC:
> >   - maybe patch 4 can be replaced with multiple queues with different
> >     buffer sizes or using EWMA to adapt the buffer size to the traffic
> 
> 
> Or EWMA + mergeable rx buffer, but if we decide to unify the datapath with
> virtio-net, we can reuse their codes.
> 
> 
> > 
> >   - as Jason suggested in a previous thread [1] I'll evaluate to use
> >     virtio-net as transport, but I need to understand better how to
> >     interface with it, maybe introducing sk_buff in virtio-vsock.
> > 
> > Any suggestions?
> 
> 
> My understanding is this is not a must, but if it makes things easier, we
> can do this.

Hopefully it should simplify the maintainability and avoid duplicated code.

> 
> Another thing that may help is to implement sendpage(), which will greatly
> improve the performance.

Thanks for your suggestions!
I'll try to implement sendpage() in VSOCK to measure the improvement.

Cheers,
Stefano

[RFC,0/4] vsock/virtio: optimizations to increase the throughput

Message

Comments