Message ID | 20240711-pinna-v3-2-697d4164fe80@outlook.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | vsock: avoid queuing on workqueue if possible | expand |
On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote: >From: Luigi Leonardi <luigi.leonardi@outlook.com> > >Introduce an optimization in virtio_transport_send_pkt: >when the work queue (send_pkt_queue) is empty the packet is Note: send_pkt_queue is just a queue of sk_buff, is not really a work queue. >put directly in the virtqueue increasing the throughput. Why? I'd write something like this, but feel free to change it: When the driver needs to send new packets to the device, it always queues the new sk_buffs into an intermediate queue (send_pkt_queue) and schedules a worker (send_pkt_work) to then queue them into the virtqueue exposed to the device. This increases the chance of batching, but also introduces a lot of latency into the communication. So we can optimize this path by adding a fast path to be taken when there is no element in the intermediate queue, there is space available in the virtqueue, and no other process that is sending packets (tx_lock held). > >In the following benchmark (pingpong mode) the host sends "fio benchmark" >a payload to the guest and waits for the same payload back. > >All vCPUs pinned individually to pCPUs. >vhost process pinned to a pCPU >fio process pinned both inside the host and the guest system. > >Host CPU: Intel i7-10700KF CPU @ 3.80GHz >Tool: Fio version 3.37-56 >Env: Phys host + L1 Guest >Runtime-per-test: 50s >Mode: pingpong (h-g-h) >Test runs: 50 >Type: SOCK_STREAM > >Before: Linux 6.9.7 > >Payload 512B: > > 1st perc. overall 99th perc. >Before 370 810.15 8656 ns >After 374 780.29 8741 ns > >Payload 4K: > > 1st perc. overall 99th perc. >Before 460 1720.23 42752 ns >After 460 1520.84 36096 ns > >The performance improvement is related to this optimization, >I used ebpf to check that each packet was sent directly to the >virtqueue. > >Throughput: iperf-vsock I would reorganize the description for a moment because it's a little confusing. For example like this: The following benchmarks were run to check improvements in latency and throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz and L1 guest running on QEMU/KVM. - Latency Tool: ... - Throughput Tool: ... >The size represents the buffer length (-l) to read/write >P represents the number parallel streams > >P=1 > 4K 64K 128K >Before 6.87 29.3 29.5 Gb/s >After 10.5 39.4 39.9 Gb/s > >P=2 > 4K 64K 128K >Before 10.5 32.8 33.2 Gb/s >After 17.8 47.7 48.5 Gb/s > >P=4 > 4K 64K 128K >Before 12.7 33.6 34.2 Gb/s >After 16.9 48.1 50.5 Gb/s Wow, great! I'm a little surprised that the latency is not much affected, but the throughput benefits so much with that kind of optimization. Maybe we can check the latency with smaller payloads like 64 bytes or even smaller. > >Co-developed-by: Marco Pinna <marco.pinn95@gmail.com> >Signed-off-by: Marco Pinna <marco.pinn95@gmail.com> >Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com> >--- > net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++---- > 1 file changed, 34 insertions(+), 4 deletions(-) > >diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c >index c4205c22f40b..d75727fdc35f 100644 >--- a/net/vmw_vsock/virtio_transport.c >+++ b/net/vmw_vsock/virtio_transport.c >@@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work) > queue_work(virtio_vsock_workqueue, &vsock->rx_work); > } > >+/* Caller need to hold RCU for vsock. >+ * Returns 0 if the packet is successfully put on the vq. >+ */ >+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb) >+{ >+ struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX]; >+ int ret; >+ >+ /* Inside RCU, can't sleep! */ >+ ret = mutex_trylock(&vsock->tx_lock); >+ if (unlikely(ret == 0)) >+ return -EBUSY; >+ >+ ret = virtio_transport_send_skb(skb, vq, vsock); >+ >+ mutex_unlock(&vsock->tx_lock); >+ >+ /* Kick if virtio_transport_send_skb succeeded */ Superfluous comment, we can remove it. >+ if (ret == 0) >+ virtqueue_kick(vq); nit: I'd add a blank line here after the if block to highlight that the return is out. >+ return ret; >+} >+ > static int > virtio_transport_send_pkt(struct sk_buff *skb) > { >@@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb) > goto out_rcu; > } > >- if (virtio_vsock_skb_reply(skb)) >- atomic_inc(&vsock->queued_replies); >+ /* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet. Again, send_pkt_queue is not a workqueue. Here I would explain more why there is no need, the fact that we are not doing this is clear. >+ * Just put it on the virtqueue using virtio_transport_send_skb_fast_path. >+ */ > nit: here I would instead remove the blank line to make it clear that the comment is about the code below. >- virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); >- queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); >+ if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) || >+ virtio_transport_send_skb_fast_path(vsock, skb)) { >+ /* Packet must be queued */ Please, include it in the comment before the if where you can explain the whole logic of the optimization. >+ if (virtio_vsock_skb_reply(skb)) >+ atomic_inc(&vsock->queued_replies); nit: blank line, how it was before this patch: if (virtio_vsock_skb_reply(skb)) atomic_inc(&vsock->queued_replies); virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); >+ virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); >+ queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); >+ } > > out_rcu: > rcu_read_unlock(); > >-- >2.45.2 > > I tested the patch and everything seems to be fine, all my comments are minor and style, the code should be fine! Thanks, Stefano
Hi Stefano, Thanks for your review! > On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote: > >From: Luigi Leonardi <luigi.leonardi@outlook.com> > > > >Introduce an optimization in virtio_transport_send_pkt: > >when the work queue (send_pkt_queue) is empty the packet is > > Note: send_pkt_queue is just a queue of sk_buff, is not really a work > queue. > > >put directly in the virtqueue increasing the throughput. > > Why? My guess is that is due to the hotpath being faster, there is (potentially) one less step! > > I tested the patch and everything seems to be fine, all my comments are > minor and style, the code should be fine! Great, I'll send a v4 addressing all your comments :) Thanks, Luigi
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index c4205c22f40b..d75727fdc35f 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work) queue_work(virtio_vsock_workqueue, &vsock->rx_work); } +/* Caller need to hold RCU for vsock. + * Returns 0 if the packet is successfully put on the vq. + */ +static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb) +{ + struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX]; + int ret; + + /* Inside RCU, can't sleep! */ + ret = mutex_trylock(&vsock->tx_lock); + if (unlikely(ret == 0)) + return -EBUSY; + + ret = virtio_transport_send_skb(skb, vq, vsock); + + mutex_unlock(&vsock->tx_lock); + + /* Kick if virtio_transport_send_skb succeeded */ + if (ret == 0) + virtqueue_kick(vq); + return ret; +} + static int virtio_transport_send_pkt(struct sk_buff *skb) { @@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb) goto out_rcu; } - if (virtio_vsock_skb_reply(skb)) - atomic_inc(&vsock->queued_replies); + /* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet. + * Just put it on the virtqueue using virtio_transport_send_skb_fast_path. + */ - virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); - queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) || + virtio_transport_send_skb_fast_path(vsock, skb)) { + /* Packet must be queued */ + if (virtio_vsock_skb_reply(skb)) + atomic_inc(&vsock->queued_replies); + virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); + queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + } out_rcu: rcu_read_unlock();