diff mbox series

[net-next,v2,2/2] vsock/virtio: avoid enqueue packets when work queue is empty

Message ID 20240701-pinna-v2-2-ac396d181f59@outlook.com (mailing list archive)
State New
Headers show
Series vsock: avoid queuing on workqueue if possible | expand

Commit Message

Luigi Leonardi via B4 Relay July 1, 2024, 2:28 p.m. UTC
From: Marco Pinna <marco.pinn95@gmail.com>

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is
put directly in the virtqueue reducing latency.

In the following benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Payload: 512
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before (Linux 6.8.11)
------
mean(1st percentile):    380.56 ns
mean(overall):           780.83 ns
mean(99th percentile):  8300.24 ns

After
------
mean(1st percentile):   370.59 ns
mean(overall):          720.66 ns
mean(99th percentile): 7600.27 ns

Same setup, using 4K payload:

Before (Linux 6.8.11)
------
mean(1st percentile):    458.84 ns
mean(overall):          1650.17 ns
mean(99th percentile): 42240.68 ns

After
------
mean(1st percentile):    450.12 ns
mean(overall):          1460.84 ns
mean(99th percentile): 37632.45 ns

virtqueue.

Throughput: iperf-vsock

Before (Linux 6.8.11)
G2H 28.7 Gb/s

After
G2H 40.8 Gb/s

The performance improvement is related to this optimization,
I checked that each packet was put directly on the vq
avoiding the work queue.

Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
---
 net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

Comments

Luigi Leonardi July 1, 2024, 2:49 p.m. UTC | #1
Hi all,

> +		/* Inside RCU, can't sleep! */
> +		ret = mutex_trylock(&vsock->tx_lock);
> +		if (unlikely(ret == 0))
> +			goto out_worker;

I just realized that here I don't release the tx_lock and 
that the email subject is "PATCH PATCH".
I will fix this in the next version.
Any feedback is welcome!

Thanks,
Luigi
Stefano Garzarella July 2, 2024, 9:53 a.m. UTC | #2
On Mon, Jul 01, 2024 at 04:49:41PM GMT, Luigi Leonardi wrote:
>Hi all,
>
>> +		/* Inside RCU, can't sleep! */
>> +		ret = mutex_trylock(&vsock->tx_lock);
>> +		if (unlikely(ret == 0))
>> +			goto out_worker;
>
>I just realized that here I don't release the tx_lock and
>that the email subject is "PATCH PATCH".
>I will fix this in the next version.

What about adding a function to handle all these steps?
So we can handle better the error path in this block code.

IMHO to simplify the code, you can just return true or false if you 
queued it. Then if the driver is disappearing and we are still queuing 
it, it will be the release that will clean up all the queues, so we 
might not worry about this edge case.

Thanks,
Stefano

>Any feedback is welcome!
>
>Thanks,
>Luigi
>
Stefano Garzarella July 2, 2024, 10 a.m. UTC | #3
On Mon, Jul 01, 2024 at 04:28:03PM GMT, Luigi Leonardi via B4 Relay wrote:
>From: Marco Pinna <marco.pinn95@gmail.com>
>
>Introduce an optimization in virtio_transport_send_pkt:
>when the work queue (send_pkt_queue) is empty the packet is
>put directly in the virtqueue reducing latency.
>
>In the following benchmark (pingpong mode) the host sends
>a payload to the guest and waits for the same payload back.
>
>All vCPUs pinned individually to pCPUs.
>vhost process pinned to a pCPU
>fio process pinned both inside the host and the guest system.
>
>Host CPU: Intel i7-10700KF CPU @ 3.80GHz
>Tool: Fio version 3.37-56
>Env: Phys host + L1 Guest
>Payload: 512
>Runtime-per-test: 50s
>Mode: pingpong (h-g-h)
>Test runs: 50
>Type: SOCK_STREAM
>
>Before (Linux 6.8.11)
>------
>mean(1st percentile):    380.56 ns
>mean(overall):           780.83 ns
>mean(99th percentile):  8300.24 ns
>
>After
>------
>mean(1st percentile):   370.59 ns
>mean(overall):          720.66 ns
>mean(99th percentile): 7600.27 ns
>
>Same setup, using 4K payload:
>
>Before (Linux 6.8.11)
>------
>mean(1st percentile):    458.84 ns
>mean(overall):          1650.17 ns
>mean(99th percentile): 42240.68 ns
>
>After
>------
>mean(1st percentile):    450.12 ns
>mean(overall):          1460.84 ns
>mean(99th percentile): 37632.45 ns
>
>virtqueue.
>
>Throughput: iperf-vsock
>
>Before (Linux 6.8.11)
>G2H 28.7 Gb/s
>
>After
>G2H 40.8 Gb/s

Cool!

I'd suggest to add the length of buffer (-l param) used, and also
check more lenghts, like at least 4k, 64k, 128k.

>
>The performance improvement is related to this optimization,
>I checked that each packet was put directly on the vq
>avoiding the work queue.

How?

>
>Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com>
>Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
>Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>

I think you might want to change the author of this patch, since it's 
changed a lot from Marco's original one. Obviously if you both agree on 
this.

Thanks,
Stefano

>---
> net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index a74083d28120..3815aa8d956b 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -213,6 +213,7 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> {
> 	struct virtio_vsock_hdr *hdr;
> 	struct virtio_vsock *vsock;
>+	bool use_worker = true;
> 	int len = skb->len;
>
> 	hdr = virtio_vsock_hdr(skb);
>@@ -234,8 +235,41 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> 	if (virtio_vsock_skb_reply(skb))
> 		atomic_inc(&vsock->queued_replies);
>
>-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	/* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.
>+	 * Just put it on the virtqueue using virtio_transport_send_skb.
>+	 */
>+	if (skb_queue_empty_lockless(&vsock->send_pkt_queue)) {
>+		bool restart_rx = false;
>+		struct virtqueue *vq;
>+		int ret;
>+
>+		/* Inside RCU, can't sleep! */
>+		ret = mutex_trylock(&vsock->tx_lock);
>+		if (unlikely(ret == 0))
>+			goto out_worker;
>+
>+		/* Driver is being removed, no need to enqueue the packet */
>+		if (!vsock->tx_run)
>+			goto out_rcu;
>+
>+		vq = vsock->vqs[VSOCK_VQ_TX];
>+
>+		if (!virtio_transport_send_skb(skb, vq, vsock, &restart_rx)) {
>+			use_worker = false;
>+			virtqueue_kick(vq);
>+		}
>+
>+		mutex_unlock(&vsock->tx_lock);
>+
>+		if (restart_rx)
>+			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>+	}
>+
>+out_worker:
>+	if (use_worker) {
>+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	}
>
> out_rcu:
> 	rcu_read_unlock();
>
>-- 2.45.2
>
>
diff mbox series

Patch

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index a74083d28120..3815aa8d956b 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -213,6 +213,7 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 {
 	struct virtio_vsock_hdr *hdr;
 	struct virtio_vsock *vsock;
+	bool use_worker = true;
 	int len = skb->len;
 
 	hdr = virtio_vsock_hdr(skb);
@@ -234,8 +235,41 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 	if (virtio_vsock_skb_reply(skb))
 		atomic_inc(&vsock->queued_replies);
 
-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	/* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.
+	 * Just put it on the virtqueue using virtio_transport_send_skb.
+	 */
+	if (skb_queue_empty_lockless(&vsock->send_pkt_queue)) {
+		bool restart_rx = false;
+		struct virtqueue *vq;
+		int ret;
+
+		/* Inside RCU, can't sleep! */
+		ret = mutex_trylock(&vsock->tx_lock);
+		if (unlikely(ret == 0))
+			goto out_worker;
+
+		/* Driver is being removed, no need to enqueue the packet */
+		if (!vsock->tx_run)
+			goto out_rcu;
+
+		vq = vsock->vqs[VSOCK_VQ_TX];
+
+		if (!virtio_transport_send_skb(skb, vq, vsock, &restart_rx)) {
+			use_worker = false;
+			virtqueue_kick(vq);
+		}
+
+		mutex_unlock(&vsock->tx_lock);
+
+		if (restart_rx)
+			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+	}
+
+out_worker:
+	if (use_worker) {
+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	}
 
 out_rcu:
 	rcu_read_unlock();