diff mbox series

[net-next,v3,2/2] vsock/virtio: avoid queuing packets when work queue is empty

Message ID 20240711-pinna-v3-2-697d4164fe80@outlook.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series vsock: avoid queuing on workqueue if possible | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 816 this patch: 816
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 821 this patch: 821
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 821 this patch: 821
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-07-11--18-00 (tests: 696)

Commit Message

Luigi Leonardi via B4 Relay July 11, 2024, 2:58 p.m. UTC
From: Luigi Leonardi <luigi.leonardi@outlook.com>

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is
put directly in the virtqueue increasing the throughput.

In the following benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before: Linux 6.9.7

Payload 512B:

	1st perc.	overall		99th perc.
Before	370		810.15		8656		ns
After	374		780.29		8741		ns

Payload 4K:

	1st perc.	overall		99th perc.
Before	460		1720.23		42752		ns
After	460		1520.84		36096		ns

The performance improvement is related to this optimization,
I used ebpf to check that each packet was sent directly to the
virtqueue.

Throughput: iperf-vsock
The size represents the buffer length (-l) to read/write
P represents the number parallel streams

P=1
	4K	64K	128K
Before	6.87	29.3	29.5 Gb/s
After	10.5	39.4	39.9 Gb/s

P=2
	4K	64K	128K
Before	10.5	32.8	33.2 Gb/s
After	17.8	47.7	48.5 Gb/s

P=4
	4K	64K	128K
Before	12.7	33.6	34.2 Gb/s
After	16.9	48.1	50.5 Gb/s

Co-developed-by: Marco Pinna <marco.pinn95@gmail.com>
Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
---
 net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

Comments

Stefano Garzarella July 12, 2024, 2:58 p.m. UTC | #1
On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote:
>From: Luigi Leonardi <luigi.leonardi@outlook.com>
>
>Introduce an optimization in virtio_transport_send_pkt:
>when the work queue (send_pkt_queue) is empty the packet is

Note: send_pkt_queue is just a queue of sk_buff, is not really a work 
queue.

>put directly in the virtqueue increasing the throughput.

Why?

I'd write something like this, but feel free to change it:

When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.

This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).


>
>In the following benchmark (pingpong mode) the host sends

"fio benchmark"

>a payload to the guest and waits for the same payload back.
>
>All vCPUs pinned individually to pCPUs.
>vhost process pinned to a pCPU
>fio process pinned both inside the host and the guest system.
>
>Host CPU: Intel i7-10700KF CPU @ 3.80GHz
>Tool: Fio version 3.37-56
>Env: Phys host + L1 Guest
>Runtime-per-test: 50s
>Mode: pingpong (h-g-h)
>Test runs: 50
>Type: SOCK_STREAM
>
>Before: Linux 6.9.7
>
>Payload 512B:
>
>	1st perc.	overall		99th perc.
>Before	370		810.15		8656		ns
>After	374		780.29		8741		ns
>
>Payload 4K:
>
>	1st perc.	overall		99th perc.
>Before	460		1720.23		42752		ns
>After	460		1520.84		36096		ns
>
>The performance improvement is related to this optimization,
>I used ebpf to check that each packet was sent directly to the
>virtqueue.
>
>Throughput: iperf-vsock

I would reorganize the description for a moment because it's a little 
confusing. For example like this:

The following benchmarks were run to check improvements in latency and 
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz 
and L1 guest running on QEMU/KVM.

- Latency
   Tool: ...

- Throughput
   Tool: ...

>The size represents the buffer length (-l) to read/write
>P represents the number parallel streams
>
>P=1
>	4K	64K	128K
>Before	6.87	29.3	29.5 Gb/s
>After	10.5	39.4	39.9 Gb/s
>
>P=2
>	4K	64K	128K
>Before	10.5	32.8	33.2 Gb/s
>After	17.8	47.7	48.5 Gb/s
>
>P=4
>	4K	64K	128K
>Before	12.7	33.6	34.2 Gb/s
>After	16.9	48.1	50.5 Gb/s

Wow, great! I'm a little surprised that the latency is not much 
affected, but the throughput benefits so much with that kind of 
optimization.

Maybe we can check the latency with smaller payloads like 64 bytes or 
even smaller.

>
>Co-developed-by: Marco Pinna <marco.pinn95@gmail.com>
>Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
>Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
>---
> net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++----
> 1 file changed, 34 insertions(+), 4 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index c4205c22f40b..d75727fdc35f 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> 		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
>+/* Caller need to hold RCU for vsock.
>+ * Returns 0 if the packet is successfully put on the vq.
>+ */
>+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb)
>+{
>+	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
>+	int ret;
>+
>+	/* Inside RCU, can't sleep! */
>+	ret = mutex_trylock(&vsock->tx_lock);
>+	if (unlikely(ret == 0))
>+		return -EBUSY;
>+
>+	ret = virtio_transport_send_skb(skb, vq, vsock);
>+
>+	mutex_unlock(&vsock->tx_lock);
>+
>+	/* Kick if virtio_transport_send_skb succeeded */

Superfluous comment, we can remove it.

>+	if (ret == 0)
>+		virtqueue_kick(vq);

nit: I'd add a blank line here after the if block to highlight that the 
return is out.

>+	return ret;
>+}
>+
> static int
> virtio_transport_send_pkt(struct sk_buff *skb)
> {
>@@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> 		goto out_rcu;
> 	}
>
>-	if (virtio_vsock_skb_reply(skb))
>-		atomic_inc(&vsock->queued_replies);
>+	/* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.

Again, send_pkt_queue is not a workqueue.

Here I would explain more why there is no need, the fact that we are not 
doing this is clear.

>+	 * Just put it on the virtqueue using virtio_transport_send_skb_fast_path.
>+	 */
>

nit: here I would instead remove the blank line to make it clear that 
the comment is about the code below.

>-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
>+	    virtio_transport_send_skb_fast_path(vsock, skb)) {
>+		/* Packet must be queued */

Please, include it in the comment before the if where you can explain 
the whole logic of the optimization.

>+		if (virtio_vsock_skb_reply(skb))
>+			atomic_inc(&vsock->queued_replies);

nit: blank line, how it was before this patch:

	if (virtio_vsock_skb_reply(skb))
		atomic_inc(&vsock->queued_replies);

	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);


>+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	}
>
> out_rcu:
> 	rcu_read_unlock();
>
>-- 
>2.45.2
>
>

I tested the patch and everything seems to be fine, all my comments are 
minor and style, the code should be fine!

Thanks,
Stefano
Luigi Leonardi July 15, 2024, 10:44 a.m. UTC | #2
Hi Stefano,

Thanks for your review!

> On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote:
> >From: Luigi Leonardi <luigi.leonardi@outlook.com>
> >
> >Introduce an optimization in virtio_transport_send_pkt:
> >when the work queue (send_pkt_queue) is empty the packet is
>
> Note: send_pkt_queue is just a queue of sk_buff, is not really a work
> queue.
>
> >put directly in the virtqueue increasing the throughput.
>
> Why?
My guess is that is due to the hotpath being faster, there is (potentially) one less
step!
>
> I tested the patch and everything seems to be fine, all my comments are
> minor and style, the code should be fine!
Great, I'll send a v4 addressing all your comments :)

Thanks,
Luigi
diff mbox series

Patch

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c4205c22f40b..d75727fdc35f 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -208,6 +208,29 @@  virtio_transport_send_pkt_work(struct work_struct *work)
 		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
+/* Caller need to hold RCU for vsock.
+ * Returns 0 if the packet is successfully put on the vq.
+ */
+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb)
+{
+	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
+	int ret;
+
+	/* Inside RCU, can't sleep! */
+	ret = mutex_trylock(&vsock->tx_lock);
+	if (unlikely(ret == 0))
+		return -EBUSY;
+
+	ret = virtio_transport_send_skb(skb, vq, vsock);
+
+	mutex_unlock(&vsock->tx_lock);
+
+	/* Kick if virtio_transport_send_skb succeeded */
+	if (ret == 0)
+		virtqueue_kick(vq);
+	return ret;
+}
+
 static int
 virtio_transport_send_pkt(struct sk_buff *skb)
 {
@@ -231,11 +254,18 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 		goto out_rcu;
 	}
 
-	if (virtio_vsock_skb_reply(skb))
-		atomic_inc(&vsock->queued_replies);
+	/* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.
+	 * Just put it on the virtqueue using virtio_transport_send_skb_fast_path.
+	 */
 
-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
+	    virtio_transport_send_skb_fast_path(vsock, skb)) {
+		/* Packet must be queued */
+		if (virtio_vsock_skb_reply(skb))
+			atomic_inc(&vsock->queued_replies);
+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	}
 
 out_rcu:
 	rcu_read_unlock();