diff mbox series

[RFC,net-next] tcp: add support for read with offset when using MSG_PEEK

Message ID 20240111222252.221693-1-jmaloy@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] tcp: add support for read with offset when using MSG_PEEK | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1094 this patch: 1094
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 1108 this patch: 1108
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1109 this patch: 1109
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jon Maloy Jan. 11, 2024, 10:22 p.m. UTC
From: Jon Maloy <jmaloy@redhat.com>

When reading received messages with MSG_PEEK, we sometines have to read
the leading bytes of the stream several times, only to reach the bytes
we really want. This is clearly non-optimal.

What we would want is something similar to pread/preadv(), but working
even for tcp sockets. At the same time, we don't want to add any new
arguments to the recv/recvmsg() calls.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 logs examples shown below, we can observe a throughput
improvement of ~20 % in the direction host->namespace when using the
protocol splicer 'passt'. This is a consistent result.

$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset not supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 60344
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
[ ID] Interval           Transfer     Bitrate
{...]
[  6]  13.00-14.00  sec  2.54 GBytes  21.8 Gbits/sec
[  6]  14.00-15.00  sec  2.52 GBytes  21.7 Gbits/sec
[  6]  15.00-16.00  sec  2.50 GBytes  21.5 Gbits/sec
[  6]  16.00-17.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  17.00-18.00  sec  2.51 GBytes  21.6 Gbits/sec
[  6]  18.00-19.00  sec  2.48 GBytes  21.3 Gbits/sec
[  6]  19.00-20.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  20.00-20.04  sec  87.4 MBytes  19.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  48.9 GBytes  21.0 Gbits/sec receiver
-----------------------------------------------------------

[jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 46362
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
[ ID] Interval           Transfer     Bitrate
[...]
[  6]  12.00-13.00  sec  3.18 GBytes  27.3 Gbits/sec
[  6]  13.00-14.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  14.00-15.00  sec  3.13 GBytes  26.9 Gbits/sec
[  6]  15.00-16.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  16.00-17.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  17.00-18.00  sec  3.14 GBytes  27.0 Gbits/sec
[  6]  18.00-19.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  19.00-20.00  sec  3.12 GBytes  26.8 Gbits/sec
[  6]  20.00-20.04  sec   119 MBytes  25.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  59.4 GBytes  25.4 Gbits/sec receiver
-----------------------------------------------------------

Passt is used to support VMs in containers, such as KubeVirt, and
is also generally supported in libvirt/QEMU since release 9.2 / 7.2.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jon Paul Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53bcc17c91e4..e9d3b5bf2f66 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2310,6 +2310,7 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,20 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			if (msg->msg_iter.nr_segs <= 1)
+				goto out;
+			msg->msg_iter.nr_segs -= 1;
+			if (msg->msg_iter.count <= peek_offset)
+				goto out;
+			msg->msg_iter.count -= peek_offset;
+			if (len <= peek_offset)
+				goto out;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);