diff mbox series

[v3] tcp: add support for SO_PEEK_OFF

Message ID 20240209221233.3150253-1-jmaloy@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v3] tcp: add support for SO_PEEK_OFF | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 992 this patch: 992
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: edumazet@google.com dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 1007 this patch: 1007
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1009 this patch: 1009
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-13--00-00 (tests: 1436)

Commit Message

Jon Maloy Feb. 9, 2024, 10:12 p.m. UTC
From: Jon Maloy <jmaloy@redhat.com>

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
---
 net/ipv4/af_inet.c |  1 +
 net/ipv4/tcp.c     | 16 ++++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Comments

Stefano Brivio Feb. 11, 2024, 11:17 p.m. UTC | #1
On Fri,  9 Feb 2024 17:12:33 -0500
jmaloy@redhat.com wrote:

> From: Jon Maloy <jmaloy@redhat.com>
> 
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
> 
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
> 
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
> 
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
> 
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
> 
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
> 
> passt and pasta are supported in KubeVirt and libvirt/qemu.
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
> [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
> [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
> [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
> [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
> [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
> [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
> [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
> [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
> [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
> 
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
> 
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
>                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> 46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
>                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> 28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Paolo Abeni Feb. 13, 2024, 10:49 a.m. UTC | #2
Oops, 

I just noticed Eric is missing from the recipients list, adding him
now.

On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote:
> From: Jon Maloy <jmaloy@redhat.com>
> 
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
> 
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
> 
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
> 
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
> 
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
> 
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
> 
> passt and pasta are supported in KubeVirt and libvirt/qemu.
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
> [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
> [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
> [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
> [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
> [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
> [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
> [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
> [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
> [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
> 
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
> 
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
>                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> 46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
>                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> 28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> 
> ---
> v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
> ---
>  net/ipv4/af_inet.c |  1 +
>  net/ipv4/tcp.c     | 16 ++++++++++------
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 4e635dd3d3c8..5f0e5d10c416 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = {
>  #endif
>  	.splice_eof	   = inet_splice_eof,
>  	.splice_read	   = tcp_splice_read,
> +	.set_peek_off      = sk_set_peek_off,
>  	.read_sock	   = tcp_read_sock,
>  	.read_skb	   = tcp_read_skb,
>  	.sendmsg_locked    = tcp_sendmsg_locked,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 7e2481b9eae1..1c8cab14a32c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
>  	struct sk_buff *skb;
>  	int copied = 0, err = 0;
>  
> -	/* XXX -- need to support SO_PEEK_OFF */
> -
>  	skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
>  		err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
>  		if (err)
> @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  	int target;		/* Read at least this many bytes */
>  	long timeo;
>  	struct sk_buff *skb, *last;
> +	u32 peek_offset = 0;
>  	u32 urg_hole = 0;
>  
>  	err = -ENOTCONN;
> @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  
>  	seq = &tp->copied_seq;
>  	if (flags & MSG_PEEK) {
> -		peek_seq = tp->copied_seq;
> +		peek_offset = max(sk_peek_offset(sk, flags), 0);
> +		peek_seq = tp->copied_seq + peek_offset;
>  		seq = &peek_seq;
>  	}
>  
> @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  		}
>  
>  		if ((flags & MSG_PEEK) &&
> -		    (peek_seq - copied - urg_hole != tp->copied_seq)) {
> +		    (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
>  			net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
>  					    current->comm,
>  					    task_pid_nr(current));
> -			peek_seq = tp->copied_seq;
> +			peek_seq = tp->copied_seq + peek_offset;
>  		}
>  		continue;
>  
> @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  		WRITE_ONCE(*seq, *seq + used);
>  		copied += used;
>  		len -= used;
> -
> +		if (flags & MSG_PEEK)
> +			sk_peek_offset_fwd(sk, used);
> +		else
> +			sk_peek_offset_bwd(sk, used);
>  		tcp_rcv_space_adjust(sk);
>  
>  skip_copy:
> @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags)
>  	__skb_queue_purge(&sk->sk_receive_queue);
>  	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
>  	WRITE_ONCE(tp->urg_data, 0);
> +	sk_set_peek_off(sk, -1);
>  	tcp_write_queue_purge(sk);
>  	tcp_fastopen_active_disable_ofo_check(sk);
>  	skb_rbtree_purge(&tp->out_of_order_queue);
Eric Dumazet Feb. 13, 2024, 12:24 p.m. UTC | #3
On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Oops,
>
> I just noticed Eric is missing from the recipients list, adding him
> now.
>

Hmmm thanks.

> On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote:
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > When reading received messages from a socket with MSG_PEEK, we may want
> > to read the contents with an offset, like we can do with pread/preadv()
> > when reading files. Currently, it is not possible to do that.
> >
> > In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> > in a similar way it is done for Unix Domain sockets.
> >
> > In the iperf3 log examples shown below, we can observe a throughput
> > improvement of 15-20 % in the direction host->namespace when using the
> > protocol splicer 'pasta' (https://passt.top).
> > This is a consistent result.
> >
> > pasta(1) and passt(1) implement user-mode networking for network
> > namespaces (containers) and virtual machines by means of a translation
> > layer between Layer-2 network interface and native Layer-4 sockets
> > (TCP, UDP, ICMP/ICMPv6 echo).
> >
> > Received, pending TCP data to the container/guest is kept in kernel
> > buffers until acknowledged, so the tool routinely needs to fetch new
> > data from socket, skipping data that was already sent.
> >
> > At the moment this is implemented using a dummy buffer passed to
> > recvmsg(). With this change, we don't need a dummy buffer and the
> > related buffer copy (copy_to_user()) anymore.
> >
> > passt and pasta are supported in KubeVirt and libvirt/qemu.
> >
> > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> > SO_PEEK_OFF not supported by kernel.
> >
> > jmaloy@freyr:~/passt# iperf3 -s
> > -----------------------------------------------------------
> > Server listening on 5201 (test #1)
> > -----------------------------------------------------------
> > Accepted connection from 192.168.122.1, port 44822
> > [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> > [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> > [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> > [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> > [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> > [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> > [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> > [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> > [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> > [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> > [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> > -----------------------------------------------------------
> > Server listening on 5201 (test #2)
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > jmaloy@freyr:~/passt#
> > logout
> > [ perf record: Woken up 23 times to write data ]
> > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> > jmaloy@freyr:~/passt$
> >
> > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> > SO_PEEK_OFF supported by kernel.
> >
> > jmaloy@freyr:~/passt# iperf3 -s
> > -----------------------------------------------------------
> > Server listening on 5201 (test #1)
> > -----------------------------------------------------------
> > Accepted connection from 192.168.122.1, port 52084
> > [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
> > [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> > [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
> > [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
> > [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
> > [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
> > [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
> > [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
> > [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
> > [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
> > [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
> > -----------------------------------------------------------
> > Server listening on 5201 (test #2)
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > logout
> > [ perf record: Woken up 20 times to write data ]
> > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> > jmaloy@freyr:~/passt$
> >
> > The perf record confirms this result. Below, we can observe that the
> > CPU spends significantly less time in the function ____sys_recvmsg()
> > when we have offset support.
> >
> > Without offset support:
> > ----------------------
> > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> >                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> > 46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> >
> > With offset support:
> > ----------------------
> > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> >                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> > 28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> >
> > Suggested-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> >
> > ---
> > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
> > ---
> >  net/ipv4/af_inet.c |  1 +
> >  net/ipv4/tcp.c     | 16 ++++++++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 4e635dd3d3c8..5f0e5d10c416 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = {
> >  #endif
> >       .splice_eof        = inet_splice_eof,
> >       .splice_read       = tcp_splice_read,
> > +     .set_peek_off      = sk_set_peek_off,
> >       .read_sock         = tcp_read_sock,
> >       .read_skb          = tcp_read_skb,
> >       .sendmsg_locked    = tcp_sendmsg_locked,
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 7e2481b9eae1..1c8cab14a32c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
> >       struct sk_buff *skb;
> >       int copied = 0, err = 0;
> >
> > -     /* XXX -- need to support SO_PEEK_OFF */
> > -
> >       skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> >               err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
> >               if (err)
> > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >       int target;             /* Read at least this many bytes */
> >       long timeo;
> >       struct sk_buff *skb, *last;
> > +     u32 peek_offset = 0;
> >       u32 urg_hole = 0;
> >
> >       err = -ENOTCONN;
> > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >
> >       seq = &tp->copied_seq;
> >       if (flags & MSG_PEEK) {
> > -             peek_seq = tp->copied_seq;
> > +             peek_offset = max(sk_peek_offset(sk, flags), 0);
> > +             peek_seq = tp->copied_seq + peek_offset;
> >               seq = &peek_seq;
> >       }
> >
> > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >               }
> >
> >               if ((flags & MSG_PEEK) &&
> > -                 (peek_seq - copied - urg_hole != tp->copied_seq)) {
> > +                 (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
> >                       net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
> >                                           current->comm,
> >                                           task_pid_nr(current));
> > -                     peek_seq = tp->copied_seq;
> > +                     peek_seq = tp->copied_seq + peek_offset;
> >               }
> >               continue;
> >
> > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >               WRITE_ONCE(*seq, *seq + used);
> >               copied += used;
> >               len -= used;
> > -
> > +             if (flags & MSG_PEEK)
> > +                     sk_peek_offset_fwd(sk, used);
> > +             else
> > +                     sk_peek_offset_bwd(sk, used);

Yet another cache miss in TCP fast path...

We need to move sk_peek_off in a better location before we accept this patch.

I always thought MSK_PEEK was very inefficient, I am surprised we
allow arbitrary loops in recvmsg().
Paolo Abeni Feb. 13, 2024, 1:02 p.m. UTC | #4
On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > >               WRITE_ONCE(*seq, *seq + used);
> > >               copied += used;
> > >               len -= used;
> > > -
> > > +             if (flags & MSG_PEEK)
> > > +                     sk_peek_offset_fwd(sk, used);
> > > +             else
> > > +                     sk_peek_offset_bwd(sk, used);
> 
> Yet another cache miss in TCP fast path...
> 
> We need to move sk_peek_off in a better location before we accept this patch.
>
> I always thought MSK_PEEK was very inefficient, I am surprised we
> allow arbitrary loops in recvmsg().

Let me double check I read the above correctly: are you concerned by
the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
touch a lot of skbs/cachelines before reaching the relevant skb?

The end goal here is allowing an user-space application to read 
incrementally/sequentially the received data while leaving them in
receive buffer.

I don't see a better option than MSG_PEEK, am I missing something?

Thanks,

Paolo
Eric Dumazet Feb. 13, 2024, 1:34 p.m. UTC | #5
On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > >               WRITE_ONCE(*seq, *seq + used);
> > > >               copied += used;
> > > >               len -= used;
> > > > -
> > > > +             if (flags & MSG_PEEK)
> > > > +                     sk_peek_offset_fwd(sk, used);
> > > > +             else
> > > > +                     sk_peek_offset_bwd(sk, used);
> >
> > Yet another cache miss in TCP fast path...
> >
> > We need to move sk_peek_off in a better location before we accept this patch.
> >
> > I always thought MSK_PEEK was very inefficient, I am surprised we
> > allow arbitrary loops in recvmsg().
>
> Let me double check I read the above correctly: are you concerned by
> the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> touch a lot of skbs/cachelines before reaching the relevant skb?
>
> The end goal here is allowing an user-space application to read
> incrementally/sequentially the received data while leaving them in
> receive buffer.
>
> I don't see a better option than MSG_PEEK, am I missing something?


This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
MSG_PEEK case is very strange IMO.

Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
is used by the caller.

That would only touch non fast paths.

Since the API is mono-threaded anyway, the caller should not rely on
the fact that normal recvmsg() call
would 'consume' sk_peek_offset.
Paolo Abeni Feb. 13, 2024, 3:28 p.m. UTC | #6
On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > 
> > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > > >               WRITE_ONCE(*seq, *seq + used);
> > > > >               copied += used;
> > > > >               len -= used;
> > > > > -
> > > > > +             if (flags & MSG_PEEK)
> > > > > +                     sk_peek_offset_fwd(sk, used);
> > > > > +             else
> > > > > +                     sk_peek_offset_bwd(sk, used);
> > > 
> > > Yet another cache miss in TCP fast path...
> > > 
> > > We need to move sk_peek_off in a better location before we accept this patch.
> > > 
> > > I always thought MSK_PEEK was very inefficient, I am surprised we
> > > allow arbitrary loops in recvmsg().
> > 
> > Let me double check I read the above correctly: are you concerned by
> > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> > touch a lot of skbs/cachelines before reaching the relevant skb?
> > 
> > The end goal here is allowing an user-space application to read
> > incrementally/sequentially the received data while leaving them in
> > receive buffer.
> > 
> > I don't see a better option than MSG_PEEK, am I missing something?
> 
> 
> This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> MSG_PEEK case is very strange IMO.
> 
> Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> is used by the caller.
> 
> That would only touch non fast paths.
> 
> Since the API is mono-threaded anyway, the caller should not rely on
> the fact that normal recvmsg() call
> would 'consume' sk_peek_offset.

Storing in sk_peek_seq the tcp next sequence number to be peeked should
avoid changes in the non MSG_PEEK cases. 

AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
(in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
that be acceptable?

Thanks!

Paolo
Eric Dumazet Feb. 13, 2024, 3:49 p.m. UTC | #7
On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > > > >               WRITE_ONCE(*seq, *seq + used);
> > > > > >               copied += used;
> > > > > >               len -= used;
> > > > > > -
> > > > > > +             if (flags & MSG_PEEK)
> > > > > > +                     sk_peek_offset_fwd(sk, used);
> > > > > > +             else
> > > > > > +                     sk_peek_offset_bwd(sk, used);
> > > >
> > > > Yet another cache miss in TCP fast path...
> > > >
> > > > We need to move sk_peek_off in a better location before we accept this patch.
> > > >
> > > > I always thought MSK_PEEK was very inefficient, I am surprised we
> > > > allow arbitrary loops in recvmsg().
> > >
> > > Let me double check I read the above correctly: are you concerned by
> > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> > > touch a lot of skbs/cachelines before reaching the relevant skb?
> > >
> > > The end goal here is allowing an user-space application to read
> > > incrementally/sequentially the received data while leaving them in
> > > receive buffer.
> > >
> > > I don't see a better option than MSG_PEEK, am I missing something?
> >
> >
> > This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> > MSG_PEEK case is very strange IMO.
> >
> > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > is used by the caller.
> >
> > That would only touch non fast paths.
> >
> > Since the API is mono-threaded anyway, the caller should not rely on
> > the fact that normal recvmsg() call
> > would 'consume' sk_peek_offset.
>
> Storing in sk_peek_seq the tcp next sequence number to be peeked should
> avoid changes in the non MSG_PEEK cases.
>
> AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> that be acceptable?

We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.

The new semantic would be : Supported by TCP (so far), and tcp
recvmsg() only reads/writes this field when MSG_PEEK is used.
Applications would have to clear the values themselves.

BTW I see the man pages say SO_PEEK_OFF is "is currently supported
only for unix(7) sockets"
Paolo Abeni Feb. 13, 2024, 6:39 p.m. UTC | #8
On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > > This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> > > MSG_PEEK case is very strange IMO.
> > > 
> > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > > is used by the caller.
> > > 
> > > That would only touch non fast paths.
> > > 
> > > Since the API is mono-threaded anyway, the caller should not rely on
> > > the fact that normal recvmsg() call
> > > would 'consume' sk_peek_offset.
> > 
> > Storing in sk_peek_seq the tcp next sequence number to be peeked should
> > avoid changes in the non MSG_PEEK cases.
> > 
> > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> > that be acceptable?
> 
> We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.
> 
> The new semantic would be : Supported by TCP (so far), and tcp
> recvmsg() only reads/writes this field when MSG_PEEK is used.
> Applications would have to clear the values themselves.

I feel like there is some misunderstanding, or at least I can't follow.
Let me be more verbose, to try to clarify my reasoning.

Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will
return adjacent data. AFAICS this is the same semantic currently
implemented by UDP and unix sockets.

Currently 'sk_peek_off' maintains the next offset to be peeked into the
current receive queue. To implement the above behaviour, tcp_recvmsg()
has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the
next data, and after a plain read, to account for the data removed from
the receive queue.

I proposed to let introduce a tcp-specific set_peek_off doing something
alike:

	WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val);

so that the recvmsg will need to update sk_peek_off only for MSG_PEEK,
while retaining the semantic described above.

To keep the userspace interface unchanged that will need a paired
tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the
user a plain offset. An additional bit flag will be needed to store the
information "the user-space enabled peek with offset".

I don't understand how a setsockopt(PEEK_OFFSET) variant would help
avoiding touching sk->sk_peek_offset?

Thanks!

Paolo
Eric Dumazet Feb. 13, 2024, 7:31 p.m. UTC | #9
On Tue, Feb 13, 2024 at 7:39 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > > > This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> > > > MSG_PEEK case is very strange IMO.
> > > >
> > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > > > is used by the caller.
> > > >
> > > > That would only touch non fast paths.
> > > >
> > > > Since the API is mono-threaded anyway, the caller should not rely on
> > > > the fact that normal recvmsg() call
> > > > would 'consume' sk_peek_offset.
> > >
> > > Storing in sk_peek_seq the tcp next sequence number to be peeked should
> > > avoid changes in the non MSG_PEEK cases.
> > >
> > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> > > that be acceptable?
> >
> > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.
> >
> > The new semantic would be : Supported by TCP (so far), and tcp
> > recvmsg() only reads/writes this field when MSG_PEEK is used.
> > Applications would have to clear the values themselves.
>
> I feel like there is some misunderstanding, or at least I can't follow.
> Let me be more verbose, to try to clarify my reasoning.
>
> Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will
> return adjacent data. AFAICS this is the same semantic currently
> implemented by UDP and unix sockets.
>
> Currently 'sk_peek_off' maintains the next offset to be peeked into the
> current receive queue. To implement the above behaviour, tcp_recvmsg()
> has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the
> next data, and after a plain read, to account for the data removed from
> the receive queue.
>
> I proposed to let introduce a tcp-specific set_peek_off doing something
> alike:
>
>         WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val);
>
> so that the recvmsg will need to update sk_peek_off only for MSG_PEEK,
> while retaining the semantic described above.
>
> To keep the userspace interface unchanged that will need a paired
> tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the
> user a plain offset. An additional bit flag will be needed to store the
> information "the user-space enabled peek with offset".
>
> I don't understand how a setsockopt(PEEK_OFFSET) variant would help
> avoiding touching sk->sk_peek_offset?
>

I was trying to avoid using an extra storage, I was not trying to
implement the alternative myself :0)

If the recvmsg( MSG_PEEK) is supposed to auto-advance the peek_offset,
we probably need more than a mere 32bit field.


> Thanks!
>
> Paolo
>
David Gibson Feb. 13, 2024, 11:34 p.m. UTC | #10
On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > > > > >               WRITE_ONCE(*seq, *seq + used);
> > > > > > >               copied += used;
> > > > > > >               len -= used;
> > > > > > > -
> > > > > > > +             if (flags & MSG_PEEK)
> > > > > > > +                     sk_peek_offset_fwd(sk, used);
> > > > > > > +             else
> > > > > > > +                     sk_peek_offset_bwd(sk, used);
> > > > >
> > > > > Yet another cache miss in TCP fast path...
> > > > >
> > > > > We need to move sk_peek_off in a better location before we accept this patch.
> > > > >
> > > > > I always thought MSK_PEEK was very inefficient, I am surprised we
> > > > > allow arbitrary loops in recvmsg().
> > > >
> > > > Let me double check I read the above correctly: are you concerned by
> > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> > > > touch a lot of skbs/cachelines before reaching the relevant skb?
> > > >
> > > > The end goal here is allowing an user-space application to read
> > > > incrementally/sequentially the received data while leaving them in
> > > > receive buffer.
> > > >
> > > > I don't see a better option than MSG_PEEK, am I missing something?
> > >
> > >
> > > This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> > > MSG_PEEK case is very strange IMO.
> > >
> > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > > is used by the caller.
> > >
> > > That would only touch non fast paths.
> > >
> > > Since the API is mono-threaded anyway, the caller should not rely on
> > > the fact that normal recvmsg() call
> > > would 'consume' sk_peek_offset.
> >
> > Storing in sk_peek_seq the tcp next sequence number to be peeked should
> > avoid changes in the non MSG_PEEK cases.
> >
> > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> > that be acceptable?
> 
> We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.
> 
> The new semantic would be : Supported by TCP (so far), and tcp
> recvmsg() only reads/writes this field when MSG_PEEK is used.
> Applications would have to clear the values themselves.

Those semantics would likely defeat the purpose of using SO_PEEK_OFF
for our use case, since we'd need an additional setsockopt() for every
non-PEEK recv() (which are all MSG_TRUNC in our case).

> BTW I see the man pages say SO_PEEK_OFF is "is currently supported
> only for unix(7) sockets"

Yes, this patch is explicitly aiming to change that.
Eric Dumazet Feb. 14, 2024, 3:41 a.m. UTC | #11
On Wed, Feb 14, 2024 at 12:34 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
>
> > BTW I see the man pages say SO_PEEK_OFF is "is currently supported
> > only for unix(7) sockets"
>
> Yes, this patch is explicitly aiming to change that.

I was pointing out that UDP is supposed to support it already, since 2016,
the manpage should mention UDP already.

Thanks.
David Gibson Feb. 15, 2024, 3:16 a.m. UTC | #12
On Wed, Feb 14, 2024 at 04:41:14AM +0100, Eric Dumazet wrote:
> On Wed, Feb 14, 2024 at 12:34 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> >
> > > BTW I see the man pages say SO_PEEK_OFF is "is currently supported
> > > only for unix(7) sockets"
> >
> > Yes, this patch is explicitly aiming to change that.
> 
> I was pointing out that UDP is supposed to support it already, since 2016,
> the manpage should mention UDP already.

Oh, sorry, I misunderstood.  I tend to forget about this applied to
datagram sockets, since we have no use for that.
David Gibson Feb. 15, 2024, 3:21 a.m. UTC | #13
On Wed, Feb 14, 2024 at 10:34:50AM +1100, David Gibson wrote:
> On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > > > > > >               WRITE_ONCE(*seq, *seq + used);
> > > > > > > >               copied += used;
> > > > > > > >               len -= used;
> > > > > > > > -
> > > > > > > > +             if (flags & MSG_PEEK)
> > > > > > > > +                     sk_peek_offset_fwd(sk, used);
> > > > > > > > +             else
> > > > > > > > +                     sk_peek_offset_bwd(sk, used);
> > > > > >
> > > > > > Yet another cache miss in TCP fast path...
> > > > > >
> > > > > > We need to move sk_peek_off in a better location before we accept this patch.
> > > > > >
> > > > > > I always thought MSK_PEEK was very inefficient, I am surprised we
> > > > > > allow arbitrary loops in recvmsg().
> > > > >
> > > > > Let me double check I read the above correctly: are you concerned by
> > > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> > > > > touch a lot of skbs/cachelines before reaching the relevant skb?
> > > > >
> > > > > The end goal here is allowing an user-space application to read
> > > > > incrementally/sequentially the received data while leaving them in
> > > > > receive buffer.
> > > > >
> > > > > I don't see a better option than MSG_PEEK, am I missing something?
> > > >
> > > >
> > > > This sk_peek_offset protocol, needing  sk_peek_offset_bwd() in the non
> > > > MSG_PEEK case is very strange IMO.
> > > >
> > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > > > is used by the caller.
> > > >
> > > > That would only touch non fast paths.
> > > >
> > > > Since the API is mono-threaded anyway, the caller should not rely on
> > > > the fact that normal recvmsg() call
> > > > would 'consume' sk_peek_offset.
> > >
> > > Storing in sk_peek_seq the tcp next sequence number to be peeked should
> > > avoid changes in the non MSG_PEEK cases.
> > >
> > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> > > that be acceptable?
> > 
> > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.
> > 
> > The new semantic would be : Supported by TCP (so far), and tcp
> > recvmsg() only reads/writes this field when MSG_PEEK is used.
> > Applications would have to clear the values themselves.
> 
> Those semantics would likely defeat the purpose of using SO_PEEK_OFF
> for our use case, since we'd need an additional setsockopt() for every
> non-PEEK recv() (which are all MSG_TRUNC in our case).

Btw, Eric,

If you're concerned about the extra access added to the "regular" TCP
path, would you be happier with the original approach Jon proposed:
that allowed a user to essentially supply an offset to an individial
MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a
NULL pointer and length to skip.

It did the job for us, although I admit it's a little ugly, which I
presume is why Paolo suggested we investigate SO_PEEK_OFF instead.  I
think the SO_PEEK_OFF approach is more elegant, but maybe the
performance impact outweighs that.
Eric Dumazet Feb. 15, 2024, 9:15 a.m. UTC | #14
On Thu, Feb 15, 2024 at 4:21 AM David Gibson
<david@gibson.dropbear.id.au> wrote:

> Btw, Eric,
>
> If you're concerned about the extra access added to the "regular" TCP
> path, would you be happier with the original approach Jon proposed:
> that allowed a user to essentially supply an offset to an individial
> MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a
> NULL pointer and length to skip.
>
> It did the job for us, although I admit it's a little ugly, which I
> presume is why Paolo suggested we investigate SO_PEEK_OFF instead.  I
> think the SO_PEEK_OFF approach is more elegant, but maybe the
> performance impact outweighs that.
>

Sorry, this was too ugly.

SO_PEEK_OFF is way better/standard, we have to polish the
implementation so that it is zero-cost for 99.9999 % of the users not
using MSG_PEEK..
Paolo Abeni Feb. 15, 2024, 5:41 p.m. UTC | #15
Note: please send text-only email to netdev.

On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> I wonder if the following could be acceptable:
> 
>  if (flags & MSG_PEEK)
>         sk_peek_offset_fwd(sk, used);
>  else if (peek_offset > 0)
>        sk_peek_offset_bwd(sk, used);
> 
>  peek_offset is already present in the data cache, and if it has the value
>  zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
>  Either way, no rewind is needed in that case.

I agree the above should avoid touching cold cachelines in the
fastpath, and looks functionally correct to me.

The last word is up to Eric :)

Cheers,

Paolo
Eric Dumazet Feb. 15, 2024, 5:46 p.m. UTC | #16
On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Note: please send text-only email to netdev.
>
> On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > I wonder if the following could be acceptable:
> >
> >  if (flags & MSG_PEEK)
> >         sk_peek_offset_fwd(sk, used);
> >  else if (peek_offset > 0)
> >        sk_peek_offset_bwd(sk, used);
> >
> >  peek_offset is already present in the data cache, and if it has the value
> >  zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> >  Either way, no rewind is needed in that case.
>
> I agree the above should avoid touching cold cachelines in the
> fastpath, and looks functionally correct to me.
>
> The last word is up to Eric :)
>

An actual patch seems needed.

In the current form, local variable peek_offset is 0 when !MSG_PEEK.

So the "else if (peek_offset > 0)" would always be false.
Jon Maloy Feb. 15, 2024, 10:24 p.m. UTC | #17
On 2024-02-15 12:46, Eric Dumazet wrote:
> On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote:
>> Note: please send text-only email to netdev.
>>
>> On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
>>> I wonder if the following could be acceptable:
>>>
>>>   if (flags & MSG_PEEK)
>>>          sk_peek_offset_fwd(sk, used);
>>>   else if (peek_offset > 0)
>>>         sk_peek_offset_bwd(sk, used);
>>>
>>>   peek_offset is already present in the data cache, and if it has the value
>>>   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
>>>   Either way, no rewind is needed in that case.
>> I agree the above should avoid touching cold cachelines in the
>> fastpath, and looks functionally correct to me.
>>
>> The last word is up to Eric :)
>>
> An actual patch seems needed.
>
> In the current form, local variable peek_offset is 0 when !MSG_PEEK.
>
> So the "else if (peek_offset > 0)" would always be false.
>
Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the 
beginning of the function.
I will look at the other suggestions.

///jon
Paolo Abeni Feb. 16, 2024, 9:14 a.m. UTC | #18
On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> 
> On 2024-02-15 12:46, Eric Dumazet wrote:
> > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > Note: please send text-only email to netdev.
> > > 
> > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > I wonder if the following could be acceptable:
> > > > 
> > > >   if (flags & MSG_PEEK)
> > > >          sk_peek_offset_fwd(sk, used);
> > > >   else if (peek_offset > 0)
> > > >         sk_peek_offset_bwd(sk, used);
> > > > 
> > > >   peek_offset is already present in the data cache, and if it has the value
> > > >   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > >   Either way, no rewind is needed in that case.
> > > I agree the above should avoid touching cold cachelines in the
> > > fastpath, and looks functionally correct to me.
> > > 
> > > The last word is up to Eric :)
> > > 
> > An actual patch seems needed.
> > 
> > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > 
> > So the "else if (peek_offset > 0)" would always be false.
> > 
> Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the 
> beginning of the function.
> I will look at the other suggestions.

I *think* that moving sk_peek_off this way:

---
diff --git a/include/net/sock.h b/include/net/sock.h
index a9d99a9c583f..576a6a6abb03 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -413,7 +413,7 @@ struct sock {
 	unsigned int		sk_napi_id;
 #endif
 	int			sk_rcvbuf;
-	int			sk_disconnects;
+	int			sk_peek_off;
 
 	struct sk_filter __rcu	*sk_filter;
 	union {
@@ -439,7 +439,7 @@ struct sock {
 		struct rb_root	tcp_rtx_queue;
 	};
 	struct sk_buff_head	sk_write_queue;
-	__s32			sk_peek_off;
+	int			sk_disconnects;
 	int			sk_write_pending;
 	__u32			sk_dst_pending_confirm;
 	u32			sk_pacing_status; /* see enum sk_pacing */
---

should avoid problematic accesses,

The relevant cachelines layout is as follow:

	                /* --- cacheline 4 boundary (256 bytes) --- */
                struct sk_buff *   tail;                 /*   256     8 */
        } sk_backlog;                                    /*   240    24 */
        int                        sk_forward_alloc;     /*   264     4 */
        u32                        sk_reserved_mem;      /*   268     4 */
        unsigned int               sk_ll_usec;           /*   272     4 */
        unsigned int               sk_napi_id;           /*   276     4 */
        int                        sk_rcvbuf;            /*   280     4 */
        int                        sk_disconnects;       /*   284     4 */
				// will become sk_peek_off
        struct sk_filter *         sk_filter;            /*   288     8 */
        union {
                struct socket_wq * sk_wq;                /*   296     8 */
                struct socket_wq * sk_wq_raw;            /*   296     8 */
        };                                               /*   296     8 */
        struct xfrm_policy *       sk_policy[2];         /*   304    16 */
        /* --- cacheline 5 boundary (320 bytes) --- */

	//  ...
	
        /* --- cacheline 6 boundary (384 bytes) --- */
        __s32                      sk_peek_off;          /*   384     4 */
				// will become sk_diconnects
        int                        sk_write_pending;     /*   388     4 */
        __u32                      sk_dst_pending_confirm; /*   392     4 */
        u32                        sk_pacing_status;     /*   396     4 */
        long int                   sk_sndtimeo;          /*   400     8 */
        struct timer_list          sk_timer;             /*   408    40 */

        /* XXX last struct has 4 bytes of padding */

        /* --- cacheline 7 boundary (448 bytes) --- */

sk_peek_off will be in the same cachline of sk_forward_alloc /
sk_reserved_mem / backlog tail, that are already touched by the
tcp_recvmsg_locked() main loop.

WDYT?

thanks!

Paolo
Eric Dumazet Feb. 16, 2024, 9:21 a.m. UTC | #19
On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> >
> > On 2024-02-15 12:46, Eric Dumazet wrote:
> > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > Note: please send text-only email to netdev.
> > > >
> > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > > I wonder if the following could be acceptable:
> > > > >
> > > > >   if (flags & MSG_PEEK)
> > > > >          sk_peek_offset_fwd(sk, used);
> > > > >   else if (peek_offset > 0)
> > > > >         sk_peek_offset_bwd(sk, used);
> > > > >
> > > > >   peek_offset is already present in the data cache, and if it has the value
> > > > >   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > > >   Either way, no rewind is needed in that case.
> > > > I agree the above should avoid touching cold cachelines in the
> > > > fastpath, and looks functionally correct to me.
> > > >
> > > > The last word is up to Eric :)
> > > >
> > > An actual patch seems needed.
> > >
> > > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > >
> > > So the "else if (peek_offset > 0)" would always be false.
> > >
> > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
> > beginning of the function.
> > I will look at the other suggestions.
>
> I *think* that moving sk_peek_off this way:
>
> ---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a9d99a9c583f..576a6a6abb03 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -413,7 +413,7 @@ struct sock {
>         unsigned int            sk_napi_id;
>  #endif
>         int                     sk_rcvbuf;
> -       int                     sk_disconnects;
> +       int                     sk_peek_off;
>
>         struct sk_filter __rcu  *sk_filter;
>         union {
> @@ -439,7 +439,7 @@ struct sock {
>                 struct rb_root  tcp_rtx_queue;
>         };
>         struct sk_buff_head     sk_write_queue;
> -       __s32                   sk_peek_off;
> +       int                     sk_disconnects;
>         int                     sk_write_pending;
>         __u32                   sk_dst_pending_confirm;
>         u32                     sk_pacing_status; /* see enum sk_pacing */
> ---
>
> should avoid problematic accesses,
>
> The relevant cachelines layout is as follow:
>
>                         /* --- cacheline 4 boundary (256 bytes) --- */
>                 struct sk_buff *   tail;                 /*   256     8 */
>         } sk_backlog;                                    /*   240    24 */
>         int                        sk_forward_alloc;     /*   264     4 */
>         u32                        sk_reserved_mem;      /*   268     4 */
>         unsigned int               sk_ll_usec;           /*   272     4 */
>         unsigned int               sk_napi_id;           /*   276     4 */
>         int                        sk_rcvbuf;            /*   280     4 */
>         int                        sk_disconnects;       /*   284     4 */
>                                 // will become sk_peek_off
>         struct sk_filter *         sk_filter;            /*   288     8 */
>         union {
>                 struct socket_wq * sk_wq;                /*   296     8 */
>                 struct socket_wq * sk_wq_raw;            /*   296     8 */
>         };                                               /*   296     8 */
>         struct xfrm_policy *       sk_policy[2];         /*   304    16 */
>         /* --- cacheline 5 boundary (320 bytes) --- */
>
>         //  ...
>
>         /* --- cacheline 6 boundary (384 bytes) --- */
>         __s32                      sk_peek_off;          /*   384     4 */
>                                 // will become sk_diconnects
>         int                        sk_write_pending;     /*   388     4 */
>         __u32                      sk_dst_pending_confirm; /*   392     4 */
>         u32                        sk_pacing_status;     /*   396     4 */
>         long int                   sk_sndtimeo;          /*   400     8 */
>         struct timer_list          sk_timer;             /*   408    40 */
>
>         /* XXX last struct has 4 bytes of padding */
>
>         /* --- cacheline 7 boundary (448 bytes) --- */
>
> sk_peek_off will be in the same cachline of sk_forward_alloc /
> sk_reserved_mem / backlog tail, that are already touched by the
> tcp_recvmsg_locked() main loop.
>
> WDYT?

I was about to send a similar change, also moving sk_rcvtimeo, and
adding __cacheline_group_begin()/__cacheline_group_end
annotations.

I can finish this today.
Eric Dumazet Feb. 16, 2024, 10:55 a.m. UTC | #20
On Fri, Feb 16, 2024 at 11:13 AM Jon Maloy <jmaloy@redhat.com> wrote:

>
> There is also the following alternative:
>
> if (flags & MSG_PEEK)
>        sk_peek_offset_fwd(sk, used);
> else if (flags & MSG_TRUNC)
>        sk_peek_offset_bwd(sk, used);
>
> This is the way we use it, and probably the typical usage.
> It would force a user to drain the receive queue with MSG_TRUNC whenever he is using
> MSG_PEEK_OFF, but I don't really see that as a limitation.
>
> Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary.

I think the suggestion to move sk_peek_off came from my first message
on this thread ;)
 "We need to move sk_peek_off in a better location before we accept this patch."

Anyway, a complete reorg of 'struct sock' was overdue, I am working on it.
David Gibson Feb. 19, 2024, 2:02 a.m. UTC | #21
On Fri, Feb 16, 2024 at 05:13:34AM -0500, Jon Maloy wrote:
> 
> 
> On 2024-02-16 04:21, Eric Dumazet wrote:
> > On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni<pabeni@redhat.com>  wrote:
> > > On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> > > > On 2024-02-15 12:46, Eric Dumazet wrote:
> > > > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni<pabeni@redhat.com>  wrote:
> > > > > > Note: please send text-only email to netdev.
> > > > > > 
> > > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > > > > I wonder if the following could be acceptable:
> > > > > > > 
> > > > > > >    if (flags & MSG_PEEK)
> > > > > > >           sk_peek_offset_fwd(sk, used);
> > > > > > >    else if (peek_offset > 0)
> > > > > > >          sk_peek_offset_bwd(sk, used);
> > > > > > > 
> > > > > > >    peek_offset is already present in the data cache, and if it has the value
> > > > > > >    zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > > > > >    Either way, no rewind is needed in that case.
> > > > > > I agree the above should avoid touching cold cachelines in the
> > > > > > fastpath, and looks functionally correct to me.
> > > > > > 
> > > > > > The last word is up to Eric :)
> > > > > > 
> > > > > An actual patch seems needed.
> > > > > 
> > > > > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > > > > 
> > > > > So the "else if (peek_offset > 0)" would always be false.
> > > > > 
> > > > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
> > > > beginning of the function.
> > > > I will look at the other suggestions.
> > > I *think* that moving sk_peek_off this way:
> > > 
> > > ---
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index a9d99a9c583f..576a6a6abb03 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -413,7 +413,7 @@ struct sock {
> > >          unsigned int            sk_napi_id;
> > >   #endif
> > >          int                     sk_rcvbuf;
> > > -       int                     sk_disconnects;
> > > +       int                     sk_peek_off;
> > > 
> > >          struct sk_filter __rcu  *sk_filter;
> > >          union {
> > > @@ -439,7 +439,7 @@ struct sock {
> > >                  struct rb_root  tcp_rtx_queue;
> > >          };
> > >          struct sk_buff_head     sk_write_queue;
> > > -       __s32                   sk_peek_off;
> > > +       int                     sk_disconnects;
> > >          int                     sk_write_pending;
> > >          __u32                   sk_dst_pending_confirm;
> > >          u32                     sk_pacing_status; /* see enum sk_pacing */
> > > ---
> > > 
> > > should avoid problematic accesses,
> > > 
> > > The relevant cachelines layout is as follow:
> > > 
> > >                          /* --- cacheline 4 boundary (256 bytes) --- */
> > >                  struct sk_buff *   tail;                 /*   256     8 */
> > >          } sk_backlog;                                    /*   240    24 */
> > >          int                        sk_forward_alloc;     /*   264     4 */
> > >          u32                        sk_reserved_mem;      /*   268     4 */
> > >          unsigned int               sk_ll_usec;           /*   272     4 */
> > >          unsigned int               sk_napi_id;           /*   276     4 */
> > >          int                        sk_rcvbuf;            /*   280     4 */
> > >          int                        sk_disconnects;       /*   284     4 */
> > >                                  // will become sk_peek_off
> > >          struct sk_filter *         sk_filter;            /*   288     8 */
> > >          union {
> > >                  struct socket_wq * sk_wq;                /*   296     8 */
> > >                  struct socket_wq * sk_wq_raw;            /*   296     8 */
> > >          };                                               /*   296     8 */
> > >          struct xfrm_policy *       sk_policy[2];         /*   304    16 */
> > >          /* --- cacheline 5 boundary (320 bytes) --- */
> > > 
> > >          //  ...
> > > 
> > >          /* --- cacheline 6 boundary (384 bytes) --- */
> > >          __s32                      sk_peek_off;          /*   384     4 */
> > >                                  // will become sk_diconnects
> > >          int                        sk_write_pending;     /*   388     4 */
> > >          __u32                      sk_dst_pending_confirm; /*   392     4 */
> > >          u32                        sk_pacing_status;     /*   396     4 */
> > >          long int                   sk_sndtimeo;          /*   400     8 */
> > >          struct timer_list          sk_timer;             /*   408    40 */
> > > 
> > >          /* XXX last struct has 4 bytes of padding */
> > > 
> > >          /* --- cacheline 7 boundary (448 bytes) --- */
> > > 
> > > sk_peek_off will be in the same cachline of sk_forward_alloc /
> > > sk_reserved_mem / backlog tail, that are already touched by the
> > > tcp_recvmsg_locked() main loop.
> > > 
> > > WDYT?
> > I was about to send a similar change, also moving sk_rcvtimeo, and
> > adding __cacheline_group_begin()/__cacheline_group_end
> > annotations.
> > 
> > I can finish this today.
> > 
> There is also the following alternative:
> 
> if (flags & MSG_PEEK)
>        sk_peek_offset_fwd(sk, used);
> else if (flags & MSG_TRUNC)
>        sk_peek_offset_bwd(sk, used);
> 
> This is the way we use it, and probably the typical usage.
> It would force a user to drain the receive queue with MSG_TRUNC whenever he
> is using
> MSG_PEEK_OFF, but I don't really see that as a limitation.

I really don't like this, although it would certainly do what we need
for passt/pasta.  SO_PEEK_OFF has established semantics for Unix
sockets, which includes regular recv() adjusting the offset.  Having
it behave subtlety differently for TCP seems like a very bad idea.

> Anyway, if Paolo's suggestion solves the problem this shouldn't be
> necessary.
> 
> ///jon
diff mbox series

Patch

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 4e635dd3d3c8..5f0e5d10c416 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1071,6 +1071,7 @@  const struct proto_ops inet_stream_ops = {
 #endif
 	.splice_eof	   = inet_splice_eof,
 	.splice_read	   = tcp_splice_read,
+	.set_peek_off      = sk_set_peek_off,
 	.read_sock	   = tcp_read_sock,
 	.read_skb	   = tcp_read_skb,
 	.sendmsg_locked    = tcp_sendmsg_locked,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7e2481b9eae1..1c8cab14a32c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1415,8 +1415,6 @@  static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
 	struct sk_buff *skb;
 	int copied = 0, err = 0;
 
-	/* XXX -- need to support SO_PEEK_OFF */
-
 	skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
 		err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
 		if (err)
@@ -2327,6 +2325,7 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
+	u32 peek_offset = 0;
 	u32 urg_hole = 0;
 
 	err = -ENOTCONN;
@@ -2360,7 +2359,8 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 
 	seq = &tp->copied_seq;
 	if (flags & MSG_PEEK) {
-		peek_seq = tp->copied_seq;
+		peek_offset = max(sk_peek_offset(sk, flags), 0);
+		peek_seq = tp->copied_seq + peek_offset;
 		seq = &peek_seq;
 	}
 
@@ -2463,11 +2463,11 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		if ((flags & MSG_PEEK) &&
-		    (peek_seq - copied - urg_hole != tp->copied_seq)) {
+		    (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
 			net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
 					    current->comm,
 					    task_pid_nr(current));
-			peek_seq = tp->copied_seq;
+			peek_seq = tp->copied_seq + peek_offset;
 		}
 		continue;
 
@@ -2508,7 +2508,10 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		WRITE_ONCE(*seq, *seq + used);
 		copied += used;
 		len -= used;
-
+		if (flags & MSG_PEEK)
+			sk_peek_offset_fwd(sk, used);
+		else
+			sk_peek_offset_bwd(sk, used);
 		tcp_rcv_space_adjust(sk);
 
 skip_copy:
@@ -3007,6 +3010,7 @@  int tcp_disconnect(struct sock *sk, int flags)
 	__skb_queue_purge(&sk->sk_receive_queue);
 	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 	WRITE_ONCE(tp->urg_data, 0);
+	sk_set_peek_off(sk, -1);
 	tcp_write_queue_purge(sk);
 	tcp_fastopen_active_disable_ofo_check(sk);
 	skb_rbtree_purge(&tp->out_of_order_queue);