Message ID | 20240209221233.3150253-1-jmaloy@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v3] tcp: add support for SO_PEEK_OFF | expand |
On Fri, 9 Feb 2024 17:12:33 -0500 jmaloy@redhat.com wrote: > From: Jon Maloy <jmaloy@redhat.com> > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF not supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@freyr:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@freyr:~/passt$ > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Suggested-by: Paolo Abeni <pabeni@redhat.com> > Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Oops, I just noticed Eric is missing from the recipients list, adding him now. On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote: > From: Jon Maloy <jmaloy@redhat.com> > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF not supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@freyr:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@freyr:~/passt$ > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Suggested-by: Paolo Abeni <pabeni@redhat.com> > Signed-off-by: Jon Maloy <jmaloy@redhat.com> > > --- > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni > --- > net/ipv4/af_inet.c | 1 + > net/ipv4/tcp.c | 16 ++++++++++------ > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 4e635dd3d3c8..5f0e5d10c416 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { > #endif > .splice_eof = inet_splice_eof, > .splice_read = tcp_splice_read, > + .set_peek_off = sk_set_peek_off, > .read_sock = tcp_read_sock, > .read_skb = tcp_read_skb, > .sendmsg_locked = tcp_sendmsg_locked, > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 7e2481b9eae1..1c8cab14a32c 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) > struct sk_buff *skb; > int copied = 0, err = 0; > > - /* XXX -- need to support SO_PEEK_OFF */ > - > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { > err = skb_copy_datagram_msg(skb, 0, msg, skb->len); > if (err) > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > int target; /* Read at least this many bytes */ > long timeo; > struct sk_buff *skb, *last; > + u32 peek_offset = 0; > u32 urg_hole = 0; > > err = -ENOTCONN; > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > seq = &tp->copied_seq; > if (flags & MSG_PEEK) { > - peek_seq = tp->copied_seq; > + peek_offset = max(sk_peek_offset(sk, flags), 0); > + peek_seq = tp->copied_seq + peek_offset; > seq = &peek_seq; > } > > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > } > > if ((flags & MSG_PEEK) && > - (peek_seq - copied - urg_hole != tp->copied_seq)) { > + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { > net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", > current->comm, > task_pid_nr(current)); > - peek_seq = tp->copied_seq; > + peek_seq = tp->copied_seq + peek_offset; > } > continue; > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used); > tcp_rcv_space_adjust(sk); > > skip_copy: > @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags) > __skb_queue_purge(&sk->sk_receive_queue); > WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); > WRITE_ONCE(tp->urg_data, 0); > + sk_set_peek_off(sk, -1); > tcp_write_queue_purge(sk); > tcp_fastopen_active_disable_ofo_check(sk); > skb_rbtree_purge(&tp->out_of_order_queue);
On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Oops, > > I just noticed Eric is missing from the recipients list, adding him > now. > Hmmm thanks. > On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote: > > From: Jon Maloy <jmaloy@redhat.com> > > > > When reading received messages from a socket with MSG_PEEK, we may want > > to read the contents with an offset, like we can do with pread/preadv() > > when reading files. Currently, it is not possible to do that. > > > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > > in a similar way it is done for Unix Domain sockets. > > > > In the iperf3 log examples shown below, we can observe a throughput > > improvement of 15-20 % in the direction host->namespace when using the > > protocol splicer 'pasta' (https://passt.top). > > This is a consistent result. > > > > pasta(1) and passt(1) implement user-mode networking for network > > namespaces (containers) and virtual machines by means of a translation > > layer between Layer-2 network interface and native Layer-4 sockets > > (TCP, UDP, ICMP/ICMPv6 echo). > > > > Received, pending TCP data to the container/guest is kept in kernel > > buffers until acknowledged, so the tool routinely needs to fetch new > > data from socket, skipping data that was already sent. > > > > At the moment this is implemented using a dummy buffer passed to > > recvmsg(). With this change, we don't need a dummy buffer and the > > related buffer copy (copy_to_user()) anymore. > > > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > > SO_PEEK_OFF not supported by kernel. > > > > jmaloy@freyr:~/passt# iperf3 -s > > ----------------------------------------------------------- > > Server listening on 5201 (test #1) > > ----------------------------------------------------------- > > Accepted connection from 192.168.122.1, port 44822 > > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > > ----------------------------------------------------------- > > Server listening on 5201 (test #2) > > ----------------------------------------------------------- > > ^Ciperf3: interrupt - the server has terminated > > jmaloy@freyr:~/passt# > > logout > > [ perf record: Woken up 23 times to write data ] > > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > > jmaloy@freyr:~/passt$ > > > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > > SO_PEEK_OFF supported by kernel. > > > > jmaloy@freyr:~/passt# iperf3 -s > > ----------------------------------------------------------- > > Server listening on 5201 (test #1) > > ----------------------------------------------------------- > > Accepted connection from 192.168.122.1, port 52084 > > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > > ----------------------------------------------------------- > > Server listening on 5201 (test #2) > > ----------------------------------------------------------- > > ^Ciperf3: interrupt - the server has terminated > > logout > > [ perf record: Woken up 20 times to write data ] > > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > > jmaloy@freyr:~/passt$ > > > > The perf record confirms this result. Below, we can observe that the > > CPU spends significantly less time in the function ____sys_recvmsg() > > when we have offset support. > > > > Without offset support: > > ---------------------- > > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > > > With offset support: > > ---------------------- > > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > > > Suggested-by: Paolo Abeni <pabeni@redhat.com> > > Signed-off-by: Jon Maloy <jmaloy@redhat.com> > > > > --- > > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni > > --- > > net/ipv4/af_inet.c | 1 + > > net/ipv4/tcp.c | 16 ++++++++++------ > > 2 files changed, 11 insertions(+), 6 deletions(-) > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > > index 4e635dd3d3c8..5f0e5d10c416 100644 > > --- a/net/ipv4/af_inet.c > > +++ b/net/ipv4/af_inet.c > > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { > > #endif > > .splice_eof = inet_splice_eof, > > .splice_read = tcp_splice_read, > > + .set_peek_off = sk_set_peek_off, > > .read_sock = tcp_read_sock, > > .read_skb = tcp_read_skb, > > .sendmsg_locked = tcp_sendmsg_locked, > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 7e2481b9eae1..1c8cab14a32c 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) > > struct sk_buff *skb; > > int copied = 0, err = 0; > > > > - /* XXX -- need to support SO_PEEK_OFF */ > > - > > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { > > err = skb_copy_datagram_msg(skb, 0, msg, skb->len); > > if (err) > > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > int target; /* Read at least this many bytes */ > > long timeo; > > struct sk_buff *skb, *last; > > + u32 peek_offset = 0; > > u32 urg_hole = 0; > > > > err = -ENOTCONN; > > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > seq = &tp->copied_seq; > > if (flags & MSG_PEEK) { > > - peek_seq = tp->copied_seq; > > + peek_offset = max(sk_peek_offset(sk, flags), 0); > > + peek_seq = tp->copied_seq + peek_offset; > > seq = &peek_seq; > > } > > > > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > } > > > > if ((flags & MSG_PEEK) && > > - (peek_seq - copied - urg_hole != tp->copied_seq)) { > > + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { > > net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", > > current->comm, > > task_pid_nr(current)); > > - peek_seq = tp->copied_seq; > > + peek_seq = tp->copied_seq + peek_offset; > > } > > continue; > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > WRITE_ONCE(*seq, *seq + used); > > copied += used; > > len -= used; > > - > > + if (flags & MSG_PEEK) > > + sk_peek_offset_fwd(sk, used); > > + else > > + sk_peek_offset_bwd(sk, used); Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().
On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > WRITE_ONCE(*seq, *seq + used); > > > copied += used; > > > len -= used; > > > - > > > + if (flags & MSG_PEEK) > > > + sk_peek_offset_fwd(sk, used); > > > + else > > > + sk_peek_offset_bwd(sk, used); > > Yet another cache miss in TCP fast path... > > We need to move sk_peek_off in a better location before we accept this patch. > > I always thought MSK_PEEK was very inefficient, I am surprised we > allow arbitrary loops in recvmsg(). Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something? Thanks, Paolo
On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > WRITE_ONCE(*seq, *seq + used); > > > > copied += used; > > > > len -= used; > > > > - > > > > + if (flags & MSG_PEEK) > > > > + sk_peek_offset_fwd(sk, used); > > > > + else > > > > + sk_peek_offset_bwd(sk, used); > > > > Yet another cache miss in TCP fast path... > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > allow arbitrary loops in recvmsg(). > > Let me double check I read the above correctly: are you concerned by > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > touch a lot of skbs/cachelines before reaching the relevant skb? > > The end goal here is allowing an user-space application to read > incrementally/sequentially the received data while leaving them in > receive buffer. > > I don't see a better option than MSG_PEEK, am I missing something? This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.
On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > > WRITE_ONCE(*seq, *seq + used); > > > > > copied += used; > > > > > len -= used; > > > > > - > > > > > + if (flags & MSG_PEEK) > > > > > + sk_peek_offset_fwd(sk, used); > > > > > + else > > > > > + sk_peek_offset_bwd(sk, used); > > > > > > Yet another cache miss in TCP fast path... > > > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > > allow arbitrary loops in recvmsg(). > > > > Let me double check I read the above correctly: are you concerned by > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > > touch a lot of skbs/cachelines before reaching the relevant skb? > > > > The end goal here is allowing an user-space application to read > > incrementally/sequentially the received data while leaving them in > > receive buffer. > > > > I don't see a better option than MSG_PEEK, am I missing something? > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > MSG_PEEK case is very strange IMO. > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > is used by the caller. > > That would only touch non fast paths. > > Since the API is mono-threaded anyway, the caller should not rely on > the fact that normal recvmsg() call > would 'consume' sk_peek_offset. Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable? Thanks! Paolo
On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > > > WRITE_ONCE(*seq, *seq + used); > > > > > > copied += used; > > > > > > len -= used; > > > > > > - > > > > > > + if (flags & MSG_PEEK) > > > > > > + sk_peek_offset_fwd(sk, used); > > > > > > + else > > > > > > + sk_peek_offset_bwd(sk, used); > > > > > > > > Yet another cache miss in TCP fast path... > > > > > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > > > allow arbitrary loops in recvmsg(). > > > > > > Let me double check I read the above correctly: are you concerned by > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > > > touch a lot of skbs/cachelines before reaching the relevant skb? > > > > > > The end goal here is allowing an user-space application to read > > > incrementally/sequentially the received data while leaving them in > > > receive buffer. > > > > > > I don't see a better option than MSG_PEEK, am I missing something? > > > > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > > MSG_PEEK case is very strange IMO. > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > is used by the caller. > > > > That would only touch non fast paths. > > > > Since the API is mono-threaded anyway, the caller should not rely on > > the fact that normal recvmsg() call > > would 'consume' sk_peek_offset. > > Storing in sk_peek_seq the tcp next sequence number to be peeked should > avoid changes in the non MSG_PEEK cases. > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > that be acceptable? We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves. BTW I see the man pages say SO_PEEK_OFF is "is currently supported only for unix(7) sockets"
On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > > > MSG_PEEK case is very strange IMO. > > > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > > is used by the caller. > > > > > > That would only touch non fast paths. > > > > > > Since the API is mono-threaded anyway, the caller should not rely on > > > the fact that normal recvmsg() call > > > would 'consume' sk_peek_offset. > > > > Storing in sk_peek_seq the tcp next sequence number to be peeked should > > avoid changes in the non MSG_PEEK cases. > > > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > > that be acceptable? > > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. > > The new semantic would be : Supported by TCP (so far), and tcp > recvmsg() only reads/writes this field when MSG_PEEK is used. > Applications would have to clear the values themselves. I feel like there is some misunderstanding, or at least I can't follow. Let me be more verbose, to try to clarify my reasoning. Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will return adjacent data. AFAICS this is the same semantic currently implemented by UDP and unix sockets. Currently 'sk_peek_off' maintains the next offset to be peeked into the current receive queue. To implement the above behaviour, tcp_recvmsg() has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the next data, and after a plain read, to account for the data removed from the receive queue. I proposed to let introduce a tcp-specific set_peek_off doing something alike: WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val); so that the recvmsg will need to update sk_peek_off only for MSG_PEEK, while retaining the semantic described above. To keep the userspace interface unchanged that will need a paired tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the user a plain offset. An additional bit flag will be needed to store the information "the user-space enabled peek with offset". I don't understand how a setsockopt(PEEK_OFFSET) variant would help avoiding touching sk->sk_peek_offset? Thanks! Paolo
On Tue, Feb 13, 2024 at 7:39 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > > > > MSG_PEEK case is very strange IMO. > > > > > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > > > is used by the caller. > > > > > > > > That would only touch non fast paths. > > > > > > > > Since the API is mono-threaded anyway, the caller should not rely on > > > > the fact that normal recvmsg() call > > > > would 'consume' sk_peek_offset. > > > > > > Storing in sk_peek_seq the tcp next sequence number to be peeked should > > > avoid changes in the non MSG_PEEK cases. > > > > > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere > > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > > > that be acceptable? > > > > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. > > > > The new semantic would be : Supported by TCP (so far), and tcp > > recvmsg() only reads/writes this field when MSG_PEEK is used. > > Applications would have to clear the values themselves. > > I feel like there is some misunderstanding, or at least I can't follow. > Let me be more verbose, to try to clarify my reasoning. > > Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will > return adjacent data. AFAICS this is the same semantic currently > implemented by UDP and unix sockets. > > Currently 'sk_peek_off' maintains the next offset to be peeked into the > current receive queue. To implement the above behaviour, tcp_recvmsg() > has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the > next data, and after a plain read, to account for the data removed from > the receive queue. > > I proposed to let introduce a tcp-specific set_peek_off doing something > alike: > > WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val); > > so that the recvmsg will need to update sk_peek_off only for MSG_PEEK, > while retaining the semantic described above. > > To keep the userspace interface unchanged that will need a paired > tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the > user a plain offset. An additional bit flag will be needed to store the > information "the user-space enabled peek with offset". > > I don't understand how a setsockopt(PEEK_OFFSET) variant would help > avoiding touching sk->sk_peek_offset? > I was trying to avoid using an extra storage, I was not trying to implement the alternative myself :0) If the recvmsg( MSG_PEEK) is supposed to auto-advance the peek_offset, we probably need more than a mere 32bit field. > Thanks! > > Paolo >
On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > > > > WRITE_ONCE(*seq, *seq + used); > > > > > > > copied += used; > > > > > > > len -= used; > > > > > > > - > > > > > > > + if (flags & MSG_PEEK) > > > > > > > + sk_peek_offset_fwd(sk, used); > > > > > > > + else > > > > > > > + sk_peek_offset_bwd(sk, used); > > > > > > > > > > Yet another cache miss in TCP fast path... > > > > > > > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > > > > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > > > > allow arbitrary loops in recvmsg(). > > > > > > > > Let me double check I read the above correctly: are you concerned by > > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > > > > touch a lot of skbs/cachelines before reaching the relevant skb? > > > > > > > > The end goal here is allowing an user-space application to read > > > > incrementally/sequentially the received data while leaving them in > > > > receive buffer. > > > > > > > > I don't see a better option than MSG_PEEK, am I missing something? > > > > > > > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > > > MSG_PEEK case is very strange IMO. > > > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > > is used by the caller. > > > > > > That would only touch non fast paths. > > > > > > Since the API is mono-threaded anyway, the caller should not rely on > > > the fact that normal recvmsg() call > > > would 'consume' sk_peek_offset. > > > > Storing in sk_peek_seq the tcp next sequence number to be peeked should > > avoid changes in the non MSG_PEEK cases. > > > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > > that be acceptable? > > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. > > The new semantic would be : Supported by TCP (so far), and tcp > recvmsg() only reads/writes this field when MSG_PEEK is used. > Applications would have to clear the values themselves. Those semantics would likely defeat the purpose of using SO_PEEK_OFF for our use case, since we'd need an additional setsockopt() for every non-PEEK recv() (which are all MSG_TRUNC in our case). > BTW I see the man pages say SO_PEEK_OFF is "is currently supported > only for unix(7) sockets" Yes, this patch is explicitly aiming to change that.
On Wed, Feb 14, 2024 at 12:34 AM David Gibson <david@gibson.dropbear.id.au> wrote: > > > > BTW I see the man pages say SO_PEEK_OFF is "is currently supported > > only for unix(7) sockets" > > Yes, this patch is explicitly aiming to change that. I was pointing out that UDP is supposed to support it already, since 2016, the manpage should mention UDP already. Thanks.
On Wed, Feb 14, 2024 at 04:41:14AM +0100, Eric Dumazet wrote: > On Wed, Feb 14, 2024 at 12:34 AM David Gibson > <david@gibson.dropbear.id.au> wrote: > > > > > > > BTW I see the man pages say SO_PEEK_OFF is "is currently supported > > > only for unix(7) sockets" > > > > Yes, this patch is explicitly aiming to change that. > > I was pointing out that UDP is supposed to support it already, since 2016, > the manpage should mention UDP already. Oh, sorry, I misunderstood. I tend to forget about this applied to datagram sockets, since we have no use for that.
On Wed, Feb 14, 2024 at 10:34:50AM +1100, David Gibson wrote: > On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > > > > > WRITE_ONCE(*seq, *seq + used); > > > > > > > > copied += used; > > > > > > > > len -= used; > > > > > > > > - > > > > > > > > + if (flags & MSG_PEEK) > > > > > > > > + sk_peek_offset_fwd(sk, used); > > > > > > > > + else > > > > > > > > + sk_peek_offset_bwd(sk, used); > > > > > > > > > > > > Yet another cache miss in TCP fast path... > > > > > > > > > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > > > > > > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > > > > > allow arbitrary loops in recvmsg(). > > > > > > > > > > Let me double check I read the above correctly: are you concerned by > > > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > > > > > touch a lot of skbs/cachelines before reaching the relevant skb? > > > > > > > > > > The end goal here is allowing an user-space application to read > > > > > incrementally/sequentially the received data while leaving them in > > > > > receive buffer. > > > > > > > > > > I don't see a better option than MSG_PEEK, am I missing something? > > > > > > > > > > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > > > > MSG_PEEK case is very strange IMO. > > > > > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > > > is used by the caller. > > > > > > > > That would only touch non fast paths. > > > > > > > > Since the API is mono-threaded anyway, the caller should not rely on > > > > the fact that normal recvmsg() call > > > > would 'consume' sk_peek_offset. > > > > > > Storing in sk_peek_seq the tcp next sequence number to be peeked should > > > avoid changes in the non MSG_PEEK cases. > > > > > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere > > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > > > that be acceptable? > > > > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. > > > > The new semantic would be : Supported by TCP (so far), and tcp > > recvmsg() only reads/writes this field when MSG_PEEK is used. > > Applications would have to clear the values themselves. > > Those semantics would likely defeat the purpose of using SO_PEEK_OFF > for our use case, since we'd need an additional setsockopt() for every > non-PEEK recv() (which are all MSG_TRUNC in our case). Btw, Eric, If you're concerned about the extra access added to the "regular" TCP path, would you be happier with the original approach Jon proposed: that allowed a user to essentially supply an offset to an individial MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a NULL pointer and length to skip. It did the job for us, although I admit it's a little ugly, which I presume is why Paolo suggested we investigate SO_PEEK_OFF instead. I think the SO_PEEK_OFF approach is more elegant, but maybe the performance impact outweighs that.
On Thu, Feb 15, 2024 at 4:21 AM David Gibson <david@gibson.dropbear.id.au> wrote: > Btw, Eric, > > If you're concerned about the extra access added to the "regular" TCP > path, would you be happier with the original approach Jon proposed: > that allowed a user to essentially supply an offset to an individial > MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a > NULL pointer and length to skip. > > It did the job for us, although I admit it's a little ugly, which I > presume is why Paolo suggested we investigate SO_PEEK_OFF instead. I > think the SO_PEEK_OFF approach is more elegant, but maybe the > performance impact outweighs that. > Sorry, this was too ugly. SO_PEEK_OFF is way better/standard, we have to polish the implementation so that it is zero-cost for 99.9999 % of the users not using MSG_PEEK..
Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > I wonder if the following could be acceptable: > > if (flags & MSG_PEEK) > sk_peek_offset_fwd(sk, used); > else if (peek_offset > 0) > sk_peek_offset_bwd(sk, used); > > peek_offset is already present in the data cache, and if it has the value > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > Either way, no rewind is needed in that case. I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :) Cheers, Paolo
On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote: > > Note: please send text-only email to netdev. > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > > I wonder if the following could be acceptable: > > > > if (flags & MSG_PEEK) > > sk_peek_offset_fwd(sk, used); > > else if (peek_offset > 0) > > sk_peek_offset_bwd(sk, used); > > > > peek_offset is already present in the data cache, and if it has the value > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > > Either way, no rewind is needed in that case. > > I agree the above should avoid touching cold cachelines in the > fastpath, and looks functionally correct to me. > > The last word is up to Eric :) > An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.
On 2024-02-15 12:46, Eric Dumazet wrote: > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote: >> Note: please send text-only email to netdev. >> >> On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: >>> I wonder if the following could be acceptable: >>> >>> if (flags & MSG_PEEK) >>> sk_peek_offset_fwd(sk, used); >>> else if (peek_offset > 0) >>> sk_peek_offset_bwd(sk, used); >>> >>> peek_offset is already present in the data cache, and if it has the value >>> zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. >>> Either way, no rewind is needed in that case. >> I agree the above should avoid touching cold cachelines in the >> fastpath, and looks functionally correct to me. >> >> The last word is up to Eric :) >> > An actual patch seems needed. > > In the current form, local variable peek_offset is 0 when !MSG_PEEK. > > So the "else if (peek_offset > 0)" would always be false. > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions. ///jon
On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote: > > On 2024-02-15 12:46, Eric Dumazet wrote: > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > Note: please send text-only email to netdev. > > > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > > > > I wonder if the following could be acceptable: > > > > > > > > if (flags & MSG_PEEK) > > > > sk_peek_offset_fwd(sk, used); > > > > else if (peek_offset > 0) > > > > sk_peek_offset_bwd(sk, used); > > > > > > > > peek_offset is already present in the data cache, and if it has the value > > > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > > > > Either way, no rewind is needed in that case. > > > I agree the above should avoid touching cold cachelines in the > > > fastpath, and looks functionally correct to me. > > > > > > The last word is up to Eric :) > > > > > An actual patch seems needed. > > > > In the current form, local variable peek_offset is 0 when !MSG_PEEK. > > > > So the "else if (peek_offset > 0)" would always be false. > > > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the > beginning of the function. > I will look at the other suggestions. I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT? thanks! Paolo
On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote: > > > > On 2024-02-15 12:46, Eric Dumazet wrote: > > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > Note: please send text-only email to netdev. > > > > > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > > > > > I wonder if the following could be acceptable: > > > > > > > > > > if (flags & MSG_PEEK) > > > > > sk_peek_offset_fwd(sk, used); > > > > > else if (peek_offset > 0) > > > > > sk_peek_offset_bwd(sk, used); > > > > > > > > > > peek_offset is already present in the data cache, and if it has the value > > > > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > > > > > Either way, no rewind is needed in that case. > > > > I agree the above should avoid touching cold cachelines in the > > > > fastpath, and looks functionally correct to me. > > > > > > > > The last word is up to Eric :) > > > > > > > An actual patch seems needed. > > > > > > In the current form, local variable peek_offset is 0 when !MSG_PEEK. > > > > > > So the "else if (peek_offset > 0)" would always be false. > > > > > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the > > beginning of the function. > > I will look at the other suggestions. > > I *think* that moving sk_peek_off this way: > > --- > diff --git a/include/net/sock.h b/include/net/sock.h > index a9d99a9c583f..576a6a6abb03 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -413,7 +413,7 @@ struct sock { > unsigned int sk_napi_id; > #endif > int sk_rcvbuf; > - int sk_disconnects; > + int sk_peek_off; > > struct sk_filter __rcu *sk_filter; > union { > @@ -439,7 +439,7 @@ struct sock { > struct rb_root tcp_rtx_queue; > }; > struct sk_buff_head sk_write_queue; > - __s32 sk_peek_off; > + int sk_disconnects; > int sk_write_pending; > __u32 sk_dst_pending_confirm; > u32 sk_pacing_status; /* see enum sk_pacing */ > --- > > should avoid problematic accesses, > > The relevant cachelines layout is as follow: > > /* --- cacheline 4 boundary (256 bytes) --- */ > struct sk_buff * tail; /* 256 8 */ > } sk_backlog; /* 240 24 */ > int sk_forward_alloc; /* 264 4 */ > u32 sk_reserved_mem; /* 268 4 */ > unsigned int sk_ll_usec; /* 272 4 */ > unsigned int sk_napi_id; /* 276 4 */ > int sk_rcvbuf; /* 280 4 */ > int sk_disconnects; /* 284 4 */ > // will become sk_peek_off > struct sk_filter * sk_filter; /* 288 8 */ > union { > struct socket_wq * sk_wq; /* 296 8 */ > struct socket_wq * sk_wq_raw; /* 296 8 */ > }; /* 296 8 */ > struct xfrm_policy * sk_policy[2]; /* 304 16 */ > /* --- cacheline 5 boundary (320 bytes) --- */ > > // ... > > /* --- cacheline 6 boundary (384 bytes) --- */ > __s32 sk_peek_off; /* 384 4 */ > // will become sk_diconnects > int sk_write_pending; /* 388 4 */ > __u32 sk_dst_pending_confirm; /* 392 4 */ > u32 sk_pacing_status; /* 396 4 */ > long int sk_sndtimeo; /* 400 8 */ > struct timer_list sk_timer; /* 408 40 */ > > /* XXX last struct has 4 bytes of padding */ > > /* --- cacheline 7 boundary (448 bytes) --- */ > > sk_peek_off will be in the same cachline of sk_forward_alloc / > sk_reserved_mem / backlog tail, that are already touched by the > tcp_recvmsg_locked() main loop. > > WDYT? I was about to send a similar change, also moving sk_rcvtimeo, and adding __cacheline_group_begin()/__cacheline_group_end annotations. I can finish this today.
On Fri, Feb 16, 2024 at 11:13 AM Jon Maloy <jmaloy@redhat.com> wrote: > > There is also the following alternative: > > if (flags & MSG_PEEK) > sk_peek_offset_fwd(sk, used); > else if (flags & MSG_TRUNC) > sk_peek_offset_bwd(sk, used); > > This is the way we use it, and probably the typical usage. > It would force a user to drain the receive queue with MSG_TRUNC whenever he is using > MSG_PEEK_OFF, but I don't really see that as a limitation. > > Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary. I think the suggestion to move sk_peek_off came from my first message on this thread ;) "We need to move sk_peek_off in a better location before we accept this patch." Anyway, a complete reorg of 'struct sock' was overdue, I am working on it.
On Fri, Feb 16, 2024 at 05:13:34AM -0500, Jon Maloy wrote: > > > On 2024-02-16 04:21, Eric Dumazet wrote: > > On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni<pabeni@redhat.com> wrote: > > > On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote: > > > > On 2024-02-15 12:46, Eric Dumazet wrote: > > > > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni<pabeni@redhat.com> wrote: > > > > > > Note: please send text-only email to netdev. > > > > > > > > > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > > > > > > > I wonder if the following could be acceptable: > > > > > > > > > > > > > > if (flags & MSG_PEEK) > > > > > > > sk_peek_offset_fwd(sk, used); > > > > > > > else if (peek_offset > 0) > > > > > > > sk_peek_offset_bwd(sk, used); > > > > > > > > > > > > > > peek_offset is already present in the data cache, and if it has the value > > > > > > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > > > > > > > Either way, no rewind is needed in that case. > > > > > > I agree the above should avoid touching cold cachelines in the > > > > > > fastpath, and looks functionally correct to me. > > > > > > > > > > > > The last word is up to Eric :) > > > > > > > > > > > An actual patch seems needed. > > > > > > > > > > In the current form, local variable peek_offset is 0 when !MSG_PEEK. > > > > > > > > > > So the "else if (peek_offset > 0)" would always be false. > > > > > > > > > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the > > > > beginning of the function. > > > > I will look at the other suggestions. > > > I *think* that moving sk_peek_off this way: > > > > > > --- > > > diff --git a/include/net/sock.h b/include/net/sock.h > > > index a9d99a9c583f..576a6a6abb03 100644 > > > --- a/include/net/sock.h > > > +++ b/include/net/sock.h > > > @@ -413,7 +413,7 @@ struct sock { > > > unsigned int sk_napi_id; > > > #endif > > > int sk_rcvbuf; > > > - int sk_disconnects; > > > + int sk_peek_off; > > > > > > struct sk_filter __rcu *sk_filter; > > > union { > > > @@ -439,7 +439,7 @@ struct sock { > > > struct rb_root tcp_rtx_queue; > > > }; > > > struct sk_buff_head sk_write_queue; > > > - __s32 sk_peek_off; > > > + int sk_disconnects; > > > int sk_write_pending; > > > __u32 sk_dst_pending_confirm; > > > u32 sk_pacing_status; /* see enum sk_pacing */ > > > --- > > > > > > should avoid problematic accesses, > > > > > > The relevant cachelines layout is as follow: > > > > > > /* --- cacheline 4 boundary (256 bytes) --- */ > > > struct sk_buff * tail; /* 256 8 */ > > > } sk_backlog; /* 240 24 */ > > > int sk_forward_alloc; /* 264 4 */ > > > u32 sk_reserved_mem; /* 268 4 */ > > > unsigned int sk_ll_usec; /* 272 4 */ > > > unsigned int sk_napi_id; /* 276 4 */ > > > int sk_rcvbuf; /* 280 4 */ > > > int sk_disconnects; /* 284 4 */ > > > // will become sk_peek_off > > > struct sk_filter * sk_filter; /* 288 8 */ > > > union { > > > struct socket_wq * sk_wq; /* 296 8 */ > > > struct socket_wq * sk_wq_raw; /* 296 8 */ > > > }; /* 296 8 */ > > > struct xfrm_policy * sk_policy[2]; /* 304 16 */ > > > /* --- cacheline 5 boundary (320 bytes) --- */ > > > > > > // ... > > > > > > /* --- cacheline 6 boundary (384 bytes) --- */ > > > __s32 sk_peek_off; /* 384 4 */ > > > // will become sk_diconnects > > > int sk_write_pending; /* 388 4 */ > > > __u32 sk_dst_pending_confirm; /* 392 4 */ > > > u32 sk_pacing_status; /* 396 4 */ > > > long int sk_sndtimeo; /* 400 8 */ > > > struct timer_list sk_timer; /* 408 40 */ > > > > > > /* XXX last struct has 4 bytes of padding */ > > > > > > /* --- cacheline 7 boundary (448 bytes) --- */ > > > > > > sk_peek_off will be in the same cachline of sk_forward_alloc / > > > sk_reserved_mem / backlog tail, that are already touched by the > > > tcp_recvmsg_locked() main loop. > > > > > > WDYT? > > I was about to send a similar change, also moving sk_rcvtimeo, and > > adding __cacheline_group_begin()/__cacheline_group_end > > annotations. > > > > I can finish this today. > > > There is also the following alternative: > > if (flags & MSG_PEEK) > sk_peek_offset_fwd(sk, used); > else if (flags & MSG_TRUNC) > sk_peek_offset_bwd(sk, used); > > This is the way we use it, and probably the typical usage. > It would force a user to drain the receive queue with MSG_TRUNC whenever he > is using > MSG_PEEK_OFF, but I don't really see that as a limitation. I really don't like this, although it would certainly do what we need for passt/pasta. SO_PEEK_OFF has established semantics for Unix sockets, which includes regular recv() adjusting the offset. Having it behave subtlety differently for TCP seems like a very bad idea. > Anyway, if Paolo's suggestion solves the problem this shouldn't be > necessary. > > ///jon
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 4e635dd3d3c8..5f0e5d10c416 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { #endif .splice_eof = inet_splice_eof, .splice_read = tcp_splice_read, + .set_peek_off = sk_set_peek_off, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7e2481b9eae1..1c8cab14a32c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) struct sk_buff *skb; int copied = 0, err = 0; - /* XXX -- need to support SO_PEEK_OFF */ - skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { err = skb_copy_datagram_msg(skb, 0, msg, skb->len); if (err) @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; + u32 peek_offset = 0; u32 urg_hole = 0; err = -ENOTCONN; @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, seq = &tp->copied_seq; if (flags & MSG_PEEK) { - peek_seq = tp->copied_seq; + peek_offset = max(sk_peek_offset(sk, flags), 0); + peek_seq = tp->copied_seq + peek_offset; seq = &peek_seq; } @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if ((flags & MSG_PEEK) && - (peek_seq - copied - urg_hole != tp->copied_seq)) { + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", current->comm, task_pid_nr(current)); - peek_seq = tp->copied_seq; + peek_seq = tp->copied_seq + peek_offset; } continue; @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; - + if (flags & MSG_PEEK) + sk_peek_offset_fwd(sk, used); + else + sk_peek_offset_bwd(sk, used); tcp_rcv_space_adjust(sk); skip_copy: @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags) __skb_queue_purge(&sk->sk_receive_queue); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); WRITE_ONCE(tp->urg_data, 0); + sk_set_peek_off(sk, -1); tcp_write_queue_purge(sk); tcp_fastopen_active_disable_ofo_check(sk); skb_rbtree_purge(&tp->out_of_order_queue);