Message ID | 20240201213201.1228681-1-jmaloy@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next] tcp: add support for SO_PEEK_OFF | expand |
On Thu, 1 Feb 2024 16:32:01 -0500 jmaloy@redhat.com wrote: > From: Jon Maloy <jmaloy@redhat.com> > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF not supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@freyr:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@freyr:~/passt$ > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Signed-off-by: Jon Maloy <jmaloy@redhat.com> I guess this was Suggested-by: Paolo Abeni <pabeni@redhat.com> ? :) > --- > include/net/tcp.h | 1 + > net/ipv4/af_inet.c | 1 + > net/ipv4/tcp.c | 25 +++++++++++++++++++------ > 3 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 87f0e6c2e1f2..7eca7f2ac102 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family); > ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, > struct pipe_inode_info *pipe, size_t len, > unsigned int flags); > +int tcp_set_peek_offset(struct sock *sk, int val); > struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, > bool force_schedule); > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index fb81de10d332..7a8b3a91257f 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = { > #endif > .splice_eof = inet_splice_eof, > .splice_read = tcp_splice_read, > + .set_peek_off = tcp_set_peek_offset, > .read_sock = tcp_read_sock, > .read_skb = tcp_read_skb, > .sendmsg_locked = tcp_sendmsg_locked, > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index fce5668a6a3d..33ade88633de 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, > } > EXPORT_SYMBOL(tcp_splice_read); > > +int tcp_set_peek_offset(struct sock *sk, int val) > +{ > + WRITE_ONCE(sk->sk_peek_off, val); > + > + return 0; > +} > +EXPORT_SYMBOL(tcp_set_peek_offset); > + > struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, > bool force_schedule) > { > @@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk) > * tricks with *seq access order and skb->users are not required. > * Probably, code can be easily improved even more. > */ > - Stray change. > static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > int flags, struct scm_timestamping_internal *tss, > int *cmsg_flags) > @@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > long timeo; > struct sk_buff *skb, *last; > u32 urg_hole = 0; > + u32 peek_offset = 0; > > err = -ENOTCONN; > if (sk->sk_state == TCP_LISTEN) > @@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > seq = &tp->copied_seq; > if (flags & MSG_PEEK) { > - peek_seq = tp->copied_seq; > + peek_offset = max(sk_peek_offset(sk, flags), 0); > + peek_seq = tp->copied_seq + peek_offset; > seq = &peek_seq; > } > And with this, explicit support in tcp_peek_sndq() is not actually needed, but this comment in that function: /* XXX -- need to support SO_PEEK_OFF */ should be removed now I guess. > @@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > } > > if ((flags & MSG_PEEK) && > - (peek_seq - copied - urg_hole != tp->copied_seq)) { > + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { > net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", > current->comm, > task_pid_nr(current)); > - peek_seq = tp->copied_seq; > + peek_seq = tp->copied_seq + peek_offset; > } > continue; > > @@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used); > tcp_rcv_space_adjust(sk); > > skip_copy: > @@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout) > data_was_unread += len; > __kfree_skb(skb); > } > + sk_set_peek_off(sk, -1); > > /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */ > if (sk->sk_state == TCP_CLOSE) > @@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk) > reqsk_fastopen_remove(sk, req, false); > > WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK); > - > + sk_set_peek_off(sk, -1); > if (!sock_flag(sk, SOCK_DEAD)) > sk->sk_state_change(sk); > else
On Thu, Feb 01, 2024 at 04:32:01PM -0500, jmaloy@redhat.com wrote: > From: Jon Maloy <jmaloy@redhat.com> > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF not supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@freyr:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@freyr:~/passt$ > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Signed-off-by: Jon Maloy <jmaloy@redhat.com> Apart from one trivial nit noted below, Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > --- > include/net/tcp.h | 1 + > net/ipv4/af_inet.c | 1 + > net/ipv4/tcp.c | 25 +++++++++++++++++++------ > 3 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 87f0e6c2e1f2..7eca7f2ac102 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family); > ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, > struct pipe_inode_info *pipe, size_t len, > unsigned int flags); > +int tcp_set_peek_offset(struct sock *sk, int val); > struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, > bool force_schedule); > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index fb81de10d332..7a8b3a91257f 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = { > #endif > .splice_eof = inet_splice_eof, > .splice_read = tcp_splice_read, > + .set_peek_off = tcp_set_peek_offset, > .read_sock = tcp_read_sock, > .read_skb = tcp_read_skb, > .sendmsg_locked = tcp_sendmsg_locked, > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index fce5668a6a3d..33ade88633de 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, > } > EXPORT_SYMBOL(tcp_splice_read); > > +int tcp_set_peek_offset(struct sock *sk, int val) > +{ > + WRITE_ONCE(sk->sk_peek_off, val); > + > + return 0; > +} > +EXPORT_SYMBOL(tcp_set_peek_offset); > + > struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, > bool force_schedule) > { > @@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk) > * tricks with *seq access order and skb->users are not required. > * Probably, code can be easily improved even more. > */ > - Extraneous whitespace change here. > static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > int flags, struct scm_timestamping_internal *tss, > int *cmsg_flags) > @@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > long timeo; > struct sk_buff *skb, *last; > u32 urg_hole = 0; > + u32 peek_offset = 0; > > err = -ENOTCONN; > if (sk->sk_state == TCP_LISTEN) > @@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > seq = &tp->copied_seq; > if (flags & MSG_PEEK) { > - peek_seq = tp->copied_seq; > + peek_offset = max(sk_peek_offset(sk, flags), 0); > + peek_seq = tp->copied_seq + peek_offset; > seq = &peek_seq; > } > > @@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > } > > if ((flags & MSG_PEEK) && > - (peek_seq - copied - urg_hole != tp->copied_seq)) { > + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { > net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", > current->comm, > task_pid_nr(current)); > - peek_seq = tp->copied_seq; > + peek_seq = tp->copied_seq + peek_offset; > } > continue; > > @@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used); > tcp_rcv_space_adjust(sk); > > skip_copy: > @@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout) > data_was_unread += len; > __kfree_skb(skb); > } > + sk_set_peek_off(sk, -1); > > /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */ > if (sk->sk_state == TCP_CLOSE) > @@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk) > reqsk_fastopen_remove(sk, req, false); > > WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK); > - > + sk_set_peek_off(sk, -1); > if (!sock_flag(sk, SOCK_DEAD)) > sk->sk_state_change(sk); > else
diff --git a/include/net/tcp.h b/include/net/tcp.h index 87f0e6c2e1f2..7eca7f2ac102 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family); ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); +int tcp_set_peek_offset(struct sock *sk, int val); struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, bool force_schedule); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index fb81de10d332..7a8b3a91257f 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = { #endif .splice_eof = inet_splice_eof, .splice_read = tcp_splice_read, + .set_peek_off = tcp_set_peek_offset, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index fce5668a6a3d..33ade88633de 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, } EXPORT_SYMBOL(tcp_splice_read); +int tcp_set_peek_offset(struct sock *sk, int val) +{ + WRITE_ONCE(sk->sk_peek_off, val); + + return 0; +} +EXPORT_SYMBOL(tcp_set_peek_offset); + struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp, bool force_schedule) { @@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk) * tricks with *seq access order and skb->users are not required. * Probably, code can be easily improved even more. */ - static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int flags, struct scm_timestamping_internal *tss, int *cmsg_flags) @@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, long timeo; struct sk_buff *skb, *last; u32 urg_hole = 0; + u32 peek_offset = 0; err = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) @@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, seq = &tp->copied_seq; if (flags & MSG_PEEK) { - peek_seq = tp->copied_seq; + peek_offset = max(sk_peek_offset(sk, flags), 0); + peek_seq = tp->copied_seq + peek_offset; seq = &peek_seq; } @@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if ((flags & MSG_PEEK) && - (peek_seq - copied - urg_hole != tp->copied_seq)) { + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", current->comm, task_pid_nr(current)); - peek_seq = tp->copied_seq; + peek_seq = tp->copied_seq + peek_offset; } continue; @@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; - + if (flags & MSG_PEEK) + sk_peek_offset_fwd(sk, used); + else + sk_peek_offset_bwd(sk, used); tcp_rcv_space_adjust(sk); skip_copy: @@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout) data_was_unread += len; __kfree_skb(skb); } + sk_set_peek_off(sk, -1); /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */ if (sk->sk_state == TCP_CLOSE) @@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk) reqsk_fastopen_remove(sk, req, false); WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK); - + sk_set_peek_off(sk, -1); if (!sock_flag(sk, SOCK_DEAD)) sk->sk_state_change(sk); else