diff mbox series

[bpf,1/3] bpf: tcp_read_skb needs to pop skb regardless of seq

Message ID 20230920232706.498747-2-john.fastabend@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf, sockmap complete fixes for avail bytes | expand

Checks

Context Check Description
bpf/vmtest-bpf-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-PR fail PR summary
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1342 this patch: 1342
netdev/cc_maintainers fail 2 blamed authors not CCed: edumazet@google.com cong.wang@bytedance.com; 6 maintainers not CCed: pabeni@redhat.com davem@davemloft.net edumazet@google.com dsahern@kernel.org cong.wang@bytedance.com kuba@kernel.org
netdev/build_clang fail Errors and warnings before: 1364 this patch: 1366
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1365 this patch: 1365
netdev/checkpatch warning WARNING: 'splitted' may be misspelled - perhaps 'split'? WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")' WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

John Fastabend Sept. 20, 2023, 11:27 p.m. UTC
Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq
value. This (as described in the commit) would cause an error for apps
because once that is incremented the application might believe there is no
data to be read. Then some apps would stall or abort believing no data is
available.

However, the fix is incomplete because it introduces another issue in
the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume
as many skbs as possible. The problem is the call is,

  tcp_recv_skb(sk, seq, &offset)

Where 'seq' is

  u32 seq = tp->copied_seq;

Now we can hit a case where we've yet incremented copied_seq from BPF side,
but then tcp_recv_skb() fails this test,

 if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))

so that instead of returning the skb we call tcp_eat_recv_skb() which frees
the skb. This is because the routine believes the SKB has been collapsed
per comment,

 /* This looks weird, but this can happen if TCP collapsing
  * splitted a fat GRO packet, while we released socket lock
  * in skb_splice_bits()
  */

This can't happen here we've unlinked the full SKB and orphaned it. Anyways
it would confuse any BPF programs if the data were suddenly moved underneath
it.

To fix this situation do simpler operation and just skb_peek() the data
of the queue followed by the unlink. It shouldn't need to check this
condition and tcp_read_skb() reads entire skbs so there is no need to
handle the 'offset!=0' case as we would see in tcp_read_sock().

Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Simon Horman Sept. 21, 2023, 9:08 p.m. UTC | #1
On Wed, Sep 20, 2023 at 04:27:04PM -0700, John Fastabend wrote:
> Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq
> value. This (as described in the commit) would cause an error for apps
> because once that is incremented the application might believe there is no
> data to be read. Then some apps would stall or abort believing no data is
> available.
> 
> However, the fix is incomplete because it introduces another issue in
> the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume
> as many skbs as possible. The problem is the call is,
> 
>   tcp_recv_skb(sk, seq, &offset)
> 
> Where 'seq' is
> 
>   u32 seq = tp->copied_seq;
> 
> Now we can hit a case where we've yet incremented copied_seq from BPF side,
> but then tcp_recv_skb() fails this test,
> 
>  if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
> 
> so that instead of returning the skb we call tcp_eat_recv_skb() which frees
> the skb. This is because the routine believes the SKB has been collapsed
> per comment,
> 
>  /* This looks weird, but this can happen if TCP collapsing
>   * splitted a fat GRO packet, while we released socket lock
>   * in skb_splice_bits()
>   */
> 
> This can't happen here we've unlinked the full SKB and orphaned it. Anyways
> it would confuse any BPF programs if the data were suddenly moved underneath
> it.
> 
> To fix this situation do simpler operation and just skb_peek() the data
> of the queue followed by the unlink. It shouldn't need to check this
> condition and tcp_read_skb() reads entire skbs so there is no need to
> handle the 'offset!=0' case as we would see in tcp_read_sock().
> 
> Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/ipv4/tcp.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0c3040a63ebd..45e7f39e67bc 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1625,12 +1625,11 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
>  	u32 seq = tp->copied_seq;

Hi John,

according to clang-16, with this change seq is now set but unused.
I guess seq can simply be removed as part of this change.

>  	struct sk_buff *skb;
>  	int copied = 0;
> -	u32 offset;
>  
>  	if (sk->sk_state == TCP_LISTEN)
>  		return -ENOTCONN;
>  
> -	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> +	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
>  		u8 tcp_flags;
>  		int used;
>  
> -- 
> 2.33.0
> 
>
John Fastabend Sept. 21, 2023, 9:23 p.m. UTC | #2
Simon Horman wrote:
> On Wed, Sep 20, 2023 at 04:27:04PM -0700, John Fastabend wrote:
> > Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq
> > value. This (as described in the commit) would cause an error for apps
> > because once that is incremented the application might believe there is no
> > data to be read. Then some apps would stall or abort believing no data is
> > available.
> > 
> > However, the fix is incomplete because it introduces another issue in
> > the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume
> > as many skbs as possible. The problem is the call is,
> > 
> >   tcp_recv_skb(sk, seq, &offset)
> > 
> > Where 'seq' is
> > 
> >   u32 seq = tp->copied_seq;
> > 
> > Now we can hit a case where we've yet incremented copied_seq from BPF side,
> > but then tcp_recv_skb() fails this test,
> > 
> >  if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
> > 
> > so that instead of returning the skb we call tcp_eat_recv_skb() which frees
> > the skb. This is because the routine believes the SKB has been collapsed
> > per comment,
> > 
> >  /* This looks weird, but this can happen if TCP collapsing
> >   * splitted a fat GRO packet, while we released socket lock
> >   * in skb_splice_bits()
> >   */
> > 
> > This can't happen here we've unlinked the full SKB and orphaned it. Anyways
> > it would confuse any BPF programs if the data were suddenly moved underneath
> > it.
> > 
> > To fix this situation do simpler operation and just skb_peek() the data
> > of the queue followed by the unlink. It shouldn't need to check this
> > condition and tcp_read_skb() reads entire skbs so there is no need to
> > handle the 'offset!=0' case as we would see in tcp_read_sock().
> > 
> > Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/ipv4/tcp.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 0c3040a63ebd..45e7f39e67bc 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1625,12 +1625,11 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> >  	u32 seq = tp->copied_seq;
> 
> Hi John,
> 
> according to clang-16, with this change seq is now set but unused.
> I guess seq can simply be removed as part of this change.

Yeah I'll send a v2. Thanks.

> 
> >  	struct sk_buff *skb;
> >  	int copied = 0;
> > -	u32 offset;
> >  
> >  	if (sk->sk_state == TCP_LISTEN)
> >  		return -ENOTCONN;
> >  
> > -	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> > +	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
> >  		u8 tcp_flags;
> >  		int used;
> >  
> > -- 
> > 2.33.0
> > 
> >
kernel test robot Sept. 23, 2023, 2:37 p.m. UTC | #3
Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf/master]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Fastabend/bpf-tcp_read_skb-needs-to-pop-skb-regardless-of-seq/20230921-073209
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230920232706.498747-2-john.fastabend%40gmail.com
patch subject: [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230923/202309232236.36lvZlKR-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309232236.36lvZlKR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309232236.36lvZlKR-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from net/ipv4/tcp.c:252:
   In file included from include/linux/inet_diag.h:5:
   In file included from include/net/netlink.h:6:
   In file included from include/linux/netlink.h:7:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from net/ipv4/tcp.c:252:
   In file included from include/linux/inet_diag.h:5:
   In file included from include/net/netlink.h:6:
   In file included from include/linux/netlink.h:7:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from net/ipv4/tcp.c:252:
   In file included from include/linux/inet_diag.h:5:
   In file included from include/net/netlink.h:6:
   In file included from include/linux/netlink.h:7:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> net/ipv4/tcp.c:1625:6: warning: variable 'seq' set but not used [-Wunused-but-set-variable]
    1625 |         u32 seq = tp->copied_seq;
         |             ^
   13 warnings generated.


vim +/seq +1625 net/ipv4/tcp.c

^1da177e4c3f41 Linus Torvalds 2005-04-16  1621  
965b57b469a589 Cong Wang      2022-06-15  1622  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
04919bed948dc2 Cong Wang      2022-06-15  1623  {
04919bed948dc2 Cong Wang      2022-06-15  1624  	struct tcp_sock *tp = tcp_sk(sk);
04919bed948dc2 Cong Wang      2022-06-15 @1625  	u32 seq = tp->copied_seq;
04919bed948dc2 Cong Wang      2022-06-15  1626  	struct sk_buff *skb;
04919bed948dc2 Cong Wang      2022-06-15  1627  	int copied = 0;
04919bed948dc2 Cong Wang      2022-06-15  1628  
04919bed948dc2 Cong Wang      2022-06-15  1629  	if (sk->sk_state == TCP_LISTEN)
04919bed948dc2 Cong Wang      2022-06-15  1630  		return -ENOTCONN;
04919bed948dc2 Cong Wang      2022-06-15  1631  
98edede7d6d1b6 John Fastabend 2023-09-20  1632  	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
db4192a754ebd5 Cong Wang      2022-09-12  1633  		u8 tcp_flags;
db4192a754ebd5 Cong Wang      2022-09-12  1634  		int used;
04919bed948dc2 Cong Wang      2022-06-15  1635  
04919bed948dc2 Cong Wang      2022-06-15  1636  		__skb_unlink(skb, &sk->sk_receive_queue);
96628951869c0d Peilin Ye      2022-09-08  1637  		WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
db4192a754ebd5 Cong Wang      2022-09-12  1638  		tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
db4192a754ebd5 Cong Wang      2022-09-12  1639  		used = recv_actor(sk, skb);
db4192a754ebd5 Cong Wang      2022-09-12  1640  		if (used < 0) {
db4192a754ebd5 Cong Wang      2022-09-12  1641  			if (!copied)
db4192a754ebd5 Cong Wang      2022-09-12  1642  				copied = used;
db4192a754ebd5 Cong Wang      2022-09-12  1643  			break;
db4192a754ebd5 Cong Wang      2022-09-12  1644  		}
db4192a754ebd5 Cong Wang      2022-09-12  1645  		seq += used;
db4192a754ebd5 Cong Wang      2022-09-12  1646  		copied += used;
db4192a754ebd5 Cong Wang      2022-09-12  1647  
db4192a754ebd5 Cong Wang      2022-09-12  1648  		if (tcp_flags & TCPHDR_FIN) {
04919bed948dc2 Cong Wang      2022-06-15  1649  			++seq;
db4192a754ebd5 Cong Wang      2022-09-12  1650  			break;
db4192a754ebd5 Cong Wang      2022-09-12  1651  		}
04919bed948dc2 Cong Wang      2022-06-15  1652  	}
04919bed948dc2 Cong Wang      2022-06-15  1653  	return copied;
04919bed948dc2 Cong Wang      2022-06-15  1654  }
04919bed948dc2 Cong Wang      2022-06-15  1655  EXPORT_SYMBOL(tcp_read_skb);
04919bed948dc2 Cong Wang      2022-06-15  1656
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0c3040a63ebd..45e7f39e67bc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1625,12 +1625,11 @@  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 	u32 seq = tp->copied_seq;
 	struct sk_buff *skb;
 	int copied = 0;
-	u32 offset;
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
 
-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
 		u8 tcp_flags;
 		int used;