diff mbox series

[bpf,v3,07/12] bpf: sockmap incorrectly handling copied_seq

Message ID 20230403200138.937569-8-john.fastabend@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf sockmap fixes | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-3 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-4 fail Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-5 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-7 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-8 success Logs for set-matrix
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf, async
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1012 this patch: 1012
netdev/cc_maintainers warning 4 maintainers not CCed: kuba@kernel.org pabeni@redhat.com dsahern@kernel.org davem@davemloft.net
netdev/build_clang success Errors and warnings before: 121 this patch: 121
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1018 this patch: 1018
netdev/checkpatch warning WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

John Fastabend April 3, 2023, 8:01 p.m. UTC
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.

To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.

Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.

We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tcp.h  |  3 +++
 net/core/skmsg.c   |  7 +++++--
 net/ipv4/tcp.c     | 10 +---------
 net/ipv4/tcp_bpf.c | 28 +++++++++++++++++++++++++++-
 4 files changed, 36 insertions(+), 12 deletions(-)

Comments

kernel test robot April 4, 2023, 3:02 a.m. UTC | #1
Hi John,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf/master]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Fastabend/bpf-sockmap-pass-skb-ownership-through-read_skb/20230404-040431
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230403200138.937569-8-john.fastabend%40gmail.com
patch subject: [PATCH bpf v3 07/12] bpf: sockmap incorrectly handling copied_seq
config: sparc64-randconfig-r025-20230403 (https://download.01.org/0day-ci/archive/20230404/202304041013.HATc3V5L-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/fbafbc850ec4ef5aa7e5c39d8133f291ec4c0bb8
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review John-Fastabend/bpf-sockmap-pass-skb-ownership-through-read_skb/20230404-040431
        git checkout fbafbc850ec4ef5aa7e5c39d8133f291ec4c0bb8
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc64 SHELL=/bin/bash net/core/ net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304041013.HATc3V5L-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   net/core/skmsg.c: In function 'sk_psock_verdict_apply':
>> net/core/skmsg.c:1056:17: error: implicit declaration of function 'tcp_eat_skb'; did you mean 'tcp_read_skb'? [-Werror=implicit-function-declaration]
    1056 |                 tcp_eat_skb(psock->sk, skb);
         |                 ^~~~~~~~~~~
         |                 tcp_read_skb
   cc1: some warnings being treated as errors
--
>> net/ipv4/tcp_bpf.c:14:6: warning: no previous prototype for 'tcp_eat_skb' [-Wmissing-prototypes]
      14 | void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
         |      ^~~~~~~~~~~


vim +1056 net/core/skmsg.c

   986	
   987	static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
   988					  int verdict)
   989	{
   990		struct sk_psock_work_state *state;
   991		struct sock *sk_other;
   992		int err = 0;
   993		u32 len, off;
   994	
   995		switch (verdict) {
   996		case __SK_PASS:
   997			err = -EIO;
   998			sk_other = psock->sk;
   999			if (sock_flag(sk_other, SOCK_DEAD) ||
  1000			    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
  1001				skb_bpf_redirect_clear(skb);
  1002				goto out_free;
  1003			}
  1004	
  1005			skb_bpf_set_ingress(skb);
  1006	
  1007			/* We need to grab mutex here because in-flight skb is in one of
  1008			 * the following states: either on ingress_skb, in psock->state
  1009			 * or being processed by backlog and neither in state->skb and
  1010			 * ingress_skb may be also empty. The troublesome case is when
  1011			 * the skb has been dequeued from ingress_skb list or taken from
  1012			 * state->skb because we can not easily test this case. Maybe we
  1013			 * could be clever with flags and resolve this but being clever
  1014			 * got us here in the first place and we note this is done under
  1015			 * sock lock and backlog conditions mean we are already running
  1016			 * into ENOMEM or other performance hindering cases so lets do
  1017			 * the obvious thing and grab the mutex.
  1018			 */
  1019			mutex_lock(&psock->work_mutex);
  1020			state = &psock->work_state;
  1021	
  1022			/* If the queue is empty then we can submit directly
  1023			 * into the msg queue. If its not empty we have to
  1024			 * queue work otherwise we may get OOO data. Otherwise,
  1025			 * if sk_psock_skb_ingress errors will be handled by
  1026			 * retrying later from workqueue.
  1027			 */
  1028			if (skb_queue_empty(&psock->ingress_skb) && likely(!state->skb)) {
  1029				len = skb->len;
  1030				off = 0;
  1031				if (skb_bpf_strparser(skb)) {
  1032					struct strp_msg *stm = strp_msg(skb);
  1033	
  1034					off = stm->offset;
  1035					len = stm->full_len;
  1036				}
  1037				err = sk_psock_skb_ingress_self(psock, skb, off, len);
  1038			}
  1039			if (err < 0) {
  1040				spin_lock_bh(&psock->ingress_lock);
  1041				if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
  1042					skb_queue_tail(&psock->ingress_skb, skb);
  1043					schedule_delayed_work(&psock->work, 0);
  1044					err = 0;
  1045				}
  1046				spin_unlock_bh(&psock->ingress_lock);
  1047				if (err < 0) {
  1048					skb_bpf_redirect_clear(skb);
  1049					mutex_unlock(&psock->work_mutex);
  1050					goto out_free;
  1051				}
  1052			}
  1053			mutex_unlock(&psock->work_mutex);
  1054			break;
  1055		case __SK_REDIRECT:
> 1056			tcp_eat_skb(psock->sk, skb);
  1057			err = sk_psock_skb_redirect(psock, skb);
  1058			break;
  1059		case __SK_DROP:
  1060		default:
  1061	out_free:
  1062			tcp_eat_skb(psock->sk, skb);
  1063			skb_bpf_redirect_clear(skb);
  1064			sock_drop(psock->sk, skb);
  1065		}
  1066	
  1067		return err;
  1068	}
  1069
kernel test robot April 4, 2023, 3:12 a.m. UTC | #2
Hi John,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf/master]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Fastabend/bpf-sockmap-pass-skb-ownership-through-read_skb/20230404-040431
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230403200138.937569-8-john.fastabend%40gmail.com
patch subject: [PATCH bpf v3 07/12] bpf: sockmap incorrectly handling copied_seq
config: hexagon-randconfig-r041-20230403 (https://download.01.org/0day-ci/archive/20230404/202304041028.XAQryEFM-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project 67409911353323ca5edf2049ef0df54132fa1ca7)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/fbafbc850ec4ef5aa7e5c39d8133f291ec4c0bb8
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review John-Fastabend/bpf-sockmap-pass-skb-ownership-through-read_skb/20230404-040431
        git checkout fbafbc850ec4ef5aa7e5c39d8133f291ec4c0bb8
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash net/core/ net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304041028.XAQryEFM-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from net/core/skmsg.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from net/core/skmsg.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from net/core/skmsg.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
>> net/core/skmsg.c:1056:3: error: call to undeclared function 'tcp_eat_skb'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   tcp_eat_skb(psock->sk, skb);
                   ^
   net/core/skmsg.c:1056:3: note: did you mean 'tcp_read_skb'?
   include/net/tcp.h:682:5: note: 'tcp_read_skb' declared here
   int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
       ^
   net/core/skmsg.c:1216:3: error: call to undeclared function 'tcp_eat_skb'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   tcp_eat_skb(sk, skb);
                   ^
   6 warnings and 2 errors generated.
--
   In file included from net/ipv4/tcp_bpf.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from net/ipv4/tcp_bpf.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from net/ipv4/tcp_bpf.c:4:
   In file included from include/linux/skmsg.h:7:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
>> net/ipv4/tcp_bpf.c:14:6: warning: no previous prototype for function 'tcp_eat_skb' [-Wmissing-prototypes]
   void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
        ^
   net/ipv4/tcp_bpf.c:14:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
   ^
   static 
   7 warnings generated.


vim +/tcp_eat_skb +1056 net/core/skmsg.c

   986	
   987	static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
   988					  int verdict)
   989	{
   990		struct sk_psock_work_state *state;
   991		struct sock *sk_other;
   992		int err = 0;
   993		u32 len, off;
   994	
   995		switch (verdict) {
   996		case __SK_PASS:
   997			err = -EIO;
   998			sk_other = psock->sk;
   999			if (sock_flag(sk_other, SOCK_DEAD) ||
  1000			    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
  1001				skb_bpf_redirect_clear(skb);
  1002				goto out_free;
  1003			}
  1004	
  1005			skb_bpf_set_ingress(skb);
  1006	
  1007			/* We need to grab mutex here because in-flight skb is in one of
  1008			 * the following states: either on ingress_skb, in psock->state
  1009			 * or being processed by backlog and neither in state->skb and
  1010			 * ingress_skb may be also empty. The troublesome case is when
  1011			 * the skb has been dequeued from ingress_skb list or taken from
  1012			 * state->skb because we can not easily test this case. Maybe we
  1013			 * could be clever with flags and resolve this but being clever
  1014			 * got us here in the first place and we note this is done under
  1015			 * sock lock and backlog conditions mean we are already running
  1016			 * into ENOMEM or other performance hindering cases so lets do
  1017			 * the obvious thing and grab the mutex.
  1018			 */
  1019			mutex_lock(&psock->work_mutex);
  1020			state = &psock->work_state;
  1021	
  1022			/* If the queue is empty then we can submit directly
  1023			 * into the msg queue. If its not empty we have to
  1024			 * queue work otherwise we may get OOO data. Otherwise,
  1025			 * if sk_psock_skb_ingress errors will be handled by
  1026			 * retrying later from workqueue.
  1027			 */
  1028			if (skb_queue_empty(&psock->ingress_skb) && likely(!state->skb)) {
  1029				len = skb->len;
  1030				off = 0;
  1031				if (skb_bpf_strparser(skb)) {
  1032					struct strp_msg *stm = strp_msg(skb);
  1033	
  1034					off = stm->offset;
  1035					len = stm->full_len;
  1036				}
  1037				err = sk_psock_skb_ingress_self(psock, skb, off, len);
  1038			}
  1039			if (err < 0) {
  1040				spin_lock_bh(&psock->ingress_lock);
  1041				if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
  1042					skb_queue_tail(&psock->ingress_skb, skb);
  1043					schedule_delayed_work(&psock->work, 0);
  1044					err = 0;
  1045				}
  1046				spin_unlock_bh(&psock->ingress_lock);
  1047				if (err < 0) {
  1048					skb_bpf_redirect_clear(skb);
  1049					mutex_unlock(&psock->work_mutex);
  1050					goto out_free;
  1051				}
  1052			}
  1053			mutex_unlock(&psock->work_mutex);
  1054			break;
  1055		case __SK_REDIRECT:
> 1056			tcp_eat_skb(psock->sk, skb);
  1057			err = sk_psock_skb_redirect(psock, skb);
  1058			break;
  1059		case __SK_DROP:
  1060		default:
  1061	out_free:
  1062			tcp_eat_skb(psock->sk, skb);
  1063			skb_bpf_redirect_clear(skb);
  1064			sock_drop(psock->sk, skb);
  1065		}
  1066	
  1067		return err;
  1068	}
  1069
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..674044b8bdaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1467,6 +1467,8 @@  static inline void tcp_adjust_rcv_ssthresh(struct sock *sk)
 }
 
 void tcp_cleanup_rbuf(struct sock *sk, int copied);
+void __tcp_cleanup_rbuf(struct sock *sk, int copied);
+
 
 /* We provision sk_rcvbuf around 200% of sk_rcvlowat.
  * If 87.5 % (7/8) of the space has been consumed, we want to override
@@ -2321,6 +2323,7 @@  struct sk_psock;
 struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
 int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
+void tcp_eat_skb(struct sock *sk, struct sk_buff *skb);
 #endif /* CONFIG_BPF_SYSCALL */
 
 int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress,
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a2e83d2aacf8..69983f40fbec 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1053,11 +1053,14 @@  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 		mutex_unlock(&psock->work_mutex);
 		break;
 	case __SK_REDIRECT:
+		tcp_eat_skb(psock->sk, skb);
 		err = sk_psock_skb_redirect(psock, skb);
 		break;
 	case __SK_DROP:
 	default:
 out_free:
+		tcp_eat_skb(psock->sk, skb);
+		skb_bpf_redirect_clear(skb);
 		sock_drop(psock->sk, skb);
 	}
 
@@ -1102,8 +1105,7 @@  static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		skb_dst_drop(skb);
 		skb_bpf_redirect_clear(skb);
 		ret = bpf_prog_run_pin_on_cpu(prog, skb);
-		if (ret == SK_PASS)
-			skb_bpf_set_strparser(skb);
+		skb_bpf_set_strparser(skb);
 		ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
 		skb->sk = NULL;
 	}
@@ -1211,6 +1213,7 @@  static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
 	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
 		len = 0;
+		tcp_eat_skb(sk, skb);
 		sock_drop(sk, skb);
 		goto out;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1be305e3d3c7..5610f8341b38 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1568,7 +1568,7 @@  static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
  * calculation of whether or not we must ACK for the sake of
  * a window update.
  */
-static void __tcp_cleanup_rbuf(struct sock *sk, int copied)
+void __tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
@@ -1783,14 +1783,6 @@  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 			break;
 		}
 	}
-	WRITE_ONCE(tp->copied_seq, seq);
-
-	tcp_rcv_space_adjust(sk);
-
-	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
-		__tcp_cleanup_rbuf(sk, copied);
-
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_skb);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ae6c7130551c..9e94864ce130 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -11,6 +11,24 @@ 
 #include <net/inet_common.h>
 #include <net/tls.h>
 
+void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tcp;
+	int copied;
+
+	if (!skb || !skb->len || !sk_is_tcp(sk))
+		return;
+
+	if (skb_bpf_strparser(skb))
+		return;
+
+	tcp = tcp_sk(sk);
+	copied = tcp->copied_seq + skb->len;
+	WRITE_ONCE(tcp->copied_seq, copied);
+	tcp_rcv_space_adjust(sk);
+	__tcp_cleanup_rbuf(sk, skb->len);
+}
+
 static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
 			   struct sk_msg *msg, u32 apply_bytes, int flags)
 {
@@ -198,8 +216,10 @@  static int tcp_bpf_recvmsg_parser(struct sock *sk,
 				  int flags,
 				  int *addr_len)
 {
+	struct tcp_sock *tcp = tcp_sk(sk);
+	u32 seq = tcp->copied_seq;
 	struct sk_psock *psock;
-	int copied;
+	int copied = 0;
 
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
@@ -244,9 +264,11 @@  static int tcp_bpf_recvmsg_parser(struct sock *sk,
 
 		if (is_fin) {
 			copied = 0;
+			seq++;
 			goto out;
 		}
 	}
+	seq += copied;
 	if (!copied) {
 		long timeo;
 		int data;
@@ -284,6 +306,10 @@  static int tcp_bpf_recvmsg_parser(struct sock *sk,
 		copied = -EAGAIN;
 	}
 out:
+	WRITE_ONCE(tcp->copied_seq, seq);
+	tcp_rcv_space_adjust(sk);
+	if (copied > 0)
+		__tcp_cleanup_rbuf(sk, copied);
 	release_sock(sk);
 	sk_psock_put(sk, psock);
 	return copied;