diff mbox series

[bpf,v10,07/14] bpf: sockmap, wake up polling after data copy

Message ID 20230523025618.113937-8-john.fastabend@gmail.com (mailing list archive)
State Accepted
Delegated to: BPF
Headers show
Series bpf sockmap fixes | expand

Checks

Context Check Description
bpf/vmtest-bpf-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf, async
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers fail 1 blamed authors not CCed: cong.wang@bytedance.com; 4 maintainers not CCed: kuba@kernel.org cong.wang@bytedance.com davem@davemloft.net pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 10 this patch: 10
netdev/checkpatch warning WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-PR success PR summary

Commit Message

John Fastabend May 23, 2023, 2:56 a.m. UTC
When TCP stack has data ready to read sk_data_ready() is called. Sockmap
overwrites this with its own handler to call into BPF verdict program.
But, the original TCP socket had sock_def_readable that would additionally
wake up any user space waiters with sk_wake_async().

Sockmap saved the callback when the socket was created so call the saved
data ready callback and then we can wake up any epoll() logic waiting
on the read.

Note we call on 'copied >= 0' to account for returning 0 when a FIN is
received because we need to wake up user for this as well so they
can do the recvmsg() -> 0 and detect the shutdown.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Eric Dumazet May 30, 2023, 6:30 a.m. UTC | #1
On Tue, May 23, 2023 at 4:56 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> When TCP stack has data ready to read sk_data_ready() is called. Sockmap
> overwrites this with its own handler to call into BPF verdict program.
> But, the original TCP socket had sock_def_readable that would additionally
> wake up any user space waiters with sk_wake_async().
>
> Sockmap saved the callback when the socket was created so call the saved
> data ready callback and then we can wake up any epoll() logic waiting
> on the read.
>
> Note we call on 'copied >= 0' to account for returning 0 when a FIN is
> received because we need to wake up user for this as well so they
> can do the recvmsg() -> 0 and detect the shutdown.
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index bcd45a99a3db..08be5f409fb8 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1199,12 +1199,21 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
>  static void sk_psock_verdict_data_ready(struct sock *sk)
>  {
>         struct socket *sock = sk->sk_socket;
> +       int copied;
>
>         trace_sk_data_ready(sk);
>
>         if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
>                 return;
> -       sock->ops->read_skb(sk, sk_psock_verdict_recv);
> +       copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
> +       if (copied >= 0) {
> +               struct sk_psock *psock;
> +
> +               rcu_read_lock();
> +               psock = sk_psock(sk);
> +               psock->saved_data_ready(sk);
> +               rcu_read_unlock();
> +       }
>  }
>
>  void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
> --
> 2.33.0
>

It seems psock could be NULL here, right ?

What do you think if I submit the following fix ?

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a9060e1f0e4378fa47cfd375b4729b5b0a9f54ec..a29508e1ff3568583263b9307f7b1a0e814ba76d
100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1210,7 +1210,8 @@ static void sk_psock_verdict_data_ready(struct sock *sk)

                rcu_read_lock();
                psock = sk_psock(sk);
-               psock->saved_data_ready(sk);
+               if (psock)
+                       psock->saved_data_ready(sk);
                rcu_read_unlock();
        }
 }
John Fastabend May 30, 2023, 6:34 p.m. UTC | #2
Eric Dumazet wrote:
> On Tue, May 23, 2023 at 4:56 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > When TCP stack has data ready to read sk_data_ready() is called. Sockmap
> > overwrites this with its own handler to call into BPF verdict program.
> > But, the original TCP socket had sock_def_readable that would additionally
> > wake up any user space waiters with sk_wake_async().
> >
> > Sockmap saved the callback when the socket was created so call the saved
> > data ready callback and then we can wake up any epoll() logic waiting
> > on the read.
> >
> > Note we call on 'copied >= 0' to account for returning 0 when a FIN is
> > received because we need to wake up user for this as well so they
> > can do the recvmsg() -> 0 and detect the shutdown.
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index bcd45a99a3db..08be5f409fb8 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -1199,12 +1199,21 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
> >  static void sk_psock_verdict_data_ready(struct sock *sk)
> >  {
> >         struct socket *sock = sk->sk_socket;
> > +       int copied;
> >
> >         trace_sk_data_ready(sk);
> >
> >         if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
> >                 return;
> > -       sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > +       copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > +       if (copied >= 0) {
> > +               struct sk_psock *psock;
> > +
> > +               rcu_read_lock();
> > +               psock = sk_psock(sk);
> > +               psock->saved_data_ready(sk);
> > +               rcu_read_unlock();
> > +       }
> >  }
> >
> >  void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
> > --
> > 2.33.0
> >
> 
> It seems psock could be NULL here, right ?
> 
> What do you think if I submit the following fix ?
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index a9060e1f0e4378fa47cfd375b4729b5b0a9f54ec..a29508e1ff3568583263b9307f7b1a0e814ba76d
> 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1210,7 +1210,8 @@ static void sk_psock_verdict_data_ready(struct sock *sk)
> 
>                 rcu_read_lock();
>                 psock = sk_psock(sk);
> -               psock->saved_data_ready(sk);
> +               if (psock)
> +                       psock->saved_data_ready(sk);
>                 rcu_read_unlock();
>         }
>  }

Yes please do presumably this is plausible if user delete map entry while
data is being sent and we get a race. We don't have any tests for this
in our CI though because we never delete socks after adding them and
rely on the sock close. This shouldn't happen in that path because of the
data_ready is blocked on SOCK_DEAD flag iirc.

I'll think if we can add some stress test to add map update/delete in
a tight loop with live socket sending/receiving traffic.

Thanks
John Fastabend May 30, 2023, 6:43 p.m. UTC | #3
John Fastabend wrote:
> Eric Dumazet wrote:
> > On Tue, May 23, 2023 at 4:56 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > When TCP stack has data ready to read sk_data_ready() is called. Sockmap
> > > overwrites this with its own handler to call into BPF verdict program.
> > > But, the original TCP socket had sock_def_readable that would additionally
> > > wake up any user space waiters with sk_wake_async().
> > >
> > > Sockmap saved the callback when the socket was created so call the saved
> > > data ready callback and then we can wake up any epoll() logic waiting
> > > on the read.
> > >
> > > Note we call on 'copied >= 0' to account for returning 0 when a FIN is
> > > received because we need to wake up user for this as well so they
> > > can do the recvmsg() -> 0 and detect the shutdown.
> > >
> > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > ---
> > >  net/core/skmsg.c | 11 ++++++++++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > > index bcd45a99a3db..08be5f409fb8 100644
> > > --- a/net/core/skmsg.c
> > > +++ b/net/core/skmsg.c
> > > @@ -1199,12 +1199,21 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
> > >  static void sk_psock_verdict_data_ready(struct sock *sk)
> > >  {
> > >         struct socket *sock = sk->sk_socket;
> > > +       int copied;
> > >
> > >         trace_sk_data_ready(sk);
> > >
> > >         if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
> > >                 return;
> > > -       sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > > +       copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > > +       if (copied >= 0) {
> > > +               struct sk_psock *psock;
> > > +
> > > +               rcu_read_lock();
> > > +               psock = sk_psock(sk);
> > > +               psock->saved_data_ready(sk);
> > > +               rcu_read_unlock();
> > > +       }
> > >  }
> > >
> > >  void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
> > > --
> > > 2.33.0
> > >
> > 
> > It seems psock could be NULL here, right ?
> > 
> > What do you think if I submit the following fix ?
> > 
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index a9060e1f0e4378fa47cfd375b4729b5b0a9f54ec..a29508e1ff3568583263b9307f7b1a0e814ba76d
> > 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -1210,7 +1210,8 @@ static void sk_psock_verdict_data_ready(struct sock *sk)
> > 
> >                 rcu_read_lock();
> >                 psock = sk_psock(sk);
> > -               psock->saved_data_ready(sk);
> > +               if (psock)
> > +                       psock->saved_data_ready(sk);
> >                 rcu_read_unlock();
> >         }
> >  }
> 
> Yes please do presumably this is plausible if user delete map entry while
> data is being sent and we get a race. We don't have any tests for this
> in our CI though because we never delete socks after adding them and
> rely on the sock close. This shouldn't happen in that path because of the
> data_ready is blocked on SOCK_DEAD flag iirc.
> 
> I'll think if we can add some stress test to add map update/delete in
> a tight loop with live socket sending/receiving traffic.
> 
> Thanks

I can also submit it if its easier just let me know.
Eric Dumazet May 30, 2023, 6:51 p.m. UTC | #4
On Tue, May 30, 2023 at 8:43 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> John Fastabend wrote:
> > Eric Dumazet wrote:
> > > On Tue, May 23, 2023 at 4:56 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > When TCP stack has data ready to read sk_data_ready() is called. Sockmap
> > > > overwrites this with its own handler to call into BPF verdict program.
> > > > But, the original TCP socket had sock_def_readable that would additionally
> > > > wake up any user space waiters with sk_wake_async().
> > > >
> > > > Sockmap saved the callback when the socket was created so call the saved
> > > > data ready callback and then we can wake up any epoll() logic waiting
> > > > on the read.
> > > >
> > > > Note we call on 'copied >= 0' to account for returning 0 when a FIN is
> > > > received because we need to wake up user for this as well so they
> > > > can do the recvmsg() -> 0 and detect the shutdown.
> > > >
> > > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > > > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > > > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > > > ---
> > > >  net/core/skmsg.c | 11 ++++++++++-
> > > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > > > index bcd45a99a3db..08be5f409fb8 100644
> > > > --- a/net/core/skmsg.c
> > > > +++ b/net/core/skmsg.c
> > > > @@ -1199,12 +1199,21 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
> > > >  static void sk_psock_verdict_data_ready(struct sock *sk)
> > > >  {
> > > >         struct socket *sock = sk->sk_socket;
> > > > +       int copied;
> > > >
> > > >         trace_sk_data_ready(sk);
> > > >
> > > >         if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
> > > >                 return;
> > > > -       sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > > > +       copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
> > > > +       if (copied >= 0) {
> > > > +               struct sk_psock *psock;
> > > > +
> > > > +               rcu_read_lock();
> > > > +               psock = sk_psock(sk);
> > > > +               psock->saved_data_ready(sk);
> > > > +               rcu_read_unlock();
> > > > +       }
> > > >  }
> > > >
> > > >  void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
> > > > --
> > > > 2.33.0
> > > >
> > >
> > > It seems psock could be NULL here, right ?
> > >
> > > What do you think if I submit the following fix ?
> > >
> > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > > index a9060e1f0e4378fa47cfd375b4729b5b0a9f54ec..a29508e1ff3568583263b9307f7b1a0e814ba76d
> > > 100644
> > > --- a/net/core/skmsg.c
> > > +++ b/net/core/skmsg.c
> > > @@ -1210,7 +1210,8 @@ static void sk_psock_verdict_data_ready(struct sock *sk)
> > >
> > >                 rcu_read_lock();
> > >                 psock = sk_psock(sk);
> > > -               psock->saved_data_ready(sk);
> > > +               if (psock)
> > > +                       psock->saved_data_ready(sk);
> > >                 rcu_read_unlock();
> > >         }
> > >  }
> >
> > Yes please do presumably this is plausible if user delete map entry while
> > data is being sent and we get a race. We don't have any tests for this
> > in our CI though because we never delete socks after adding them and
> > rely on the sock close. This shouldn't happen in that path because of the
> > data_ready is blocked on SOCK_DEAD flag iirc.
> >
> > I'll think if we can add some stress test to add map update/delete in
> > a tight loop with live socket sending/receiving traffic.
> >
> > Thanks
>
> I can also submit it if its easier just let me know.

 I will, this is based on a syzbot report I will also release,
thanks !
diff mbox series

Patch

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index bcd45a99a3db..08be5f409fb8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1199,12 +1199,21 @@  static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
 static void sk_psock_verdict_data_ready(struct sock *sk)
 {
 	struct socket *sock = sk->sk_socket;
+	int copied;
 
 	trace_sk_data_ready(sk);
 
 	if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
 		return;
-	sock->ops->read_skb(sk, sk_psock_verdict_recv);
+	copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
+	if (copied >= 0) {
+		struct sk_psock *psock;
+
+		rcu_read_lock();
+		psock = sk_psock(sk);
+		psock->saved_data_ready(sk);
+		rcu_read_unlock();
+	}
 }
 
 void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)