[bpf-next,v3,1/4] tcp: introduce tcp_read_skb()

Message ID	20220602012105.58853-2-xiyou.wangcong@gmail.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Eric Dumazet <edumazet@google.com>, John Fastabend <john.fastabend@gmail.com>, Daniel Borkmann <daniel@iogearbox.net>, Jakub Sitnicki <jakub@cloudflare.com> Subject: [Patch bpf-next v3 1/4] tcp: introduce tcp_read_skb() Date: Wed, 1 Jun 2022 18:21:02 -0700 Message-Id: <20220602012105.58853-2-xiyou.wangcong@gmail.com> In-Reply-To: <20220602012105.58853-1-xiyou.wangcong@gmail.com> References: <20220602012105.58853-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sockmap: some performance optimizations \| expand [bpf-next,v3,0/4] sockmap: some performance optimizations [bpf-next,v3,1/4] tcp: introduce tcp_read_skb() [bpf-next,v3,2/4] net: introduce a new proto_ops ->read_skb() [bpf-next,v3,3/4] skmsg: get rid of skb_clone() [bpf-next,v3,4/4] skmsg: get rid of unncessary memset()

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1082 this patch: 1082
netdev/cc_maintainers	warning	11 maintainers not CCed: songliubraving@fb.com ast@kernel.org pabeni@redhat.com kuba@kernel.org davem@davemloft.net yoshfuji@linux-ipv6.org yhs@fb.com kafai@fb.com dsahern@kernel.org andrii@kernel.org kpsingh@kernel.org
netdev/build_clang	success	Errors and warnings before: 140 this patch: 140
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1087 this patch: 1087
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 61 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for Kernel LATEST on ubuntu-latest with gcc

Cong Wang June 2, 2022, 1:21 a.m. UTC

From: Cong Wang <cong.wang@bytedance.com>

This patch inroduces tcp_read_skb() based on tcp_read_sock(),
a preparation for the next patch which actually introduces
a new sock ops.

TCP is special here, because it has tcp_read_sock() which is
mainly used by splice(). tcp_read_sock() supports partial read
and arbitrary offset, neither of them is needed for sockmap.

Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/net/tcp.h |  2 ++
 net/ipv4/tcp.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

John Fastabend June 9, 2022, 3:08 p.m. UTC | #1

Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This patch inroduces tcp_read_skb() based on tcp_read_sock(),
> a preparation for the next patch which actually introduces
> a new sock ops.
> 
> TCP is special here, because it has tcp_read_sock() which is
> mainly used by splice(). tcp_read_sock() supports partial read
> and arbitrary offset, neither of them is needed for sockmap.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  include/net/tcp.h |  2 ++
>  net/ipv4/tcp.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 1e99f5c61f84..878544d0f8f9 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *);
>  /* Read 'sendfile()'-style from a TCP socket */
>  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		  sk_read_actor_t recv_actor);
> +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> +		 sk_read_actor_t recv_actor);
>  
>  void tcp_initialize_rcv_mss(struct sock *sk);
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9984d23a7f3e..a18e9ababf54 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  }
>  EXPORT_SYMBOL(tcp_read_sock);
>  
> +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> +		 sk_read_actor_t recv_actor)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	u32 seq = tp->copied_seq;
> +	struct sk_buff *skb;
> +	int copied = 0;
> +	u32 offset;
> +
> +	if (sk->sk_state == TCP_LISTEN)
> +		return -ENOTCONN;
> +
> +	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> +		int used;
> +
> +		__skb_unlink(skb, &sk->sk_receive_queue);
> +		used = recv_actor(desc, skb, 0, skb->len);
> +		if (used <= 0) {
> +			if (!copied)
> +				copied = used;
> +			break;
> +		}
> +		seq += used;
> +		copied += used;
> +
> +		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> +			kfree_skb(skb);

Hi Cong, can you elaborate here from v2 comment.

"Hm, it is tricky here, we use the skb refcount after this patchset, so
it could be a real drop from another kfree_skb() in net/core/skmsg.c
which initiates the drop."

The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going
to kick tracing infra even on good cases with kfree_skb()? In
sk_psock_verdict_recv() we do an skb_clone() there.

.John

Cong Wang June 9, 2022, 6:50 p.m. UTC | #2

On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > This patch inroduces tcp_read_skb() based on tcp_read_sock(),
> > a preparation for the next patch which actually introduces
> > a new sock ops.
> >
> > TCP is special here, because it has tcp_read_sock() which is
> > mainly used by splice(). tcp_read_sock() supports partial read
> > and arbitrary offset, neither of them is needed for sockmap.
> >
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  include/net/tcp.h |  2 ++
> >  net/ipv4/tcp.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 49 insertions(+)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 1e99f5c61f84..878544d0f8f9 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *);
> >  /* Read 'sendfile()'-style from a TCP socket */
> >  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> >                 sk_read_actor_t recv_actor);
> > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> > +              sk_read_actor_t recv_actor);
> >
> >  void tcp_initialize_rcv_mss(struct sock *sk);
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 9984d23a7f3e..a18e9ababf54 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> >  }
> >  EXPORT_SYMBOL(tcp_read_sock);
> >
> > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> > +              sk_read_actor_t recv_actor)
> > +{
> > +     struct tcp_sock *tp = tcp_sk(sk);
> > +     u32 seq = tp->copied_seq;
> > +     struct sk_buff *skb;
> > +     int copied = 0;
> > +     u32 offset;
> > +
> > +     if (sk->sk_state == TCP_LISTEN)
> > +             return -ENOTCONN;
> > +
> > +     while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> > +             int used;
> > +
> > +             __skb_unlink(skb, &sk->sk_receive_queue);
> > +             used = recv_actor(desc, skb, 0, skb->len);
> > +             if (used <= 0) {
> > +                     if (!copied)
> > +                             copied = used;
> > +                     break;
> > +             }
> > +             seq += used;
> > +             copied += used;
> > +
> > +             if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > +                     kfree_skb(skb);
>
> Hi Cong, can you elaborate here from v2 comment.
>
> "Hm, it is tricky here, we use the skb refcount after this patchset, so
> it could be a real drop from another kfree_skb() in net/core/skmsg.c
> which initiates the drop."

Sure.

This is the source code of consume_skb():

 911 void consume_skb(struct sk_buff *skb)
 912 {
 913         if (!skb_unref(skb))
 914                 return;
 915
 916         trace_consume_skb(skb);
 917         __kfree_skb(skb);
 918 }

and this is kfree_skb (or kfree_skb_reason()):

 770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
 771 {
 772         if (!skb_unref(skb))
 773                 return;
 774
 775         DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >=
SKB_DROP_REASON_MAX);
 776
 777         trace_kfree_skb(skb, __builtin_return_address(0), reason);
 778         __kfree_skb(skb);
 779 }

So, both do refcnt before tracing, very clearly.

Now, let's do a simple case:

tcp_read_skb():
 -> tcp_recv_skb() // Let's assume skb refcnt == 1 here
  -> recv_actor()
   -> skb_get() // refcnt == 2
   -> kfree_skb() // Let's assume users drop it intentionally
 ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would
not be counted as a drop

Of course you can give another example where consume_skb() is
correct, but the point here is it is very tricky when refcnt, I even doubt
we can do anything here, maybe moving trace before refcnt.

>
> The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going
> to kick tracing infra even on good cases with kfree_skb()? In
> sk_psock_verdict_recv() we do an skb_clone() there.

I don't get your point here, are you suggesting we should sacrifice
performance just to make the drop tracing more accurate??

Thanks.

John Fastabend June 9, 2022, 7:12 p.m. UTC | #3

Cong Wang wrote:
> On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > This patch inroduces tcp_read_skb() based on tcp_read_sock(),
> > > a preparation for the next patch which actually introduces
> > > a new sock ops.
> > >
> > > TCP is special here, because it has tcp_read_sock() which is
> > > mainly used by splice(). tcp_read_sock() supports partial read
> > > and arbitrary offset, neither of them is needed for sockmap.
> > >
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > ---
> > >  include/net/tcp.h |  2 ++
> > >  net/ipv4/tcp.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 49 insertions(+)
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 1e99f5c61f84..878544d0f8f9 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *);
> > >  /* Read 'sendfile()'-style from a TCP socket */
> > >  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > >                 sk_read_actor_t recv_actor);
> > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> > > +              sk_read_actor_t recv_actor);
> > >
> > >  void tcp_initialize_rcv_mss(struct sock *sk);
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 9984d23a7f3e..a18e9ababf54 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> > >  }
> > >  EXPORT_SYMBOL(tcp_read_sock);
> > >
> > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> > > +              sk_read_actor_t recv_actor)
> > > +{
> > > +     struct tcp_sock *tp = tcp_sk(sk);
> > > +     u32 seq = tp->copied_seq;
> > > +     struct sk_buff *skb;
> > > +     int copied = 0;
> > > +     u32 offset;
> > > +
> > > +     if (sk->sk_state == TCP_LISTEN)
> > > +             return -ENOTCONN;
> > > +
> > > +     while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> > > +             int used;
> > > +
> > > +             __skb_unlink(skb, &sk->sk_receive_queue);
> > > +             used = recv_actor(desc, skb, 0, skb->len);
> > > +             if (used <= 0) {
> > > +                     if (!copied)
> > > +                             copied = used;
> > > +                     break;
> > > +             }
> > > +             seq += used;
> > > +             copied += used;
> > > +
> > > +             if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > > +                     kfree_skb(skb);
> >
> > Hi Cong, can you elaborate here from v2 comment.
> >
> > "Hm, it is tricky here, we use the skb refcount after this patchset, so
> > it could be a real drop from another kfree_skb() in net/core/skmsg.c
> > which initiates the drop."
> 
> Sure.
> 
> This is the source code of consume_skb():
> 
>  911 void consume_skb(struct sk_buff *skb)
>  912 {
>  913         if (!skb_unref(skb))
>  914                 return;
>  915
>  916         trace_consume_skb(skb);
>  917         __kfree_skb(skb);
>  918 }
> 
> and this is kfree_skb (or kfree_skb_reason()):
> 
>  770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
>  771 {
>  772         if (!skb_unref(skb))
>  773                 return;
>  774
>  775         DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >=
> SKB_DROP_REASON_MAX);
>  776
>  777         trace_kfree_skb(skb, __builtin_return_address(0), reason);
>  778         __kfree_skb(skb);
>  779 }
> 
> So, both do refcnt before tracing, very clearly.
> 
> Now, let's do a simple case:
> 
> tcp_read_skb():
>  -> tcp_recv_skb() // Let's assume skb refcnt == 1 here
>   -> recv_actor()
>    -> skb_get() // refcnt == 2
>    -> kfree_skb() // Let's assume users drop it intentionally
>  ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would
> not be counted as a drop

OK great thanks for that it matches what I was thinking as well.

> 
> Of course you can give another example where consume_skb() is
> correct, but the point here is it is very tricky when refcnt, I even doubt
> we can do anything here, maybe moving trace before refcnt.

Considering, the other case where we do kfree_skb when consume_skb()
is correct. We have logic in the Cilium tracing tools (tetragon) to
trace kfree_skb's and count them. So in the good case here
we end up tripping that logic even though its expected.

The question is which is better noisy kfree_skb even when
expected or missing kfree_skb on the drops. I'm leaning
to consume_skb() is safer instead of noisy kfree_skb().

> 
> >
> > The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going
> > to kick tracing infra even on good cases with kfree_skb()? In
> > sk_psock_verdict_recv() we do an skb_clone() there.
> 
> I don't get your point here, are you suggesting we should sacrifice
> performance just to make the drop tracing more accurate??

No lets not sacrifice the performance. I'm suggesting I would
rather go with skb_consume() and miss some kfree_skb() than
the other way around and have extra kfree_skb() that will
trip monitoring. Does the question make sense? I guess we
have to pick one.

> 
> Thanks.

Cong Wang June 14, 2022, 5:19 p.m. UTC | #4

On Thu, Jun 9, 2022 at 12:12 PM John Fastabend <john.fastabend@gmail.com> wrote:
> Considering, the other case where we do kfree_skb when consume_skb()
> is correct. We have logic in the Cilium tracing tools (tetragon) to
> trace kfree_skb's and count them. So in the good case here
> we end up tripping that logic even though its expected.
>
> The question is which is better noisy kfree_skb even when
> expected or missing kfree_skb on the drops. I'm leaning
> to consume_skb() is safer instead of noisy kfree_skb().

Oh, sure. As long as we all know neither of them is accurate,
I am 100% fine with changing it to consume_skb() to reduce the noise
for you.

Meanwhile, let me think about how to make it accurate, if possible at
all. But clearly this deserves a separate patch.

Thanks.

John Fastabend June 14, 2022, 7:55 p.m. UTC | #5

Cong Wang wrote:
> On Thu, Jun 9, 2022 at 12:12 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > Considering, the other case where we do kfree_skb when consume_skb()
> > is correct. We have logic in the Cilium tracing tools (tetragon) to
> > trace kfree_skb's and count them. So in the good case here
> > we end up tripping that logic even though its expected.
> >
> > The question is which is better noisy kfree_skb even when
> > expected or missing kfree_skb on the drops. I'm leaning
> > to consume_skb() is safer instead of noisy kfree_skb().
> 
> Oh, sure. As long as we all know neither of them is accurate,
> I am 100% fine with changing it to consume_skb() to reduce the noise
> for you.

Thanks that would be great.

> 
> Meanwhile, let me think about how to make it accurate, if possible at
> all. But clearly this deserves a separate patch.

Yep should be ok. We set the error code in desc->error in the verdict
recv handler maybe tracking through this.

> 
> Thanks.

[bpf-next,v3,1/4] tcp: introduce tcp_read_skb()

Checks

Commit Message

Comments

Patch