Message ID | 20220602012105.58853-2-xiyou.wangcong@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | sockmap: some performance optimizations | expand |
Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > This patch inroduces tcp_read_skb() based on tcp_read_sock(), > a preparation for the next patch which actually introduces > a new sock ops. > > TCP is special here, because it has tcp_read_sock() which is > mainly used by splice(). tcp_read_sock() supports partial read > and arbitrary offset, neither of them is needed for sockmap. > > Cc: Eric Dumazet <edumazet@google.com> > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > include/net/tcp.h | 2 ++ > net/ipv4/tcp.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 49 insertions(+) > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 1e99f5c61f84..878544d0f8f9 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); > /* Read 'sendfile()'-style from a TCP socket */ > int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > sk_read_actor_t recv_actor); > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > + sk_read_actor_t recv_actor); > > void tcp_initialize_rcv_mss(struct sock *sk); > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 9984d23a7f3e..a18e9ababf54 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > } > EXPORT_SYMBOL(tcp_read_sock); > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > + sk_read_actor_t recv_actor) > +{ > + struct tcp_sock *tp = tcp_sk(sk); > + u32 seq = tp->copied_seq; > + struct sk_buff *skb; > + int copied = 0; > + u32 offset; > + > + if (sk->sk_state == TCP_LISTEN) > + return -ENOTCONN; > + > + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > + int used; > + > + __skb_unlink(skb, &sk->sk_receive_queue); > + used = recv_actor(desc, skb, 0, skb->len); > + if (used <= 0) { > + if (!copied) > + copied = used; > + break; > + } > + seq += used; > + copied += used; > + > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { > + kfree_skb(skb); Hi Cong, can you elaborate here from v2 comment. "Hm, it is tricky here, we use the skb refcount after this patchset, so it could be a real drop from another kfree_skb() in net/core/skmsg.c which initiates the drop." The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going to kick tracing infra even on good cases with kfree_skb()? In sk_psock_verdict_recv() we do an skb_clone() there. .John
On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > This patch inroduces tcp_read_skb() based on tcp_read_sock(), > > a preparation for the next patch which actually introduces > > a new sock ops. > > > > TCP is special here, because it has tcp_read_sock() which is > > mainly used by splice(). tcp_read_sock() supports partial read > > and arbitrary offset, neither of them is needed for sockmap. > > > > Cc: Eric Dumazet <edumazet@google.com> > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > include/net/tcp.h | 2 ++ > > net/ipv4/tcp.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 49 insertions(+) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 1e99f5c61f84..878544d0f8f9 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); > > /* Read 'sendfile()'-style from a TCP socket */ > > int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > sk_read_actor_t recv_actor); > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > + sk_read_actor_t recv_actor); > > > > void tcp_initialize_rcv_mss(struct sock *sk); > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 9984d23a7f3e..a18e9ababf54 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > } > > EXPORT_SYMBOL(tcp_read_sock); > > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > + sk_read_actor_t recv_actor) > > +{ > > + struct tcp_sock *tp = tcp_sk(sk); > > + u32 seq = tp->copied_seq; > > + struct sk_buff *skb; > > + int copied = 0; > > + u32 offset; > > + > > + if (sk->sk_state == TCP_LISTEN) > > + return -ENOTCONN; > > + > > + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > > + int used; > > + > > + __skb_unlink(skb, &sk->sk_receive_queue); > > + used = recv_actor(desc, skb, 0, skb->len); > > + if (used <= 0) { > > + if (!copied) > > + copied = used; > > + break; > > + } > > + seq += used; > > + copied += used; > > + > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { > > + kfree_skb(skb); > > Hi Cong, can you elaborate here from v2 comment. > > "Hm, it is tricky here, we use the skb refcount after this patchset, so > it could be a real drop from another kfree_skb() in net/core/skmsg.c > which initiates the drop." Sure. This is the source code of consume_skb(): 911 void consume_skb(struct sk_buff *skb) 912 { 913 if (!skb_unref(skb)) 914 return; 915 916 trace_consume_skb(skb); 917 __kfree_skb(skb); 918 } and this is kfree_skb (or kfree_skb_reason()): 770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) 771 { 772 if (!skb_unref(skb)) 773 return; 774 775 DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >= SKB_DROP_REASON_MAX); 776 777 trace_kfree_skb(skb, __builtin_return_address(0), reason); 778 __kfree_skb(skb); 779 } So, both do refcnt before tracing, very clearly. Now, let's do a simple case: tcp_read_skb(): -> tcp_recv_skb() // Let's assume skb refcnt == 1 here -> recv_actor() -> skb_get() // refcnt == 2 -> kfree_skb() // Let's assume users drop it intentionally ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would not be counted as a drop Of course you can give another example where consume_skb() is correct, but the point here is it is very tricky when refcnt, I even doubt we can do anything here, maybe moving trace before refcnt. > > The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going > to kick tracing infra even on good cases with kfree_skb()? In > sk_psock_verdict_recv() we do an skb_clone() there. I don't get your point here, are you suggesting we should sacrifice performance just to make the drop tracing more accurate?? Thanks.
Cong Wang wrote: > On Thu, Jun 9, 2022 at 8:08 AM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > This patch inroduces tcp_read_skb() based on tcp_read_sock(), > > > a preparation for the next patch which actually introduces > > > a new sock ops. > > > > > > TCP is special here, because it has tcp_read_sock() which is > > > mainly used by splice(). tcp_read_sock() supports partial read > > > and arbitrary offset, neither of them is needed for sockmap. > > > > > > Cc: Eric Dumazet <edumazet@google.com> > > > Cc: John Fastabend <john.fastabend@gmail.com> > > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > > --- > > > include/net/tcp.h | 2 ++ > > > net/ipv4/tcp.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 49 insertions(+) > > > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > > index 1e99f5c61f84..878544d0f8f9 100644 > > > --- a/include/net/tcp.h > > > +++ b/include/net/tcp.h > > > @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); > > > /* Read 'sendfile()'-style from a TCP socket */ > > > int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > > sk_read_actor_t recv_actor); > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > > + sk_read_actor_t recv_actor); > > > > > > void tcp_initialize_rcv_mss(struct sock *sk); > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > > index 9984d23a7f3e..a18e9ababf54 100644 > > > --- a/net/ipv4/tcp.c > > > +++ b/net/ipv4/tcp.c > > > @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > > > } > > > EXPORT_SYMBOL(tcp_read_sock); > > > > > > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, > > > + sk_read_actor_t recv_actor) > > > +{ > > > + struct tcp_sock *tp = tcp_sk(sk); > > > + u32 seq = tp->copied_seq; > > > + struct sk_buff *skb; > > > + int copied = 0; > > > + u32 offset; > > > + > > > + if (sk->sk_state == TCP_LISTEN) > > > + return -ENOTCONN; > > > + > > > + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > > > + int used; > > > + > > > + __skb_unlink(skb, &sk->sk_receive_queue); > > > + used = recv_actor(desc, skb, 0, skb->len); > > > + if (used <= 0) { > > > + if (!copied) > > > + copied = used; > > > + break; > > > + } > > > + seq += used; > > > + copied += used; > > > + > > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { > > > + kfree_skb(skb); > > > > Hi Cong, can you elaborate here from v2 comment. > > > > "Hm, it is tricky here, we use the skb refcount after this patchset, so > > it could be a real drop from another kfree_skb() in net/core/skmsg.c > > which initiates the drop." > > Sure. > > This is the source code of consume_skb(): > > 911 void consume_skb(struct sk_buff *skb) > 912 { > 913 if (!skb_unref(skb)) > 914 return; > 915 > 916 trace_consume_skb(skb); > 917 __kfree_skb(skb); > 918 } > > and this is kfree_skb (or kfree_skb_reason()): > > 770 void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) > 771 { > 772 if (!skb_unref(skb)) > 773 return; > 774 > 775 DEBUG_NET_WARN_ON_ONCE(reason <= 0 || reason >= > SKB_DROP_REASON_MAX); > 776 > 777 trace_kfree_skb(skb, __builtin_return_address(0), reason); > 778 __kfree_skb(skb); > 779 } > > So, both do refcnt before tracing, very clearly. > > Now, let's do a simple case: > > tcp_read_skb(): > -> tcp_recv_skb() // Let's assume skb refcnt == 1 here > -> recv_actor() > -> skb_get() // refcnt == 2 > -> kfree_skb() // Let's assume users drop it intentionally > ->kfree_skb() // refcnt == 0 here, if we had consume_skb() it would > not be counted as a drop OK great thanks for that it matches what I was thinking as well. > > Of course you can give another example where consume_skb() is > correct, but the point here is it is very tricky when refcnt, I even doubt > we can do anything here, maybe moving trace before refcnt. Considering, the other case where we do kfree_skb when consume_skb() is correct. We have logic in the Cilium tracing tools (tetragon) to trace kfree_skb's and count them. So in the good case here we end up tripping that logic even though its expected. The question is which is better noisy kfree_skb even when expected or missing kfree_skb on the drops. I'm leaning to consume_skb() is safer instead of noisy kfree_skb(). > > > > > The tcp_read_sock() hook is using tcp_eat_recv_skb(). Are we going > > to kick tracing infra even on good cases with kfree_skb()? In > > sk_psock_verdict_recv() we do an skb_clone() there. > > I don't get your point here, are you suggesting we should sacrifice > performance just to make the drop tracing more accurate?? No lets not sacrifice the performance. I'm suggesting I would rather go with skb_consume() and miss some kfree_skb() than the other way around and have extra kfree_skb() that will trip monitoring. Does the question make sense? I guess we have to pick one. > > Thanks.
On Thu, Jun 9, 2022 at 12:12 PM John Fastabend <john.fastabend@gmail.com> wrote: > Considering, the other case where we do kfree_skb when consume_skb() > is correct. We have logic in the Cilium tracing tools (tetragon) to > trace kfree_skb's and count them. So in the good case here > we end up tripping that logic even though its expected. > > The question is which is better noisy kfree_skb even when > expected or missing kfree_skb on the drops. I'm leaning > to consume_skb() is safer instead of noisy kfree_skb(). Oh, sure. As long as we all know neither of them is accurate, I am 100% fine with changing it to consume_skb() to reduce the noise for you. Meanwhile, let me think about how to make it accurate, if possible at all. But clearly this deserves a separate patch. Thanks.
Cong Wang wrote: > On Thu, Jun 9, 2022 at 12:12 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Considering, the other case where we do kfree_skb when consume_skb() > > is correct. We have logic in the Cilium tracing tools (tetragon) to > > trace kfree_skb's and count them. So in the good case here > > we end up tripping that logic even though its expected. > > > > The question is which is better noisy kfree_skb even when > > expected or missing kfree_skb on the drops. I'm leaning > > to consume_skb() is safer instead of noisy kfree_skb(). > > Oh, sure. As long as we all know neither of them is accurate, > I am 100% fine with changing it to consume_skb() to reduce the noise > for you. Thanks that would be great. > > Meanwhile, let me think about how to make it accurate, if possible at > all. But clearly this deserves a separate patch. Yep should be ok. We set the error code in desc->error in the verdict recv handler maybe tracking through this. > > Thanks.
diff --git a/include/net/tcp.h b/include/net/tcp.h index 1e99f5c61f84..878544d0f8f9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -669,6 +669,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); /* Read 'sendfile()'-style from a TCP socket */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor); void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9984d23a7f3e..a18e9ababf54 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1709,6 +1709,53 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, } EXPORT_SYMBOL(tcp_read_sock); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor) +{ + struct tcp_sock *tp = tcp_sk(sk); + u32 seq = tp->copied_seq; + struct sk_buff *skb; + int copied = 0; + u32 offset; + + if (sk->sk_state == TCP_LISTEN) + return -ENOTCONN; + + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + int used; + + __skb_unlink(skb, &sk->sk_receive_queue); + used = recv_actor(desc, skb, 0, skb->len); + if (used <= 0) { + if (!copied) + copied = used; + break; + } + seq += used; + copied += used; + + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { + kfree_skb(skb); + ++seq; + break; + } + kfree_skb(skb); + if (!desc->count) + break; + WRITE_ONCE(tp->copied_seq, seq); + } + WRITE_ONCE(tp->copied_seq, seq); + + tcp_rcv_space_adjust(sk); + + /* Clean up data we have read: This will do ACK frames. */ + if (copied > 0) + tcp_cleanup_rbuf(sk, copied); + + return copied; +} +EXPORT_SYMBOL(tcp_read_skb); + int tcp_peek_len(struct socket *sock) { return tcp_inq(sock->sk);