[v5] sock: add tracepoint for send recv length

Message ID	20230110091356.1524-1-cuiyunhui@bytedance.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-trace-kernel-owner@vger.kernel.org> From: Yunhui Cui <cuiyunhui@bytedance.com> To: rostedt@goodmis.org, mhiramat@kernel.org, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, kuniyu@amazon.com, xiyou.wangcong@gmail.com, duanxiongchun@bytedance.com, cuiyunhui@bytedance.com, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, netdev@vger.kernel.org, dust.li@linux.alibaba.com Subject: [PATCH v5] sock: add tracepoint for send recv length Date: Tue, 10 Jan 2023 17:13:56 +0800 Message-Id: <20230110091356.1524-1-cuiyunhui@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v5] sock: add tracepoint for send recv length \| expand [v5] sock: add tracepoint for send recv length

yunhui cui Jan. 10, 2023, 9:13 a.m. UTC

Add 2 tracepoints to monitor the tcp/udp traffic
of per process and per cgroup.

Regarding monitoring the tcp/udp traffic of each process, there are two
existing solutions, the first one is https://www.atoptool.nl/netatop.php.
The second is via kprobe/kretprobe.

Netatop solution is implemented by registering the hook function at the
hook point provided by the netfilter framework.

These hook functions may be in the soft interrupt context and cannot
directly obtain the pid. Some data structures are added to bind packets
and processes. For example, struct taskinfobucket, struct taskinfo ...

Every time the process sends and receives packets it needs multiple
hashmaps,resulting in low performance and it has the problem fo inaccurate
tcp/udp traffic statistics(for example: multiple threads share sockets).

We can obtain the information with kretprobe, but as we know, kprobe gets
the result by trappig in an exception, which loses performance compared
to tracepoint.

We compared the performance of tracepoints with the above two methods, and
the results are as follows:

ab -n 1000000 -c 1000 -r http://127.0.0.1/index.html
without trace:
Time per request: 39.660 [ms] (mean)
Time per request: 0.040 [ms] (mean, across all concurrent requests)

netatop:
Time per request: 50.717 [ms] (mean)
Time per request: 0.051 [ms] (mean, across all concurrent requests)

kr:
Time per request: 43.168 [ms] (mean)
Time per request: 0.043 [ms] (mean, across all concurrent requests)

tracepoint:
Time per request: 41.004 [ms] (mean)
Time per request: 0.041 [ms] (mean, across all concurrent requests

It can be seen that tracepoint has better performance.

Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Signed-off-by: Xiongchun Duan <duanxiongchun@bytedance.com>
---
 include/trace/events/sock.h | 44 +++++++++++++++++++++++++++++++++++++
 net/socket.c                | 36 ++++++++++++++++++++++++++----
 2 files changed, 76 insertions(+), 4 deletions(-)

Eric Dumazet Jan. 10, 2023, 11:49 a.m. UTC | #1

On Tue, Jan 10, 2023 at 10:15 AM Yunhui Cui <cuiyunhui@bytedance.com> wrote:
>
> Add 2 tracepoints to monitor the tcp/udp traffic
> of per process and per cgroup.
>
> Regarding monitoring the tcp/udp traffic of each process, there are two
> existing solutions, the first one is https://www.atoptool.nl/netatop.php.
> The second is via kprobe/kretprobe.
>
> Netatop solution is implemented by registering the hook function at the
> hook point provided by the netfilter framework.
>
> These hook functions may be in the soft interrupt context and cannot
> directly obtain the pid. Some data structures are added to bind packets
> and processes. For example, struct taskinfobucket, struct taskinfo ...
>
> Every time the process sends and receives packets it needs multiple
> hashmaps,resulting in low performance and it has the problem fo inaccurate
> tcp/udp traffic statistics(for example: multiple threads share sockets).
>
> We can obtain the information with kretprobe, but as we know, kprobe gets
> the result by trappig in an exception, which loses performance compared
> to tracepoint.
>
> We compared the performance of tracepoints with the above two methods, and
> the results are as follows:
>
> ab -n 1000000 -c 1000 -r http://127.0.0.1/index.html
> without trace:
> Time per request: 39.660 [ms] (mean)
> Time per request: 0.040 [ms] (mean, across all concurrent requests)
>
> netatop:
> Time per request: 50.717 [ms] (mean)
> Time per request: 0.051 [ms] (mean, across all concurrent requests)
>
> kr:
> Time per request: 43.168 [ms] (mean)
> Time per request: 0.043 [ms] (mean, across all concurrent requests)
>
> tracepoint:
> Time per request: 41.004 [ms] (mean)
> Time per request: 0.041 [ms] (mean, across all concurrent requests
>
> It can be seen that tracepoint has better performance.
>
> Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
> Signed-off-by: Xiongchun Duan <duanxiongchun@bytedance.com>
> ---
>  include/trace/events/sock.h | 44 +++++++++++++++++++++++++++++++++++++
>  net/socket.c                | 36 ++++++++++++++++++++++++++----
>  2 files changed, 76 insertions(+), 4 deletions(-)
>

...

> +static noinline void call_trace_sock_recv_length(struct sock *sk, int ret, int flags)
> +{
> +       trace_sock_recv_length(sk, !(flags & MSG_PEEK) ? ret :
> +                              (ret < 0 ? ret : 0), flags);

Maybe we should only 'fast assign' the two fields (ret and flags),
and let this logic happen later at 'print' time ?

This would reduce storage by one integer, and make fast path really fast.

This also could potentially remove the need for the peculiar construct with
these noinline helpers.

> +}
> +
>  static inline int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
>                                      int flags)


>  {
> -       return INDIRECT_CALL_INET(sock->ops->recvmsg, inet6_recvmsg,
> -                                 inet_recvmsg, sock, msg, msg_data_left(msg),
> -                                 flags);
> +       int ret = INDIRECT_CALL_INET(sock->ops->recvmsg, inet6_recvmsg,
> +                                    inet_recvmsg, sock, msg,
> +                                    msg_data_left(msg), flags);
> +
> +       if (trace_sock_recv_length_enabled())
> +               call_trace_sock_recv_length(sock->sk, !(flags & MSG_PEEK) ?
> +                                           ret : (ret < 0 ? ret : 0), flags);
> +       return ret;
>  }

Maybe you meant :

  if (trace_sock_recv_length_enabled())
      call_trace_sock_recv_length(sock->sk, ret, flags);

?

Please make sure to test your patches.

Steven Rostedt Jan. 10, 2023, 3:44 p.m. UTC | #2

On Tue, 10 Jan 2023 12:49:30 +0100
Eric Dumazet <edumazet@google.com> wrote:

> > +static noinline void call_trace_sock_recv_length(struct sock *sk, int ret, int flags)
> > +{
> > +       trace_sock_recv_length(sk, !(flags & MSG_PEEK) ? ret :
> > +                              (ret < 0 ? ret : 0), flags);  
> 
> Maybe we should only 'fast assign' the two fields (ret and flags),
> and let this logic happen later at 'print' time ?
> 
> This would reduce storage by one integer, and make fast path really fast.
> 
> This also could potentially remove the need for the peculiar construct with
> these noinline helpers.

I noticed that the trace_sock_send_length() doesn't have this logic, and
they are both DEFINE_EVENT() of the same DECLARE_EVENT_CLASS(). But we
could change this too, by the following:

/*
 * sock send/recv msg length
 */
DECLARE_EVENT_CLASS(sock_msg_length,

	TP_PROTO(struct sock *sk, int ret, int flags),

	TP_ARGS(sk, ret, flags),

	TP_STRUCT__entry(
		__field(void *, sk)
		__field(__u16, family)
		__field(__u16, protocol)
		__field(int, ret)
		__field(int, flags)
	),

	TP_fast_assign(
		__entry->sk = sk;
		__entry->family = sk->sk_family;
		__entry->protocol = sk->sk_protocol;
		__entry->length = ret > 0 ? ret : 0;
		__entry->error = ret < 0 ? ret : 0;
		__entry->flags = flags;
	),

	TP_printk("sk address = %p, family = %s protocol = %s, length = %d, error = %d, flags = 0x%x",
		  __entry->sk, show_family_name(__entry->family),
		  show_inet_protocol_name(__entry->protocol),
		  __entry->ret > 0 ? ret : 0, __entry->ret < 0 ? ret : 0,
		  __entry->flags)
);

DEFINE_EVENT(sock_msg_length, sock_send_length,
	TP_PROTO(struct sock *sk, int ret, int flags),

	TP_ARGS(sk, ret, flags)
);

DEFINE_EVENT_PRINT(sock_msg_length, sock_recv_length,
	TP_PROTO(struct sock *sk, int ret, int flags),

	TP_ARGS(sk, ret, flags)

	TP_printk("sk address = %p, family = %s protocol = %s, length = %d, error = %d, flags = 0x%x",
		  __entry->sk, show_family_name(__entry->family),
		  show_inet_protocol_name(__entry->protocol),
		  !(__entry->flags & MSG_PEEK) ? __entry->ret : __entry->ret > 0 ? ret : 0,
		  __entry->ret < 0 ? ret : 0,
		  __entry->flags)
);
#endif /* _TRACE_SOCK_H */

As DEFINE_EVENT_PRINT() uses the class template, but overrides the
TP_printk() portion (still saving memory).

And then both calls can just do:

	trace_sock_send_length(sk, ret, 0);

	trace_sock_recv_length(sock->sk, ret, flags);

And I bet that will also solve all the gcc being smart waste.

-- Steve

yunhui cui Jan. 11, 2023, 3:53 a.m. UTC | #3

On Tue, Jan 10, 2023 at 11:44 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
>
>         TP_printk("sk address = %p, family = %s protocol = %s, length = %d, error = %d, flags = 0x%x",
>                   __entry->sk, show_family_name(__entry->family),
>                   show_inet_protocol_name(__entry->protocol),
>                   __entry->ret > 0 ? ret : 0, __entry->ret < 0 ? ret : 0,
>                   __entry->flags)
> );
>
> DEFINE_EVENT(sock_msg_length, sock_send_length,
>         TP_PROTO(struct sock *sk, int ret, int flags),
>
>         TP_ARGS(sk, ret, flags)
> );
>
> DEFINE_EVENT_PRINT(sock_msg_length, sock_recv_length,
>         TP_PROTO(struct sock *sk, int ret, int flags),
>
>         TP_ARGS(sk, ret, flags)
>
>         TP_printk("sk address = %p, family = %s protocol = %s, length = %d, error = %d, flags = 0x%x",
>                   __entry->sk, show_family_name(__entry->family),
>                   show_inet_protocol_name(__entry->protocol),
>                   !(__entry->flags & MSG_PEEK) ? __entry->ret : __entry->ret > 0 ? ret : 0,
>                   __entry->ret < 0 ? ret : 0,
>                   __entry->flags)
> );
> #endif /* _TRACE_SOCK_H */
>
> As DEFINE_EVENT_PRINT() uses the class template, but overrides the
> TP_printk() portion (still saving memory).
>

Hi Steve, Based on your suggestion, can we use the following code
instead of using DEFINE_EVENT_PRINT ？

DECLARE_EVENT_CLASS(sock_msg_length,

        TP_PROTO(struct sock *sk, int ret, int flags),

        TP_ARGS(sk, ret, flags),

        TP_STRUCT__entry(
                __field(void *, sk)
                __field(__u16, family)
                __field(__u16, protocol)
                __field(int, ret)
                __field(int, flags)
        ),

        TP_fast_assign(
                __entry->sk = sk;
                __entry->family = sk->sk_family;
                __entry->protocol = sk->sk_protocol;
                __entry->ret = ret;
                __entry->flags = flags;
        ),

        TP_printk("sk address = %p, family = %s protocol = %s, length
= %d, error = %d, flags = 0x%x",
                  __entry->sk, show_family_name(__entry->family),
                  show_inet_protocol_name(__entry->protocol),
                  !(__entry->flags & MSG_PEEK) ?
                  (__entry->ret > 0 ? __entry->ret : 0) : 0,
                  __entry->ret < 0 ? __entry->ret : 0,
                  __entry->flags)
);

DEFINE_EVENT(sock_msg_length, sock_send_length,
        TP_PROTO(struct sock *sk, int ret, int flags),

        TP_ARGS(sk, ret, flags)
);

DEFINE_EVENT(sock_msg_length, sock_recv_length,
        TP_PROTO(struct sock *sk, int ret, int flags),

        TP_ARGS(sk, ret, flags)
);


> And then both calls can just do:
>
>         trace_sock_send_length(sk, ret, 0);
>
>         trace_sock_recv_length(sock->sk, ret, flags);
>
> And I bet that will also solve all the gcc being smart waste.
>
> -- Steve
>

Btw,  we still need noinline helpers, right?
Otherwise the following code would be inlined into sock_recvmsg:
mov    0x18(%r13),%rsi
mov    %gs:0x7e832d87(%rip),%eax        # 0x2e68c <pcpu_hot+12>
mov    %eax,%eax
bt     %rax,0xdca591(%rip)        # 0xffffffff825c5ea0 <__cpu_online_mask>


Thanks,
Yunhui

Steven Rostedt Jan. 11, 2023, 4:12 a.m. UTC | #4

On Wed, 11 Jan 2023 11:53:24 +0800
运辉崔 <cuiyunhui@bytedance.com> wrote:

> Hi Steve, Based on your suggestion, can we use the following code
> instead of using DEFINE_EVENT_PRINT ？

I only suggested it because you didn't have that logic for the
sock_send_length trace event. But if you don't care about that one, then
sure, use it for both.

-- Steve


> 
> DECLARE_EVENT_CLASS(sock_msg_length,
> 
>         TP_PROTO(struct sock *sk, int ret, int flags),
> 
>         TP_ARGS(sk, ret, flags),
> 
>         TP_STRUCT__entry(
>                 __field(void *, sk)
>                 __field(__u16, family)
>                 __field(__u16, protocol)
>                 __field(int, ret)
>                 __field(int, flags)
>         ),
> 
>         TP_fast_assign(
>                 __entry->sk = sk;
>                 __entry->family = sk->sk_family;
>                 __entry->protocol = sk->sk_protocol;
>                 __entry->ret = ret;
>                 __entry->flags = flags;
>         ),
> 
>         TP_printk("sk address = %p, family = %s protocol = %s, length
> = %d, error = %d, flags = 0x%x",
>                   __entry->sk, show_family_name(__entry->family),
>                   show_inet_protocol_name(__entry->protocol),
>                   !(__entry->flags & MSG_PEEK) ?
>                   (__entry->ret > 0 ? __entry->ret : 0) : 0,
>                   __entry->ret < 0 ? __entry->ret : 0,
>                   __entry->flags)
> );

yunhui cui Jan. 11, 2023, 6:17 a.m. UTC | #5

On Wed, Jan 11, 2023 at 12:12 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 11 Jan 2023 11:53:24 +0800
> 运辉崔 <cuiyunhui@bytedance.com> wrote:
>
> > Hi Steve, Based on your suggestion, can we use the following code
> > instead of using DEFINE_EVENT_PRINT ？
>
> I only suggested it because you didn't have that logic for the
> sock_send_length trace event. But if you don't care about that one, then
> sure, use it for both.
>
> -- Steve
>

Ok, Thank you, I'll post v6 later.

[v5] sock: add tracepoint for send recv length

Commit Message

Comments

Patch