diff mbox series

[net-next] net: tcp: Define tcp_listendrop() as noinline

Message ID 20241231055207.9208-1-laoar.shao@gmail.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: tcp: Define tcp_listendrop() as noinline | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/build_tools success Errors and warnings before: 0 (+23) this patch: 0 (+23)
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 541 this patch: 541
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 909 this patch: 909
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 39 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-12-31--15-00 (tests: 881)

Commit Message

Yafang Shao Dec. 31, 2024, 5:52 a.m. UTC
The LINUX_MIB_LISTENDROPS counter can be incremented for several reasons,
such as:
- A full SYN backlog queue
- SYN flood attacks
- Memory exhaustion
- Other resource constraints

Recently, one of our services experienced frequent ListenDrops. While
attempting to trace the root cause, we discovered that tracing the function
tcp_listendrop() was not straightforward because it is currently inlined.
To overcome this, we had to create a livepatch that defined a non-inlined
version of the function, which we then traced using BPF programs.

  $ grep tcp_listendrop /proc/kallsyms
  ffffffffc093fba0 t tcp_listendrop_x     [livepatch_tmp]

Through this process, we eventually determined that the ListenDrops were
caused by SYN attacks.

Since tcp_listendrop() is not part of the critical path, defining it as
noinline would make it significantly easier to trace and diagnose issues
without requiring workarounds such as livepatching. This function can be
used by kernel modules like smc, so export it with EXPORT_SYMBOL_GPL().

After that change, the result is as follows,

  $ grep tcp_listendrop /proc/kallsyms
  ffffffffb718eaa0 T __pfx_tcp_listendrop
  ffffffffb718eab0 T tcp_listendrop
  ffffffffb7e636b8 r __ksymtab_tcp_listendrop

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/net/tcp.h    | 13 +------------
 net/ipv4/tcp_input.c | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 12 deletions(-)

Comments

Eric Dumazet Dec. 31, 2024, 9:26 a.m. UTC | #1
On Tue, Dec 31, 2024 at 6:52 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> The LINUX_MIB_LISTENDROPS counter can be incremented for several reasons,
> such as:
> - A full SYN backlog queue
> - SYN flood attacks
> - Memory exhaustion
> - Other resource constraints
>
> Recently, one of our services experienced frequent ListenDrops. While
> attempting to trace the root cause, we discovered that tracing the function
> tcp_listendrop() was not straightforward because it is currently inlined.
> To overcome this, we had to create a livepatch that defined a non-inlined
> version of the function, which we then traced using BPF programs.
>
>   $ grep tcp_listendrop /proc/kallsyms
>   ffffffffc093fba0 t tcp_listendrop_x     [livepatch_tmp]
>
> Through this process, we eventually determined that the ListenDrops were
> caused by SYN attacks.
>
> Since tcp_listendrop() is not part of the critical path, defining it as
> noinline would make it significantly easier to trace and diagnose issues
> without requiring workarounds such as livepatching. This function can be
> used by kernel modules like smc, so export it with EXPORT_SYMBOL_GPL().
>
> After that change, the result is as follows,
>
>   $ grep tcp_listendrop /proc/kallsyms
>   ffffffffb718eaa0 T __pfx_tcp_listendrop
>   ffffffffb718eab0 T tcp_listendrop
>   ffffffffb7e636b8 r __ksymtab_tcp_listendrop
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> --

Are we going to accept one patch at a time to un-inline all TCP
related functions in the kernel ?

My understanding is that current tools work fine. You may need to upgrade yours.

# perf probe -a tcp_listendrop
Added new events:
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)
  probe:tcp_listendrop (on tcp_listendrop)

You can now use it in all perf tools, such as:

perf record -e probe:tcp_listendrop -aR sleep 1
Yafang Shao Dec. 31, 2024, 11:49 a.m. UTC | #2
On Tue, Dec 31, 2024 at 5:26 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Dec 31, 2024 at 6:52 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The LINUX_MIB_LISTENDROPS counter can be incremented for several reasons,
> > such as:
> > - A full SYN backlog queue
> > - SYN flood attacks
> > - Memory exhaustion
> > - Other resource constraints
> >
> > Recently, one of our services experienced frequent ListenDrops. While
> > attempting to trace the root cause, we discovered that tracing the function
> > tcp_listendrop() was not straightforward because it is currently inlined.
> > To overcome this, we had to create a livepatch that defined a non-inlined
> > version of the function, which we then traced using BPF programs.
> >
> >   $ grep tcp_listendrop /proc/kallsyms
> >   ffffffffc093fba0 t tcp_listendrop_x     [livepatch_tmp]
> >
> > Through this process, we eventually determined that the ListenDrops were
> > caused by SYN attacks.
> >
> > Since tcp_listendrop() is not part of the critical path, defining it as
> > noinline would make it significantly easier to trace and diagnose issues
> > without requiring workarounds such as livepatching. This function can be
> > used by kernel modules like smc, so export it with EXPORT_SYMBOL_GPL().
> >
> > After that change, the result is as follows,
> >
> >   $ grep tcp_listendrop /proc/kallsyms
> >   ffffffffb718eaa0 T __pfx_tcp_listendrop
> >   ffffffffb718eab0 T tcp_listendrop
> >   ffffffffb7e636b8 r __ksymtab_tcp_listendrop
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > --
>
> Are we going to accept one patch at a time to un-inline all TCP
> related functions in the kernel ?

I don't have an in-depth understanding of the TCP/IP stack, so I can't
consolidate all related TCP functions in the error paths into a single
patch. I would greatly appreciate it if you could help ensure these
functions are marked as non-inlined based on your expertise. If you
don't have time to do it directly, could you provide some guidance?

The inlining of TCP functions in error paths often complicates tracing
efforts. For instance, we recently encountered issues with the inlined
tcp_write_err(), although we were fortunate to have an alternative
tracepoint available: inet_sk_error_report.

Unfortunately, no such alternative exists for tcp_listendrop(), which
is why I submitted this patch.

>
> My understanding is that current tools work fine. You may need to upgrade yours.
>
> # perf probe -a tcp_listendrop
> Added new events:
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>   probe:tcp_listendrop (on tcp_listendrop)
>
> You can now use it in all perf tools, such as:
>
> perf record -e probe:tcp_listendrop -aR sleep 1

The issue is that we can't extract the struct `sock *sk` from
tcp_listendrop() using perf. Please correct me if I’m mistaken.

In a containerized environment, it's common to have many pods running
on a single host, each assigned a unique IP. Our goal is to filter for
a specific local_ip:local_port combination while also capturing the
corresponding remote IP. This task is straightforward with bpftrace or
other BPF-based tools, but I’m unsure how to accomplish it using perf.
Eric Dumazet Dec. 31, 2024, 12:34 p.m. UTC | #3
On Tue, Dec 31, 2024 at 12:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Dec 31, 2024 at 5:26 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Dec 31, 2024 at 6:52 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > The LINUX_MIB_LISTENDROPS counter can be incremented for several reasons,
> > > such as:
> > > - A full SYN backlog queue
> > > - SYN flood attacks
> > > - Memory exhaustion
> > > - Other resource constraints
> > >
> > > Recently, one of our services experienced frequent ListenDrops. While
> > > attempting to trace the root cause, we discovered that tracing the function
> > > tcp_listendrop() was not straightforward because it is currently inlined.
> > > To overcome this, we had to create a livepatch that defined a non-inlined
> > > version of the function, which we then traced using BPF programs.
> > >
> > >   $ grep tcp_listendrop /proc/kallsyms
> > >   ffffffffc093fba0 t tcp_listendrop_x     [livepatch_tmp]
> > >
> > > Through this process, we eventually determined that the ListenDrops were
> > > caused by SYN attacks.
> > >
> > > Since tcp_listendrop() is not part of the critical path, defining it as
> > > noinline would make it significantly easier to trace and diagnose issues
> > > without requiring workarounds such as livepatching. This function can be
> > > used by kernel modules like smc, so export it with EXPORT_SYMBOL_GPL().
> > >
> > > After that change, the result is as follows,
> > >
> > >   $ grep tcp_listendrop /proc/kallsyms
> > >   ffffffffb718eaa0 T __pfx_tcp_listendrop
> > >   ffffffffb718eab0 T tcp_listendrop
> > >   ffffffffb7e636b8 r __ksymtab_tcp_listendrop
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > --
> >
> > Are we going to accept one patch at a time to un-inline all TCP
> > related functions in the kernel ?
>
> I don't have an in-depth understanding of the TCP/IP stack, so I can't
> consolidate all related TCP functions in the error paths into a single
> patch. I would greatly appreciate it if you could help ensure these
> functions are marked as non-inlined based on your expertise. If you
> don't have time to do it directly, could you provide some guidance?
>
> The inlining of TCP functions in error paths often complicates tracing
> efforts. For instance, we recently encountered issues with the inlined
> tcp_write_err(), although we were fortunate to have an alternative
> tracepoint available: inet_sk_error_report.
>
> Unfortunately, no such alternative exists for tcp_listendrop(), which
> is why I submitted this patch.
>
> >
> > My understanding is that current tools work fine. You may need to upgrade yours.
> >
> > # perf probe -a tcp_listendrop
> > Added new events:
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >   probe:tcp_listendrop (on tcp_listendrop)
> >
> > You can now use it in all perf tools, such as:
> >
> > perf record -e probe:tcp_listendrop -aR sleep 1
>
> The issue is that we can't extract the struct `sock *sk` from
> tcp_listendrop() using perf. Please correct me if I’m mistaken.
>
> In a containerized environment, it's common to have many pods running
> on a single host, each assigned a unique IP. Our goal is to filter for
> a specific local_ip:local_port combination while also capturing the
> corresponding remote IP. This task is straightforward with bpftrace or
> other BPF-based tools, but I’m unsure how to accomplish it using perf.

I suggest you ask tracing experts' opinions, why is perf able to add a
probe, but not bpftrace ?

Again, accepting thousands of patches just because of tracing tools
seems not good to me.
Yafang Shao Dec. 31, 2024, 1:05 p.m. UTC | #4
On Tue, Dec 31, 2024 at 8:34 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Dec 31, 2024 at 12:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Dec 31, 2024 at 5:26 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Dec 31, 2024 at 6:52 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The LINUX_MIB_LISTENDROPS counter can be incremented for several reasons,
> > > > such as:
> > > > - A full SYN backlog queue
> > > > - SYN flood attacks
> > > > - Memory exhaustion
> > > > - Other resource constraints
> > > >
> > > > Recently, one of our services experienced frequent ListenDrops. While
> > > > attempting to trace the root cause, we discovered that tracing the function
> > > > tcp_listendrop() was not straightforward because it is currently inlined.
> > > > To overcome this, we had to create a livepatch that defined a non-inlined
> > > > version of the function, which we then traced using BPF programs.
> > > >
> > > >   $ grep tcp_listendrop /proc/kallsyms
> > > >   ffffffffc093fba0 t tcp_listendrop_x     [livepatch_tmp]
> > > >
> > > > Through this process, we eventually determined that the ListenDrops were
> > > > caused by SYN attacks.
> > > >
> > > > Since tcp_listendrop() is not part of the critical path, defining it as
> > > > noinline would make it significantly easier to trace and diagnose issues
> > > > without requiring workarounds such as livepatching. This function can be
> > > > used by kernel modules like smc, so export it with EXPORT_SYMBOL_GPL().
> > > >
> > > > After that change, the result is as follows,
> > > >
> > > >   $ grep tcp_listendrop /proc/kallsyms
> > > >   ffffffffb718eaa0 T __pfx_tcp_listendrop
> > > >   ffffffffb718eab0 T tcp_listendrop
> > > >   ffffffffb7e636b8 r __ksymtab_tcp_listendrop
> > > >
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > --
> > >
> > > Are we going to accept one patch at a time to un-inline all TCP
> > > related functions in the kernel ?
> >
> > I don't have an in-depth understanding of the TCP/IP stack, so I can't
> > consolidate all related TCP functions in the error paths into a single
> > patch. I would greatly appreciate it if you could help ensure these
> > functions are marked as non-inlined based on your expertise. If you
> > don't have time to do it directly, could you provide some guidance?
> >
> > The inlining of TCP functions in error paths often complicates tracing
> > efforts. For instance, we recently encountered issues with the inlined
> > tcp_write_err(), although we were fortunate to have an alternative
> > tracepoint available: inet_sk_error_report.
> >
> > Unfortunately, no such alternative exists for tcp_listendrop(), which
> > is why I submitted this patch.
> >
> > >
> > > My understanding is that current tools work fine. You may need to upgrade yours.
> > >
> > > # perf probe -a tcp_listendrop
> > > Added new events:
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >   probe:tcp_listendrop (on tcp_listendrop)
> > >
> > > You can now use it in all perf tools, such as:
> > >
> > > perf record -e probe:tcp_listendrop -aR sleep 1
> >
> > The issue is that we can't extract the struct `sock *sk` from
> > tcp_listendrop() using perf. Please correct me if I’m mistaken.
> >
> > In a containerized environment, it's common to have many pods running
> > on a single host, each assigned a unique IP. Our goal is to filter for
> > a specific local_ip:local_port combination while also capturing the
> > corresponding remote IP. This task is straightforward with bpftrace or
> > other BPF-based tools, but I’m unsure how to accomplish it using perf.
>
> I suggest you ask tracing experts' opinions, why is perf able to add a
> probe, but not bpftrace ?
>
> Again, accepting thousands of patches just because of tracing tools
> seems not good to me.

Thank you for your guidance on using perf. I now understand how to
achieve similar tracing with bpftrace.

I agree with your point that it’s not ideal to send thousands of
patches solely for the sake of tracing tools. It would indeed be much
better if this could be addressed in a single patch.
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e9b37b76e894..65e6357e893b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2537,18 +2537,7 @@  static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
 		WRITE_ONCE(tp->data_segs_in, tp->data_segs_in + segs_in);
 }
 
-/*
- * TCP listen path runs lockless.
- * We forced "struct sock" to be const qualified to make sure
- * we don't modify one of its field by mistake.
- * Here, we increment sk_drops which is an atomic_t, so we can safely
- * make sock writable again.
- */
-static inline void tcp_listendrop(const struct sock *sk)
-{
-	atomic_inc(&((struct sock *)sk)->sk_drops);
-	__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
-}
+void tcp_listendrop(const struct sock *sk);
 
 enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5bdf13ac26ef..0fcea815860b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -7174,6 +7174,20 @@  u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
 }
 EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
 
+/*
+ * TCP listen path runs lockless.
+ * We forced "struct sock" to be const qualified to make sure
+ * we don't modify one of its field by mistake.
+ * Here, we increment sk_drops which is an atomic_t, so we can safely
+ * make sock writable again.
+ */
+void tcp_listendrop(const struct sock *sk)
+{
+	atomic_inc(&((struct sock *)sk)->sk_drops);
+	__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
+}
+EXPORT_SYMBOL_GPL(tcp_listendrop);
+
 int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		     const struct tcp_request_sock_ops *af_ops,
 		     struct sock *sk, struct sk_buff *skb)