diff mbox series

[net-next,v3] tcp: Dump bound-only sockets in inet_diag.

Message ID 49a05d612fc8968b17780ed82ecb1b96dcf78e5a.1701358163.git.gnault@redhat.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v3] tcp: Dump bound-only sockets in inet_diag. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3981 this patch: 3981
netdev/cc_maintainers warning 12 maintainers not CCed: haoluo@google.com jolsa@kernel.org john.fastabend@gmail.com sdf@google.com ast@kernel.org andrii@kernel.org yonghong.song@linux.dev martin.lau@linux.dev daniel@iogearbox.net bpf@vger.kernel.org kpsingh@kernel.org song@kernel.org
netdev/build_clang success Errors and warnings before: 1312 this patch: 1312
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4209 this patch: 4209
netdev/checkpatch warning CHECK: multiple assignments should be avoided
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Guillaume Nault Nov. 30, 2023, 3:40 p.m. UTC
Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
that are bound but haven't yet called connect() or listen().

The code is inspired by the ->lhash2 loop. However there's no manual
test of the source port, since this kind of filtering is already
handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
at a time, to avoid running with bh disabled for too long.

There's no TCP state for bound but otherwise inactive sockets. Such
sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
to only dump listening sockets, actually requests the kernel to dump
sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
bound-only sockets with "ss -l", we therefore need to define a new
pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
explicitly.

With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
40000, 64000, 60000, an updated version of iproute2 could work as
follow:

  $ ss -t state bound-inactive
  Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
  0        0                0.0.0.0:40000           0.0.0.0:*
  0        0                   [::]:60000              [::]:*
  0        0                      *:64000                 *:*

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---

v3:
  * Grab sockets with sock_hold(), instead of refcount_inc_not_zero()
    (Kuniyuki Iwashima).
  * Use a new TCP pseudo-state (TCP_BOUND_INACTIVE), to dump bound-only
    sockets, so that "ss -l" won't print them (Eric Dumazet).

v2:
  * Use ->bhash2 instead of ->bhash (Kuniyuki Iwashima).
  * Process no more than 16 sockets at a time (Kuniyuki Iwashima).

 include/net/tcp_states.h |  2 +
 include/uapi/linux/bpf.h |  1 +
 net/ipv4/inet_diag.c     | 86 +++++++++++++++++++++++++++++++++++++++-
 net/ipv4/tcp.c           |  1 +
 4 files changed, 89 insertions(+), 1 deletion(-)

Comments

Guillaume Nault Nov. 30, 2023, 3:51 p.m. UTC | #1
On Thu, Nov 30, 2023 at 04:40:51PM +0100, Guillaume Nault wrote:
> Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
> that are bound but haven't yet called connect() or listen().
> 
> The code is inspired by the ->lhash2 loop. However there's no manual
> test of the source port, since this kind of filtering is already
> handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
> at a time, to avoid running with bh disabled for too long.
> 
> There's no TCP state for bound but otherwise inactive sockets. Such
> sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
> to only dump listening sockets, actually requests the kernel to dump
> sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
> bound-only sockets with "ss -l", we therefore need to define a new
> pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
> explicitly.
> 
> With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
> 40000, 64000, 60000, an updated version of iproute2 could work as
> follow:
> 
>   $ ss -t state bound-inactive
>   Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
>   0        0                0.0.0.0:40000           0.0.0.0:*
>   0        0                   [::]:60000              [::]:*
>   0        0                      *:64000                 *:*

Here's a patch for iproute2-next for easy testing.
I'll submit it formally once the kernel side will be in place.

-------- >8 --------

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 073e9f03..4ece41fa 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -40,6 +40,10 @@ established connections) sockets.
 .B \-l, \-\-listening
 Display only listening sockets (these are omitted by default).
 .TP
+.B \-B, \-\-bound-inactive
+Display only TCP bound but inactive (not listening, connecting, etc.) sockets
+(these are omitted by default).
+.TP
 .B \-o, \-\-options
 Show timer information. For TCP protocol, the output format is:
 .RS
@@ -456,6 +460,9 @@ states except for
 - opposite to
 .B bucket
 
+.B bound-inactive
+- bound but otherwise inactive sockets (not listening, connecting, etc.)
+
 .SH EXPRESSION
 
 .B EXPRESSION
diff --git a/misc/ss.c b/misc/ss.c
index 9438382b..45f01286 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -210,6 +210,8 @@ enum {
 	SS_LAST_ACK,
 	SS_LISTEN,
 	SS_CLOSING,
+	SS_NEW_SYN_RECV,
+	SS_BOUND_INACTIVE,
 	SS_MAX
 };
 
@@ -1381,6 +1383,8 @@ static void sock_state_print(struct sockstat *s)
 		[SS_LAST_ACK] = "LAST-ACK",
 		[SS_LISTEN] =	"LISTEN",
 		[SS_CLOSING] = "CLOSING",
+		[SS_NEW_SYN_RECV] = "NEW-SYN-RECV",
+		[SS_BOUND_INACTIVE] = "BOUND-INACTIVE",
 	};
 
 	switch (s->local.family) {
@@ -5333,6 +5337,7 @@ static void _usage(FILE *dest)
 "   -r, --resolve       resolve host names\n"
 "   -a, --all           display all sockets\n"
 "   -l, --listening     display listening sockets\n"
+"   -B, --bound-inactive display TCP bound but inactive sockets\n"
 "   -o, --options       show timer information\n"
 "   -e, --extended      show detailed socket information\n"
 "   -m, --memory        show socket memory usage\n"
@@ -5415,6 +5420,8 @@ static int scan_state(const char *state)
 		[SS_LAST_ACK] = "last-ack",
 		[SS_LISTEN] =	"listening",
 		[SS_CLOSING] = "closing",
+		[SS_NEW_SYN_RECV] = "new-syn-recv",
+		[SS_BOUND_INACTIVE] = "bound-inactive",
 	};
 	int i;
 
@@ -5481,6 +5488,7 @@ static const struct option long_opts[] = {
 	{ "vsock", 0, 0, OPT_VSOCK },
 	{ "all", 0, 0, 'a' },
 	{ "listening", 0, 0, 'l' },
+	{ "bound-inactive", 0, 0, 'B' },
 	{ "ipv4", 0, 0, '4' },
 	{ "ipv6", 0, 0, '6' },
 	{ "packet", 0, 0, '0' },
@@ -5519,7 +5527,7 @@ int main(int argc, char *argv[])
 	int state_filter = 0;
 
 	while ((ch = getopt_long(argc, argv,
-				 "dhaletuwxnro460spTbEf:mMiA:D:F:vVzZN:KHSO",
+				 "dhalBetuwxnro460spTbEf:mMiA:D:F:vVzZN:KHSO",
 				 long_opts, NULL)) != EOF) {
 		switch (ch) {
 		case 'n':
@@ -5584,6 +5592,9 @@ int main(int argc, char *argv[])
 		case 'l':
 			state_filter = (1 << SS_LISTEN) | (1 << SS_CLOSE);
 			break;
+		case 'B':
+			state_filter = 1 << SS_BOUND_INACTIVE;
+			break;
 		case '4':
 			filter_af_set(&current_filter, AF_INET);
 			break;
Eric Dumazet Nov. 30, 2023, 4:17 p.m. UTC | #2
On Thu, Nov 30, 2023 at 4:40 PM Guillaume Nault <gnault@redhat.com> wrote:
>
> Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
> that are bound but haven't yet called connect() or listen().
>
> The code is inspired by the ->lhash2 loop. However there's no manual
> test of the source port, since this kind of filtering is already
> handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
> at a time, to avoid running with bh disabled for too long.
>
> There's no TCP state for bound but otherwise inactive sockets. Such
> sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
> to only dump listening sockets, actually requests the kernel to dump
> sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
> bound-only sockets with "ss -l", we therefore need to define a new
> pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
> explicitly.
>
> With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
> 40000, 64000, 60000, an updated version of iproute2 could work as
> follow:
>
>   $ ss -t state bound-inactive
>   Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
>   0        0                0.0.0.0:40000           0.0.0.0:*
>   0        0                   [::]:60000              [::]:*
>   0        0                      *:64000                 *:*
>
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> ---
>
> v3:
>   * Grab sockets with sock_hold(), instead of refcount_inc_not_zero()
>     (Kuniyuki Iwashima).
>   * Use a new TCP pseudo-state (TCP_BOUND_INACTIVE), to dump bound-only
>     sockets, so that "ss -l" won't print them (Eric Dumazet).
>


> +pause_bind_walk:
> +                       spin_unlock_bh(&ibb->lock);
> +
> +                       res = 0;
> +                       for (idx = 0; idx < accum; idx++) {
> +                               if (res >= 0) {
> +                                       res = inet_sk_diag_fill(sk_arr[idx],
> +                                                               NULL, skb, cb,
> +                                                               r, NLM_F_MULTI,
> +                                                               net_admin);
> +                                       if (res < 0)
> +                                               num = num_arr[idx];
> +                               }
> +                               sock_gen_put(sk_arr[idx]);

nit: this could be a mere sock_put(), because only full sockets are
hashed in bhash2[]

Reviewed-by: Eric Dumazet <edumazet@google.com>
Guillaume Nault Nov. 30, 2023, 4:30 p.m. UTC | #3
On Thu, Nov 30, 2023 at 05:17:57PM +0100, Eric Dumazet wrote:
> On Thu, Nov 30, 2023 at 4:40 PM Guillaume Nault <gnault@redhat.com> wrote:
> >
> > Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
> > that are bound but haven't yet called connect() or listen().
> >
> > The code is inspired by the ->lhash2 loop. However there's no manual
> > test of the source port, since this kind of filtering is already
> > handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
> > at a time, to avoid running with bh disabled for too long.
> >
> > There's no TCP state for bound but otherwise inactive sockets. Such
> > sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
> > to only dump listening sockets, actually requests the kernel to dump
> > sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
> > bound-only sockets with "ss -l", we therefore need to define a new
> > pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
> > explicitly.
> >
> > With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
> > 40000, 64000, 60000, an updated version of iproute2 could work as
> > follow:
> >
> >   $ ss -t state bound-inactive
> >   Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
> >   0        0                0.0.0.0:40000           0.0.0.0:*
> >   0        0                   [::]:60000              [::]:*
> >   0        0                      *:64000                 *:*
> >
> > Signed-off-by: Guillaume Nault <gnault@redhat.com>
> > ---
> >
> > v3:
> >   * Grab sockets with sock_hold(), instead of refcount_inc_not_zero()
> >     (Kuniyuki Iwashima).
> >   * Use a new TCP pseudo-state (TCP_BOUND_INACTIVE), to dump bound-only
> >     sockets, so that "ss -l" won't print them (Eric Dumazet).
> >
> 
> 
> > +pause_bind_walk:
> > +                       spin_unlock_bh(&ibb->lock);
> > +
> > +                       res = 0;
> > +                       for (idx = 0; idx < accum; idx++) {
> > +                               if (res >= 0) {
> > +                                       res = inet_sk_diag_fill(sk_arr[idx],
> > +                                                               NULL, skb, cb,
> > +                                                               r, NLM_F_MULTI,
> > +                                                               net_admin);
> > +                                       if (res < 0)
> > +                                               num = num_arr[idx];
> > +                               }
> > +                               sock_gen_put(sk_arr[idx]);
> 
> nit: this could be a mere sock_put(), because only full sockets are
> hashed in bhash2[]

Yes, makes sense.
I'll send a v4.

> Reviewed-by: Eric Dumazet <edumazet@google.com>
>
diff mbox series

Patch

diff --git a/include/net/tcp_states.h b/include/net/tcp_states.h
index cc00118acca1..d60e8148ff4c 100644
--- a/include/net/tcp_states.h
+++ b/include/net/tcp_states.h
@@ -22,6 +22,7 @@  enum {
 	TCP_LISTEN,
 	TCP_CLOSING,	/* Now a valid state */
 	TCP_NEW_SYN_RECV,
+	TCP_BOUND_INACTIVE, /* Pseudo-state for inet_diag */
 
 	TCP_MAX_STATES	/* Leave at the end! */
 };
@@ -43,6 +44,7 @@  enum {
 	TCPF_LISTEN	 = (1 << TCP_LISTEN),
 	TCPF_CLOSING	 = (1 << TCP_CLOSING),
 	TCPF_NEW_SYN_RECV = (1 << TCP_NEW_SYN_RECV),
+	TCPF_BOUND_INACTIVE = (1 << TCP_BOUND_INACTIVE),
 };
 
 #endif	/* _LINUX_TCP_STATES_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7a5498242eaa..8ee2404d077c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6892,6 +6892,7 @@  enum {
 	BPF_TCP_LISTEN,
 	BPF_TCP_CLOSING,	/* Now a valid state */
 	BPF_TCP_NEW_SYN_RECV,
+	BPF_TCP_BOUND_INACTIVE,
 
 	BPF_TCP_MAX_STATES	/* Leave at the end! */
 };
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 7d0e7aaa71e0..05fa0edd78b1 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -1077,10 +1077,94 @@  void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 		s_i = num = s_num = 0;
 	}
 
+/* Process a maximum of SKARR_SZ sockets at a time when walking hash buckets
+ * with bh disabled.
+ */
+#define SKARR_SZ 16
+
+	/* Dump bound but inactive (not listening, connecting, etc.) sockets */
+	if (cb->args[0] == 1) {
+		if (!(idiag_states & TCPF_BOUND_INACTIVE))
+			goto skip_bind_ht;
+
+		for (i = s_i; i < hashinfo->bhash_size; i++) {
+			struct inet_bind_hashbucket *ibb;
+			struct inet_bind2_bucket *tb2;
+			struct sock *sk_arr[SKARR_SZ];
+			int num_arr[SKARR_SZ];
+			int idx, accum, res;
+
+resume_bind_walk:
+			num = 0;
+			accum = 0;
+			ibb = &hashinfo->bhash2[i];
+
+			spin_lock_bh(&ibb->lock);
+			inet_bind_bucket_for_each(tb2, &ibb->chain) {
+				if (!net_eq(ib2_net(tb2), net))
+					continue;
+
+				sk_for_each_bound_bhash2(sk, &tb2->owners) {
+					struct inet_sock *inet = inet_sk(sk);
+
+					if (num < s_num)
+						goto next_bind;
+
+					if (sk->sk_state != TCP_CLOSE ||
+					    !inet->inet_num)
+						goto next_bind;
+
+					if (r->sdiag_family != AF_UNSPEC &&
+					    r->sdiag_family != sk->sk_family)
+						goto next_bind;
+
+					if (!inet_diag_bc_sk(bc, sk))
+						goto next_bind;
+
+					sock_hold(sk);
+					num_arr[accum] = num;
+					sk_arr[accum] = sk;
+					if (++accum == SKARR_SZ)
+						goto pause_bind_walk;
+next_bind:
+					num++;
+				}
+			}
+pause_bind_walk:
+			spin_unlock_bh(&ibb->lock);
+
+			res = 0;
+			for (idx = 0; idx < accum; idx++) {
+				if (res >= 0) {
+					res = inet_sk_diag_fill(sk_arr[idx],
+								NULL, skb, cb,
+								r, NLM_F_MULTI,
+								net_admin);
+					if (res < 0)
+						num = num_arr[idx];
+				}
+				sock_gen_put(sk_arr[idx]);
+			}
+			if (res < 0)
+				goto done;
+
+			cond_resched();
+
+			if (accum == SKARR_SZ) {
+				s_num = num + 1;
+				goto resume_bind_walk;
+			}
+
+			s_num = 0;
+		}
+skip_bind_ht:
+		cb->args[0] = 2;
+		s_i = num = s_num = 0;
+	}
+
 	if (!(idiag_states & ~TCPF_LISTEN))
 		goto out;
 
-#define SKARR_SZ 16
 	for (i = s_i; i <= hashinfo->ehash_mask; i++) {
 		struct inet_ehash_bucket *head = &hashinfo->ehash[i];
 		spinlock_t *lock = inet_ehash_lockp(hashinfo, i);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53bcc17c91e4..a100df07d34a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2605,6 +2605,7 @@  void tcp_set_state(struct sock *sk, int state)
 	BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
 	BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
 	BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+	BUILD_BUG_ON((int)BPF_TCP_BOUND_INACTIVE != (int)TCP_BOUND_INACTIVE);
 	BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
 
 	/* bpf uapi header bpf.h defines an anonymous enum with values