From patchwork Fri Jan 6 10:37:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Sitnicki X-Patchwork-Id: 13091253 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0783C3DA7A for ; Fri, 6 Jan 2023 10:37:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229724AbjAFKhx (ORCPT ); Fri, 6 Jan 2023 05:37:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41112 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229766AbjAFKhp (ORCPT ); Fri, 6 Jan 2023 05:37:45 -0500 Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A2626C2A0 for ; Fri, 6 Jan 2023 02:37:44 -0800 (PST) Received: by mail-ej1-x635.google.com with SMTP id ud5so2624069ejc.4 for ; Fri, 06 Jan 2023 02:37:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PcKgZJ+P+SVP0hDiVyMH1T7PA6ZiK6DYMeWb6TJB5DE=; b=a9rdlVtcdYHy2N3YIpJU24BGhabHMiZSBEk5MG2RLE4ff2fMVfB5RE91SwntgWJmJs 0fSIxSzU3iTTTSkPGtkY388s6JJPWjnMOAlBf1O6aM4MUsll2Oh/90RwIMOmyVKkXDPm pL7g5OekTaCPbVzCCLfAB91OyFOuVyGk9U+9g= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PcKgZJ+P+SVP0hDiVyMH1T7PA6ZiK6DYMeWb6TJB5DE=; b=d1I9ZtFn3teqMDPd/tYcls/aaNt1CkffTFhNOZrVf3jlqz0uzHyvNxwAgz5mF5lW5F laA4/8gu1SGvsqI5grG+IsSjIZYjvAMcI5a+0bnJzqE/NCV1pj9BKSgY1b1ll0+O1NgM eetrSOKaWT1n7lF2QNc1LW41ubfwe3ctTIiKhHaNLp8+XHqE3G1ur8MkXd/ShKRFNr2E MYbio93Xa+xIYcrjW8rJIJW/OwKtnAl4RyHXEmNlJW6hlKE6xJy0xH5MDnY33YSw+Tnd dSI0Y835EKPOTT3dSCoudSL2zJM3Y9wZ3AyeXObB39pD7ouh8sPA2MXcOeU7hy1CLsOO 8WYg== X-Gm-Message-State: AFqh2kofoXUE5OlFJOwB6zExEbrkeATUPSzFSVxozEOA0bMxYP4FFsBE 0U3mj3oWe4wP1nazBfd5dwyScJVsNKnXR8mf X-Google-Smtp-Source: AMrXdXuN00eFR49/AmE9orjfe0ug55vhSeROBbrftojJzrs+r+SwJkpVxn+T1+K5tPTIGN8xJ8Ys1w== X-Received: by 2002:a17:907:6d0c:b0:7c1:652:d109 with SMTP id sa12-20020a1709076d0c00b007c10652d109mr55920501ejc.35.1673001462236; Fri, 06 Jan 2023 02:37:42 -0800 (PST) Received: from cloudflare.com (79.184.146.66.ipv4.supernova.orange.pl. [79.184.146.66]) by smtp.gmail.com with ESMTPSA id g16-20020a170906539000b0084cb26c3b71sm282877ejo.1.2023.01.06.02.37.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 Jan 2023 02:37:41 -0800 (PST) From: Jakub Sitnicki To: netdev@vger.kernel.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni Cc: kernel-team@cloudflare.com, Marek Majkowski Subject: [PATCH net-next 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option Date: Fri, 6 Jan 2023 11:37:37 +0100 Message-Id: <20221221-sockopt-port-range-v1-1-e2b094b60ffd@cloudflare.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20221221-sockopt-port-range-v1-0-e2b094b60ffd@cloudflare.com> References: <20221221-sockopt-port-range-v1-0-e2b094b60ffd@cloudflare.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_LO | (PORT_HI << 16)) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski Signed-off-by: Jakub Sitnicki --- include/net/inet_sock.h | 4 ++++ include/net/ip.h | 3 ++- include/uapi/linux/in.h | 1 + net/ipv4/inet_connection_sock.c | 22 ++++++++++++++++++++-- net/ipv4/inet_hashtables.c | 2 +- net/ipv4/ip_sockglue.c | 18 ++++++++++++++++++ net/ipv4/udp.c | 2 +- 7 files changed, 47 insertions(+), 5 deletions(-) diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index bf5654ce711e..51857117ac09 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -249,6 +249,10 @@ struct inet_sock { __be32 mc_addr; struct ip_mc_socklist __rcu *mc_list; struct inet_cork_full cork; + struct { + __u16 lo; + __u16 hi; + } local_port_range; }; #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */ diff --git a/include/net/ip.h b/include/net/ip.h index 144bdfbb25af..c3fffaa92d6e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o } \ } -void inet_get_local_port_range(struct net *net, int *low, int *high); +void inet_get_local_port_range(const struct net *net, int *low, int *high); +void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high); #ifdef CONFIG_SYSCTL static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port) diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h index 07a4cb149305..4b7f2df66b99 100644 --- a/include/uapi/linux/in.h +++ b/include/uapi/linux/in.h @@ -162,6 +162,7 @@ struct in_addr { #define MCAST_MSFILTER 48 #define IP_MULTICAST_ALL 49 #define IP_UNICAST_IF 50 +#define IP_LOCAL_PORT_RANGE 51 #define MCAST_EXCLUDE 0 #define MCAST_INCLUDE 1 diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index d1f837579398..ec40c9ee753c 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk) return !sk->sk_rcv_saddr; } -void inet_get_local_port_range(struct net *net, int *low, int *high) +void inet_get_local_port_range(const struct net *net, int *low, int *high) { unsigned int seq; @@ -130,6 +130,24 @@ void inet_get_local_port_range(struct net *net, int *low, int *high) } EXPORT_SYMBOL(inet_get_local_port_range); +void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high) +{ + const struct inet_sock *inet = inet_sk(sk); + const struct net *net = sock_net(sk); + int lo, hi; + + inet_get_local_port_range(net, &lo, &hi); + + if (unlikely(inet->local_port_range.lo)) + lo = clamp_val(inet->local_port_range.lo, lo, hi); + if (unlikely(inet->local_port_range.hi)) + hi = clamp_val(inet->local_port_range.hi, lo, hi); + + *low = lo; + *high = hi; +} +EXPORT_SYMBOL(inet_sk_get_local_port_range); + static bool inet_use_bhash2_on_bind(const struct sock *sk) { #if IS_ENABLED(CONFIG_IPV6) @@ -316,7 +334,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, ports_exhausted: attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; other_half_scan: - inet_get_local_port_range(net, &low, &high); + inet_sk_get_local_port_range(sk, &low, &high); high++; /* [32768, 60999] -> [32768, 61000[ */ if (high - low < 4) attempt_half = 0; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 24a38b56fab9..22ee5f23ee5b 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -1013,7 +1013,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, l3mdev = inet_sk_bound_l3mdev(sk); - inet_get_local_port_range(net, &low, &high); + inet_sk_get_local_port_range(sk, &low, &high); high++; /* [32768, 60999] -> [32768, 61000[ */ remaining = high - low; if (likely(remaining > 1)) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 9f92ae35bb01..b511ff0adc0a 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -923,6 +923,7 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname, case IP_CHECKSUM: case IP_RECVFRAGSIZE: case IP_RECVERR_RFC4884: + case IP_LOCAL_PORT_RANGE: if (optlen >= sizeof(int)) { if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; @@ -1365,6 +1366,20 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname, WRITE_ONCE(inet->min_ttl, val); break; + case IP_LOCAL_PORT_RANGE: + { + const __u16 lo = val; + const __u16 hi = val >> 16; + + if (optlen != sizeof(__u32)) + goto e_inval; + if (lo != 0 && hi != 0 && lo > hi) + goto e_inval; + + inet->local_port_range.lo = lo; + inet->local_port_range.hi = hi; + break; + } default: err = -ENOPROTOOPT; break; @@ -1743,6 +1758,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname, case IP_MINTTL: val = inet->min_ttl; break; + case IP_LOCAL_PORT_RANGE: + val = inet->local_port_range.hi << 16 | inet->local_port_range.lo; + break; default: sockopt_release_sock(sk); return -ENOPROTOOPT; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 9592fe3e444a..c605d171eb2d 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -248,7 +248,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum, int low, high, remaining; unsigned int rand; - inet_get_local_port_range(net, &low, &high); + inet_sk_get_local_port_range(sk, &low, &high); remaining = (high - low) + 1; rand = get_random_u32(); From patchwork Fri Jan 6 10:37:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Sitnicki X-Patchwork-Id: 13091254 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB32BC54EBD for ; Fri, 6 Jan 2023 10:37:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229766AbjAFKhy (ORCPT ); Fri, 6 Jan 2023 05:37:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231696AbjAFKhs (ORCPT ); Fri, 6 Jan 2023 05:37:48 -0500 Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6D7FD6C28A for ; Fri, 6 Jan 2023 02:37:46 -0800 (PST) Received: by mail-ed1-x534.google.com with SMTP id v30so1691064edb.9 for ; Fri, 06 Jan 2023 02:37:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Mipcc5MufqCxFvtx1A9Osqfo2DYN60zWBd75DM224sQ=; b=eDNZ6KUJWI2h91Mz5pxv0r3iOhcIeoedo2rmfKTfCMdWqoL1Hu7QmS+8MeoNCQHkzN 0StQ6NMvvi7GpnUX8+5xQFJMv2p6sNHnRIkm6uaL2+kgrm9XoF7MclOJ4jmq5IlPeooH bgE5JKcrPF1litOGqwttsGkXeWb6XjMbKa6W0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Mipcc5MufqCxFvtx1A9Osqfo2DYN60zWBd75DM224sQ=; b=GAPNdvH+Nw2rbqZtjuxd35T7aQpjdXY9fXkOO1F5s2Ml9kul1JhOdjRy+gChu5YMBo RscjGoJPwISMHvg/108sEFI2th1T+tfqhdkO+BBsC3yYBfSoUpwNfEcXs8qY+mr3XqPD 1Okk7yeApxu4KLz3hIIHOS2e5i31bUd0fnZq7tB0Wz7bkx3R90Mncb0RVfLQOI1byjh6 IKSZzbyrtn3wy4OOXgEHGl7YGYpM/7uSWqVNKce14AfEDh5/K8WYtjgguFeCI2vU1rw2 XsN3PW2AC3nYrmmC7rxpX6RzLYsPBSja7zhQmg6bp+Ec/0cd6R64o5qOimlRBRcGXbU/ 7EaQ== X-Gm-Message-State: AFqh2krMM6orrfHjlxojOCkzmtpvXOhViI4kUBgdnF5vT8GRKKmApInM Ll/vOTxrfHjMSia66+NL1/WLybH+NsoE1D4m X-Google-Smtp-Source: AMrXdXtpBafDNVu1GbWkdpaPBcd3ShpwyzXwQNGZktgtM4PeFAQotZoY4pE5DBXsGtlG4ntlhOJl0Q== X-Received: by 2002:a05:6402:320e:b0:48e:c3f0:1b1b with SMTP id g14-20020a056402320e00b0048ec3f01b1bmr11816423eda.41.1673001464079; Fri, 06 Jan 2023 02:37:44 -0800 (PST) Received: from cloudflare.com (79.184.146.66.ipv4.supernova.orange.pl. [79.184.146.66]) by smtp.gmail.com with ESMTPSA id n11-20020a170906118b00b0081bfd407ad0sm274426eja.135.2023.01.06.02.37.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 Jan 2023 02:37:43 -0800 (PST) From: Jakub Sitnicki To: netdev@vger.kernel.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni Cc: kernel-team@cloudflare.com Subject: [PATCH net-next 2/2] selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option Date: Fri, 6 Jan 2023 11:37:38 +0100 Message-Id: <20221221-sockopt-port-range-v1-2-e2b094b60ffd@cloudflare.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20221221-sockopt-port-range-v1-0-e2b094b60ffd@cloudflare.com> References: <20221221-sockopt-port-range-v1-0-e2b094b60ffd@cloudflare.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios: 1. pass invalid values to setsockopt 2. pass a range outside of the per-netns port range 3. configure a single-port range 4. exhaust a configured multi-port range 5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT) 6. set then get the per-socket port range Signed-off-by: Jakub Sitnicki --- tools/testing/selftests/net/Makefile | 2 + tools/testing/selftests/net/ip_local_port_range.c | 439 +++++++++++++++++++++ tools/testing/selftests/net/ip_local_port_range.sh | 5 + 3 files changed, 446 insertions(+) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 3007e98a6d64..51b0b49ecd3b 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -45,6 +45,7 @@ TEST_PROGS += arp_ndisc_untracked_subnets.sh TEST_PROGS += stress_reuseport_listen.sh TEST_PROGS += l2_tos_ttl_inherit.sh TEST_PROGS += bind_bhash.sh +TEST_PROGS += ip_local_port_range.sh TEST_PROGS_EXTENDED := in_netns.sh setup_loopback.sh setup_veth.sh TEST_PROGS_EXTENDED += toeplitz_client.sh toeplitz.sh TEST_GEN_FILES = socket nettest @@ -75,6 +76,7 @@ TEST_GEN_PROGS += so_incoming_cpu TEST_PROGS += sctp_vrf.sh TEST_GEN_FILES += sctp_hello TEST_GEN_FILES += csum +TEST_GEN_FILES += ip_local_port_range TEST_FILES := settings diff --git a/tools/testing/selftests/net/ip_local_port_range.c b/tools/testing/selftests/net/ip_local_port_range.c new file mode 100644 index 000000000000..272b0cb9fc57 --- /dev/null +++ b/tools/testing/selftests/net/ip_local_port_range.c @@ -0,0 +1,439 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +// Copyright (c) 2022 Cloudflare + +/* Test IP_LOCAL_PORT_RANGE socket option: IPv4 + IPv6, TCP + UDP. + * + * Tests assume that net.ipv4.ip_local_port_range is [40000, 49999]. + * Don't run these directly but with ip_local_port_range.sh script. + */ + +#include +#include + +#include "../kselftest_harness.h" + +#ifndef IP_LOCAL_PORT_RANGE +#define IP_LOCAL_PORT_RANGE 51 +#endif + +static __u32 pack_port_range(__u16 lo, __u16 hi) +{ + return (hi << 16) | (lo << 0); +} + +static void unpack_port_range(__u32 range, __u16 *lo, __u16 *hi) +{ + *lo = range & 0xffff; + *hi = range >> 16; +} + +static int get_so_domain(int fd) +{ + int domain, err; + socklen_t len; + + len = sizeof(domain); + err = getsockopt(fd, SOL_SOCKET, SO_DOMAIN, &domain, &len); + if (err) + return -1; + + return domain; +} + +static int bind_to_loopback_any_port(int fd) +{ + union { + struct sockaddr sa; + struct sockaddr_in v4; + struct sockaddr_in6 v6; + } addr; + socklen_t addr_len; + + memset(&addr, 0, sizeof(addr)); + switch (get_so_domain(fd)) { + case AF_INET: + addr.v4.sin_family = AF_INET; + addr.v4.sin_port = htons(0); + addr.v4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + addr_len = sizeof(addr.v4); + break; + case AF_INET6: + addr.v6.sin6_family = AF_INET6; + addr.v6.sin6_port = htons(0); + addr.v6.sin6_addr = in6addr_loopback; + addr_len = sizeof(addr.v6); + break; + default: + return -1; + } + + return bind(fd, &addr.sa, addr_len); +} + +static int get_sock_port(int fd) +{ + union { + struct sockaddr sa; + struct sockaddr_in v4; + struct sockaddr_in6 v6; + } addr; + socklen_t addr_len; + int err; + + addr_len = sizeof(addr); + memset(&addr, 0, sizeof(addr)); + err = getsockname(fd, &addr.sa, &addr_len); + if (err) + return -1; + + switch (addr.sa.sa_family) { + case AF_INET: + return ntohs(addr.v4.sin_port); + case AF_INET6: + return ntohs(addr.v6.sin6_port); + default: + errno = EAFNOSUPPORT; + return -1; + } +} + +static int get_ip_local_port_range(int fd, __u32 *range) +{ + socklen_t len; + __u32 val; + int err; + + len = sizeof(val); + err = getsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val, &len); + if (err) + return -1; + + *range = val; + return 0; +} + +static const char *so_domain_to_str(int domain) +{ + switch (domain) { + case AF_INET: + return "ip4"; + case AF_INET6: + return "ip6"; + default: + __builtin_unreachable(); + } +} + +static const char *so_type_to_str(int type) +{ + switch (type) { + case SOCK_STREAM: + return "tcp"; + case SOCK_DGRAM: + return "udp"; + default: + __builtin_unreachable(); + } +} + +const struct socket_kind { + int domain; + int type; +} socket_matrix[] = { + { AF_INET, SOCK_STREAM }, + { AF_INET, SOCK_DGRAM }, + { AF_INET6, SOCK_STREAM }, + { AF_INET6, SOCK_DGRAM }, +}; + +#define for_each_socket_kind(sk) \ + for ((sk) = socket_matrix; \ + (sk) < socket_matrix + ARRAY_SIZE(socket_matrix) && log_socket_kind((sk)); \ + (sk)++) + +#define log_socket_kind(sk) \ + ({ \ + TH_LOG("%s/%s", \ + so_domain_to_str((sk)->domain), \ + so_type_to_str((sk)->type)); \ + true; \ + }) + +TEST(invalid_option_value) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + __u16 val16; + __u32 val32; + __u64 val64; + int fd, err; + + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + /* Too few bytes */ + val16 = 40000; + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val16, sizeof(val16)); + EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail"); + EXPECT_EQ(errno, EINVAL); + + /* Empty range: low port > high port */ + val32 = pack_port_range(40222, 40111); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val32, sizeof(val32)); + EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail"); + EXPECT_EQ(errno, EINVAL); + + /* Too many bytes */ + val64 = pack_port_range(40333, 40444); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &val64, sizeof(val64)); + EXPECT_TRUE(err) TH_LOG("expected setsockopt(IP_LOCAL_PORT_RANGE) to fail"); + EXPECT_EQ(errno, EINVAL); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } +} + +TEST(sock_port_range_out_of_netns_range) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + int fd, err, port; + __u32 range; + + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + range = pack_port_range(50000, 50111); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + err = bind_to_loopback_any_port(fd); + ASSERT_TRUE(!err) TH_LOG("bind failed"); + + /* Check that socket port range outside of ephemeral range is ignored */ + port = get_sock_port(fd); + ASSERT_GE(port, 40000) TH_LOG("expected port within ephemeral range"); + ASSERT_LE(port, 49999) TH_LOG("expected port within ephemeral range"); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } +} + +TEST(single_port_range) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + struct test { + __u16 range_lo; + __u16 range_hi; + __u16 expected; + } tests[] = { + /* single port range within ephemeral range */ + { 45000, 45000, 45000 }, + /* first port in the ephemeral range (clamp from above) */ + { 0, 40000, 40000 }, + /* last port in the ephemeral range (clamp from below) */ + { 49999, 0, 49999 }, + }; + struct test *t; + + for (t = tests; t < tests + ARRAY_SIZE(tests); t++) { + int fd, err, port; + __u32 range; + + TH_LOG("lo %5hu, hi %5hu, expected %5hu", + t->range_lo, t->range_hi, t->expected); + + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + range = pack_port_range(t->range_lo, t->range_hi); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + err = bind_to_loopback_any_port(fd); + ASSERT_TRUE(!err) TH_LOG("bind failed"); + + port = get_sock_port(fd); + ASSERT_EQ(port, t->expected) TH_LOG("unexpected local port"); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } + } +} + +TEST(exhaust_8_port_range) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + __u8 port_set = 0; + int i, fd, err; + __u32 range; + __u16 port; + int fds[8]; + + for (i = 0; i < ARRAY_SIZE(fds); i++) { + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + range = pack_port_range(40000, 40007); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + err = bind_to_loopback_any_port(fd); + ASSERT_TRUE(!err) TH_LOG("bind failed"); + + port = get_sock_port(fd); + ASSERT_GE(port, 40000) TH_LOG("expected port within sockopt range"); + ASSERT_LE(port, 40007) TH_LOG("expected port within sockopt range"); + + port_set |= 1 << (port - 40000); + fds[i] = fd; + } + + /* Check that all every port from the test range is in use */ + ASSERT_EQ(port_set, 0xff) TH_LOG("expected all ports to be busy"); + + /* Check that bind() fails because the whole range is busy */ + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + range = pack_port_range(40000, 40007); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + err = bind_to_loopback_any_port(fd); + ASSERT_TRUE(err) TH_LOG("expected bind to fail"); + ASSERT_EQ(errno, EADDRINUSE); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + + for (i = 0; i < ARRAY_SIZE(fds); i++) { + err = close(fds[i]); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } + } +} + +TEST(late_bind) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + union { + struct sockaddr sa; + struct sockaddr_in v4; + struct sockaddr_in6 v6; + } addr; + socklen_t addr_len; + const int one = 1; + int fd, err; + __u32 range; + __u16 port; + + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + range = pack_port_range(40100, 40199); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + err = setsockopt(fd, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_BIND_ADDRESS_NO_PORT) failed"); + + err = bind_to_loopback_any_port(fd); + ASSERT_TRUE(!err) TH_LOG("bind failed"); + + port = get_sock_port(fd); + ASSERT_EQ(port, 0) TH_LOG("getsockname failed"); + + /* Invalid destination */ + memset(&addr, 0, sizeof(addr)); + switch (sk->domain) { + case AF_INET: + addr.v4.sin_family = AF_INET; + addr.v4.sin_port = htons(0); + addr.v4.sin_addr.s_addr = htonl(INADDR_ANY); + addr_len = sizeof(addr.v4); + break; + case AF_INET6: + addr.v6.sin6_family = AF_INET6; + addr.v6.sin6_port = htons(0); + addr.v6.sin6_addr = in6addr_any; + addr_len = sizeof(addr.v6); + break; + default: + ASSERT_TRUE(false) TH_LOG("unsupported socket domain"); + } + + /* connect() doesn't need to succeed for late bind to happen */ + connect(fd, &addr.sa, addr_len); + + port = get_sock_port(fd); + ASSERT_GE(port, 40100); + ASSERT_LE(port, 40199); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } +} + +TEST(get_port_range) +{ + const struct socket_kind *sk; + + for_each_socket_kind(sk) { + __u16 lo, hi; + __u32 range; + int fd, err; + + fd = socket(sk->domain, sk->type, 0); + ASSERT_GE(fd, 0) TH_LOG("socket failed"); + + /* Get range before it will be set */ + err = get_ip_local_port_range(fd, &range); + ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed"); + + unpack_port_range(range, &lo, &hi); + ASSERT_EQ(lo, 0) TH_LOG("unexpected low port"); + ASSERT_EQ(hi, 0) TH_LOG("unexpected high port"); + + range = pack_port_range(12345, 54321); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + /* Get range after it has been set */ + err = get_ip_local_port_range(fd, &range); + ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed"); + + unpack_port_range(range, &lo, &hi); + ASSERT_EQ(lo, 12345) TH_LOG("unexpected low port"); + ASSERT_EQ(hi, 54321) TH_LOG("unexpected high port"); + + /* Unset the port range */ + range = pack_port_range(0, 0); + err = setsockopt(fd, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range)); + ASSERT_TRUE(!err) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE) failed"); + + /* Get range after it has been unset */ + err = get_ip_local_port_range(fd, &range); + ASSERT_TRUE(!err) TH_LOG("getsockopt(IP_LOCAL_PORT_RANGE) failed"); + + unpack_port_range(range, &lo, &hi); + ASSERT_EQ(lo, 0) TH_LOG("unexpected low port"); + ASSERT_EQ(hi, 0) TH_LOG("unexpected high port"); + + err = close(fd); + ASSERT_TRUE(!err) TH_LOG("close failed"); + } +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh new file mode 100755 index 000000000000..6c6ad346eaa0 --- /dev/null +++ b/tools/testing/selftests/net/ip_local_port_range.sh @@ -0,0 +1,5 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +./in_netns.sh \ + sh -c 'sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && ./ip_local_port_range'