diff mbox series

[RFC,net-next,03/10] ipv4: Add custom multipath hash policy

Message ID 20210502162257.3472453-4-idosch@idosch.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Add support for custom multipath hash | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 4 maintainers not CCed: yoshfuji@linux-ipv6.org linux-doc@vger.kernel.org dsahern@kernel.org corbet@lwn.net
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 156 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 5 this patch: 5
netdev/header_inline success Link

Commit Message

Ido Schimmel May 2, 2021, 4:22 p.m. UTC
From: Ido Schimmel <idosch@nvidia.com>

Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.

The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if the 'need_outer' or 'need_inner' variables are not set.
These fields are set / cleared as part of the control path when the
fib_multipath_hash_fields sysctl is configured.

In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ip-sysctl.rst |   2 +
 net/ipv4/route.c                       | 121 +++++++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |   3 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

Comments

David Ahern May 4, 2021, 2:38 a.m. UTC | #1
On 5/2/21 10:22 AM, Ido Schimmel wrote:
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 8ab61f4edf02..549601494694 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -99,6 +99,8 @@ fib_multipath_hash_policy - INTEGER
>  	- 0 - Layer 3
>  	- 1 - Layer 4
>  	- 2 - Layer 3 or inner Layer 3 if present
> +	- 3 - Custom multipath hash. Fields used for multipath hash calculation
> +	  are determined by fib_multipath_hash_fields sysctl
>  

If you default the fields to L3, then seems like the custom option is a
more generic case of '0'.
David Ahern May 4, 2021, 2:40 a.m. UTC | #2
On 5/3/21 8:38 PM, David Ahern wrote:
> On 5/2/21 10:22 AM, Ido Schimmel wrote:
>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
>> index 8ab61f4edf02..549601494694 100644
>> --- a/Documentation/networking/ip-sysctl.rst
>> +++ b/Documentation/networking/ip-sysctl.rst
>> @@ -99,6 +99,8 @@ fib_multipath_hash_policy - INTEGER
>>  	- 0 - Layer 3
>>  	- 1 - Layer 4
>>  	- 2 - Layer 3 or inner Layer 3 if present
>> +	- 3 - Custom multipath hash. Fields used for multipath hash calculation
>> +	  are determined by fib_multipath_hash_fields sysctl
>>  
> 
> If you default the fields to L3, then seems like the custom option is a
> more generic case of '0'.
> 

Actually this is the more generic case of all of them, so all of them
could be implemented using this custom field selection.
Ido Schimmel May 5, 2021, 8:45 a.m. UTC | #3
On Mon, May 03, 2021 at 08:40:18PM -0600, David Ahern wrote:
> On 5/3/21 8:38 PM, David Ahern wrote:
> > On 5/2/21 10:22 AM, Ido Schimmel wrote:
> >> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> >> index 8ab61f4edf02..549601494694 100644
> >> --- a/Documentation/networking/ip-sysctl.rst
> >> +++ b/Documentation/networking/ip-sysctl.rst
> >> @@ -99,6 +99,8 @@ fib_multipath_hash_policy - INTEGER
> >>  	- 0 - Layer 3
> >>  	- 1 - Layer 4
> >>  	- 2 - Layer 3 or inner Layer 3 if present
> >> +	- 3 - Custom multipath hash. Fields used for multipath hash calculation
> >> +	  are determined by fib_multipath_hash_fields sysctl
> >>  
> > 
> > If you default the fields to L3, then seems like the custom option is a
> > more generic case of '0'.
> > 
> 
> Actually this is the more generic case of all of them, so all of them
> could be implemented using this custom field selection.

Yes and no. The default policy (0) special cases ICMP packets whereas
the rest of the policies do not. Therefore, entirely ignoring the chosen
policy and only looking at the hash fields is not possible here.

Policy 1 can be implemented based on the hash fields alone, as it only
considers outer fields.

Policy 2 is a bit quirky. It will consider inner layer 3 fields and
fallback to outer fields if no encapsulation was encountered. It is not
possible to implement it as-is based on hash fields alone. A good
approximation would be to hash based on both outer and inner layer 3
fields, as the outer fields should remain constant throughout the
lifetime of a given encapsulated flow.

However, while some code can be reused between the different policies, I
do not think it is the primary consideration here. With the new policy,
I tried to minimize the amount of checks as much as I could, but at the
worst case it still adds a per-field check. Avoiding this penalty for
existing use cases was my primary motivation. Especially when the
multipath code can be called from XDP via bpf_fib_lookup() helper.
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 8ab61f4edf02..549601494694 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -99,6 +99,8 @@  fib_multipath_hash_policy - INTEGER
 	- 0 - Layer 3
 	- 1 - Layer 4
 	- 2 - Layer 3 or inner Layer 3 if present
+	- 3 - Custom multipath hash. Fields used for multipath hash calculation
+	  are determined by fib_multipath_hash_fields sysctl
 
 fib_multipath_hash_fields - list of comma separated ranges
 	When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9d61e969446e..995799a6e06f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1906,6 +1906,121 @@  static void ip_multipath_l3_keys(const struct sk_buff *skb,
 	hash_keys->addrs.v4addrs.dst = key_iph->daddr;
 }
 
+static u32 fib_multipath_custom_hash_outer(const struct net *net,
+					   const struct sk_buff *skb,
+					   bool *p_has_inner)
+{
+	unsigned long *hash_fields = net->ipv4.sysctl_fib_multipath_hash_fields;
+	struct flow_keys keys, hash_keys;
+
+	if (!net->ipv4.fib_multipath_hash_fields_need_outer)
+		return 0;
+
+	memset(&hash_keys, 0, sizeof(hash_keys));
+	skb_flow_dissect_flow_keys(skb, &keys, FLOW_DISSECTOR_F_STOP_AT_ENCAP);
+
+	hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(SRC_IP, hash_fields))
+		hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(DST_IP, hash_fields))
+		hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(IP_PROTO, hash_fields))
+		hash_keys.basic.ip_proto = keys.basic.ip_proto;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(SRC_PORT, hash_fields))
+		hash_keys.ports.src = keys.ports.src;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(DST_PORT, hash_fields))
+		hash_keys.ports.dst = keys.ports.dst;
+
+	*p_has_inner = !!(keys.control.flags & FLOW_DIS_ENCAPSULATION);
+	return flow_hash_from_keys(&hash_keys);
+}
+
+static u32 fib_multipath_custom_hash_inner(const struct net *net,
+					   const struct sk_buff *skb,
+					   bool has_inner)
+{
+	unsigned long *hash_fields = net->ipv4.sysctl_fib_multipath_hash_fields;
+	struct flow_keys keys, hash_keys;
+
+	/* We assume the packet carries an encapsulation, but if none was
+	 * encountered during dissection of the outer flow, then there is no
+	 * point in calling the flow dissector again.
+	 */
+	if (!has_inner)
+		return 0;
+
+	if (!net->ipv4.fib_multipath_hash_fields_need_inner)
+		return 0;
+
+	memset(&hash_keys, 0, sizeof(hash_keys));
+	skb_flow_dissect_flow_keys(skb, &keys, 0);
+
+	if (!(keys.control.flags & FLOW_DIS_ENCAPSULATION))
+		return 0;
+
+	if (keys.control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+		if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_SRC_IP, hash_fields))
+			hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+		if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_DST_IP, hash_fields))
+			hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
+	} else if (keys.control.addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+		if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_SRC_IP, hash_fields))
+			hash_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
+		if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_DST_IP, hash_fields))
+			hash_keys.addrs.v6addrs.dst = keys.addrs.v6addrs.dst;
+		if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_FLOWLABEL, hash_fields))
+			hash_keys.tags.flow_label = keys.tags.flow_label;
+	}
+
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_IP_PROTO, hash_fields))
+		hash_keys.basic.ip_proto = keys.basic.ip_proto;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_SRC_PORT, hash_fields))
+		hash_keys.ports.src = keys.ports.src;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(INNER_DST_PORT, hash_fields))
+		hash_keys.ports.dst = keys.ports.dst;
+
+	return flow_hash_from_keys(&hash_keys);
+}
+
+static u32 fib_multipath_custom_hash_skb(const struct net *net,
+					 const struct sk_buff *skb)
+{
+	u32 mhash, mhash_inner;
+	bool has_inner = true;
+
+	mhash = fib_multipath_custom_hash_outer(net, skb, &has_inner);
+	mhash_inner = fib_multipath_custom_hash_inner(net, skb, has_inner);
+
+	return jhash_2words(mhash, mhash_inner, 0);
+}
+
+static u32 fib_multipath_custom_hash_fl4(const struct net *net,
+					 const struct flowi4 *fl4)
+{
+	unsigned long *hash_fields = net->ipv4.sysctl_fib_multipath_hash_fields;
+	struct flow_keys hash_keys;
+
+	if (!net->ipv4.fib_multipath_hash_fields_need_outer)
+		return 0;
+
+	memset(&hash_keys, 0, sizeof(hash_keys));
+	hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(SRC_IP, hash_fields))
+		hash_keys.addrs.v4addrs.src = fl4->saddr;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(DST_IP, hash_fields))
+		hash_keys.addrs.v4addrs.dst = fl4->daddr;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(IP_PROTO, hash_fields))
+		hash_keys.basic.ip_proto = fl4->flowi4_proto;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(SRC_PORT, hash_fields))
+		hash_keys.ports.src = fl4->fl4_sport;
+	if (FIB_MULTIPATH_HASH_TEST_FIELD(DST_PORT, hash_fields))
+		hash_keys.ports.dst = fl4->fl4_dport;
+
+	return flow_hash_from_keys(&hash_keys);
+}
+
 /* if skb is set it will be used and fl4 can be NULL */
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
 		       const struct sk_buff *skb, struct flow_keys *flkeys)
@@ -1991,6 +2106,12 @@  int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
 		}
 		mhash = flow_hash_from_keys(&hash_keys);
 		break;
+	case 3:
+		if (skb)
+			mhash = fib_multipath_custom_hash_skb(net, skb);
+		else
+			mhash = fib_multipath_custom_hash_fl4(net, fl4);
+		break;
 	}
 
 	if (multipath_hash)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0db7e68c38cd..988defcd8ae3 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,7 @@ 
 #include <net/netevent.h>
 
 static int two = 2;
+static int three __maybe_unused = 3;
 static int four = 4;
 static int thousand = 1000;
 static int tcp_retr1_max = 255;
@@ -1075,7 +1076,7 @@  static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_fib_multipath_hash_policy,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= &two,
+		.extra2		= &three,
 	},
 	{
 		.procname	= "fib_multipath_hash_fields",