diff mbox series

[net-next,2/6] net: nexthop: Increase weight to u16

Message ID e0605ce114eb24323a05aaca1dcdb750b2e0329a.1722519021.git.petrm@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: nexthop: Increase weight to u16 | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 42 this patch: 42
netdev/build_tools success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 43 this patch: 43
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 381 this patch: 381
netdev/checkpatch warning WARNING: line length of 90 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-03--00-00 (tests: 701)

Commit Message

Petr Machata Aug. 1, 2024, 4:23 p.m. UTC
In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patch increase the next hop weight from u8 to u16.

Increasing the width of an integral type can be tricky, because while the
code still compiles, the types may not check out anymore, and numerical
errors come up. To prevent this, the conversion was done in two steps.
First the type was changed from u8 to a single-member structure, which
invalidated all uses of the field. This allowed going through them one by
one and audit for type correctness. Then the structure was replaced with a
vanilla u16 again. This should ensure that no place was missed.

The UAPI for configuring nexthop group members is that an attribute
NHA_GROUP carries an array of struct nexthop_grp entries:

	struct nexthop_grp {
		__u32	id;	  /* nexthop id - must exist */
		__u8	weight;   /* weight of this nexthop */
		__u8	resvd1;
		__u16	resvd2;
	};

The field resvd1 is currently validated and required to be zero. We can
lift this requirement and carry high-order bits of the weight in the
reserved field:

	struct nexthop_grp {
		__u32	id;	  /* nexthop id - must exist */
		__u8	weight;   /* weight of this nexthop */
		__u8	weight_high;
		__u16	resvd2;
	};

Keeping the fields split this way was chosen in case an existing userspace
makes assumptions about the width of the weight field, and to sidestep any
endianes issues.

The weight field is currently encoded as the weight value minus one,
because weight of 0 is invalid. This same trick is impossible for the new
weight_high field, because zero must mean actual zero. With this in place:

- Old userspace is guaranteed to carry weight_high of 0, therefore
  configuring 8-bit weights as appropriate. When dumping nexthops with
  16-bit weight, it would only show the lower 8 bits. But configuring such
  nexthops implies existence of userspace aware of the extension in the
  first place.

- New userspace talking to an old kernel will work as long as it only
  attempts to configure 8-bit weights, where the high-order bits are zero.
  Old kernel will bounce attempts at configuring >8-bit weights.

Renaming reserved fields as they are allocated for some purpose is commonly
done in Linux. Whoever touches a reserved field is doing so at their own
risk. nexthop_grp::resvd1 in particular is currently used by at least
strace, however they carry an own copy of UAPI headers, and the conversion
should be trivial. A helper is provided for decoding the weight out of the
two fields. Forcing a conversion seems preferable to bending backwards and
introducing anonymous unions or whatever.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 include/net/nexthop.h        |  4 ++--
 include/uapi/linux/nexthop.h |  7 ++++++-
 net/ipv4/nexthop.c           | 37 ++++++++++++++++++++++--------------
 3 files changed, 31 insertions(+), 17 deletions(-)

Comments

Simon Horman Aug. 1, 2024, 7:39 p.m. UTC | #1
On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patch increase the next hop weight from u8 to u16.
> 
> Increasing the width of an integral type can be tricky, because while the
> code still compiles, the types may not check out anymore, and numerical
> errors come up. To prevent this, the conversion was done in two steps.
> First the type was changed from u8 to a single-member structure, which
> invalidated all uses of the field. This allowed going through them one by
> one and audit for type correctness. Then the structure was replaced with a
> vanilla u16 again. This should ensure that no place was missed.
> 
> The UAPI for configuring nexthop group members is that an attribute
> NHA_GROUP carries an array of struct nexthop_grp entries:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	resvd1;
> 		__u16	resvd2;
> 	};
> 
> The field resvd1 is currently validated and required to be zero. We can
> lift this requirement and carry high-order bits of the weight in the
> reserved field:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	weight_high;
> 		__u16	resvd2;
> 	};
> 
> Keeping the fields split this way was chosen in case an existing userspace
> makes assumptions about the width of the weight field, and to sidestep any
> endianes issues.

nit: endianness

...
Petr Machata Aug. 5, 2024, 12:33 p.m. UTC | #2
Simon Horman <horms@kernel.org> writes:

> On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote:
>> In CLOS networks, as link failures occur at various points in the network,
>> ECMP weights of the involved nodes are adjusted to compensate. With high
>> fan-out of the involved nodes, and overall high number of nodes,
>> a (non-)ECMP weight ratio that we would like to configure does not fit into
>> 8 bits. Instead of, say, 255:254, we might like to configure something like
>> 1000:999. For these deployments, the 8-bit weight may not be enough.
>> 
>> To that end, in this patch increase the next hop weight from u8 to u16.
>> 
>> Increasing the width of an integral type can be tricky, because while the
>> code still compiles, the types may not check out anymore, and numerical
>> errors come up. To prevent this, the conversion was done in two steps.
>> First the type was changed from u8 to a single-member structure, which
>> invalidated all uses of the field. This allowed going through them one by
>> one and audit for type correctness. Then the structure was replaced with a
>> vanilla u16 again. This should ensure that no place was missed.
>> 
>> The UAPI for configuring nexthop group members is that an attribute
>> NHA_GROUP carries an array of struct nexthop_grp entries:
>> 
>> 	struct nexthop_grp {
>> 		__u32	id;	  /* nexthop id - must exist */
>> 		__u8	weight;   /* weight of this nexthop */
>> 		__u8	resvd1;
>> 		__u16	resvd2;
>> 	};
>> 
>> The field resvd1 is currently validated and required to be zero. We can
>> lift this requirement and carry high-order bits of the weight in the
>> reserved field:
>> 
>> 	struct nexthop_grp {
>> 		__u32	id;	  /* nexthop id - must exist */
>> 		__u8	weight;   /* weight of this nexthop */
>> 		__u8	weight_high;
>> 		__u16	resvd2;
>> 	};
>> 
>> Keeping the fields split this way was chosen in case an existing userspace
>> makes assumptions about the width of the weight field, and to sidestep any
>> endianes issues.
>
> nit: endianness

Thanks, will fix for v2. For now I'm still waiting if there's other
feedback.
Przemek Kitszel Aug. 6, 2024, 2:02 p.m. UTC | #3
On 8/1/24 18:23, Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patch increase the next hop weight from u8 to u16.
> 
> Increasing the width of an integral type can be tricky, because while the
> code still compiles, the types may not check out anymore, and numerical
> errors come up. To prevent this, the conversion was done in two steps.
> First the type was changed from u8 to a single-member structure, which
> invalidated all uses of the field. This allowed going through them one by
> one and audit for type correctness. Then the structure was replaced with a
> vanilla u16 again. This should ensure that no place was missed.
> 
> The UAPI for configuring nexthop group members is that an attribute
> NHA_GROUP carries an array of struct nexthop_grp entries:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	resvd1;
> 		__u16	resvd2;
> 	};
> 
> The field resvd1 is currently validated and required to be zero. We can
> lift this requirement and carry high-order bits of the weight in the
> reserved field:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	weight_high;
> 		__u16	resvd2;
> 	};
> 
> Keeping the fields split this way was chosen in case an existing userspace
> makes assumptions about the width of the weight field, and to sidestep any
> endianes issues.
> 
> The weight field is currently encoded as the weight value minus one,
> because weight of 0 is invalid. This same trick is impossible for the new
> weight_high field, because zero must mean actual zero. With this in place:
> 
> - Old userspace is guaranteed to carry weight_high of 0, therefore
>    configuring 8-bit weights as appropriate. When dumping nexthops with
>    16-bit weight, it would only show the lower 8 bits. But configuring such
>    nexthops implies existence of userspace aware of the extension in the
>    first place.
> 
> - New userspace talking to an old kernel will work as long as it only
>    attempts to configure 8-bit weights, where the high-order bits are zero.
>    Old kernel will bounce attempts at configuring >8-bit weights.
> 
> Renaming reserved fields as they are allocated for some purpose is commonly
> done in Linux. Whoever touches a reserved field is doing so at their own
> risk. nexthop_grp::resvd1 in particular is currently used by at least
> strace, however they carry an own copy of UAPI headers, and the conversion
> should be trivial. A helper is provided for decoding the weight out of the
> two fields. Forcing a conversion seems preferable to bending backwards and
> introducing anonymous unions or whatever.
> 
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> ---
>   include/net/nexthop.h        |  4 ++--
>   include/uapi/linux/nexthop.h |  7 ++++++-
>   net/ipv4/nexthop.c           | 37 ++++++++++++++++++++++--------------
>   3 files changed, 31 insertions(+), 17 deletions(-)
> 

> -		if (nhg[i].weight > 254) {
> +		if (nexthop_grp_weight(&nhg[i]) == 0) {
> +			/* 0xffff got passed in, representing weight of 0x10000,
> +			 * which is too heavy.
> +			 */
>   			NL_SET_ERR_MSG(extack, "Invalid value for weight");
>   			return -EINVAL;
>   		}

code is fine, and I like the decision to just extend the width
(instead of, say apply this +1 arithmetic only to lower byte, then just
add higher byte), so:

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
diff mbox series

Patch

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 68463aebcc05..d9fb44e8b321 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -105,7 +105,7 @@  struct nh_grp_entry_stats {
 struct nh_grp_entry {
 	struct nexthop	*nh;
 	struct nh_grp_entry_stats __percpu	*stats;
-	u8		weight;
+	u16		weight;
 
 	union {
 		struct {
@@ -192,7 +192,7 @@  struct nh_notifier_single_info {
 };
 
 struct nh_notifier_grp_entry_info {
-	u8 weight;
+	u16 weight;
 	struct nh_notifier_single_info nh;
 };
 
diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index 2ed643207847..3f869a8fc949 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -16,10 +16,15 @@  struct nhmsg {
 struct nexthop_grp {
 	__u32	id;	  /* nexthop id - must exist */
 	__u8	weight;   /* weight of this nexthop */
-	__u8	resvd1;
+	__u8	weight_high;	/* high order bits of weight */
 	__u16	resvd2;
 };
 
+static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry)
+{
+	return ((entry->weight_high << 8) | entry->weight) + 1;
+}
+
 enum {
 	NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
 				  * default type if not specified
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 23caa13bf24d..9db10d409074 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -872,6 +872,7 @@  static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
 	size_t len = nhg->num_nh * sizeof(*p);
 	struct nlattr *nla;
 	u16 group_type = 0;
+	u16 weight;
 	int i;
 
 	*resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0;
@@ -890,9 +891,12 @@  static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
 
 	p = nla_data(nla);
 	for (i = 0; i < nhg->num_nh; ++i) {
+		weight = nhg->nh_entries[i].weight - 1;
+
 		*p++ = (struct nexthop_grp) {
 			.id = nhg->nh_entries[i].nh->id,
-			.weight = nhg->nh_entries[i].weight - 1,
+			.weight = weight,
+			.weight_high = weight >> 8,
 		};
 	}
 
@@ -1286,11 +1290,14 @@  static int nh_check_attr_group(struct net *net,
 
 	nhg = nla_data(tb[NHA_GROUP]);
 	for (i = 0; i < len; ++i) {
-		if (nhg[i].resvd1 || nhg[i].resvd2) {
-			NL_SET_ERR_MSG(extack, "Reserved fields in nexthop_grp must be 0");
+		if (nhg[i].resvd2) {
+			NL_SET_ERR_MSG(extack, "Reserved field in nexthop_grp must be 0");
 			return -EINVAL;
 		}
-		if (nhg[i].weight > 254) {
+		if (nexthop_grp_weight(&nhg[i]) == 0) {
+			/* 0xffff got passed in, representing weight of 0x10000,
+			 * which is too heavy.
+			 */
 			NL_SET_ERR_MSG(extack, "Invalid value for weight");
 			return -EINVAL;
 		}
@@ -1886,9 +1893,9 @@  static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table)
 static void nh_res_group_rebalance(struct nh_group *nhg,
 				   struct nh_res_table *res_table)
 {
-	int prev_upper_bound = 0;
-	int total = 0;
-	int w = 0;
+	u16 prev_upper_bound = 0;
+	u32 total = 0;
+	u32 w = 0;
 	int i;
 
 	INIT_LIST_HEAD(&res_table->uw_nh_entries);
@@ -1898,11 +1905,12 @@  static void nh_res_group_rebalance(struct nh_group *nhg,
 
 	for (i = 0; i < nhg->num_nh; ++i) {
 		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
-		int upper_bound;
+		u16 upper_bound;
+		u64 btw;
 
 		w += nhge->weight;
-		upper_bound = DIV_ROUND_CLOSEST(res_table->num_nh_buckets * w,
-						total);
+		btw = ((u64)res_table->num_nh_buckets) * w;
+		upper_bound = DIV_ROUND_CLOSEST_ULL(btw, total);
 		nhge->res.wants_buckets = upper_bound - prev_upper_bound;
 		prev_upper_bound = upper_bound;
 
@@ -1968,8 +1976,8 @@  static void replace_nexthop_grp_res(struct nh_group *oldg,
 
 static void nh_hthr_group_rebalance(struct nh_group *nhg)
 {
-	int total = 0;
-	int w = 0;
+	u32 total = 0;
+	u32 w = 0;
 	int i;
 
 	for (i = 0; i < nhg->num_nh; ++i)
@@ -1977,7 +1985,7 @@  static void nh_hthr_group_rebalance(struct nh_group *nhg)
 
 	for (i = 0; i < nhg->num_nh; ++i) {
 		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
-		int upper_bound;
+		u32 upper_bound;
 
 		w += nhge->weight;
 		upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1;
@@ -2719,7 +2727,8 @@  static struct nexthop *nexthop_create_group(struct net *net,
 			goto out_no_nh;
 		}
 		nhg->nh_entries[i].nh = nhe;
-		nhg->nh_entries[i].weight = entry[i].weight + 1;
+		nhg->nh_entries[i].weight = nexthop_grp_weight(&entry[i]);
+
 		list_add(&nhg->nh_entries[i].nh_list, &nhe->grp_list);
 		nhg->nh_entries[i].nh_parent = nh;
 	}