Message ID | e0605ce114eb24323a05aaca1dcdb750b2e0329a.1722519021.git.petrm@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: nexthop: Increase weight to u16 | expand |
On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote: > In CLOS networks, as link failures occur at various points in the network, > ECMP weights of the involved nodes are adjusted to compensate. With high > fan-out of the involved nodes, and overall high number of nodes, > a (non-)ECMP weight ratio that we would like to configure does not fit into > 8 bits. Instead of, say, 255:254, we might like to configure something like > 1000:999. For these deployments, the 8-bit weight may not be enough. > > To that end, in this patch increase the next hop weight from u8 to u16. > > Increasing the width of an integral type can be tricky, because while the > code still compiles, the types may not check out anymore, and numerical > errors come up. To prevent this, the conversion was done in two steps. > First the type was changed from u8 to a single-member structure, which > invalidated all uses of the field. This allowed going through them one by > one and audit for type correctness. Then the structure was replaced with a > vanilla u16 again. This should ensure that no place was missed. > > The UAPI for configuring nexthop group members is that an attribute > NHA_GROUP carries an array of struct nexthop_grp entries: > > struct nexthop_grp { > __u32 id; /* nexthop id - must exist */ > __u8 weight; /* weight of this nexthop */ > __u8 resvd1; > __u16 resvd2; > }; > > The field resvd1 is currently validated and required to be zero. We can > lift this requirement and carry high-order bits of the weight in the > reserved field: > > struct nexthop_grp { > __u32 id; /* nexthop id - must exist */ > __u8 weight; /* weight of this nexthop */ > __u8 weight_high; > __u16 resvd2; > }; > > Keeping the fields split this way was chosen in case an existing userspace > makes assumptions about the width of the weight field, and to sidestep any > endianes issues. nit: endianness ...
Simon Horman <horms@kernel.org> writes: > On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote: >> In CLOS networks, as link failures occur at various points in the network, >> ECMP weights of the involved nodes are adjusted to compensate. With high >> fan-out of the involved nodes, and overall high number of nodes, >> a (non-)ECMP weight ratio that we would like to configure does not fit into >> 8 bits. Instead of, say, 255:254, we might like to configure something like >> 1000:999. For these deployments, the 8-bit weight may not be enough. >> >> To that end, in this patch increase the next hop weight from u8 to u16. >> >> Increasing the width of an integral type can be tricky, because while the >> code still compiles, the types may not check out anymore, and numerical >> errors come up. To prevent this, the conversion was done in two steps. >> First the type was changed from u8 to a single-member structure, which >> invalidated all uses of the field. This allowed going through them one by >> one and audit for type correctness. Then the structure was replaced with a >> vanilla u16 again. This should ensure that no place was missed. >> >> The UAPI for configuring nexthop group members is that an attribute >> NHA_GROUP carries an array of struct nexthop_grp entries: >> >> struct nexthop_grp { >> __u32 id; /* nexthop id - must exist */ >> __u8 weight; /* weight of this nexthop */ >> __u8 resvd1; >> __u16 resvd2; >> }; >> >> The field resvd1 is currently validated and required to be zero. We can >> lift this requirement and carry high-order bits of the weight in the >> reserved field: >> >> struct nexthop_grp { >> __u32 id; /* nexthop id - must exist */ >> __u8 weight; /* weight of this nexthop */ >> __u8 weight_high; >> __u16 resvd2; >> }; >> >> Keeping the fields split this way was chosen in case an existing userspace >> makes assumptions about the width of the weight field, and to sidestep any >> endianes issues. > > nit: endianness Thanks, will fix for v2. For now I'm still waiting if there's other feedback.
On 8/1/24 18:23, Petr Machata wrote: > In CLOS networks, as link failures occur at various points in the network, > ECMP weights of the involved nodes are adjusted to compensate. With high > fan-out of the involved nodes, and overall high number of nodes, > a (non-)ECMP weight ratio that we would like to configure does not fit into > 8 bits. Instead of, say, 255:254, we might like to configure something like > 1000:999. For these deployments, the 8-bit weight may not be enough. > > To that end, in this patch increase the next hop weight from u8 to u16. > > Increasing the width of an integral type can be tricky, because while the > code still compiles, the types may not check out anymore, and numerical > errors come up. To prevent this, the conversion was done in two steps. > First the type was changed from u8 to a single-member structure, which > invalidated all uses of the field. This allowed going through them one by > one and audit for type correctness. Then the structure was replaced with a > vanilla u16 again. This should ensure that no place was missed. > > The UAPI for configuring nexthop group members is that an attribute > NHA_GROUP carries an array of struct nexthop_grp entries: > > struct nexthop_grp { > __u32 id; /* nexthop id - must exist */ > __u8 weight; /* weight of this nexthop */ > __u8 resvd1; > __u16 resvd2; > }; > > The field resvd1 is currently validated and required to be zero. We can > lift this requirement and carry high-order bits of the weight in the > reserved field: > > struct nexthop_grp { > __u32 id; /* nexthop id - must exist */ > __u8 weight; /* weight of this nexthop */ > __u8 weight_high; > __u16 resvd2; > }; > > Keeping the fields split this way was chosen in case an existing userspace > makes assumptions about the width of the weight field, and to sidestep any > endianes issues. > > The weight field is currently encoded as the weight value minus one, > because weight of 0 is invalid. This same trick is impossible for the new > weight_high field, because zero must mean actual zero. With this in place: > > - Old userspace is guaranteed to carry weight_high of 0, therefore > configuring 8-bit weights as appropriate. When dumping nexthops with > 16-bit weight, it would only show the lower 8 bits. But configuring such > nexthops implies existence of userspace aware of the extension in the > first place. > > - New userspace talking to an old kernel will work as long as it only > attempts to configure 8-bit weights, where the high-order bits are zero. > Old kernel will bounce attempts at configuring >8-bit weights. > > Renaming reserved fields as they are allocated for some purpose is commonly > done in Linux. Whoever touches a reserved field is doing so at their own > risk. nexthop_grp::resvd1 in particular is currently used by at least > strace, however they carry an own copy of UAPI headers, and the conversion > should be trivial. A helper is provided for decoding the weight out of the > two fields. Forcing a conversion seems preferable to bending backwards and > introducing anonymous unions or whatever. > > Signed-off-by: Petr Machata <petrm@nvidia.com> > Reviewed-by: Ido Schimmel <idosch@nvidia.com> > --- > include/net/nexthop.h | 4 ++-- > include/uapi/linux/nexthop.h | 7 ++++++- > net/ipv4/nexthop.c | 37 ++++++++++++++++++++++-------------- > 3 files changed, 31 insertions(+), 17 deletions(-) > > - if (nhg[i].weight > 254) { > + if (nexthop_grp_weight(&nhg[i]) == 0) { > + /* 0xffff got passed in, representing weight of 0x10000, > + * which is too heavy. > + */ > NL_SET_ERR_MSG(extack, "Invalid value for weight"); > return -EINVAL; > } code is fine, and I like the decision to just extend the width (instead of, say apply this +1 arithmetic only to lower byte, then just add higher byte), so: Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
diff --git a/include/net/nexthop.h b/include/net/nexthop.h index 68463aebcc05..d9fb44e8b321 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -105,7 +105,7 @@ struct nh_grp_entry_stats { struct nh_grp_entry { struct nexthop *nh; struct nh_grp_entry_stats __percpu *stats; - u8 weight; + u16 weight; union { struct { @@ -192,7 +192,7 @@ struct nh_notifier_single_info { }; struct nh_notifier_grp_entry_info { - u8 weight; + u16 weight; struct nh_notifier_single_info nh; }; diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index 2ed643207847..3f869a8fc949 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -16,10 +16,15 @@ struct nhmsg { struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ - __u8 resvd1; + __u8 weight_high; /* high order bits of weight */ __u16 resvd2; }; +static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry) +{ + return ((entry->weight_high << 8) | entry->weight) + 1; +} + enum { NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group * default type if not specified diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 23caa13bf24d..9db10d409074 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -872,6 +872,7 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, size_t len = nhg->num_nh * sizeof(*p); struct nlattr *nla; u16 group_type = 0; + u16 weight; int i; *resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0; @@ -890,9 +891,12 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh, p = nla_data(nla); for (i = 0; i < nhg->num_nh; ++i) { + weight = nhg->nh_entries[i].weight - 1; + *p++ = (struct nexthop_grp) { .id = nhg->nh_entries[i].nh->id, - .weight = nhg->nh_entries[i].weight - 1, + .weight = weight, + .weight_high = weight >> 8, }; } @@ -1286,11 +1290,14 @@ static int nh_check_attr_group(struct net *net, nhg = nla_data(tb[NHA_GROUP]); for (i = 0; i < len; ++i) { - if (nhg[i].resvd1 || nhg[i].resvd2) { - NL_SET_ERR_MSG(extack, "Reserved fields in nexthop_grp must be 0"); + if (nhg[i].resvd2) { + NL_SET_ERR_MSG(extack, "Reserved field in nexthop_grp must be 0"); return -EINVAL; } - if (nhg[i].weight > 254) { + if (nexthop_grp_weight(&nhg[i]) == 0) { + /* 0xffff got passed in, representing weight of 0x10000, + * which is too heavy. + */ NL_SET_ERR_MSG(extack, "Invalid value for weight"); return -EINVAL; } @@ -1886,9 +1893,9 @@ static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table) static void nh_res_group_rebalance(struct nh_group *nhg, struct nh_res_table *res_table) { - int prev_upper_bound = 0; - int total = 0; - int w = 0; + u16 prev_upper_bound = 0; + u32 total = 0; + u32 w = 0; int i; INIT_LIST_HEAD(&res_table->uw_nh_entries); @@ -1898,11 +1905,12 @@ static void nh_res_group_rebalance(struct nh_group *nhg, for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; - int upper_bound; + u16 upper_bound; + u64 btw; w += nhge->weight; - upper_bound = DIV_ROUND_CLOSEST(res_table->num_nh_buckets * w, - total); + btw = ((u64)res_table->num_nh_buckets) * w; + upper_bound = DIV_ROUND_CLOSEST_ULL(btw, total); nhge->res.wants_buckets = upper_bound - prev_upper_bound; prev_upper_bound = upper_bound; @@ -1968,8 +1976,8 @@ static void replace_nexthop_grp_res(struct nh_group *oldg, static void nh_hthr_group_rebalance(struct nh_group *nhg) { - int total = 0; - int w = 0; + u32 total = 0; + u32 w = 0; int i; for (i = 0; i < nhg->num_nh; ++i) @@ -1977,7 +1985,7 @@ static void nh_hthr_group_rebalance(struct nh_group *nhg) for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; - int upper_bound; + u32 upper_bound; w += nhge->weight; upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1; @@ -2719,7 +2727,8 @@ static struct nexthop *nexthop_create_group(struct net *net, goto out_no_nh; } nhg->nh_entries[i].nh = nhe; - nhg->nh_entries[i].weight = entry[i].weight + 1; + nhg->nh_entries[i].weight = nexthop_grp_weight(&entry[i]); + list_add(&nhg->nh_entries[i].nh_list, &nhe->grp_list); nhg->nh_entries[i].nh_parent = nh; }