diff mbox series

[iproute2] iplink: work around rtattr length limits for IFLA_VFINFO_LIST

Message ID 20210115225950.18762-1-edwin.peer@broadcom.com (mailing list archive)
State Rejected
Delegated to: David Ahern
Headers show
Series [iproute2] iplink: work around rtattr length limits for IFLA_VFINFO_LIST | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Edwin Peer Jan. 15, 2021, 10:59 p.m. UTC
The maximum possible length of an RTNL attribute is 64KB, but the
nested VFINFO list exceeds this for more than about 220 VFs (each VF
consumes approximately 300 bytes, depending on alignment and optional
fields). Exceeding the limit causes IFLA_VFINFO_LIST's length to wrap
modulo 16 bits in the kernel's nla_nest_end().

This patch is a horrible hack exploiting the fact that the full set
of attributes is actually present in the netlink packet, even though
the published length of the nested rtattr may be considerably shorter.
The total number of VFs is known, however, and can instead be used as
the basis for the iteration over the VFINFO list.

As ugly as this solution is, it does appear to be a reasonable and
practical compromise selected from a number of alternate approaches
that were considered and deemed worse or otherwise unworkable:

  - Extending the apparent maximum length of rtattr:

  To do this is a way that maintains ABI compatibility is easier said
  than done. Pushing the nested contents through deflate in response to
  a special request filter flag so that the data still fits within the
  64KB limit was considered (not entirely as crazy as this first sounds
  because there is a lot of redundancy in the data that would definitely
  compress well) as well as approaches based on providing new attribute
  types to pair with ATTR_TYPE_NESTED that extend its length in various
  ways (such as a "more" attribute or an extended attribute header with
  a wider length type). Ultimately these length extension ideas were
  rejected because the client parser APIs are expressed in terms of the
  base rtattr type, which cannot be extended cleanly without tacking on
  kludgy helpers or otherwise conducting major rework of client APIs.

  - Filtering based approaches:

  An obvious idea is to reduce the amount of data actually sent using
  filters. For example, by extending RTEXT_FILTER_SKIP_STATS to the VF
  stats, which make up a large proportion of the dump. But, the problem
  arises when it is the stats that are desired. One now either has to
  filter by VF when requesting full resolution data (ie. fetch each VF
  separately) or one has to pick another subset of fields to exclude
  and stitch the results together in the client. But, the requests are
  not atomic and the VF configuration could have changed in the interim.
  This may be less of a concern when requesting a VF's entire data as
  a whole (at least the data would necessarily apply to the same VF),
  but even so there would then need to be a mechanism to select only
  the VFINFO of interest, which is particularly messy given that we're
  not requesting a top level object here and would involve extensions
  to an otherwise frozen VF query API (and still not be atomic).

  - API redesign:

  The clean solution is to decompose the API into smaller granularity
  requests and otherwise rethink the structure of netlink attributes in
  a V2 RTM_GETLINK redesign. Such ideas are all moot, however, because
  VF config has been punted to switchdev and any new work should happen
  there instead.

Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
---
 ip/ipaddress.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

Comments

Jakub Kicinski Jan. 15, 2021, 11:53 p.m. UTC | #1
On Fri, 15 Jan 2021 14:59:50 -0800 Edwin Peer wrote:
> The maximum possible length of an RTNL attribute is 64KB, but the
> nested VFINFO list exceeds this for more than about 220 VFs (each VF
> consumes approximately 300 bytes, depending on alignment and optional
> fields). Exceeding the limit causes IFLA_VFINFO_LIST's length to wrap
> modulo 16 bits in the kernel's nla_nest_end().

Let's add Michal to CC, my faulty memory tells me he was fighting with
this in the past.
Michal Kubecek Jan. 16, 2021, 9:12 p.m. UTC | #2
On Fri, Jan 15, 2021 at 03:53:25PM -0800, Jakub Kicinski wrote:
> On Fri, 15 Jan 2021 14:59:50 -0800 Edwin Peer wrote:
> > The maximum possible length of an RTNL attribute is 64KB, but the
> > nested VFINFO list exceeds this for more than about 220 VFs (each VF
> > consumes approximately 300 bytes, depending on alignment and optional
> > fields). Exceeding the limit causes IFLA_VFINFO_LIST's length to wrap
> > modulo 16 bits in the kernel's nla_nest_end().
> 
> Let's add Michal to CC, my faulty memory tells me he was fighting with
> this in the past.

I've been looking into this some time ago and even tried to open
a discussion on this topic two or three times but there didn't seem
sufficient interest.

My idea back then was to use a separate  query which would allow getting
VF information using a dump request (one VF per message); the reply for
RTM_GETLINK request would either list all VFs as now if possible or only
as many as fit into a nested attribute and indicate that the information
is incomplete (or maybe omit the VF information in such case as
usefulness of the truncated list is questionable).

However, my take from the discussions was that most developers who took
part rather thought that there is no need for such rtnetlink feature as
there is a devlink interface which does not suffer from this limit and
NICs with so many VFs that IFLA_VFINFO_LIST exceeds 65535 bytes can
provide devlink interface to handle them.

In any case, while the idea of handling the malformed messages composed
by existing kernels makes sense, we should IMHO consider this a kernel
bug which should be fixed so that kernel does not reply with malformed
netlink messages (independently of whether this patch is applied to
iproute2 or not).

Michal
Jakub Kicinski Jan. 17, 2021, 1:21 a.m. UTC | #3
On Sat, 16 Jan 2021 22:12:23 +0100 Michal Kubecek wrote:
> On Fri, Jan 15, 2021 at 03:53:25PM -0800, Jakub Kicinski wrote:
> > On Fri, 15 Jan 2021 14:59:50 -0800 Edwin Peer wrote:  
> > > The maximum possible length of an RTNL attribute is 64KB, but the
> > > nested VFINFO list exceeds this for more than about 220 VFs (each VF
> > > consumes approximately 300 bytes, depending on alignment and optional
> > > fields). Exceeding the limit causes IFLA_VFINFO_LIST's length to wrap
> > > modulo 16 bits in the kernel's nla_nest_end().  
> > 
> > Let's add Michal to CC, my faulty memory tells me he was fighting with
> > this in the past.  
> 
> I've been looking into this some time ago and even tried to open
> a discussion on this topic two or three times but there didn't seem
> sufficient interest.
> 
> My idea back then was to use a separate  query which would allow getting
> VF information using a dump request (one VF per message); the reply for
> RTM_GETLINK request would either list all VFs as now if possible or only
> as many as fit into a nested attribute and indicate that the information
> is incomplete (or maybe omit the VF information in such case as
> usefulness of the truncated list is questionable).
> 
> However, my take from the discussions was that most developers who took
> part rather thought that there is no need for such rtnetlink feature as
> there is a devlink interface which does not suffer from this limit and
> NICs with so many VFs that IFLA_VFINFO_LIST exceeds 65535 bytes can
> provide devlink interface to handle them.

Indeed, that's still my position. AFAICT the options of "fixing" this
interface are rather limited and we don't want to perpetuate the
legacy-ndo-based method of configuring VFs - so reimplementation is
not appealing..

One way of working around the 64k limit we discussed with Edwin was
filtering attributes, effectively doing two dumps each with different
filtering flags, so that each one fits (e.g. dropping stats in one and
MAC addresses in the other).

> In any case, while the idea of handling the malformed messages composed
> by existing kernels makes sense, we should IMHO consider this a kernel
> bug which should be fixed so that kernel does not reply with malformed
> netlink messages (independently of whether this patch is applied to
> iproute2 or not).

I wonder. There is something inherently risky about making
a precedent for user space depending on invalid kernel output.

_If_ we want to fix the kernel, IMO we should only fix the kernel.
David Ahern Jan. 18, 2021, 3:48 a.m. UTC | #4
On 1/16/21 6:21 PM, Jakub Kicinski wrote:
> 
> I wonder. There is something inherently risky about making
> a precedent for user space depending on invalid kernel output.
> 
> _If_ we want to fix the kernel, IMO we should only fix the kernel.
> 

IMHO this is a kernel bug that should be fixed. An easy fix to check the
overflow in nla_nest_end and return an error. Sadly, nla_nest_end return
code is ignored and backporting any change to fix that will be
nightmare. A warning will identify places that need to be fixed.

We can at least catch and fix this overflow which is by far the primary
known victim of the rollover.
Edwin Peer Jan. 18, 2021, 5:31 p.m. UTC | #5
On Sat, Jan 16, 2021 at 5:21 PM Jakub Kicinski <kuba@kernel.org> wrote:

> I wonder. There is something inherently risky about making
> a precedent for user space depending on invalid kernel output.

In this instance, it's not depending on the invalid output at all.
Rather, the patch depends on a different and valid output that imparts
equivalent information. I do hear the concern though.

Regards,
Edwin Peer
Edwin Peer Jan. 18, 2021, 5:34 p.m. UTC | #6
On Sun, Jan 17, 2021 at 7:48 PM David Ahern <dsahern@gmail.com> wrote:

> IMHO this is a kernel bug that should be fixed. An easy fix to check the
> overflow in nla_nest_end and return an error. Sadly, nla_nest_end return
> code is ignored and backporting any change to fix that will be
> nightmare. A warning will identify places that need to be fixed.

Assuming we fix nla_nest_end() and error in some way, how does that
assist iproute2?

Regards,
Edwin Peer
David Ahern Jan. 18, 2021, 5:36 p.m. UTC | #7
On 1/18/21 10:34 AM, Edwin Peer wrote:
> On Sun, Jan 17, 2021 at 7:48 PM David Ahern <dsahern@gmail.com> wrote:
> 
>> IMHO this is a kernel bug that should be fixed. An easy fix to check the
>> overflow in nla_nest_end and return an error. Sadly, nla_nest_end return
>> code is ignored and backporting any change to fix that will be
>> nightmare. A warning will identify places that need to be fixed.
> 
> Assuming we fix nla_nest_end() and error in some way, how does that
> assist iproute2?
> 

I don't follow. The kernel is sending a malformed message; userspace
should not be guessing at how to interpret it.
Edwin Peer Jan. 18, 2021, 5:37 p.m. UTC | #8
On Sat, Jan 16, 2021 at 1:12 PM Michal Kubecek <mkubecek@suse.cz> wrote:

> My idea back then was to use a separate  query which would allow getting
> VF information using a dump request (one VF per message); the reply for
> RTM_GETLINK request would either list all VFs as now if possible or only
> as many as fit into a nested attribute and indicate that the information
> is incomplete (or maybe omit the VF information in such case as
> usefulness of the truncated list is questionable).

Yip, that would probably be the right way to fix it if we can fix it
(falls into the 3rd approach mentioned in the patch).

> However, my take from the discussions was that most developers who took
> part rather thought that there is no need for such rtnetlink feature as
> there is a devlink interface which does not suffer from this limit and
> NICs with so many VFs that IFLA_VFINFO_LIST exceeds 65535 bytes can
> provide devlink interface to handle them.

Does that imply reworking ip link to use devlink interfaces as the fix?

Regards,
Edwin Peer
Edwin Peer Jan. 18, 2021, 5:42 p.m. UTC | #9
On Mon, Jan 18, 2021 at 9:36 AM David Ahern <dsahern@gmail.com> wrote:

> > Assuming we fix nla_nest_end() and error in some way, how does that
> > assist iproute2?
>
> I don't follow. The kernel is sending a malformed message; userspace
> should not be guessing at how to interpret it.

The user isn't going to care about this technicality. If the kernel
errors out here, then the user sees zero VFs when adding one more VF.
That's still a bug, even though the malformed message is fixed. An API
bug is still a bug, except we seemingly can't fix it because it's
deprecated.

Regards,
Edwin Peer
David Ahern Jan. 18, 2021, 5:49 p.m. UTC | #10
On 1/18/21 10:42 AM, Edwin Peer wrote:
> On Mon, Jan 18, 2021 at 9:36 AM David Ahern <dsahern@gmail.com> wrote:
> 
>>> Assuming we fix nla_nest_end() and error in some way, how does that
>>> assist iproute2?
>>
>> I don't follow. The kernel is sending a malformed message; userspace
>> should not be guessing at how to interpret it.
> 
> The user isn't going to care about this technicality. If the kernel
> errors out here, then the user sees zero VFs when adding one more VF.
> That's still a bug, even though the malformed message is fixed. An API
> bug is still a bug, except we seemingly can't fix it because it's
> deprecated.
> 

Different bug, different solution required. The networking stack hits
these kind of scalability problems from time to time with original
uapis, so workarounds are needed. One example is rtmsg which only allows
255 routing tables, so RTA_TABLE attribute was added as a u32. Once a
solution is found for the VF problem, iproute2 can be enhanced to
accommodate.
Edwin Peer Jan. 18, 2021, 6:20 p.m. UTC | #11
On Mon, Jan 18, 2021 at 9:49 AM David Ahern <dsahern@gmail.com> wrote:

> Different bug, different solution required. The networking stack hits
> these kind of scalability problems from time to time with original
> uapis, so workarounds are needed. One example is rtmsg which only allows
> 255 routing tables, so RTA_TABLE attribute was added as a u32. Once a
> solution is found for the VF problem, iproute2 can be enhanced to
> accommodate.

The problem is even worse, because user space already depends on the
broken behavior. Erroring out will cause the whole ip link show
command to fail, which works today. Even though the VF list is bust,
the rest of the netdevs are still dumped correctly. A hard fail would
break those too.

Regards,
Edwin Peer
Michal Kubecek Jan. 18, 2021, 6:30 p.m. UTC | #12
On Mon, Jan 18, 2021 at 10:20:35AM -0800, Edwin Peer wrote:
> On Mon, Jan 18, 2021 at 9:49 AM David Ahern <dsahern@gmail.com> wrote:
> 
> > Different bug, different solution required. The networking stack hits
> > these kind of scalability problems from time to time with original
> > uapis, so workarounds are needed. One example is rtmsg which only allows
> > 255 routing tables, so RTA_TABLE attribute was added as a u32. Once a
> > solution is found for the VF problem, iproute2 can be enhanced to
> > accommodate.
> 
> The problem is even worse, because user space already depends on the
> broken behavior. Erroring out will cause the whole ip link show
> command to fail, which works today. Even though the VF list is bust,
> the rest of the netdevs are still dumped correctly. A hard fail would
> break those too.

We could cut the list just before overflowing and inform userspace that
the list is incomplete. Not perfect but there is no perfect solution
which would not require userspace changes to work properly for devices
with "too many" VFs.

Michal
diff mbox series

Patch

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 571346b15cc3..3be61f49204c 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1198,13 +1198,13 @@  int print_linkinfo(struct nlmsghdr *n, void *arg)
 	}
 
 	if ((do_link || show_details) && tb[IFLA_VFINFO_LIST] && tb[IFLA_NUM_VF]) {
-		struct rtattr *i, *vflist = tb[IFLA_VFINFO_LIST];
-		int rem = RTA_PAYLOAD(vflist);
+		struct rtattr *vf = RTA_DATA(tb[IFLA_VFINFO_LIST]);
+		int i, ignore = 0, num_vf = rta_getattr_u32(tb[IFLA_NUM_VF]);
 
 		open_json_array(PRINT_JSON, "vfinfo_list");
-		for (i = RTA_DATA(vflist); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+		for (i = 0; i < num_vf; vf = RTA_NEXT(vf, ignore), i++) {
 			open_json_object(NULL);
-			print_vfinfo(fp, ifi, i);
+			print_vfinfo(fp, ifi, vf);
 			close_json_object();
 		}
 		close_json_array(PRINT_JSON, NULL);
@@ -2157,22 +2157,20 @@  out:
 static void
 ipaddr_loop_each_vf(struct rtattr *tb[], int vfnum, int *min, int *max)
 {
-	struct rtattr *vflist = tb[IFLA_VFINFO_LIST];
-	struct rtattr *i, *vf[IFLA_VF_MAX+1];
+	int i, ignore = 0, num_vf = rta_getattr_u32(tb[IFLA_NUM_VF]);
+	struct rtattr *vf = RTA_DATA(tb[IFLA_VFINFO_LIST]);
+	struct rtattr *vf_tb[IFLA_VF_MAX+1];
 	struct ifla_vf_rate *vf_rate;
-	int rem;
 
-	rem = RTA_PAYLOAD(vflist);
+	for (i = 0; i < num_vf; vf = RTA_NEXT(vf, ignore), i++) {
+		parse_rtattr_nested(vf_tb, IFLA_VF_MAX, vf);
 
-	for (i = RTA_DATA(vflist); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
-		parse_rtattr_nested(vf, IFLA_VF_MAX, i);
-
-		if (!vf[IFLA_VF_RATE]) {
+		if (!vf_tb[IFLA_VF_RATE]) {
 			fprintf(stderr, "VF min/max rate API not supported\n");
 			exit(1);
 		}
 
-		vf_rate = RTA_DATA(vf[IFLA_VF_RATE]);
+		vf_rate = RTA_DATA(vf_tb[IFLA_VF_RATE]);
 		if (vf_rate->vf == vfnum) {
 			*min = vf_rate->min_tx_rate;
 			*max = vf_rate->max_tx_rate;