Message ID | 20240303052408.310064-4-kuba@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Commit | 87d381973e49404f658d6923a617932eeda9415f |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | netlink: handle EMSGSIZE errors in the core | expand |
On Sat, Mar 02, 2024 at 09:24:08PM -0800, Jakub Kicinski wrote: > Make sure ctrl_fill_info() returns sensible error codes and > propagate them out to netlink core. Let netlink core decide > when to return skb->len and when to treat the exit as an > error. Netlink core does better job at it, if we always > return skb->len the core doesn't know when we're done > dumping and NLMSG_DONE ends up in a separate read(). > > Reviewed-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Hi, On Sat, 2 Mar 2024 21:24:08 -0800 Jakub Kicinski <kuba@kernel.org> wrote: > Make sure ctrl_fill_info() returns sensible error codes and > propagate them out to netlink core. Let netlink core decide > when to return skb->len and when to treat the exit as an > error. Netlink core does better job at it, if we always > return skb->len the core doesn't know when we're done > dumping and NLMSG_DONE ends up in a separate read(). While this change is obviously correct, it breaks... well, broken applications that _wrongly_ rely on the fact that NLMSG_DONE is delivered in a separate datagram. This was the (embarrassing) case for passt(1), which I just fixed: https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/ but the "separate" NLMSG_DONE is such an established behaviour, I think, that this might raise a more general concern. From my perspective, I'm just happy that this change revealed the issue, but I wanted to report this anyway in case somebody has similar possible breakages in mind. > Reviewed-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > CC: jiri@resnulli.us > --- > net/netlink/genetlink.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c > index 50ec599a5cff..3b7666944b11 100644 > --- a/net/netlink/genetlink.c > +++ b/net/netlink/genetlink.c > @@ -1232,7 +1232,7 @@ static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq, > > hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd); > if (hdr == NULL) > - return -1; > + return -EMSGSIZE; > > if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) || > nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) || > @@ -1355,6 +1355,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) > struct net *net = sock_net(skb->sk); > int fams_to_skip = cb->args[0]; > unsigned int id; > + int err = 0; > > idr_for_each_entry(&genl_fam_idr, rt, id) { > if (!rt->netnsok && !net_eq(net, &init_net)) > @@ -1363,16 +1364,17 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) > if (n++ < fams_to_skip) > continue; > > - if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid, > - cb->nlh->nlmsg_seq, NLM_F_MULTI, > - skb, CTRL_CMD_NEWFAMILY) < 0) { > + err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid, > + cb->nlh->nlmsg_seq, NLM_F_MULTI, > + skb, CTRL_CMD_NEWFAMILY); > + if (err) { > n--; > break; > } > } > > cb->args[0] = n; > - return skb->len; > + return err; > } > > static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,
On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote: > > Make sure ctrl_fill_info() returns sensible error codes and > > propagate them out to netlink core. Let netlink core decide > > when to return skb->len and when to treat the exit as an > > error. Netlink core does better job at it, if we always > > return skb->len the core doesn't know when we're done > > dumping and NLMSG_DONE ends up in a separate read(). > > While this change is obviously correct, it breaks... well, broken > applications that _wrongly_ rely on the fact that NLMSG_DONE is > delivered in a separate datagram. > > This was the (embarrassing) case for passt(1), which I just fixed: > https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/ > > but the "separate" NLMSG_DONE is such an established behaviour, > I think, that this might raise a more general concern. > > From my perspective, I'm just happy that this change revealed the > issue, but I wanted to report this anyway in case somebody has > similar possible breakages in mind. Hi Stefano! I was worried this may happen :( I think we should revert offending commits, but I'd like to take it on case by case basis. I'd imagine majority of netlink is only exercised by iproute2 and libmnl-based tools. Does passt hang specifically on genetlink family dump? Your commit also mentions RTM_GETROUTE. This is not the only commit which removed DONE: $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection 6647b338fc5c netlink: fix netlink_diag_dump() return value
On Tue, Mar 19, 2024 at 4:55 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote: > > > Make sure ctrl_fill_info() returns sensible error codes and > > > propagate them out to netlink core. Let netlink core decide > > > when to return skb->len and when to treat the exit as an > > > error. Netlink core does better job at it, if we always > > > return skb->len the core doesn't know when we're done > > > dumping and NLMSG_DONE ends up in a separate read(). > > > > While this change is obviously correct, it breaks... well, broken > > applications that _wrongly_ rely on the fact that NLMSG_DONE is > > delivered in a separate datagram. > > > > This was the (embarrassing) case for passt(1), which I just fixed: > > https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/ > > > > but the "separate" NLMSG_DONE is such an established behaviour, > > I think, that this might raise a more general concern. > > > > From my perspective, I'm just happy that this change revealed the > > issue, but I wanted to report this anyway in case somebody has > > similar possible breakages in mind. > > Hi Stefano! I was worried this may happen :( I think we should revert > offending commits, but I'd like to take it on case by case basis. > I'd imagine majority of netlink is only exercised by iproute2 and > libmnl-based tools. Does passt hang specifically on genetlink family > dump? Your commit also mentions RTM_GETROUTE. This is not the only > commit which removed DONE: > > $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline > > 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() > 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families > 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection > 6647b338fc5c netlink: fix netlink_diag_dump() return value Lets not bring back more RTNL locking please for the handlers that still require it. The core can generate an NLMSG_DONE by itself, if we decide this needs to be done. I find this discussion a bit strange, because NLMSG_DONE being piggybacked has been a long established behavior. Jason patch was from 2017
On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote: > > Hi Stefano! I was worried this may happen :( I think we should revert > > offending commits, but I'd like to take it on case by case basis. > > I'd imagine majority of netlink is only exercised by iproute2 and > > libmnl-based tools. Does passt hang specifically on genetlink family > > dump? Your commit also mentions RTM_GETROUTE. This is not the only > > commit which removed DONE: > > > > $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline > > > > 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() > > 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families > > 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection > > 6647b338fc5c netlink: fix netlink_diag_dump() return value > > Lets not bring back more RTNL locking please for the handlers that > still require it. Definitely. My git log copy/paste is pretty inaccurate, these two are better examples: 5d9b7cb383bb nexthop: Simplify dump error handling 02e24903e5a4 netlink: let core handle error cases in dump operations I was trying to point out that we merged a handful of DONE "coalescing" patches, and if we need to revert - let's only do that for the exact commands needed. The comment was raised on my genetlink patch while the discussion in the link points to RTM_GETROUTE. > The core can generate an NLMSG_DONE by itself, if we decide this needs > to be done. Exactly.
On Tue, Mar 19, 2024 at 08:55:45AM -0700, Jakub Kicinski wrote: > On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote: > > > Make sure ctrl_fill_info() returns sensible error codes and > > > propagate them out to netlink core. Let netlink core decide > > > when to return skb->len and when to treat the exit as an > > > error. Netlink core does better job at it, if we always > > > return skb->len the core doesn't know when we're done > > > dumping and NLMSG_DONE ends up in a separate read(). > > > > While this change is obviously correct, it breaks... well, broken > > applications that _wrongly_ rely on the fact that NLMSG_DONE is > > delivered in a separate datagram. > > > > This was the (embarrassing) case for passt(1), which I just fixed: > > https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/ > > > > but the "separate" NLMSG_DONE is such an established behaviour, > > I think, that this might raise a more general concern. > > > > From my perspective, I'm just happy that this change revealed the > > issue, but I wanted to report this anyway in case somebody has > > similar possible breakages in mind. > > Hi Stefano! I was worried this may happen :( I think we should revert > offending commits, but I'd like to take it on case by case basis. > I'd imagine majority of netlink is only exercised by iproute2 and > libmnl-based tools. Does passt hang specifically on genetlink family > dump? Your commit also mentions RTM_GETROUTE. This is not the only > commit which removed DONE: I don't think there's anything specirfic to RTM_GETROUTE here from the kernel side. We've looked at the problem in passt more closely now, and it turns out we handled a merged NLMSG_DONE correctly in most cases. For various reasons internal to passt, our handling of RTM_GETROUTE on one path is more complex, and we had a subtle error there which broke the handling of a merged NLMSG_DONE. > > $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline > > 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() > 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families > 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection > 6647b338fc5c netlink: fix netlink_diag_dump() return value >
On 19/03/2024 19:40, Jakub Kicinski wrote: > On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote: >>> Hi Stefano! I was worried this may happen :( I think we should revert >>> offending commits, but I'd like to take it on case by case basis. >>> I'd imagine majority of netlink is only exercised by iproute2 and >>> libmnl-based tools. Does passt hang specifically on genetlink family >>> dump? Your commit also mentions RTM_GETROUTE. This is not the only >>> commit which removed DONE: >>> >>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline >>> >>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() >>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families >>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection >>> 6647b338fc5c netlink: fix netlink_diag_dump() return value >> >> Lets not bring back more RTNL locking please for the handlers that >> still require it. > > Definitely. My git log copy/paste is pretty inaccurate, these two are > better examples: > > 5d9b7cb383bb nexthop: Simplify dump error handling > 02e24903e5a4 netlink: let core handle error cases in dump operations > > I was trying to point out that we merged a handful of DONE "coalescing" > patches, and if we need to revert - let's only do that for the exact > commands needed. The comment was raised on my genetlink patch while > the discussion in the link points to RTM_GETROUTE. > >> The core can generate an NLMSG_DONE by itself, if we decide this needs >> to be done. > > Exactly. > We've encountered a new issue recently which I believe is related to this discussion. Following Eric's patch: 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()") Setting the interface mtu to < 1280 results in 'ip addr show eth2' returning an error, because the ipv6 dump fails. This is a degradation from the user's perspective. # ip addr show eth2 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff altname enp6s0f0np0 # ip link set dev eth2 mtu 1000 # ip addr show eth2 RTNETLINK answers: No such device Dump terminated
On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote: > We've encountered a new issue recently which I believe is related to > this discussion. > > Following Eric's patch: > 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()") > > Setting the interface mtu to < 1280 results in 'ip addr show eth2' > returning an error, because the ipv6 dump fails. This is a degradation > from the user's perspective. > > # ip addr show eth2 > 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group > default qlen 1000 > link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff > altname enp6s0f0np0 > # ip link set dev eth2 mtu 1000 > # ip addr show eth2 > RTNETLINK answers: No such device > Dump terminated I don't think it's the same issue. Original issue was about user space not knowing how to handle NLMSG_DONE being sent together with dump responses. The issue you reported seems to be related to an unintentional change in the return code when IPv6 is disabled on an interface. Can you please test the following patch? diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 247bd4d8ee45..92db9b474f2b 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb, err = 0; if (fillargs.ifindex) { - err = -ENODEV; dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex); - if (!dev) + if (!dev) { + err = -ENODEV; goto done; + } idev = __in6_dev_get(dev); if (idev) err = in6_dump_addrs(idev, skb, cb,
On 21/03/2024 15:51, Ido Schimmel wrote: > On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote: >> We've encountered a new issue recently which I believe is related to >> this discussion. >> >> Following Eric's patch: >> 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()") >> >> Setting the interface mtu to < 1280 results in 'ip addr show eth2' >> returning an error, because the ipv6 dump fails. This is a degradation >> from the user's perspective. >> >> # ip addr show eth2 >> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group >> default qlen 1000 >> link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff >> altname enp6s0f0np0 >> # ip link set dev eth2 mtu 1000 >> # ip addr show eth2 >> RTNETLINK answers: No such device >> Dump terminated > > I don't think it's the same issue. Original issue was about user space > not knowing how to handle NLMSG_DONE being sent together with dump > responses. The issue you reported seems to be related to an > unintentional change in the return code when IPv6 is disabled on an > interface. Can you please test the following patch? > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 247bd4d8ee45..92db9b474f2b 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb, > > err = 0; > if (fillargs.ifindex) { > - err = -ENODEV; > dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex); > - if (!dev) > + if (!dev) { > + err = -ENODEV; > goto done; > + } > idev = __in6_dev_get(dev); > if (idev) > err = in6_dump_addrs(idev, skb, cb, This seems to fix it, thanks! Will you submit a patch?
On Thu, Mar 21, 2024 at 2:51 PM Ido Schimmel <idosch@idosch.org> wrote: > > On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote: > > We've encountered a new issue recently which I believe is related to > > this discussion. > > > > Following Eric's patch: > > 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()") > > > > Setting the interface mtu to < 1280 results in 'ip addr show eth2' > > returning an error, because the ipv6 dump fails. This is a degradation > > from the user's perspective. > > > > # ip addr show eth2 > > 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group > > default qlen 1000 > > link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff > > altname enp6s0f0np0 > > # ip link set dev eth2 mtu 1000 > > # ip addr show eth2 > > RTNETLINK answers: No such device > > Dump terminated > > I don't think it's the same issue. Original issue was about user space > not knowing how to handle NLMSG_DONE being sent together with dump > responses. The issue you reported seems to be related to an > unintentional change in the return code when IPv6 is disabled on an > interface. Can you please test the following patch? > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 247bd4d8ee45..92db9b474f2b 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb, > > err = 0; > if (fillargs.ifindex) { > - err = -ENODEV; > dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex); > - if (!dev) > + if (!dev) { > + err = -ENODEV; > goto done; > + } > idev = __in6_dev_get(dev); > if (idev) > err = in6_dump_addrs(idev, skb, cb, Good catch, thanks ! Not sure why ip command is so picky about the error code here. There is no IPv6 on this device after all. The following seems to react quite differently : # ip addr show lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host proto kernel_lo valid_lft forever preferred_lft forever # ip link set dev lo mtu 1000 # ip addr show lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever Reviewed-by: Eric Dumazet <edumazet@google.com>
On Thu, Mar 21, 2024 at 06:26:31PM +0100, Eric Dumazet wrote: > The following seems to react quite differently : > > # ip addr show lo > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN > group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host proto kernel_lo > valid_lft forever preferred_lft forever > # ip link set dev lo mtu 1000 > # ip addr show lo > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN > group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever Yes, for loopback the NETDEV_CHANGEMTU event is treated as NETDEV_DOWN rather than NETDEV_UNREGISTER: if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; } vim net/ipv6/addrconf.c +3656 > Reviewed-by: Eric Dumazet <edumazet@google.com> Sorry Eric, already sent the patch without your tag: https://lore.kernel.org/netdev/20240321173042.2151756-1-idosch@nvidia.com/
On 3/19/24 18:40, Jakub Kicinski wrote: > On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote: >>> Hi Stefano! I was worried this may happen :( I think we should revert >>> offending commits, but I'd like to take it on case by case basis. >>> I'd imagine majority of netlink is only exercised by iproute2 and >>> libmnl-based tools. Does passt hang specifically on genetlink family >>> dump? Your commit also mentions RTM_GETROUTE. This is not the only >>> commit which removed DONE: >>> >>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline >>> >>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr() >>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families >>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection >>> 6647b338fc5c netlink: fix netlink_diag_dump() return value >> >> Lets not bring back more RTNL locking please for the handlers that >> still require it. > > Definitely. My git log copy/paste is pretty inaccurate, these two are > better examples: > > 5d9b7cb383bb nexthop: Simplify dump error handling > 02e24903e5a4 netlink: let core handle error cases in dump operations > > I was trying to point out that we merged a handful of DONE "coalescing" > patches, and if we need to revert - let's only do that for the exact > commands needed. The comment was raised on my genetlink patch while > the discussion in the link points to RTM_GETROUTE. > >> The core can generate an NLMSG_DONE by itself, if we decide this needs >> to be done. > > Exactly. FWIW, it seems that Libreswan is suffering from the same issue on RTM_GETROUTE dump. On 6.9.0-rc1 I see: /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \ --start --asynchronous tun-in-1 recvfrom(7, [ [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], ... [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...] ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) recvfrom(7, <-- Stuck here forever On 6.8.0 the output is following: recvfrom(7, [ [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], ... [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...] ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) recvfrom(7, [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI,}, 0], 40728, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) close(7) So, it seems like it explicitly waits for NLMSG_DONE to be in a separate message. I reported the issue to Libreswan: https://github.com/libreswan/libreswan/issues/1675 but just wanted to let you know as well. Found this since it breaks IPsec system tests in Open vSwitch. Best regards, Ilya Maximets.
On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote: > /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \ > --start --asynchronous tun-in-1 > > recvfrom(7, > [ > [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], > ... > [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], > [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...] > ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) > > recvfrom(7, <-- Stuck here forever I think we should probably fix this.. Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() - it does seem to coalesce DONE..
On 4/11/24 17:16, Jakub Kicinski wrote: > On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote: >> /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \ >> --start --asynchronous tun-in-1 >> >> recvfrom(7, >> [ >> [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], >> ... >> [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...], >> [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...] >> ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) >> >> recvfrom(7, <-- Stuck here forever > > I think we should probably fix this.. > Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() - > it does seem to coalesce DONE.. Sure, the whole sequence looks like this in strace: socket(AF_NETLINK, SOCK_DGRAM|SOCK_CLOEXEC, NETLINK_ROUTE) = 7 sendto(7, [ { nlmsg_len=36, nlmsg_type=0x1a /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=138040 }, "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02" ], 36, 0, NULL, 0) = 36 recvfrom(7, [ [ {nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_BOOT, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [ [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.1.1.2")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")] ] ], [ {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, {rtm_family=AF_INET, rtm_dst_len=24, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [ [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.0")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")] ] ], [ {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0}, [ [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.1")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")] ] ], [ {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_BROADCAST, rtm_flags=0}, [ [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.255")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")] ] ], [ {nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, 0 ] ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) = 252 recvfrom(7, <unfinished ...> nlmsg_type=0x1a /* NLMSG_??? */ --> RTM_GETROUTE nlmsg_flags=NLM_F_REQUEST|0x300 --> NLM_F_REQUEST|NLM_F_DUMP Best regards, Ilya Maximets.
On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote: > nlmsg_type=0x1a /* NLMSG_??? */ --> RTM_GETROUTE > nlmsg_flags=NLM_F_REQUEST|0x300 --> NLM_F_REQUEST|NLM_F_DUMP Thanks! Also: "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02" ^^^^ So it's dumping AF_INET. I'm guessing its also going to dump v6, separately? To fix v4 I think something like this would do (untested): diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 48741352a88a..749baa74eee7 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) e++; } } + + /* Don't let NLM_DONE coalesce into a message, even if it could. + * Some user space expects NLM_DONE in a separate recv(). + */ + err = skb->len; out: cb->args[1] = e;
On 4/11/24 17:52, Jakub Kicinski wrote: > On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote: >> nlmsg_type=0x1a /* NLMSG_??? */ --> RTM_GETROUTE >> nlmsg_flags=NLM_F_REQUEST|0x300 --> NLM_F_REQUEST|NLM_F_DUMP > > Thanks! > > Also: > > "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02" > ^^^^ > > So it's dumping AF_INET. I'm guessing its also going to dump v6, > separately? AFAICT, it dumps the family according to the tunnel configuration. If ipsec is set up for IPv4, then it dumps IPv4, if v6 - v6. > To fix v4 I think something like this would do (untested): > > diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c > index 48741352a88a..749baa74eee7 100644 > --- a/net/ipv4/fib_frontend.c > +++ b/net/ipv4/fib_frontend.c > @@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) > e++; > } > } > + > + /* Don't let NLM_DONE coalesce into a message, even if it could. > + * Some user space expects NLM_DONE in a separate recv(). > + */ > + err = skb->len; > out: > > cb->args[1] = e; I tried that and IPv4 tests with Libreswan are passing with this change on latest net-next/main. IPv6 tests are stuck in the same way as IPv4 did before. The sendto as captured by strace is following: sendto(7, [ { nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084 }, "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48 NLMSG_DONE is part of the first recvfrom and the process is stuck waiting for something from the second one. Best regards, Ilya Maximets.
On Thu, 11 Apr 2024 18:38:19 +0200 Ilya Maximets wrote: > I tried that and IPv4 tests with Libreswan are passing with this change > on latest net-next/main. > > IPv6 tests are stuck in the same way as IPv4 did before. The sendto > as captured by strace is following: > > sendto(7, [ > { > nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */, > nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084 > }, > "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48 > > NLMSG_DONE is part of the first recvfrom and the process is stuck > waiting for something from the second one. Perfect, thank you! I just sent the v4 fix, just to be clear you were testing on -next right? Because AFAICT the v6 change hasn't made it to Linus.
On Thu, 11 Apr 2024 11:03:13 -0700 Jakub Kicinski wrote: > > NLMSG_DONE is part of the first recvfrom and the process is stuck > > waiting for something from the second one. > > Perfect, thank you! I just sent the v4 fix, > just to be clear you were testing on -next right? > Because AFAICT the v6 change hasn't made it to Linus. Oh, you said net-next/main earlier in your message, all makes sense.
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index 50ec599a5cff..3b7666944b11 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -1232,7 +1232,7 @@ static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq, hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd); if (hdr == NULL) - return -1; + return -EMSGSIZE; if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) || nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) || @@ -1355,6 +1355,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) struct net *net = sock_net(skb->sk); int fams_to_skip = cb->args[0]; unsigned int id; + int err = 0; idr_for_each_entry(&genl_fam_idr, rt, id) { if (!rt->netnsok && !net_eq(net, &init_net)) @@ -1363,16 +1364,17 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) if (n++ < fams_to_skip) continue; - if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid, - cb->nlh->nlmsg_seq, NLM_F_MULTI, - skb, CTRL_CMD_NEWFAMILY) < 0) { + err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, NLM_F_MULTI, + skb, CTRL_CMD_NEWFAMILY); + if (err) { n--; break; } } cb->args[0] = n; - return skb->len; + return err; } static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,