diff mbox series

[net-next,v2,3/3] genetlink: fit NLMSG_DONE into same read() as families

Message ID 20240303052408.310064-4-kuba@kernel.org (mailing list archive)
State Accepted
Commit 87d381973e49404f658d6923a617932eeda9415f
Delegated to: Netdev Maintainers
Headers show
Series netlink: handle EMSGSIZE errors in the core | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 942 this patch: 942
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 958 this patch: 958
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 36 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-05--21-00 (tests: 892)

Commit Message

Jakub Kicinski March 3, 2024, 5:24 a.m. UTC
Make sure ctrl_fill_info() returns sensible error codes and
propagate them out to netlink core. Let netlink core decide
when to return skb->len and when to treat the exit as an
error. Netlink core does better job at it, if we always
return skb->len the core doesn't know when we're done
dumping and NLMSG_DONE ends up in a separate read().

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jiri@resnulli.us
---
 net/netlink/genetlink.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Ido Schimmel March 3, 2024, 3:10 p.m. UTC | #1
On Sat, Mar 02, 2024 at 09:24:08PM -0800, Jakub Kicinski wrote:
> Make sure ctrl_fill_info() returns sensible error codes and
> propagate them out to netlink core. Let netlink core decide
> when to return skb->len and when to treat the exit as an
> error. Netlink core does better job at it, if we always
> return skb->len the core doesn't know when we're done
> dumping and NLMSG_DONE ends up in a separate read().
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Stefano Brivio March 15, 2024, 11:48 a.m. UTC | #2
Hi,

On Sat,  2 Mar 2024 21:24:08 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> Make sure ctrl_fill_info() returns sensible error codes and
> propagate them out to netlink core. Let netlink core decide
> when to return skb->len and when to treat the exit as an
> error. Netlink core does better job at it, if we always
> return skb->len the core doesn't know when we're done
> dumping and NLMSG_DONE ends up in a separate read().

While this change is obviously correct, it breaks... well, broken
applications that _wrongly_ rely on the fact that NLMSG_DONE is
delivered in a separate datagram.

This was the (embarrassing) case for passt(1), which I just fixed:
  https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/

but the "separate" NLMSG_DONE is such an established behaviour,
I think, that this might raise a more general concern.

From my perspective, I'm just happy that this change revealed the
issue, but I wanted to report this anyway in case somebody has
similar possible breakages in mind.

> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: jiri@resnulli.us
> ---
>  net/netlink/genetlink.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 50ec599a5cff..3b7666944b11 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -1232,7 +1232,7 @@ static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq,
>  
>  	hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd);
>  	if (hdr == NULL)
> -		return -1;
> +		return -EMSGSIZE;
>  
>  	if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) ||
>  	    nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) ||
> @@ -1355,6 +1355,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
>  	struct net *net = sock_net(skb->sk);
>  	int fams_to_skip = cb->args[0];
>  	unsigned int id;
> +	int err = 0;
>  
>  	idr_for_each_entry(&genl_fam_idr, rt, id) {
>  		if (!rt->netnsok && !net_eq(net, &init_net))
> @@ -1363,16 +1364,17 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
>  		if (n++ < fams_to_skip)
>  			continue;
>  
> -		if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
> -				   cb->nlh->nlmsg_seq, NLM_F_MULTI,
> -				   skb, CTRL_CMD_NEWFAMILY) < 0) {
> +		err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
> +				     cb->nlh->nlmsg_seq, NLM_F_MULTI,
> +				     skb, CTRL_CMD_NEWFAMILY);
> +		if (err) {
>  			n--;
>  			break;
>  		}
>  	}
>  
>  	cb->args[0] = n;
> -	return skb->len;
> +	return err;
>  }
>  
>  static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,
Jakub Kicinski March 19, 2024, 3:55 p.m. UTC | #3
On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > Make sure ctrl_fill_info() returns sensible error codes and
> > propagate them out to netlink core. Let netlink core decide
> > when to return skb->len and when to treat the exit as an
> > error. Netlink core does better job at it, if we always
> > return skb->len the core doesn't know when we're done
> > dumping and NLMSG_DONE ends up in a separate read().  
> 
> While this change is obviously correct, it breaks... well, broken
> applications that _wrongly_ rely on the fact that NLMSG_DONE is
> delivered in a separate datagram.
> 
> This was the (embarrassing) case for passt(1), which I just fixed:
>   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> 
> but the "separate" NLMSG_DONE is such an established behaviour,
> I think, that this might raise a more general concern.
> 
> From my perspective, I'm just happy that this change revealed the
> issue, but I wanted to report this anyway in case somebody has
> similar possible breakages in mind.

Hi Stefano! I was worried this may happen :( I think we should revert
offending commits, but I'd like to take it on case by case basis. 
I'd imagine majority of netlink is only exercised by iproute2 and
libmnl-based tools. Does passt hang specifically on genetlink family
dump? Your commit also mentions RTM_GETROUTE. This is not the only
commit which removed DONE:

$ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline 

9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
6647b338fc5c netlink: fix netlink_diag_dump() return value
Eric Dumazet March 19, 2024, 5:17 p.m. UTC | #4
On Tue, Mar 19, 2024 at 4:55 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > > Make sure ctrl_fill_info() returns sensible error codes and
> > > propagate them out to netlink core. Let netlink core decide
> > > when to return skb->len and when to treat the exit as an
> > > error. Netlink core does better job at it, if we always
> > > return skb->len the core doesn't know when we're done
> > > dumping and NLMSG_DONE ends up in a separate read().
> >
> > While this change is obviously correct, it breaks... well, broken
> > applications that _wrongly_ rely on the fact that NLMSG_DONE is
> > delivered in a separate datagram.
> >
> > This was the (embarrassing) case for passt(1), which I just fixed:
> >   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> >
> > but the "separate" NLMSG_DONE is such an established behaviour,
> > I think, that this might raise a more general concern.
> >
> > From my perspective, I'm just happy that this change revealed the
> > issue, but I wanted to report this anyway in case somebody has
> > similar possible breakages in mind.
>
> Hi Stefano! I was worried this may happen :( I think we should revert
> offending commits, but I'd like to take it on case by case basis.
> I'd imagine majority of netlink is only exercised by iproute2 and
> libmnl-based tools. Does passt hang specifically on genetlink family
> dump? Your commit also mentions RTM_GETROUTE. This is not the only
> commit which removed DONE:
>
> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>
> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> 6647b338fc5c netlink: fix netlink_diag_dump() return value

Lets not bring back more RTNL locking please for the handlers that
still require it.

The core can generate an NLMSG_DONE by itself, if we decide this needs
to be done.

I find this discussion a bit strange, because NLMSG_DONE being
piggybacked has been a long established behavior.

Jason patch was from 2017
Jakub Kicinski March 19, 2024, 5:40 p.m. UTC | #5
On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
> > Hi Stefano! I was worried this may happen :( I think we should revert
> > offending commits, but I'd like to take it on case by case basis.
> > I'd imagine majority of netlink is only exercised by iproute2 and
> > libmnl-based tools. Does passt hang specifically on genetlink family
> > dump? Your commit also mentions RTM_GETROUTE. This is not the only
> > commit which removed DONE:
> >
> > $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
> >
> > 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> > 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> > 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> > 6647b338fc5c netlink: fix netlink_diag_dump() return value  
> 
> Lets not bring back more RTNL locking please for the handlers that
> still require it.

Definitely. My git log copy/paste is pretty inaccurate, these two are
better examples:

5d9b7cb383bb nexthop: Simplify dump error handling
02e24903e5a4 netlink: let core handle error cases in dump operations

I was trying to point out that we merged a handful of DONE "coalescing"
patches, and if we need to revert - let's only do that for the exact
commands needed. The comment was raised on my genetlink patch while
the discussion in the link points to RTM_GETROUTE.

> The core can generate an NLMSG_DONE by itself, if we decide this needs
> to be done.

Exactly.
David Gibson March 19, 2024, 11:36 p.m. UTC | #6
On Tue, Mar 19, 2024 at 08:55:45AM -0700, Jakub Kicinski wrote:
> On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > > Make sure ctrl_fill_info() returns sensible error codes and
> > > propagate them out to netlink core. Let netlink core decide
> > > when to return skb->len and when to treat the exit as an
> > > error. Netlink core does better job at it, if we always
> > > return skb->len the core doesn't know when we're done
> > > dumping and NLMSG_DONE ends up in a separate read().  
> > 
> > While this change is obviously correct, it breaks... well, broken
> > applications that _wrongly_ rely on the fact that NLMSG_DONE is
> > delivered in a separate datagram.
> > 
> > This was the (embarrassing) case for passt(1), which I just fixed:
> >   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> > 
> > but the "separate" NLMSG_DONE is such an established behaviour,
> > I think, that this might raise a more general concern.
> > 
> > From my perspective, I'm just happy that this change revealed the
> > issue, but I wanted to report this anyway in case somebody has
> > similar possible breakages in mind.
> 
> Hi Stefano! I was worried this may happen :( I think we should revert
> offending commits, but I'd like to take it on case by case basis. 
> I'd imagine majority of netlink is only exercised by iproute2 and
> libmnl-based tools. Does passt hang specifically on genetlink family
> dump? Your commit also mentions RTM_GETROUTE. This is not the only
> commit which removed DONE:

I don't think there's anything specirfic to RTM_GETROUTE here from the
kernel side.  We've looked at the problem in passt more closely now,
and it turns out we handled a merged NLMSG_DONE correctly in most
cases.  For various reasons internal to passt, our handling of
RTM_GETROUTE on one path is more complex, and we had a subtle error
there which broke the handling of a merged NLMSG_DONE.

> 
> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline 
> 
> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> 6647b338fc5c netlink: fix netlink_diag_dump() return value
>
Gal Pressman March 21, 2024, 12:56 p.m. UTC | #7
On 19/03/2024 19:40, Jakub Kicinski wrote:
> On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
>>> Hi Stefano! I was worried this may happen :( I think we should revert
>>> offending commits, but I'd like to take it on case by case basis.
>>> I'd imagine majority of netlink is only exercised by iproute2 and
>>> libmnl-based tools. Does passt hang specifically on genetlink family
>>> dump? Your commit also mentions RTM_GETROUTE. This is not the only
>>> commit which removed DONE:
>>>
>>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>>>
>>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
>>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
>>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
>>> 6647b338fc5c netlink: fix netlink_diag_dump() return value  
>>
>> Lets not bring back more RTNL locking please for the handlers that
>> still require it.
> 
> Definitely. My git log copy/paste is pretty inaccurate, these two are
> better examples:
> 
> 5d9b7cb383bb nexthop: Simplify dump error handling
> 02e24903e5a4 netlink: let core handle error cases in dump operations
> 
> I was trying to point out that we merged a handful of DONE "coalescing"
> patches, and if we need to revert - let's only do that for the exact
> commands needed. The comment was raised on my genetlink patch while
> the discussion in the link points to RTM_GETROUTE.
> 
>> The core can generate an NLMSG_DONE by itself, if we decide this needs
>> to be done.
> 
> Exactly.
> 

We've encountered a new issue recently which I believe is related to
this discussion.

Following Eric's patch:
9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")

Setting the interface mtu to < 1280 results in 'ip addr show eth2'
returning an error, because the ipv6 dump fails. This is a degradation
from the user's perspective.

# ip addr show eth2
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
    link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
# ip link set dev eth2 mtu 1000
# ip addr show eth2
RTNETLINK answers: No such device
Dump terminated
Ido Schimmel March 21, 2024, 1:51 p.m. UTC | #8
On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
> We've encountered a new issue recently which I believe is related to
> this discussion.
> 
> Following Eric's patch:
> 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
> 
> Setting the interface mtu to < 1280 results in 'ip addr show eth2'
> returning an error, because the ipv6 dump fails. This is a degradation
> from the user's perspective.
> 
> # ip addr show eth2
> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> default qlen 1000
>     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
>     altname enp6s0f0np0
> # ip link set dev eth2 mtu 1000
> # ip addr show eth2
> RTNETLINK answers: No such device
> Dump terminated

I don't think it's the same issue. Original issue was about user space
not knowing how to handle NLMSG_DONE being sent together with dump
responses. The issue you reported seems to be related to an
unintentional change in the return code when IPv6 is disabled on an
interface. Can you please test the following patch?

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 247bd4d8ee45..92db9b474f2b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 
                err = 0;
                if (fillargs.ifindex) {
-                       err = -ENODEV;
                        dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
-                       if (!dev)
+                       if (!dev) {
+                               err = -ENODEV;
                                goto done;
+                       }
                        idev = __in6_dev_get(dev);
                        if (idev)
                                err = in6_dump_addrs(idev, skb, cb,
Gal Pressman March 21, 2024, 3:03 p.m. UTC | #9
On 21/03/2024 15:51, Ido Schimmel wrote:
> On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
>> We've encountered a new issue recently which I believe is related to
>> this discussion.
>>
>> Following Eric's patch:
>> 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
>>
>> Setting the interface mtu to < 1280 results in 'ip addr show eth2'
>> returning an error, because the ipv6 dump fails. This is a degradation
>> from the user's perspective.
>>
>> # ip addr show eth2
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
>> default qlen 1000
>>     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
>>     altname enp6s0f0np0
>> # ip link set dev eth2 mtu 1000
>> # ip addr show eth2
>> RTNETLINK answers: No such device
>> Dump terminated
> 
> I don't think it's the same issue. Original issue was about user space
> not knowing how to handle NLMSG_DONE being sent together with dump
> responses. The issue you reported seems to be related to an
> unintentional change in the return code when IPv6 is disabled on an
> interface. Can you please test the following patch?
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 247bd4d8ee45..92db9b474f2b 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>  
>                 err = 0;
>                 if (fillargs.ifindex) {
> -                       err = -ENODEV;
>                         dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
> -                       if (!dev)
> +                       if (!dev) {
> +                               err = -ENODEV;
>                                 goto done;
> +                       }
>                         idev = __in6_dev_get(dev);
>                         if (idev)
>                                 err = in6_dump_addrs(idev, skb, cb,

This seems to fix it, thanks!
Will you submit a patch?
Eric Dumazet March 21, 2024, 5:26 p.m. UTC | #10
On Thu, Mar 21, 2024 at 2:51 PM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
> > We've encountered a new issue recently which I believe is related to
> > this discussion.
> >
> > Following Eric's patch:
> > 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
> >
> > Setting the interface mtu to < 1280 results in 'ip addr show eth2'
> > returning an error, because the ipv6 dump fails. This is a degradation
> > from the user's perspective.
> >
> > # ip addr show eth2
> > 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> > default qlen 1000
> >     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
> >     altname enp6s0f0np0
> > # ip link set dev eth2 mtu 1000
> > # ip addr show eth2
> > RTNETLINK answers: No such device
> > Dump terminated
>
> I don't think it's the same issue. Original issue was about user space
> not knowing how to handle NLMSG_DONE being sent together with dump
> responses. The issue you reported seems to be related to an
> unintentional change in the return code when IPv6 is disabled on an
> interface. Can you please test the following patch?
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 247bd4d8ee45..92db9b474f2b 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>
>                 err = 0;
>                 if (fillargs.ifindex) {
> -                       err = -ENODEV;
>                         dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
> -                       if (!dev)
> +                       if (!dev) {
> +                               err = -ENODEV;
>                                 goto done;
> +                       }
>                         idev = __in6_dev_get(dev);
>                         if (idev)
>                                 err = in6_dump_addrs(idev, skb, cb,

Good catch, thanks !

Not sure why ip command is so picky about the error code here.

There is no IPv6 on this device after all.

The following seems to react quite differently :

# ip addr  show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo
       valid_lft forever preferred_lft forever
# ip link set dev lo mtu 1000
# ip addr  show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever



Reviewed-by: Eric Dumazet <edumazet@google.com>
Ido Schimmel March 21, 2024, 5:41 p.m. UTC | #11
On Thu, Mar 21, 2024 at 06:26:31PM +0100, Eric Dumazet wrote:
> The following seems to react quite differently :
> 
> # ip addr  show lo
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host proto kernel_lo
>        valid_lft forever preferred_lft forever
> # ip link set dev lo mtu 1000
> # ip addr  show lo
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN
> group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever

Yes, for loopback the NETDEV_CHANGEMTU event is treated as NETDEV_DOWN
rather than NETDEV_UNREGISTER:

if (dev->mtu < IPV6_MIN_MTU) {
	addrconf_ifdown(dev, dev != net->loopback_dev);
	break;
}

vim net/ipv6/addrconf.c +3656

> Reviewed-by: Eric Dumazet <edumazet@google.com>

Sorry Eric, already sent the patch without your tag:

https://lore.kernel.org/netdev/20240321173042.2151756-1-idosch@nvidia.com/
Ilya Maximets April 3, 2024, 10:52 p.m. UTC | #12
On 3/19/24 18:40, Jakub Kicinski wrote:
> On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
>>> Hi Stefano! I was worried this may happen :( I think we should revert
>>> offending commits, but I'd like to take it on case by case basis.
>>> I'd imagine majority of netlink is only exercised by iproute2 and
>>> libmnl-based tools. Does passt hang specifically on genetlink family
>>> dump? Your commit also mentions RTM_GETROUTE. This is not the only
>>> commit which removed DONE:
>>>
>>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>>>
>>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
>>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
>>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
>>> 6647b338fc5c netlink: fix netlink_diag_dump() return value  
>>
>> Lets not bring back more RTNL locking please for the handlers that
>> still require it.
> 
> Definitely. My git log copy/paste is pretty inaccurate, these two are
> better examples:
> 
> 5d9b7cb383bb nexthop: Simplify dump error handling
> 02e24903e5a4 netlink: let core handle error cases in dump operations
> 
> I was trying to point out that we merged a handful of DONE "coalescing"
> patches, and if we need to revert - let's only do that for the exact
> commands needed. The comment was raised on my genetlink patch while
> the discussion in the link points to RTM_GETROUTE.
> 
>> The core can generate an NLMSG_DONE by itself, if we decide this needs
>> to be done.
> 
> Exactly.

FWIW, it seems that Libreswan is suffering from the same issue on
RTM_GETROUTE dump.

On 6.9.0-rc1 I see:

/usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
                     --start --asynchronous tun-in-1

recvfrom(7, 
[
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  ...
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])

recvfrom(7, <-- Stuck here forever


On 6.8.0 the output is following:

recvfrom(7, 
[
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  ...
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])

recvfrom(7,
  [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI,}, 0],
  40728, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
close(7)

So, it seems like it explicitly waits for NLMSG_DONE to be in a separate
message.

I reported the issue to Libreswan:
  https://github.com/libreswan/libreswan/issues/1675
but just wanted to let you know as well.

Found this since it breaks IPsec system tests in Open vSwitch.

Best regards, Ilya Maximets.
Jakub Kicinski April 11, 2024, 3:16 p.m. UTC | #13
On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote:
> /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
>                      --start --asynchronous tun-in-1
> 
> recvfrom(7, 
> [
>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>   ...
>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>   [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
> ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
> 
> recvfrom(7, <-- Stuck here forever

I think we should probably fix this..
Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() -
it does seem to coalesce DONE..
Ilya Maximets April 11, 2024, 3:39 p.m. UTC | #14
On 4/11/24 17:16, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote:
>> /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
>>                      --start --asynchronous tun-in-1
>>
>> recvfrom(7, 
>> [
>>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>>   ...
>>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>>   [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
>> ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
>>
>> recvfrom(7, <-- Stuck here forever
> 
> I think we should probably fix this..
> Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() -
> it does seem to coalesce DONE..

Sure, the whole sequence looks like this in strace:

socket(AF_NETLINK, SOCK_DGRAM|SOCK_CLOEXEC, NETLINK_ROUTE) = 7

sendto(7, [
  {
    nlmsg_len=36, nlmsg_type=0x1a /* NLMSG_??? */,
    nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=138040
  },
  "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
], 36, 0, NULL, 0) = 36

recvfrom(7, [
 [
   {nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN,
    rtm_protocol=RTPROT_BOOT, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN],
     [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.1.1.2")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=24, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.0")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_BROADCAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.255")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, 0
 ]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) = 252

recvfrom(7,  <unfinished ...>


nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP

Best regards, Ilya Maximets.
Jakub Kicinski April 11, 2024, 3:52 p.m. UTC | #15
On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote:
> nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
> nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP

Thanks!

Also:

"\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
 ^^^^

So it's dumping AF_INET. I'm guessing its also going to dump v6,
separately? To fix v4 I think something like this would do (untested):

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 48741352a88a..749baa74eee7 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 			e++;
 		}
 	}
+
+	/* Don't let NLM_DONE coalesce into a message, even if it could.
+	 * Some user space expects NLM_DONE in a separate recv().
+	 */
+	err = skb->len;
 out:
 
 	cb->args[1] = e;
Ilya Maximets April 11, 2024, 4:38 p.m. UTC | #16
On 4/11/24 17:52, Jakub Kicinski wrote:
> On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote:
>> nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
>> nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP
> 
> Thanks!
> 
> Also:
> 
> "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
>  ^^^^
> 
> So it's dumping AF_INET. I'm guessing its also going to dump v6,
> separately?

AFAICT, it dumps the family according to the tunnel configuration.
If ipsec is set up for IPv4, then it dumps IPv4, if v6 - v6.

> To fix v4 I think something like this would do (untested):
> 
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 48741352a88a..749baa74eee7 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  			e++;
>  		}
>  	}
> +
> +	/* Don't let NLM_DONE coalesce into a message, even if it could.
> +	 * Some user space expects NLM_DONE in a separate recv().
> +	 */
> +	err = skb->len;
>  out:
>  
>  	cb->args[1] = e;

I tried that and IPv4 tests with Libreswan are passing with this change
on latest net-next/main.

IPv6 tests are stuck in the same way as IPv4 did before.  The sendto
as captured by strace is following:

sendto(7, [
  {
    nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */,
    nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084
  },
  "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48

NLMSG_DONE is part of the first recvfrom and the process is stuck
waiting for something from the second one.

Best regards, Ilya Maximets.
Jakub Kicinski April 11, 2024, 6:03 p.m. UTC | #17
On Thu, 11 Apr 2024 18:38:19 +0200 Ilya Maximets wrote:
> I tried that and IPv4 tests with Libreswan are passing with this change
> on latest net-next/main.
> 
> IPv6 tests are stuck in the same way as IPv4 did before.  The sendto
> as captured by strace is following:
> 
> sendto(7, [
>   {
>     nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */,
>     nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084
>   },
>   "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48
> 
> NLMSG_DONE is part of the first recvfrom and the process is stuck
> waiting for something from the second one.

Perfect, thank you! I just sent the v4 fix,
just to be clear you were testing on -next right?
Because AFAICT the v6 change hasn't made it to Linus.
Jakub Kicinski April 11, 2024, 6:04 p.m. UTC | #18
On Thu, 11 Apr 2024 11:03:13 -0700 Jakub Kicinski wrote:
> > NLMSG_DONE is part of the first recvfrom and the process is stuck
> > waiting for something from the second one.  
> 
> Perfect, thank you! I just sent the v4 fix,
> just to be clear you were testing on -next right?
> Because AFAICT the v6 change hasn't made it to Linus.

Oh, you said net-next/main earlier in your message, all makes sense.
diff mbox series

Patch

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 50ec599a5cff..3b7666944b11 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1232,7 +1232,7 @@  static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq,
 
 	hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd);
 	if (hdr == NULL)
-		return -1;
+		return -EMSGSIZE;
 
 	if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) ||
 	    nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) ||
@@ -1355,6 +1355,7 @@  static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net *net = sock_net(skb->sk);
 	int fams_to_skip = cb->args[0];
 	unsigned int id;
+	int err = 0;
 
 	idr_for_each_entry(&genl_fam_idr, rt, id) {
 		if (!rt->netnsok && !net_eq(net, &init_net))
@@ -1363,16 +1364,17 @@  static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 		if (n++ < fams_to_skip)
 			continue;
 
-		if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
-				   cb->nlh->nlmsg_seq, NLM_F_MULTI,
-				   skb, CTRL_CMD_NEWFAMILY) < 0) {
+		err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
+				     cb->nlh->nlmsg_seq, NLM_F_MULTI,
+				     skb, CTRL_CMD_NEWFAMILY);
+		if (err) {
 			n--;
 			break;
 		}
 	}
 
 	cb->args[0] = n;
-	return skb->len;
+	return err;
 }
 
 static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,