diff mbox series

[net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()

Message ID 20240601212517.644844-1-kuba@kernel.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3637 this patch: 3637
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 950 this patch: 950
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 3822 this patch: 3822
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 69 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 97 this patch: 97
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-06-02--03-00 (tests: 1040)

Commit Message

Jakub Kicinski June 1, 2024, 9:25 p.m. UTC
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), so revert to the
old behavior.

This is the same kind of fix as we added in
commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again")
so wrap the logic into a helper, to make it easier to keep track
of which dump handles we know to require legacy handling
(and possibly one day let sockets opt into not doing this).

Tested:

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
           --dump getaddr --json '{"ifa-family": 2}'

  ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
           --dump getroute --json '{"rtm-family": 2}'

Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: dsahern@kernel.org
---
 include/net/netlink.h   | 42 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/devinet.c      |  3 +++
 net/ipv4/fib_frontend.c |  5 +----
 3 files changed, 46 insertions(+), 4 deletions(-)

Comments

Stephen Hemminger June 1, 2024, 11:10 p.m. UTC | #1
On Sat,  1 Jun 2024 14:25:17 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> + * When used in combination with for_each_netdev_dump() - make sure to
> + * invalidate the ifindex when iteration is done. for_each_netdev_dump()
> + * does not move the iterator index "after" the last valid entry.
> + *
> + * NOTE: Do not use this helper for dumps without known legacy users!
> + *       Most families are accessed only using well-written libraries
> + *       so starting to coalesce NLM_DONE is perfectly fine, and more efficient.

Sorry, I disagree.

You can't just fix the problem areas. The split was an ABI change, and there could
be a problem in any dump. This the ABI version of the old argument 
  If a tree falls in a forest and no one is around to hear it, does it make a sound?

All dumps must behave the same. You are stuck with the legacy behavior.
Jakub Kicinski June 1, 2024, 11:48 p.m. UTC | #2
On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> Sorry, I disagree.
> 
> You can't just fix the problem areas. The split was an ABI change, and there could
> be a problem in any dump. This the ABI version of the old argument 
>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
> 
> All dumps must behave the same. You are stuck with the legacy behavior.

The dump partitioning is up to the family. Multiple families
coalesce NLM_DONE from day 1. "All dumps must behave the same"
is saying we should convert all families to be poorly behaved.

Admittedly changing the most heavily used parts of rtnetlink is very
risky. And there's couple more corner cases which I'm afraid someone
will hit. I'm adding this helper to clearly annotate "legacy"
callbacks, so we don't regress again. At the same time nobody should
use this in new code or "just to be safe" (read: because they don't
understand netlink).
David Ahern June 2, 2024, 2:23 a.m. UTC | #3
On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
>> Sorry, I disagree.
>>
>> You can't just fix the problem areas. The split was an ABI change, and there could
>> be a problem in any dump. This the ABI version of the old argument 
>>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
>>
>> All dumps must behave the same. You are stuck with the legacy behavior.

I don't agree with such a hard line stance. Mistakes made 20 years ago
cannot hold Linux back from moving forward. We have to continue
searching for ways to allow better or more performant behavior.

> 
> The dump partitioning is up to the family. Multiple families
> coalesce NLM_DONE from day 1. "All dumps must behave the same"
> is saying we should convert all families to be poorly behaved.
> 
> Admittedly changing the most heavily used parts of rtnetlink is very
> risky. And there's couple more corner cases which I'm afraid someone
> will hit. I'm adding this helper to clearly annotate "legacy"
> callbacks, so we don't regress again. At the same time nobody should
> use this in new code or "just to be safe" (read: because they don't
> understand netlink).

What about a socket option that says "I am a modern app and can handle
the new way" - similar to the strict mode option that was added? Then
the decision of requiring a separate message for NLM_DONE can be based
on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
to help old apps understand they need to move forward.
Eric Dumazet June 2, 2024, 10 a.m. UTC | #4
On Sun, Jun 2, 2024 at 4:23 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> >> Sorry, I disagree.
> >>
> >> You can't just fix the problem areas. The split was an ABI change, and there could
> >> be a problem in any dump. This the ABI version of the old argument
> >>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
> >>
> >> All dumps must behave the same. You are stuck with the legacy behavior.
>
> I don't agree with such a hard line stance. Mistakes made 20 years ago
> cannot hold Linux back from moving forward. We have to continue
> searching for ways to allow better or more performant behavior.
>
> >
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.

Main motivation for me was to avoid re-grabbing RTNL again just to get NLM_DONE.

The avoidance of the two system calls was really secondary.

I think we could make a generic change in netlink_dump() to force NLM_DONE
in an empty message _and_ avoiding a useless call to the dump method, which
might still use RTNL or other contended mutex.

In a prior feedback I suggested a sysctl that Jakub disliked,
but really we do not care yet, as long as we avoid RTNL as much as we can.

Jakub, what about the following generic change, instead of ad-hoc changes ?

I tested it, I can send it with the minimal change (the alloc skb
optim will reach net-next)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index fa9c090cf629e6e92c097285b262ed90324c7656..0a58e5d13b8e68dd3fbb2b3fb362c3399fa29909
100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2289,15 +2289,20 @@ static int netlink_dump(struct sock *sk, bool
lock_taken)
         * ever provided a big enough buffer.
         */
        cb = &nlk->cb;
-       alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
-
-       max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
-       if (alloc_min_size < max_recvmsg_len) {
-               alloc_size = max_recvmsg_len;
-               skb = alloc_skb(alloc_size,
+       if (nlk->dump_done_errno) {
+               alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
+               max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
+               if (alloc_min_size < max_recvmsg_len) {
+                       alloc_size = max_recvmsg_len;
+                       skb = alloc_skb(alloc_size,
                                (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
                                __GFP_NOWARN | __GFP_NORETRY);
+               }
+       } else {
+               /* Allocate the space needed for NLMSG_DONE alone. */
+               alloc_min_size = nlmsg_total_size(sizeof(nlk->dump_done_errno));
        }
+
        if (!skb) {
                alloc_size = alloc_min_size;
                skb = alloc_skb(alloc_size, GFP_KERNEL);
@@ -2350,8 +2355,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
                cb->extack = NULL;
        }

-       if (nlk->dump_done_errno > 0 ||
-           skb_tailroom(skb) <
nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
+       if (skb->len) {
                mutex_unlock(&nlk->nl_cb_mutex);

                if (sk_filter(sk, skb))
Jakub Kicinski June 2, 2024, 9:59 p.m. UTC | #5
On Sat, 1 Jun 2024 20:23:17 -0600 David Ahern wrote:
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> > 
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).  
> 
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app.

That seems like a good solution, with the helper marking the "legacy"
handlers - I hope it should be trivial to add such option and change
the helper's behavior based on the socket state.

> Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.

Hm, do you think people would actually modernize all the legacy apps?

Coincidentally, looking at Jaroslav's traces it appears that the app
sets ifindex for the link dump, so it must not be opting into strict
checking, either.
Jakub Kicinski June 2, 2024, 10:21 p.m. UTC | #6
On Sun, 2 Jun 2024 12:00:09 +0200 Eric Dumazet wrote:
> Main motivation for me was to avoid re-grabbing RTNL again just to get NLM_DONE.
> 
> The avoidance of the two system calls was really secondary.
> 
> I think we could make a generic change in netlink_dump() to force NLM_DONE
> in an empty message _and_ avoiding a useless call to the dump method, which
> might still use RTNL or other contended mutex.
> 
> In a prior feedback I suggested a sysctl that Jakub disliked,
> but really we do not care yet, as long as we avoid RTNL as much as we can.
> 
> Jakub, what about the following generic change, instead of ad-hoc changes ?

Netlink is full of legacy behavior, the only way to make it usable
in modern environments is to let new families not repeat the mistakes.
That's why I'd really rather not add a workaround at the af_netlink
level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
suddenly start needed another recv(). A lot of the time the entire dump
fits in one skb.

If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
has also been correctly coded from day 1) - we can add a trampoline for
rtnetlink dump handlers?

(just to illustrate, not even compile tested)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 522bbd70c205..c59b39ee544f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6487,6 +6487,18 @@ static int rtnl_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 /* Process one rtnetlink message. */
 
+static int rtnl_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	rtnl_dumpit_func dumpit = cb->data;
+	int err;
+
+	err = dumpit(skb, cb);
+	if (err < 0 && err != -EMSGSIZE)
+		return err;
+
+	return skb->len;
+}
+
 static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 			     struct netlink_ext_ack *extack)
 {
@@ -6546,10 +6558,11 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 		rtnl = net->rtnl;
 		if (err == 0) {
 			struct netlink_dump_control c = {
-				.dump		= dumpit,
+				.dump		= rtnl_dumpit,
 				.min_dump_alloc	= min_dump_alloc,
 				.module		= owner,
 				.flags		= flags,
+				.data		= dumpit,
 			};
 			err = netlink_dump_start(rtnl, skb, nlh, &c);
 			/* netlink_dump_start() will keep a reference on
David Ahern June 3, 2024, 2:42 a.m. UTC | #7
On 6/2/24 3:59 PM, Jakub Kicinski wrote:
> On Sat, 1 Jun 2024 20:23:17 -0600 David Ahern wrote:
>>> The dump partitioning is up to the family. Multiple families
>>> coalesce NLM_DONE from day 1. "All dumps must behave the same"
>>> is saying we should convert all families to be poorly behaved.
>>>
>>> Admittedly changing the most heavily used parts of rtnetlink is very
>>> risky. And there's couple more corner cases which I'm afraid someone
>>> will hit. I'm adding this helper to clearly annotate "legacy"
>>> callbacks, so we don't regress again. At the same time nobody should
>>> use this in new code or "just to be safe" (read: because they don't
>>> understand netlink).  
>>
>> What about a socket option that says "I am a modern app and can handle
>> the new way" - similar to the strict mode option that was added? Then
>> the decision of requiring a separate message for NLM_DONE can be based
>> on the app.
> 
> That seems like a good solution, with the helper marking the "legacy"
> handlers - I hope it should be trivial to add such option and change
> the helper's behavior based on the socket state.
> 
>> Could even throw a `pr_warn_once("modernize app %s/%d\n")`
>> to help old apps understand they need to move forward.
> 
> Hm, do you think people would actually modernize all the legacy apps?

I have worked for a few companies that do monitor dmesg and when given
the right push will update apps. Best an OS can do.

> 
> Coincidentally, looking at Jaroslav's traces it appears that the app
> sets ifindex for the link dump, so it must not be opting into strict
> checking, either.

:-(

I should have added a warning back when the option was introduced - that
and a warning when options to a dump are ignored.
Jakub Kicinski June 3, 2024, 1:54 p.m. UTC | #8
On Sun, 2 Jun 2024 15:21:02 -0700 Jakub Kicinski wrote:
> Netlink is full of legacy behavior, the only way to make it usable
> in modern environments is to let new families not repeat the mistakes.
> That's why I'd really rather not add a workaround at the af_netlink
> level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
> suddenly start needed another recv(). A lot of the time the entire dump
> fits in one skb.
> 
> If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
> has also been correctly coded from day 1) - we can add a trampoline for
> rtnetlink dump handlers?

Hi Eric, how do you feel about this approach? It would also let us
extract the "RTNL unlocked dump" handling from af_netlink.c, which
would be nice.

BTW it will probably need to be paired with fixing the
for_each_netdev_dump() foot gun, maybe (untested):

--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3025,7 +3025,8 @@ int call_netdevice_notifiers_info(unsigned long val,
 #define net_device_entry(lh)	list_entry(lh, struct net_device, dev_list)
 
 #define for_each_netdev_dump(net, d, ifindex)				\
-	xa_for_each_start(&(net)->dev_by_index, (ifindex), (d), (ifindex))
+	for (; (d = xa_find(&(net)->dev_by_index, &ifindex,		\
+			    ULONG_MAX, XA_PRESENT)); ifindex++)
 
 static inline struct net_device *next_net_device(struct net_device *dev)
 {
Jamal Hadi Salim June 3, 2024, 2:05 p.m. UTC | #9
On Sat, Jun 1, 2024 at 10:23 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> >> Sorry, I disagree.
> >>
> >> You can't just fix the problem areas. The split was an ABI change, and there could
> >> be a problem in any dump. This the ABI version of the old argument
> >>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
> >>
> >> All dumps must behave the same. You are stuck with the legacy behavior.
>
> I don't agree with such a hard line stance. Mistakes made 20 years ago
> cannot hold Linux back from moving forward. We have to continue
> searching for ways to allow better or more performant behavior.
>
> >
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.
>

Sorry, being a little lazy so asking instead:
NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
you get rid of it  - how does the user know there are more msgs coming
or the dump transaction is over? In addition to the user->kernel "I am
modern", perhaps set the nlmsg_flag in the reverse path to either say
"there's more coming" which you dont set on the last message or "we
are doing this the new way". Backward compat is very important - there
are dinosaur apps out there that will break otherwise.

cheers,
jamal
Eric Dumazet June 3, 2024, 2:22 p.m. UTC | #10
On Mon, Jun 3, 2024 at 3:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 2 Jun 2024 15:21:02 -0700 Jakub Kicinski wrote:
> > Netlink is full of legacy behavior, the only way to make it usable
> > in modern environments is to let new families not repeat the mistakes.
> > That's why I'd really rather not add a workaround at the af_netlink
> > level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
> > suddenly start needed another recv(). A lot of the time the entire dump
> > fits in one skb.
> >
> > If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
> > has also been correctly coded from day 1) - we can add a trampoline for
> > rtnetlink dump handlers?
>
> Hi Eric, how do you feel about this approach? It would also let us
> extract the "RTNL unlocked dump" handling from af_netlink.c, which
> would be nice.

Sure, I have not thought of af_netlink

>
> BTW it will probably need to be paired with fixing the
> for_each_netdev_dump() foot gun, maybe (untested):
>

I confess I am a bit lost : this part relates to your original submission,
when you set "ctx->ifindex = ULONG_MAX;"  in inet_dump_ifaddr() ?


> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3025,7 +3025,8 @@ int call_netdevice_notifiers_info(unsigned long val,
>  #define net_device_entry(lh)   list_entry(lh, struct net_device, dev_list)
>
>  #define for_each_netdev_dump(net, d, ifindex)                          \
> -       xa_for_each_start(&(net)->dev_by_index, (ifindex), (d), (ifindex))
> +       for (; (d = xa_find(&(net)->dev_by_index, &ifindex,             \
> +                           ULONG_MAX, XA_PRESENT)); ifindex++)
>
>  static inline struct net_device *next_net_device(struct net_device *dev)
>  {
>
Jakub Kicinski June 3, 2024, 2:59 p.m. UTC | #11
On Mon, 3 Jun 2024 16:22:19 +0200 Eric Dumazet wrote:
> > Hi Eric, how do you feel about this approach? It would also let us
> > extract the "RTNL unlocked dump" handling from af_netlink.c, which
> > would be nice.  
> 
> Sure, I have not thought of af_netlink
> 
> > BTW it will probably need to be paired with fixing the
> > for_each_netdev_dump() foot gun, maybe (untested):
> 
> I confess I am a bit lost : this part relates to your original submission,
> when you set "ctx->ifindex = ULONG_MAX;"  in inet_dump_ifaddr() ?

Yes, xa_for_each_start() leaves ifindex at the last valid entry, so
without this snippet it's not safe to call dump again after we reached
the end.
Eric Dumazet June 3, 2024, 3:26 p.m. UTC | #12
On Mon, Jun 3, 2024 at 4:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Jun 2024 16:22:19 +0200 Eric Dumazet wrote:
> > > Hi Eric, how do you feel about this approach? It would also let us
> > > extract the "RTNL unlocked dump" handling from af_netlink.c, which
> > > would be nice.
> >
> > Sure, I have not thought of af_netlink
> >
> > > BTW it will probably need to be paired with fixing the
> > > for_each_netdev_dump() foot gun, maybe (untested):
> >
> > I confess I am a bit lost : this part relates to your original submission,
> > when you set "ctx->ifindex = ULONG_MAX;"  in inet_dump_ifaddr() ?
>
> Yes, xa_for_each_start() leaves ifindex at the last valid entry, so
> without this snippet it's not safe to call dump again after we reached
> the end.

A generic change would be to make sure we remember we hit the end,
I guess this can be done later.
David Ahern June 3, 2024, 3:33 p.m. UTC | #13
On 6/3/24 8:05 AM, Jamal Hadi Salim wrote:
> 
> Sorry, being a little lazy so asking instead:
> NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
> you get rid of it  - how does the user know there are more msgs coming
> or the dump transaction is over? In addition to the user->kernel "I am
> modern", perhaps set the nlmsg_flag in the reverse path to either say
> "there's more coming" which you dont set on the last message or "we
> are doing this the new way". Backward compat is very important - there
> are dinosaur apps out there that will break otherwise.
> 

NLM_DONE is not getting removed. The recent changes allow the end
message signal without a separate system call.
Eric Dumazet June 3, 2024, 3:34 p.m. UTC | #14
On Mon, Jun 3, 2024 at 4:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Jun 1, 2024 at 10:23 PM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> > >> Sorry, I disagree.
> > >>
> > >> You can't just fix the problem areas. The split was an ABI change, and there could
> > >> be a problem in any dump. This the ABI version of the old argument
> > >>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
> > >>
> > >> All dumps must behave the same. You are stuck with the legacy behavior.
> >
> > I don't agree with such a hard line stance. Mistakes made 20 years ago
> > cannot hold Linux back from moving forward. We have to continue
> > searching for ways to allow better or more performant behavior.
> >
> > >
> > > The dump partitioning is up to the family. Multiple families
> > > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > > is saying we should convert all families to be poorly behaved.
> > >
> > > Admittedly changing the most heavily used parts of rtnetlink is very
> > > risky. And there's couple more corner cases which I'm afraid someone
> > > will hit. I'm adding this helper to clearly annotate "legacy"
> > > callbacks, so we don't regress again. At the same time nobody should
> > > use this in new code or "just to be safe" (read: because they don't
> > > understand netlink).
> >
> > What about a socket option that says "I am a modern app and can handle
> > the new way" - similar to the strict mode option that was added? Then
> > the decision of requiring a separate message for NLM_DONE can be based
> > on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> > to help old apps understand they need to move forward.
> >
>
> Sorry, being a little lazy so asking instead:
> NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
> you get rid of it  - how does the user know there are more msgs coming
> or the dump transaction is over? In addition to the user->kernel "I am
> modern", perhaps set the nlmsg_flag in the reverse path to either say
> "there's more coming" which you dont set on the last message or "we
> are doing this the new way". Backward compat is very important - there
> are dinosaur apps out there that will break otherwise.

The NLMSG_DONE was not removed.

Some applications expected it to be carried in a standalone message
for some of the rtnetlink operations,
because old kernel implementations accidentally had this
(undocumented) behavior.

When the kernel started to be smart and piggy-back the NLMSG_DONE in
the 'last given answer',
these applications started to complain.

Basically these applications do not correctly parse the full answer
the kernel gives to them.
diff mbox series

Patch

diff --git a/include/net/netlink.h b/include/net/netlink.h
index e78ce008e07c..8369aca32443 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1198,6 +1198,48 @@  nl_dump_check_consistent(struct netlink_callback *cb,
 	cb->prev_seq = cb->seq;
 }
 
+/**
+ * nl_dump_legacy_retval - get legacy return code for netlink dumps
+ * @err: error encountered during dump, negative errno or zero
+ * @skb: skb with the dump results
+ *
+ * Netlink dump callbacks get called multiple times per dump, because
+ * all the objects may not fit into a single skb. Whether another iteration
+ * is necessary gets decided based on the return value of the callback
+ * (with 0 meaning "end reached").
+ *
+ * The semantics used to be more complicated, with positive return values
+ * meaning "continue" and negative meaning "end with an error". A lot of
+ * handlers simplified this to return skb->len ? : -errno. Meaning that zero
+ * would only be returned when skb was empty, requiring another recvmsg()
+ * syscall just to get the NLM_DONE message.
+ *
+ * The current semantics allow handlers to also return -EMSGSIZE to continue.
+ *
+ * Unfortunately, some user space has started to depend on the NLM_DONE
+ * message being returned individually, in a separate recvmsg(). Select
+ * netlink dumps must preserve those semantics.
+ *
+ * This helper wraps the "legacy logic" and serves as an annotation for
+ * dumps which are known to require legacy handling.
+ *
+ * When used in combination with for_each_netdev_dump() - make sure to
+ * invalidate the ifindex when iteration is done. for_each_netdev_dump()
+ * does not move the iterator index "after" the last valid entry.
+ *
+ * NOTE: Do not use this helper for dumps without known legacy users!
+ *       Most families are accessed only using well-written libraries
+ *       so starting to coalesce NLM_DONE is perfectly fine, and more efficient.
+ *
+ * Return: return code to use for a dump handler
+ */
+static inline int nl_dump_legacy_retval(int err, const struct sk_buff *skb)
+{
+	if (err < 0 && err != -EMSGSIZE)
+		return err;
+	return skb->len;
+}
+
 /**************************************************************************
  * Netlink Attributes
  **************************************************************************/
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 96accde527da..6d0e5cbd95b4 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1911,7 +1911,10 @@  static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 		if (err < 0)
 			goto done;
 	}
+	ctx->ifindex = ULONG_MAX;
 done:
+	err = nl_dump_legacy_retval(err, skb);
+
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 	rcu_read_unlock();
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c484b1c0fc00..100d77eafe35 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1051,10 +1051,7 @@  static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 		}
 	}
 
-	/* Don't let NLM_DONE coalesce into a message, even if it could.
-	 * Some user space expects NLM_DONE in a separate recv().
-	 */
-	err = skb->len;
+	err = nl_dump_legacy_retval(err, skb);
 out:
 
 	cb->args[1] = e;