diff mbox series

[net,v2] neighbour: add RTNL_FLAG_DUMP_SPLIT_NLM_DONE to RTM_GETNEIGH

Message ID 20240615113224.4141608-1-maze@google.com (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] neighbour: add RTNL_FLAG_DUMP_SPLIT_NLM_DONE to RTM_GETNEIGH | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 863 this patch: 863
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 863 this patch: 863
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 867 this patch: 867
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-06-16--00-00 (tests: 654)

Commit Message

Maciej Żenczykowski June 15, 2024, 11:32 a.m. UTC
without this Android's net test, available at:
  https://cs.android.com/android/platform/superproject/main/+/main:kernel/tests/net/test/
run via:
  /...aosp-tests.../net/test/run_net_test.sh --builder neighbour_test.py
fails with:
  TypeError: NLMsgHdr requires a bytes object of length 16, got 4

Fixes: 7e4975f7e7fb ("neighbour: fix neigh_dump_info() return value")
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 net/core/neighbour.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Maciej Żenczykowski June 16, 2024, 8:09 a.m. UTC | #1
For the other patch, I've tracked down:
  32affa5578f0 ("fib: rules: no longer hold RTNL in fib_nl_dumprule()")
which causes half the regression.

But... I haven't figured out what causes the final half (or third
depending on how you look at it).

I've also spent quite a while trying to figure out what exactly is
going wrong in the python netlink parsing code.
The code leaves a *lot* to be desired...

Turns out it doesn't honour the nlmsghdr.length field of NLMSG_DONE
messages, so it only reads the header (16 bytes) instead of the kernel
generated 20=16+4 NULL bytes.  I'm not sure why those extra 4 bytes
are there, but they are... (anyone know?)
This results in a leftover 4 bytes, which then fail to parse as
another nlmsghdr (because it also effectively ignores that it's a DONE
and continues parsing).
Which explains the failure:
  TypeError: NLMsgHdr requires a bytes object of length 16, got 4

Fixing the parsing, results in things hanging, because we ignore the DONE.

Fixing that... causes more issues (or I'm still confused about how the
rest works, it's hard to follow, complicated by python's lack of types
and some apparently dead code).

Ultimately I think the right answer is to simply fix the horribly
broken netlink parser, which only ever worked by (more-or-less)
chance.  We have plenty of time (months) to fix it in time for the
next release of Android after 15/V, which will be the first one to
support a kernel newer than 6.6 LTS anyway.

Furthermore, the python netlink parser is only used in the test
framework, while the non-test code itself uses C++& java netlink
parsers (that I have not yet looked at) but is likely to either work
or contain entirely different classes of bugs ;-)
Jakub Kicinski June 17, 2024, 4:36 p.m. UTC | #2
On Sun, 16 Jun 2024 10:09:07 +0200 Maciej Żenczykowski wrote:
> For the other patch, I've tracked down:
>   32affa5578f0 ("fib: rules: no longer hold RTNL in fib_nl_dumprule()")
> which causes half the regression.
> 
> But... I haven't figured out what causes the final half (or third
> depending on how you look at it).

To be completely honest I also have a fix queued for the other case,
since it was reported already a month ago. But I "forgot" to send it.
I had these tags on it:

    Reported-by: Stefano Brivio <sbrivio@redhat.com>
    Link: https://lore.kernel.org/all/20240315124808.033ff58d@elisabeth
    Reported-by: Ilya Maximets <i.maximets@ovn.org>
    Link: https://lore.kernel.org/all/02b50aae-f0e9-47a4-8365-a977a85975d3@ovn.org
    Fixes: 4ce5dc9316de ("inet: switch inet_dump_fib() to RCU protection")
    Fixes: 5fc68320c1fb ("ipv6: remove RTNL protection from inet6_dump_fib()")
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

LMK if you want to resend yours or I should send mine, because Ilya
pinged the old thread this morning..

> I've also spent quite a while trying to figure out what exactly is
> going wrong in the python netlink parsing code.
> The code leaves a *lot* to be desired...
> 
> Turns out it doesn't honour the nlmsghdr.length field of NLMSG_DONE
> messages, so it only reads the header (16 bytes) instead of the kernel
> generated 20=16+4 NULL bytes.  I'm not sure why those extra 4 bytes
> are there, but they are... (anyone know?)

They are the error code. Just like in a netlink ack. And similarly
extack attrs may follow. Main difference with ack off the top of my
head is that DONE never echos the request.

> This results in a leftover 4 bytes, which then fail to parse as
> another nlmsghdr (because it also effectively ignores that it's a DONE
> and continues parsing).
> Which explains the failure:
>   TypeError: NLMsgHdr requires a bytes object of length 16, got 4
> 
> Fixing the parsing, results in things hanging, because we ignore the DONE.
> 
> Fixing that... causes more issues (or I'm still confused about how the
> rest works, it's hard to follow, complicated by python's lack of types
> and some apparently dead code).
> 
> Ultimately I think the right answer is to simply fix the horribly
> broken netlink parser, which only ever worked by (more-or-less)
> chance.  We have plenty of time (months) to fix it in time for the
> next release of Android after 15/V, which will be the first one to
> support a kernel newer than 6.6 LTS anyway.
> 
> Furthermore, the python netlink parser is only used in the test
> framework, while the non-test code itself uses C++& java netlink
> parsers (that I have not yet looked at) but is likely to either work
> or contain entirely different classes of bugs ;-)

We do have: tools/net/ynl/lib/ynl.py in the tree, FWIW. 
It's BSD-licensed, feel free to lift it / some of it.
It's designed for the netlink YAML specs but the basics like 
message / attr parsing should work.
diff mbox series

Patch

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 45fd88405b6b..e1d12c6ac5b6 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3892,7 +3892,7 @@  static int __init neigh_init(void)
 	rtnl_register(PF_UNSPEC, RTM_NEWNEIGH, neigh_add, NULL, 0);
 	rtnl_register(PF_UNSPEC, RTM_DELNEIGH, neigh_delete, NULL, 0);
 	rtnl_register(PF_UNSPEC, RTM_GETNEIGH, neigh_get, neigh_dump_info,
-		      RTNL_FLAG_DUMP_UNLOCKED);
+		      RTNL_FLAG_DUMP_UNLOCKED | RTNL_FLAG_DUMP_SPLIT_NLM_DONE);
 
 	rtnl_register(PF_UNSPEC, RTM_GETNEIGHTBL, NULL, neightbl_dump_info,
 		      0);