mbox series

[net,v2,0/5] Make TCP-MD5-diag slightly less broken

Message ID 20241113-tcp-md5-diag-prep-v2-0-00a2a7feb1fa@gmail.com (mailing list archive)
Headers show
Series Make TCP-MD5-diag slightly less broken | expand

Message

Dmitry Safonov via B4 Relay Nov. 13, 2024, 6:46 p.m. UTC
Changes in v2:
- Fixup for uninitilized md5sig_count stack variable
  (Oops! Kudos to kernel test robot <lkp@intel.com>)
- Correct space damage, add a missing Fixes tag &
  reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
- Take out patch for maximum attribute length, see (4) below.
  Going to send it later with the next TCP-AO-diag part
  (Kuniyuki Iwashima)
- Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail.com

My original intent was to replace the last non-upstream Arista's TCP-AO
piece. That is per-netns procfs seqfile which lists AO keys. In my view
an acceptable upstream alternative would be TCP-AO-diag uAPI.

So, I started by looking and reviewing TCP-MD5-diag code. And straight
away I saw a bunch of issues:

1. Similarly to TCP_MD5SIG_EXT, which doesn't check tcpm_flags for
   unknown flags and so being non-extendable setsockopt(), the same
   way tcp_diag_put_md5sig() dumps md5 keys in an array of
   tcp_diag_md5sig, which makes it ABI non-extendable structure
   as userspace can't tolerate any new members in it.

2. Inet-diag allocates netlink message for sockets in
   inet_diag_dump_one_icsk(), which uses a TCP-diag callback
   .idiag_get_aux_size(), that pre-calculates the needed space for
   TCP-diag related information. But as neither socket lock nor
   rcu_readlock() are held between allocation and the actual TCP
   info filling, the TCP-related space requirement may change before
   reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
   a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
   return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().

3. Inet-diag "do" request* can create skb of any message required size.
   But "dump" request* the skb size, since d35c99ff77ec ("netlink: do
   not enter direct reclaim from netlink_dump()") is limited by
   32 KB. Having in mind that sizeof(struct tcp_diag_md5sig) = 100 bytes, 
   dumps for sockets that have more than 327 keys are going to fail
   (not counting other diag infos, which lower this limit futher).
   That is much lower than the number of TCP-MD5 keys that can be
   allocated on a socket with the current default
   optmem_max limit (128Kb).

So, then I went and written selftests for TCP-MD5-diag and besides
confirming that (2) and (3) are not theoretical issues, I also
discovered another issues, that I didn't notice on code inspection:

4. nlattr::nla_len is __u16, which limits the largest netlink attibute
   by 64Kb or by 655 tcp_diag_md5sig keys in the diag array. What
   happens de-facto is that the netlink attribute gets u16 overflow,
   breaking the userspace parsing - RTA_NEXT(), that should point
   to the next attribute, points into the middle of md5 keys array.

In this patch set issues (2) and (4) are addressed.
(2) by not returning EMSGSIZE when the dump raced with modifying
TCP-MD5 keys on a socket, but mark the dump inconsistent by setting
NLM_F_DUMP_INTR nlmsg flag. Which changes uAPI in situations where
previously kernel did WARN() and errored the dump.
(4) by artificially limiting the maximum attribute size by U16_MAX - 1.

In order to remove the new limit from (4) solution, my plan is to
convert the dump of TCP-MD5 keys from an array to
NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
And for (3), it's needed to teach tcp-diag how-to remember not only
socket on which previous recvmsg() stopped, but potentially TCP-MD5
key as well.

I plan in the next part of patch set address (3), (1) and the new limit
for (4), together with adding new TCP-AO-diag.

* Terminology from Documentation/userspace-api/netlink/intro.rst

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
Dmitry Safonov (5):
      net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
      net/diag: Warn only once on EMSGSIZE
      net/diag: Pre-allocate optional info only if requested
      net/diag: Always pre-allocate tcp_ulp info
      net/netlink: Correct the comment on netlink message max cap

 include/linux/inet_diag.h |  3 +-
 include/net/tcp.h         |  1 -
 net/ipv4/inet_diag.c      | 89 ++++++++++++++++++++++++++++++++++++++---------
 net/ipv4/tcp_diag.c       | 68 ++++++++++++++++++------------------
 net/mptcp/diag.c          | 20 -----------
 net/netlink/af_netlink.c  |  4 +--
 net/tls/tls_main.c        | 17 ---------
 7 files changed, 110 insertions(+), 92 deletions(-)
---
base-commit: f1b785f4c7870c42330b35522c2514e39a1e28e7
change-id: 20241106-tcp-md5-diag-prep-2f0dcf371d90

Best regards,

Comments

MPTCP CI Nov. 13, 2024, 7:52 p.m. UTC | #1
Hi Dmitry,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal: Success! ✅
- KVM Validation: debug: Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/11823966620

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/0952c60a2572
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=909394


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)
Jakub Kicinski Nov. 16, 2024, 12:08 a.m. UTC | #2
On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> 2. Inet-diag allocates netlink message for sockets in
>    inet_diag_dump_one_icsk(), which uses a TCP-diag callback
>    .idiag_get_aux_size(), that pre-calculates the needed space for
>    TCP-diag related information. But as neither socket lock nor
>    rcu_readlock() are held between allocation and the actual TCP
>    info filling, the TCP-related space requirement may change before
>    reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
>    a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
>    return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().

Would it be too ugly if we simply retried with a 32kB skb if the initial
dump failed with EMSGSIZE?

Another option would be to automatically grow the skb. The size
accounting is an endless source of bugs. We'd just need to scan
the codebase to make sure there are no cases where someone does

	ptr = __nla_reserve();
	nla_put();
	*ptr = 0;

Which may be too much of a project and source of bugs in itself.

Or do both, retry as a fix, and auto-grow in net-next.

> In order to remove the new limit from (4) solution, my plan is to
> convert the dump of TCP-MD5 keys from an array to
> NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> And for (3), it's needed to teach tcp-diag how-to remember not only
> socket on which previous recvmsg() stopped, but potentially TCP-MD5
> key as well.

Just putting the same attribute type multiple times is preferable
to array types.
patchwork-bot+netdevbpf@kernel.org Nov. 16, 2024, 12:20 a.m. UTC | #3
Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 13 Nov 2024 18:46:39 +0000 you wrote:
> Changes in v2:
> - Fixup for uninitilized md5sig_count stack variable
>   (Oops! Kudos to kernel test robot <lkp@intel.com>)
> - Correct space damage, add a missing Fixes tag &
>   reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
> - Take out patch for maximum attribute length, see (4) below.
>   Going to send it later with the next TCP-AO-diag part
>   (Kuniyuki Iwashima)
> - Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail.com
> 
> [...]

Here is the summary with links:
  - [net,v2,1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
    (no matching commit)
  - [net,v2,2/5] net/diag: Warn only once on EMSGSIZE
    (no matching commit)
  - [net,v2,3/5] net/diag: Pre-allocate optional info only if requested
    (no matching commit)
  - [net,v2,4/5] net/diag: Always pre-allocate tcp_ulp info
    (no matching commit)
  - [net,v2,5/5] net/netlink: Correct the comment on netlink message max cap
    https://git.kernel.org/netdev/net-next/c/e51edeaf3506

You are awesome, thank you!
Dmitry Safonov Nov. 16, 2024, 12:48 a.m. UTC | #4
Hi Jakub,

On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > 2. Inet-diag allocates netlink message for sockets in
> >    inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> >    .idiag_get_aux_size(), that pre-calculates the needed space for
> >    TCP-diag related information. But as neither socket lock nor
> >    rcu_readlock() are held between allocation and the actual TCP
> >    info filling, the TCP-related space requirement may change before
> >    reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> >    a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> >    return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
>
> Would it be too ugly if we simply retried with a 32kB skb if the initial
> dump failed with EMSGSIZE?

Yeah, I'm not sure. I thought of keeping it simple and just marking
the nlmsg "inconsistent". This is arguably a change of meaning for
NLM_F_DUMP_INTR because previously, it meant that the multi-message
dump became inconsistent between recvmsg() calls. And now, it is also
utilized in the "do" version if it raced with the socket setsockopts()
in another thread.

> Another option would be to automatically grow the skb. The size
> accounting is an endless source of bugs. We'd just need to scan
> the codebase to make sure there are no cases where someone does
>
>         ptr = __nla_reserve();
>         nla_put();
>         *ptr = 0;
>
> Which may be too much of a project and source of bugs in itself.

This seems quite more complex than just marking the dump inconsistent
and letting the userspace deal with the result or retry if it wants
precise key information.

> Or do both, retry as a fix, and auto-grow in net-next.
>
> > In order to remove the new limit from (4) solution, my plan is to
> > convert the dump of TCP-MD5 keys from an array to
> > NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> > And for (3), it's needed to teach tcp-diag how-to remember not only
> > socket on which previous recvmsg() stopped, but potentially TCP-MD5
> > key as well.
>
> Just putting the same attribute type multiple times is preferable
> to array types.

Cool. I didn't know that. I think I was confused by iproute way of
parsing [which I read very briefly, so might have misunderstood]:
: while (RTA_OK(rta, len)) {
:         type = rta->rta_type & ~flags;
:         if ((type <= max) && (!tb[type]))
:                 tb[type] = rta;
:         rta = RTA_NEXT(rta, len);
: }
https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526

which seems like it will just ignore duplicate attributes.

That doesn't mean iproute has to dictate new code in kernel, for sure.

Thanks,
             Dmitry
Jakub Kicinski Nov. 16, 2024, 1:58 a.m. UTC | #5
On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:
> On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:  
> > > 2. Inet-diag allocates netlink message for sockets in
> > >    inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> > >    .idiag_get_aux_size(), that pre-calculates the needed space for
> > >    TCP-diag related information. But as neither socket lock nor
> > >    rcu_readlock() are held between allocation and the actual TCP
> > >    info filling, the TCP-related space requirement may change before
> > >    reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> > >    a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> > >    return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().  
> >
> > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > dump failed with EMSGSIZE?  
> 
> Yeah, I'm not sure. I thought of keeping it simple and just marking
> the nlmsg "inconsistent". This is arguably a change of meaning for
> NLM_F_DUMP_INTR because previously, it meant that the multi-message
> dump became inconsistent between recvmsg() calls. And now, it is also
> utilized in the "do" version if it raced with the socket setsockopts()
> in another thread.

NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
was a workaround for consistency of the dump as a whole. Single message
we can re-generate quite easily in the kernel, so forcing the user to
handle INTR and retry seems unnecessarily cruel ;)

> > > In order to remove the new limit from (4) solution, my plan is to
> > > convert the dump of TCP-MD5 keys from an array to
> > > NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> > > And for (3), it's needed to teach tcp-diag how-to remember not only
> > > socket on which previous recvmsg() stopped, but potentially TCP-MD5
> > > key as well.  
> >
> > Just putting the same attribute type multiple times is preferable
> > to array types.  
> 
> Cool. I didn't know that. I think I was confused by iproute way of
> parsing [which I read very briefly, so might have misunderstood]:
> : while (RTA_OK(rta, len)) {
> :         type = rta->rta_type & ~flags;
> :         if ((type <= max) && (!tb[type]))
> :                 tb[type] = rta;
> :         rta = RTA_NEXT(rta, len);
> : }
> https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526
> 
> which seems like it will just ignore duplicate attributes.
> 
> That doesn't mean iproute has to dictate new code in kernel, for sure.

Right, the table based parsing doesn't work well with multi-attr,
but other table formats aren't fundamentally better. Or at least
I never came up with a good way of solving this. And the multi-attr
at least doesn't suffer from the u16 problem.
Dmitry Safonov Nov. 16, 2024, 3:52 a.m. UTC | #6
On Sat, 16 Nov 2024 at 01:58, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:
> > On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
> > > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > > dump failed with EMSGSIZE?
> >
> > Yeah, I'm not sure. I thought of keeping it simple and just marking
> > the nlmsg "inconsistent". This is arguably a change of meaning for
> > NLM_F_DUMP_INTR because previously, it meant that the multi-message
> > dump became inconsistent between recvmsg() calls. And now, it is also
> > utilized in the "do" version if it raced with the socket setsockopts()
> > in another thread.
>
> NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
> was a workaround for consistency of the dump as a whole. Single message
> we can re-generate quite easily in the kernel, so forcing the user to
> handle INTR and retry seems unnecessarily cruel ;)

Kind of agree. But then, it seems to be quite rare. Even on a
purposely created selftest it fires not each time (maybe I'm not
skilful enough). Yet somewhat sceptical about a re-try in the kernel:
the need for it is caused by another thread manipulating keys, so we
may need another re-try after the first re-try... So, then we would
have to introduce a limit on retries :D

Hmm, what do you think about a kind of middle-ground/compromise
solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
hardly ever/never happen by purposely allocating larger skb. I don't
want to set some value in stone as one day it might become not enough
for all different socket infos, but maybe just add 4kB more to the
initial allocation? So, for it to reproduce, another thread would have
to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
socket between this thread's skb allocation and filling of the info
array. I'd call it "attempting to be nice to a user, but not at their
busylooping expense".

> > > Just putting the same attribute type multiple times is preferable
> > > to array types.
> >
> > Cool. I didn't know that. I think I was confused by iproute way of
> > parsing [which I read very briefly, so might have misunderstood]:
> > : while (RTA_OK(rta, len)) {
> > :         type = rta->rta_type & ~flags;
> > :         if ((type <= max) && (!tb[type]))
> > :                 tb[type] = rta;
> > :         rta = RTA_NEXT(rta, len);
> > : }
> > https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526
> >
> > which seems like it will just ignore duplicate attributes.
> >
> > That doesn't mean iproute has to dictate new code in kernel, for sure.
>
> Right, the table based parsing doesn't work well with multi-attr,
> but other table formats aren't fundamentally better. Or at least
> I never came up with a good way of solving this. And the multi-attr
> at least doesn't suffer from the u16 problem.

Yeah, also an array of structs that makes it impossible to extend such
an ABI with new members.

And with regards to u16, I was thinking of this diff for net-next, but
was not sure if it's worth it:

diff --git a/lib/nlattr.c b/lib/nlattr.c
index be9c576b6e2d..01c5a49ffa34 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
int attrtype, int attrlen)
 {
  struct nlattr *nla;

+ DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
+     "requested nlattr::nla_len %d >= U16_MAX", attrlen);
+
  nla = skb_put(skb, nla_total_size(attrlen));
  nla->nla_type = attrtype;
  nla->nla_len = nla_attr_size(attrlen);
--->8---

Thanks,
             Dmitry
Jakub Kicinski Nov. 19, 2024, 12:12 a.m. UTC | #7
On Sat, 16 Nov 2024 03:52:47 +0000 Dmitry Safonov wrote:
> On Sat, 16 Nov 2024 at 01:58, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:  
> > > Yeah, I'm not sure. I thought of keeping it simple and just marking
> > > the nlmsg "inconsistent". This is arguably a change of meaning for
> > > NLM_F_DUMP_INTR because previously, it meant that the multi-message
> > > dump became inconsistent between recvmsg() calls. And now, it is also
> > > utilized in the "do" version if it raced with the socket setsockopts()
> > > in another thread.  
> >
> > NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
> > was a workaround for consistency of the dump as a whole. Single message
> > we can re-generate quite easily in the kernel, so forcing the user to
> > handle INTR and retry seems unnecessarily cruel ;)  
> 
> Kind of agree. But then, it seems to be quite rare. Even on a
> purposely created selftest it fires not each time (maybe I'm not
> skilful enough). Yet somewhat sceptical about a re-try in the kernel:
> the need for it is caused by another thread manipulating keys, so we
> may need another re-try after the first re-try... So, then we would
> have to introduce a limit on retries :D

Wouldn't be the first time ;)
But I'd just retry once with a "very large" buffer.

> Hmm, what do you think about a kind of middle-ground/compromise
> solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
> hardly ever/never happen by purposely allocating larger skb. I don't
> want to set some value in stone as one day it might become not enough
> for all different socket infos, but maybe just add 4kB more to the
> initial allocation? So, for it to reproduce, another thread would have
> to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
> socket between this thread's skb allocation and filling of the info
> array. I'd call it "attempting to be nice to a user, but not at their
> busylooping expense".

The size of the retry buffer should be larger than any valid size.
We can add a warning if calculated size >= 32kB.
If we support an inf number of md5 keys we need to cap it.

Eric is back later this week, perhaps we should wait for his advice.

> > Right, the table based parsing doesn't work well with multi-attr,
> > but other table formats aren't fundamentally better. Or at least
> > I never came up with a good way of solving this. And the multi-attr
> > at least doesn't suffer from the u16 problem.  
> 
> Yeah, also an array of structs that makes it impossible to extend such
> an ABI with new members.
> 
> And with regards to u16, I was thinking of this diff for net-next, but
> was not sure if it's worth it:
> 
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index be9c576b6e2d..01c5a49ffa34 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
> int attrtype, int attrlen)
>  {
>   struct nlattr *nla;
> 
> + DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
> +     "requested nlattr::nla_len %d >= U16_MAX", attrlen);
> +
>   nla = skb_put(skb, nla_total_size(attrlen));
>   nla->nla_type = attrtype;
>   nla->nla_len = nla_attr_size(attrlen);

I'm slightly worried that this can be triggered already from user
space, but we can try DEBUG_NET_* and see. Here and in nla_nest_end().
Dmitry Safonov Nov. 20, 2024, 12:19 a.m. UTC | #8
On Tue, 19 Nov 2024 at 00:12, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 16 Nov 2024 03:52:47 +0000 Dmitry Safonov wrote:
> > Kind of agree. But then, it seems to be quite rare. Even on a
> > purposely created selftest it fires not each time (maybe I'm not
> > skilful enough). Yet somewhat sceptical about a re-try in the kernel:
> > the need for it is caused by another thread manipulating keys, so we
> > may need another re-try after the first re-try... So, then we would
> > have to introduce a limit on retries :D
>
> Wouldn't be the first time ;)
> But I'd just retry once with a "very large" buffer.
>
> > Hmm, what do you think about a kind of middle-ground/compromise
> > solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
> > hardly ever/never happen by purposely allocating larger skb. I don't
> > want to set some value in stone as one day it might become not enough
> > for all different socket infos, but maybe just add 4kB more to the
> > initial allocation? So, for it to reproduce, another thread would have
> > to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
> > socket between this thread's skb allocation and filling of the info
> > array. I'd call it "attempting to be nice to a user, but not at their
> > busylooping expense".
>
> The size of the retry buffer should be larger than any valid size.
> We can add a warning if calculated size >= 32kB.

Currently, md5/ao keys are limited by sock_kmalloc(), which uses
optmem_max sysctl limit. The default nowadays is 128KB.

From [1] I see that the current in-kernel (struct tcp_md5sig_key) hits
optmem_max on
# ok 38 optmem limit was hit on adding 655 key
IOW, with the default limit and sizeof(struct tcp_diag_md5sig) = 100,
the maximum skb size would be ~= 65Kb. Sounds a little too big for
kmemcache allocation.

Initially, my idea was to limit this old version of tcp-md5-diag by
U16_MAX. Now I'm thinking of adopting your idea by always allocating
32kB skb for single-message and marking it somehow, if it's not big
enough to fit all the keys on a socket (NLM_F_DUMP_INTR or any other
alternative for userspace to get a clue that the single message wasn't
enough).

Then, as I planned, teach the multi-message dump iterator to stop
between recvmsg() on N-th md5/ao key and continue the dump from that
key on the next recvmsg().

> If we support an inf number of md5 keys we need to cap it.

Yeah, unfortunately, we have some customers with 1000 peers (and
because of that we internally test BGP with even more peers).
And that's with an assumption of one key per peer, which is not
necessarily true for AO.

> Eric is back later this week, perhaps we should wait for his advice.

Sure, I will be glad to have advice from you both, thanks!

> > > Right, the table based parsing doesn't work well with multi-attr,
> > > but other table formats aren't fundamentally better. Or at least
> > > I never came up with a good way of solving this. And the multi-attr
> > > at least doesn't suffer from the u16 problem.
> >
> > Yeah, also an array of structs that makes it impossible to extend such
> > an ABI with new members.
> >
> > And with regards to u16, I was thinking of this diff for net-next, but
> > was not sure if it's worth it:
> >
> > diff --git a/lib/nlattr.c b/lib/nlattr.c
> > index be9c576b6e2d..01c5a49ffa34 100644
> > --- a/lib/nlattr.c
> > +++ b/lib/nlattr.c
> > @@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
> > int attrtype, int attrlen)
> >  {
> >   struct nlattr *nla;
> >
> > + DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
> > +     "requested nlattr::nla_len %d >= U16_MAX", attrlen);
> > +
> >   nla = skb_put(skb, nla_total_size(attrlen));
> >   nla->nla_type = attrtype;
> >   nla->nla_len = nla_attr_size(attrlen);
>
> I'm slightly worried that this can be triggered already from user
> space, but we can try DEBUG_NET_* and see. Here and in nla_nest_end().

Yeah, I thought that CONFIG_DEBUG_NET is not enabled on generic
distros, but the description is:
:          Enable extra sanity checks in networking.
:          This is mostly used by fuzzers, but is safe to select.

not sure if that guards any production users from enabling it.
But that would be interesting to see if, with those new additions,
netdev doesn't produce any warnings.

[1] https://netdev-3.bots.linux.dev/vmksft-tcp-ao/results/867500/14-setsockopt-closed-ipv4/stdout

Thanks,
             Dmitry
Johannes Berg Nov. 20, 2024, 8:44 a.m. UTC | #9
(Sorry, late to the party)

On Fri, 2024-11-15 at 16:08 -0800, Jakub Kicinski wrote:
> On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > 2. Inet-diag allocates netlink message for sockets in
> >    inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> >    .idiag_get_aux_size(), that pre-calculates the needed space for
> >    TCP-diag related information. But as neither socket lock nor
> >    rcu_readlock() are held between allocation and the actual TCP
> >    info filling, the TCP-related space requirement may change before
> >    reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> >    a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> >    return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
> 
> Would it be too ugly if we simply retried with a 32kB skb if the initial
> dump failed with EMSGSIZE?

We have min_dump_alloc, which a number of places are setting much higher
than the default, partially at least because there were similar issues,
e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.

Kind of ugly? Sure! And we shouldn't use it now with newer userspace
that knows to request a more finely split dump. For older userspace it's
the only way though.

Also, we don't even give all the data to older userspace (it must
support split dumps to get information about the more modern features, 6
GHz channels, etc.), but I gather that's not an option here.

> Another option would be to automatically grow the skb. The size
> accounting is an endless source of bugs. We'd just need to scan
> the codebase to make sure there are no cases where someone does
> 
> 	ptr = __nla_reserve();
> 	nla_put();
> 	*ptr = 0;
> 
> Which may be too much of a project and source of bugs in itself.
> 
> Or do both, retry as a fix, and auto-grow in net-next.

For auto-grow you'd also have to have information about the userspace
buffer, I think? It still has to fit there, might as well fail anyway if
that buffer is too small? I'm not sure we have that link back? But I'm
not really sure right now, just remember this as an additional wrinkle
from the above-mentioned nl80211 problem.

johannes
Dmitry Safonov Nov. 20, 2024, 4:13 p.m. UTC | #10
On Wed, 20 Nov 2024 at 08:44, Johannes Berg <johannes@sipsolutions.net> wrote:
>
> (Sorry, late to the party)

Thanks for joining! :-)

> On Fri, 2024-11-15 at 16:08 -0800, Jakub Kicinski wrote:
> > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > dump failed with EMSGSIZE?
>
> We have min_dump_alloc, which a number of places are setting much higher
> than the default, partially at least because there were similar issues,
> e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.

Yeah, your example seems alike what netlink_dump() does with
min_dump_alloc and max_recvmsg_len. You have there
.doit = nl80211_get_wiphy,
.dumpit = nl80211_dump_wiphy,

So at this initial patch set, I'm trying to fix
inet_diag_handler::dump_one() callback, which is to my understanding
same as .doit() for generic netlink [should we just rename struct
inet_diag_handler callbacks to match the generics?]. See
inet_diag_handler_cmd() and NLM_F_DUMP in
Documentation/userspace-api/netlink/intro.rst
For TCP-MD5-diag even the single message reply may have a variable
number of keys on a socket's dump. For multi-messages dump, my plan is
to use netlink_callback::ctx[] and add an iterator type that will
allow to stop on N-th key between recvmsg() calls.

> Kind of ugly? Sure! And we shouldn't use it now with newer userspace
> that knows to request a more finely split dump. For older userspace it's
> the only way though.

Heh, the comment in nl80211_dump_wiphy() on sending an empty message
and retrying is ouch!

>
> Also, we don't even give all the data to older userspace (it must
> support split dumps to get information about the more modern features, 6
> GHz channels, etc.), but I gather that's not an option here.
>
> > Another option would be to automatically grow the skb. The size
> > accounting is an endless source of bugs. We'd just need to scan
> > the codebase to make sure there are no cases where someone does
> >
> >       ptr = __nla_reserve();
> >       nla_put();
> >       *ptr = 0;
> >
> > Which may be too much of a project and source of bugs in itself.
> >
> > Or do both, retry as a fix, and auto-grow in net-next.
>
> For auto-grow you'd also have to have information about the userspace
> buffer, I think? It still has to fit there, might as well fail anyway if
> that buffer is too small? I'm not sure we have that link back? But I'm
> not really sure right now, just remember this as an additional wrinkle
> from the above-mentioned nl80211 problem.

Yeah, netlink_recvmsg() attempts to track what the userspace is asking:

:        /* Record the max length of recvmsg() calls for future allocations */
:        max_recvmsg_len = max(READ_ONCE(nlk->max_recvmsg_len), len);
:        max_recvmsg_len = min_t(size_t, max_recvmsg_len,
:                                SKB_WITH_OVERHEAD(32768));
:        WRITE_ONCE(nlk->max_recvmsg_len, max_recvmsg_len);

Thanks,
             Dmitry
Johannes Berg Nov. 20, 2024, 7:36 p.m. UTC | #11
On Wed, 2024-11-20 at 16:13 +0000, Dmitry Safonov wrote:
> > We have min_dump_alloc, which a number of places are setting much higher
> > than the default, partially at least because there were similar issues,
> > e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.
> 
> Yeah, your example seems alike what netlink_dump() does with
> min_dump_alloc and max_recvmsg_len. You have there
> .doit = nl80211_get_wiphy,
> .dumpit = nl80211_dump_wiphy,
> 
> So at this initial patch set, I'm trying to fix
> inet_diag_handler::dump_one() callback, which is to my understanding
> same as .doit() for generic netlink [should we just rename struct
> inet_diag_handler callbacks to match the generics?].

dump_one() doesn't sound like doit(), it sounds more like dump one
object? In generic netlink dumpit has to handle that internally, and
doit() is for commands, without F_DUMP.

But also generic netlink is just one netlink family, so I wouldn't
really rename anything to match it anyway.

>  See
> inet_diag_handler_cmd() and NLM_F_DUMP in
> Documentation/userspace-api/netlink/intro.rst
> For TCP-MD5-diag even the single message reply may have a variable
> number of keys on a socket's dump.

Right.

> For multi-messages dump, my plan is
> to use netlink_callback::ctx[] and add an iterator type that will
> allow to stop on N-th key between recvmsg() calls.

Right, so userspace has to understand that format. In nl80211 we've made
an input flag attribute (inputs are often unused with dump, but are
present) to request that split format.

> > Kind of ugly? Sure! And we shouldn't use it now with newer userspace
> > that knows to request a more finely split dump. For older userspace it's
> > the only way though.
> 
> Heh, the comment in nl80211_dump_wiphy() on sending an empty message
> and retrying is ouch!

Yeah ... Luckily we basically converted all userspace to request split
dumps, so we shouldn't ever get there now.

> > For auto-grow you'd also have to have information about the userspace
> > buffer, I think? It still has to fit there, might as well fail anyway if
> > that buffer is too small? I'm not sure we have that link back? But I'm
> > not really sure right now, just remember this as an additional wrinkle
> > from the above-mentioned nl80211 problem.
> 
> Yeah, netlink_recvmsg() attempts to track what the userspace is asking:
> 
> :        /* Record the max length of recvmsg() calls for future allocations */
> :        max_recvmsg_len = max(READ_ONCE(nlk->max_recvmsg_len), len);
> :        max_recvmsg_len = min_t(size_t, max_recvmsg_len,
> :                                SKB_WITH_OVERHEAD(32768));
> :        WRITE_ONCE(nlk->max_recvmsg_len, max_recvmsg_len);

Right, OK, so that's sorted then :)

johannes