diff mbox series

[v1,net,2/2] rds: tcp: Fix use-after-free of net in reqsk_timer_handler().

Message ID 20240223172448.94084-3-kuniyu@amazon.com (mailing list archive)
State Not Applicable
Headers show
Series rds: Fix use-after-free of net by tcp reqsk timer. | expand

Commit Message

Kuniyuki Iwashima Feb. 23, 2024, 5:24 p.m. UTC
syzkaller reported a warning of netns tracker [0] followed by KASAN
splat [1] and another ref tracker warning [1].

syzkaller could not find a repro, but in the log, the only suspicious
sequence was as follows:

  18:26:22 executing program 1:
  r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
  ...
  connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)

The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.

So, the scenario would be:

  1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
      rds_tcp_listen_init().
  2. syz-executor connect()s to it and creates a reqsk.
  3. syz-executor exit()s immediately.
  4. netns is dismantled.  [0]
  5. reqsk timer is fired, and UAF happens while freeing reqsk.  [1]
  6. listener is freed after RCU grace period.  [2]

Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.

Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash, but
the issue still exists for the global ehash.

We can apply the same fix, but this issue is specific to RDS.

Instead of iterating potentially large ehash and purging reqsk during
netns dismantle, let's hold netns refcount for the kernel TCP listener.

[0]:
ref_tracker: net notrefcnt@0000000065449cc3 has 1/1 users at
     sk_alloc (./include/net/net_namespace.h:337 net/core/sock.c:2146)
     inet6_create (net/ipv6/af_inet6.c:192 net/ipv6/af_inet6.c:119)
     __sock_create (net/socket.c:1572)
     rds_tcp_listen_init (net/rds/tcp_listen.c:279)
     rds_tcp_init_net (net/rds/tcp.c:577)
     ops_init (net/core/net_namespace.c:137)
     setup_net (net/core/net_namespace.c:340)
     copy_net_ns (net/core/net_namespace.c:497)
     create_new_namespaces (kernel/nsproxy.c:110)
     unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
     ksys_unshare (kernel/fork.c:3429)
     __x64_sys_unshare (kernel/fork.c:3496)
     do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
     entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
...
WARNING: CPU: 0 PID: 27 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179)

[1]:
BUG: KASAN: slab-use-after-free in inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
Read of size 8 at addr ffff88801b370400 by task swapper/0/0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
 print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
 kasan_report (mm/kasan/report.c:603)
 inet_csk_reqsk_queue_drop (./include/net/inet_hashtables.h:180 net/ipv4/inet_connection_sock.c:952 net/ipv4/inet_connection_sock.c:966)
 reqsk_timer_handler (net/ipv4/inet_connection_sock.c:979 net/ipv4/inet_connection_sock.c:1092)
 call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701)
 __run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2038)
 run_timer_softirq (kernel/time/timer.c:2053)
 __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
 irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
 sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
 </IRQ>

Allocated by task 258 on cpu 0 at 83.612050s:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:68)
 __kasan_slab_alloc (mm/kasan/common.c:343)
 kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
 copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
 create_new_namespaces (kernel/nsproxy.c:110)
 unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
 ksys_unshare (kernel/fork.c:3429)
 __x64_sys_unshare (kernel/fork.c:3496)
 do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)

Freed by task 27 on cpu 0 at 329.158864s:
 kasan_save_stack (mm/kasan/common.c:48)
 kasan_save_track (mm/kasan/common.c:68)
 kasan_save_free_info (mm/kasan/generic.c:643)
 __kasan_slab_free (mm/kasan/common.c:265)
 kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
 cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
 process_one_work (kernel/workqueue.c:2638)
 worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
 kthread (kernel/kthread.c:388)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:250)

The buggy address belongs to the object at ffff88801b370000
 which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
 freed 4352-byte region [ffff88801b370000, ffff88801b371100)

[2]:
WARNING: CPU: 0 PID: 95 at lib/ref_tracker.c:228 ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
Modules linked in:
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:ref_tracker_free (lib/ref_tracker.c:228 (discriminator 1))
...
Call Trace:
<IRQ>
 __sk_destruct (./include/net/net_namespace.h:353 net/core/sock.c:2204)
 rcu_core (./arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2165 kernel/rcu/tree.c:2433)
 __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554)
 irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:644)
 sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
</IRQ>

Reported-by: syzkaller <syzkaller@googlegroups.com>
Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/rds/tcp_listen.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Eric Dumazet Feb. 23, 2024, 6:09 p.m. UTC | #1
On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> syzkaller reported a warning of netns tracker [0] followed by KASAN
> splat [1] and another ref tracker warning [1].
>
> syzkaller could not find a repro, but in the log, the only suspicious
> sequence was as follows:
>
>   18:26:22 executing program 1:
>   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
>   ...
>   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
>
> The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
>
> So, the scenario would be:
>
>   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
>       rds_tcp_listen_init().
>   2. syz-executor connect()s to it and creates a reqsk.
>   3. syz-executor exit()s immediately.
>   4. netns is dismantled.  [0]
>   5. reqsk timer is fired, and UAF happens while freeing reqsk.  [1]
>   6. listener is freed after RCU grace period.  [2]
>
> Basically, reqsk assumes that the listener guarantees netns safety
> until all reqsk timers are expired by holding the listener's refcount.
> However, this was not the case for kernel sockets.
>
> Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> inet_twsk_purge()") fixed this issue only for per-netns ehash, but
> the issue still exists for the global ehash.
>
> We can apply the same fix, but this issue is specific to RDS.
>
> Instead of iterating potentially large ehash and purging reqsk during
> netns dismantle, let's hold netns refcount for the kernel TCP listener.
>
>
> Reported-by: syzkaller <syzkaller@googlegroups.com>
> Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/rds/tcp_listen.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> index 05008ce5c421..4f7863932df7 100644
> --- a/net/rds/tcp_listen.c
> +++ b/net/rds/tcp_listen.c
> @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct net *net, bool isv6)
>                 goto out;
>         }
>
> +       __netns_tracker_free(net, &sock->sk->ns_tracker, false);
> +       sock->sk->sk_net_refcnt = 1;
> +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> +       sock_inuse_add(net, 1);
> +

Why using sock_create_kern() then later 'convert' this kernel socket
to a user one ?

Would using __sock_create() avoid this ?
Kuniyuki Iwashima Feb. 23, 2024, 6:28 p.m. UTC | #2
From: Eric Dumazet <edumazet@google.com>
Date: Fri, 23 Feb 2024 19:09:27 +0100
> On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > syzkaller reported a warning of netns tracker [0] followed by KASAN
> > splat [1] and another ref tracker warning [1].
> >
> > syzkaller could not find a repro, but in the log, the only suspicious
> > sequence was as follows:
> >
> >   18:26:22 executing program 1:
> >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> >   ...
> >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
> >
> > The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
> >
> > So, the scenario would be:
> >
> >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> >       rds_tcp_listen_init().
> >   2. syz-executor connect()s to it and creates a reqsk.
> >   3. syz-executor exit()s immediately.
> >   4. netns is dismantled.  [0]
> >   5. reqsk timer is fired, and UAF happens while freeing reqsk.  [1]
> >   6. listener is freed after RCU grace period.  [2]
> >
> > Basically, reqsk assumes that the listener guarantees netns safety
> > until all reqsk timers are expired by holding the listener's refcount.
> > However, this was not the case for kernel sockets.
> >
> > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > inet_twsk_purge()") fixed this issue only for per-netns ehash, but
> > the issue still exists for the global ehash.
> >
> > We can apply the same fix, but this issue is specific to RDS.
> >
> > Instead of iterating potentially large ehash and purging reqsk during
> > netns dismantle, let's hold netns refcount for the kernel TCP listener.
> >
> >
> > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/rds/tcp_listen.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > index 05008ce5c421..4f7863932df7 100644
> > --- a/net/rds/tcp_listen.c
> > +++ b/net/rds/tcp_listen.c
> > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct net *net, bool isv6)
> >                 goto out;
> >         }
> >
> > +       __netns_tracker_free(net, &sock->sk->ns_tracker, false);
> > +       sock->sk->sk_net_refcnt = 1;
> > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > +       sock_inuse_add(net, 1);
> > +
> 
> Why using sock_create_kern() then later 'convert' this kernel socket
> to a user one ?
> 
> Would using __sock_create() avoid this ?

I think yes, but LSM would see kern=0 in pre/post socket() hooks.

Probably we can use __sock_create() in net-next and see if someone
complains.
Kuniyuki Iwashima Feb. 26, 2024, 7:14 p.m. UTC | #3
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri, 23 Feb 2024 10:28:32 -0800
> From: Eric Dumazet <edumazet@google.com>
> Date: Fri, 23 Feb 2024 19:09:27 +0100
> > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > syzkaller reported a warning of netns tracker [0] followed by KASAN
> > > splat [1] and another ref tracker warning [1].
> > >
> > > syzkaller could not find a repro, but in the log, the only suspicious
> > > sequence was as follows:
> > >
> > >   18:26:22 executing program 1:
> > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > >   ...
> > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, @loopback}, 0x1c) (async)
> > >
> > > The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
> > >
> > > So, the scenario would be:
> > >
> > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > >       rds_tcp_listen_init().
> > >   2. syz-executor connect()s to it and creates a reqsk.
> > >   3. syz-executor exit()s immediately.
> > >   4. netns is dismantled.  [0]
> > >   5. reqsk timer is fired, and UAF happens while freeing reqsk.  [1]
> > >   6. listener is freed after RCU grace period.  [2]
> > >
> > > Basically, reqsk assumes that the listener guarantees netns safety
> > > until all reqsk timers are expired by holding the listener's refcount.
> > > However, this was not the case for kernel sockets.
> > >
> > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > inet_twsk_purge()") fixed this issue only for per-netns ehash, but
> > > the issue still exists for the global ehash.
> > >
> > > We can apply the same fix, but this issue is specific to RDS.
> > >
> > > Instead of iterating potentially large ehash and purging reqsk during
> > > netns dismantle, let's hold netns refcount for the kernel TCP listener.
> > >
> > >
> > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  net/rds/tcp_listen.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > index 05008ce5c421..4f7863932df7 100644
> > > --- a/net/rds/tcp_listen.c
> > > +++ b/net/rds/tcp_listen.c
> > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct net *net, bool isv6)
> > >                 goto out;
> > >         }
> > >
> > > +       __netns_tracker_free(net, &sock->sk->ns_tracker, false);
> > > +       sock->sk->sk_net_refcnt = 1;
> > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > +       sock_inuse_add(net, 1);
> > > +
> > 
> > Why using sock_create_kern() then later 'convert' this kernel socket
> > to a user one ?
> > 
> > Would using __sock_create() avoid this ?
> 
> I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> 
> Probably we can use __sock_create() in net-next and see if someone
> complains.

I noticed the patchwork status is Changes Requested.
https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*

Should we use __sock_create() for RDS or add another parameter
to __sock_create(..., kern=true/false, netref=true/false) and
fix other similar uses (MPTCP, SMC, Netlink) altogether ?

Thanks!
Allison Henderson Feb. 26, 2024, 7:22 p.m. UTC | #4
On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Fri, 23 Feb 2024 10:28:32 -0800
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > <kuniyu@amazon.com> wrote:
> > > > 
> > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > KASAN
> > > > splat [1] and another ref tracker warning [1].
> > > > 
> > > > syzkaller could not find a repro, but in the log, the only
> > > > suspicious
> > > > sequence was as follows:
> > > > 
> > > >   18:26:22 executing program 1:
> > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > >   ...
> > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > @loopback}, 0x1c) (async)
> > > > 
> > > > The notable thing here is 0x4001 in connect(), which is
> > > > RDS_TCP_PORT.
> > > > 
> > > > So, the scenario would be:
> > > > 
> > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > >       rds_tcp_listen_init().
> > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > >   3. syz-executor exit()s immediately.
> > > >   4. netns is dismantled.  [0]
> > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > reqsk.  [1]
> > > >   6. listener is freed after RCU grace period.  [2]
> > > > 
> > > > Basically, reqsk assumes that the listener guarantees netns
> > > > safety
> > > > until all reqsk timers are expired by holding the listener's
> > > > refcount.
> > > > However, this was not the case for kernel sockets.
> > > > 
> > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > but
> > > > the issue still exists for the global ehash.
> > > > 
> > > > We can apply the same fix, but this issue is specific to RDS.
> > > > 
> > > > Instead of iterating potentially large ehash and purging reqsk
> > > > during
> > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > listener.
> > > > 
> > > > 
> > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > endpoints, one per netns.")
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > >  net/rds/tcp_listen.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > index 05008ce5c421..4f7863932df7 100644
> > > > --- a/net/rds/tcp_listen.c
> > > > +++ b/net/rds/tcp_listen.c
> > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > net *net, bool isv6)
> > > >                 goto out;
> > > >         }
> > > > 
> > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > false);
> > > > +       sock->sk->sk_net_refcnt = 1;
> > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > +       sock_inuse_add(net, 1);
> > > > +
> > > 
> > > Why using sock_create_kern() then later 'convert' this kernel
> > > socket
> > > to a user one ?
> > > 
> > > Would using __sock_create() avoid this ?
> > 
> > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > 
> > Probably we can use __sock_create() in net-next and see if someone
> > complains.
> 
> I noticed the patchwork status is Changes Requested.
> https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
>  
> 
> Should we use __sock_create() for RDS or add another parameter
> to __sock_create(..., kern=true/false, netref=true/false) and
> fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> 
> Thanks!

Hi all,

Thank you for looking at this.  I've been doing a little investigation
in the area to better understand the issue and this fix.  While I
understand what this patch is trying to do here, I'd like to do a
little more digging as to why 740ea3c4a0b2 didnt work for rds, or what
else rds may not be doing correctly that the other sockets are.  I'm
not quite sure about setting the kern parameter to 0 for socket_create.
While it seems like it would have a similar effect, this looks
incorrect since this is not a user space socket.  

I'll do a little more diging myself too.  If you had another idea about
adding parameters to __sock_create, I'd be happy to take a look.  Thank
you!

Allison
Kuniyuki Iwashima Feb. 26, 2024, 7:38 p.m. UTC | #5
From: Allison Henderson <allison.henderson@oracle.com>
Date: Mon, 26 Feb 2024 19:22:01 +0000
> On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > From: Eric Dumazet <edumazet@google.com>
> > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > <kuniyu@amazon.com> wrote:
> > > > > 
> > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > KASAN
> > > > > splat [1] and another ref tracker warning [1].
> > > > > 
> > > > > syzkaller could not find a repro, but in the log, the only
> > > > > suspicious
> > > > > sequence was as follows:
> > > > > 
> > > > >   18:26:22 executing program 1:
> > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > >   ...
> > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > @loopback}, 0x1c) (async)
> > > > > 
> > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > RDS_TCP_PORT.
> > > > > 
> > > > > So, the scenario would be:
> > > > > 
> > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > >       rds_tcp_listen_init().
> > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > >   3. syz-executor exit()s immediately.
> > > > >   4. netns is dismantled.  [0]
> > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > reqsk.  [1]
> > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > 
> > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > safety
> > > > > until all reqsk timers are expired by holding the listener's
> > > > > refcount.
> > > > > However, this was not the case for kernel sockets.
> > > > > 
> > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > but
> > > > > the issue still exists for the global ehash.
> > > > > 
> > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > 
> > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > during
> > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > listener.
> > > > > 
> > > > > 
> > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > endpoints, one per netns.")
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > --- a/net/rds/tcp_listen.c
> > > > > +++ b/net/rds/tcp_listen.c
> > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > net *net, bool isv6)
> > > > >                 goto out;
> > > > >         }
> > > > > 
> > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > false);
> > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > +       sock_inuse_add(net, 1);
> > > > > +
> > > > 
> > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > socket
> > > > to a user one ?
> > > > 
> > > > Would using __sock_create() avoid this ?
> > > 
> > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > 
> > > Probably we can use __sock_create() in net-next and see if someone
> > > complains.
> > 
> > I noticed the patchwork status is Changes Requested.
> > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> >  
> > 
> > Should we use __sock_create() for RDS or add another parameter
> > to __sock_create(..., kern=true/false, netref=true/false) and
> > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > 
> > Thanks!
> 
> Hi all,
> 
> Thank you for looking at this.  I've been doing a little investigation
> in the area to better understand the issue and this fix.  While I
> understand what this patch is trying to do here, I'd like to do a
> little more digging as to why 740ea3c4a0b2 didnt work for rds,

740ea3c4a0b2 works only for netns with its dedicated ehash, which
is unshare(CLONE_NEWNET)d with net.ipv4.tcp_child_ehash_entries != 0.

With the diff below, we can fix the issue, but as noted in the
description, this slows down netns dismantle where no reqsk, this
is true if the netns did not have kernel TCP sockets.

---8<---
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9e85f2a0bddd..0ecc7311dc6c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -398,10 +398,6 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
 			/* Even if tw_refcount == 1, we must clean up kernel reqsk */
 			inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
 		} else if (!purged_once) {
-			/* The last refcount is decremented in tcp_sk_exit_batch() */
-			if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
-				continue;
-
 			inet_twsk_purge(&tcp_hashinfo, family);
 			purged_once = true;
 		}
---8<---


> or what
> else rds may not be doing correctly that the other sockets are.

RDS has to clean up all sockets before netns are destroyed, but
it doesn't.  RDS per-netns TCP listener could have reqsk tied to
it, and reqsk timer could be fired after the netns and the listener
are freed.

Here we have 2 options to fix the issue.

  1) iterate ehash and purge reqsk during netns dismantle
  2) defer netns dismantle until reqsk timers are all fired

and 2) is preferred as the issue is specific to RDS.


> I'm
> not quite sure about setting the kern parameter to 0 for socket_create.
> While it seems like it would have a similar effect, this looks
> incorrect since this is not a user space socket.  

Probably kern parameter could be enum.

  SOCKET_USER = 0,
  SOCKET_KERN,
  SOCKET_KERN_NET_REFCNT,

If the enum is > 0, we can invoke LSM and mask it with
SOCKET_KERN_NET_REFCNT and pass down to it.
Eric Dumazet Feb. 26, 2024, 7:40 p.m. UTC | #6
On Mon, Feb 26, 2024 at 8:22 PM Allison Henderson
<allison.henderson@oracle.com> wrote:
>
> On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > From: Eric Dumazet <edumazet@google.com>
> > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > <kuniyu@amazon.com> wrote:
> > > > >
> > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > KASAN
> > > > > splat [1] and another ref tracker warning [1].
> > > > >
> > > > > syzkaller could not find a repro, but in the log, the only
> > > > > suspicious
> > > > > sequence was as follows:
> > > > >
> > > > >   18:26:22 executing program 1:
> > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > >   ...
> > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > @loopback}, 0x1c) (async)
> > > > >
> > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > RDS_TCP_PORT.
> > > > >
> > > > > So, the scenario would be:
> > > > >
> > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > >       rds_tcp_listen_init().
> > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > >   3. syz-executor exit()s immediately.
> > > > >   4. netns is dismantled.  [0]
> > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > reqsk.  [1]
> > > > >   6. listener is freed after RCU grace period.  [2]
> > > > >
> > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > safety
> > > > > until all reqsk timers are expired by holding the listener's
> > > > > refcount.
> > > > > However, this was not the case for kernel sockets.
> > > > >
> > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > but
> > > > > the issue still exists for the global ehash.
> > > > >
> > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > >
> > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > during
> > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > listener.
> > > > >
> > > > >
> > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > endpoints, one per netns.")
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > ---
> > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > --- a/net/rds/tcp_listen.c
> > > > > +++ b/net/rds/tcp_listen.c
> > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > net *net, bool isv6)
> > > > >                 goto out;
> > > > >         }
> > > > >
> > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > false);
> > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > +       sock_inuse_add(net, 1);
> > > > > +
> > > >
> > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > socket
> > > > to a user one ?
> > > >
> > > > Would using __sock_create() avoid this ?
> > >
> > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > >
> > > Probably we can use __sock_create() in net-next and see if someone
> > > complains.
> >
> > I noticed the patchwork status is Changes Requested.
> > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> >
> >
> > Should we use __sock_create() for RDS or add another parameter
> > to __sock_create(..., kern=true/false, netref=true/false) and
> > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> >
> > Thanks!
>
> Hi all,
>
> Thank you for looking at this.  I've been doing a little investigation
> in the area to better understand the issue and this fix.  While I
> understand what this patch is trying to do here, I'd like to do a
> little more digging as to why 740ea3c4a0b2 didnt work for rds, or what
> else rds may not be doing correctly that the other sockets are.  I'm
> not quite sure about setting the kern parameter to 0 for socket_create.
> While it seems like it would have a similar effect, this looks
> incorrect since this is not a user space socket.
>
> I'll do a little more diging myself too.  If you had another idea about
> adding parameters to __sock_create, I'd be happy to take a look.  Thank
> you!

I wonder if the following change would help ?

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9e85f2a0bddd4978b1bde6add1efc6aad351db8b..0ecc7311dc6ceedd8ada7b99b1441a562a6be4d6
100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -398,10 +398,6 @@ void tcp_twsk_purge(struct list_head
*net_exit_list, int family)
                        /* Even if tw_refcount == 1, we must clean up
kernel reqsk */

inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
                } else if (!purged_once) {
-                       /* The last refcount is decremented in
tcp_sk_exit_batch() */
-                       if
(refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
-                               continue;
-
                        inet_twsk_purge(&tcp_hashinfo, family);
                        purged_once = true;
                }
Kuniyuki Iwashima Feb. 26, 2024, 7:48 p.m. UTC | #7
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 26 Feb 2024 20:40:10 +0100
> On Mon, Feb 26, 2024 at 8:22 PM Allison Henderson
> <allison.henderson@oracle.com> wrote:
> >
> > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > > <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > > KASAN
> > > > > > splat [1] and another ref tracker warning [1].
> > > > > >
> > > > > > syzkaller could not find a repro, but in the log, the only
> > > > > > suspicious
> > > > > > sequence was as follows:
> > > > > >
> > > > > >   18:26:22 executing program 1:
> > > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > > >   ...
> > > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > > @loopback}, 0x1c) (async)
> > > > > >
> > > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > > RDS_TCP_PORT.
> > > > > >
> > > > > > So, the scenario would be:
> > > > > >
> > > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > > >       rds_tcp_listen_init().
> > > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > > >   3. syz-executor exit()s immediately.
> > > > > >   4. netns is dismantled.  [0]
> > > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > > reqsk.  [1]
> > > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > >
> > > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > > safety
> > > > > > until all reqsk timers are expired by holding the listener's
> > > > > > refcount.
> > > > > > However, this was not the case for kernel sockets.
> > > > > >
> > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > > but
> > > > > > the issue still exists for the global ehash.
> > > > > >
> > > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > >
> > > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > > during
> > > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > > listener.
> > > > > >
> > > > > >
> > > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > > endpoints, one per netns.")
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > > >  1 file changed, 5 insertions(+)
> > > > > >
> > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > > --- a/net/rds/tcp_listen.c
> > > > > > +++ b/net/rds/tcp_listen.c
> > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > > net *net, bool isv6)
> > > > > >                 goto out;
> > > > > >         }
> > > > > >
> > > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > > false);
> > > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > > +       sock_inuse_add(net, 1);
> > > > > > +
> > > > >
> > > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > > socket
> > > > > to a user one ?
> > > > >
> > > > > Would using __sock_create() avoid this ?
> > > >
> > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > >
> > > > Probably we can use __sock_create() in net-next and see if someone
> > > > complains.
> > >
> > > I noticed the patchwork status is Changes Requested.
> > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> > >
> > >
> > > Should we use __sock_create() for RDS or add another parameter
> > > to __sock_create(..., kern=true/false, netref=true/false) and
> > > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > >
> > > Thanks!
> >
> > Hi all,
> >
> > Thank you for looking at this.  I've been doing a little investigation
> > in the area to better understand the issue and this fix.  While I
> > understand what this patch is trying to do here, I'd like to do a
> > little more digging as to why 740ea3c4a0b2 didnt work for rds, or what
> > else rds may not be doing correctly that the other sockets are.  I'm
> > not quite sure about setting the kern parameter to 0 for socket_create.
> > While it seems like it would have a similar effect, this looks
> > incorrect since this is not a user space socket.
> >
> > I'll do a little more diging myself too.  If you had another idea about
> > adding parameters to __sock_create, I'd be happy to take a look.  Thank
> > you!
> 
> I wonder if the following change would help ?

Yes, it also fixes the issue. :)
https://lore.kernel.org/netdev/20240226193857.69672-1-kuniyu@amazon.com/

but it will trigger full ehash iteration for netns with no RDS usage
(and even without TCP).

So, I think __sock_create() or the netref conversion would be better.


> 
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 9e85f2a0bddd4978b1bde6add1efc6aad351db8b..0ecc7311dc6ceedd8ada7b99b1441a562a6be4d6
> 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -398,10 +398,6 @@ void tcp_twsk_purge(struct list_head
> *net_exit_list, int family)
>                         /* Even if tw_refcount == 1, we must clean up
> kernel reqsk */
> 
> inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
>                 } else if (!purged_once) {
> -                       /* The last refcount is decremented in
> tcp_sk_exit_batch() */
> -                       if
> (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> -                               continue;
> -
>                         inet_twsk_purge(&tcp_hashinfo, family);
>                         purged_once = true;
>                 }
Eric Dumazet Feb. 26, 2024, 7:52 p.m. UTC | #8
On Mon, Feb 26, 2024 at 8:39 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Allison Henderson <allison.henderson@oracle.com>
> Date: Mon, 26 Feb 2024 19:22:01 +0000
> > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > > <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > > KASAN
> > > > > > splat [1] and another ref tracker warning [1].
> > > > > >
> > > > > > syzkaller could not find a repro, but in the log, the only
> > > > > > suspicious
> > > > > > sequence was as follows:
> > > > > >
> > > > > >   18:26:22 executing program 1:
> > > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > > >   ...
> > > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > > @loopback}, 0x1c) (async)
> > > > > >
> > > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > > RDS_TCP_PORT.
> > > > > >
> > > > > > So, the scenario would be:
> > > > > >
> > > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > > >       rds_tcp_listen_init().
> > > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > > >   3. syz-executor exit()s immediately.
> > > > > >   4. netns is dismantled.  [0]
> > > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > > reqsk.  [1]
> > > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > >
> > > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > > safety
> > > > > > until all reqsk timers are expired by holding the listener's
> > > > > > refcount.
> > > > > > However, this was not the case for kernel sockets.
> > > > > >
> > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > > but
> > > > > > the issue still exists for the global ehash.
> > > > > >
> > > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > >
> > > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > > during
> > > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > > listener.
> > > > > >
> > > > > >
> > > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > > endpoints, one per netns.")
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > ---
> > > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > > >  1 file changed, 5 insertions(+)
> > > > > >
> > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > > --- a/net/rds/tcp_listen.c
> > > > > > +++ b/net/rds/tcp_listen.c
> > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > > net *net, bool isv6)
> > > > > >                 goto out;
> > > > > >         }
> > > > > >
> > > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > > false);
> > > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > > +       sock_inuse_add(net, 1);
> > > > > > +
> > > > >
> > > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > > socket
> > > > > to a user one ?
> > > > >
> > > > > Would using __sock_create() avoid this ?
> > > >
> > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > >
> > > > Probably we can use __sock_create() in net-next and see if someone
> > > > complains.
> > >
> > > I noticed the patchwork status is Changes Requested.
> > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> > >
> > >
> > > Should we use __sock_create() for RDS or add another parameter
> > > to __sock_create(..., kern=true/false, netref=true/false) and
> > > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > >
> > > Thanks!
> >
> > Hi all,
> >
> > Thank you for looking at this.  I've been doing a little investigation
> > in the area to better understand the issue and this fix.  While I
> > understand what this patch is trying to do here, I'd like to do a
> > little more digging as to why 740ea3c4a0b2 didnt work for rds,
>
> 740ea3c4a0b2 works only for netns with its dedicated ehash, which
> is unshare(CLONE_NEWNET)d with net.ipv4.tcp_child_ehash_entries != 0.
>
> With the diff below, we can fix the issue, but as noted in the
> description, this slows down netns dismantle where no reqsk, this
> is true if the netns did not have kernel TCP sockets.

BTW I note that inet_twsk_purge() probably would need this "goto
restart;" as well....

sk_nulls_for_each_rcu(sk, node, &head->chain)  is not a _safe variant...

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5befa4de5b2416281ad2795713a70d0fd847b0b2..a62031d85bcce00261193136dd72d9f57ffd34fc
100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -283,10 +283,11 @@ void inet_twsk_purge(struct inet_hashinfo
*hashinfo, int family)
                                 * freed.  Userspace listener and
reqsk never exist here.
                                 */
                                if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
-                                            hashinfo->pernet)) {
+
!refcount_read(&sock_net(sk)->ns.count))) {
                                        struct request_sock *req =
inet_reqsk(sk);


inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
+                                       goto restart;
                                }

                                continue;
Eric Dumazet Feb. 26, 2024, 8 p.m. UTC | #9
On Mon, Feb 26, 2024 at 8:48 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 26 Feb 2024 20:40:10 +0100
> > On Mon, Feb 26, 2024 at 8:22 PM Allison Henderson
> > <allison.henderson@oracle.com> wrote:
> > >
> > > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > > > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > > > <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > > > KASAN
> > > > > > > splat [1] and another ref tracker warning [1].
> > > > > > >
> > > > > > > syzkaller could not find a repro, but in the log, the only
> > > > > > > suspicious
> > > > > > > sequence was as follows:
> > > > > > >
> > > > > > >   18:26:22 executing program 1:
> > > > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > > > >   ...
> > > > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > > > @loopback}, 0x1c) (async)
> > > > > > >
> > > > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > > > RDS_TCP_PORT.
> > > > > > >
> > > > > > > So, the scenario would be:
> > > > > > >
> > > > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > > > >       rds_tcp_listen_init().
> > > > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > > > >   3. syz-executor exit()s immediately.
> > > > > > >   4. netns is dismantled.  [0]
> > > > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > > > reqsk.  [1]
> > > > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > > >
> > > > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > > > safety
> > > > > > > until all reqsk timers are expired by holding the listener's
> > > > > > > refcount.
> > > > > > > However, this was not the case for kernel sockets.
> > > > > > >
> > > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > > > but
> > > > > > > the issue still exists for the global ehash.
> > > > > > >
> > > > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > > >
> > > > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > > > during
> > > > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > > > listener.
> > > > > > >
> > > > > > >
> > > > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > > > endpoints, one per netns.")
> > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > ---
> > > > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > > > >  1 file changed, 5 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > > > --- a/net/rds/tcp_listen.c
> > > > > > > +++ b/net/rds/tcp_listen.c
> > > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > > > net *net, bool isv6)
> > > > > > >                 goto out;
> > > > > > >         }
> > > > > > >
> > > > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > > > false);
> > > > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > > > +       sock_inuse_add(net, 1);
> > > > > > > +
> > > > > >
> > > > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > > > socket
> > > > > > to a user one ?
> > > > > >
> > > > > > Would using __sock_create() avoid this ?
> > > > >
> > > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > > >
> > > > > Probably we can use __sock_create() in net-next and see if someone
> > > > > complains.
> > > >
> > > > I noticed the patchwork status is Changes Requested.
> > > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> > > >
> > > >
> > > > Should we use __sock_create() for RDS or add another parameter
> > > > to __sock_create(..., kern=true/false, netref=true/false) and
> > > > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > > >
> > > > Thanks!
> > >
> > > Hi all,
> > >
> > > Thank you for looking at this.  I've been doing a little investigation
> > > in the area to better understand the issue and this fix.  While I
> > > understand what this patch is trying to do here, I'd like to do a
> > > little more digging as to why 740ea3c4a0b2 didnt work for rds, or what
> > > else rds may not be doing correctly that the other sockets are.  I'm
> > > not quite sure about setting the kern parameter to 0 for socket_create.
> > > While it seems like it would have a similar effect, this looks
> > > incorrect since this is not a user space socket.
> > >
> > > I'll do a little more diging myself too.  If you had another idea about
> > > adding parameters to __sock_create, I'd be happy to take a look.  Thank
> > > you!
> >
> > I wonder if the following change would help ?
>
> Yes, it also fixes the issue. :)
> https://lore.kernel.org/netdev/20240226193857.69672-1-kuniyu@amazon.com/
>
> but it will trigger full ehash iteration for netns with no RDS usage
> (and even without TCP).
>
> So, I think __sock_create() or the netref conversion would be better.

Maybe, although you could add debugging/assertions to make sure no
TCP_NEW_SYN_RECV request are created on behalf of a kernel socket...

I am pretty sure other layers in the kernel use kernel socket TCP
listeners, SUNRPC for instance.
Kuniyuki Iwashima Feb. 26, 2024, 8 p.m. UTC | #10
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 26 Feb 2024 20:52:35 +0100
> On Mon, Feb 26, 2024 at 8:39 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Allison Henderson <allison.henderson@oracle.com>
> > Date: Mon, 26 Feb 2024 19:22:01 +0000
> > > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > > > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > > > <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > > > KASAN
> > > > > > > splat [1] and another ref tracker warning [1].
> > > > > > >
> > > > > > > syzkaller could not find a repro, but in the log, the only
> > > > > > > suspicious
> > > > > > > sequence was as follows:
> > > > > > >
> > > > > > >   18:26:22 executing program 1:
> > > > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > > > >   ...
> > > > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > > > @loopback}, 0x1c) (async)
> > > > > > >
> > > > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > > > RDS_TCP_PORT.
> > > > > > >
> > > > > > > So, the scenario would be:
> > > > > > >
> > > > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > > > >       rds_tcp_listen_init().
> > > > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > > > >   3. syz-executor exit()s immediately.
> > > > > > >   4. netns is dismantled.  [0]
> > > > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > > > reqsk.  [1]
> > > > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > > >
> > > > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > > > safety
> > > > > > > until all reqsk timers are expired by holding the listener's
> > > > > > > refcount.
> > > > > > > However, this was not the case for kernel sockets.
> > > > > > >
> > > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > > > but
> > > > > > > the issue still exists for the global ehash.
> > > > > > >
> > > > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > > >
> > > > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > > > during
> > > > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > > > listener.
> > > > > > >
> > > > > > >
> > > > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > > > endpoints, one per netns.")
> > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > ---
> > > > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > > > >  1 file changed, 5 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > > > --- a/net/rds/tcp_listen.c
> > > > > > > +++ b/net/rds/tcp_listen.c
> > > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > > > net *net, bool isv6)
> > > > > > >                 goto out;
> > > > > > >         }
> > > > > > >
> > > > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > > > false);
> > > > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > > > +       sock_inuse_add(net, 1);
> > > > > > > +
> > > > > >
> > > > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > > > socket
> > > > > > to a user one ?
> > > > > >
> > > > > > Would using __sock_create() avoid this ?
> > > > >
> > > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > > >
> > > > > Probably we can use __sock_create() in net-next and see if someone
> > > > > complains.
> > > >
> > > > I noticed the patchwork status is Changes Requested.
> > > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> > > >
> > > >
> > > > Should we use __sock_create() for RDS or add another parameter
> > > > to __sock_create(..., kern=true/false, netref=true/false) and
> > > > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > > >
> > > > Thanks!
> > >
> > > Hi all,
> > >
> > > Thank you for looking at this.  I've been doing a little investigation
> > > in the area to better understand the issue and this fix.  While I
> > > understand what this patch is trying to do here, I'd like to do a
> > > little more digging as to why 740ea3c4a0b2 didnt work for rds,
> >
> > 740ea3c4a0b2 works only for netns with its dedicated ehash, which
> > is unshare(CLONE_NEWNET)d with net.ipv4.tcp_child_ehash_entries != 0.
> >
> > With the diff below, we can fix the issue, but as noted in the
> > description, this slows down netns dismantle where no reqsk, this
> > is true if the netns did not have kernel TCP sockets.
> 
> BTW I note that inet_twsk_purge() probably would need this "goto
> restart;" as well....
> 
> sk_nulls_for_each_rcu(sk, node, &head->chain)  is not a _safe variant...

Ah exactly!  Will you post it or I can include it as the first patch.

The following patch will remove it and the end result will be the
same though.


> 
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 5befa4de5b2416281ad2795713a70d0fd847b0b2..a62031d85bcce00261193136dd72d9f57ffd34fc
> 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -283,10 +283,11 @@ void inet_twsk_purge(struct inet_hashinfo
> *hashinfo, int family)
>                                  * freed.  Userspace listener and
> reqsk never exist here.
>                                  */
>                                 if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> -                                            hashinfo->pernet)) {
> +
> !refcount_read(&sock_net(sk)->ns.count))) {
>                                         struct request_sock *req =
> inet_reqsk(sk);
> 
> 
> inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> +                                       goto restart;
>                                 }
> 
>                                 continue;
Kuniyuki Iwashima Feb. 26, 2024, 8:05 p.m. UTC | #11
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 26 Feb 2024 21:00:07 +0100
> On Mon, Feb 26, 2024 at 8:48 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon, 26 Feb 2024 20:40:10 +0100
> > > On Mon, Feb 26, 2024 at 8:22 PM Allison Henderson
> > > <allison.henderson@oracle.com> wrote:
> > > >
> > > > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote:
> > > > > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > Date: Fri, 23 Feb 2024 10:28:32 -0800
> > > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > > Date: Fri, 23 Feb 2024 19:09:27 +0100
> > > > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima
> > > > > > > <kuniyu@amazon.com> wrote:
> > > > > > > >
> > > > > > > > syzkaller reported a warning of netns tracker [0] followed by
> > > > > > > > KASAN
> > > > > > > > splat [1] and another ref tracker warning [1].
> > > > > > > >
> > > > > > > > syzkaller could not find a repro, but in the log, the only
> > > > > > > > suspicious
> > > > > > > > sequence was as follows:
> > > > > > > >
> > > > > > > >   18:26:22 executing program 1:
> > > > > > > >   r0 = socket$inet6_mptcp(0xa, 0x1, 0x106)
> > > > > > > >   ...
> > > > > > > >   connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0,
> > > > > > > > @loopback}, 0x1c) (async)
> > > > > > > >
> > > > > > > > The notable thing here is 0x4001 in connect(), which is
> > > > > > > > RDS_TCP_PORT.
> > > > > > > >
> > > > > > > > So, the scenario would be:
> > > > > > > >
> > > > > > > >   1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
> > > > > > > >       rds_tcp_listen_init().
> > > > > > > >   2. syz-executor connect()s to it and creates a reqsk.
> > > > > > > >   3. syz-executor exit()s immediately.
> > > > > > > >   4. netns is dismantled.  [0]
> > > > > > > >   5. reqsk timer is fired, and UAF happens while freeing
> > > > > > > > reqsk.  [1]
> > > > > > > >   6. listener is freed after RCU grace period.  [2]
> > > > > > > >
> > > > > > > > Basically, reqsk assumes that the listener guarantees netns
> > > > > > > > safety
> > > > > > > > until all reqsk timers are expired by holding the listener's
> > > > > > > > refcount.
> > > > > > > > However, this was not the case for kernel sockets.
> > > > > > > >
> > > > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> > > > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash,
> > > > > > > > but
> > > > > > > > the issue still exists for the global ehash.
> > > > > > > >
> > > > > > > > We can apply the same fix, but this issue is specific to RDS.
> > > > > > > >
> > > > > > > > Instead of iterating potentially large ehash and purging reqsk
> > > > > > > > during
> > > > > > > > netns dismantle, let's hold netns refcount for the kernel TCP
> > > > > > > > listener.
> > > > > > > >
> > > > > > > >
> > > > > > > > Reported-by: syzkaller <syzkaller@googlegroups.com>
> > > > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen
> > > > > > > > endpoints, one per netns.")
> > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > > > > ---
> > > > > > > >  net/rds/tcp_listen.c | 5 +++++
> > > > > > > >  1 file changed, 5 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> > > > > > > > index 05008ce5c421..4f7863932df7 100644
> > > > > > > > --- a/net/rds/tcp_listen.c
> > > > > > > > +++ b/net/rds/tcp_listen.c
> > > > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct
> > > > > > > > net *net, bool isv6)
> > > > > > > >                 goto out;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > +       __netns_tracker_free(net, &sock->sk->ns_tracker,
> > > > > > > > false);
> > > > > > > > +       sock->sk->sk_net_refcnt = 1;
> > > > > > > > +       get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
> > > > > > > > +       sock_inuse_add(net, 1);
> > > > > > > > +
> > > > > > >
> > > > > > > Why using sock_create_kern() then later 'convert' this kernel
> > > > > > > socket
> > > > > > > to a user one ?
> > > > > > >
> > > > > > > Would using __sock_create() avoid this ?
> > > > > >
> > > > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks.
> > > > > >
> > > > > > Probably we can use __sock_create() in net-next and see if someone
> > > > > > complains.
> > > > >
> > > > > I noticed the patchwork status is Changes Requested.
> > > > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$
> > > > >
> > > > >
> > > > > Should we use __sock_create() for RDS or add another parameter
> > > > > to __sock_create(..., kern=true/false, netref=true/false) and
> > > > > fix other similar uses (MPTCP, SMC, Netlink) altogether ?
> > > > >
> > > > > Thanks!
> > > >
> > > > Hi all,
> > > >
> > > > Thank you for looking at this.  I've been doing a little investigation
> > > > in the area to better understand the issue and this fix.  While I
> > > > understand what this patch is trying to do here, I'd like to do a
> > > > little more digging as to why 740ea3c4a0b2 didnt work for rds, or what
> > > > else rds may not be doing correctly that the other sockets are.  I'm
> > > > not quite sure about setting the kern parameter to 0 for socket_create.
> > > > While it seems like it would have a similar effect, this looks
> > > > incorrect since this is not a user space socket.
> > > >
> > > > I'll do a little more diging myself too.  If you had another idea about
> > > > adding parameters to __sock_create, I'd be happy to take a look.  Thank
> > > > you!
> > >
> > > I wonder if the following change would help ?
> >
> > Yes, it also fixes the issue. :)
> > https://lore.kernel.org/netdev/20240226193857.69672-1-kuniyu@amazon.com/
> >
> > but it will trigger full ehash iteration for netns with no RDS usage
> > (and even without TCP).
> >
> > So, I think __sock_create() or the netref conversion would be better.
> 
> Maybe, although you could add debugging/assertions to make sure no
> TCP_NEW_SYN_RECV request are created on behalf of a kernel socket...
> 
> I am pretty sure other layers in the kernel use kernel socket TCP
> listeners, SUNRPC for instance.

That sounds nice!

I'll include this diff in v2.

---8<--
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df7b13f0e5e0..d949501d13dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6972,6 +6972,8 @@ struct request_sock *inet_reqsk_alloc(const struct request_sock_ops *ops,
 	if (req) {
 		struct inet_request_sock *ireq = inet_rsk(req);
 
+		DEBUG_NET_WARN_ON_ONCE(!sk_listener->sk_net_refcnt);
+
 		ireq->ireq_opt = NULL;
 #if IS_ENABLED(CONFIG_IPV6)
 		ireq->pktopts = NULL;
---8<---
diff mbox series

Patch

diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 05008ce5c421..4f7863932df7 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -282,6 +282,11 @@  struct socket *rds_tcp_listen_init(struct net *net, bool isv6)
 		goto out;
 	}
 
+	__netns_tracker_free(net, &sock->sk->ns_tracker, false);
+	sock->sk->sk_net_refcnt = 1;
+	get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
+	sock_inuse_add(net, 1);
+
 	sock->sk->sk_reuse = SK_CAN_REUSE;
 	tcp_sock_set_nodelay(sock->sk);