Message ID | a5fb1fc4-2284-3359-f6a0-e4e390239d7b@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | Accepted |
Commit | 3a58f13a881ed351198ffab4cf9953cf19d2ab3a |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] net: rds: acquire refcount on TCP sockets | expand |
> On 2 May 2022, at 03:40, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > --- > Changes in v2: > Add Fixes: tag. > Move to inside lock_sock() section. > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > to RDS") was added to 2.6.32. > > net/rds/tcp.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 5327d130c4b5..2f638f8b7b1e 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); > + /* TCP timer functions might access net namespace even after > + * a process which created this net namespace terminated. > + */ > + if (!sk->sk_net_refcnt) { > + sk->sk_net_refcnt = 1; > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); Don't you need a corresponding put_net_track()? Thxs, Håkon > + sock_inuse_add(net, 1); > + } > if (rtn->sndbuf_size > 0) { > sk->sk_sndbuf = rtn->sndbuf_size; > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > -- > 2.34.1 > >
On 2022/05/02 23:12, Haakon Bugge wrote: >> + /* TCP timer functions might access net namespace even after >> + * a process which created this net namespace terminated. >> + */ >> + if (!sk->sk_net_refcnt) { >> + sk->sk_net_refcnt = 1; >> + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > Don't you need a corresponding put_net_track()? __sk_free() and __sk_destruct() will do if sk->sk_net_refcnt is set. > >> + sock_inuse_add(net, 1); >> + }
Hello, On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > --- > Changes in v2: > Add Fixes: tag. > Move to inside lock_sock() section. > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > to RDS") was added to 2.6.32. > > net/rds/tcp.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > index 5327d130c4b5..2f638f8b7b1e 100644 > --- a/net/rds/tcp.c > +++ b/net/rds/tcp.c > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > tcp_sock_set_nodelay(sock->sk); > lock_sock(sk); > + /* TCP timer functions might access net namespace even after > + * a process which created this net namespace terminated. > + */ > + if (!sk->sk_net_refcnt) { > + sk->sk_net_refcnt = 1; > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > + sock_inuse_add(net, 1); > + } > if (rtn->sndbuf_size > 0) { > sk->sk_sndbuf = rtn->sndbuf_size; > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; This looks equivalent to the fix presented here: https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ but the latter looks a more generic solution. @Tetsuo could you please test the above in your setup? Thanks! Paolo
On 2022/05/03 18:02, Paolo Abeni wrote: > This looks equivalent to the fix presented here: Not equivalent. > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? I already tested that fix, and the result was https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ .
On Tue, 2022-05-03 at 18:56 +0900, Tetsuo Handa wrote: > On 2022/05/03 18:02, Paolo Abeni wrote: > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > but the latter looks a more generic solution. @Tetsuo could you please > > test the above in your setup? > > I already tested that fix, and the result was > https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ . Thanks, I somewhat missed that reply. Paolo
Hello: This patch was applied to netdev/net.git (master) by Paolo Abeni <pabeni@redhat.com>: On Mon, 2 May 2022 10:40:18 +0900 you wrote: > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > for TCP socket used by RDS is accessing sock_net() without acquiring a > refcount on net namespace. Since TCP's retransmission can happen after > a process which created net namespace terminated, we need to explicitly > acquire a refcount. > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > [...] Here is the summary with links: - [v2] net: rds: acquire refcount on TCP sockets https://git.kernel.org/netdev/net/c/3a58f13a881e You are awesome, thank you!
From: Paolo Abeni > Sent: 03 May 2022 10:03 > > Hello, > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > --- > > Changes in v2: > > Add Fixes: tag. > > Move to inside lock_sock() section. > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > to RDS") was added to 2.6.32. > > > > net/rds/tcp.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > index 5327d130c4b5..2f638f8b7b1e 100644 > > --- a/net/rds/tcp.c > > +++ b/net/rds/tcp.c > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > tcp_sock_set_nodelay(sock->sk); > > lock_sock(sk); > > + /* TCP timer functions might access net namespace even after > > + * a process which created this net namespace terminated. > > + */ > > + if (!sk->sk_net_refcnt) { > > + sk->sk_net_refcnt = 1; > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > + sock_inuse_add(net, 1); > > + } > > if (rtn->sndbuf_size > 0) { > > sk->sk_sndbuf = rtn->sndbuf_size; > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > This looks equivalent to the fix presented here: > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? Wouldn't a more generic solution be to add a flag to sock_create_kern() so that it acquires a reference to the namespace? This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. I've a driver that uses __sock_create() in order to get that reference. I'm pretty sure the extra 'security' check will never fail. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote: > > From: Paolo Abeni > > Sent: 03 May 2022 10:03 > > > > Hello, > > > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > > kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > --- > > > Changes in v2: > > > Add Fixes: tag. > > > Move to inside lock_sock() section. > > > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > > to RDS") was added to 2.6.32. > > > > > > net/rds/tcp.c | 8 ++++++++ > > > 1 file changed, 8 insertions(+) > > > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > > index 5327d130c4b5..2f638f8b7b1e 100644 > > > --- a/net/rds/tcp.c > > > +++ b/net/rds/tcp.c > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > > > tcp_sock_set_nodelay(sock->sk); > > > lock_sock(sk); > > > + /* TCP timer functions might access net namespace even after > > > + * a process which created this net namespace terminated. > > > + */ > > > + if (!sk->sk_net_refcnt) { > > > + sk->sk_net_refcnt = 1; > > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > > + sock_inuse_add(net, 1); > > > + } > > > if (rtn->sndbuf_size > 0) { > > > sk->sk_sndbuf = rtn->sndbuf_size; > > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > > > This looks equivalent to the fix presented here: > > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > but the latter looks a more generic solution. @Tetsuo could you please > > test the above in your setup? > > Wouldn't a more generic solution be to add a flag to sock_create_kern() > so that it acquires a reference to the namespace? > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. > > I've a driver that uses __sock_create() in order to get that reference. > I'm pretty sure the extra 'security' check will never fail. > This would be silly really. Definition of a 'kernel socket' is that it does not hold a reference to the namespace. (otherwise a netns could not be destroyed by user space) A kernel layer using kernel sockets needs to properly dismantle them when a namespace is destroyed. In the RDS case, the socket was a user socket, or RDS lacked proper tracking of all the sockets so that they can be dismantled properly.
On Tue, May 3, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hello, > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > --- > > Changes in v2: > > Add Fixes: tag. > > Move to inside lock_sock() section. > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > to RDS") was added to 2.6.32. > > > > net/rds/tcp.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > index 5327d130c4b5..2f638f8b7b1e 100644 > > --- a/net/rds/tcp.c > > +++ b/net/rds/tcp.c > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > tcp_sock_set_nodelay(sock->sk); > > lock_sock(sk); > > + /* TCP timer functions might access net namespace even after > > + * a process which created this net namespace terminated. > > + */ > > + if (!sk->sk_net_refcnt) { > > + sk->sk_net_refcnt = 1; > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > + sock_inuse_add(net, 1); > > + } > > if (rtn->sndbuf_size > 0) { > > sk->sk_sndbuf = rtn->sndbuf_size; > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > This looks equivalent to the fix presented here: > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ I think this is still needed for layers (NFS ?) that dismantle their TCP sockets whenever a netns is dismantled. But RDS case was different, only the listener is a kernel socket. > > but the latter looks a more generic solution. @Tetsuo could you please > test the above in your setup? > > Thanks! > > Paolo >
On 2022/05/03 22:45, Eric Dumazet wrote: >> This looks equivalent to the fix presented here: >> >> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ I retested the fix above using unshare -n sh -c ' ip link set lo up iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP telnet 127.0.0.1 16385 dmesg -c netstat -tanpe' < /dev/null as a test case, but it seems racy; sometimes timer function is called again and crashes. [ 426.086565][ C2] general protection fault, probably for non-canonical address 0x6b6af3ebcc3b6bc3: 0000 [#1] PREEMPT SMP KASAN [ 426.096339][ C2] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.18.0-rc5-dirty #807 [ 426.103769][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 426.111851][ C2] RIP: 0010:__tcp_transmit_skb+0xe72/0x1b80 [ 426.117512][ C2] Code: e8 b3 ea dc fd 48 8d 7d 30 45 0f b7 77 30 e8 95 ec dc fd 48 8b 5d 30 48 8d bb b8 02 00 00 e8 85 ec dc fd 48 8b 83 b8 02 00 00 <65> 4c 01 70 58 e9 67 fd ff ff e8 ef 56 ac fd 48 8d bd d0 09 00 00 [ 426.124692][ C2] RSP: 0018:ffff888060d09ac8 EFLAGS: 00010246 [ 426.126845][ C2] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8880145c8000 RCX: ffffffff838cc28b [ 426.129616][ C2] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880145c82b8 [ 426.132374][ C2] RBP: ffff8880129f8000 R08: 0000000000000000 R09: 0000000000000007 [ 426.135077][ C2] R10: ffffffff838cbfd4 R11: 0000000000000001 R12: ffff8880129f8760 [ 426.137793][ C2] R13: ffff88800f6e0118 R14: 0000000000000001 R15: ffff88800f6e00e8 [ 426.140489][ C2] FS: 0000000000000000(0000) GS:ffff888060d00000(0000) knlGS:0000000000000000 [ 426.143525][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 426.145792][ C2] CR2: 000055b5bb0adabc CR3: 000000000e003000 CR4: 00000000000506e0 [ 426.148509][ C2] Call Trace: [ 426.149442][ C2] <IRQ> [ 426.150183][ C2] ? __tcp_select_window+0x710/0x710 [ 426.151457][ C2] ? __sanitizer_cov_trace_cmp4+0x1c/0x70 [ 426.153007][ C2] ? tcp_current_mss+0x165/0x280 [ 426.154245][ C2] ? tcp_trim_head+0x300/0x300 [ 426.155396][ C2] ? find_held_lock+0x85/0xa0 [ 426.156734][ C2] ? mark_held_locks+0x65/0x90 [ 426.157967][ C2] tcp_write_wakeup+0x2e2/0x340 [ 426.159149][ C2] tcp_send_probe0+0x2a/0x2c0 [ 426.160368][ C2] tcp_write_timer_handler+0x5cb/0x670 [ 426.161740][ C2] tcp_write_timer+0x86/0x250 [ 426.162896][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.164285][ C2] call_timer_fn+0x15d/0x5f0 [ 426.165481][ C2] ? add_timer_on+0x2e0/0x2e0 [ 426.166667][ C2] ? lock_downgrade+0x3c0/0x3c0 [ 426.167921][ C2] ? mark_held_locks+0x24/0x90 [ 426.169263][ C2] ? _raw_spin_unlock_irq+0x1f/0x40 [ 426.170564][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.171920][ C2] __run_timers.part.0+0x523/0x740 [ 426.173181][ C2] ? call_timer_fn+0x5f0/0x5f0 [ 426.174321][ C2] ? pvclock_clocksource_read+0xdc/0x1a0 [ 426.175655][ C2] run_timer_softirq+0x66/0xe0 [ 426.176825][ C2] __do_softirq+0x1c2/0x670 [ 426.177944][ C2] __irq_exit_rcu+0xf8/0x140 [ 426.179120][ C2] irq_exit_rcu+0x5/0x20 [ 426.180150][ C2] sysvec_apic_timer_interrupt+0x8e/0xc0 [ 426.181486][ C2] </IRQ> [ 426.182180][ C2] <TASK> [ 426.182845][ C2] asm_sysvec_apic_timer_interrupt+0x12/0x20 > > I think this is still needed for layers (NFS ?) that dismantle their > TCP sockets whenever a netns > is dismantled. But RDS case was different, only the listener is a kernel socket. We can't apply the fix above. I think that the fundamental problem is that we use net->ns.count for both "avoiding use-after-free" purpose and "allowing dismantle from user event" purpose. Why not to use separated counters?
From: Eric Dumazet > Sent: 03 May 2022 14:43 > > On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote: > > > > From: Paolo Abeni > > > Sent: 03 May 2022 10:03 > > > > > > Hello, > > > > > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote: > > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > > refcount on net namespace. Since TCP's retransmission can happen after > > > > a process which created net namespace terminated, we need to explicitly > > > > acquire a refcount. > > > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel > sockets.") > > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a > > > kernel socket") > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > --- > > > > Changes in v2: > > > > Add Fixes: tag. > > > > Move to inside lock_sock() section. > > > > > > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag, > > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport > > > > to RDS") was added to 2.6.32. > > > > > > > > net/rds/tcp.c | 8 ++++++++ > > > > 1 file changed, 8 insertions(+) > > > > > > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c > > > > index 5327d130c4b5..2f638f8b7b1e 100644 > > > > --- a/net/rds/tcp.c > > > > +++ b/net/rds/tcp.c > > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) > > > > > > > > tcp_sock_set_nodelay(sock->sk); > > > > lock_sock(sk); > > > > + /* TCP timer functions might access net namespace even after > > > > + * a process which created this net namespace terminated. > > > > + */ > > > > + if (!sk->sk_net_refcnt) { > > > > + sk->sk_net_refcnt = 1; > > > > + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > > > > + sock_inuse_add(net, 1); > > > > + } > > > > if (rtn->sndbuf_size > 0) { > > > > sk->sk_sndbuf = rtn->sndbuf_size; > > > > sk->sk_userlocks |= SOCK_SNDBUF_LOCK; > > > > > > This looks equivalent to the fix presented here: > > > > > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > > > but the latter looks a more generic solution. @Tetsuo could you please > > > test the above in your setup? > > > > Wouldn't a more generic solution be to add a flag to sock_create_kern() > > so that it acquires a reference to the namespace? > > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK. > > > > I've a driver that uses __sock_create() in order to get that reference. > > I'm pretty sure the extra 'security' check will never fail. > > > > This would be silly really. > > Definition of a 'kernel socket' is that it does not hold a reference > to the namespace. > (otherwise a netns could not be destroyed by user space) > > A kernel layer using kernel sockets needs to properly dismantle them > when a namespace is destroyed. I think it depends on why the driver is using a socket. If the driver is a 'user' of a TCP connection that happens to be is a kernel driver then holding the a reference to the namespace is no different to an application socket holding a reference. An example might be nfs/tcp - you need to unmount the filesystem before you can delete the namespace. OTOH if part of a protocol stack is using a socket for internal calls (I think I've seen routing sockets used that way) then the presence of the socket probably shouldn't stop the namespace being deleted. Listening sockets are a slight problem - probably for userspace as well. It would be nicer to be able to get TCP (etc) to error out listening sockets if they are the only thing stopping a namespace being deleted. > In the RDS case, the socket was a user socket, or RDS lacked proper > tracking of all the sockets > so that they can be dismantled properly. I think they probably are sockets created in order act on requests from applications. I think they should have the same effect on namespaces as a direct user socket - you can't delete the socket while the connection is active. Kill all the relevant processes, tell the driver to stop, and you can delete the namespace. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > Hello: > > This patch was applied to netdev/net.git (master) > by Paolo Abeni <pabeni@redhat.com>: > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > refcount on net namespace. Since TCP's retransmission can happen after > > a process which created net namespace terminated, we need to explicitly > > acquire a refcount. > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > [...] > > Here is the summary with links: > - [v2] net: rds: acquire refcount on TCP sockets > https://git.kernel.org/netdev/net/c/3a58f13a881e > > You are awesome, thank you! > -- > Deet-doot-dot, I am a bot. > https://korg.docs.kernel.org/patchwork/pwbot.html > > I think we merged this patch too soon. My question is : What prevents rds_tcp_conn_path_connect(), and thus rds_tcp_tune() to be called after the netns refcount already reached 0 ? I guess we can wait for next syzbot report, but I think that get_net() should be replaced by maybe_get_net()
On Tue, May 3, 2022 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > Hello: > > > > This patch was applied to netdev/net.git (master) > > by Paolo Abeni <pabeni@redhat.com>: > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > [...] > > > > Here is the summary with links: > > - [v2] net: rds: acquire refcount on TCP sockets > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > You are awesome, thank you! > > -- > > Deet-doot-dot, I am a bot. > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > I think we merged this patch too soon. > > My question is : What prevents rds_tcp_conn_path_connect(), and thus > rds_tcp_tune() to be called > after the netns refcount already reached 0 ? > > I guess we can wait for next syzbot report, but I think that get_net() > should be replaced > by maybe_get_net() Yes, syzbot was fast to trigger this exact issue: HEAD commit: 3a58f13a net: rds: acquire refcount on TCP sockets git tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 6934 at lib/refcount.c:25 refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25 Modules linked in: CPU: 1 PID: 6934 Comm: kworker/u4:17 Not tainted 5.18.0-rc4-syzkaller-00209-g3a58f13a881e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: krdsd rds_connect_worker RIP: 0010:refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25 Code: 09 31 ff 89 de e8 f7 b9 81 fd 84 db 0f 85 36 ff ff ff e8 0a b6 81 fd 48 c7 c7 40 eb 26 8a c6 05 75 1f ac 09 01 e8 56 75 2d 05 <0f> 0b e9 17 ff ff ff e8 eb b5 81 fd 0f b6 1d 5a 1f ac 09 31 ff 89 RSP: 0018:ffffc9000b5e7b80 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88807a948000 RSI: ffffffff81600c08 RDI: fffff520016bcf62 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815fb5de R11: 0000000000000000 R12: ffff888021e69b80 R13: ffff88805bc82a00 R14: ffff888021e69ccc R15: ffff8880741a2900 FS: 0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2cb5c000 CR3: 000000005688f000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_add include/linux/refcount.h:199 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] get_net include/net/net_namespace.h:248 [inline] get_net_track include/net/net_namespace.h:334 [inline] rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298 </TASK>
On 2022/05/04 7:37, Eric Dumazet wrote: >> I think we merged this patch too soon. >> >> My question is : What prevents rds_tcp_conn_path_connect(), and thus >> rds_tcp_tune() to be called >> after the netns refcount already reached 0 ? >> >> I guess we can wait for next syzbot report, but I think that get_net() >> should be replaced >> by maybe_get_net() > > Yes, syzbot was fast to trigger this exact issue: Does maybe_get_net() help? Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). rds_tcp_conn_path_connect() { sock_create_kern(net = rds_conn_net(conn)) { __sock_create(net = rds_conn_net(conn), kern = 1) { err = pf->create(net = rds_conn_net(conn), kern = 1) { // pf->create is either inet_create or inet6_create sk_alloc(net = rds_conn_net(conn), kern = 1) { sk->sk_net_refcnt = kern ? 0 : 1; if (likely(sk->sk_net_refcnt)) { get_net_track(net, &sk->ns_tracker, priority); sock_inuse_add(net, 1); } sock_net_set(sk, net); } } } } rds_tcp_tune() { if (!sk->sk_net_refcnt) { sk->sk_net_refcnt = 1; get_net_track(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } } } "struct rds_connection" needs to hold a ref in order to safely allow rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection" must have a ref. Situation where we are protected by maybe_get_net() is quite limited if long-lived object is not holding a ref. Hmm, can we simply use &init_net instead of rds_conn_net(conn) ?
On Tue, May 3, 2022 at 6:04 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2022/05/04 7:37, Eric Dumazet wrote: > >> I think we merged this patch too soon. > >> > >> My question is : What prevents rds_tcp_conn_path_connect(), and thus > >> rds_tcp_tune() to be called > >> after the netns refcount already reached 0 ? > >> > >> I guess we can wait for next syzbot report, but I think that get_net() > >> should be replaced > >> by maybe_get_net() > > > > Yes, syzbot was fast to trigger this exact issue: > > Does maybe_get_net() help? > > Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically > possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d > if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). Nope. RDS has an exit() handler called from cleanup_net() (struct pernet_operations)->exit() or exit_batch() : rds_tcp_exit_net() (rds_tcp_kill_sock()) This exit() handler _has_ to remove all known listeners, and definitely cancel work queues (synchronous operation) before the actual "struct net" free can happen later. > > rds_tcp_conn_path_connect() { > sock_create_kern(net = rds_conn_net(conn)) { > __sock_create(net = rds_conn_net(conn), kern = 1) { > err = pf->create(net = rds_conn_net(conn), kern = 1) { > // pf->create is either inet_create or inet6_create > sk_alloc(net = rds_conn_net(conn), kern = 1) { > sk->sk_net_refcnt = kern ? 0 : 1; > if (likely(sk->sk_net_refcnt)) { > get_net_track(net, &sk->ns_tracker, priority); > sock_inuse_add(net, 1); > } > sock_net_set(sk, net); > } > } > } > } > rds_tcp_tune() { > if (!sk->sk_net_refcnt) { > sk->sk_net_refcnt = 1; > get_net_track(net, &sk->ns_tracker, GFP_KERNEL); > sock_inuse_add(net, 1); > } > } > } > > "struct rds_connection" needs to hold a ref in order to safely allow > rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless > to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection" > must have a ref. Situation where we are protected by maybe_get_net() is > quite limited if long-lived object is not holding a ref. > > Hmm, can we simply use &init_net instead of rds_conn_net(conn) ? Only if you plan making RDS unavailable for non init netns.
On 2022/05/04 12:09, Eric Dumazet wrote: >> Does maybe_get_net() help? >> >> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically >> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d >> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set(). > > Nope. RDS has an exit() handler called from cleanup_net() > > (struct pernet_operations)->exit() or exit_batch() : > rds_tcp_exit_net() (rds_tcp_kill_sock()) Hmm, when put_net() called __put_net(), this "struct net" is chained to cleanup_list. When cleanup_net() is called via net_cleanup_work, rds_tcp_exit_net() is called from ops_exit_list(). Therefore, we can call maybe_get_net() until rds_tcp_exit_net() returns. That's good. > > This exit() handler _has_ to remove all known listeners, and > definitely cancel work queues (synchronous operation) > before the actual "struct net" free can happen later. But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w). But I can't see how rds_tcp_exit_net() synchronously cancels all works associated with "struct rds_conn_path". struct rds_conn_path { struct delayed_work cp_send_w; struct delayed_work cp_recv_w; struct delayed_work cp_conn_w; struct work_struct cp_down_w; } These works are queued to rds_wq, but flush_workqueue() waits for completion only if already queued. What if timer for queue_delayed_work() has not expired, or was about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient? Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed or cancelled, the fix would look like something below? net/rds/tcp.c | 11 ++++++++--- net/rds/tcp.h | 2 +- net/rds/tcp_connect.c | 5 ++++- net/rds/tcp_listen.c | 5 ++++- 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 2f638f8b7b1e..8e26bcf02044 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -487,11 +487,11 @@ struct rds_tcp_net { /* All module specific customizations to the RDS-TCP socket should be done in * rds_tcp_tune() and applied after socket creation. */ -void rds_tcp_tune(struct socket *sock) +bool rds_tcp_tune(struct socket *sock) { struct sock *sk = sock->sk; struct net *net = sock_net(sk); - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct rds_tcp_net *rtn; tcp_sock_set_nodelay(sock->sk); lock_sock(sk); @@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock) * a process which created this net namespace terminated. */ if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return false; + } sk->sk_net_refcnt = 1; - get_net_track(net, &sk->ns_tracker, GFP_KERNEL); sock_inuse_add(net, 1); } + rtn = net_generic(net, rds_tcp_netid); if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; @@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock) sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); + return true; } static void rds_tcp_accept_worker(struct work_struct *work) diff --git a/net/rds/tcp.h b/net/rds/tcp.h index dc8d745d6857..f8b5930d7b34 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -49,7 +49,7 @@ struct rds_tcp_statistics { }; /* tcp.c */ -void rds_tcp_tune(struct socket *sock); +bool rds_tcp_tune(struct socket *sock); void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index 5461d77fff4f..f0c477c5d1db 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp) if (ret < 0) goto out; - rds_tcp_tune(sock); + if (!rds_tcp_tune(sock)) { + ret = -EINVAL; + goto out; + } if (isv6) { sin6.sin6_family = AF_INET6; diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 09cadd556d1e..7edf2e69d3fe 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock) __module_get(new_sock->ops->owner); rds_tcp_keepalive(new_sock); - rds_tcp_tune(new_sock); + if (!rds_tcp_tune(new_sock)) { + ret = -EINVAL; + goto out; + } inet = inet_sk(new_sock->sk);
On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote: > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > Hello: > > > > This patch was applied to netdev/net.git (master) > > by Paolo Abeni <pabeni@redhat.com>: > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > refcount on net namespace. Since TCP's retransmission can happen after > > > a process which created net namespace terminated, we need to explicitly > > > acquire a refcount. > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > [...] > > > > Here is the summary with links: > > - [v2] net: rds: acquire refcount on TCP sockets > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > You are awesome, thank you! > > -- > > Deet-doot-dot, I am a bot. > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > I think we merged this patch too soon. My fault. > My question is : What prevents rds_tcp_conn_path_connect(), and thus > rds_tcp_tune() to be called > after the netns refcount already reached 0 ? > > I guess we can wait for next syzbot report, but I think that get_net() > should be replaced > by maybe_get_net() > Should we revert this patch before the next pull request, if a suitable incremental fix is not available by then? It looks like the window of opportunity for the race is roughly the same? Thanks! Paolo
On Wed, May 4, 2022 at 6:09 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote: > > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > > > Hello: > > > > > > This patch was applied to netdev/net.git (master) > > > by Paolo Abeni <pabeni@redhat.com>: > > > > > > On Mon, 2 May 2022 10:40:18 +0900 you wrote: > > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], > > > > for TCP socket used by RDS is accessing sock_net() without acquiring a > > > > refcount on net namespace. Since TCP's retransmission can happen after > > > > a process which created net namespace terminated, we need to explicitly > > > > acquire a refcount. > > > > > > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] > > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") > > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") > > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [v2] net: rds: acquire refcount on TCP sockets > > > https://git.kernel.org/netdev/net/c/3a58f13a881e > > > > > > You are awesome, thank you! > > > -- > > > Deet-doot-dot, I am a bot. > > > https://korg.docs.kernel.org/patchwork/pwbot.html > > > > > > > > > > I think we merged this patch too soon. > > My fault. > > > > My question is : What prevents rds_tcp_conn_path_connect(), and thus > > rds_tcp_tune() to be called > > after the netns refcount already reached 0 ? > > > > I guess we can wait for next syzbot report, but I think that get_net() > > should be replaced > > by maybe_get_net() > > > Should we revert this patch before the next pull request, if a suitable > incremental fix is not available by then? > > It looks like the window of opportunity for the race is roughly the > same? > No need to revert the patch, we certainly are in a better situation, as refcount_t helps here. We can refine the logic in a followup. Thanks.
On 2022/05/04 13:58, Tetsuo Handa wrote: > On 2022/05/04 12:09, Eric Dumazet wrote: >> This exit() handler _has_ to remove all known listeners, and >> definitely cancel work queues (synchronous operation) >> before the actual "struct net" free can happen later. > > But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from > rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that > rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and > rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w). > > But I can't see how rds_tcp_exit_net() synchronously cancels all works associated > with "struct rds_conn_path". > > struct rds_conn_path { > struct delayed_work cp_send_w; > struct delayed_work cp_recv_w; > struct delayed_work cp_conn_w; > struct work_struct cp_down_w; > } > > These works are queued to rds_wq, but flush_workqueue() waits for completion only > if already queued. What if timer for queue_delayed_work() has not expired, or was > about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient? rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 rds_tcp_conn_path_connect is referenced by "struct rds_transport rds_tcp_transport"->conn_path_connect. It is invoked by ret = conn->c_trans->conn_path_connect(cp) in rds_connect_worker(). rds_connect_worker is referenced by "struct rds_conn_path"->cp_conn_w via INIT_DELAYED_WORK(). queue_delayed_work(rds_wq, &cp->cp_conn_w, *) is called by rds_queue_reconnect() or rds_conn_path_connect_if_down(). If rds_conn_path_connect_if_down() were called from rds_tcp_accept_one_path() from rds_tcp_accept_one(), rds_tcp_tune() from rds_tcp_accept_one() was already called before rds_tcp_tune() from rds_tcp_conn_path_connect() is called. Since the addition on 0 was not reported at rds_tcp_tune() from rds_tcp_accept_one(), what Eric is reporting cannot be from rds_tcp_accept_one() from rds_tcp_accept_worker(). Despite rds_tcp_kill_sock() sets rtn->rds_tcp_listen_sock = NULL and waits for rds_tcp_accept_one() from rds_tcp_accept_worker() to complete using flush_workqueue(rds_wq), what Eric is reporting is different from what syzbot+694120e1002c117747ed was reporting. > > Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed > or cancelled, the fix would look like something below? I think it is OK to apply below diff in order to avoid addition on 0 problem, but it is not proven that kmem_cache_free() is not yet called. What should we do? > > net/rds/tcp.c | 11 ++++++++--- > net/rds/tcp.h | 2 +- > net/rds/tcp_connect.c | 5 ++++- > net/rds/tcp_listen.c | 5 ++++- > 4 files changed, 17 insertions(+), 6 deletions(-) >
diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 5327d130c4b5..2f638f8b7b1e 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock) tcp_sock_set_nodelay(sock->sk); lock_sock(sk); + /* TCP timer functions might access net namespace even after + * a process which created this net namespace terminated. + */ + if (!sk->sk_net_refcnt) { + sk->sk_net_refcnt = 1; + get_net_track(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + } if (rtn->sndbuf_size > 0) { sk->sk_sndbuf = rtn->sndbuf_size; sk->sk_userlocks |= SOCK_SNDBUF_LOCK;