Message ID | 20240523050357.43941-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,net] tcp: fix a race when purging the netns and allocating tw socket | expand |
On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > Race condition: > CPU 0 CPU 1 > ----- ----- > inet_twsk_purge tcp_time_wait > inet_twsk_deschedule_put __inet_twsk_schedule > mod_timer(tw_timer... > del_timer_sync > inet_twsk_kill > refcount_dec(tw_refcount)[1] > refcount_inc(tw_refcount)[2] > > Race case happens because [1] decrements refcount first before [2]. > > After we reorder the mod_timer() and refcount_inc() in the initialization > phase, we can use the status of timer as an indicator to test if we want > to destroy the tw socket in inet_twsk_purge() or postpone it to > tw_timer_handler(). > > After this patch applied, we get four possible cases: > 1) if we can see the armed timer during the initialization phase > CPU 0 CPU 1 > ----- ----- > inet_twsk_purge tcp_time_wait > inet_twsk_deschedule_put __inet_twsk_schedule > refcount_inc(tw_refcount) > mod_timer(tw_timer... > test if the timer is queued > //timer is queued > del_timer_sync > inet_twsk_kill > refcount_dec(tw_refcount) > Note: we finish it up in the purge process. > > 2) if we fail to see the armed timer during the initialization phase > CPU 0 CPU 1 > ----- ----- > inet_twsk_purge tcp_time_wait > inet_twsk_deschedule_put __inet_twsk_schedule > refcount_inc(tw_refcount) > test if the timer is queued > //timer isn't queued > postpone > mod_timer(tw_timer... > Note: later, in another context, expired timer will finish up tw socket > > 3) if we're seeing a running timer after the initialization phase > CPU 0 CPU 1 CPU 2 > ----- ----- ----- > tcp_time_wait > __inet_twsk_schedule > refcount_inc(tw_refcount) > mod_timer(tw_timer... > ...(around 1 min)... > inet_twsk_purge > inet_twsk_deschedule_put > test if the timer is queued > // timer is running > skip tw_timer_handler > Note: CPU 2 is destroying the timewait socket > > 4) if we're seeing a pending timer after the initialization phase > CPU 0 CPU 1 > ----- ----- > tcp_time_wait > __inet_twsk_schedule > refcount_inc(tw_refcount) > mod_timer(tw_timer... > ...(< 1 min)... > inet_twsk_purge > inet_twsk_deschedule_put > test if the timer is queued > // timer is queued > del_timer_sync > inet_twsk_kill > > Therefore, only making sure that we either call inet_twsk_purge() or > call tw_timer_handler() to destroy the timewait socket, we can > handle all the cases as above. > > [1] > refcount_t: decrement hit 0; leaking memory. > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > Modules linked in: > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > __refcount_dec include/linux/refcount.h:336 [inline] > refcount_dec include/linux/refcount.h:351 [inline] > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > __do_sys_unshare kernel/fork.c:3394 [inline] > __se_sys_unshare kernel/fork.c:3392 [inline] > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f56d7c7cee9 > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > v2 > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/ > 1. Use timer as a flag to test if we can safely destroy the timewait socket > based on top of the patch Eric wrote. > 2. change the title and add more explanation into body message. > --- > net/ipv4/inet_timewait_sock.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > index e28075f0006e..b890d1c280a1 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > LINUX_MIB_TIMEWAITED); > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > refcount_inc(&tw->tw_dr->tw_refcount); > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > } else { > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > } > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) > rcu_read_unlock(); > local_bh_disable(); > if (state == TCP_TIME_WAIT) { > - inet_twsk_deschedule_put(inet_twsk(sk)); > + struct inet_timewait_sock *tw = inet_twsk(sk); > + > + /* If the timer is armed, we can safely destroy > + * it, or else we postpone the process of destruction > + * to tw_timer_handler(). > + */ > + if (timer_pending(&tw->tw_timer)) > + inet_twsk_deschedule_put(tw); This patch is not needed, and a timer_pending() would be racy anywau. As already explained, del_timer_sync() takes care of this with proper locking. inet_twsk_deschedule_put(struct inet_timewait_sock *tw) { if (del_timer_sync(&tw->tw_timer)) inet_twsk_kill(tw); inet_twsk_put(tw); }
On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > Race condition: > > CPU 0 CPU 1 > > ----- ----- > > inet_twsk_purge tcp_time_wait > > inet_twsk_deschedule_put __inet_twsk_schedule > > mod_timer(tw_timer... > > del_timer_sync > > inet_twsk_kill > > refcount_dec(tw_refcount)[1] > > refcount_inc(tw_refcount)[2] > > > > Race case happens because [1] decrements refcount first before [2]. > > > > After we reorder the mod_timer() and refcount_inc() in the initialization > > phase, we can use the status of timer as an indicator to test if we want > > to destroy the tw socket in inet_twsk_purge() or postpone it to > > tw_timer_handler(). > > > > After this patch applied, we get four possible cases: > > 1) if we can see the armed timer during the initialization phase > > CPU 0 CPU 1 > > ----- ----- > > inet_twsk_purge tcp_time_wait > > inet_twsk_deschedule_put __inet_twsk_schedule > > refcount_inc(tw_refcount) > > mod_timer(tw_timer... > > test if the timer is queued > > //timer is queued > > del_timer_sync > > inet_twsk_kill > > refcount_dec(tw_refcount) > > Note: we finish it up in the purge process. > > > > 2) if we fail to see the armed timer during the initialization phase > > CPU 0 CPU 1 > > ----- ----- > > inet_twsk_purge tcp_time_wait > > inet_twsk_deschedule_put __inet_twsk_schedule > > refcount_inc(tw_refcount) > > test if the timer is queued > > //timer isn't queued > > postpone > > mod_timer(tw_timer... > > Note: later, in another context, expired timer will finish up tw socket > > > > 3) if we're seeing a running timer after the initialization phase > > CPU 0 CPU 1 CPU 2 > > ----- ----- ----- > > tcp_time_wait > > __inet_twsk_schedule > > refcount_inc(tw_refcount) > > mod_timer(tw_timer... > > ...(around 1 min)... > > inet_twsk_purge > > inet_twsk_deschedule_put > > test if the timer is queued > > // timer is running > > skip tw_timer_handler > > Note: CPU 2 is destroying the timewait socket > > > > 4) if we're seeing a pending timer after the initialization phase > > CPU 0 CPU 1 > > ----- ----- > > tcp_time_wait > > __inet_twsk_schedule > > refcount_inc(tw_refcount) > > mod_timer(tw_timer... > > ...(< 1 min)... > > inet_twsk_purge > > inet_twsk_deschedule_put > > test if the timer is queued > > // timer is queued > > del_timer_sync > > inet_twsk_kill > > > > Therefore, only making sure that we either call inet_twsk_purge() or > > call tw_timer_handler() to destroy the timewait socket, we can > > handle all the cases as above. > > > > [1] > > refcount_t: decrement hit 0; leaking memory. > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > Modules linked in: > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > Call Trace: > > <TASK> > > __refcount_dec include/linux/refcount.h:336 [inline] > > refcount_dec include/linux/refcount.h:351 [inline] > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > __do_sys_unshare kernel/fork.c:3394 [inline] > > __se_sys_unshare kernel/fork.c:3392 [inline] > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > RIP: 0033:0x7f56d7c7cee9 > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > v2 > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/ > > 1. Use timer as a flag to test if we can safely destroy the timewait socket > > based on top of the patch Eric wrote. > > 2. change the title and add more explanation into body message. > > --- > > net/ipv4/inet_timewait_sock.c | 11 +++++++++-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > index e28075f0006e..b890d1c280a1 100644 > > --- a/net/ipv4/inet_timewait_sock.c > > +++ b/net/ipv4/inet_timewait_sock.c > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > LINUX_MIB_TIMEWAITED); > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > refcount_inc(&tw->tw_dr->tw_refcount); > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > } else { > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > } > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) > > rcu_read_unlock(); > > local_bh_disable(); > > if (state == TCP_TIME_WAIT) { > > - inet_twsk_deschedule_put(inet_twsk(sk)); > > + struct inet_timewait_sock *tw = inet_twsk(sk); > > + > > + /* If the timer is armed, we can safely destroy > > + * it, or else we postpone the process of destruction > > + * to tw_timer_handler(). > > + */ > > + if (timer_pending(&tw->tw_timer)) > > + inet_twsk_deschedule_put(tw); > > > This patch is not needed, and a timer_pending() would be racy anywau. > > As already explained, del_timer_sync() takes care of this with proper locking. > > inet_twsk_deschedule_put(struct inet_timewait_sock *tw) > { > if (del_timer_sync(&tw->tw_timer)) > inet_twsk_kill(tw); > inet_twsk_put(tw); > } Sorry, I'm lost. But now I understand what I wrote in the morning is not correct... After rethinking about this, only reordering the mod_timer() and refcount_inc() in the initialization phase (just like the patch you wrote) can ensure safety. CPU 0 uses del_timer_sync() to detect if the timer is running while CPU 1 mod_timer() and then increments the refcount. 1) If CPU 0 sees the queued timer which means the tw socket has done incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the socket. 2) If CPU 0 cannot see the queued timer (and will not kill the socket), then CPU 1 will call mod_timer() and expire in one minute. Another cpu will call inet_twsk_kill() to clear tw socket at last. The patch you provided seems to solve the race issue in the right way... I cannot see the problem of it :( So what is the problem to be solved with your patch? Thanks for your patience and help. Thanks, Jason
On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > Race condition: > > > CPU 0 CPU 1 > > > ----- ----- > > > inet_twsk_purge tcp_time_wait > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > mod_timer(tw_timer... > > > del_timer_sync > > > inet_twsk_kill > > > refcount_dec(tw_refcount)[1] > > > refcount_inc(tw_refcount)[2] > > > > > > Race case happens because [1] decrements refcount first before [2]. > > > > > > After we reorder the mod_timer() and refcount_inc() in the initialization > > > phase, we can use the status of timer as an indicator to test if we want > > > to destroy the tw socket in inet_twsk_purge() or postpone it to > > > tw_timer_handler(). > > > > > > After this patch applied, we get four possible cases: > > > 1) if we can see the armed timer during the initialization phase > > > CPU 0 CPU 1 > > > ----- ----- > > > inet_twsk_purge tcp_time_wait > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > refcount_inc(tw_refcount) > > > mod_timer(tw_timer... > > > test if the timer is queued > > > //timer is queued > > > del_timer_sync > > > inet_twsk_kill > > > refcount_dec(tw_refcount) > > > Note: we finish it up in the purge process. > > > > > > 2) if we fail to see the armed timer during the initialization phase > > > CPU 0 CPU 1 > > > ----- ----- > > > inet_twsk_purge tcp_time_wait > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > refcount_inc(tw_refcount) > > > test if the timer is queued > > > //timer isn't queued > > > postpone > > > mod_timer(tw_timer... > > > Note: later, in another context, expired timer will finish up tw socket > > > > > > 3) if we're seeing a running timer after the initialization phase > > > CPU 0 CPU 1 CPU 2 > > > ----- ----- ----- > > > tcp_time_wait > > > __inet_twsk_schedule > > > refcount_inc(tw_refcount) > > > mod_timer(tw_timer... > > > ...(around 1 min)... > > > inet_twsk_purge > > > inet_twsk_deschedule_put > > > test if the timer is queued > > > // timer is running > > > skip tw_timer_handler > > > Note: CPU 2 is destroying the timewait socket > > > > > > 4) if we're seeing a pending timer after the initialization phase > > > CPU 0 CPU 1 > > > ----- ----- > > > tcp_time_wait > > > __inet_twsk_schedule > > > refcount_inc(tw_refcount) > > > mod_timer(tw_timer... > > > ...(< 1 min)... > > > inet_twsk_purge > > > inet_twsk_deschedule_put > > > test if the timer is queued > > > // timer is queued > > > del_timer_sync > > > inet_twsk_kill > > > > > > Therefore, only making sure that we either call inet_twsk_purge() or > > > call tw_timer_handler() to destroy the timewait socket, we can > > > handle all the cases as above. > > > > > > [1] > > > refcount_t: decrement hit 0; leaking memory. > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > Modules linked in: > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > Call Trace: > > > <TASK> > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > refcount_dec include/linux/refcount.h:351 [inline] > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > v2 > > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/ > > > 1. Use timer as a flag to test if we can safely destroy the timewait socket > > > based on top of the patch Eric wrote. > > > 2. change the title and add more explanation into body message. > > > --- > > > net/ipv4/inet_timewait_sock.c | 11 +++++++++-- > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > > index e28075f0006e..b890d1c280a1 100644 > > > --- a/net/ipv4/inet_timewait_sock.c > > > +++ b/net/ipv4/inet_timewait_sock.c > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) > > > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > > LINUX_MIB_TIMEWAITED); > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > refcount_inc(&tw->tw_dr->tw_refcount); > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > } else { > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > > } > > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) > > > rcu_read_unlock(); > > > local_bh_disable(); > > > if (state == TCP_TIME_WAIT) { > > > - inet_twsk_deschedule_put(inet_twsk(sk)); > > > + struct inet_timewait_sock *tw = inet_twsk(sk); > > > + > > > + /* If the timer is armed, we can safely destroy > > > + * it, or else we postpone the process of destruction > > > + * to tw_timer_handler(). > > > + */ > > > + if (timer_pending(&tw->tw_timer)) > > > + inet_twsk_deschedule_put(tw); > > > > > > This patch is not needed, and a timer_pending() would be racy anywau. > > > > As already explained, del_timer_sync() takes care of this with proper locking. > > > > inet_twsk_deschedule_put(struct inet_timewait_sock *tw) > > { > > if (del_timer_sync(&tw->tw_timer)) > > inet_twsk_kill(tw); > > inet_twsk_put(tw); > > } > > Sorry, I'm lost. But now I understand what I wrote in the morning is > not correct... > > After rethinking about this, only reordering the mod_timer() and > refcount_inc() in the initialization phase (just like the patch you > wrote) can ensure safety. > > CPU 0 uses del_timer_sync() to detect if the timer is running while > CPU 1 mod_timer() and then increments the refcount. > > 1) If CPU 0 sees the queued timer which means the tw socket has done > incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the > socket. > > 2) If CPU 0 cannot see the queued timer (and will not kill the > socket), then CPU 1 will call mod_timer() and expire in one minute. > Another cpu will call inet_twsk_kill() to clear tw socket at last. > > The patch you provided seems to solve the race issue in the right > way... I cannot see the problem of it :( > > So what is the problem to be solved with your patch? Thanks for your > patience and help. I already explained that when we do the mod_timer() and refcount_inc(&tw->tw_dr->tw_refcount), the tw refcount is 0, and no other cpu can possibly use the tw. So the order of these operations seems not really relevant. Another cpu, before trying to use it would have first to acquire a reference on tw refcount, and this is not allowed/possible. inet_twsk_purge() would hit this at this failed attempt. if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) continue; I think we should wait for more syzbot occurrences before spending more time on this. It is possible the single report was caused by unrelated memory mangling in the kernel, or some more fundamental issues caused by kernel listeners and sockets.
On Thu, May 23, 2024 at 5:37 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > > > Race condition: > > > > CPU 0 CPU 1 > > > > ----- ----- > > > > inet_twsk_purge tcp_time_wait > > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > > mod_timer(tw_timer... > > > > del_timer_sync > > > > inet_twsk_kill > > > > refcount_dec(tw_refcount)[1] > > > > refcount_inc(tw_refcount)[2] > > > > > > > > Race case happens because [1] decrements refcount first before [2]. > > > > > > > > After we reorder the mod_timer() and refcount_inc() in the initialization > > > > phase, we can use the status of timer as an indicator to test if we want > > > > to destroy the tw socket in inet_twsk_purge() or postpone it to > > > > tw_timer_handler(). > > > > > > > > After this patch applied, we get four possible cases: > > > > 1) if we can see the armed timer during the initialization phase > > > > CPU 0 CPU 1 > > > > ----- ----- > > > > inet_twsk_purge tcp_time_wait > > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > > refcount_inc(tw_refcount) > > > > mod_timer(tw_timer... > > > > test if the timer is queued > > > > //timer is queued > > > > del_timer_sync > > > > inet_twsk_kill > > > > refcount_dec(tw_refcount) > > > > Note: we finish it up in the purge process. > > > > > > > > 2) if we fail to see the armed timer during the initialization phase > > > > CPU 0 CPU 1 > > > > ----- ----- > > > > inet_twsk_purge tcp_time_wait > > > > inet_twsk_deschedule_put __inet_twsk_schedule > > > > refcount_inc(tw_refcount) > > > > test if the timer is queued > > > > //timer isn't queued > > > > postpone > > > > mod_timer(tw_timer... > > > > Note: later, in another context, expired timer will finish up tw socket > > > > > > > > 3) if we're seeing a running timer after the initialization phase > > > > CPU 0 CPU 1 CPU 2 > > > > ----- ----- ----- > > > > tcp_time_wait > > > > __inet_twsk_schedule > > > > refcount_inc(tw_refcount) > > > > mod_timer(tw_timer... > > > > ...(around 1 min)... > > > > inet_twsk_purge > > > > inet_twsk_deschedule_put > > > > test if the timer is queued > > > > // timer is running > > > > skip tw_timer_handler > > > > Note: CPU 2 is destroying the timewait socket > > > > > > > > 4) if we're seeing a pending timer after the initialization phase > > > > CPU 0 CPU 1 > > > > ----- ----- > > > > tcp_time_wait > > > > __inet_twsk_schedule > > > > refcount_inc(tw_refcount) > > > > mod_timer(tw_timer... > > > > ...(< 1 min)... > > > > inet_twsk_purge > > > > inet_twsk_deschedule_put > > > > test if the timer is queued > > > > // timer is queued > > > > del_timer_sync > > > > inet_twsk_kill > > > > > > > > Therefore, only making sure that we either call inet_twsk_purge() or > > > > call tw_timer_handler() to destroy the timewait socket, we can > > > > handle all the cases as above. > > > > > > > > [1] > > > > refcount_t: decrement hit 0; leaking memory. > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > Modules linked in: > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > > Call Trace: > > > > <TASK> > > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > > refcount_dec include/linux/refcount.h:351 [inline] > > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > --- > > > > v2 > > > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/ > > > > 1. Use timer as a flag to test if we can safely destroy the timewait socket > > > > based on top of the patch Eric wrote. > > > > 2. change the title and add more explanation into body message. > > > > --- > > > > net/ipv4/inet_timewait_sock.c | 11 +++++++++-- > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > > > index e28075f0006e..b890d1c280a1 100644 > > > > --- a/net/ipv4/inet_timewait_sock.c > > > > +++ b/net/ipv4/inet_timewait_sock.c > > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) > > > > > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > > > LINUX_MIB_TIMEWAITED); > > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > refcount_inc(&tw->tw_dr->tw_refcount); > > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > } else { > > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > > > } > > > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) > > > > rcu_read_unlock(); > > > > local_bh_disable(); > > > > if (state == TCP_TIME_WAIT) { > > > > - inet_twsk_deschedule_put(inet_twsk(sk)); > > > > + struct inet_timewait_sock *tw = inet_twsk(sk); > > > > + > > > > + /* If the timer is armed, we can safely destroy > > > > + * it, or else we postpone the process of destruction > > > > + * to tw_timer_handler(). > > > > + */ > > > > + if (timer_pending(&tw->tw_timer)) > > > > + inet_twsk_deschedule_put(tw); > > > > > > > > > This patch is not needed, and a timer_pending() would be racy anywau. > > > > > > As already explained, del_timer_sync() takes care of this with proper locking. > > > > > > inet_twsk_deschedule_put(struct inet_timewait_sock *tw) > > > { > > > if (del_timer_sync(&tw->tw_timer)) > > > inet_twsk_kill(tw); > > > inet_twsk_put(tw); > > > } > > > > Sorry, I'm lost. But now I understand what I wrote in the morning is > > not correct... > > > > After rethinking about this, only reordering the mod_timer() and > > refcount_inc() in the initialization phase (just like the patch you > > wrote) can ensure safety. > > > > CPU 0 uses del_timer_sync() to detect if the timer is running while > > CPU 1 mod_timer() and then increments the refcount. > > > > 1) If CPU 0 sees the queued timer which means the tw socket has done > > incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the > > socket. > > > > 2) If CPU 0 cannot see the queued timer (and will not kill the > > socket), then CPU 1 will call mod_timer() and expire in one minute. > > Another cpu will call inet_twsk_kill() to clear tw socket at last. > > > > The patch you provided seems to solve the race issue in the right > > way... I cannot see the problem of it :( > > > > So what is the problem to be solved with your patch? Thanks for your > > patience and help. > > I already explained that when we do the mod_timer() and > refcount_inc(&tw->tw_dr->tw_refcount), > the tw refcount is 0, and no other cpu can possibly use the tw. > > So the order of these operations seems not really relevant. > > Another cpu, before trying to use it would have first to acquire a > reference on tw refcount, > and this is not allowed/possible. > > inet_twsk_purge() would hit this at this failed attempt. > > if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) > continue; > > I think we should wait for more syzbot occurrences before spending > more time on this. > > It is possible the single report was caused by unrelated memory > mangling in the kernel, > or some more fundamental issues caused by kernel listeners and sockets. Thank you so much, Eric!! I learn a lot.
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index e28075f0006e..b890d1c280a1 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : LINUX_MIB_TIMEWAITED); - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); refcount_inc(&tw->tw_dr->tw_refcount); + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); } else { mod_timer_pending(&tw->tw_timer, jiffies + timeo); } @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) rcu_read_unlock(); local_bh_disable(); if (state == TCP_TIME_WAIT) { - inet_twsk_deschedule_put(inet_twsk(sk)); + struct inet_timewait_sock *tw = inet_twsk(sk); + + /* If the timer is armed, we can safely destroy + * it, or else we postpone the process of destruction + * to tw_timer_handler(). + */ + if (timer_pending(&tw->tw_timer)) + inet_twsk_deschedule_put(tw); } else { struct request_sock *req = inet_reqsk(sk);