Message ID | 20240521144930.23805-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] Revert "rds: tcp: Fix use-after-free of net in reqsk_timer_handler()." | expand |
On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > If we failed in setup_net() and try to undo the setup process, the > reference now is 1 which shouldn't be decremented. However, it happened > actually. > > After applying this patch which allows us to check the reference first, > it will not hit zero anymore in tcp_twsk_purge() without calling > inet_twsk_purge() one more time. > > [1] > refcount_t: decrement hit 0; leaking memory. > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > Modules linked in: > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > Call Trace: > <TASK> > __refcount_dec include/linux/refcount.h:336 [inline] > refcount_dec include/linux/refcount.h:351 [inline] > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > __do_sys_unshare kernel/fork.c:3394 [inline] > __se_sys_unshare kernel/fork.c:3392 [inline] > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f56d7c7cee9 > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > The reverted patch trying to solve another issue causes unexpected error as above. I > think that issue can be properly analyzed and handled later. So can we revert it first? > --- > net/ipv4/tcp_minisocks.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index b93619b2384b..46e6f9db4227 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > } else if (!purged_once) { > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > + continue; > + > inet_twsk_purge(&tcp_hashinfo); > purged_once = true; > } > -- This can not be a fix for a race condition. By definition a TW has a refcount on tcp_death_row.tw_refcount only if its timer is armed. And inet_twsk_deschedule_put() does if (del_timer_sync(&tw->tw_timer)) inet_twsk_kill(tw); I think you need to provide a full explanation, instead of a shot in the dark. Before releasing this syzbot, I thought that maybe the refcount_inc() was done too late, but considering other invariants, this should not matter. diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : LINUX_MIB_TIMEWAITED); - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); refcount_inc(&tw->tw_dr->tw_refcount); + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); } else { mod_timer_pending(&tw->tw_timer, jiffies + timeo); }
Hello Eric, On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > If we failed in setup_net() and try to undo the setup process, the > > reference now is 1 which shouldn't be decremented. However, it happened > > actually. > > > > After applying this patch which allows us to check the reference first, > > it will not hit zero anymore in tcp_twsk_purge() without calling > > inet_twsk_purge() one more time. > > > > [1] > > refcount_t: decrement hit 0; leaking memory. > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > Modules linked in: > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > Call Trace: > > <TASK> > > __refcount_dec include/linux/refcount.h:336 [inline] > > refcount_dec include/linux/refcount.h:351 [inline] > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > __do_sys_unshare kernel/fork.c:3394 [inline] > > __se_sys_unshare kernel/fork.c:3392 [inline] > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > RIP: 0033:0x7f56d7c7cee9 > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > The reverted patch trying to solve another issue causes unexpected error as above. I > > think that issue can be properly analyzed and handled later. So can we revert it first? > > --- > > net/ipv4/tcp_minisocks.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > index b93619b2384b..46e6f9db4227 100644 > > --- a/net/ipv4/tcp_minisocks.c > > +++ b/net/ipv4/tcp_minisocks.c > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > > } else if (!purged_once) { > > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > > + continue; > > + > > inet_twsk_purge(&tcp_hashinfo); > > purged_once = true; > > } > > -- > > This can not be a fix for a race condition. > > By definition a TW has a refcount on tcp_death_row.tw_refcount only > if its timer is armed. > > And inet_twsk_deschedule_put() does > > if (del_timer_sync(&tw->tw_timer)) > inet_twsk_kill(tw); > > I think you need to provide a full explanation, instead of a shot in the dark. > > Before releasing this syzbot, I thought that maybe the refcount_inc() > was done too late, > but considering other invariants, this should not matter. > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc > 100644 > --- a/net/ipv4/inet_timewait_sock.c > +++ b/net/ipv4/inet_timewait_sock.c > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct > inet_timewait_sock *tw, int timeo, bool rearm) > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > LINUX_MIB_TIMEWAITED); > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > refcount_inc(&tw->tw_dr->tw_refcount); > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > } else { > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > } Thanks for your information. What you wrote is right, I think for a while. In inet_twsk_deschedule_put(), there are two possible cases: 1) if it finds the timer is armed, then it can kill the tw socket by decrementing the refcount in the right way. So it's a good/lucky thing for us. or 2) if it misses the point, then the tw socket arms the timer which will expire in 60 seconds in the initialization phase. The tw socket would have a chance to be destroyed when the timer expires. It seems that you don't think your code could solve the race issue? Thanks, Jason
On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > If we failed in setup_net() and try to undo the setup process, the > > > reference now is 1 which shouldn't be decremented. However, it happened > > > actually. > > > > > > After applying this patch which allows us to check the reference first, > > > it will not hit zero anymore in tcp_twsk_purge() without calling > > > inet_twsk_purge() one more time. > > > > > > [1] > > > refcount_t: decrement hit 0; leaking memory. > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > Modules linked in: > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > Call Trace: > > > <TASK> > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > refcount_dec include/linux/refcount.h:351 [inline] > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > The reverted patch trying to solve another issue causes unexpected error as above. I > > > think that issue can be properly analyzed and handled later. So can we revert it first? > > > --- > > > net/ipv4/tcp_minisocks.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > > index b93619b2384b..46e6f9db4227 100644 > > > --- a/net/ipv4/tcp_minisocks.c > > > +++ b/net/ipv4/tcp_minisocks.c > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > > > } else if (!purged_once) { > > > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > > > + continue; > > > + > > > inet_twsk_purge(&tcp_hashinfo); > > > purged_once = true; > > > } > > > -- > > > > This can not be a fix for a race condition. > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount only > > if its timer is armed. > > > > And inet_twsk_deschedule_put() does > > > > if (del_timer_sync(&tw->tw_timer)) > > inet_twsk_kill(tw); > > > > I think you need to provide a full explanation, instead of a shot in the dark. > > > > Before releasing this syzbot, I thought that maybe the refcount_inc() > > was done too late, > > but considering other invariants, this should not matter. > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc > > 100644 > > --- a/net/ipv4/inet_timewait_sock.c > > +++ b/net/ipv4/inet_timewait_sock.c > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct > > inet_timewait_sock *tw, int timeo, bool rearm) > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > LINUX_MIB_TIMEWAITED); > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > refcount_inc(&tw->tw_dr->tw_refcount); > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > } else { > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > } > > Thanks for your information. > > What you wrote is right, I think for a while. In > inet_twsk_deschedule_put(), there are two possible cases: > 1) if it finds the timer is armed, then it can kill the tw socket by > decrementing the refcount in the right way. So it's a good/lucky thing > for us. > or > 2) if it misses the point, then the tw socket arms the timer which > will expire in 60 seconds in the initialization phase. The tw socket > would have a chance to be destroyed when the timer expires. > > It seems that you don't think your code could solve the race issue? inet_twsk_schedule(tw) is called while the tw refcount is still 0, so this tw can be be found in inet_twsk_purge() at the same time. Look at inet_twsk_hashdance() for details.
On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > > > If we failed in setup_net() and try to undo the setup process, the > > > > reference now is 1 which shouldn't be decremented. However, it happened > > > > actually. > > > > > > > > After applying this patch which allows us to check the reference first, > > > > it will not hit zero anymore in tcp_twsk_purge() without calling > > > > inet_twsk_purge() one more time. > > > > > > > > [1] > > > > refcount_t: decrement hit 0; leaking memory. > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > Modules linked in: > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > > Call Trace: > > > > <TASK> > > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > > refcount_dec include/linux/refcount.h:351 [inline] > > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > --- > > > > The reverted patch trying to solve another issue causes unexpected error as above. I > > > > think that issue can be properly analyzed and handled later. So can we revert it first? > > > > --- > > > > net/ipv4/tcp_minisocks.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > > > index b93619b2384b..46e6f9db4227 100644 > > > > --- a/net/ipv4/tcp_minisocks.c > > > > +++ b/net/ipv4/tcp_minisocks.c > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > > > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > > > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > > > > } else if (!purged_once) { > > > > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > > > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > > > > + continue; > > > > + > > > > inet_twsk_purge(&tcp_hashinfo); > > > > purged_once = true; > > > > } > > > > -- > > > > > > This can not be a fix for a race condition. > > > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount only > > > if its timer is armed. > > > > > > And inet_twsk_deschedule_put() does > > > > > > if (del_timer_sync(&tw->tw_timer)) > > > inet_twsk_kill(tw); > > > > > > I think you need to provide a full explanation, instead of a shot in the dark. > > > > > > Before releasing this syzbot, I thought that maybe the refcount_inc() > > > was done too late, > > > but considering other invariants, this should not matter. > > > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc > > > 100644 > > > --- a/net/ipv4/inet_timewait_sock.c > > > +++ b/net/ipv4/inet_timewait_sock.c > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct > > > inet_timewait_sock *tw, int timeo, bool rearm) > > > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > > LINUX_MIB_TIMEWAITED); > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > refcount_inc(&tw->tw_dr->tw_refcount); > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > } else { > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > > } > > > > Thanks for your information. > > > > What you wrote is right, I think for a while. In > > inet_twsk_deschedule_put(), there are two possible cases: > > 1) if it finds the timer is armed, then it can kill the tw socket by > > decrementing the refcount in the right way. So it's a good/lucky thing > > for us. > > or > > 2) if it misses the point, then the tw socket arms the timer which > > will expire in 60 seconds in the initialization phase. The tw socket > > would have a chance to be destroyed when the timer expires. > > > > It seems that you don't think your code could solve the race issue? > > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so > this tw can be be found in inet_twsk_purge() > at the same time. Look at inet_twsk_hashdance() for details. Yes, after inet_twsk_purge() finds the tw, there are two cases like my previous email said after applying your code. For 1) case, everything is good. inet_twsk_purge() will finish it up because it can decrement the refcount safely. For 2) case, even though inet_twsk_purge() finds it, it's not the time to destroy it until the expired tw timer will finally handle the process of destruction by calling inet_twsk_kill(), right? Let the timer handle the destruction until its end of life, which I think is a normal process for all the timewait sockets. The only difference in 2) case is that inet_twsk_purge() calls inet_twsk_put() twice while tw_timer_handler() only calls it one time. Am I missing something? Thanks, Jason
On Wed, May 22, 2024 at 12:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > Hello Eric, > > > > > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > > > > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > > > > > If we failed in setup_net() and try to undo the setup process, the > > > > > reference now is 1 which shouldn't be decremented. However, it happened > > > > > actually. > > > > > > > > > > After applying this patch which allows us to check the reference first, > > > > > it will not hit zero anymore in tcp_twsk_purge() without calling > > > > > inet_twsk_purge() one more time. > > > > > > > > > > [1] > > > > > refcount_t: decrement hit 0; leaking memory. > > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > > Modules linked in: > > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > > > Call Trace: > > > > > <TASK> > > > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > > > refcount_dec include/linux/refcount.h:351 [inline] > > > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > > --- > > > > > The reverted patch trying to solve another issue causes unexpected error as above. I > > > > > think that issue can be properly analyzed and handled later. So can we revert it first? > > > > > --- > > > > > net/ipv4/tcp_minisocks.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > > > > index b93619b2384b..46e6f9db4227 100644 > > > > > --- a/net/ipv4/tcp_minisocks.c > > > > > +++ b/net/ipv4/tcp_minisocks.c > > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > > > > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > > > > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > > > > > } else if (!purged_once) { > > > > > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > > > > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > > > > > + continue; > > > > > + > > > > > inet_twsk_purge(&tcp_hashinfo); > > > > > purged_once = true; > > > > > } > > > > > -- > > > > > > > > This can not be a fix for a race condition. > > > > > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount only > > > > if its timer is armed. > > > > > > > > And inet_twsk_deschedule_put() does > > > > > > > > if (del_timer_sync(&tw->tw_timer)) > > > > inet_twsk_kill(tw); > > > > > > > > I think you need to provide a full explanation, instead of a shot in the dark. > > > > > > > > Before releasing this syzbot, I thought that maybe the refcount_inc() > > > > was done too late, > > > > but considering other invariants, this should not matter. > > > > > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc > > > > 100644 > > > > --- a/net/ipv4/inet_timewait_sock.c > > > > +++ b/net/ipv4/inet_timewait_sock.c > > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct > > > > inet_timewait_sock *tw, int timeo, bool rearm) > > > > > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > > > LINUX_MIB_TIMEWAITED); > > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > refcount_inc(&tw->tw_dr->tw_refcount); > > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > } else { > > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > > > } > > > > > > Thanks for your information. > > > > > > What you wrote is right, I think for a while. In > > > inet_twsk_deschedule_put(), there are two possible cases: > > > 1) if it finds the timer is armed, then it can kill the tw socket by > > > decrementing the refcount in the right way. So it's a good/lucky thing > > > for us. > > > or > > > 2) if it misses the point, then the tw socket arms the timer which > > > will expire in 60 seconds in the initialization phase. The tw socket > > > would have a chance to be destroyed when the timer expires. > > > > > > It seems that you don't think your code could solve the race issue? > > > > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so > > this tw can be be found in inet_twsk_purge() > > at the same time. Look at inet_twsk_hashdance() for details. > > Yes, after inet_twsk_purge() finds the tw, there are two cases like my > previous email said after applying your code. > > For 1) case, everything is good. inet_twsk_purge() will finish it up > because it can decrement the refcount safely. > > For 2) case, even though inet_twsk_purge() finds it, it's not the time > to destroy it until the expired tw timer will finally handle the > process of destruction by calling inet_twsk_kill(), right? Let the > timer handle the destruction until its end of life, which I think is a > normal process for all the timewait sockets. > > The only difference in 2) case is that inet_twsk_purge() calls > inet_twsk_put() twice while tw_timer_handler() only calls it one time. > > Am I missing something? You are missing that inet_twsk_deschedule_put() is doing : if (del_timer_sync(&tw->tw_timer)) inet_twsk_kill(tw); You can not have both tw_timer_handler() and inet_twsk_deschedule_put() calling inet_twsk_kill(tw); Take a look at del_timer_sync()
On Wed, May 22, 2024 at 6:54 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, May 22, 2024 at 12:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > Hello Eric, > > > > > > > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24. > > > > > > > > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory. > > > > > > > > > > > > If we failed in setup_net() and try to undo the setup process, the > > > > > > reference now is 1 which shouldn't be decremented. However, it happened > > > > > > actually. > > > > > > > > > > > > After applying this patch which allows us to check the reference first, > > > > > > it will not hit zero anymore in tcp_twsk_purge() without calling > > > > > > inet_twsk_purge() one more time. > > > > > > > > > > > > [1] > > > > > > refcount_t: decrement hit 0; leaking memory. > > > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > > > Modules linked in: > > > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0 > > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > > > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31 > > > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282 > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000 > > > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001 > > > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000 > > > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80 > > > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567 > > > > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000 > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0 > > > > > > Call Trace: > > > > > > <TASK> > > > > > > __refcount_dec include/linux/refcount.h:336 [inline] > > > > > > refcount_dec include/linux/refcount.h:351 [inline] > > > > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 > > > > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline] > > > > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 > > > > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402 > > > > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 > > > > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 > > > > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375 > > > > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 > > > > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 > > > > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 > > > > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323 > > > > > > __do_sys_unshare kernel/fork.c:3394 [inline] > > > > > > __se_sys_unshare kernel/fork.c:3392 [inline] > > > > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392 > > > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 > > > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > > > RIP: 0033:0x7f56d7c7cee9 > > > > > > > > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().") > > > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com > > > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251 > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > > > --- > > > > > > The reverted patch trying to solve another issue causes unexpected error as above. I > > > > > > think that issue can be properly analyzed and handled later. So can we revert it first? > > > > > > --- > > > > > > net/ipv4/tcp_minisocks.c | 4 ++++ > > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > > > > > index b93619b2384b..46e6f9db4227 100644 > > > > > > --- a/net/ipv4/tcp_minisocks.c > > > > > > +++ b/net/ipv4/tcp_minisocks.c > > > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) > > > > > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */ > > > > > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); > > > > > > } else if (!purged_once) { > > > > > > + /* The last refcount is decremented in tcp_sk_exit_batch() */ > > > > > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > > > > > > + continue; > > > > > > + > > > > > > inet_twsk_purge(&tcp_hashinfo); > > > > > > purged_once = true; > > > > > > } > > > > > > -- > > > > > > > > > > This can not be a fix for a race condition. > > > > > > > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount only > > > > > if its timer is armed. > > > > > > > > > > And inet_twsk_deschedule_put() does > > > > > > > > > > if (del_timer_sync(&tw->tw_timer)) > > > > > inet_twsk_kill(tw); > > > > > > > > > > I think you need to provide a full explanation, instead of a shot in the dark. > > > > > > > > > > Before releasing this syzbot, I thought that maybe the refcount_inc() > > > > > was done too late, > > > > > but considering other invariants, this should not matter. > > > > > > > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c > > > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc > > > > > 100644 > > > > > --- a/net/ipv4/inet_timewait_sock.c > > > > > +++ b/net/ipv4/inet_timewait_sock.c > > > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct > > > > > inet_timewait_sock *tw, int timeo, bool rearm) > > > > > > > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : > > > > > LINUX_MIB_TIMEWAITED); > > > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > > refcount_inc(&tw->tw_dr->tw_refcount); > > > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); > > > > > } else { > > > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo); > > > > > } > > > > > > > > Thanks for your information. > > > > > > > > What you wrote is right, I think for a while. In > > > > inet_twsk_deschedule_put(), there are two possible cases: > > > > 1) if it finds the timer is armed, then it can kill the tw socket by > > > > decrementing the refcount in the right way. So it's a good/lucky thing > > > > for us. > > > > or > > > > 2) if it misses the point, then the tw socket arms the timer which > > > > will expire in 60 seconds in the initialization phase. The tw socket > > > > would have a chance to be destroyed when the timer expires. > > > > > > > > It seems that you don't think your code could solve the race issue? > > > > > > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so > > > this tw can be be found in inet_twsk_purge() > > > at the same time. Look at inet_twsk_hashdance() for details. > > > > Yes, after inet_twsk_purge() finds the tw, there are two cases like my > > previous email said after applying your code. > > > > For 1) case, everything is good. inet_twsk_purge() will finish it up > > because it can decrement the refcount safely. > > > > For 2) case, even though inet_twsk_purge() finds it, it's not the time > > to destroy it until the expired tw timer will finally handle the > > process of destruction by calling inet_twsk_kill(), right? Let the > > timer handle the destruction until its end of life, which I think is a > > normal process for all the timewait sockets. > > > > The only difference in 2) case is that inet_twsk_purge() calls > > inet_twsk_put() twice while tw_timer_handler() only calls it one time. > > > > Am I missing something? > > You are missing that inet_twsk_deschedule_put() is doing : > > if (del_timer_sync(&tw->tw_timer)) > inet_twsk_kill(tw); > > You can not have both tw_timer_handler() and > inet_twsk_deschedule_put() calling inet_twsk_kill(tw); [...] > > Take a look at del_timer_sync() Oops, I missed this function...Thanks for pointing it out. I can test if the timer is active or queued before we can delete, like this on top of your patch: diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index e28075f0006e..b890d1c280a1 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED : LINUX_MIB_TIMEWAITED); - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); refcount_inc(&tw->tw_dr->tw_refcount); + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo)); } else { mod_timer_pending(&tw->tw_timer, jiffies + timeo); } @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo) rcu_read_unlock(); local_bh_disable(); if (state == TCP_TIME_WAIT) { - inet_twsk_deschedule_put(inet_twsk(sk)); + struct inet_timewait_sock *tw = inet_twsk(sk); + + /* If the timer is armed, we can safely destroy + * it, or else we postpone the process of destruction + * to tw_timer_handler(). + */ + if (timer_pending(&tw->tw_timer)) + inet_twsk_deschedule_put(tw); } else { struct request_sock *req = inet_reqsk(sk); As above, if we find the timer is armed, then we can destroy it naturally in inet_twsk_deschedule_put(). If not, we can skip it in the inet_twsk_purge() and then postpone the destruction process in the timer handler. I think I need to add another test about whether the timer is running or not. If testing the status of timer is not a good way to go, perhaps I would like to introduce a new flag to indicate whether the tw socket still exists in the initialization phase. Sorry, It's __not__ tested because it's too late for me, I will spend more time taking care of this race condition tomorrow. Thanks, Jason
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index b93619b2384b..46e6f9db4227 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list) /* Even if tw_refcount == 1, we must clean up kernel reqsk */ inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); } else if (!purged_once) { + /* The last refcount is decremented in tcp_sk_exit_batch() */ + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) + continue; + inet_twsk_purge(&tcp_hashinfo); purged_once = true; }