diff mbox series

[v2,net] tcp: fix a race when purging the netns and allocating tw socket

Message ID 20240523050357.43941-1-kerneljasonxing@gmail.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [v2,net] tcp: fix a race when purging the netns and allocating tw socket | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 907 this patch: 907
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 909 this patch: 909
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 911 this patch: 911
netdev/checkpatch warning WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-23--12-00 (tests: 1038)

Commit Message

Jason Xing May 23, 2024, 5:03 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

Syzbot[1] reported the drecrement of reference count hits leaking memory.

Race condition:
   CPU 0                      CPU 1
   -----                      -----
inet_twsk_purge            tcp_time_wait
inet_twsk_deschedule_put   __inet_twsk_schedule
                           mod_timer(tw_timer...
del_timer_sync
inet_twsk_kill
refcount_dec(tw_refcount)[1]
                           refcount_inc(tw_refcount)[2]

Race case happens because [1] decrements refcount first before [2].

After we reorder the mod_timer() and refcount_inc() in the initialization
phase, we can use the status of timer as an indicator to test if we want
to destroy the tw socket in inet_twsk_purge() or postpone it to
tw_timer_handler().

After this patch applied, we get four possible cases:
1) if we can see the armed timer during the initialization phase
   CPU 0                      CPU 1
   -----                      -----
inet_twsk_purge            tcp_time_wait
inet_twsk_deschedule_put   __inet_twsk_schedule
                           refcount_inc(tw_refcount)
                           mod_timer(tw_timer...
test if the timer is queued
//timer is queued
del_timer_sync
inet_twsk_kill
refcount_dec(tw_refcount)
Note: we finish it up in the purge process.

2) if we fail to see the armed timer during the initialization phase
   CPU 0                      CPU 1
   -----                      -----
inet_twsk_purge            tcp_time_wait
inet_twsk_deschedule_put   __inet_twsk_schedule
                           refcount_inc(tw_refcount)
test if the timer is queued
//timer isn't queued
postpone
                           mod_timer(tw_timer...
Note: later, in another context, expired timer will finish up tw socket

3) if we're seeing a running timer after the initialization phase
   CPU 0                      CPU 1                    CPU 2
   -----                      -----                    -----
                           tcp_time_wait
                           __inet_twsk_schedule
                           refcount_inc(tw_refcount)
                           mod_timer(tw_timer...
                           ...(around 1 min)...
inet_twsk_purge
inet_twsk_deschedule_put
test if the timer is queued
// timer is running
skip                                              tw_timer_handler
Note: CPU 2 is destroying the timewait socket

4) if we're seeing a pending timer after the initialization phase
   CPU 0                      CPU 1
   -----                      -----
                           tcp_time_wait
                           __inet_twsk_schedule
                           refcount_inc(tw_refcount)
                           mod_timer(tw_timer...
                           ...(< 1 min)...
inet_twsk_purge
inet_twsk_deschedule_put
test if the timer is queued
// timer is queued
del_timer_sync
inet_twsk_kill

Therefore, only making sure that we either call inet_twsk_purge() or
call tw_timer_handler() to destroy the timewait socket, we can
handle all the cases as above.

[1]
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:336 [inline]
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
 ksys_unshare+0x419/0x970 kernel/fork.c:3323
 __do_sys_unshare kernel/fork.c:3394 [inline]
 __se_sys_unshare kernel/fork.c:3392 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f56d7c7cee9

Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
v2
Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
1. Use timer as a flag to test if we can safely destroy the timewait socket
based on top of the patch Eric wrote.
2. change the title and add more explanation into body message.
---
 net/ipv4/inet_timewait_sock.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Eric Dumazet May 23, 2024, 7:19 a.m. UTC | #1
On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> Syzbot[1] reported the drecrement of reference count hits leaking memory.
>
> Race condition:
>    CPU 0                      CPU 1
>    -----                      -----
> inet_twsk_purge            tcp_time_wait
> inet_twsk_deschedule_put   __inet_twsk_schedule
>                            mod_timer(tw_timer...
> del_timer_sync
> inet_twsk_kill
> refcount_dec(tw_refcount)[1]
>                            refcount_inc(tw_refcount)[2]
>
> Race case happens because [1] decrements refcount first before [2].
>
> After we reorder the mod_timer() and refcount_inc() in the initialization
> phase, we can use the status of timer as an indicator to test if we want
> to destroy the tw socket in inet_twsk_purge() or postpone it to
> tw_timer_handler().
>
> After this patch applied, we get four possible cases:
> 1) if we can see the armed timer during the initialization phase
>    CPU 0                      CPU 1
>    -----                      -----
> inet_twsk_purge            tcp_time_wait
> inet_twsk_deschedule_put   __inet_twsk_schedule
>                            refcount_inc(tw_refcount)
>                            mod_timer(tw_timer...
> test if the timer is queued
> //timer is queued
> del_timer_sync
> inet_twsk_kill
> refcount_dec(tw_refcount)
> Note: we finish it up in the purge process.
>
> 2) if we fail to see the armed timer during the initialization phase
>    CPU 0                      CPU 1
>    -----                      -----
> inet_twsk_purge            tcp_time_wait
> inet_twsk_deschedule_put   __inet_twsk_schedule
>                            refcount_inc(tw_refcount)
> test if the timer is queued
> //timer isn't queued
> postpone
>                            mod_timer(tw_timer...
> Note: later, in another context, expired timer will finish up tw socket
>
> 3) if we're seeing a running timer after the initialization phase
>    CPU 0                      CPU 1                    CPU 2
>    -----                      -----                    -----
>                            tcp_time_wait
>                            __inet_twsk_schedule
>                            refcount_inc(tw_refcount)
>                            mod_timer(tw_timer...
>                            ...(around 1 min)...
> inet_twsk_purge
> inet_twsk_deschedule_put
> test if the timer is queued
> // timer is running
> skip                                              tw_timer_handler
> Note: CPU 2 is destroying the timewait socket
>
> 4) if we're seeing a pending timer after the initialization phase
>    CPU 0                      CPU 1
>    -----                      -----
>                            tcp_time_wait
>                            __inet_twsk_schedule
>                            refcount_inc(tw_refcount)
>                            mod_timer(tw_timer...
>                            ...(< 1 min)...
> inet_twsk_purge
> inet_twsk_deschedule_put
> test if the timer is queued
> // timer is queued
> del_timer_sync
> inet_twsk_kill
>
> Therefore, only making sure that we either call inet_twsk_purge() or
> call tw_timer_handler() to destroy the timewait socket, we can
> handle all the cases as above.
>
> [1]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> Modules linked in:
> CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  __refcount_dec include/linux/refcount.h:336 [inline]
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
>  ksys_unshare+0x419/0x970 kernel/fork.c:3323
>  __do_sys_unshare kernel/fork.c:3394 [inline]
>  __se_sys_unshare kernel/fork.c:3392 [inline]
>  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f56d7c7cee9
>
> Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> v2
> Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> 1. Use timer as a flag to test if we can safely destroy the timewait socket
> based on top of the patch Eric wrote.
> 2. change the title and add more explanation into body message.
> ---
>  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index e28075f0006e..b890d1c280a1 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
>
>                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
>                                                      LINUX_MIB_TIMEWAITED);
> -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>                 refcount_inc(&tw->tw_dr->tw_refcount);
> +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>         } else {
>                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
>         }
> @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
>                         rcu_read_unlock();
>                         local_bh_disable();
>                         if (state == TCP_TIME_WAIT) {
> -                               inet_twsk_deschedule_put(inet_twsk(sk));
> +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> +
> +                               /* If the timer is armed, we can safely destroy
> +                                * it, or else we postpone the process of destruction
> +                                * to tw_timer_handler().
> +                                */
> +                               if (timer_pending(&tw->tw_timer))
> +                                       inet_twsk_deschedule_put(tw);


This patch is not needed, and a timer_pending() would be racy anywau.

As already explained, del_timer_sync() takes care of this with proper locking.

inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
{
  if (del_timer_sync(&tw->tw_timer))
     inet_twsk_kill(tw);
  inet_twsk_put(tw);
}
Jason Xing May 23, 2024, 9:23 a.m. UTC | #2
On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> >
> > Race condition:
> >    CPU 0                      CPU 1
> >    -----                      -----
> > inet_twsk_purge            tcp_time_wait
> > inet_twsk_deschedule_put   __inet_twsk_schedule
> >                            mod_timer(tw_timer...
> > del_timer_sync
> > inet_twsk_kill
> > refcount_dec(tw_refcount)[1]
> >                            refcount_inc(tw_refcount)[2]
> >
> > Race case happens because [1] decrements refcount first before [2].
> >
> > After we reorder the mod_timer() and refcount_inc() in the initialization
> > phase, we can use the status of timer as an indicator to test if we want
> > to destroy the tw socket in inet_twsk_purge() or postpone it to
> > tw_timer_handler().
> >
> > After this patch applied, we get four possible cases:
> > 1) if we can see the armed timer during the initialization phase
> >    CPU 0                      CPU 1
> >    -----                      -----
> > inet_twsk_purge            tcp_time_wait
> > inet_twsk_deschedule_put   __inet_twsk_schedule
> >                            refcount_inc(tw_refcount)
> >                            mod_timer(tw_timer...
> > test if the timer is queued
> > //timer is queued
> > del_timer_sync
> > inet_twsk_kill
> > refcount_dec(tw_refcount)
> > Note: we finish it up in the purge process.
> >
> > 2) if we fail to see the armed timer during the initialization phase
> >    CPU 0                      CPU 1
> >    -----                      -----
> > inet_twsk_purge            tcp_time_wait
> > inet_twsk_deschedule_put   __inet_twsk_schedule
> >                            refcount_inc(tw_refcount)
> > test if the timer is queued
> > //timer isn't queued
> > postpone
> >                            mod_timer(tw_timer...
> > Note: later, in another context, expired timer will finish up tw socket
> >
> > 3) if we're seeing a running timer after the initialization phase
> >    CPU 0                      CPU 1                    CPU 2
> >    -----                      -----                    -----
> >                            tcp_time_wait
> >                            __inet_twsk_schedule
> >                            refcount_inc(tw_refcount)
> >                            mod_timer(tw_timer...
> >                            ...(around 1 min)...
> > inet_twsk_purge
> > inet_twsk_deschedule_put
> > test if the timer is queued
> > // timer is running
> > skip                                              tw_timer_handler
> > Note: CPU 2 is destroying the timewait socket
> >
> > 4) if we're seeing a pending timer after the initialization phase
> >    CPU 0                      CPU 1
> >    -----                      -----
> >                            tcp_time_wait
> >                            __inet_twsk_schedule
> >                            refcount_inc(tw_refcount)
> >                            mod_timer(tw_timer...
> >                            ...(< 1 min)...
> > inet_twsk_purge
> > inet_twsk_deschedule_put
> > test if the timer is queued
> > // timer is queued
> > del_timer_sync
> > inet_twsk_kill
> >
> > Therefore, only making sure that we either call inet_twsk_purge() or
> > call tw_timer_handler() to destroy the timewait socket, we can
> > handle all the cases as above.
> >
> > [1]
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > Modules linked in:
> > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > Call Trace:
> >  <TASK>
> >  __refcount_dec include/linux/refcount.h:336 [inline]
> >  refcount_dec include/linux/refcount.h:351 [inline]
> >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> >  __do_sys_unshare kernel/fork.c:3394 [inline]
> >  __se_sys_unshare kernel/fork.c:3392 [inline]
> >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f56d7c7cee9
> >
> > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > v2
> > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> > 1. Use timer as a flag to test if we can safely destroy the timewait socket
> > based on top of the patch Eric wrote.
> > 2. change the title and add more explanation into body message.
> > ---
> >  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index e28075f0006e..b890d1c280a1 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> >
> >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> >                                                      LINUX_MIB_TIMEWAITED);
> > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> >         } else {
> >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> >         }
> > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> >                         rcu_read_unlock();
> >                         local_bh_disable();
> >                         if (state == TCP_TIME_WAIT) {
> > -                               inet_twsk_deschedule_put(inet_twsk(sk));
> > +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> > +
> > +                               /* If the timer is armed, we can safely destroy
> > +                                * it, or else we postpone the process of destruction
> > +                                * to tw_timer_handler().
> > +                                */
> > +                               if (timer_pending(&tw->tw_timer))
> > +                                       inet_twsk_deschedule_put(tw);
>
>
> This patch is not needed, and a timer_pending() would be racy anywau.
>
> As already explained, del_timer_sync() takes care of this with proper locking.
>
> inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> {
>   if (del_timer_sync(&tw->tw_timer))
>      inet_twsk_kill(tw);
>   inet_twsk_put(tw);
> }

Sorry, I'm lost. But now I understand what I wrote in the morning is
not correct...

After rethinking about this, only reordering the mod_timer() and
refcount_inc() in the initialization phase (just like the patch you
wrote) can ensure safety.

CPU 0 uses del_timer_sync() to detect if the timer is running while
CPU 1 mod_timer() and then increments the refcount.

1) If CPU 0 sees the queued timer which means the tw socket has done
incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the
socket.

2) If CPU 0 cannot see the queued timer (and will not kill the
socket), then CPU 1 will call mod_timer() and expire in one minute.
Another cpu will call inet_twsk_kill() to clear tw socket at last.

The patch you provided seems to solve the race issue in the right
way... I cannot see the problem of it :(

So what is the problem to be solved with your patch? Thanks for your
patience and help.

Thanks,
Jason
Eric Dumazet May 23, 2024, 9:36 a.m. UTC | #3
On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > >
> > > Race condition:
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            mod_timer(tw_timer...
> > > del_timer_sync
> > > inet_twsk_kill
> > > refcount_dec(tw_refcount)[1]
> > >                            refcount_inc(tw_refcount)[2]
> > >
> > > Race case happens because [1] decrements refcount first before [2].
> > >
> > > After we reorder the mod_timer() and refcount_inc() in the initialization
> > > phase, we can use the status of timer as an indicator to test if we want
> > > to destroy the tw socket in inet_twsk_purge() or postpone it to
> > > tw_timer_handler().
> > >
> > > After this patch applied, we get four possible cases:
> > > 1) if we can see the armed timer during the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > > test if the timer is queued
> > > //timer is queued
> > > del_timer_sync
> > > inet_twsk_kill
> > > refcount_dec(tw_refcount)
> > > Note: we finish it up in the purge process.
> > >
> > > 2) if we fail to see the armed timer during the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > > test if the timer is queued
> > > //timer isn't queued
> > > postpone
> > >                            mod_timer(tw_timer...
> > > Note: later, in another context, expired timer will finish up tw socket
> > >
> > > 3) if we're seeing a running timer after the initialization phase
> > >    CPU 0                      CPU 1                    CPU 2
> > >    -----                      -----                    -----
> > >                            tcp_time_wait
> > >                            __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > >                            ...(around 1 min)...
> > > inet_twsk_purge
> > > inet_twsk_deschedule_put
> > > test if the timer is queued
> > > // timer is running
> > > skip                                              tw_timer_handler
> > > Note: CPU 2 is destroying the timewait socket
> > >
> > > 4) if we're seeing a pending timer after the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > >                            tcp_time_wait
> > >                            __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > >                            ...(< 1 min)...
> > > inet_twsk_purge
> > > inet_twsk_deschedule_put
> > > test if the timer is queued
> > > // timer is queued
> > > del_timer_sync
> > > inet_twsk_kill
> > >
> > > Therefore, only making sure that we either call inet_twsk_purge() or
> > > call tw_timer_handler() to destroy the timewait socket, we can
> > > handle all the cases as above.
> > >
> > > [1]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > Call Trace:
> > >  <TASK>
> > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > >  refcount_dec include/linux/refcount.h:351 [inline]
> > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7f56d7c7cee9
> > >
> > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> > > 1. Use timer as a flag to test if we can safely destroy the timewait socket
> > > based on top of the patch Eric wrote.
> > > 2. change the title and add more explanation into body message.
> > > ---
> > >  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index e28075f0006e..b890d1c280a1 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> > >
> > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > >                                                      LINUX_MIB_TIMEWAITED);
> > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >         } else {
> > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > >         }
> > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> > >                         rcu_read_unlock();
> > >                         local_bh_disable();
> > >                         if (state == TCP_TIME_WAIT) {
> > > -                               inet_twsk_deschedule_put(inet_twsk(sk));
> > > +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> > > +
> > > +                               /* If the timer is armed, we can safely destroy
> > > +                                * it, or else we postpone the process of destruction
> > > +                                * to tw_timer_handler().
> > > +                                */
> > > +                               if (timer_pending(&tw->tw_timer))
> > > +                                       inet_twsk_deschedule_put(tw);
> >
> >
> > This patch is not needed, and a timer_pending() would be racy anywau.
> >
> > As already explained, del_timer_sync() takes care of this with proper locking.
> >
> > inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > {
> >   if (del_timer_sync(&tw->tw_timer))
> >      inet_twsk_kill(tw);
> >   inet_twsk_put(tw);
> > }
>
> Sorry, I'm lost. But now I understand what I wrote in the morning is
> not correct...
>
> After rethinking about this, only reordering the mod_timer() and
> refcount_inc() in the initialization phase (just like the patch you
> wrote) can ensure safety.
>
> CPU 0 uses del_timer_sync() to detect if the timer is running while
> CPU 1 mod_timer() and then increments the refcount.
>
> 1) If CPU 0 sees the queued timer which means the tw socket has done
> incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the
> socket.
>
> 2) If CPU 0 cannot see the queued timer (and will not kill the
> socket), then CPU 1 will call mod_timer() and expire in one minute.
> Another cpu will call inet_twsk_kill() to clear tw socket at last.
>
> The patch you provided seems to solve the race issue in the right
> way... I cannot see the problem of it :(
>
> So what is the problem to be solved with your patch? Thanks for your
> patience and help.

I already explained that when we do the mod_timer() and
refcount_inc(&tw->tw_dr->tw_refcount),
the tw refcount is 0, and no other cpu can possibly use the tw.

So the order of these operations seems not really relevant.

Another cpu, before trying to use it would have first to acquire a
reference on tw refcount,
and this is not allowed/possible.

inet_twsk_purge() would hit this at this failed attempt.

if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
     continue;

I think we should wait for more syzbot occurrences before spending
more time on this.

It is possible the single report was caused by unrelated memory
mangling in the kernel,
or some more fundamental issues caused by kernel listeners and sockets.
Jason Xing May 23, 2024, 12:15 p.m. UTC | #4
On Thu, May 23, 2024 at 5:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > >
> > > > Race condition:
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            mod_timer(tw_timer...
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > > refcount_dec(tw_refcount)[1]
> > > >                            refcount_inc(tw_refcount)[2]
> > > >
> > > > Race case happens because [1] decrements refcount first before [2].
> > > >
> > > > After we reorder the mod_timer() and refcount_inc() in the initialization
> > > > phase, we can use the status of timer as an indicator to test if we want
> > > > to destroy the tw socket in inet_twsk_purge() or postpone it to
> > > > tw_timer_handler().
> > > >
> > > > After this patch applied, we get four possible cases:
> > > > 1) if we can see the armed timer during the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > > test if the timer is queued
> > > > //timer is queued
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > > refcount_dec(tw_refcount)
> > > > Note: we finish it up in the purge process.
> > > >
> > > > 2) if we fail to see the armed timer during the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > > test if the timer is queued
> > > > //timer isn't queued
> > > > postpone
> > > >                            mod_timer(tw_timer...
> > > > Note: later, in another context, expired timer will finish up tw socket
> > > >
> > > > 3) if we're seeing a running timer after the initialization phase
> > > >    CPU 0                      CPU 1                    CPU 2
> > > >    -----                      -----                    -----
> > > >                            tcp_time_wait
> > > >                            __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > >                            ...(around 1 min)...
> > > > inet_twsk_purge
> > > > inet_twsk_deschedule_put
> > > > test if the timer is queued
> > > > // timer is running
> > > > skip                                              tw_timer_handler
> > > > Note: CPU 2 is destroying the timewait socket
> > > >
> > > > 4) if we're seeing a pending timer after the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > >                            tcp_time_wait
> > > >                            __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > >                            ...(< 1 min)...
> > > > inet_twsk_purge
> > > > inet_twsk_deschedule_put
> > > > test if the timer is queued
> > > > // timer is queued
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > >
> > > > Therefore, only making sure that we either call inet_twsk_purge() or
> > > > call tw_timer_handler() to destroy the timewait socket, we can
> > > > handle all the cases as above.
> > > >
> > > > [1]
> > > > refcount_t: decrement hit 0; leaking memory.
> > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > Modules linked in:
> > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > Call Trace:
> > > >  <TASK>
> > > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > > >  refcount_dec include/linux/refcount.h:351 [inline]
> > > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > RIP: 0033:0x7f56d7c7cee9
> > > >
> > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > v2
> > > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> > > > 1. Use timer as a flag to test if we can safely destroy the timewait socket
> > > > based on top of the patch Eric wrote.
> > > > 2. change the title and add more explanation into body message.
> > > > ---
> > > >  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
> > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > index e28075f0006e..b890d1c280a1 100644
> > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> > > >
> > > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > > >                                                      LINUX_MIB_TIMEWAITED);
> > > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >         } else {
> > > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > > >         }
> > > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> > > >                         rcu_read_unlock();
> > > >                         local_bh_disable();
> > > >                         if (state == TCP_TIME_WAIT) {
> > > > -                               inet_twsk_deschedule_put(inet_twsk(sk));
> > > > +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> > > > +
> > > > +                               /* If the timer is armed, we can safely destroy
> > > > +                                * it, or else we postpone the process of destruction
> > > > +                                * to tw_timer_handler().
> > > > +                                */
> > > > +                               if (timer_pending(&tw->tw_timer))
> > > > +                                       inet_twsk_deschedule_put(tw);
> > >
> > >
> > > This patch is not needed, and a timer_pending() would be racy anywau.
> > >
> > > As already explained, del_timer_sync() takes care of this with proper locking.
> > >
> > > inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > > {
> > >   if (del_timer_sync(&tw->tw_timer))
> > >      inet_twsk_kill(tw);
> > >   inet_twsk_put(tw);
> > > }
> >
> > Sorry, I'm lost. But now I understand what I wrote in the morning is
> > not correct...
> >
> > After rethinking about this, only reordering the mod_timer() and
> > refcount_inc() in the initialization phase (just like the patch you
> > wrote) can ensure safety.
> >
> > CPU 0 uses del_timer_sync() to detect if the timer is running while
> > CPU 1 mod_timer() and then increments the refcount.
> >
> > 1) If CPU 0 sees the queued timer which means the tw socket has done
> > incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the
> > socket.
> >
> > 2) If CPU 0 cannot see the queued timer (and will not kill the
> > socket), then CPU 1 will call mod_timer() and expire in one minute.
> > Another cpu will call inet_twsk_kill() to clear tw socket at last.
> >
> > The patch you provided seems to solve the race issue in the right
> > way... I cannot see the problem of it :(
> >
> > So what is the problem to be solved with your patch? Thanks for your
> > patience and help.
>
> I already explained that when we do the mod_timer() and
> refcount_inc(&tw->tw_dr->tw_refcount),
> the tw refcount is 0, and no other cpu can possibly use the tw.
>
> So the order of these operations seems not really relevant.
>
> Another cpu, before trying to use it would have first to acquire a
> reference on tw refcount,
> and this is not allowed/possible.
>
> inet_twsk_purge() would hit this at this failed attempt.
>
> if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>      continue;
>
> I think we should wait for more syzbot occurrences before spending
> more time on this.
>
> It is possible the single report was caused by unrelated memory
> mangling in the kernel,
> or some more fundamental issues caused by kernel listeners and sockets.

Thank you so much, Eric!! I learn a lot.
diff mbox series

Patch

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e28075f0006e..b890d1c280a1 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -255,8 +255,8 @@  void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 
 		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
 						     LINUX_MIB_TIMEWAITED);
-		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
 		refcount_inc(&tw->tw_dr->tw_refcount);
+		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
 	} else {
 		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 	}
@@ -301,7 +301,14 @@  void inet_twsk_purge(struct inet_hashinfo *hashinfo)
 			rcu_read_unlock();
 			local_bh_disable();
 			if (state == TCP_TIME_WAIT) {
-				inet_twsk_deschedule_put(inet_twsk(sk));
+				struct inet_timewait_sock *tw = inet_twsk(sk);
+
+				/* If the timer is armed, we can safely destroy
+				 * it, or else we postpone the process of destruction
+				 * to tw_timer_handler().
+				 */
+				if (timer_pending(&tw->tw_timer))
+					inet_twsk_deschedule_put(tw);
 			} else {
 				struct request_sock *req = inet_reqsk(sk);