diff mbox series

[net] Revert "rds: tcp: Fix use-after-free of net in reqsk_timer_handler()."

Message ID 20240521144930.23805-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] Revert "rds: tcp: Fix use-after-free of net in reqsk_timer_handler()." | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 907 this patch: 907
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 909 this patch: 909
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 911 this patch: 911
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-21--21-00 (tests: 1039)

Commit Message

Jason Xing May 21, 2024, 2:49 p.m. UTC
From: Jason Xing <kernelxing@tencent.com>

This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.

Syzbot[1] reported the drecrement of reference count hits leaking memory.

If we failed in setup_net() and try to undo the setup process, the
reference now is 1 which shouldn't be decremented. However, it happened
actually.

After applying this patch which allows us to check the reference first,
it will not hit zero anymore in tcp_twsk_purge() without calling
inet_twsk_purge() one more time.

[1]
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:336 [inline]
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
 ksys_unshare+0x419/0x970 kernel/fork.c:3323
 __do_sys_unshare kernel/fork.c:3394 [inline]
 __se_sys_unshare kernel/fork.c:3392 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f56d7c7cee9

Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
The reverted patch trying to solve another issue causes unexpected error as above. I
think that issue can be properly analyzed and handled later. So can we revert it first?
---
 net/ipv4/tcp_minisocks.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Eric Dumazet May 21, 2024, 3:49 p.m. UTC | #1
On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
>
> Syzbot[1] reported the drecrement of reference count hits leaking memory.
>
> If we failed in setup_net() and try to undo the setup process, the
> reference now is 1 which shouldn't be decremented. However, it happened
> actually.
>
> After applying this patch which allows us to check the reference first,
> it will not hit zero anymore in tcp_twsk_purge() without calling
> inet_twsk_purge() one more time.
>
> [1]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> Modules linked in:
> CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  __refcount_dec include/linux/refcount.h:336 [inline]
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
>  ksys_unshare+0x419/0x970 kernel/fork.c:3323
>  __do_sys_unshare kernel/fork.c:3394 [inline]
>  __se_sys_unshare kernel/fork.c:3392 [inline]
>  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f56d7c7cee9
>
> Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> The reverted patch trying to solve another issue causes unexpected error as above. I
> think that issue can be properly analyzed and handled later. So can we revert it first?
> ---
>  net/ipv4/tcp_minisocks.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index b93619b2384b..46e6f9db4227 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
>                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
>                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
>                 } else if (!purged_once) {
> +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> +                               continue;
> +
>                         inet_twsk_purge(&tcp_hashinfo);
>                         purged_once = true;
>                 }
> --

This can not be a fix for a race condition.

By definition a TW has a refcount on tcp_death_row.tw_refcount   only
if its timer is armed.

And inet_twsk_deschedule_put() does

if (del_timer_sync(&tw->tw_timer))
    inet_twsk_kill(tw);

I think you need to provide a full explanation, instead of a shot in the dark.

Before releasing this syzbot, I thought that maybe the refcount_inc()
was done too late,
but considering other invariants, this should not matter.

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
inet_timewait_sock *tw, int timeo, bool rearm)

                __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
                                                     LINUX_MIB_TIMEWAITED);
-               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
                refcount_inc(&tw->tw_dr->tw_refcount);
+               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
        } else {
                mod_timer_pending(&tw->tw_timer, jiffies + timeo);
        }
Jason Xing May 22, 2024, 6:27 a.m. UTC | #2
Hello Eric,

On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> >
> > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> >
> > If we failed in setup_net() and try to undo the setup process, the
> > reference now is 1 which shouldn't be decremented. However, it happened
> > actually.
> >
> > After applying this patch which allows us to check the reference first,
> > it will not hit zero anymore in tcp_twsk_purge() without calling
> > inet_twsk_purge() one more time.
> >
> > [1]
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > Modules linked in:
> > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > Call Trace:
> >  <TASK>
> >  __refcount_dec include/linux/refcount.h:336 [inline]
> >  refcount_dec include/linux/refcount.h:351 [inline]
> >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> >  __do_sys_unshare kernel/fork.c:3394 [inline]
> >  __se_sys_unshare kernel/fork.c:3392 [inline]
> >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f56d7c7cee9
> >
> > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > The reverted patch trying to solve another issue causes unexpected error as above. I
> > think that issue can be properly analyzed and handled later. So can we revert it first?
> > ---
> >  net/ipv4/tcp_minisocks.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index b93619b2384b..46e6f9db4227 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> >                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> >                 } else if (!purged_once) {
> > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > +                               continue;
> > +
> >                         inet_twsk_purge(&tcp_hashinfo);
> >                         purged_once = true;
> >                 }
> > --
>
> This can not be a fix for a race condition.
>
> By definition a TW has a refcount on tcp_death_row.tw_refcount   only
> if its timer is armed.
>
> And inet_twsk_deschedule_put() does
>
> if (del_timer_sync(&tw->tw_timer))
>     inet_twsk_kill(tw);
>
> I think you need to provide a full explanation, instead of a shot in the dark.
>
> Before releasing this syzbot, I thought that maybe the refcount_inc()
> was done too late,
> but considering other invariants, this should not matter.
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> inet_timewait_sock *tw, int timeo, bool rearm)
>
>                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
>                                                      LINUX_MIB_TIMEWAITED);
> -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>                 refcount_inc(&tw->tw_dr->tw_refcount);
> +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
>         } else {
>                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
>         }

Thanks for your information.

What you wrote is right, I think for a while. In
inet_twsk_deschedule_put(), there are two possible cases:
1) if it finds the timer is armed, then it can kill the tw socket by
decrementing the refcount in the right way. So it's a good/lucky thing
for us.
or
2) if it misses the point, then the tw socket arms the timer which
will expire in 60 seconds in the initialization phase. The tw socket
would have a chance to be destroyed when the timer expires.

It seems that you don't think your code could solve the race issue?

Thanks,
Jason
Eric Dumazet May 22, 2024, 6:51 a.m. UTC | #3
On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> > >
> > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > >
> > > If we failed in setup_net() and try to undo the setup process, the
> > > reference now is 1 which shouldn't be decremented. However, it happened
> > > actually.
> > >
> > > After applying this patch which allows us to check the reference first,
> > > it will not hit zero anymore in tcp_twsk_purge() without calling
> > > inet_twsk_purge() one more time.
> > >
> > > [1]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > Call Trace:
> > >  <TASK>
> > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > >  refcount_dec include/linux/refcount.h:351 [inline]
> > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7f56d7c7cee9
> > >
> > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > The reverted patch trying to solve another issue causes unexpected error as above. I
> > > think that issue can be properly analyzed and handled later. So can we revert it first?
> > > ---
> > >  net/ipv4/tcp_minisocks.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > index b93619b2384b..46e6f9db4227 100644
> > > --- a/net/ipv4/tcp_minisocks.c
> > > +++ b/net/ipv4/tcp_minisocks.c
> > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> > >                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> > >                 } else if (!purged_once) {
> > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > +                               continue;
> > > +
> > >                         inet_twsk_purge(&tcp_hashinfo);
> > >                         purged_once = true;
> > >                 }
> > > --
> >
> > This can not be a fix for a race condition.
> >
> > By definition a TW has a refcount on tcp_death_row.tw_refcount   only
> > if its timer is armed.
> >
> > And inet_twsk_deschedule_put() does
> >
> > if (del_timer_sync(&tw->tw_timer))
> >     inet_twsk_kill(tw);
> >
> > I think you need to provide a full explanation, instead of a shot in the dark.
> >
> > Before releasing this syzbot, I thought that maybe the refcount_inc()
> > was done too late,
> > but considering other invariants, this should not matter.
> >
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> > 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> > inet_timewait_sock *tw, int timeo, bool rearm)
> >
> >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> >                                                      LINUX_MIB_TIMEWAITED);
> > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> >         } else {
> >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> >         }
>
> Thanks for your information.
>
> What you wrote is right, I think for a while. In
> inet_twsk_deschedule_put(), there are two possible cases:
> 1) if it finds the timer is armed, then it can kill the tw socket by
> decrementing the refcount in the right way. So it's a good/lucky thing
> for us.
> or
> 2) if it misses the point, then the tw socket arms the timer which
> will expire in 60 seconds in the initialization phase. The tw socket
> would have a chance to be destroyed when the timer expires.
>
> It seems that you don't think your code could solve the race issue?

inet_twsk_schedule(tw) is called while the tw refcount is still 0, so
this tw can be be found in  inet_twsk_purge()
at the same time. Look at inet_twsk_hashdance() for details.
Jason Xing May 22, 2024, 10:41 a.m. UTC | #4
On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> > > >
> > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > >
> > > > If we failed in setup_net() and try to undo the setup process, the
> > > > reference now is 1 which shouldn't be decremented. However, it happened
> > > > actually.
> > > >
> > > > After applying this patch which allows us to check the reference first,
> > > > it will not hit zero anymore in tcp_twsk_purge() without calling
> > > > inet_twsk_purge() one more time.
> > > >
> > > > [1]
> > > > refcount_t: decrement hit 0; leaking memory.
> > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > Modules linked in:
> > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > Call Trace:
> > > >  <TASK>
> > > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > > >  refcount_dec include/linux/refcount.h:351 [inline]
> > > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > RIP: 0033:0x7f56d7c7cee9
> > > >
> > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > The reverted patch trying to solve another issue causes unexpected error as above. I
> > > > think that issue can be properly analyzed and handled later. So can we revert it first?
> > > > ---
> > > >  net/ipv4/tcp_minisocks.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > index b93619b2384b..46e6f9db4227 100644
> > > > --- a/net/ipv4/tcp_minisocks.c
> > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> > > >                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> > > >                 } else if (!purged_once) {
> > > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > +                               continue;
> > > > +
> > > >                         inet_twsk_purge(&tcp_hashinfo);
> > > >                         purged_once = true;
> > > >                 }
> > > > --
> > >
> > > This can not be a fix for a race condition.
> > >
> > > By definition a TW has a refcount on tcp_death_row.tw_refcount   only
> > > if its timer is armed.
> > >
> > > And inet_twsk_deschedule_put() does
> > >
> > > if (del_timer_sync(&tw->tw_timer))
> > >     inet_twsk_kill(tw);
> > >
> > > I think you need to provide a full explanation, instead of a shot in the dark.
> > >
> > > Before releasing this syzbot, I thought that maybe the refcount_inc()
> > > was done too late,
> > > but considering other invariants, this should not matter.
> > >
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> > > 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> > > inet_timewait_sock *tw, int timeo, bool rearm)
> > >
> > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > >                                                      LINUX_MIB_TIMEWAITED);
> > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >         } else {
> > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > >         }
> >
> > Thanks for your information.
> >
> > What you wrote is right, I think for a while. In
> > inet_twsk_deschedule_put(), there are two possible cases:
> > 1) if it finds the timer is armed, then it can kill the tw socket by
> > decrementing the refcount in the right way. So it's a good/lucky thing
> > for us.
> > or
> > 2) if it misses the point, then the tw socket arms the timer which
> > will expire in 60 seconds in the initialization phase. The tw socket
> > would have a chance to be destroyed when the timer expires.
> >
> > It seems that you don't think your code could solve the race issue?
>
> inet_twsk_schedule(tw) is called while the tw refcount is still 0, so
> this tw can be be found in  inet_twsk_purge()
> at the same time. Look at inet_twsk_hashdance() for details.

Yes, after inet_twsk_purge() finds the tw, there are two cases like my
previous email said after applying your code.

For 1) case, everything is good. inet_twsk_purge() will finish it up
because it can decrement the refcount safely.

For 2) case, even though inet_twsk_purge() finds it, it's not the time
to destroy it until the expired tw timer will finally handle the
process of destruction by calling inet_twsk_kill(), right? Let the
timer handle the destruction until its end of life, which I think is a
normal process for all the timewait sockets.

The only difference in 2) case is that inet_twsk_purge() calls
inet_twsk_put() twice while tw_timer_handler() only calls it one time.

Am I missing something?

Thanks,
Jason
Eric Dumazet May 22, 2024, 10:53 a.m. UTC | #5
On Wed, May 22, 2024 at 12:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > Hello Eric,
> > >
> > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> > > > >
> > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > > >
> > > > > If we failed in setup_net() and try to undo the setup process, the
> > > > > reference now is 1 which shouldn't be decremented. However, it happened
> > > > > actually.
> > > > >
> > > > > After applying this patch which allows us to check the reference first,
> > > > > it will not hit zero anymore in tcp_twsk_purge() without calling
> > > > > inet_twsk_purge() one more time.
> > > > >
> > > > > [1]
> > > > > refcount_t: decrement hit 0; leaking memory.
> > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > Modules linked in:
> > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > > > >  refcount_dec include/linux/refcount.h:351 [inline]
> > > > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > > > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > > > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > RIP: 0033:0x7f56d7c7cee9
> > > > >
> > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > > The reverted patch trying to solve another issue causes unexpected error as above. I
> > > > > think that issue can be properly analyzed and handled later. So can we revert it first?
> > > > > ---
> > > > >  net/ipv4/tcp_minisocks.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > index b93619b2384b..46e6f9db4227 100644
> > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> > > > >                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > > > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> > > > >                 } else if (!purged_once) {
> > > > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > > +                               continue;
> > > > > +
> > > > >                         inet_twsk_purge(&tcp_hashinfo);
> > > > >                         purged_once = true;
> > > > >                 }
> > > > > --
> > > >
> > > > This can not be a fix for a race condition.
> > > >
> > > > By definition a TW has a refcount on tcp_death_row.tw_refcount   only
> > > > if its timer is armed.
> > > >
> > > > And inet_twsk_deschedule_put() does
> > > >
> > > > if (del_timer_sync(&tw->tw_timer))
> > > >     inet_twsk_kill(tw);
> > > >
> > > > I think you need to provide a full explanation, instead of a shot in the dark.
> > > >
> > > > Before releasing this syzbot, I thought that maybe the refcount_inc()
> > > > was done too late,
> > > > but considering other invariants, this should not matter.
> > > >
> > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> > > > 100644
> > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> > > > inet_timewait_sock *tw, int timeo, bool rearm)
> > > >
> > > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > > >                                                      LINUX_MIB_TIMEWAITED);
> > > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >         } else {
> > > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > > >         }
> > >
> > > Thanks for your information.
> > >
> > > What you wrote is right, I think for a while. In
> > > inet_twsk_deschedule_put(), there are two possible cases:
> > > 1) if it finds the timer is armed, then it can kill the tw socket by
> > > decrementing the refcount in the right way. So it's a good/lucky thing
> > > for us.
> > > or
> > > 2) if it misses the point, then the tw socket arms the timer which
> > > will expire in 60 seconds in the initialization phase. The tw socket
> > > would have a chance to be destroyed when the timer expires.
> > >
> > > It seems that you don't think your code could solve the race issue?
> >
> > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so
> > this tw can be be found in  inet_twsk_purge()
> > at the same time. Look at inet_twsk_hashdance() for details.
>
> Yes, after inet_twsk_purge() finds the tw, there are two cases like my
> previous email said after applying your code.
>
> For 1) case, everything is good. inet_twsk_purge() will finish it up
> because it can decrement the refcount safely.
>
> For 2) case, even though inet_twsk_purge() finds it, it's not the time
> to destroy it until the expired tw timer will finally handle the
> process of destruction by calling inet_twsk_kill(), right? Let the
> timer handle the destruction until its end of life, which I think is a
> normal process for all the timewait sockets.
>
> The only difference in 2) case is that inet_twsk_purge() calls
> inet_twsk_put() twice while tw_timer_handler() only calls it one time.
>
> Am I missing something?

You are missing that inet_twsk_deschedule_put() is doing :

if (del_timer_sync(&tw->tw_timer))
    inet_twsk_kill(tw);

You can not have both tw_timer_handler() and
inet_twsk_deschedule_put() calling inet_twsk_kill(tw);

Take a look at del_timer_sync()
Jason Xing May 22, 2024, 3:37 p.m. UTC | #6
On Wed, May 22, 2024 at 6:54 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 22, 2024 at 12:42 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > Hello Eric,
> > > >
> > > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> > > > > >
> > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > > > >
> > > > > > If we failed in setup_net() and try to undo the setup process, the
> > > > > > reference now is 1 which shouldn't be decremented. However, it happened
> > > > > > actually.
> > > > > >
> > > > > > After applying this patch which allows us to check the reference first,
> > > > > > it will not hit zero anymore in tcp_twsk_purge() without calling
> > > > > > inet_twsk_purge() one more time.
> > > > > >
> > > > > > [1]
> > > > > > refcount_t: decrement hit 0; leaking memory.
> > > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > > Modules linked in:
> > > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > > > > >  refcount_dec include/linux/refcount.h:351 [inline]
> > > > > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > > > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > > > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > > > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > > > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > > > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > > > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > > > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > > > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > > > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > > > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > > > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > > > > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > > > > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > RIP: 0033:0x7f56d7c7cee9
> > > > > >
> > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > > > Reported-by: syzbot+2eca27bdcb48ed330251@syzkaller.appspotmail.com
> > > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > > The reverted patch trying to solve another issue causes unexpected error as above. I
> > > > > > think that issue can be properly analyzed and handled later. So can we revert it first?
> > > > > > ---
> > > > > >  net/ipv4/tcp_minisocks.c | 4 ++++
> > > > > >  1 file changed, 4 insertions(+)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > > index b93619b2384b..46e6f9db4227 100644
> > > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> > > > > >                         /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > > > > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> > > > > >                 } else if (!purged_once) {
> > > > > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > > > +                               continue;
> > > > > > +
> > > > > >                         inet_twsk_purge(&tcp_hashinfo);
> > > > > >                         purged_once = true;
> > > > > >                 }
> > > > > > --
> > > > >
> > > > > This can not be a fix for a race condition.
> > > > >
> > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount   only
> > > > > if its timer is armed.
> > > > >
> > > > > And inet_twsk_deschedule_put() does
> > > > >
> > > > > if (del_timer_sync(&tw->tw_timer))
> > > > >     inet_twsk_kill(tw);
> > > > >
> > > > > I think you need to provide a full explanation, instead of a shot in the dark.
> > > > >
> > > > > Before releasing this syzbot, I thought that maybe the refcount_inc()
> > > > > was done too late,
> > > > > but considering other invariants, this should not matter.
> > > > >
> > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> > > > > 100644
> > > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> > > > > inet_timewait_sock *tw, int timeo, bool rearm)
> > > > >
> > > > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > > > >                                                      LINUX_MIB_TIMEWAITED);
> > > > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > > >         } else {
> > > > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > > > >         }
> > > >
> > > > Thanks for your information.
> > > >
> > > > What you wrote is right, I think for a while. In
> > > > inet_twsk_deschedule_put(), there are two possible cases:
> > > > 1) if it finds the timer is armed, then it can kill the tw socket by
> > > > decrementing the refcount in the right way. So it's a good/lucky thing
> > > > for us.
> > > > or
> > > > 2) if it misses the point, then the tw socket arms the timer which
> > > > will expire in 60 seconds in the initialization phase. The tw socket
> > > > would have a chance to be destroyed when the timer expires.
> > > >
> > > > It seems that you don't think your code could solve the race issue?
> > >
> > > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so
> > > this tw can be be found in  inet_twsk_purge()
> > > at the same time. Look at inet_twsk_hashdance() for details.
> >
> > Yes, after inet_twsk_purge() finds the tw, there are two cases like my
> > previous email said after applying your code.
> >
> > For 1) case, everything is good. inet_twsk_purge() will finish it up
> > because it can decrement the refcount safely.
> >
> > For 2) case, even though inet_twsk_purge() finds it, it's not the time
> > to destroy it until the expired tw timer will finally handle the
> > process of destruction by calling inet_twsk_kill(), right? Let the
> > timer handle the destruction until its end of life, which I think is a
> > normal process for all the timewait sockets.
> >
> > The only difference in 2) case is that inet_twsk_purge() calls
> > inet_twsk_put() twice while tw_timer_handler() only calls it one time.
> >
> > Am I missing something?
>
> You are missing that inet_twsk_deschedule_put() is doing :
>
> if (del_timer_sync(&tw->tw_timer))
>     inet_twsk_kill(tw);
>
> You can not have both tw_timer_handler() and
> inet_twsk_deschedule_put() calling inet_twsk_kill(tw);
[...]
>
> Take a look at del_timer_sync()

Oops, I missed this function...Thanks for pointing it out.

I can test if the timer is active or queued before we can delete, like
this on top of your patch:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e28075f0006e..b890d1c280a1 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
inet_timewait_sock *tw, int timeo, bool rearm)

                __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
                                                     LINUX_MIB_TIMEWAITED);
-               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
                refcount_inc(&tw->tw_dr->tw_refcount);
+               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
        } else {
                mod_timer_pending(&tw->tw_timer, jiffies + timeo);
        }
@@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
                        rcu_read_unlock();
                        local_bh_disable();
                        if (state == TCP_TIME_WAIT) {
-                               inet_twsk_deschedule_put(inet_twsk(sk));
+                               struct inet_timewait_sock *tw = inet_twsk(sk);
+
+                               /* If the timer is armed, we can safely destroy
+                                * it, or else we postpone the process
of destruction
+                                * to tw_timer_handler().
+                                */
+                               if (timer_pending(&tw->tw_timer))
+                                       inet_twsk_deschedule_put(tw);
                        } else {
                                struct request_sock *req = inet_reqsk(sk);

As above, if we find the timer is armed, then we can destroy it
naturally in inet_twsk_deschedule_put(). If not, we can skip it in the
inet_twsk_purge() and then postpone the destruction process in the
timer handler.

I think I need to add another test about whether the timer is running or not.

If testing the status of timer is not a good way to go, perhaps I
would like to introduce a new flag to indicate whether the tw socket
still exists in the initialization phase.

Sorry, It's __not__ tested because it's too late for me, I will spend
more time taking care of this race condition tomorrow.

Thanks,
Jason
diff mbox series

Patch

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index b93619b2384b..46e6f9db4227 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -399,6 +399,10 @@  void tcp_twsk_purge(struct list_head *net_exit_list)
 			/* Even if tw_refcount == 1, we must clean up kernel reqsk */
 			inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
 		} else if (!purged_once) {
+			/* The last refcount is decremented in tcp_sk_exit_batch() */
+			if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
+				continue;
+
 			inet_twsk_purge(&tcp_hashinfo);
 			purged_once = true;
 		}