diff mbox series

[v1,net] tcp: Clean up kernel listener's reqsk in inet_twsk_purge()

Message ID 20221012145036.74960-1-kuniyu@amazon.com (mailing list archive)
State Accepted
Commit 740ea3c4a0b2e326b23d7cdf05472a0e92aa39bc
Delegated to: Netdev Maintainers
Headers show
Series [v1,net] tcp: Clean up kernel listener's reqsk in inet_twsk_purge() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: Possible repeated word: 'Google' WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kuniyuki Iwashima Oct. 12, 2022, 2:50 p.m. UTC
Eric Dumazet reported a use-after-free related to the per-netns ehash
series. [0]

When we create a TCP socket from userspace, the socket always holds a
refcnt of the netns.  This guarantees that a reqsk timer is always fired
before netns dismantle.  Each reqsk has a refcnt of its listener, so the
listener is not freed before the reqsk, and the net is not freed before
the listener as well.

OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
of its netns.  Thus, a reqsk timer can be fired after the netns dismantle
and access freed per-netns ehash.

To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
in inet_twsk_purge() if the netns uses a per-netns ehash.

[0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/

BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
include/net/inet_hashtables.h:181 [inline]
BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
net/ipv4/inet_connection_sock.c:913
Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301

CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
6.0.0-syzkaller-02757-gaf7d23f9d96a #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/22/2022
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
__run_timers kernel/time/timer.c:1768 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
__do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
invoke_softirq kernel/softirq.c:445 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
</IRQ>

Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
 net/ipv4/tcp_minisocks.c      |  9 +++++----
 2 files changed, 19 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Oct. 12, 2022, 4:31 p.m. UTC | #1
On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Eric Dumazet reported a use-after-free related to the per-netns ehash
> series. [0]
>
> When we create a TCP socket from userspace, the socket always holds a
> refcnt of the netns.  This guarantees that a reqsk timer is always fired
> before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> listener is not freed before the reqsk, and the net is not freed before
> the listener as well.
>
> OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
> of its netns.  Thus, a reqsk timer can be fired after the netns dismantle
> and access freed per-netns ehash.

Patch seems good, but changelog is incorrect.

1) we have a TCP listener (or more) on a netns
2) We receive SYN packets, creating SYN_RECV request sockets, added in
ehash table.
3) job is killed, TCP listener closed.
4) When a TCP listener is closed, we do not purge all SYN_RECV
requests sockets, because we rely
   on normal per-request timer firing, then finding the listener is no
longer in LISTEN state -> drop the request socket.
   (We do not maintain a per-listener list of request sockets, and
going through ehash would be quite expensive on busy servers)
5) netns is deleted (and optional TCP ehashinfo freed)
6) request socket timer fire, and wecrash while trying to unlink the
request socket from the freed ehash table.

In short, I think the case could happen with normal TCP sockets,
allocated from user space.

>
> To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
> in inet_twsk_purge() if the netns uses a per-netns ehash.
>
> [0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
>
> BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
> include/net/inet_hashtables.h:181 [inline]
> BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
> net/ipv4/inet_connection_sock.c:913
> Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
>
> CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
> 6.0.0-syzkaller-02757-gaf7d23f9d96a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 09/22/2022
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> print_address_description mm/kasan/report.c:317 [inline]
> print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
> kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
> tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
> reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
> inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
> inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
> reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
> call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
> expire_timers kernel/time/timer.c:1519 [inline]
> __run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
> __run_timers kernel/time/timer.c:1768 [inline]
> run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
> __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
> invoke_softirq kernel/softirq.c:445 [inline]
> __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
> irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
> sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
> </IRQ>
>
> Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Eric Dumazet <edumazet@google.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
>  net/ipv4/tcp_minisocks.c      |  9 +++++----
>  2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 71d3bb0abf6c..66fc940f9521 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -268,8 +268,21 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
>                 rcu_read_lock();
>  restart:
>                 sk_nulls_for_each_rcu(sk, node, &head->chain) {
> -                       if (sk->sk_state != TCP_TIME_WAIT)
> +                       if (sk->sk_state != TCP_TIME_WAIT) {
> +                               /* A kernel listener socket might not hold refcnt for net,
> +                                * so reqsk_timer_handler() could be fired after net is
> +                                * freed.  Userspace listener and reqsk never exist here.
> +                                */
> +                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> +                                            hashinfo->pernet)) {
> +                                       struct request_sock *req = inet_reqsk(sk);
> +
> +                                       inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> +                               }
> +
>                                 continue;
> +                       }
> +
>                         tw = inet_twsk(sk);
>                         if ((tw->tw_family != family) ||
>                                 refcount_read(&twsk_net(tw)->ns.count))
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 79f30f026d89..c375f603a16c 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -353,13 +353,14 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
>         struct net *net;
>
>         list_for_each_entry(net, net_exit_list, exit_list) {
> -               /* The last refcount is decremented in tcp_sk_exit_batch() */
> -               if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> -                       continue;
> -
>                 if (net->ipv4.tcp_death_row.hashinfo->pernet) {
> +                       /* Even if tw_refcount == 1, we must clean up kernel reqsk */
>                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
>                 } else if (!purged_once) {
> +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> +                               continue;
> +
>                         inet_twsk_purge(&tcp_hashinfo, family);
>                         purged_once = true;
>                 }
> --
> 2.30.2
>
Kuniyuki Iwashima Oct. 12, 2022, 5:28 p.m. UTC | #2
From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 12 Oct 2022 09:31:44 -0700
> On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > Eric Dumazet reported a use-after-free related to the per-netns ehash
> > series. [0]
> >
> > When we create a TCP socket from userspace, the socket always holds a
> > refcnt of the netns.  This guarantees that a reqsk timer is always fired
> > before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> > listener is not freed before the reqsk, and the net is not freed before
> > the listener as well.
> >
> > OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
> > of its netns.  Thus, a reqsk timer can be fired after the netns dismantle
> > and access freed per-netns ehash.
> 
> Patch seems good, but changelog is incorrect.
> 
> 1) we have a TCP listener (or more) on a netns
> 2) We receive SYN packets, creating SYN_RECV request sockets, added in
> ehash table.
> 3) job is killed, TCP listener closed.
> 4) When a TCP listener is closed, we do not purge all SYN_RECV
> requests sockets, because we rely
>    on normal per-request timer firing, then finding the listener is no
> longer in LISTEN state -> drop the request socket.
>    (We do not maintain a per-listener list of request sockets, and
> going through ehash would be quite expensive on busy servers)
> 5) netns is deleted (and optional TCP ehashinfo freed)
> 6) request socket timer fire, and wecrash while trying to unlink the
> request socket from the freed ehash table.
> 
> In short, I think the case could happen with normal TCP sockets,
> allocated from user space.

Hmm.. I think 5) always happens after reqsk_timer for TCP socket
allocated from user space because reqsk has a refcnt for its netns
indirectly via the listener.

reqsk has its listener's sk_refcnt, so when reqsk timer is fired
the last reqsk_put() in inet_csk_reqsk_queue_drop_and_put() calls
sock_put() for the listener, and then listener release the last
refcnt for the net, and finally net is queued up to the free-list.

---8<---
static void __sk_destruct(struct rcu_head *head)
{
...
	if (likely(sk->sk_net_refcnt))
		put_net(sock_net(sk));
	sk_prot_free(sk->sk_prot_creator, sk);
}
---8<---


I did some tests with this script, but KASAN did not detect UAF
and I checked the timer is always executed before netns dismantle
for userspace listener.

---8<---
set -e

sysctl -w net.ipv4.tcp_child_ehash_entries=128
sysctl -w net.ipv4.tcp_max_orphans=0

cat <<EOF>test.py
from socket import *
from subprocess import run

s = socket()
s.bind(('localhost', 80))
s.listen()

c = socket()
c.connect(('localhost', 80))
run('netstat -tan'.split())
EOF

cat <<EOF>test_net.sh
set -e

sysctl net.ipv4.tcp_child_ehash_entries

ip link set lo up

iptables -A OUTPUT -o lo -p tcp --dport 80 --tcp-flags ACK ACK -j DROP

python3 test.py

netstat -tan

EOF

unshare -n bash test_net.sh
---8<---



> > To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
> > in inet_twsk_purge() if the netns uses a per-netns ehash.
> >
> > [0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
> >
> > BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
> > include/net/inet_hashtables.h:181 [inline]
> > BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
> > net/ipv4/inet_connection_sock.c:913
> > Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
> >
> > CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
> > 6.0.0-syzkaller-02757-gaf7d23f9d96a #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 09/22/2022
> > Call Trace:
> > <IRQ>
> > __dump_stack lib/dump_stack.c:88 [inline]
> > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > print_address_description mm/kasan/report.c:317 [inline]
> > print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
> > kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
> > tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
> > reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
> > inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
> > inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
> > reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
> > call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
> > expire_timers kernel/time/timer.c:1519 [inline]
> > __run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
> > __run_timers kernel/time/timer.c:1768 [inline]
> > run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
> > __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
> > invoke_softirq kernel/softirq.c:445 [inline]
> > __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
> > irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
> > sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
> > </IRQ>
> >
> > Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Reported-by: Eric Dumazet <edumazet@google.com>
> > Suggested-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
> >  net/ipv4/tcp_minisocks.c      |  9 +++++----
> >  2 files changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 71d3bb0abf6c..66fc940f9521 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -268,8 +268,21 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
> >                 rcu_read_lock();
> >  restart:
> >                 sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > -                       if (sk->sk_state != TCP_TIME_WAIT)
> > +                       if (sk->sk_state != TCP_TIME_WAIT) {
> > +                               /* A kernel listener socket might not hold refcnt for net,
> > +                                * so reqsk_timer_handler() could be fired after net is
> > +                                * freed.  Userspace listener and reqsk never exist here.
> > +                                */
> > +                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> > +                                            hashinfo->pernet)) {
> > +                                       struct request_sock *req = inet_reqsk(sk);
> > +
> > +                                       inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> > +                               }
> > +
> >                                 continue;
> > +                       }
> > +
> >                         tw = inet_twsk(sk);
> >                         if ((tw->tw_family != family) ||
> >                                 refcount_read(&twsk_net(tw)->ns.count))
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index 79f30f026d89..c375f603a16c 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -353,13 +353,14 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
> >         struct net *net;
> >
> >         list_for_each_entry(net, net_exit_list, exit_list) {
> > -               /* The last refcount is decremented in tcp_sk_exit_batch() */
> > -               if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > -                       continue;
> > -
> >                 if (net->ipv4.tcp_death_row.hashinfo->pernet) {
> > +                       /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
> >                 } else if (!purged_once) {
> > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > +                               continue;
> > +
> >                         inet_twsk_purge(&tcp_hashinfo, family);
> >                         purged_once = true;
> >                 }
> > --
> > 2.30.2
Eric Dumazet Oct. 12, 2022, 6:54 p.m. UTC | #3
On Wed, Oct 12, 2022 at 10:28 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Eric Dumazet <edumazet@google.com>
> Date:   Wed, 12 Oct 2022 09:31:44 -0700
> > On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > Eric Dumazet reported a use-after-free related to the per-netns ehash
> > > series. [0]
> > >
> > > When we create a TCP socket from userspace, the socket always holds a
> > > refcnt of the netns.  This guarantees that a reqsk timer is always fired
> > > before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> > > listener is not freed before the reqsk, and the net is not freed before
> > > the listener as well.
> > >
> > > OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
> > > of its netns.  Thus, a reqsk timer can be fired after the netns dismantle
> > > and access freed per-netns ehash.
> >
> > Patch seems good, but changelog is incorrect.
> >
> > 1) we have a TCP listener (or more) on a netns
> > 2) We receive SYN packets, creating SYN_RECV request sockets, added in
> > ehash table.
> > 3) job is killed, TCP listener closed.
> > 4) When a TCP listener is closed, we do not purge all SYN_RECV
> > requests sockets, because we rely
> >    on normal per-request timer firing, then finding the listener is no
> > longer in LISTEN state -> drop the request socket.
> >    (We do not maintain a per-listener list of request sockets, and
> > going through ehash would be quite expensive on busy servers)
> > 5) netns is deleted (and optional TCP ehashinfo freed)
> > 6) request socket timer fire, and wecrash while trying to unlink the
> > request socket from the freed ehash table.
> >
> > In short, I think the case could happen with normal TCP sockets,
> > allocated from user space.
>
> Hmm.. I think 5) always happens after reqsk_timer for TCP socket
> allocated from user space because reqsk has a refcnt for its netns
> indirectly via the listener.
>
> reqsk has its listener's sk_refcnt, so when reqsk timer is fired
> the last reqsk_put() in inet_csk_reqsk_queue_drop_and_put() calls
> sock_put() for the listener, and then listener release the last
> refcnt for the net, and finally net is queued up to the free-list.
>
> ---8<---
> static void __sk_destruct(struct rcu_head *head)
> {
> ...
>         if (likely(sk->sk_net_refcnt))
>                 put_net(sock_net(sk));
>         sk_prot_free(sk->sk_prot_creator, sk);
> }
> ---8<---
>
>
> I did some tests with this script, but KASAN did not detect UAF
> and I checked the timer is always executed before netns dismantle
> for userspace listener.
>
> ---8<---
> set -e
>
> sysctl -w net.ipv4.tcp_child_ehash_entries=128
> sysctl -w net.ipv4.tcp_max_orphans=0
>
> cat <<EOF>test.py
> from socket import *
> from subprocess import run
>
> s = socket()
> s.bind(('localhost', 80))
> s.listen()
>
> c = socket()
> c.connect(('localhost', 80))
> run('netstat -tan'.split())
> EOF
>
> cat <<EOF>test_net.sh
> set -e
>
> sysctl net.ipv4.tcp_child_ehash_entries
>
> ip link set lo up
>
> iptables -A OUTPUT -o lo -p tcp --dport 80 --tcp-flags ACK ACK -j DROP
>
> python3 test.py
>
> netstat -tan
>
> EOF
>
> unshare -n bash test_net.sh
> ---8<---
>

Hmm... maybe , but I have other syzbot reports that suggest the netns
refcounting is wrong...

Maybe this is time I add a ref tracker dir for 'kernel sockets',
making sure this tracker is empty right before netns is destroyed.

BUG: KASAN: use-after-free in tcp_probe_timer net/ipv4/tcp_timer.c:378 [inline]
BUG: KASAN: use-after-free in tcp_write_timer_handler+0x7e2/0x920
net/ipv4/tcp_timer.c:624
Read of size 1 at addr ffff88807d6ba385 by task syz-executor.2/3638

CPU: 0 PID: 3638 Comm: syz-executor.2 Not tainted
6.0.0-syzkaller-09589-g55be6084c8e0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/22/2022
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106
 print_address_description+0x65/0x4b0 mm/kasan/report.c:317
 print_report+0x108/0x220 mm/kasan/report.c:433
 kasan_report+0xfb/0x130 mm/kasan/report.c:495
 tcp_probe_timer net/ipv4/tcp_timer.c:378 [inline]
 tcp_write_timer_handler+0x7e2/0x920 net/ipv4/tcp_timer.c:624
 tcp_write_timer+0x176/0x280 net/ipv4/tcp_timer.c:637
 call_timer_fn+0xf5/0x210 kernel/time/timer.c:1474
 expire_timers kernel/time/timer.c:1519 [inline]
 __run_timers+0x76a/0x980 kernel/time/timer.c:1790
 run_timer_softirq+0x63/0xf0 kernel/time/timer.c:1803
 __do_softirq+0x277/0x75b kernel/softirq.c:571
 __irq_exit_rcu+0xec/0x170 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 sysvec_apic_timer_interrupt+0x91/0xb0 arch/x86/kernel/apic/apic.c:1107




>
>
> > > To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
> > > in inet_twsk_purge() if the netns uses a per-netns ehash.
> > >
> > > [0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
> > >
> > > BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
> > > include/net/inet_hashtables.h:181 [inline]
> > > BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
> > > net/ipv4/inet_connection_sock.c:913
> > > Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
> > >
> > > CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
> > > 6.0.0-syzkaller-02757-gaf7d23f9d96a #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > BIOS Google 09/22/2022
> > > Call Trace:
> > > <IRQ>
> > > __dump_stack lib/dump_stack.c:88 [inline]
> > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > > print_address_description mm/kasan/report.c:317 [inline]
> > > print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
> > > kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
> > > tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
> > > reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
> > > inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
> > > inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
> > > reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
> > > call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
> > > expire_timers kernel/time/timer.c:1519 [inline]
> > > __run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
> > > __run_timers kernel/time/timer.c:1768 [inline]
> > > run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
> > > __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
> > > invoke_softirq kernel/softirq.c:445 [inline]
> > > __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
> > > irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
> > > sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
> > > </IRQ>
> > >
> > > Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > Reported-by: Eric Dumazet <edumazet@google.com>
> > > Suggested-by: Eric Dumazet <edumazet@google.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > >  net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
> > >  net/ipv4/tcp_minisocks.c      |  9 +++++----
> > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 71d3bb0abf6c..66fc940f9521 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -268,8 +268,21 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
> > >                 rcu_read_lock();
> > >  restart:
> > >                 sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > > -                       if (sk->sk_state != TCP_TIME_WAIT)
> > > +                       if (sk->sk_state != TCP_TIME_WAIT) {
> > > +                               /* A kernel listener socket might not hold refcnt for net,
> > > +                                * so reqsk_timer_handler() could be fired after net is
> > > +                                * freed.  Userspace listener and reqsk never exist here.
> > > +                                */
> > > +                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> > > +                                            hashinfo->pernet)) {
> > > +                                       struct request_sock *req = inet_reqsk(sk);
> > > +
> > > +                                       inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> > > +                               }
> > > +
> > >                                 continue;
> > > +                       }
> > > +
> > >                         tw = inet_twsk(sk);
> > >                         if ((tw->tw_family != family) ||
> > >                                 refcount_read(&twsk_net(tw)->ns.count))
> > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > index 79f30f026d89..c375f603a16c 100644
> > > --- a/net/ipv4/tcp_minisocks.c
> > > +++ b/net/ipv4/tcp_minisocks.c
> > > @@ -353,13 +353,14 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
> > >         struct net *net;
> > >
> > >         list_for_each_entry(net, net_exit_list, exit_list) {
> > > -               /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > -               if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > -                       continue;
> > > -
> > >                 if (net->ipv4.tcp_death_row.hashinfo->pernet) {
> > > +                       /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
> > >                 } else if (!purged_once) {
> > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > +                               continue;
> > > +
> > >                         inet_twsk_purge(&tcp_hashinfo, family);
> > >                         purged_once = true;
> > >                 }
> > > --
> > > 2.30.2
Kuniyuki Iwashima Oct. 12, 2022, 7:36 p.m. UTC | #4
From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 12 Oct 2022 11:54:42 -0700
> On Wed, Oct 12, 2022 at 10:28 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Eric Dumazet <edumazet@google.com>
> > Date:   Wed, 12 Oct 2022 09:31:44 -0700
> > > On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > Eric Dumazet reported a use-after-free related to the per-netns ehash
> > > > series. [0]
> > > >
> > > > When we create a TCP socket from userspace, the socket always holds a
> > > > refcnt of the netns.  This guarantees that a reqsk timer is always fired
> > > > before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> > > > listener is not freed before the reqsk, and the net is not freed before
> > > > the listener as well.
> > > >
> > > > OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
> > > > of its netns.  Thus, a reqsk timer can be fired after the netns dismantle
> > > > and access freed per-netns ehash.
> > >
> > > Patch seems good, but changelog is incorrect.
> > >
> > > 1) we have a TCP listener (or more) on a netns
> > > 2) We receive SYN packets, creating SYN_RECV request sockets, added in
> > > ehash table.
> > > 3) job is killed, TCP listener closed.
> > > 4) When a TCP listener is closed, we do not purge all SYN_RECV
> > > requests sockets, because we rely
> > >    on normal per-request timer firing, then finding the listener is no
> > > longer in LISTEN state -> drop the request socket.
> > >    (We do not maintain a per-listener list of request sockets, and
> > > going through ehash would be quite expensive on busy servers)
> > > 5) netns is deleted (and optional TCP ehashinfo freed)
> > > 6) request socket timer fire, and wecrash while trying to unlink the
> > > request socket from the freed ehash table.
> > >
> > > In short, I think the case could happen with normal TCP sockets,
> > > allocated from user space.
> >
> > Hmm.. I think 5) always happens after reqsk_timer for TCP socket
> > allocated from user space because reqsk has a refcnt for its netns
> > indirectly via the listener.
> >
> > reqsk has its listener's sk_refcnt, so when reqsk timer is fired
> > the last reqsk_put() in inet_csk_reqsk_queue_drop_and_put() calls
> > sock_put() for the listener, and then listener release the last
> > refcnt for the net, and finally net is queued up to the free-list.
> >
> > ---8<---
> > static void __sk_destruct(struct rcu_head *head)
> > {
> > ...
> >         if (likely(sk->sk_net_refcnt))
> >                 put_net(sock_net(sk));
> >         sk_prot_free(sk->sk_prot_creator, sk);
> > }
> > ---8<---
> >
> >
> > I did some tests with this script, but KASAN did not detect UAF
> > and I checked the timer is always executed before netns dismantle
> > for userspace listener.
> >
> > ---8<---
> > set -e
> >
> > sysctl -w net.ipv4.tcp_child_ehash_entries=128
> > sysctl -w net.ipv4.tcp_max_orphans=0
> >
> > cat <<EOF>test.py
> > from socket import *
> > from subprocess import run
> >
> > s = socket()
> > s.bind(('localhost', 80))
> > s.listen()
> >
> > c = socket()
> > c.connect(('localhost', 80))
> > run('netstat -tan'.split())
> > EOF
> >
> > cat <<EOF>test_net.sh
> > set -e
> >
> > sysctl net.ipv4.tcp_child_ehash_entries
> >
> > ip link set lo up
> >
> > iptables -A OUTPUT -o lo -p tcp --dport 80 --tcp-flags ACK ACK -j DROP
> >
> > python3 test.py
> >
> > netstat -tan
> >
> > EOF
> >
> > unshare -n bash test_net.sh
> > ---8<---
> >
> 
> Hmm... maybe , but I have other syzbot reports that suggest the netns
> refcounting is wrong...
> 
> Maybe this is time I add a ref tracker dir for 'kernel sockets',
> making sure this tracker is empty right before netns is destroyed.

Yes, that makes sense.
It seems a similar issue (kernel sk vs TCP full sk timer) happend before.
https://lore.kernel.org/all/00000000000045dc96059f4d7b02@google.com/

Actually I was thinking of adding a kernel reqsk counter like
tw_refcount to short-circuit inet_twsk_purge(), but considering
per-netns ehash size is usually small, it might not be worth adding.


> BUG: KASAN: use-after-free in tcp_probe_timer net/ipv4/tcp_timer.c:378 [inline]
> BUG: KASAN: use-after-free in tcp_write_timer_handler+0x7e2/0x920
> net/ipv4/tcp_timer.c:624
> Read of size 1 at addr ffff88807d6ba385 by task syz-executor.2/3638
> 
> CPU: 0 PID: 3638 Comm: syz-executor.2 Not tainted
> 6.0.0-syzkaller-09589-g55be6084c8e0 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 09/22/2022
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106
>  print_address_description+0x65/0x4b0 mm/kasan/report.c:317
>  print_report+0x108/0x220 mm/kasan/report.c:433
>  kasan_report+0xfb/0x130 mm/kasan/report.c:495
>  tcp_probe_timer net/ipv4/tcp_timer.c:378 [inline]
>  tcp_write_timer_handler+0x7e2/0x920 net/ipv4/tcp_timer.c:624
>  tcp_write_timer+0x176/0x280 net/ipv4/tcp_timer.c:637
>  call_timer_fn+0xf5/0x210 kernel/time/timer.c:1474
>  expire_timers kernel/time/timer.c:1519 [inline]
>  __run_timers+0x76a/0x980 kernel/time/timer.c:1790
>  run_timer_softirq+0x63/0xf0 kernel/time/timer.c:1803
>  __do_softirq+0x277/0x75b kernel/softirq.c:571
>  __irq_exit_rcu+0xec/0x170 kernel/softirq.c:650
>  irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
>  sysvec_apic_timer_interrupt+0x91/0xb0 arch/x86/kernel/apic/apic.c:1107
> 
> 
> 
> 
> >
> >
> > > > To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
> > > > in inet_twsk_purge() if the netns uses a per-netns ehash.
> > > >
> > > > [0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
> > > >
> > > > BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
> > > > include/net/inet_hashtables.h:181 [inline]
> > > > BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
> > > > net/ipv4/inet_connection_sock.c:913
> > > > Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
> > > >
> > > > CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
> > > > 6.0.0-syzkaller-02757-gaf7d23f9d96a #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > > BIOS Google 09/22/2022
> > > > Call Trace:
> > > > <IRQ>
> > > > __dump_stack lib/dump_stack.c:88 [inline]
> > > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > > > print_address_description mm/kasan/report.c:317 [inline]
> > > > print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
> > > > kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
> > > > tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
> > > > reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
> > > > inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
> > > > inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
> > > > reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
> > > > call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
> > > > expire_timers kernel/time/timer.c:1519 [inline]
> > > > __run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
> > > > __run_timers kernel/time/timer.c:1768 [inline]
> > > > run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
> > > > __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
> > > > invoke_softirq kernel/softirq.c:445 [inline]
> > > > __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
> > > > irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
> > > > sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
> > > > </IRQ>
> > > >
> > > > Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> > > > Reported-by: syzbot <syzkaller@googlegroups.com>
> > > > Reported-by: Eric Dumazet <edumazet@google.com>
> > > > Suggested-by: Eric Dumazet <edumazet@google.com>
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > >  net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
> > > >  net/ipv4/tcp_minisocks.c      |  9 +++++----
> > > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > index 71d3bb0abf6c..66fc940f9521 100644
> > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > @@ -268,8 +268,21 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
> > > >                 rcu_read_lock();
> > > >  restart:
> > > >                 sk_nulls_for_each_rcu(sk, node, &head->chain) {
> > > > -                       if (sk->sk_state != TCP_TIME_WAIT)
> > > > +                       if (sk->sk_state != TCP_TIME_WAIT) {
> > > > +                               /* A kernel listener socket might not hold refcnt for net,
> > > > +                                * so reqsk_timer_handler() could be fired after net is
> > > > +                                * freed.  Userspace listener and reqsk never exist here.
> > > > +                                */
> > > > +                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> > > > +                                            hashinfo->pernet)) {
> > > > +                                       struct request_sock *req = inet_reqsk(sk);
> > > > +
> > > > +                                       inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> > > > +                               }
> > > > +
> > > >                                 continue;
> > > > +                       }
> > > > +
> > > >                         tw = inet_twsk(sk);
> > > >                         if ((tw->tw_family != family) ||
> > > >                                 refcount_read(&twsk_net(tw)->ns.count))
> > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > index 79f30f026d89..c375f603a16c 100644
> > > > --- a/net/ipv4/tcp_minisocks.c
> > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > @@ -353,13 +353,14 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
> > > >         struct net *net;
> > > >
> > > >         list_for_each_entry(net, net_exit_list, exit_list) {
> > > > -               /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > -               if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > -                       continue;
> > > > -
> > > >                 if (net->ipv4.tcp_death_row.hashinfo->pernet) {
> > > > +                       /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > > >                         inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
> > > >                 } else if (!purged_once) {
> > > > +                       /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > +                       if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > +                               continue;
> > > > +
> > > >                         inet_twsk_purge(&tcp_hashinfo, family);
> > > >                         purged_once = true;
> > > >                 }
> > > > --
> > > > 2.30.2
Eric Dumazet Oct. 13, 2022, 3:56 p.m. UTC | #5
On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Eric Dumazet reported a use-after-free related to the per-netns ehash
> series. [0]
>
> When we create a TCP socket from userspace, the socket always holds a
> refcnt of the netns.  This guarantees that a reqsk timer is always fired
> before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> listener is not freed before the reqsk, and the net is not freed before
> the listener as well.
>
...

> Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Eric Dumazet <edumazet@google.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
patchwork-bot+netdevbpf@kernel.org Oct. 13, 2022, 4:50 p.m. UTC | #6
Hello:

This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 12 Oct 2022 07:50:36 -0700 you wrote:
> Eric Dumazet reported a use-after-free related to the per-netns ehash
> series. [0]
> 
> When we create a TCP socket from userspace, the socket always holds a
> refcnt of the netns.  This guarantees that a reqsk timer is always fired
> before netns dismantle.  Each reqsk has a refcnt of its listener, so the
> listener is not freed before the reqsk, and the net is not freed before
> the listener as well.
> 
> [...]

Here is the summary with links:
  - [v1,net] tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
    https://git.kernel.org/netdev/net/c/740ea3c4a0b2

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 71d3bb0abf6c..66fc940f9521 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -268,8 +268,21 @@  void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
 		rcu_read_lock();
 restart:
 		sk_nulls_for_each_rcu(sk, node, &head->chain) {
-			if (sk->sk_state != TCP_TIME_WAIT)
+			if (sk->sk_state != TCP_TIME_WAIT) {
+				/* A kernel listener socket might not hold refcnt for net,
+				 * so reqsk_timer_handler() could be fired after net is
+				 * freed.  Userspace listener and reqsk never exist here.
+				 */
+				if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
+					     hashinfo->pernet)) {
+					struct request_sock *req = inet_reqsk(sk);
+
+					inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
+				}
+
 				continue;
+			}
+
 			tw = inet_twsk(sk);
 			if ((tw->tw_family != family) ||
 				refcount_read(&twsk_net(tw)->ns.count))
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 79f30f026d89..c375f603a16c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -353,13 +353,14 @@  void tcp_twsk_purge(struct list_head *net_exit_list, int family)
 	struct net *net;
 
 	list_for_each_entry(net, net_exit_list, exit_list) {
-		/* The last refcount is decremented in tcp_sk_exit_batch() */
-		if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
-			continue;
-
 		if (net->ipv4.tcp_death_row.hashinfo->pernet) {
+			/* Even if tw_refcount == 1, we must clean up kernel reqsk */
 			inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
 		} else if (!purged_once) {
+			/* The last refcount is decremented in tcp_sk_exit_batch() */
+			if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
+				continue;
+
 			inet_twsk_purge(&tcp_hashinfo, family);
 			purged_once = true;
 		}