[v2] net: rds: acquire refcount on TCP sockets

Message ID	a5fb1fc4-2284-3359-f6a0-e4e390239d7b@I-love.SAKURA.ne.jp (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-rdma-owner@kernel.org> Message-ID: <a5fb1fc4-2284-3359-f6a0-e4e390239d7b@I-love.SAKURA.ne.jp> Date: Mon, 2 May 2022 10:40:18 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: [PATCH v2] net: rds: acquire refcount on TCP sockets Content-Language: en-US To: Eric Dumazet <edumazet@google.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>, netdev <netdev@vger.kernel.org>, syzkaller-bugs <syzkaller-bugs@googlegroups.com>, OFED mailing list <linux-rdma@vger.kernel.org> References: <00000000000045dc96059f4d7b02@google.com> <000000000000f75af905d3ba0716@google.com> <c389e47f-8f82-fd62-8c1d-d9481d2f71ff@I-love.SAKURA.ne.jp> <b0f99499-fb6a-b9ec-7bd3-f535f11a885d@I-love.SAKURA.ne.jp> <5f90c2b8-283e-6ca5-65f9-3ea96df00984@I-love.SAKURA.ne.jp> <f8ae5dcd-a5ed-2d8b-dd7a-08385e9c3675@I-love.SAKURA.ne.jp> <CANn89iJukWcN9-fwk4HEH-StAjnTVJ34UiMsrN=mdRbwVpo8AA@mail.gmail.com> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> In-Reply-To: <CANn89iJukWcN9-fwk4HEH-StAjnTVJ34UiMsrN=mdRbwVpo8AA@mail.gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk
Series	[v2] net: rds: acquire refcount on TCP sockets \| expand [v2] net: rds: acquire refcount on TCP sockets

Tetsuo Handa May 2, 2022, 1:40 a.m. UTC

syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
for TCP socket used by RDS is accessing sock_net() without acquiring a
refcount on net namespace. Since TCP's retransmission can happen after
a process which created net namespace terminated, we need to explicitly
acquire a refcount.

Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
---
Changes in v2:
  Add Fixes: tag.
  Move to inside lock_sock() section.

I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
to RDS") was added to 2.6.32.

 net/rds/tcp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Haakon Bugge May 2, 2022, 2:12 p.m. UTC | #1

> On 2 May 2022, at 03:40, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> 
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
> 
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> ---
> Changes in v2:
>  Add Fixes: tag.
>  Move to inside lock_sock() section.
> 
> I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> to RDS") was added to 2.6.32.
> 
> net/rds/tcp.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 5327d130c4b5..2f638f8b7b1e 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> 
> 	tcp_sock_set_nodelay(sock->sk);
> 	lock_sock(sk);
> +	/* TCP timer functions might access net namespace even after
> +	 * a process which created this net namespace terminated.
> +	 */
> +	if (!sk->sk_net_refcnt) {
> +		sk->sk_net_refcnt = 1;
> +		get_net_track(net, &sk->ns_tracker, GFP_KERNEL);

Don't you need a corresponding put_net_track()?


Thxs, Håkon


> +		sock_inuse_add(net, 1);
> +	}
> 	if (rtn->sndbuf_size > 0) {
> 		sk->sk_sndbuf = rtn->sndbuf_size;
> 		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> -- 
> 2.34.1
> 
>

Tetsuo Handa May 2, 2022, 2:29 p.m. UTC | #2

On 2022/05/02 23:12, Haakon Bugge wrote:
>> +	/* TCP timer functions might access net namespace even after
>> +	 * a process which created this net namespace terminated.
>> +	 */
>> +	if (!sk->sk_net_refcnt) {
>> +		sk->sk_net_refcnt = 1;
>> +		get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> 
> Don't you need a corresponding put_net_track()?

__sk_free() and __sk_destruct() will do if sk->sk_net_refcnt is set.

> 
>> +		sock_inuse_add(net, 1);
>> +	}

Paolo Abeni May 3, 2022, 9:02 a.m. UTC | #3

Hello,

On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
> 
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> ---
> Changes in v2:
>   Add Fixes: tag.
>   Move to inside lock_sock() section.
> 
> I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> to RDS") was added to 2.6.32.
> 
>  net/rds/tcp.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 5327d130c4b5..2f638f8b7b1e 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
>  
>  	tcp_sock_set_nodelay(sock->sk);
>  	lock_sock(sk);
> +	/* TCP timer functions might access net namespace even after
> +	 * a process which created this net namespace terminated.
> +	 */
> +	if (!sk->sk_net_refcnt) {
> +		sk->sk_net_refcnt = 1;
> +		get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> +		sock_inuse_add(net, 1);
> +	}
>  	if (rtn->sndbuf_size > 0) {
>  		sk->sk_sndbuf = rtn->sndbuf_size;
>  		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;

This looks equivalent to the fix presented here:

https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/

but the latter looks a more generic solution. @Tetsuo could you please
test the above in your setup?

Thanks!

Paolo

Tetsuo Handa May 3, 2022, 9:56 a.m. UTC | #4

On 2022/05/03 18:02, Paolo Abeni wrote:
> This looks equivalent to the fix presented here:

Not equivalent.

> 
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> 
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?

I already tested that fix, and the result was
https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ .

Paolo Abeni May 3, 2022, 11:10 a.m. UTC | #5

On Tue, 2022-05-03 at 18:56 +0900, Tetsuo Handa wrote:
> On 2022/05/03 18:02, Paolo Abeni wrote:
> 
> > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> > 
> > but the latter looks a more generic solution. @Tetsuo could you please
> > test the above in your setup?
> 
> I already tested that fix, and the result was
> https://lore.kernel.org/all/78cdbf25-4511-a567-bb09-0c07edae8b50@I-love.SAKURA.ne.jp/ .

Thanks, I somewhat missed that reply.

Paolo

patchwork-bot+netdevbpf@kernel.org May 3, 2022, 11:40 a.m. UTC | #6

Hello:

This patch was applied to netdev/net.git (master)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> for TCP socket used by RDS is accessing sock_net() without acquiring a
> refcount on net namespace. Since TCP's retransmission can happen after
> a process which created net namespace terminated, we need to explicitly
> acquire a refcount.
> 
> Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> 
> [...]

Here is the summary with links:
  - [v2] net: rds: acquire refcount on TCP sockets
    https://git.kernel.org/netdev/net/c/3a58f13a881e

You are awesome, thank you!

David Laight May 3, 2022, 1:27 p.m. UTC | #7

From: Paolo Abeni
> Sent: 03 May 2022 10:03
> 
> Hello,
> 
> On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > ---
> > Changes in v2:
> >   Add Fixes: tag.
> >   Move to inside lock_sock() section.
> >
> > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > to RDS") was added to 2.6.32.
> >
> >  net/rds/tcp.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > index 5327d130c4b5..2f638f8b7b1e 100644
> > --- a/net/rds/tcp.c
> > +++ b/net/rds/tcp.c
> > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> >
> >  	tcp_sock_set_nodelay(sock->sk);
> >  	lock_sock(sk);
> > +	/* TCP timer functions might access net namespace even after
> > +	 * a process which created this net namespace terminated.
> > +	 */
> > +	if (!sk->sk_net_refcnt) {
> > +		sk->sk_net_refcnt = 1;
> > +		get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > +		sock_inuse_add(net, 1);
> > +	}
> >  	if (rtn->sndbuf_size > 0) {
> >  		sk->sk_sndbuf = rtn->sndbuf_size;
> >  		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> 
> This looks equivalent to the fix presented here:
> 
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> 
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?

Wouldn't a more generic solution be to add a flag to sock_create_kern()
so that it acquires a reference to the namespace?
This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.

I've a driver that uses __sock_create() in order to get that reference.
I'm pretty sure the extra 'security' check will never fail.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric Dumazet May 3, 2022, 1:43 p.m. UTC | #8

On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Paolo Abeni
> > Sent: 03 May 2022 10:03
> >
> > Hello,
> >
> > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> > kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > ---
> > > Changes in v2:
> > >   Add Fixes: tag.
> > >   Move to inside lock_sock() section.
> > >
> > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > > to RDS") was added to 2.6.32.
> > >
> > >  net/rds/tcp.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > > index 5327d130c4b5..2f638f8b7b1e 100644
> > > --- a/net/rds/tcp.c
> > > +++ b/net/rds/tcp.c
> > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> > >
> > >     tcp_sock_set_nodelay(sock->sk);
> > >     lock_sock(sk);
> > > +   /* TCP timer functions might access net namespace even after
> > > +    * a process which created this net namespace terminated.
> > > +    */
> > > +   if (!sk->sk_net_refcnt) {
> > > +           sk->sk_net_refcnt = 1;
> > > +           get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > > +           sock_inuse_add(net, 1);
> > > +   }
> > >     if (rtn->sndbuf_size > 0) {
> > >             sk->sk_sndbuf = rtn->sndbuf_size;
> > >             sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> >
> > This looks equivalent to the fix presented here:
> >
> > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> >
> > but the latter looks a more generic solution. @Tetsuo could you please
> > test the above in your setup?
>
> Wouldn't a more generic solution be to add a flag to sock_create_kern()
> so that it acquires a reference to the namespace?
> This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.
>
> I've a driver that uses __sock_create() in order to get that reference.
> I'm pretty sure the extra 'security' check will never fail.
>

This would be silly really.

Definition of a 'kernel socket' is that it does not hold a reference
to the namespace.
(otherwise a netns could not be destroyed by user space)

A kernel layer using kernel sockets needs to properly dismantle them
when a namespace is destroyed.

In the RDS case, the socket was a user socket, or RDS lacked proper
tracking of all the sockets
so that they can be dismantled properly.

Eric Dumazet May 3, 2022, 1:45 p.m. UTC | #9

On Tue, May 3, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > ---
> > Changes in v2:
> >   Add Fixes: tag.
> >   Move to inside lock_sock() section.
> >
> > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > to RDS") was added to 2.6.32.
> >
> >  net/rds/tcp.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > index 5327d130c4b5..2f638f8b7b1e 100644
> > --- a/net/rds/tcp.c
> > +++ b/net/rds/tcp.c
> > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> >
> >       tcp_sock_set_nodelay(sock->sk);
> >       lock_sock(sk);
> > +     /* TCP timer functions might access net namespace even after
> > +      * a process which created this net namespace terminated.
> > +      */
> > +     if (!sk->sk_net_refcnt) {
> > +             sk->sk_net_refcnt = 1;
> > +             get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > +             sock_inuse_add(net, 1);
> > +     }
> >       if (rtn->sndbuf_size > 0) {
> >               sk->sk_sndbuf = rtn->sndbuf_size;
> >               sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
>
> This looks equivalent to the fix presented here:
>
> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/

I think this is still needed for layers (NFS ?) that dismantle their
TCP sockets whenever a netns
is dismantled. But RDS case was different, only the listener is a kernel socket.

>
> but the latter looks a more generic solution. @Tetsuo could you please
> test the above in your setup?
>
> Thanks!
>
> Paolo
>

Tetsuo Handa May 3, 2022, 2:08 p.m. UTC | #10

On 2022/05/03 22:45, Eric Dumazet wrote:
>> This looks equivalent to the fix presented here:
>>
>> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/

I retested the fix above using

unshare -n sh -c '
ip link set lo up
iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP
telnet 127.0.0.1 16385
dmesg -c
netstat -tanpe' < /dev/null

as a test case, but it seems racy; sometimes timer function is called again and crashes.

[  426.086565][    C2] general protection fault, probably for non-canonical address 0x6b6af3ebcc3b6bc3: 0000 [#1] PREEMPT SMP KASAN
[  426.096339][    C2] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.18.0-rc5-dirty #807
[  426.103769][    C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  426.111851][    C2] RIP: 0010:__tcp_transmit_skb+0xe72/0x1b80
[  426.117512][    C2] Code: e8 b3 ea dc fd 48 8d 7d 30 45 0f b7 77 30 e8 95 ec dc fd 48 8b 5d 30 48 8d bb b8 02 00 00 e8 85 ec dc fd 48 8b 83 b8 02 00 00 <65> 4c 01 70 58 e9 67 fd ff ff e8 ef 56 ac fd 48 8d bd d0 09 00 00
[  426.124692][    C2] RSP: 0018:ffff888060d09ac8 EFLAGS: 00010246
[  426.126845][    C2] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8880145c8000 RCX: ffffffff838cc28b
[  426.129616][    C2] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880145c82b8
[  426.132374][    C2] RBP: ffff8880129f8000 R08: 0000000000000000 R09: 0000000000000007
[  426.135077][    C2] R10: ffffffff838cbfd4 R11: 0000000000000001 R12: ffff8880129f8760
[  426.137793][    C2] R13: ffff88800f6e0118 R14: 0000000000000001 R15: ffff88800f6e00e8
[  426.140489][    C2] FS:  0000000000000000(0000) GS:ffff888060d00000(0000) knlGS:0000000000000000
[  426.143525][    C2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  426.145792][    C2] CR2: 000055b5bb0adabc CR3: 000000000e003000 CR4: 00000000000506e0
[  426.148509][    C2] Call Trace:
[  426.149442][    C2]  <IRQ>
[  426.150183][    C2]  ? __tcp_select_window+0x710/0x710
[  426.151457][    C2]  ? __sanitizer_cov_trace_cmp4+0x1c/0x70
[  426.153007][    C2]  ? tcp_current_mss+0x165/0x280
[  426.154245][    C2]  ? tcp_trim_head+0x300/0x300
[  426.155396][    C2]  ? find_held_lock+0x85/0xa0
[  426.156734][    C2]  ? mark_held_locks+0x65/0x90
[  426.157967][    C2]  tcp_write_wakeup+0x2e2/0x340
[  426.159149][    C2]  tcp_send_probe0+0x2a/0x2c0
[  426.160368][    C2]  tcp_write_timer_handler+0x5cb/0x670
[  426.161740][    C2]  tcp_write_timer+0x86/0x250
[  426.162896][    C2]  ? tcp_write_timer_handler+0x670/0x670
[  426.164285][    C2]  call_timer_fn+0x15d/0x5f0
[  426.165481][    C2]  ? add_timer_on+0x2e0/0x2e0
[  426.166667][    C2]  ? lock_downgrade+0x3c0/0x3c0
[  426.167921][    C2]  ? mark_held_locks+0x24/0x90
[  426.169263][    C2]  ? _raw_spin_unlock_irq+0x1f/0x40
[  426.170564][    C2]  ? tcp_write_timer_handler+0x670/0x670
[  426.171920][    C2]  __run_timers.part.0+0x523/0x740
[  426.173181][    C2]  ? call_timer_fn+0x5f0/0x5f0
[  426.174321][    C2]  ? pvclock_clocksource_read+0xdc/0x1a0
[  426.175655][    C2]  run_timer_softirq+0x66/0xe0
[  426.176825][    C2]  __do_softirq+0x1c2/0x670
[  426.177944][    C2]  __irq_exit_rcu+0xf8/0x140
[  426.179120][    C2]  irq_exit_rcu+0x5/0x20
[  426.180150][    C2]  sysvec_apic_timer_interrupt+0x8e/0xc0
[  426.181486][    C2]  </IRQ>
[  426.182180][    C2]  <TASK>
[  426.182845][    C2]  asm_sysvec_apic_timer_interrupt+0x12/0x20

> 
> I think this is still needed for layers (NFS ?) that dismantle their
> TCP sockets whenever a netns
> is dismantled. But RDS case was different, only the listener is a kernel socket.

We can't apply the fix above.

I think that the fundamental problem is that we use net->ns.count for both
"avoiding use-after-free" purpose and "allowing dismantle from user event" purpose.
Why not to use separated counters?

David Laight May 3, 2022, 2:25 p.m. UTC | #11

From: Eric Dumazet
> Sent: 03 May 2022 14:43
> 
> On Tue, May 3, 2022 at 6:27 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Paolo Abeni
> > > Sent: 03 May 2022 10:03
> > >
> > > Hello,
> > >
> > > On Mon, 2022-05-02 at 10:40 +0900, Tetsuo Handa wrote:
> > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > > refcount on net namespace. Since TCP's retransmission can happen after
> > > > a process which created net namespace terminated, we need to explicitly
> > > > acquire a refcount.
> > > >
> > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel
> sockets.")
> > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a
> > > kernel socket")
> > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > ---
> > > > Changes in v2:
> > > >   Add Fixes: tag.
> > > >   Move to inside lock_sock() section.
> > > >
> > > > I chose 26abe14379f8e2fa and 8a68173691f03661 which went to 4.2 for Fixes: tag,
> > > > for refcount was implicitly taken when 70041088e3b97662 ("RDS: Add TCP transport
> > > > to RDS") was added to 2.6.32.
> > > >
> > > >  net/rds/tcp.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> > > > index 5327d130c4b5..2f638f8b7b1e 100644
> > > > --- a/net/rds/tcp.c
> > > > +++ b/net/rds/tcp.c
> > > > @@ -495,6 +495,14 @@ void rds_tcp_tune(struct socket *sock)
> > > >
> > > >     tcp_sock_set_nodelay(sock->sk);
> > > >     lock_sock(sk);
> > > > +   /* TCP timer functions might access net namespace even after
> > > > +    * a process which created this net namespace terminated.
> > > > +    */
> > > > +   if (!sk->sk_net_refcnt) {
> > > > +           sk->sk_net_refcnt = 1;
> > > > +           get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
> > > > +           sock_inuse_add(net, 1);
> > > > +   }
> > > >     if (rtn->sndbuf_size > 0) {
> > > >             sk->sk_sndbuf = rtn->sndbuf_size;
> > > >             sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> > >
> > > This looks equivalent to the fix presented here:
> > >
> > > https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> > >
> > > but the latter looks a more generic solution. @Tetsuo could you please
> > > test the above in your setup?
> >
> > Wouldn't a more generic solution be to add a flag to sock_create_kern()
> > so that it acquires a reference to the namespace?
> > This could be a bit on one of the existing parameters - like SOCK_NONBLOCK.
> >
> > I've a driver that uses __sock_create() in order to get that reference.
> > I'm pretty sure the extra 'security' check will never fail.
> >
> 
> This would be silly really.
> 
> Definition of a 'kernel socket' is that it does not hold a reference
> to the namespace.
> (otherwise a netns could not be destroyed by user space)
> 
> A kernel layer using kernel sockets needs to properly dismantle them
> when a namespace is destroyed.

I think it depends on why the driver is using a socket.

If the driver is a 'user' of a TCP connection that happens to
be is a kernel driver then holding the a reference to the namespace
is no different to an application socket holding a reference.
An example might be nfs/tcp - you need to unmount the filesystem
before you can delete the namespace.

OTOH if part of a protocol stack is using a socket for internal
calls (I think I've seen routing sockets used that way) then the
presence of the socket probably shouldn't stop the namespace
being deleted.

Listening sockets are a slight problem - probably for userspace as well.
It would be nicer to be able to get TCP (etc) to error out listening
sockets if they are the only thing stopping a namespace being deleted.

> In the RDS case, the socket was a user socket, or RDS lacked proper
> tracking of all the sockets
> so that they can be dismantled properly.

I think they probably are sockets created in order act on requests
from applications.
I think they should have the same effect on namespaces as a direct
user socket - you can't delete the socket while the connection is
active.
Kill all the relevant processes, tell the driver to stop, and you
can delete the namespace.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric Dumazet May 3, 2022, 9:17 p.m. UTC | #12

On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
>
> Hello:
>
> This patch was applied to netdev/net.git (master)
> by Paolo Abeni <pabeni@redhat.com>:
>
> On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > refcount on net namespace. Since TCP's retransmission can happen after
> > a process which created net namespace terminated, we need to explicitly
> > acquire a refcount.
> >
> > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> >
> > [...]
>
> Here is the summary with links:
>   - [v2] net: rds: acquire refcount on TCP sockets
>     https://git.kernel.org/netdev/net/c/3a58f13a881e
>
> You are awesome, thank you!
> --
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/patchwork/pwbot.html
>
>

I think we merged this patch too soon.

My question is : What prevents rds_tcp_conn_path_connect(), and thus
rds_tcp_tune() to be called
after the netns refcount already reached 0 ?

I guess we can wait for next syzbot report, but I think that get_net()
should be replaced
by maybe_get_net()

Eric Dumazet May 3, 2022, 10:37 p.m. UTC | #13

On Tue, May 3, 2022 at 2:17 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> >
> > Hello:
> >
> > This patch was applied to netdev/net.git (master)
> > by Paolo Abeni <pabeni@redhat.com>:
> >
> > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > >
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > >
> > > [...]
> >
> > Here is the summary with links:
> >   - [v2] net: rds: acquire refcount on TCP sockets
> >     https://git.kernel.org/netdev/net/c/3a58f13a881e
> >
> > You are awesome, thank you!
> > --
> > Deet-doot-dot, I am a bot.
> > https://korg.docs.kernel.org/patchwork/pwbot.html
> >
> >
>
> I think we merged this patch too soon.
>
> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> rds_tcp_tune() to be called
> after the netns refcount already reached 0 ?
>
> I guess we can wait for next syzbot report, but I think that get_net()
> should be replaced
> by maybe_get_net()

Yes, syzbot was fast to trigger this exact issue:

HEAD commit:    3a58f13a net: rds: acquire refcount on TCP sockets
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 6934 at lib/refcount.c:25
refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25
Modules linked in:
CPU: 1 PID: 6934 Comm: kworker/u4:17 Not tainted
5.18.0-rc4-syzkaller-00209-g3a58f13a881e #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: krdsd rds_connect_worker
RIP: 0010:refcount_warn_saturate+0x169/0x1e0 lib/refcount.c:25
Code: 09 31 ff 89 de e8 f7 b9 81 fd 84 db 0f 85 36 ff ff ff e8 0a b6
81 fd 48 c7 c7 40 eb 26 8a c6 05 75 1f ac 09 01 e8 56 75 2d 05 <0f> 0b
e9 17 ff ff ff e8 eb b5 81 fd 0f b6 1d 5a 1f ac 09 31 ff 89
RSP: 0018:ffffc9000b5e7b80 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88807a948000 RSI: ffffffff81600c08 RDI: fffff520016bcf62
RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815fb5de R11: 0000000000000000 R12: ffff888021e69b80
R13: ffff88805bc82a00 R14: ffff888021e69ccc R15: ffff8880741a2900
FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2cb5c000 CR3: 000000005688f000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __refcount_add include/linux/refcount.h:199 [inline]
 __refcount_inc include/linux/refcount.h:250 [inline]
 refcount_inc include/linux/refcount.h:267 [inline]
 get_net include/net/net_namespace.h:248 [inline]
 get_net_track include/net/net_namespace.h:334 [inline]
 rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503
 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127
 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176
 process_one_work+0x996/0x1610 kernel/workqueue.c:2289
 worker_thread+0x665/0x1080 kernel/workqueue.c:2436
 kthread+0x2e9/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298
 </TASK>

Tetsuo Handa May 4, 2022, 1:04 a.m. UTC | #14

On 2022/05/04 7:37, Eric Dumazet wrote:
>> I think we merged this patch too soon.
>>
>> My question is : What prevents rds_tcp_conn_path_connect(), and thus
>> rds_tcp_tune() to be called
>> after the netns refcount already reached 0 ?
>>
>> I guess we can wait for next syzbot report, but I think that get_net()
>> should be replaced
>> by maybe_get_net()
> 
> Yes, syzbot was fast to trigger this exact issue:

Does maybe_get_net() help?

Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().

rds_tcp_conn_path_connect() {
  sock_create_kern(net = rds_conn_net(conn)) {
    __sock_create(net = rds_conn_net(conn), kern = 1) {
      err = pf->create(net = rds_conn_net(conn), kern = 1) {
        // pf->create is either inet_create or inet6_create
        sk_alloc(net = rds_conn_net(conn), kern = 1) {
          sk->sk_net_refcnt = kern ? 0 : 1;
          if (likely(sk->sk_net_refcnt)) {
            get_net_track(net, &sk->ns_tracker, priority);
            sock_inuse_add(net, 1);
          }
          sock_net_set(sk, net);
        }
      }
    }
  }
  rds_tcp_tune() {
    if (!sk->sk_net_refcnt) {
      sk->sk_net_refcnt = 1;
      get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
      sock_inuse_add(net, 1);
    }
  }
}

"struct rds_connection" needs to hold a ref in order to safely allow
rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless
to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection"
must have a ref. Situation where we are protected by maybe_get_net() is
quite limited if long-lived object is not holding a ref.

Hmm, can we simply use &init_net instead of rds_conn_net(conn) ?

Eric Dumazet May 4, 2022, 3:09 a.m. UTC | #15

On Tue, May 3, 2022 at 6:04 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2022/05/04 7:37, Eric Dumazet wrote:
> >> I think we merged this patch too soon.
> >>
> >> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> >> rds_tcp_tune() to be called
> >> after the netns refcount already reached 0 ?
> >>
> >> I guess we can wait for next syzbot report, but I think that get_net()
> >> should be replaced
> >> by maybe_get_net()
> >
> > Yes, syzbot was fast to trigger this exact issue:
>
> Does maybe_get_net() help?
>
> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().

Nope. RDS has an exit() handler called from cleanup_net()

(struct pernet_operations)->exit() or exit_batch() :
rds_tcp_exit_net() (rds_tcp_kill_sock())

This exit() handler _has_ to remove all known listeners, and
definitely cancel work queues (synchronous operation)
before the actual "struct net" free can happen later.



>
> rds_tcp_conn_path_connect() {
>   sock_create_kern(net = rds_conn_net(conn)) {
>     __sock_create(net = rds_conn_net(conn), kern = 1) {
>       err = pf->create(net = rds_conn_net(conn), kern = 1) {
>         // pf->create is either inet_create or inet6_create
>         sk_alloc(net = rds_conn_net(conn), kern = 1) {
>           sk->sk_net_refcnt = kern ? 0 : 1;
>           if (likely(sk->sk_net_refcnt)) {
>             get_net_track(net, &sk->ns_tracker, priority);
>             sock_inuse_add(net, 1);
>           }
>           sock_net_set(sk, net);
>         }
>       }
>     }
>   }
>   rds_tcp_tune() {
>     if (!sk->sk_net_refcnt) {
>       sk->sk_net_refcnt = 1;
>       get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
>       sock_inuse_add(net, 1);
>     }
>   }
> }
>
> "struct rds_connection" needs to hold a ref in order to safely allow
> rds_tcp_tune() to call maybe_get_net(), which in turn makes pointless
> to use maybe_get_net() from rds_tcp_tune() because "struct rds_connection"
> must have a ref. Situation where we are protected by maybe_get_net() is
> quite limited if long-lived object is not holding a ref.
>
> Hmm, can we simply use &init_net instead of rds_conn_net(conn) ?

Only if you plan making RDS unavailable for non init netns.

Tetsuo Handa May 4, 2022, 4:58 a.m. UTC | #16

On 2022/05/04 12:09, Eric Dumazet wrote:
>> Does maybe_get_net() help?
>>
>> Since rds_conn_net() returns a net namespace without holding a ref, it is theoretically
>> possible that the net namespace returned by rds_conn_net() is already kmem_cache_free()d
>> if refcount dropped to 0 by the moment sk_alloc() calls sock_net_set().
> 
> Nope. RDS has an exit() handler called from cleanup_net()
> 
> (struct pernet_operations)->exit() or exit_batch() :
> rds_tcp_exit_net() (rds_tcp_kill_sock())

Hmm, when put_net() called __put_net(), this "struct net" is chained to cleanup_list.
When cleanup_net() is called via net_cleanup_work, rds_tcp_exit_net() is called from
ops_exit_list(). Therefore, we can call maybe_get_net() until rds_tcp_exit_net() returns.
That's good.

> 
> This exit() handler _has_ to remove all known listeners, and
> definitely cancel work queues (synchronous operation)
> before the actual "struct net" free can happen later.

But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from
rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that
rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and
rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w).

But I can't see how rds_tcp_exit_net() synchronously cancels all works associated
with "struct rds_conn_path".

struct rds_conn_path {
        struct delayed_work     cp_send_w;
        struct delayed_work     cp_recv_w;
        struct delayed_work     cp_conn_w;
        struct work_struct      cp_down_w;
}

These works are queued to rds_wq, but flush_workqueue() waits for completion only
if already queued. What if timer for queue_delayed_work() has not expired, or was
about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient?

Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed
or cancelled, the fix would look like something below?

 net/rds/tcp.c         | 11 ++++++++---
 net/rds/tcp.h         |  2 +-
 net/rds/tcp_connect.c |  5 ++++-
 net/rds/tcp_listen.c  |  5 ++++-
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2f638f8b7b1e..8e26bcf02044 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -487,11 +487,11 @@ struct rds_tcp_net {
 /* All module specific customizations to the RDS-TCP socket should be done in
  * rds_tcp_tune() and applied after socket creation.
  */
-void rds_tcp_tune(struct socket *sock)
+bool rds_tcp_tune(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
 	struct net *net = sock_net(sk);
-	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
+	struct rds_tcp_net *rtn;
 
 	tcp_sock_set_nodelay(sock->sk);
 	lock_sock(sk);
@@ -499,10 +499,14 @@ void rds_tcp_tune(struct socket *sock)
 	 * a process which created this net namespace terminated.
 	 */
 	if (!sk->sk_net_refcnt) {
+		if (!maybe_get_net(net)) {
+			release_sock(sk);
+			return false;
+		}
 		sk->sk_net_refcnt = 1;
-		get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
 		sock_inuse_add(net, 1);
 	}
+	rtn = net_generic(net, rds_tcp_netid);
 	if (rtn->sndbuf_size > 0) {
 		sk->sk_sndbuf = rtn->sndbuf_size;
 		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
@@ -512,6 +516,7 @@ void rds_tcp_tune(struct socket *sock)
 		sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
 	}
 	release_sock(sk);
+	return true;
 }
 
 static void rds_tcp_accept_worker(struct work_struct *work)
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index dc8d745d6857..f8b5930d7b34 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -49,7 +49,7 @@ struct rds_tcp_statistics {
 };
 
 /* tcp.c */
-void rds_tcp_tune(struct socket *sock);
+bool rds_tcp_tune(struct socket *sock);
 void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
 void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
 void rds_tcp_restore_callbacks(struct socket *sock,
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 5461d77fff4f..f0c477c5d1db 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -124,7 +124,10 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 	if (ret < 0)
 		goto out;
 
-	rds_tcp_tune(sock);
+	if (!rds_tcp_tune(sock)) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	if (isv6) {
 		sin6.sin6_family = AF_INET6;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 09cadd556d1e..7edf2e69d3fe 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -133,7 +133,10 @@ int rds_tcp_accept_one(struct socket *sock)
 	__module_get(new_sock->ops->owner);
 
 	rds_tcp_keepalive(new_sock);
-	rds_tcp_tune(new_sock);
+	if (!rds_tcp_tune(new_sock)) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	inet = inet_sk(new_sock->sk);

Paolo Abeni May 4, 2022, 1:09 p.m. UTC | #17

On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote:
> On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> > 
> > Hello:
> > 
> > This patch was applied to netdev/net.git (master)
> > by Paolo Abeni <pabeni@redhat.com>:
> > 
> > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > refcount on net namespace. Since TCP's retransmission can happen after
> > > a process which created net namespace terminated, we need to explicitly
> > > acquire a refcount.
> > > 
> > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > 
> > > [...]
> > 
> > Here is the summary with links:
> >   - [v2] net: rds: acquire refcount on TCP sockets
> >     https://git.kernel.org/netdev/net/c/3a58f13a881e
> > 
> > You are awesome, thank you!
> > --
> > Deet-doot-dot, I am a bot.
> > https://korg.docs.kernel.org/patchwork/pwbot.html
> > 
> > 
> 
> I think we merged this patch too soon.

My fault.


> My question is : What prevents rds_tcp_conn_path_connect(), and thus
> rds_tcp_tune() to be called
> after the netns refcount already reached 0 ?
> 
> I guess we can wait for next syzbot report, but I think that get_net()
> should be replaced
> by maybe_get_net()
> 
Should we revert this patch before the next pull request, if a suitable
incremental fix is not available by then?

It looks like the window of opportunity for the race is roughly the
same?

Thanks!

Paolo

Eric Dumazet May 4, 2022, 1:25 p.m. UTC | #18

On Wed, May 4, 2022 at 6:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2022-05-03 at 14:17 -0700, Eric Dumazet wrote:
> > On Tue, May 3, 2022 at 4:40 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
> > >
> > > Hello:
> > >
> > > This patch was applied to netdev/net.git (master)
> > > by Paolo Abeni <pabeni@redhat.com>:
> > >
> > > On Mon, 2 May 2022 10:40:18 +0900 you wrote:
> > > > syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
> > > > for TCP socket used by RDS is accessing sock_net() without acquiring a
> > > > refcount on net namespace. Since TCP's retransmission can happen after
> > > > a process which created net namespace terminated, we need to explicitly
> > > > acquire a refcount.
> > > >
> > > > Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
> > > > Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > > Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
> > > > Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
> > > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
> > > >
> > > > [...]
> > >
> > > Here is the summary with links:
> > >   - [v2] net: rds: acquire refcount on TCP sockets
> > >     https://git.kernel.org/netdev/net/c/3a58f13a881e
> > >
> > > You are awesome, thank you!
> > > --
> > > Deet-doot-dot, I am a bot.
> > > https://korg.docs.kernel.org/patchwork/pwbot.html
> > >
> > >
> >
> > I think we merged this patch too soon.
>
> My fault.
>
>
> > My question is : What prevents rds_tcp_conn_path_connect(), and thus
> > rds_tcp_tune() to be called
> > after the netns refcount already reached 0 ?
> >
> > I guess we can wait for next syzbot report, but I think that get_net()
> > should be replaced
> > by maybe_get_net()
> >
> Should we revert this patch before the next pull request, if a suitable
> incremental fix is not available by then?
>
> It looks like the window of opportunity for the race is roughly the
> same?
>

No need to revert the patch, we certainly are in a better situation,
as refcount_t helps here.

We can refine the logic in a followup.

Thanks.

Tetsuo Handa May 4, 2022, 3:15 p.m. UTC | #19

On 2022/05/04 13:58, Tetsuo Handa wrote:
> On 2022/05/04 12:09, Eric Dumazet wrote:
>> This exit() handler _has_ to remove all known listeners, and
>> definitely cancel work queues (synchronous operation)
>> before the actual "struct net" free can happen later.
> 
> But in your report, rds_tcp_tune() is called from rds_tcp_conn_path_connect() from
> rds_connect_worker() via "struct rds_connection"->cp_conn_w work. I can see that
> rds_tcp_kill_sock() calls rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w), and
> rds_tcp_listen_stop() calls flush_workqueue(rds_wq) and flush_work(&rtn->rds_tcp_accept_w).
> 
> But I can't see how rds_tcp_exit_net() synchronously cancels all works associated
> with "struct rds_conn_path".
> 
> struct rds_conn_path {
>         struct delayed_work     cp_send_w;
>         struct delayed_work     cp_recv_w;
>         struct delayed_work     cp_conn_w;
>         struct work_struct      cp_down_w;
> }
> 
> These works are queued to rds_wq, but flush_workqueue() waits for completion only
> if already queued. What if timer for queue_delayed_work() has not expired, or was
> about to call queue_delayed_work() ? Is flush_workqueue(rds_wq) sufficient?

 rds_tcp_tune+0x5a0/0x5f0 net/rds/tcp.c:503
 rds_tcp_conn_path_connect+0x489/0x880 net/rds/tcp_connect.c:127
 rds_connect_worker+0x1a5/0x2c0 net/rds/threads.c:176
 process_one_work+0x996/0x1610 kernel/workqueue.c:2289

rds_tcp_conn_path_connect is referenced by
"struct rds_transport rds_tcp_transport"->conn_path_connect.
It is invoked by

  ret = conn->c_trans->conn_path_connect(cp)

in rds_connect_worker().

rds_connect_worker is referenced by "struct rds_conn_path"->cp_conn_w
via INIT_DELAYED_WORK().

queue_delayed_work(rds_wq, &cp->cp_conn_w, *) is called by
rds_queue_reconnect() or rds_conn_path_connect_if_down().

If rds_conn_path_connect_if_down() were called from
rds_tcp_accept_one_path() from rds_tcp_accept_one(),
rds_tcp_tune() from rds_tcp_accept_one() was already called
before rds_tcp_tune() from rds_tcp_conn_path_connect() is called.
Since the addition on 0 was not reported at rds_tcp_tune() from
rds_tcp_accept_one(), what Eric is reporting cannot be from
rds_tcp_accept_one() from rds_tcp_accept_worker().

Despite rds_tcp_kill_sock() sets rtn->rds_tcp_listen_sock = NULL and
waits for rds_tcp_accept_one() from rds_tcp_accept_worker() to complete
using flush_workqueue(rds_wq), what Eric is reporting is different from
what syzbot+694120e1002c117747ed was reporting.

> 
> Anyway, if rds_tcp_kill_sock() can somehow guarantee that all works are completed
> or cancelled, the fix would look like something below?

I think it is OK to apply below diff in order to avoid addition on 0 problem, but
it is not proven that kmem_cache_free() is not yet called. What should we do?

> 
>  net/rds/tcp.c         | 11 ++++++++---
>  net/rds/tcp.h         |  2 +-
>  net/rds/tcp_connect.c |  5 ++++-
>  net/rds/tcp_listen.c  |  5 ++++-
>  4 files changed, 17 insertions(+), 6 deletions(-)
>

[v2] net: rds: acquire refcount on TCP sockets

Commit Message

Comments

Patch