diff mbox series

[RFC,net] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put()

Message ID 20240905064257.3870271-1-dmantipov@yandex.ru (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put() | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 2 maintainers not CCed: bpf@vger.kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 16 this patch: 16
netdev/checkpatch warning WARNING: The commit message has 'syzkaller', perhaps it also needs a 'Fixes:' tag?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Dmitry Antipov Sept. 5, 2024, 6:42 a.m. UTC
At https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45, syzbot
has triggered the following race condition:

On CPU0, 'sk_psock_drop()' is running at [1]:

void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
{
        write_lock_bh(&sk->sk_callback_lock);
        sk_psock_restore_proto(sk, psock);                                  [1]
        rcu_assign_sk_user_data(sk, NULL);
        if (psock->progs.stream_parser)
                sk_psock_stop_strp(sk, psock);
        else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
                sk_psock_stop_verdict(sk, psock);
        write_unlock_bh(&sk->sk_callback_lock);

        sk_psock_stop(psock);

        INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
        queue_rcu_work(system_wq, &psock->rwork);
}

If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is
always NULL at [2]. But, since [1] may be is in progress during [3], the
value of 'saved_destroy' at this point is undefined:

void sock_map_destroy(struct sock *sk)
{
        void (*saved_destroy)(struct sock *sk);
        struct sk_psock *psock;

        rcu_read_lock();
        psock = sk_psock_get(sk);                                           [2]
        if (unlikely(!psock)) {
                rcu_read_unlock();
                saved_destroy = READ_ONCE(sk->sk_prot)->destroy;            [3]
        } else {
                saved_destroy = psock->saved_destroy;
                sock_map_remove_links(sk, psock);
                rcu_read_unlock();
                sk_psock_stop(psock);
                sk_psock_put(sk, psock);
        }
        if (WARN_ON_ONCE(saved_destroy == sock_map_destroy))
                return;
        if (saved_destroy)
                saved_destroy(sk);
}

The following fix is helpful to avoid the race and does not introduce
psock leak (when running the syzbot reproducer from the above), but
I'm not sure whether the latter is always true in some other scenario.
So comments are highly appreciated.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
---
 net/core/sock_map.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Cong Wang Sept. 8, 2024, 6:36 p.m. UTC | #1
On Thu, Sep 05, 2024 at 09:42:57AM +0300, Dmitry Antipov wrote:
> At https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45, syzbot
> has triggered the following race condition:

Are you sure it is due to sockmap code?

I see rds_tcp_accept_one() in the stack trace. This is why I highly
suspect that it is due to RDS code instead of sockmap code.

I have the following patch ready for testing, in case you are
interested.

Thanks.

--------------->

commit 4068420e2c82137ab95d387346c0776a36c69e5d
Author: Cong Wang <cong.wang@bytedance.com>
Date:   Sun Sep 1 17:01:49 2024 -0700

    rds: check sock->sk->sk_user_data conflicts
    
    Signed-off-by: Cong Wang <cong.wang@bytedance.com>

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 351ac1747224..54ee7f6b8f34 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -134,11 +134,12 @@ void rds_tcp_restore_callbacks(struct socket *sock,
  * it is set.  The absence of RDS_CONN_UP bit protects those paths
  * from being called while it isn't set.
  */
-void rds_tcp_reset_callbacks(struct socket *sock,
-			     struct rds_conn_path *cp)
+int rds_tcp_reset_callbacks(struct socket *sock,
+			    struct rds_conn_path *cp)
 {
 	struct rds_tcp_connection *tc = cp->cp_transport_data;
 	struct socket *osock = tc->t_sock;
+	int ret = 0;
 
 	if (!osock)
 		goto newsock;
@@ -181,21 +182,25 @@ void rds_tcp_reset_callbacks(struct socket *sock,
 newsock:
 	rds_send_path_reset(cp);
 	lock_sock(sock->sk);
-	rds_tcp_set_callbacks(sock, cp);
+	ret = rds_tcp_set_callbacks(sock, cp);
 	release_sock(sock->sk);
+	return ret;
 }
 
 /* Add tc to rds_tcp_tc_list and set tc->t_sock. See comments
  * above rds_tcp_reset_callbacks for notes about synchronization
  * with data path
  */
-void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
+int rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
 {
 	struct rds_tcp_connection *tc = cp->cp_transport_data;
 
-	rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc);
 	write_lock_bh(&sock->sk->sk_callback_lock);
-
+	if (sock->sk->sk_user_data) {
+		write_unlock_bh(&sock->sk->sk_callback_lock);
+		return -EBUSY;
+	}
+	rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc);
 	/* done under the callback_lock to serialize with write_space */
 	spin_lock(&rds_tcp_tc_list_lock);
 	list_add_tail(&tc->t_list_item, &rds_tcp_tc_list);
@@ -222,6 +227,7 @@ void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
 	sock->sk->sk_state_change = rds_tcp_state_change;
 
 	write_unlock_bh(&sock->sk->sk_callback_lock);
+	return 0;
 }
 
 /* Handle RDS_INFO_TCP_SOCKETS socket option.  It only returns IPv4
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 053aa7da87ef..710cc7fa41af 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -50,8 +50,8 @@ struct rds_tcp_statistics {
 
 /* tcp.c */
 bool rds_tcp_tune(struct socket *sock);
-void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
-void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
+int rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
+int rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
 void rds_tcp_restore_callbacks(struct socket *sock,
 			       struct rds_tcp_connection *tc);
 u32 rds_tcp_write_seq(struct rds_tcp_connection *tc);
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index d89bd8d0c354..695456455aee 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -205,11 +205,15 @@ int rds_tcp_accept_one(struct socket *sock)
 		goto rst_nsk;
 	if (rs_tcp->t_sock) {
 		/* Duelling SYN has been handled in rds_tcp_accept_one() */
-		rds_tcp_reset_callbacks(new_sock, cp);
+		ret = rds_tcp_reset_callbacks(new_sock, cp);
+		if (ret)
+			goto rst_nsk;
 		/* rds_connect_path_complete() marks RDS_CONN_UP */
 		rds_connect_path_complete(cp, RDS_CONN_RESETTING);
 	} else {
-		rds_tcp_set_callbacks(new_sock, cp);
+		ret = rds_tcp_set_callbacks(new_sock, cp);
+		if (ret)
+			goto rst_nsk;
 		rds_connect_path_complete(cp, RDS_CONN_CONNECTING);
 	}
 	new_sock = NULL;
Dmitry Antipov Sept. 9, 2024, 7:04 a.m. UTC | #2
On 9/8/24 9:36 PM, Cong Wang wrote:

> Are you sure it is due to sockmap code?

No, and that's why my patch has RFC tag in subject :-).

> I see rds_tcp_accept_one() in the stack trace. This is why I highly
> suspect that it is due to RDS code instead of sockmap code.
> 
> I have the following patch ready for testing, in case you are
> interested.

Does it work for you? Running current upstream with this patch applied,
I'm still seeing the same warning at net/core/sock_map.c:1663.

Again, I'm suspecting the race just because 'sk_psock_drop()' issues
'sk_psock_restore_proto()' with 'sk->sk_callback_lock' write locked,
but 'sock_map_destroy()' just uses 'READ_ONCE()' to obtain a callback
which may be changed underneath.

BTW looking here and there again, I suppose that my patch is not correct
too because it moves and/or shrinks the race window but doesn't eliminate
it completely.

Dmitry
Cong Wang Sept. 11, 2024, 4:32 a.m. UTC | #3
On Mon, Sep 09, 2024 at 10:04:21AM +0300, Dmitry Antipov wrote:
> On 9/8/24 9:36 PM, Cong Wang wrote:
> 
> > Are you sure it is due to sockmap code?
> 
> No, and that's why my patch has RFC tag in subject :-).
> 
> > I see rds_tcp_accept_one() in the stack trace. This is why I highly
> > suspect that it is due to RDS code instead of sockmap code.
> > 
> > I have the following patch ready for testing, in case you are
> > interested.
> 
> Does it work for you? Running current upstream with this patch applied,
> I'm still seeing the same warning at net/core/sock_map.c:1663.

I never tested the RDS code (hence why I didn't post it). But for the warning
itself, actually disabling CONFIG_RDS made it disappear on my side, yet
another reason why I suspect it is RDS related.

Thanks.
Dmitry Antipov Sept. 11, 2024, 9:51 a.m. UTC | #4
On 9/11/24 7:32 AM, Cong Wang wrote:

> I never tested the RDS code (hence why I didn't post it). But for the warning
> itself, actually disabling CONFIG_RDS made it disappear on my side, yet
> another reason why I suspect it is RDS related.

OTOH sockmap code depends from CONFIG_BPF_SYSCALL. So I'm pretty sure that
there are more sockmap users beyond RDS and turning off CONFIG_RDS by itself
is not too useful for further investigations of this case.

Dmitry
Cong Wang Sept. 11, 2024, 4:45 p.m. UTC | #5
On Wed, Sep 11, 2024 at 12:51:04PM +0300, Dmitry Antipov wrote:
> On 9/11/24 7:32 AM, Cong Wang wrote:
> 
> > I never tested the RDS code (hence why I didn't post it). But for the warning
> > itself, actually disabling CONFIG_RDS made it disappear on my side, yet
> > another reason why I suspect it is RDS related.
> 
> OTOH sockmap code depends from CONFIG_BPF_SYSCALL. So I'm pretty sure that
> there are more sockmap users beyond RDS and turning off CONFIG_RDS by itself
> is not too useful for further investigations of this case.
> 

I guess you totally misunderstand my point. As a significant sockmap
contributor, I am certainly aware of sockmap users. My point is that I
needed to narrow down the problem to CONFIG_RDS when I was debugging it.

So, please let me know if you can still reproduce this after disabling
CONFIG_RDS, because I could not reproduce it any more. If you can,
please kindly share the stack trace without rds_* functions.

Thanks.
Dmitry Antipov Sept. 12, 2024, 3:59 p.m. UTC | #6
On 9/11/24 7:45 PM, Cong Wang wrote:

> I guess you totally misunderstand my point. As a significant sockmap
> contributor, I am certainly aware of sockmap users. My point is that I
> needed to narrow down the problem to CONFIG_RDS when I was debugging it.

I've narrowed down the problem to possible race condition between two
functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem
is not too helpful.

> So, please let me know if you can still reproduce this after disabling
> CONFIG_RDS, because I could not reproduce it any more. If you can,
> please kindly share the stack trace without rds_* functions.

Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But
syzbot reproducer I'm working with doesn't create RDS sockets explicitly
(with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled,
the default network namespace has special kernel-space socket which is
created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces
is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling
these two options makes the kernel vulnerable.

So I'm still gently asking you to check whether there is a race condition
I've talked about. Hopefully this shouldn't be too hard for a significant
sockmap contributor.

Dmitry
Cong Wang Sept. 14, 2024, 12:34 a.m. UTC | #7
On Thu, Sep 12, 2024 at 06:59:39PM +0300, Dmitry Antipov wrote:
> On 9/11/24 7:45 PM, Cong Wang wrote:
> 
> > I guess you totally misunderstand my point. As a significant sockmap
> > contributor, I am certainly aware of sockmap users. My point is that I
> > needed to narrow down the problem to CONFIG_RDS when I was debugging it.
> 
> I've narrowed down the problem to possible race condition between two
> functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem
> is not too helpful.

Narrowing down from more 30 millions lines of code to 17.5K is already a huge
win to me, maybe not for you. :)

> 
> > So, please let me know if you can still reproduce this after disabling
> > CONFIG_RDS, because I could not reproduce it any more. If you can,
> > please kindly share the stack trace without rds_* functions.
> 
> Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But
> syzbot reproducer I'm working with doesn't create RDS sockets explicitly
> (with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled,
> the default network namespace has special kernel-space socket which is
> created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces
> is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling
> these two options makes the kernel vulnerable.

Thanks for confirming it.

I did notice the RDS kernel socket, but, without my patch, we can still
use sockops to hook TCP socket under the RDS socket and add it to a
sockmap, hence the conflict of sock->sk->sk_user_data.

My patch basically prevents such TCP socket under RDS socket from being
added to any sockmap.

> 
> So I'm still gently asking you to check whether there is a race condition
> I've talked about. Hopefully this shouldn't be too hard for a significant
> sockmap contributor.

If you can kindly explain why this race condition is not related to RDS
despite the fact it only happens with CONFIG_RDS enabled, I'd happy to
review it. Otherwise, I feel like you may head to a wrong direction.

Thanks.
Dmitry Antipov Sept. 18, 2024, 3:42 p.m. UTC | #8
+CC Allison

On 9/14/24 3:34 AM, Cong Wang wrote:
> On Thu, Sep 12, 2024 at 06:59:39PM +0300, Dmitry Antipov wrote:
>> On 9/11/24 7:45 PM, Cong Wang wrote:
>>
>>> I guess you totally misunderstand my point. As a significant sockmap
>>> contributor, I am certainly aware of sockmap users. My point is that I
>>> needed to narrow down the problem to CONFIG_RDS when I was debugging it.
>>
>> I've narrowed down the problem to possible race condition between two
>> functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem
>> is not too helpful.
> 
> Narrowing down from more 30 millions lines of code to 17.5K is already a huge
> win to me, maybe not for you. :)
> 
>>
>>> So, please let me know if you can still reproduce this after disabling
>>> CONFIG_RDS, because I could not reproduce it any more. If you can,
>>> please kindly share the stack trace without rds_* functions.
>>
>> Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But
>> syzbot reproducer I'm working with doesn't create RDS sockets explicitly
>> (with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled,
>> the default network namespace has special kernel-space socket which is
>> created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces
>> is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling
>> these two options makes the kernel vulnerable.
> 
> Thanks for confirming it.
> 
> I did notice the RDS kernel socket, but, without my patch, we can still
> use sockops to hook TCP socket under the RDS socket and add it to a
> sockmap, hence the conflict of sock->sk->sk_user_data.
> 
> My patch basically prevents such TCP socket under RDS socket from being
> added to any sockmap.
> 
>>
>> So I'm still gently asking you to check whether there is a race condition
>> I've talked about. Hopefully this shouldn't be too hard for a significant
>> sockmap contributor.
> 
> If you can kindly explain why this race condition is not related to RDS
> despite the fact it only happens with CONFIG_RDS enabled, I'd happy to
> review it. Otherwise, I feel like you may head to a wrong direction.
> 
> Thanks.
diff mbox series

Patch

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..fef4231fc872 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1649,7 +1649,7 @@  void sock_map_destroy(struct sock *sk)
 	struct sk_psock *psock;
 
 	rcu_read_lock();
-	psock = sk_psock_get(sk);
+	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
 		rcu_read_unlock();
 		saved_destroy = READ_ONCE(sk->sk_prot)->destroy;
@@ -1658,7 +1658,6 @@  void sock_map_destroy(struct sock *sk)
 		sock_map_remove_links(sk, psock);
 		rcu_read_unlock();
 		sk_psock_stop(psock);
-		sk_psock_put(sk, psock);
 	}
 	if (WARN_ON_ONCE(saved_destroy == sock_map_destroy))
 		return;