Message ID | 20240905064257.3870271-1-dmantipov@yandex.ru (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net] net: sockmap: avoid race between sock_map_destroy() and sk_psock_put() | expand |
On Thu, Sep 05, 2024 at 09:42:57AM +0300, Dmitry Antipov wrote: > At https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45, syzbot > has triggered the following race condition: Are you sure it is due to sockmap code? I see rds_tcp_accept_one() in the stack trace. This is why I highly suspect that it is due to RDS code instead of sockmap code. I have the following patch ready for testing, in case you are interested. Thanks. ---------------> commit 4068420e2c82137ab95d387346c0776a36c69e5d Author: Cong Wang <cong.wang@bytedance.com> Date: Sun Sep 1 17:01:49 2024 -0700 rds: check sock->sk->sk_user_data conflicts Signed-off-by: Cong Wang <cong.wang@bytedance.com> diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 351ac1747224..54ee7f6b8f34 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -134,11 +134,12 @@ void rds_tcp_restore_callbacks(struct socket *sock, * it is set. The absence of RDS_CONN_UP bit protects those paths * from being called while it isn't set. */ -void rds_tcp_reset_callbacks(struct socket *sock, - struct rds_conn_path *cp) +int rds_tcp_reset_callbacks(struct socket *sock, + struct rds_conn_path *cp) { struct rds_tcp_connection *tc = cp->cp_transport_data; struct socket *osock = tc->t_sock; + int ret = 0; if (!osock) goto newsock; @@ -181,21 +182,25 @@ void rds_tcp_reset_callbacks(struct socket *sock, newsock: rds_send_path_reset(cp); lock_sock(sock->sk); - rds_tcp_set_callbacks(sock, cp); + ret = rds_tcp_set_callbacks(sock, cp); release_sock(sock->sk); + return ret; } /* Add tc to rds_tcp_tc_list and set tc->t_sock. See comments * above rds_tcp_reset_callbacks for notes about synchronization * with data path */ -void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp) +int rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp) { struct rds_tcp_connection *tc = cp->cp_transport_data; - rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc); write_lock_bh(&sock->sk->sk_callback_lock); - + if (sock->sk->sk_user_data) { + write_unlock_bh(&sock->sk->sk_callback_lock); + return -EBUSY; + } + rdsdebug("setting sock %p callbacks to tc %p\n", sock, tc); /* done under the callback_lock to serialize with write_space */ spin_lock(&rds_tcp_tc_list_lock); list_add_tail(&tc->t_list_item, &rds_tcp_tc_list); @@ -222,6 +227,7 @@ void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp) sock->sk->sk_state_change = rds_tcp_state_change; write_unlock_bh(&sock->sk->sk_callback_lock); + return 0; } /* Handle RDS_INFO_TCP_SOCKETS socket option. It only returns IPv4 diff --git a/net/rds/tcp.h b/net/rds/tcp.h index 053aa7da87ef..710cc7fa41af 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -50,8 +50,8 @@ struct rds_tcp_statistics { /* tcp.c */ bool rds_tcp_tune(struct socket *sock); -void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); -void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); +int rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp); +int rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, struct rds_tcp_connection *tc); u32 rds_tcp_write_seq(struct rds_tcp_connection *tc); diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index d89bd8d0c354..695456455aee 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -205,11 +205,15 @@ int rds_tcp_accept_one(struct socket *sock) goto rst_nsk; if (rs_tcp->t_sock) { /* Duelling SYN has been handled in rds_tcp_accept_one() */ - rds_tcp_reset_callbacks(new_sock, cp); + ret = rds_tcp_reset_callbacks(new_sock, cp); + if (ret) + goto rst_nsk; /* rds_connect_path_complete() marks RDS_CONN_UP */ rds_connect_path_complete(cp, RDS_CONN_RESETTING); } else { - rds_tcp_set_callbacks(new_sock, cp); + ret = rds_tcp_set_callbacks(new_sock, cp); + if (ret) + goto rst_nsk; rds_connect_path_complete(cp, RDS_CONN_CONNECTING); } new_sock = NULL;
On 9/8/24 9:36 PM, Cong Wang wrote: > Are you sure it is due to sockmap code? No, and that's why my patch has RFC tag in subject :-). > I see rds_tcp_accept_one() in the stack trace. This is why I highly > suspect that it is due to RDS code instead of sockmap code. > > I have the following patch ready for testing, in case you are > interested. Does it work for you? Running current upstream with this patch applied, I'm still seeing the same warning at net/core/sock_map.c:1663. Again, I'm suspecting the race just because 'sk_psock_drop()' issues 'sk_psock_restore_proto()' with 'sk->sk_callback_lock' write locked, but 'sock_map_destroy()' just uses 'READ_ONCE()' to obtain a callback which may be changed underneath. BTW looking here and there again, I suppose that my patch is not correct too because it moves and/or shrinks the race window but doesn't eliminate it completely. Dmitry
On Mon, Sep 09, 2024 at 10:04:21AM +0300, Dmitry Antipov wrote: > On 9/8/24 9:36 PM, Cong Wang wrote: > > > Are you sure it is due to sockmap code? > > No, and that's why my patch has RFC tag in subject :-). > > > I see rds_tcp_accept_one() in the stack trace. This is why I highly > > suspect that it is due to RDS code instead of sockmap code. > > > > I have the following patch ready for testing, in case you are > > interested. > > Does it work for you? Running current upstream with this patch applied, > I'm still seeing the same warning at net/core/sock_map.c:1663. I never tested the RDS code (hence why I didn't post it). But for the warning itself, actually disabling CONFIG_RDS made it disappear on my side, yet another reason why I suspect it is RDS related. Thanks.
On 9/11/24 7:32 AM, Cong Wang wrote: > I never tested the RDS code (hence why I didn't post it). But for the warning > itself, actually disabling CONFIG_RDS made it disappear on my side, yet > another reason why I suspect it is RDS related. OTOH sockmap code depends from CONFIG_BPF_SYSCALL. So I'm pretty sure that there are more sockmap users beyond RDS and turning off CONFIG_RDS by itself is not too useful for further investigations of this case. Dmitry
On Wed, Sep 11, 2024 at 12:51:04PM +0300, Dmitry Antipov wrote: > On 9/11/24 7:32 AM, Cong Wang wrote: > > > I never tested the RDS code (hence why I didn't post it). But for the warning > > itself, actually disabling CONFIG_RDS made it disappear on my side, yet > > another reason why I suspect it is RDS related. > > OTOH sockmap code depends from CONFIG_BPF_SYSCALL. So I'm pretty sure that > there are more sockmap users beyond RDS and turning off CONFIG_RDS by itself > is not too useful for further investigations of this case. > I guess you totally misunderstand my point. As a significant sockmap contributor, I am certainly aware of sockmap users. My point is that I needed to narrow down the problem to CONFIG_RDS when I was debugging it. So, please let me know if you can still reproduce this after disabling CONFIG_RDS, because I could not reproduce it any more. If you can, please kindly share the stack trace without rds_* functions. Thanks.
On 9/11/24 7:45 PM, Cong Wang wrote: > I guess you totally misunderstand my point. As a significant sockmap > contributor, I am certainly aware of sockmap users. My point is that I > needed to narrow down the problem to CONFIG_RDS when I was debugging it. I've narrowed down the problem to possible race condition between two functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem is not too helpful. > So, please let me know if you can still reproduce this after disabling > CONFIG_RDS, because I could not reproduce it any more. If you can, > please kindly share the stack trace without rds_* functions. Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But syzbot reproducer I'm working with doesn't create RDS sockets explicitly (with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled, the default network namespace has special kernel-space socket which is created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling these two options makes the kernel vulnerable. So I'm still gently asking you to check whether there is a race condition I've talked about. Hopefully this shouldn't be too hard for a significant sockmap contributor. Dmitry
On Thu, Sep 12, 2024 at 06:59:39PM +0300, Dmitry Antipov wrote: > On 9/11/24 7:45 PM, Cong Wang wrote: > > > I guess you totally misunderstand my point. As a significant sockmap > > contributor, I am certainly aware of sockmap users. My point is that I > > needed to narrow down the problem to CONFIG_RDS when I was debugging it. > > I've narrowed down the problem to possible race condition between two > functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem > is not too helpful. Narrowing down from more 30 millions lines of code to 17.5K is already a huge win to me, maybe not for you. :) > > > So, please let me know if you can still reproduce this after disabling > > CONFIG_RDS, because I could not reproduce it any more. If you can, > > please kindly share the stack trace without rds_* functions. > > Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But > syzbot reproducer I'm working with doesn't create RDS sockets explicitly > (with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled, > the default network namespace has special kernel-space socket which is > created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces > is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling > these two options makes the kernel vulnerable. Thanks for confirming it. I did notice the RDS kernel socket, but, without my patch, we can still use sockops to hook TCP socket under the RDS socket and add it to a sockmap, hence the conflict of sock->sk->sk_user_data. My patch basically prevents such TCP socket under RDS socket from being added to any sockmap. > > So I'm still gently asking you to check whether there is a race condition > I've talked about. Hopefully this shouldn't be too hard for a significant > sockmap contributor. If you can kindly explain why this race condition is not related to RDS despite the fact it only happens with CONFIG_RDS enabled, I'd happy to review it. Otherwise, I feel like you may head to a wrong direction. Thanks.
+CC Allison On 9/14/24 3:34 AM, Cong Wang wrote: > On Thu, Sep 12, 2024 at 06:59:39PM +0300, Dmitry Antipov wrote: >> On 9/11/24 7:45 PM, Cong Wang wrote: >> >>> I guess you totally misunderstand my point. As a significant sockmap >>> contributor, I am certainly aware of sockmap users. My point is that I >>> needed to narrow down the problem to CONFIG_RDS when I was debugging it. >> >> I've narrowed down the problem to possible race condition between two >> functions. "Narrowing down" the problem to a 17.5Kloc-sized subsystem >> is not too helpful. > > Narrowing down from more 30 millions lines of code to 17.5K is already a huge > win to me, maybe not for you. :) > >> >>> So, please let me know if you can still reproduce this after disabling >>> CONFIG_RDS, because I could not reproduce it any more. If you can, >>> please kindly share the stack trace without rds_* functions. >> >> Yes, this issue requires CONFIG_RDS and CONFIG_RDS_TCP to reproduce. But >> syzbot reproducer I'm working with doesn't create RDS sockets explicitly >> (with 'socket(AF_RDS, ..., ...)' or so). When two options above are enabled, >> the default network namespace has special kernel-space socket which is >> created in 'rds_tcp_listen_init()' and (if my understanding of the namespaces >> is correct) may be inherited with 'unshare(CLONE_NEWNET)'. So just enabling >> these two options makes the kernel vulnerable. > > Thanks for confirming it. > > I did notice the RDS kernel socket, but, without my patch, we can still > use sockops to hook TCP socket under the RDS socket and add it to a > sockmap, hence the conflict of sock->sk->sk_user_data. > > My patch basically prevents such TCP socket under RDS socket from being > added to any sockmap. > >> >> So I'm still gently asking you to check whether there is a race condition >> I've talked about. Hopefully this shouldn't be too hard for a significant >> sockmap contributor. > > If you can kindly explain why this race condition is not related to RDS > despite the fact it only happens with CONFIG_RDS enabled, I'd happy to > review it. Otherwise, I feel like you may head to a wrong direction. > > Thanks.
diff --git a/net/core/sock_map.c b/net/core/sock_map.c index d3dbb92153f2..fef4231fc872 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1649,7 +1649,7 @@ void sock_map_destroy(struct sock *sk) struct sk_psock *psock; rcu_read_lock(); - psock = sk_psock_get(sk); + psock = sk_psock(sk); if (unlikely(!psock)) { rcu_read_unlock(); saved_destroy = READ_ONCE(sk->sk_prot)->destroy; @@ -1658,7 +1658,6 @@ void sock_map_destroy(struct sock *sk) sock_map_remove_links(sk, psock); rcu_read_unlock(); sk_psock_stop(psock); - sk_psock_put(sk, psock); } if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) return;
At https://syzkaller.appspot.com/bug?extid=f363afac6b0ace576f45, syzbot has triggered the following race condition: On CPU0, 'sk_psock_drop()' is running at [1]: void sk_psock_drop(struct sock *sk, struct sk_psock *psock) { write_lock_bh(&sk->sk_callback_lock); sk_psock_restore_proto(sk, psock); [1] rcu_assign_sk_user_data(sk, NULL); if (psock->progs.stream_parser) sk_psock_stop_strp(sk, psock); else if (psock->progs.stream_verdict || psock->progs.skb_verdict) sk_psock_stop_verdict(sk, psock); write_unlock_bh(&sk->sk_callback_lock); sk_psock_stop(psock); INIT_RCU_WORK(&psock->rwork, sk_psock_destroy); queue_rcu_work(system_wq, &psock->rwork); } If 'sock_map_destroy()' is scheduled on CPU1 at the same time, psock is always NULL at [2]. But, since [1] may be is in progress during [3], the value of 'saved_destroy' at this point is undefined: void sock_map_destroy(struct sock *sk) { void (*saved_destroy)(struct sock *sk); struct sk_psock *psock; rcu_read_lock(); psock = sk_psock_get(sk); [2] if (unlikely(!psock)) { rcu_read_unlock(); saved_destroy = READ_ONCE(sk->sk_prot)->destroy; [3] } else { saved_destroy = psock->saved_destroy; sock_map_remove_links(sk, psock); rcu_read_unlock(); sk_psock_stop(psock); sk_psock_put(sk, psock); } if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) return; if (saved_destroy) saved_destroy(sk); } The following fix is helpful to avoid the race and does not introduce psock leak (when running the syzbot reproducer from the above), but I'm not sure whether the latter is always true in some other scenario. So comments are highly appreciated. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> --- net/core/sock_map.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)