diff mbox series

[v4,net,1/2] tcp: Fix use-after-free in inet_twsk_purge().

Message ID 20240307232151.55963-2-kuniyu@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series tcp/rds: Fix use-after-free around kernel TCP reqsk. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 958 this patch: 958
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 973 this patch: 973
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 974 this patch: 974
netdev/checkpatch warning WARNING: line length of 87 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-08--09-00 (tests: 890)

Commit Message

Kuniyuki Iwashima March 7, 2024, 11:21 p.m. UTC
Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") added changes in inet_twsk_purge() to purge
reqsk in per-netns ehash during netns dismantle.

inet_csk_reqsk_queue_drop_and_put() will remove reqsk from per-netns
ehash, but the iteration uses sk_nulls_for_each_rcu(), which is not
safe.  After removing reqsk, we need to restart iteration.

Also, we need to use refcount_inc_not_zero() to check if reqsk is
freed by its timer.

Fixes: 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in inet_twsk_purge()")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_timewait_sock.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Eric Dumazet March 8, 2024, 4:35 p.m. UTC | #1
On Fri, Mar 8, 2024 at 12:22 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
> inet_twsk_purge()") added changes in inet_twsk_purge() to purge
> reqsk in per-netns ehash during netns dismantle.
>
> inet_csk_reqsk_queue_drop_and_put() will remove reqsk from per-netns
> ehash, but the iteration uses sk_nulls_for_each_rcu(), which is not
> safe.  After removing reqsk, we need to restart iteration.
>
> Also, we need to use refcount_inc_not_zero() to check if reqsk is
> freed by its timer.
>
> Fixes: 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in inet_twsk_purge()")
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/inet_timewait_sock.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 5befa4de5b24..c81f83893fc7 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -278,18 +278,32 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
>  restart:
>                 sk_nulls_for_each_rcu(sk, node, &head->chain) {
>                         if (sk->sk_state != TCP_TIME_WAIT) {
> +                               struct request_sock *req;
> +
> +                               if (likely(sk->sk_state != TCP_NEW_SYN_RECV))
> +                                       continue;
> +
>                                 /* A kernel listener socket might not hold refcnt for net,
>                                  * so reqsk_timer_handler() could be fired after net is
>                                  * freed.  Userspace listener and reqsk never exist here.
>                                  */
> -                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> -                                            hashinfo->pernet)) {
> -                                       struct request_sock *req = inet_reqsk(sk);
>
> -                                       inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> +                               if (sk->sk_family != family ||
> +                                   refcount_read(&sock_net(sk)->ns.count))
> +                                       continue;
> +
> +                               req = inet_reqsk(sk);
> +                               if (unlikely(!refcount_inc_not_zero(&req->rsk_refcnt)))
> +                                       continue;
> +
> +                               if (unlikely(sk->sk_family != family ||
> +                                            refcount_read(&sock_net(sk)->ns.count))) {
> +                                       reqsk_put(req);
> +                                       continue;
>                                 }
>
> -                               continue;
> +                               inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);

> +                               goto restart;
>                         }
>
>                         tw = inet_twsk(sk);
> --
> 2.30.2
>

Unfortunately  inet_csk_reqsk_queue_drop_and_put() also needs the
local_bh_disable() part.
Presumably this code never met LOCKDEP :/

What about factoring the code like this ?
(Untested patch, only compiled)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5befa4de5b2416281ad2795713a70d0fd847b0b2..21a9b267fc8b843f8320767fb705de32f2c7b7b0
100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -265,10 +265,9 @@ EXPORT_SYMBOL_GPL(__inet_twsk_schedule);

 void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
 {
-       struct inet_timewait_sock *tw;
-       struct sock *sk;
        struct hlist_nulls_node *node;
        unsigned int slot;
+       struct sock *sk;

        for (slot = 0; slot <= hashinfo->ehash_mask; slot++) {
                struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
@@ -277,38 +276,35 @@ void inet_twsk_purge(struct inet_hashinfo
*hashinfo, int family)
                rcu_read_lock();
 restart:
                sk_nulls_for_each_rcu(sk, node, &head->chain) {
-                       if (sk->sk_state != TCP_TIME_WAIT) {
-                               /* A kernel listener socket might not
hold refcnt for net,
-                                * so reqsk_timer_handler() could be
fired after net is
-                                * freed.  Userspace listener and
reqsk never exist here.
-                                */
-                               if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
-                                            hashinfo->pernet)) {
-                                       struct request_sock *req =
inet_reqsk(sk);
-
-
inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
-                               }
+                       int state = inet_sk_state_load(sk);

+                       if ((1 << state) & ~(TCPF_TIME_WAIT |
+                                            TCPF_NEW_SYN_RECV))
                                continue;
-                       }

-                       tw = inet_twsk(sk);
-                       if ((tw->tw_family != family) ||
-                               refcount_read(&twsk_net(tw)->ns.count))
+                       if ((sk->sk_family != family) ||
+                               refcount_read(&sock_net(sk)->ns.count))
                                continue;

-                       if (unlikely(!refcount_inc_not_zero(&tw->tw_refcnt)))
+                       if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
                                continue;

-                       if (unlikely((tw->tw_family != family) ||
-                                    refcount_read(&twsk_net(tw)->ns.count))) {
-                               inet_twsk_put(tw);
+                       if (unlikely((sk->sk_family != family) ||
+                                    refcount_read(&sock_net(sk)->ns.count))) {
+                               sock_gen_put(sk);
                                goto restart;
                        }

                        rcu_read_unlock();
                        local_bh_disable();
-                       inet_twsk_deschedule_put(tw);
+                       if (state == TCP_TIME_WAIT) {
+                               inet_twsk_deschedule_put(inet_twsk(sk));
+                       } else {
+                               struct request_sock *req = inet_reqsk(sk);
+
+
inet_csk_reqsk_queue_drop_and_put(req->rsk_listener,
+                                                                 req);
+                       }
                        local_bh_enable();
                        goto restart_rcu;
                }
diff mbox series

Patch

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5befa4de5b24..c81f83893fc7 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -278,18 +278,32 @@  void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
 restart:
 		sk_nulls_for_each_rcu(sk, node, &head->chain) {
 			if (sk->sk_state != TCP_TIME_WAIT) {
+				struct request_sock *req;
+
+				if (likely(sk->sk_state != TCP_NEW_SYN_RECV))
+					continue;
+
 				/* A kernel listener socket might not hold refcnt for net,
 				 * so reqsk_timer_handler() could be fired after net is
 				 * freed.  Userspace listener and reqsk never exist here.
 				 */
-				if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
-					     hashinfo->pernet)) {
-					struct request_sock *req = inet_reqsk(sk);
 
-					inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
+				if (sk->sk_family != family ||
+				    refcount_read(&sock_net(sk)->ns.count))
+					continue;
+
+				req = inet_reqsk(sk);
+				if (unlikely(!refcount_inc_not_zero(&req->rsk_refcnt)))
+					continue;
+
+				if (unlikely(sk->sk_family != family ||
+					     refcount_read(&sock_net(sk)->ns.count))) {
+					reqsk_put(req);
+					continue;
 				}
 
-				continue;
+				inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
+				goto restart;
 			}
 
 			tw = inet_twsk(sk);