Message ID | 20240408161336.612064-2-mhal@rbox.co (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | af_unix: Garbage collector vs connect() race condition | expand |
From: Michal Luczaj <mhal@rbox.co> Date: Mon, 8 Apr 2024 17:58:45 +0200 > Garbage collector does not take into account the risk of embryo getting > enqueued during the garbage collection. If such embryo has a peer that > carries SCM_RIGHTS, two consecutive passes of scan_children() may see a > different set of children. Leading to an incorrectly elevated inflight > count, and then a dangling pointer within the gc_inflight_list. > > sockets are AF_UNIX/SOCK_STREAM > S is an unconnected socket > L is a listening in-flight socket bound to addr, not in fdtable > V's fd will be passed via sendmsg(), gets inflight count bumped > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > ---------------- ------------------------- ----------- > > NS = unix_create1() > skb1 = sock_wmalloc(NS) > L = unix_find_other(addr) > unix_state_lock(L) > unix_peer(S) = NS > // V count=1 inflight=0 > > NS = unix_peer(S) > skb2 = sock_alloc() > skb_queue_tail(NS, skb2[V]) > > // V became in-flight > // V count=2 inflight=1 > > close(V) > > // V count=1 inflight=1 > // GC candidate condition met > > for u in gc_inflight_list: > if (total_refs == inflight_refs) > add u to gc_candidates > > // gc_candidates={L, V} > > for u in gc_candidates: > scan_children(u, dec_inflight) > > // embryo (skb1) was not > // reachable from L yet, so V's > // inflight remains unchanged > __skb_queue_tail(L, skb1) > unix_state_unlock(L) > for u in gc_candidates: > if (u.inflight) > scan_children(u, inc_inflight_move_tail) > > // V count=1 inflight=2 (!) > > If there is a GC-candidate listening socket, lock/unlock its state. This > makes GC wait until the end of any ongoing connect() to that socket. After > flipping the lock, a possibly SCM-laden embryo is already enqueued. And if > there is another connect() coming, its embryo won't carry SCM_RIGHTS as we > already took the unix_gc_lock. > > Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.") > Signed-off-by: Michal Luczaj <mhal@rbox.co> > --- > net/unix/garbage.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c > index fa39b6265238..cd3e8585ceb2 100644 > --- a/net/unix/garbage.c > +++ b/net/unix/garbage.c > @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work) > * receive queues. Other, non candidate sockets _can_ be > * added to queue, so we must make sure only to touch > * candidates. > + * > + * Embryos, though never candidates themselves, affect which > + * candidates are reachable by the garbage collector. Before > + * being added to a listener's queue, an embryo may already > + * receive data carrying SCM_RIGHTS, potentially making the > + * passed socket a candidate that is not yet reachable by the > + * collector. It becomes reachable once the embryo is > + * enqueued. Therefore, we must ensure that no SCM-laden > + * embryo appears in a (candidate) listener's queue between > + * consecutive scan_children() calls. > */ > list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > - long total_refs; > - > - total_refs = file_count(u->sk.sk_socket->file); > + struct sock *sk = &u->sk; > + long total_refs = file_count(sk->sk_socket->file); > > WARN_ON_ONCE(!u->inflight); > WARN_ON_ONCE(total_refs < u->inflight); > @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) > list_move_tail(&u->link, &gc_candidates); > __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); > __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); > + > + if (sk->sk_state == TCP_LISTEN) { > + unix_state_lock(sk); > + unix_state_unlock(sk); Less likely though, what if the same connect() happens after this ? connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() ---------------- ------------------------- ----------- NS = unix_create1() skb1 = sock_wmalloc(NS) L = unix_find_other(addr) for u in gc_inflight_list: if (total_refs == inflight_refs) add u to gc_candidates // L was already traversed // in a previous iteration. unix_state_lock(L) unix_peer(S) = NS // gc_candidates={L, V} for u in gc_candidates: scan_children(u, dec_inflight) // embryo (skb1) was not // reachable from L yet, so V's // inflight remains unchanged __skb_queue_tail(L, skb1) unix_state_unlock(L) for u in gc_candidates: if (u.inflight) scan_children(u, inc_inflight_move_tail) // V count=1 inflight=2 (!) As you pointed out, this GC's assumption is basically wrong; the GC works correctly only when the set of traversed sockets does not change over 3 scan_children() calls. That's why I reworked the GC not to rely on receive queue. https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/ > + } > } > } > > -- > 2.44.0 >
On 4/8/24 23:18, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Mon, 8 Apr 2024 17:58:45 +0200 ... >> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { >> - long total_refs; >> - >> - total_refs = file_count(u->sk.sk_socket->file); >> + struct sock *sk = &u->sk; >> + long total_refs = file_count(sk->sk_socket->file); >> >> WARN_ON_ONCE(!u->inflight); >> WARN_ON_ONCE(total_refs < u->inflight); >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) >> list_move_tail(&u->link, &gc_candidates); >> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); >> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); >> + >> + if (sk->sk_state == TCP_LISTEN) { >> + unix_state_lock(sk); >> + unix_state_unlock(sk); > > Less likely though, what if the same connect() happens after this ? > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > ---------------- ------------------------- ----------- > NS = unix_create1() > skb1 = sock_wmalloc(NS) > L = unix_find_other(addr) > for u in gc_inflight_list: > if (total_refs == inflight_refs) > add u to gc_candidates > // L was already traversed > // in a previous iteration. > unix_state_lock(L) > unix_peer(S) = NS > > // gc_candidates={L, V} > > for u in gc_candidates: > scan_children(u, dec_inflight) > > // embryo (skb1) was not > // reachable from L yet, so V's > // inflight remains unchanged > __skb_queue_tail(L, skb1) > unix_state_unlock(L) > for u in gc_candidates: > if (u.inflight) > scan_children(u, inc_inflight_move_tail) > > // V count=1 inflight=2 (!) If I understand your question, in this case L's queue technically does change between scan_children()s: embryo appears, but that's meaningless. __unix_gc() already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the same unix_gc_lock. Is there something I'm missing? > As you pointed out, this GC's assumption is basically wrong; the GC > works correctly only when the set of traversed sockets does not change > over 3 scan_children() calls. > > That's why I reworked the GC not to rely on receive queue. > https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/ Right, I'll try to get my head around your series :)
From: Michal Luczaj <mhal@rbox.co> Date: Tue, 9 Apr 2024 01:25:23 +0200 > On 4/8/24 23:18, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Mon, 8 Apr 2024 17:58:45 +0200 > ... > >> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { Please move sk declaration here and > >> - long total_refs; > >> - > >> - total_refs = file_count(u->sk.sk_socket->file); keep these 3 lines for reverse xmax tree order. > >> + struct sock *sk = &u->sk; > >> + long total_refs = file_count(sk->sk_socket->file); > >> > >> WARN_ON_ONCE(!u->inflight); > >> WARN_ON_ONCE(total_refs < u->inflight); > >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) > >> list_move_tail(&u->link, &gc_candidates); > >> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); > >> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); > >> + > >> + if (sk->sk_state == TCP_LISTEN) { > >> + unix_state_lock(sk); > >> + unix_state_unlock(sk); > > > > Less likely though, what if the same connect() happens after this ? > > > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > > ---------------- ------------------------- ----------- > > NS = unix_create1() > > skb1 = sock_wmalloc(NS) > > L = unix_find_other(addr) > > for u in gc_inflight_list: > > if (total_refs == inflight_refs) > > add u to gc_candidates > > // L was already traversed > > // in a previous iteration. > > unix_state_lock(L) > > unix_peer(S) = NS > > > > // gc_candidates={L, V} > > > > for u in gc_candidates: > > scan_children(u, dec_inflight) > > > > // embryo (skb1) was not > > // reachable from L yet, so V's > > // inflight remains unchanged > > __skb_queue_tail(L, skb1) > > unix_state_unlock(L) > > for u in gc_candidates: > > if (u.inflight) > > scan_children(u, inc_inflight_move_tail) > > > > // V count=1 inflight=2 (!) > > If I understand your question, in this case L's queue technically does change > between scan_children()s: embryo appears, but that's meaningless. __unix_gc() > already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS > (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the > same unix_gc_lock. > > Is there something I'm missing? Ah exactly, you are right. Could you repost this patch only with my comment above addressed ? Thanks!
On 4/9/24 02:22, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Tue, 9 Apr 2024 01:25:23 +0200 >> On 4/8/24 23:18, Kuniyuki Iwashima wrote: >>> From: Michal Luczaj <mhal@rbox.co> >>> Date: Mon, 8 Apr 2024 17:58:45 +0200 >> ... >>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > > Please move sk declaration here and > >>>> - long total_refs; >>>> - >>>> - total_refs = file_count(u->sk.sk_socket->file); > > keep these 3 lines for reverse xmax tree order. Tricky to have them all 3 in reverse xmax. Did you mean struct sock *sk = &u->sk; long total_refs; total_refs = file_count(sk->sk_socket->file); ? >>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() >>> ---------------- ------------------------- ----------- >>> NS = unix_create1() >>> skb1 = sock_wmalloc(NS) >>> L = unix_find_other(addr) >>> for u in gc_inflight_list: >>> if (total_refs == inflight_refs) >>> add u to gc_candidates >>> // L was already traversed >>> // in a previous iteration. >>> unix_state_lock(L) >>> unix_peer(S) = NS >>> >>> // gc_candidates={L, V} >>> >>> for u in gc_candidates: >>> scan_children(u, dec_inflight) >>> >>> // embryo (skb1) was not >>> // reachable from L yet, so V's >>> // inflight remains unchanged >>> __skb_queue_tail(L, skb1) >>> unix_state_unlock(L) >>> for u in gc_candidates: >>> if (u.inflight) >>> scan_children(u, inc_inflight_move_tail) >>> >>> // V count=1 inflight=2 (!) >> >> If I understand your question, in this case L's queue technically does change >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc() >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the >> same unix_gc_lock. >> >> Is there something I'm missing? > > Ah exactly, you are right. > > Could you repost this patch only with my comment above addressed ? Yeah, sure. One question though: what I wrote above is basically a rephrasing of the commit message: (...) After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another connect() coming, its embryo won't carry SCM_RIGHTS as we already took the unix_gc_lock. As I understand, the important missing part was the clarification that embryo, even though enqueued after the lock flipping, won't affect the inflight graph, right? So how about: (...) After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another embryo coming, it can not possibly carry SCM_RIGHTS. At this point, unix_inflight() can not happen because unix_gc_lock is already taken. Inflight graph remains unaffected. Thanks!
From: Michal Luczaj <mhal@rbox.co> Date: Tue, 9 Apr 2024 11:16:35 +0200 > On 4/9/24 02:22, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Tue, 9 Apr 2024 01:25:23 +0200 > >> On 4/8/24 23:18, Kuniyuki Iwashima wrote: > >>> From: Michal Luczaj <mhal@rbox.co> > >>> Date: Mon, 8 Apr 2024 17:58:45 +0200 > >> ... > >>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > > > > Please move sk declaration here and > > > >>>> - long total_refs; > >>>> - > >>>> - total_refs = file_count(u->sk.sk_socket->file); > > > > keep these 3 lines for reverse xmax tree order. > > Tricky to have them all 3 in reverse xmax. Did you mean > > struct sock *sk = &u->sk; > long total_refs; > > total_refs = file_count(sk->sk_socket->file); > > ? Yes, it's netdev convention. https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs > > >>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > >>> ---------------- ------------------------- ----------- > >>> NS = unix_create1() > >>> skb1 = sock_wmalloc(NS) > >>> L = unix_find_other(addr) > >>> for u in gc_inflight_list: > >>> if (total_refs == inflight_refs) > >>> add u to gc_candidates > >>> // L was already traversed > >>> // in a previous iteration. > >>> unix_state_lock(L) > >>> unix_peer(S) = NS > >>> > >>> // gc_candidates={L, V} > >>> > >>> for u in gc_candidates: > >>> scan_children(u, dec_inflight) > >>> > >>> // embryo (skb1) was not > >>> // reachable from L yet, so V's > >>> // inflight remains unchanged > >>> __skb_queue_tail(L, skb1) > >>> unix_state_unlock(L) > >>> for u in gc_candidates: > >>> if (u.inflight) > >>> scan_children(u, inc_inflight_move_tail) > >>> > >>> // V count=1 inflight=2 (!) > >> > >> If I understand your question, in this case L's queue technically does change > >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc() > >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS > >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the > >> same unix_gc_lock. > >> > >> Is there something I'm missing? > > > > Ah exactly, you are right. > > > > Could you repost this patch only with my comment above addressed ? > > Yeah, sure. One question though: what I wrote above is basically a rephrasing of > the commit message: > > (...) After flipping the lock, a possibly SCM-laden embryo is already > enqueued. And if there is another connect() coming, its embryo won't > carry SCM_RIGHTS as we already took the unix_gc_lock. > > As I understand, the important missing part was the clarification that embryo, > even though enqueued after the lock flipping, won't affect the inflight graph, > right? So how about: > > (...) After flipping the lock, a possibly SCM-laden embryo is already > enqueued. And if there is another embryo coming, it can not possibly carry > SCM_RIGHTS. At this point, unix_inflight() can not happen because > unix_gc_lock is already taken. Inflight graph remains unaffected. Sounds good to me. Thanks!
diff --git a/net/unix/garbage.c b/net/unix/garbage.c index fa39b6265238..cd3e8585ceb2 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work) * receive queues. Other, non candidate sockets _can_ be * added to queue, so we must make sure only to touch * candidates. + * + * Embryos, though never candidates themselves, affect which + * candidates are reachable by the garbage collector. Before + * being added to a listener's queue, an embryo may already + * receive data carrying SCM_RIGHTS, potentially making the + * passed socket a candidate that is not yet reachable by the + * collector. It becomes reachable once the embryo is + * enqueued. Therefore, we must ensure that no SCM-laden + * embryo appears in a (candidate) listener's queue between + * consecutive scan_children() calls. */ list_for_each_entry_safe(u, next, &gc_inflight_list, link) { - long total_refs; - - total_refs = file_count(u->sk.sk_socket->file); + struct sock *sk = &u->sk; + long total_refs = file_count(sk->sk_socket->file); WARN_ON_ONCE(!u->inflight); WARN_ON_ONCE(total_refs < u->inflight); @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) list_move_tail(&u->link, &gc_candidates); __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); + + if (sk->sk_state == TCP_LISTEN) { + unix_state_lock(sk); + unix_state_unlock(sk); + } } }
Garbage collector does not take into account the risk of embryo getting enqueued during the garbage collection. If such embryo has a peer that carries SCM_RIGHTS, two consecutive passes of scan_children() may see a different set of children. Leading to an incorrectly elevated inflight count, and then a dangling pointer within the gc_inflight_list. sockets are AF_UNIX/SOCK_STREAM S is an unconnected socket L is a listening in-flight socket bound to addr, not in fdtable V's fd will be passed via sendmsg(), gets inflight count bumped connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() ---------------- ------------------------- ----------- NS = unix_create1() skb1 = sock_wmalloc(NS) L = unix_find_other(addr) unix_state_lock(L) unix_peer(S) = NS // V count=1 inflight=0 NS = unix_peer(S) skb2 = sock_alloc() skb_queue_tail(NS, skb2[V]) // V became in-flight // V count=2 inflight=1 close(V) // V count=1 inflight=1 // GC candidate condition met for u in gc_inflight_list: if (total_refs == inflight_refs) add u to gc_candidates // gc_candidates={L, V} for u in gc_candidates: scan_children(u, dec_inflight) // embryo (skb1) was not // reachable from L yet, so V's // inflight remains unchanged __skb_queue_tail(L, skb1) unix_state_unlock(L) for u in gc_candidates: if (u.inflight) scan_children(u, inc_inflight_move_tail) // V count=1 inflight=2 (!) If there is a GC-candidate listening socket, lock/unlock its state. This makes GC wait until the end of any ongoing connect() to that socket. After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another connect() coming, its embryo won't carry SCM_RIGHTS as we already took the unix_gc_lock. Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.") Signed-off-by: Michal Luczaj <mhal@rbox.co> --- net/unix/garbage.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)