diff mbox series

[net,1/2] af_unix: Fix garbage collector racing against connect()

Message ID 20240408161336.612064-2-mhal@rbox.co (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series af_unix: Garbage collector vs connect() race condition | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 944 this patch: 944
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 6 maintainers
netdev/build_clang success Errors and warnings before: 954 this patch: 954
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 955 this patch: 955
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 34 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Michal Luczaj April 8, 2024, 3:58 p.m. UTC
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------

NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
			// V count=1 inflight=0

 			NS = unix_peer(S)
 			skb2 = sock_alloc()
			skb_queue_tail(NS, skb2[V])

			// V became in-flight
			// V count=2 inflight=1

			close(V)

			// V count=1 inflight=1
			// GC candidate condition met

						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)

If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
already took the unix_gc_lock.

Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/garbage.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Comments

Kuniyuki Iwashima April 8, 2024, 9:18 p.m. UTC | #1
From: Michal Luczaj <mhal@rbox.co>
Date: Mon,  8 Apr 2024 17:58:45 +0200
> Garbage collector does not take into account the risk of embryo getting
> enqueued during the garbage collection. If such embryo has a peer that
> carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
> different set of children. Leading to an incorrectly elevated inflight
> count, and then a dangling pointer within the gc_inflight_list.
> 
> sockets are AF_UNIX/SOCK_STREAM
> S is an unconnected socket
> L is a listening in-flight socket bound to addr, not in fdtable
> V's fd will be passed via sendmsg(), gets inflight count bumped
> 
> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> ----------------	-------------------------	-----------
> 
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> unix_state_lock(L)
> unix_peer(S) = NS
> 			// V count=1 inflight=0
> 
>  			NS = unix_peer(S)
>  			skb2 = sock_alloc()
> 			skb_queue_tail(NS, skb2[V])
> 
> 			// V became in-flight
> 			// V count=2 inflight=1
> 
> 			close(V)
> 
> 			// V count=1 inflight=1
> 			// GC candidate condition met
> 
> 						for u in gc_inflight_list:
> 						  if (total_refs == inflight_refs)
> 						    add u to gc_candidates
> 
> 						// gc_candidates={L, V}
> 
> 						for u in gc_candidates:
> 						  scan_children(u, dec_inflight)
> 
> 						// embryo (skb1) was not
> 						// reachable from L yet, so V's
> 						// inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> 						for u in gc_candidates:
> 						  if (u.inflight)
> 						    scan_children(u, inc_inflight_move_tail)
> 
> 						// V count=1 inflight=2 (!)
> 
> If there is a GC-candidate listening socket, lock/unlock its state. This
> makes GC wait until the end of any ongoing connect() to that socket. After
> flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
> there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
> already took the unix_gc_lock.
> 
> Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  net/unix/garbage.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index fa39b6265238..cd3e8585ceb2 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work)
>  	 * receive queues.  Other, non candidate sockets _can_ be
>  	 * added to queue, so we must make sure only to touch
>  	 * candidates.
> +	 *
> +	 * Embryos, though never candidates themselves, affect which
> +	 * candidates are reachable by the garbage collector.  Before
> +	 * being added to a listener's queue, an embryo may already
> +	 * receive data carrying SCM_RIGHTS, potentially making the
> +	 * passed socket a candidate that is not yet reachable by the
> +	 * collector.  It becomes reachable once the embryo is
> +	 * enqueued.  Therefore, we must ensure that no SCM-laden
> +	 * embryo appears in a (candidate) listener's queue between
> +	 * consecutive scan_children() calls.
>  	 */
>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> -		long total_refs;
> -
> -		total_refs = file_count(u->sk.sk_socket->file);
> +		struct sock *sk = &u->sk;
> +		long total_refs = file_count(sk->sk_socket->file);
>  
>  		WARN_ON_ONCE(!u->inflight);
>  		WARN_ON_ONCE(total_refs < u->inflight);
> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
>  			list_move_tail(&u->link, &gc_candidates);
>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> +
> +			if (sk->sk_state == TCP_LISTEN) {
> +				unix_state_lock(sk);
> +				unix_state_unlock(sk);

Less likely though, what if the same connect() happens after this ?

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates
						    // L was already traversed
						    // in a previous iteration.
unix_state_lock(L)
unix_peer(S) = NS

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)


As you pointed out, this GC's assumption is basically wrong; the GC
works correctly only when the set of traversed sockets does not change
over 3 scan_children() calls.

That's why I reworked the GC not to rely on receive queue.
https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/


> +			}
>  		}
>  	}
>  
> -- 
> 2.44.0
>
Michal Luczaj April 8, 2024, 11:25 p.m. UTC | #2
On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon,  8 Apr 2024 17:58:45 +0200
...
>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
>> -		long total_refs;
>> -
>> -		total_refs = file_count(u->sk.sk_socket->file);
>> +		struct sock *sk = &u->sk;
>> +		long total_refs = file_count(sk->sk_socket->file);
>>  
>>  		WARN_ON_ONCE(!u->inflight);
>>  		WARN_ON_ONCE(total_refs < u->inflight);
>> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
>>  			list_move_tail(&u->link, &gc_candidates);
>>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
>>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
>> +
>> +			if (sk->sk_state == TCP_LISTEN) {
>> +				unix_state_lock(sk);
>> +				unix_state_unlock(sk);
> 
> Less likely though, what if the same connect() happens after this ?
> 
> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> ----------------	-------------------------	-----------
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> 						for u in gc_inflight_list:
> 						  if (total_refs == inflight_refs)
> 						    add u to gc_candidates
> 						    // L was already traversed
> 						    // in a previous iteration.
> unix_state_lock(L)
> unix_peer(S) = NS
> 
> 						// gc_candidates={L, V}
> 
> 						for u in gc_candidates:
> 						  scan_children(u, dec_inflight)
> 
> 						// embryo (skb1) was not
> 						// reachable from L yet, so V's
> 						// inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> 						for u in gc_candidates:
> 						  if (u.inflight)
> 						    scan_children(u, inc_inflight_move_tail)
> 
> 						// V count=1 inflight=2 (!)

If I understand your question, in this case L's queue technically does change
between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
(i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
same unix_gc_lock.

Is there something I'm missing?

> As you pointed out, this GC's assumption is basically wrong; the GC
> works correctly only when the set of traversed sockets does not change
> over 3 scan_children() calls.
> 
> That's why I reworked the GC not to rely on receive queue.
> https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/

Right, I'll try to get my head around your series :)
Kuniyuki Iwashima April 9, 2024, 12:22 a.m. UTC | #3
From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 01:25:23 +0200
> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon,  8 Apr 2024 17:58:45 +0200
> ...
> >>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {

Please move sk declaration here and


> >> -		long total_refs;
> >> -
> >> -		total_refs = file_count(u->sk.sk_socket->file);

keep these 3 lines for reverse xmax tree order.


> >> +		struct sock *sk = &u->sk;
> >> +		long total_refs = file_count(sk->sk_socket->file);
> >>  
> >>  		WARN_ON_ONCE(!u->inflight);
> >>  		WARN_ON_ONCE(total_refs < u->inflight);
> >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
> >>  			list_move_tail(&u->link, &gc_candidates);
> >>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
> >>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> >> +
> >> +			if (sk->sk_state == TCP_LISTEN) {
> >> +				unix_state_lock(sk);
> >> +				unix_state_unlock(sk);
> > 
> > Less likely though, what if the same connect() happens after this ?
> > 
> > connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> > ----------------	-------------------------	-----------
> > NS = unix_create1()
> > skb1 = sock_wmalloc(NS)
> > L = unix_find_other(addr)
> > 						for u in gc_inflight_list:
> > 						  if (total_refs == inflight_refs)
> > 						    add u to gc_candidates
> > 						    // L was already traversed
> > 						    // in a previous iteration.
> > unix_state_lock(L)
> > unix_peer(S) = NS
> > 
> > 						// gc_candidates={L, V}
> > 
> > 						for u in gc_candidates:
> > 						  scan_children(u, dec_inflight)
> > 
> > 						// embryo (skb1) was not
> > 						// reachable from L yet, so V's
> > 						// inflight remains unchanged
> > __skb_queue_tail(L, skb1)
> > unix_state_unlock(L)
> > 						for u in gc_candidates:
> > 						  if (u.inflight)
> > 						    scan_children(u, inc_inflight_move_tail)
> > 
> > 						// V count=1 inflight=2 (!)
> 
> If I understand your question, in this case L's queue technically does change
> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> same unix_gc_lock.
> 
> Is there something I'm missing?

Ah exactly, you are right.

Could you repost this patch only with my comment above addressed ?

Thanks!
Michal Luczaj April 9, 2024, 9:16 a.m. UTC | #4
On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Tue, 9 Apr 2024 01:25:23 +0200
>> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon,  8 Apr 2024 17:58:45 +0200
>> ...
>>>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> 
> Please move sk declaration here and
> 
>>>> -		long total_refs;
>>>> -
>>>> -		total_refs = file_count(u->sk.sk_socket->file);
> 
> keep these 3 lines for reverse xmax tree order.

Tricky to have them all 3 in reverse xmax. Did you mean

	struct sock *sk = &u->sk;
	long total_refs;

	total_refs = file_count(sk->sk_socket->file);

?

>>> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
>>> ----------------	-------------------------	-----------
>>> NS = unix_create1()
>>> skb1 = sock_wmalloc(NS)
>>> L = unix_find_other(addr)
>>> 						for u in gc_inflight_list:
>>> 						  if (total_refs == inflight_refs)
>>> 						    add u to gc_candidates
>>> 						    // L was already traversed
>>> 						    // in a previous iteration.
>>> unix_state_lock(L)
>>> unix_peer(S) = NS
>>>
>>> 						// gc_candidates={L, V}
>>>
>>> 						for u in gc_candidates:
>>> 						  scan_children(u, dec_inflight)
>>>
>>> 						// embryo (skb1) was not
>>> 						// reachable from L yet, so V's
>>> 						// inflight remains unchanged
>>> __skb_queue_tail(L, skb1)
>>> unix_state_unlock(L)
>>> 						for u in gc_candidates:
>>> 						  if (u.inflight)
>>> 						    scan_children(u, inc_inflight_move_tail)
>>>
>>> 						// V count=1 inflight=2 (!)
>>
>> If I understand your question, in this case L's queue technically does change
>> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
>> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
>> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
>> same unix_gc_lock.
>>
>> Is there something I'm missing?
> 
> Ah exactly, you are right.
> 
> Could you repost this patch only with my comment above addressed ?

Yeah, sure. One question though: what I wrote above is basically a rephrasing of
the commit message:

    (...) After flipping the lock, a possibly SCM-laden embryo is already
    enqueued. And if there is another connect() coming, its embryo won't
    carry SCM_RIGHTS as we already took the unix_gc_lock.

As I understand, the important missing part was the clarification that embryo,
even though enqueued after the lock flipping, won't affect the inflight graph,
right? So how about:

    (...) After flipping the lock, a possibly SCM-laden embryo is already
    enqueued. And if there is another embryo coming, it can not possibly carry
    SCM_RIGHTS. At this point, unix_inflight() can not happen because
    unix_gc_lock is already taken. Inflight graph remains unaffected.

Thanks!
Kuniyuki Iwashima April 9, 2024, 7:02 p.m. UTC | #5
From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 11:16:35 +0200
> On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Tue, 9 Apr 2024 01:25:23 +0200
> >> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Mon,  8 Apr 2024 17:58:45 +0200
> >> ...
> >>>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> > 
> > Please move sk declaration here and
> > 
> >>>> -		long total_refs;
> >>>> -
> >>>> -		total_refs = file_count(u->sk.sk_socket->file);
> > 
> > keep these 3 lines for reverse xmax tree order.
> 
> Tricky to have them all 3 in reverse xmax. Did you mean
> 
> 	struct sock *sk = &u->sk;
> 	long total_refs;
> 
> 	total_refs = file_count(sk->sk_socket->file);
> 
> ?

Yes, it's netdev convention.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs


> 
> >>> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> >>> ----------------	-------------------------	-----------
> >>> NS = unix_create1()
> >>> skb1 = sock_wmalloc(NS)
> >>> L = unix_find_other(addr)
> >>> 						for u in gc_inflight_list:
> >>> 						  if (total_refs == inflight_refs)
> >>> 						    add u to gc_candidates
> >>> 						    // L was already traversed
> >>> 						    // in a previous iteration.
> >>> unix_state_lock(L)
> >>> unix_peer(S) = NS
> >>>
> >>> 						// gc_candidates={L, V}
> >>>
> >>> 						for u in gc_candidates:
> >>> 						  scan_children(u, dec_inflight)
> >>>
> >>> 						// embryo (skb1) was not
> >>> 						// reachable from L yet, so V's
> >>> 						// inflight remains unchanged
> >>> __skb_queue_tail(L, skb1)
> >>> unix_state_unlock(L)
> >>> 						for u in gc_candidates:
> >>> 						  if (u.inflight)
> >>> 						    scan_children(u, inc_inflight_move_tail)
> >>>
> >>> 						// V count=1 inflight=2 (!)
> >>
> >> If I understand your question, in this case L's queue technically does change
> >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> >> same unix_gc_lock.
> >>
> >> Is there something I'm missing?
> > 
> > Ah exactly, you are right.
> > 
> > Could you repost this patch only with my comment above addressed ?
> 
> Yeah, sure. One question though: what I wrote above is basically a rephrasing of
> the commit message:
> 
>     (...) After flipping the lock, a possibly SCM-laden embryo is already
>     enqueued. And if there is another connect() coming, its embryo won't
>     carry SCM_RIGHTS as we already took the unix_gc_lock.
> 
> As I understand, the important missing part was the clarification that embryo,
> even though enqueued after the lock flipping, won't affect the inflight graph,
> right? So how about:
> 
>     (...) After flipping the lock, a possibly SCM-laden embryo is already
>     enqueued. And if there is another embryo coming, it can not possibly carry
>     SCM_RIGHTS. At this point, unix_inflight() can not happen because
>     unix_gc_lock is already taken. Inflight graph remains unaffected.

Sounds good to me.

Thanks!
diff mbox series

Patch

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index fa39b6265238..cd3e8585ceb2 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -274,11 +274,20 @@  static void __unix_gc(struct work_struct *work)
 	 * receive queues.  Other, non candidate sockets _can_ be
 	 * added to queue, so we must make sure only to touch
 	 * candidates.
+	 *
+	 * Embryos, though never candidates themselves, affect which
+	 * candidates are reachable by the garbage collector.  Before
+	 * being added to a listener's queue, an embryo may already
+	 * receive data carrying SCM_RIGHTS, potentially making the
+	 * passed socket a candidate that is not yet reachable by the
+	 * collector.  It becomes reachable once the embryo is
+	 * enqueued.  Therefore, we must ensure that no SCM-laden
+	 * embryo appears in a (candidate) listener's queue between
+	 * consecutive scan_children() calls.
 	 */
 	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
-		long total_refs;
-
-		total_refs = file_count(u->sk.sk_socket->file);
+		struct sock *sk = &u->sk;
+		long total_refs = file_count(sk->sk_socket->file);
 
 		WARN_ON_ONCE(!u->inflight);
 		WARN_ON_ONCE(total_refs < u->inflight);
@@ -286,6 +295,11 @@  static void __unix_gc(struct work_struct *work)
 			list_move_tail(&u->link, &gc_candidates);
 			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
 			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+
+			if (sk->sk_state == TCP_LISTEN) {
+				unix_state_lock(sk);
+				unix_state_unlock(sk);
+			}
 		}
 	}