diff mbox series

[v2,net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.

Message ID 20240510093905.25510-1-kuniyu@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2,net] af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 931 this patch: 931
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: dhowells@redhat.com
netdev/build_clang success Errors and warnings before: 937 this patch: 937
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 942 this patch: 942
netdev/checkpatch warning WARNING: Commit log lines starting with '#' are dropped by git as comments
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kuniyuki Iwashima May 10, 2024, 9:39 a.m. UTC
Billy Jheng Bing-Jhong reported a race between __unix_gc() and
queue_oob().

__unix_gc() tries to garbage-collect close()d inflight sockets,
and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
will drop the reference and set NULL to it locklessly.

However, the peer socket still can send MSG_OOB message and
queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
NULL pointer dereference. [0]

To fix the issue, let's update unix_sk(sk)->oob_skb under the
sk_receive_queue's lock and take it everywhere we touch oob_skb.

Note that the same issue exists in the new GC, and the change
in queue_oob() can be applied as is.

[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
FS:  0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
 <TASK>
 unix_release_sock (net/unix/af_unix.c:654)
 unix_release (net/unix/af_unix.c:1050)
 __sock_release (net/socket.c:660)
 sock_close (net/socket.c:1423)
 __fput (fs/file_table.c:423)
 delayed_fput (fs/file_table.c:444 (discriminator 3))
 process_one_work (kernel/workqueue.c:3259)
 worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
 kthread (kernel/kthread.c:388)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>
Modules linked in:
CR2: 0000000000000008

Fixes: 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
  * Add recvq locking everywhere we touch oob_skb (Paolo)

v1: https://lore.kernel.org/netdev/20240507170018.83385-1-kuniyu@amazon.com/
---
 net/unix/af_unix.c | 18 ++++++++++++++----
 net/unix/garbage.c |  4 +++-
 2 files changed, 17 insertions(+), 5 deletions(-)

Comments

Paolo Abeni May 10, 2024, 10:44 a.m. UTC | #1
On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index 0104be9d4704..b87e48e2b51b 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
>  		scan_children(&u->sk, inc_inflight, &hitlist);
>  
>  #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> +		spin_lock(&u->sk.sk_receive_queue.lock);
>  		if (u->oob_skb) {
> -			kfree_skb(u->oob_skb);
> +			WARN_ON_ONCE(skb_unref(u->oob_skb));

Sorry for not asking this first, but it's not clear to me why the above
change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
related to the addressed issue???

Thanks!

Paolo
Kuniyuki Iwashima May 10, 2024, 10:54 a.m. UTC | #2
From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 10 May 2024 12:44:58 +0200
> On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > index 0104be9d4704..b87e48e2b51b 100644
> > --- a/net/unix/garbage.c
> > +++ b/net/unix/garbage.c
> > @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
> >  		scan_children(&u->sk, inc_inflight, &hitlist);
> >  
> >  #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > +		spin_lock(&u->sk.sk_receive_queue.lock);
> >  		if (u->oob_skb) {
> > -			kfree_skb(u->oob_skb);
> > +			WARN_ON_ONCE(skb_unref(u->oob_skb));
> 
> Sorry for not asking this first, but it's not clear to me why the above
> change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
> related to the addressed issue???

I think I added it to make it clear that here we don't actually free skb
and consistent with manage_oob().

But I don't have strong preference as it will be removed soon.
Paolo Abeni May 10, 2024, 11:11 a.m. UTC | #3
On Fri, 2024-05-10 at 19:54 +0900, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Fri, 10 May 2024 12:44:58 +0200
> > On Fri, 2024-05-10 at 18:39 +0900, Kuniyuki Iwashima wrote:
> > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> > > index 0104be9d4704..b87e48e2b51b 100644
> > > --- a/net/unix/garbage.c
> > > +++ b/net/unix/garbage.c
> > > @@ -342,10 +342,12 @@ static void __unix_gc(struct work_struct *work)
> > >  		scan_children(&u->sk, inc_inflight, &hitlist);
> > >  
> > >  #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
> > > +		spin_lock(&u->sk.sk_receive_queue.lock);
> > >  		if (u->oob_skb) {
> > > -			kfree_skb(u->oob_skb);
> > > +			WARN_ON_ONCE(skb_unref(u->oob_skb));
> > 
> > Sorry for not asking this first, but it's not clear to me why the above
> > change (just the 'WARN_ON_ONCE' introduction) is needed and if it's
> > related to the addressed issue???
> 
> I think I added it to make it clear that here we don't actually free skb
> and consistent with manage_oob().
> 
> But I don't have strong preference as it will be removed soon.

Ok, thanks for the explanation. I'm fine with the above.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Michal Luczaj May 12, 2024, 2:47 p.m. UTC | #4
On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  		consume_skb(skb);
>  		skb = NULL;
>  	} else {
> +		spin_lock(&sk->sk_receive_queue.lock);
> +
>  		if (skb == u->oob_skb) {
>  			if (copied) {
>  				skb = NULL;
> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  			} else if (flags & MSG_PEEK) {
>  				skb = NULL;
>  			} else {
> -				skb_unlink(skb, &sk->sk_receive_queue);
> +				__skb_unlink(skb, &sk->sk_receive_queue);
>  				WRITE_ONCE(u->oob_skb, NULL);
>  				if (!WARN_ON_ONCE(skb_unref(skb)))
>  					kfree_skb(skb);
>  				skb = skb_peek(&sk->sk_receive_queue);
>  			}
>  		}
> +
> +		spin_unlock(&sk->sk_receive_queue.lock);
>  	}
>  	return skb;
>  }

Now it is
  
  spin_lock(&sk->sk_receive_queue.lock)
  kfree_skb
    unix_destruct_scm
      unix_notinflight
        spin_lock(&unix_gc_lock)

I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
But that's benign, right?

thanks,
Michal
Kuniyuki Iwashima May 13, 2024, 6:12 a.m. UTC | #5
From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 12 May 2024 16:47:11 +0200
> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> > @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >  		consume_skb(skb);
> >  		skb = NULL;
> >  	} else {
> > +		spin_lock(&sk->sk_receive_queue.lock);
> > +
> >  		if (skb == u->oob_skb) {
> >  			if (copied) {
> >  				skb = NULL;
> > @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >  			} else if (flags & MSG_PEEK) {
> >  				skb = NULL;
> >  			} else {
> > -				skb_unlink(skb, &sk->sk_receive_queue);
> > +				__skb_unlink(skb, &sk->sk_receive_queue);
> >  				WRITE_ONCE(u->oob_skb, NULL);
> >  				if (!WARN_ON_ONCE(skb_unref(skb)))
> >  					kfree_skb(skb);
> >  				skb = skb_peek(&sk->sk_receive_queue);
> >  			}
> >  		}
> > +
> > +		spin_unlock(&sk->sk_receive_queue.lock);
> >  	}
> >  	return skb;
> >  }
> 
> Now it is
>   
>   spin_lock(&sk->sk_receive_queue.lock)
>   kfree_skb

This does not free skb actually and just drops a refcount by skb_get()
in queue_oob().


>     unix_destruct_scm

So, here we don't reach unix_destruct_scm().

That's why I changed kfree_skb() to skb_unref() in __unix_gc().

Thanks!


>       unix_notinflight
>         spin_lock(&unix_gc_lock)
> 
> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
> But that's benign, right?
Michal Luczaj May 13, 2024, 6:40 a.m. UTC | #6
On 5/13/24 08:12, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 12 May 2024 16:47:11 +0200
>> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
>>> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>>>  		consume_skb(skb);
>>>  		skb = NULL;
>>>  	} else {
>>> +		spin_lock(&sk->sk_receive_queue.lock);
>>> +
>>>  		if (skb == u->oob_skb) {
>>>  			if (copied) {
>>>  				skb = NULL;
>>> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>>>  			} else if (flags & MSG_PEEK) {
>>>  				skb = NULL;
>>>  			} else {
>>> -				skb_unlink(skb, &sk->sk_receive_queue);
>>> +				__skb_unlink(skb, &sk->sk_receive_queue);
>>>  				WRITE_ONCE(u->oob_skb, NULL);
>>>  				if (!WARN_ON_ONCE(skb_unref(skb)))
>>>  					kfree_skb(skb);
>>>  				skb = skb_peek(&sk->sk_receive_queue);
>>>  			}
>>>  		}
>>> +
>>> +		spin_unlock(&sk->sk_receive_queue.lock);
>>>  	}
>>>  	return skb;
>>>  }
>>
>> Now it is
>>   
>>   spin_lock(&sk->sk_receive_queue.lock)
>>   kfree_skb
> 
> This does not free skb actually and just drops a refcount by skb_get()
> in queue_oob().

I suspect you refer to change in __unix_gc()

 	if (u->oob_skb) {
-		kfree_skb(u->oob_skb);
+		WARN_ON_ONCE(skb_unref(u->oob_skb));
 	}

What I'm talking about is the quoted above (unchanged) part in manage_oob():

	if (!WARN_ON_ONCE(skb_unref(skb)))
  		kfree_skb(skb);

I might be missing something, but

from array import array
from socket import *
a, b = socketpair(AF_UNIX, SOCK_STREAM)
scm = (SOL_SOCKET, SCM_RIGHTS, array("i", [b.fileno()]))
b.sendmsg([b'x'], [scm], MSG_OOB)
b.close()
a.recv(MSG_DONTWAIT)

[   72.513125] ======================================================
[   72.513148] WARNING: possible circular locking dependency detected
[   72.513170] 6.9.0-rc7nokasan+ #25 Not tainted
[   72.513193] ------------------------------------------------------
[   72.513215] python/1054 is trying to acquire lock:
[   72.513237] ffffffff83563898 (unix_gc_lock){+.+.}-{2:2}, at: unix_notinflight+0x23/0x100
[   72.513266]
               but task is already holding lock:
[   72.513288] ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
[   72.513313]
               which lock already depends on the new lock.

[   72.513336]
               the existing dependency chain (in reverse order) is:
[   72.513358]
               -> #1 (rlock-AF_UNIX){+.+.}-{2:2}:
[   72.513381]        _raw_spin_lock+0x2f/0x40
[   72.513404]        scan_inflight+0x36/0x1e0
[   72.513428]        __unix_gc+0x17c/0x4b0
[   72.513450]        process_one_work+0x217/0x700
[   72.513474]        worker_thread+0x1ca/0x3b0
[   72.513497]        kthread+0xdd/0x110
[   72.513519]        ret_from_fork+0x2d/0x50
[   72.513543]        ret_from_fork_asm+0x1a/0x30
[   72.513565]
               -> #0 (unix_gc_lock){+.+.}-{2:2}:
[   72.513589]        __lock_acquire+0x137b/0x20e0
[   72.513612]        lock_acquire+0xc5/0x2c0
[   72.513635]        _raw_spin_lock+0x2f/0x40
[   72.513657]        unix_notinflight+0x23/0x100
[   72.513680]        unix_destruct_scm+0x95/0xa0
[   72.513702]        skb_release_head_state+0x20/0x60
[   72.513726]        kfree_skb_reason+0x53/0x1e0
[   72.513748]        unix_stream_read_generic+0xb69/0xbc0
[   72.513771]        unix_stream_recvmsg+0x68/0x80
[   72.513794]        sock_recvmsg+0xb9/0xc0
[   72.513817]        __sys_recvfrom+0xa1/0x110
[   72.513840]        __x64_sys_recvfrom+0x20/0x30
[   72.513862]        do_syscall_64+0x93/0x190
[   72.513886]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   72.513909]
               other info that might help us debug this:

[   72.513932]  Possible unsafe locking scenario:

[   72.513954]        CPU0                    CPU1
[   72.513976]        ----                    ----
[   72.513998]   lock(rlock-AF_UNIX);
[   72.514020]                                lock(unix_gc_lock);
[   72.514043]                                lock(rlock-AF_UNIX);
[   72.514066]   lock(unix_gc_lock);
[   72.514088]
                *** DEADLOCK ***

[   72.514110] 3 locks held by python/1054:
[   72.514133]  #0: ffff88811eb10cf0 (&u->iolock){+.+.}-{3:3}, at: unix_stream_read_generic+0xd4/0xbc0
[   72.514158]  #1: ffff88811eb10de0 (&u->lock){+.+.}-{2:2}, at: unix_stream_read_generic+0x110/0xbc0
[   72.514184]  #2: ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
[   72.514209]
               stack backtrace:
[   72.514231] CPU: 4 PID: 1054 Comm: python Not tainted 6.9.0-rc7nokasan+ #25
[   72.514254] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   72.514278] Call Trace:
[   72.514300]  <TASK>
[   72.514321]  dump_stack_lvl+0x73/0xb0
[   72.514345]  check_noncircular+0x108/0x120
[   72.514369]  __lock_acquire+0x137b/0x20e0
[   72.514392]  lock_acquire+0xc5/0x2c0
[   72.514415]  ? unix_notinflight+0x23/0x100
[   72.514439]  _raw_spin_lock+0x2f/0x40
[   72.514461]  ? unix_notinflight+0x23/0x100
[   72.514484]  unix_notinflight+0x23/0x100
[   72.514507]  unix_destruct_scm+0x95/0xa0
[   72.514530]  skb_release_head_state+0x20/0x60
[   72.514553]  kfree_skb_reason+0x53/0x1e0
[   72.514575]  unix_stream_read_generic+0xb69/0xbc0
[   72.514600]  unix_stream_recvmsg+0x68/0x80
[   72.514623]  ? __pfx_unix_stream_read_actor+0x10/0x10
[   72.514646]  sock_recvmsg+0xb9/0xc0
[   72.514669]  __sys_recvfrom+0xa1/0x110
[   72.514692]  ? lock_release+0x133/0x290
[   72.514715]  ? syscall_exit_to_user_mode+0x11/0x280
[   72.514739]  ? do_syscall_64+0xa0/0x190
[   72.514762]  __x64_sys_recvfrom+0x20/0x30
[   72.514784]  do_syscall_64+0x93/0x190
[   72.514807]  ? clear_bhb_loop+0x45/0xa0
[   72.514829]  ? clear_bhb_loop+0x45/0xa0
[   72.514852]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   72.514875] RIP: 0033:0x7fc16bb3594d
[   72.514899] Code: 02 02 00 00 00 5d c3 66 0f 1f 44 00 00 f3 0f 1e fa 80 3d 25 8a 0c 00 00 41 89 ca 74 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89
[   72.514925] RSP: 002b:00007fffae4ab2f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
[   72.514949] RAX: ffffffffffffffda RBX: 00007fffae4ab3c8 RCX: 00007fc16bb3594d
[   72.514972] RDX: 0000000000000040 RSI: 00007fc15e298f30 RDI: 0000000000000003
[   72.514994] RBP: 00007fffae4ab310 R08: 0000000000000000 R09: 0000000000000000
[   72.515017] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc15e34af90
[   72.515040] R13: 0000000000000000 R14: ffffffffc4653600 R15: 0000000000000000
[   72.515064]  </TASK>

>>     unix_destruct_scm
> 
> So, here we don't reach unix_destruct_scm().
> 
> That's why I changed kfree_skb() to skb_unref() in __unix_gc().
> 
> Thanks!
> 
> 
>>       unix_notinflight
>>         spin_lock(&unix_gc_lock)
>>
>> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
>> But that's benign, right?
Kuniyuki Iwashima May 13, 2024, 7:44 a.m. UTC | #7
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 13 May 2024 08:40:34 +0200
> On 5/13/24 08:12, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Sun, 12 May 2024 16:47:11 +0200
> >> On 5/10/24 11:39, Kuniyuki Iwashima wrote:
> >>> @@ -2655,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >>>  		consume_skb(skb);
> >>>  		skb = NULL;
> >>>  	} else {
> >>> +		spin_lock(&sk->sk_receive_queue.lock);
> >>> +
> >>>  		if (skb == u->oob_skb) {
> >>>  			if (copied) {
> >>>  				skb = NULL;
> >>> @@ -2666,13 +2674,15 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >>>  			} else if (flags & MSG_PEEK) {
> >>>  				skb = NULL;
> >>>  			} else {
> >>> -				skb_unlink(skb, &sk->sk_receive_queue);
> >>> +				__skb_unlink(skb, &sk->sk_receive_queue);
> >>>  				WRITE_ONCE(u->oob_skb, NULL);
> >>>  				if (!WARN_ON_ONCE(skb_unref(skb)))
> >>>  					kfree_skb(skb);
> >>>  				skb = skb_peek(&sk->sk_receive_queue);
> >>>  			}
> >>>  		}
> >>> +
> >>> +		spin_unlock(&sk->sk_receive_queue.lock);
> >>>  	}
> >>>  	return skb;
> >>>  }
> >>
> >> Now it is
> >>   
> >>   spin_lock(&sk->sk_receive_queue.lock)
> >>   kfree_skb
> > 
> > This does not free skb actually and just drops a refcount by skb_get()
> > in queue_oob().
> 
> I suspect you refer to change in __unix_gc()
> 
>  	if (u->oob_skb) {
> -		kfree_skb(u->oob_skb);
> +		WARN_ON_ONCE(skb_unref(u->oob_skb));
>  	}
> 
> What I'm talking about is the quoted above (unchanged) part in manage_oob():
> 
> 	if (!WARN_ON_ONCE(skb_unref(skb)))
>   		kfree_skb(skb);

Ah, I got your point, good catch!

Somehow I was thinking of new GC where alive recvq is not touched
and lockdep would end up with false-positive.

We need to delay freeing oob_skb in that case like below.

I'll respin v3 later, thanks!

---8<---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index c555464cf1fb..35ca2be2c984 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2661,6 +2661,8 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                consume_skb(skb);
                skb = NULL;
        } else {
+               struct sk_buff *unlinked_skb = NULL;
+
                spin_lock(&sk->sk_receive_queue.lock);
 
                if (skb == u->oob_skb) {
@@ -2676,13 +2678,18 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
                        } else {
                                __skb_unlink(skb, &sk->sk_receive_queue);
                                WRITE_ONCE(u->oob_skb, NULL);
-                               if (!WARN_ON_ONCE(skb_unref(skb)))
-                                       kfree_skb(skb);
+                               unlinked_skb = skb;
                                skb = skb_peek(&sk->sk_receive_queue);
                        }
                }
 
                spin_unlock(&sk->sk_receive_queue.lock);
+
+               if (unlinked_skb) {
+                       WARN_ON_ONCE(skb_unref(unlinked_skb));
+                       kfree_skb(unlinked_skb);
+               }
+
        }
        return skb;
 }
---8<---




> 
> I might be missing something, but
> 
> from array import array
> from socket import *
> a, b = socketpair(AF_UNIX, SOCK_STREAM)
> scm = (SOL_SOCKET, SCM_RIGHTS, array("i", [b.fileno()]))
> b.sendmsg([b'x'], [scm], MSG_OOB)
> b.close()
> a.recv(MSG_DONTWAIT)
> 
> [   72.513125] ======================================================
> [   72.513148] WARNING: possible circular locking dependency detected
> [   72.513170] 6.9.0-rc7nokasan+ #25 Not tainted
> [   72.513193] ------------------------------------------------------
> [   72.513215] python/1054 is trying to acquire lock:
> [   72.513237] ffffffff83563898 (unix_gc_lock){+.+.}-{2:2}, at: unix_notinflight+0x23/0x100
> [   72.513266]
>                but task is already holding lock:
> [   72.513288] ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
> [   72.513313]
>                which lock already depends on the new lock.
> 
> [   72.513336]
>                the existing dependency chain (in reverse order) is:
> [   72.513358]
>                -> #1 (rlock-AF_UNIX){+.+.}-{2:2}:
> [   72.513381]        _raw_spin_lock+0x2f/0x40
> [   72.513404]        scan_inflight+0x36/0x1e0
> [   72.513428]        __unix_gc+0x17c/0x4b0
> [   72.513450]        process_one_work+0x217/0x700
> [   72.513474]        worker_thread+0x1ca/0x3b0
> [   72.513497]        kthread+0xdd/0x110
> [   72.513519]        ret_from_fork+0x2d/0x50
> [   72.513543]        ret_from_fork_asm+0x1a/0x30
> [   72.513565]
>                -> #0 (unix_gc_lock){+.+.}-{2:2}:
> [   72.513589]        __lock_acquire+0x137b/0x20e0
> [   72.513612]        lock_acquire+0xc5/0x2c0
> [   72.513635]        _raw_spin_lock+0x2f/0x40
> [   72.513657]        unix_notinflight+0x23/0x100
> [   72.513680]        unix_destruct_scm+0x95/0xa0
> [   72.513702]        skb_release_head_state+0x20/0x60
> [   72.513726]        kfree_skb_reason+0x53/0x1e0
> [   72.513748]        unix_stream_read_generic+0xb69/0xbc0
> [   72.513771]        unix_stream_recvmsg+0x68/0x80
> [   72.513794]        sock_recvmsg+0xb9/0xc0
> [   72.513817]        __sys_recvfrom+0xa1/0x110
> [   72.513840]        __x64_sys_recvfrom+0x20/0x30
> [   72.513862]        do_syscall_64+0x93/0x190
> [   72.513886]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   72.513909]
>                other info that might help us debug this:
> 
> [   72.513932]  Possible unsafe locking scenario:
> 
> [   72.513954]        CPU0                    CPU1
> [   72.513976]        ----                    ----
> [   72.513998]   lock(rlock-AF_UNIX);
> [   72.514020]                                lock(unix_gc_lock);
> [   72.514043]                                lock(rlock-AF_UNIX);
> [   72.514066]   lock(unix_gc_lock);
> [   72.514088]
>                 *** DEADLOCK ***
> 
> [   72.514110] 3 locks held by python/1054:
> [   72.514133]  #0: ffff88811eb10cf0 (&u->iolock){+.+.}-{3:3}, at: unix_stream_read_generic+0xd4/0xbc0
> [   72.514158]  #1: ffff88811eb10de0 (&u->lock){+.+.}-{2:2}, at: unix_stream_read_generic+0x110/0xbc0
> [   72.514184]  #2: ffff88811eb10898 (rlock-AF_UNIX){+.+.}-{2:2}, at: unix_stream_read_generic+0x178/0xbc0
> [   72.514209]
>                stack backtrace:
> [   72.514231] CPU: 4 PID: 1054 Comm: python Not tainted 6.9.0-rc7nokasan+ #25
> [   72.514254] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [   72.514278] Call Trace:
> [   72.514300]  <TASK>
> [   72.514321]  dump_stack_lvl+0x73/0xb0
> [   72.514345]  check_noncircular+0x108/0x120
> [   72.514369]  __lock_acquire+0x137b/0x20e0
> [   72.514392]  lock_acquire+0xc5/0x2c0
> [   72.514415]  ? unix_notinflight+0x23/0x100
> [   72.514439]  _raw_spin_lock+0x2f/0x40
> [   72.514461]  ? unix_notinflight+0x23/0x100
> [   72.514484]  unix_notinflight+0x23/0x100
> [   72.514507]  unix_destruct_scm+0x95/0xa0
> [   72.514530]  skb_release_head_state+0x20/0x60
> [   72.514553]  kfree_skb_reason+0x53/0x1e0
> [   72.514575]  unix_stream_read_generic+0xb69/0xbc0
> [   72.514600]  unix_stream_recvmsg+0x68/0x80
> [   72.514623]  ? __pfx_unix_stream_read_actor+0x10/0x10
> [   72.514646]  sock_recvmsg+0xb9/0xc0
> [   72.514669]  __sys_recvfrom+0xa1/0x110
> [   72.514692]  ? lock_release+0x133/0x290
> [   72.514715]  ? syscall_exit_to_user_mode+0x11/0x280
> [   72.514739]  ? do_syscall_64+0xa0/0x190
> [   72.514762]  __x64_sys_recvfrom+0x20/0x30
> [   72.514784]  do_syscall_64+0x93/0x190
> [   72.514807]  ? clear_bhb_loop+0x45/0xa0
> [   72.514829]  ? clear_bhb_loop+0x45/0xa0
> [   72.514852]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   72.514875] RIP: 0033:0x7fc16bb3594d
> [   72.514899] Code: 02 02 00 00 00 5d c3 66 0f 1f 44 00 00 f3 0f 1e fa 80 3d 25 8a 0c 00 00 41 89 ca 74 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89
> [   72.514925] RSP: 002b:00007fffae4ab2f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
> [   72.514949] RAX: ffffffffffffffda RBX: 00007fffae4ab3c8 RCX: 00007fc16bb3594d
> [   72.514972] RDX: 0000000000000040 RSI: 00007fc15e298f30 RDI: 0000000000000003
> [   72.514994] RBP: 00007fffae4ab310 R08: 0000000000000000 R09: 0000000000000000
> [   72.515017] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc15e34af90
> [   72.515040] R13: 0000000000000000 R14: ffffffffc4653600 R15: 0000000000000000
> [   72.515064]  </TASK>
> 
> >>     unix_destruct_scm
> > 
> > So, here we don't reach unix_destruct_scm().
> > 
> > That's why I changed kfree_skb() to skb_unref() in __unix_gc().
> > 
> > Thanks!
> > 
> > 
> >>       unix_notinflight
> >>         spin_lock(&unix_gc_lock)
> >>
> >> I.e. sk_receive_queue.lock -> unix_gc_lock, inversion of what unix_gc() does.
> >> But that's benign, right?
>
Michal Luczaj May 13, 2024, 9:14 a.m. UTC | #8
On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 13 May 2024 08:40:34 +0200
>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
>>
>> 	if (!WARN_ON_ONCE(skb_unref(skb)))
>>   		kfree_skb(skb);
> 
> Ah, I got your point, good catch!
> 
> Somehow I was thinking of new GC where alive recvq is not touched
> and lockdep would end up with false-positive.
> 
> We need to delay freeing oob_skb in that case like below.
> ...

So this not a lockdep false positive after all?

Here's my understanding: the only way manage_oob() can lead to an inverted locking
order is when the receiver socket is _not_ in gc_candidates. And when it's not
there, no risk of deadlock. What do you think?
Kuniyuki Iwashima May 13, 2024, 9:24 a.m. UTC | #9
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 13 May 2024 11:14:39 +0200
> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 13 May 2024 08:40:34 +0200
> >> What I'm talking about is the quoted above (unchanged) part in manage_oob():
> >>
> >> 	if (!WARN_ON_ONCE(skb_unref(skb)))
> >>   		kfree_skb(skb);
> > 
> > Ah, I got your point, good catch!
> > 
> > Somehow I was thinking of new GC where alive recvq is not touched
> > and lockdep would end up with false-positive.
> > 
> > We need to delay freeing oob_skb in that case like below.
> > ...
> 
> So this not a lockdep false positive after all?
> 
> Here's my understanding: the only way manage_oob() can lead to an inverted locking
> order is when the receiver socket is _not_ in gc_candidates. And when it's not
> there, no risk of deadlock. What do you think?

For the new GC, it's false positive, but for the old GC, it's not.

The old GC locks unix_gc_lock and could iterate alive sockets if
they are linked to gc_inflight_list, and then recvq is locked.

The new GC only touches recvq when it's detected as dead, meaning
there's no recv() call in progress for the socket.

This patch is for both GC impl, so we need to free skb after
unlocking recvq in manage_oob().
Michal Luczaj May 13, 2024, 10:15 a.m. UTC | #10
On 5/13/24 11:24, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 13 May 2024 11:14:39 +0200
>> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon, 13 May 2024 08:40:34 +0200
>>>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
>>>>
>>>> 	if (!WARN_ON_ONCE(skb_unref(skb)))
>>>>   		kfree_skb(skb);
>>>
>>> Ah, I got your point, good catch!
>>>
>>> Somehow I was thinking of new GC where alive recvq is not touched
>>> and lockdep would end up with false-positive.
>>>
>>> We need to delay freeing oob_skb in that case like below.
>>> ...
>>
>> So this not a lockdep false positive after all?
>>
>> Here's my understanding: the only way manage_oob() can lead to an inverted locking
>> order is when the receiver socket is _not_ in gc_candidates. And when it's not
>> there, no risk of deadlock. What do you think?
> 
> For the new GC, it's false positive, but for the old GC, it's not.
>
> The old GC locks unix_gc_lock and could iterate alive sockets if
> they are linked to gc_inflight_list, and then recvq is locked.
> ...

The recvq is locked not for all sockets in gc_inflight_list, but only its
subset, gc_candidates, i.e. sockets that fulfil the 'total_refs == u->inflight'
condition, right? So doesn't this imply that our receiver is not user-reachable
and manage_oob() cannot be called/raced?

I wouldn't be surprised if I was missing something important, but it's not like
I didn't try deadlocking this code :)
Kuniyuki Iwashima May 13, 2024, 12:44 p.m. UTC | #11
Date: Mon, 13 May 2024 12:15:57 +0200
From: Michal Luczaj <mhal@rbox.co>
> On 5/13/24 11:24, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 13 May 2024 11:14:39 +0200
> >> On 5/13/24 09:44, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Mon, 13 May 2024 08:40:34 +0200
> >>>> What I'm talking about is the quoted above (unchanged) part in manage_oob():
> >>>>
> >>>> 	if (!WARN_ON_ONCE(skb_unref(skb)))
> >>>>   		kfree_skb(skb);
> >>>
> >>> Ah, I got your point, good catch!
> >>>
> >>> Somehow I was thinking of new GC where alive recvq is not touched
> >>> and lockdep would end up with false-positive.
> >>>
> >>> We need to delay freeing oob_skb in that case like below.
> >>> ...
> >>
> >> So this not a lockdep false positive after all?
> >>
> >> Here's my understanding: the only way manage_oob() can lead to an inverted locking
> >> order is when the receiver socket is _not_ in gc_candidates. And when it's not
> >> there, no risk of deadlock. What do you think?
> > 
> > For the new GC, it's false positive, but for the old GC, it's not.
> >
> > The old GC locks unix_gc_lock and could iterate alive sockets if
> > they are linked to gc_inflight_list, and then recvq is locked.
> > ...
> 
> The recvq is locked not for all sockets in gc_inflight_list, but only its
> subset, gc_candidates, i.e. sockets that fulfil the 'total_refs == u->inflight'
> condition, right? So doesn't this imply that our receiver is not user-reachable
> and manage_oob() cannot be called/raced?

Ah, yes, the splat was false-positive for the old GC too.

Instead of using a differenct class for the recvq in GC, it would
be better to unlock it earlier in manage_oob(), so I'll post v3.

Thanks!
diff mbox series

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9a6ad5974dff..c555464cf1fb 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2217,13 +2217,15 @@  static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
 	maybe_add_creds(skb, sock, other);
 	skb_get(skb);
 
+	scm_stat_add(other, skb);
+
+	spin_lock(&other->sk_receive_queue.lock);
 	if (ousk->oob_skb)
 		consume_skb(ousk->oob_skb);
-
 	WRITE_ONCE(ousk->oob_skb, skb);
+	__skb_queue_tail(&other->sk_receive_queue, skb);
+	spin_unlock(&other->sk_receive_queue.lock);
 
-	scm_stat_add(other, skb);
-	skb_queue_tail(&other->sk_receive_queue, skb);
 	sk_send_sigurg(other);
 	unix_state_unlock(other);
 	other->sk_data_ready(other);
@@ -2614,8 +2616,10 @@  static int unix_stream_recv_urg(struct unix_stream_read_state *state)
 
 	mutex_lock(&u->iolock);
 	unix_state_lock(sk);
+	spin_lock(&sk->sk_receive_queue.lock);
 
 	if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
+		spin_unlock(&sk->sk_receive_queue.lock);
 		unix_state_unlock(sk);
 		mutex_unlock(&u->iolock);
 		return -EINVAL;
@@ -2627,6 +2631,8 @@  static int unix_stream_recv_urg(struct unix_stream_read_state *state)
 		WRITE_ONCE(u->oob_skb, NULL);
 	else
 		skb_get(oob_skb);
+
+	spin_unlock(&sk->sk_receive_queue.lock);
 	unix_state_unlock(sk);
 
 	chunk = state->recv_actor(oob_skb, 0, chunk, state);
@@ -2655,6 +2661,8 @@  static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 		consume_skb(skb);
 		skb = NULL;
 	} else {
+		spin_lock(&sk->sk_receive_queue.lock);
+
 		if (skb == u->oob_skb) {
 			if (copied) {
 				skb = NULL;
@@ -2666,13 +2674,15 @@  static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 			} else if (flags & MSG_PEEK) {
 				skb = NULL;
 			} else {
-				skb_unlink(skb, &sk->sk_receive_queue);
+				__skb_unlink(skb, &sk->sk_receive_queue);
 				WRITE_ONCE(u->oob_skb, NULL);
 				if (!WARN_ON_ONCE(skb_unref(skb)))
 					kfree_skb(skb);
 				skb = skb_peek(&sk->sk_receive_queue);
 			}
 		}
+
+		spin_unlock(&sk->sk_receive_queue.lock);
 	}
 	return skb;
 }
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 0104be9d4704..b87e48e2b51b 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -342,10 +342,12 @@  static void __unix_gc(struct work_struct *work)
 		scan_children(&u->sk, inc_inflight, &hitlist);
 
 #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+		spin_lock(&u->sk.sk_receive_queue.lock);
 		if (u->oob_skb) {
-			kfree_skb(u->oob_skb);
+			WARN_ON_ONCE(skb_unref(u->oob_skb));
 			u->oob_skb = NULL;
 		}
+		spin_unlock(&u->sk.sk_receive_queue.lock);
 #endif
 	}