diff mbox series

[v5,net,1/2] af_unix: Fix garbage collection of embryos carrying OOB/SCM_RIGHTS.

Message ID 20240515003204.43153-2-kuniyu@amazon.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series af_unix: Fix memleak and null-ptr-deref around MSG_OOB and GC. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 927 this patch: 927
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 936 this patch: 936
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 938 this patch: 938
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 44 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kuniyuki Iwashima May 15, 2024, 12:32 a.m. UTC
From: Michal Luczaj <mhal@rbox.co>

GC attempts to explicitly drop oob_skb's refcount before purging the hit
list.

The problem is with embryos: instead of trying to kfree_skb(u->oob_skb)
of an embryo socket, GC goes for its parent-listener socket, which never
carries u->oob_skb.

The python script below [0] sends a listener's fd to its embryo as OOB
data.  Then, GC does not iterates the embryo from the listener to drop
the OOB skb's refcount, and the skb in embryo's receive queue keeps the
listener's refcount.  As a result, the listener is leaked and the warning
[1] is hit.

Tell GC to dispose the right socket's oob_skb.

[0]:
from array import array
from socket import *

addr = 'unix-oob-splat'
lis = socket(AF_UNIX, SOCK_STREAM)
lis.bind(addr)
lis.listen(1)

s = socket(AF_UNIX, SOCK_STREAM)
s.connect(addr)
scm = (SOL_SOCKET, SCM_RIGHTS, array('i', [lis.fileno()]))
s.sendmsg([b'x'], [scm], MSG_OOB)
lis.close()

[1]:
WARNING: CPU: 2 PID: 546 at net/unix/garbage.c:371 __unix_gc+0x50e/0x520
Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel virtio_blk ghash_clmulni_intel serio_raw fuse qemu_fw_cfg virtio_console
CPU: 2 PID: 546 Comm: kworker/u32:5 Not tainted 6.9.0-rc7nokasan+ #28
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound __unix_gc
RIP: 0010:__unix_gc+0x50e/0x520
Code: 83 fa 01 0f 84 07 fe ff ff 85 d2 0f 8f 01 fe ff ff be 03 00 00 00 e8 f1 f7 9a ff e9 f2 fd ff ff e8 b7 f9 ff ff e9 28 fd ff ff <0f> 0b e9 07 ff ff ff e8 36 0a 1a 00 66 0f 1f 44 00 00 90 90 90 90
RSP: 0018:ffffc9000051fd90 EFLAGS: 00010283
RAX: ffff88810b316f30 RBX: ffffffff83563230 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff82956cfb RDI: ffffffff83563880
RBP: ffffc9000051fe38 R08: 00000000cba2db62 R09: 00000000000003e5
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000051fdb0
R13: ffff88810b316a00 R14: ffffc9000051fd90 R15: ffffffff83563860
FS:  0000000000000000(0000) GS:ffff88842fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557b30b7406c CR3: 000000011e5be000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn.cold+0xb1/0x13e
 ? __unix_gc+0x50e/0x520
 ? report_bug+0xe6/0x170
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? __unix_gc+0x50e/0x520
 process_one_work+0x21f/0x590
 ? move_linked_works+0x70/0xa0
 worker_thread+0x1bf/0x3d0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xdd/0x110
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Fixes: aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/netdev/6915a10c-cc57-4a68-9f91-a5efdf42091d@rbox.co/
[ kuniyu: edited commit message ]
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/garbage.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

Comments

Michal Luczaj May 15, 2024, 9:34 a.m. UTC | #1
On 5/15/24 02:32, Kuniyuki Iwashima wrote:
> ...
> The python script below [0] sends a listener's fd to its embryo as OOB
> data.  Then, GC does not iterates the embryo from the listener to drop
> the OOB skb's refcount, and the skb in embryo's receive queue keeps the
> listener's refcount.  As a result, the listener is leaked and the warning
> [1] is hit.
> ...

Sorry, this does not convey what I wrote. And I think your edit is
incorrect.

GC starts from the in-flight listener and *does* iterate the embryo; see
scan_children() where scan_inflight() is called for all the embryos.
The skb in embryo's RQ *does not* keep the listener's refcount; skb from RQ
ends up in the hit list and is purged.
It is embryo's oob_skb that holds the refcount; see how __unix_gc() goes
over gc_candidates attempting to kfree_skb(u->oob_skb), notice that `u`
here is a listener, not an embryo.

I understand you're "in rush for the merge window", but would it be okay if
I ask you not to edit my commit messages so heavily?

Thanks,
Michal
Kuniyuki Iwashima May 15, 2024, 1:35 p.m. UTC | #2
From: Michal Luczaj <mhal@rbox.co>
Date: Wed, 15 May 2024 11:34:51 +0200
> On 5/15/24 02:32, Kuniyuki Iwashima wrote:
> > ...
> > The python script below [0] sends a listener's fd to its embryo as OOB
> > data.  Then, GC does not iterates the embryo from the listener to drop
> > the OOB skb's refcount, and the skb in embryo's receive queue keeps the
> > listener's refcount.  As a result, the listener is leaked and the warning
> > [1] is hit.
> > ...
> 
> Sorry, this does not convey what I wrote. And I think your edit is
> incorrect.
> 
> GC starts from the in-flight listener and *does* iterate the embryo; see
> scan_children() where scan_inflight() is called for all the embryos.

I meant the current code does not call skb_unref() for embryos's OOB skb
because it's done _after_ scan_inflight(), not in scan_inflight().


> The skb in embryo's RQ *does not* keep the listener's refcount; skb from RQ
> ends up in the hit list and is purged.

unix_sk(sk)->oob_skb is a pointer to skb in recvq.  Perhaps I should
have written "the skb which was in embryo's receive queue stays as
unix_sk(sk)->oob_skb and keeps the listener's refcount".


> It is embryo's oob_skb that holds the refcount; see how __unix_gc() goes
> over gc_candidates attempting to kfree_skb(u->oob_skb), notice that `u`
> here is a listener, not an embryo.
> 
> I understand you're "in rush for the merge window", but would it be okay if
> I ask you not to edit my commit messages so heavily?

I noticed the new gc code was merged in Linus' tree.  It's still not
synced with net.git, but I guess it will be done soon and your patch
will not apply on net.git.  Then, I cannot include your patch as a
series, so please feel free to send it to each stable tree.

Thanks
Michal Luczaj May 16, 2024, 10:33 a.m. UTC | #3
On 5/15/24 15:35, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Wed, 15 May 2024 11:34:51 +0200
>> On 5/15/24 02:32, Kuniyuki Iwashima wrote:
>>> ...
>>> The python script below [0] sends a listener's fd to its embryo as OOB
>>> data.  Then, GC does not iterates the embryo from the listener to drop
>>> the OOB skb's refcount, and the skb in embryo's receive queue keeps the
>>> listener's refcount.  As a result, the listener is leaked and the warning
>>> [1] is hit.
>>> ...
>>
>> Sorry, this does not convey what I wrote. And I think your edit is
>> incorrect.
>>
>> GC starts from the in-flight listener and *does* iterate the embryo; see
>> scan_children() where scan_inflight() is called for all the embryos.
> 
> I meant the current code does not call skb_unref() for embryos's OOB skb
> because it's done _after_ scan_inflight(), not in scan_inflight().

Right, I think I see what you mean.

>> The skb in embryo's RQ *does not* keep the listener's refcount; skb from RQ
>> ends up in the hit list and is purged.
> 
> unix_sk(sk)->oob_skb is a pointer to skb in recvq.  Perhaps I should
> have written "the skb which was in embryo's receive queue stays as
> unix_sk(sk)->oob_skb and keeps the listener's refcount".

I wholeheartedly concur with you!

>> It is embryo's oob_skb that holds the refcount; see how __unix_gc() goes
>> over gc_candidates attempting to kfree_skb(u->oob_skb), notice that `u`
>> here is a listener, not an embryo.
>>
>> I understand you're "in rush for the merge window", but would it be okay if
>> I ask you not to edit my commit messages so heavily?
> 
> I noticed the new gc code was merged in Linus' tree.  It's still not
> synced with net.git, but I guess it will be done soon and your patch
> will not apply on net.git.  Then, I cannot include your patch as a
> series, so please feel free to send it to each stable tree.

All right, no problem. Does it mean you'll be posting PATCH 2/2 ("af_unix:
Update unix_sk(sk)->oob_skb under sk_receive_queue lock") to stable(s)?

Moving on to the New GC: Python test from this patch shows that the New GC
is memleaking in pretty much the same fashion.

$ grep splat /proc/net/unix
$ ./unix-oob-splat.py
$ rm unix-oob-splat
$ ./unix-oob-splat.py
$ grep splat /proc/net/unix
0000000000000000: 00000002 00000000 00000000 0001 02     0 unix-oob-splat
0000000000000000: 00000002 00000000 00000000 0001 02     0 unix-oob-splat
0000000000000000: 00000002 00000000 00010000 0001 01  6643 unix-oob-splat
0000000000000000: 00000002 00000000 00010000 0001 01  2920 unix-oob-splat

I've posted a patch:
https://lore.kernel.org/netdev/20240516103049.1132040-1-mhal@rbox.co/

I tried to align with your version of the commit message, but feel free to
chime in. Also, I took the liberty to introduce a small sanity check. Would
you prefer if I dropped this hunk or possibly made it a separate patch?

Thanks!
Kuniyuki Iwashima May 16, 2024, 12:19 p.m. UTC | #4
From: Michal Luczaj <mhal@rbox.co>
Date: Thu, 16 May 2024 12:33:35 +0200
> On 5/15/24 15:35, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Wed, 15 May 2024 11:34:51 +0200
> >> On 5/15/24 02:32, Kuniyuki Iwashima wrote:
> >>> ...
> >>> The python script below [0] sends a listener's fd to its embryo as OOB
> >>> data.  Then, GC does not iterates the embryo from the listener to drop
> >>> the OOB skb's refcount, and the skb in embryo's receive queue keeps the
> >>> listener's refcount.  As a result, the listener is leaked and the warning
> >>> [1] is hit.
> >>> ...
> >>
> >> Sorry, this does not convey what I wrote. And I think your edit is
> >> incorrect.
> >>
> >> GC starts from the in-flight listener and *does* iterate the embryo; see
> >> scan_children() where scan_inflight() is called for all the embryos.
> > 
> > I meant the current code does not call skb_unref() for embryos's OOB skb
> > because it's done _after_ scan_inflight(), not in scan_inflight().
> 
> Right, I think I see what you mean.
> 
> >> The skb in embryo's RQ *does not* keep the listener's refcount; skb from RQ
> >> ends up in the hit list and is purged.
> > 
> > unix_sk(sk)->oob_skb is a pointer to skb in recvq.  Perhaps I should
> > have written "the skb which was in embryo's receive queue stays as
> > unix_sk(sk)->oob_skb and keeps the listener's refcount".
> 
> I wholeheartedly concur with you!
> 
> >> It is embryo's oob_skb that holds the refcount; see how __unix_gc() goes
> >> over gc_candidates attempting to kfree_skb(u->oob_skb), notice that `u`
> >> here is a listener, not an embryo.
> >>
> >> I understand you're "in rush for the merge window", but would it be okay if
> >> I ask you not to edit my commit messages so heavily?
> > 
> > I noticed the new gc code was merged in Linus' tree.  It's still not
> > synced with net.git, but I guess it will be done soon and your patch
> > will not apply on net.git.  Then, I cannot include your patch as a
> > series, so please feel free to send it to each stable tree.
> 
> All right, no problem. Does it mean you'll be posting PATCH 2/2 ("af_unix:
> Update unix_sk(sk)->oob_skb under sk_receive_queue lock") to stable(s)?

I'll post patch 2/2 to net.git and it will be sent to stable later
by netdev maintainers.  Then, with your patch, the issue is completely
fixed for the old gc.


> 
> Moving on to the New GC: Python test from this patch shows that the New GC
> is memleaking in pretty much the same fashion.

Good catch!

> 
> $ grep splat /proc/net/unix
> $ ./unix-oob-splat.py
> $ rm unix-oob-splat
> $ ./unix-oob-splat.py
> $ grep splat /proc/net/unix
> 0000000000000000: 00000002 00000000 00000000 0001 02     0 unix-oob-splat
> 0000000000000000: 00000002 00000000 00000000 0001 02     0 unix-oob-splat
> 0000000000000000: 00000002 00000000 00010000 0001 01  6643 unix-oob-splat
> 0000000000000000: 00000002 00000000 00010000 0001 01  2920 unix-oob-splat
> 
> I've posted a patch:
> https://lore.kernel.org/netdev/20240516103049.1132040-1-mhal@rbox.co/
> 
> I tried to align with your version of the commit message, but feel free to
> chime in. Also, I took the liberty to introduce a small sanity check. Would
> you prefer if I dropped this hunk or possibly made it a separate patch?

Will comment on the patch thread.

Thanks!
diff mbox series

Patch

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 0104be9d4704..beecd0bfbf48 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -170,10 +170,11 @@  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			/* Process the descriptors of this socket */
 			int nfd = UNIXCB(skb).fp->count;
 			struct file **fp = UNIXCB(skb).fp->fp;
+			struct unix_sock *u;
 
 			while (nfd--) {
 				/* Get the socket the fd matches if it indeed does so */
-				struct unix_sock *u = unix_get_socket(*fp++);
+				u = unix_get_socket(*fp++);
 
 				/* Ignore non-candidates, they could have been added
 				 * to the queues after starting the garbage collection
@@ -187,6 +188,14 @@  static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			if (hit && hitlist != NULL) {
 				__skb_unlink(skb, &x->sk_receive_queue);
 				__skb_queue_tail(hitlist, skb);
+
+#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+				u = unix_sk(x);
+				if (u->oob_skb == skb) {
+					WARN_ON_ONCE(skb_unref(u->oob_skb));
+					u->oob_skb = NULL;
+				}
+#endif
 			}
 		}
 	}
@@ -338,17 +347,9 @@  static void __unix_gc(struct work_struct *work)
 	 * which are creating the cycle(s).
 	 */
 	skb_queue_head_init(&hitlist);
-	list_for_each_entry(u, &gc_candidates, link) {
+	list_for_each_entry(u, &gc_candidates, link)
 		scan_children(&u->sk, inc_inflight, &hitlist);
 
-#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
-		if (u->oob_skb) {
-			kfree_skb(u->oob_skb);
-			u->oob_skb = NULL;
-		}
-#endif
-	}
-
 	/* not_cycle_list contains those sockets which do not make up a
 	 * cycle.  Restore these to the inflight list.
 	 */