diff mbox series

[vfs/for-next,v6] cgroup: fix top cgroup refcnt leak

Message ID 20190103035426.23526-1-avagin@gmail.com (mailing list archive)
State New, archived
Headers show
Series [vfs/for-next,v6] cgroup: fix top cgroup refcnt leak | expand

Commit Message

Andrei Vagin Jan. 3, 2019, 3:54 a.m. UTC
It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag
v3: fix a reference leak when kernfs_node_dentry fails
v4: call deactivate_locked_super() in a error case
v5: don't dereference fc->root after dput()
v6: rebase on today's vfs/for-next

 kernel/cgroup/cgroup-v1.c |  2 +-
 kernel/cgroup/cgroup.c    | 25 ++++++++++++++++++-------
 2 files changed, 19 insertions(+), 8 deletions(-)

Comments

Al Viro Jan. 3, 2019, 8:32 a.m. UTC | #1
On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote:

[I'm thoroughly sick of refcounting in that thing, TBH ;-/]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index a19f0fec9d82..fe67b5e81f9a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  
>  	ret = kernfs_get_tree(fc);
>  	if (ret < 0)
> -		goto out_cgrp;
> +		return ret;

Why does that case avoid needing cgroup_put()?  Note, BTW, that we
also have this:
/*
 * Destroy a cgroup filesystem context.
 */
static void cgroup_fs_context_free(struct fs_context *fc)
{
        struct cgroup_fs_context *ctx = cgroup_fc2context(fc);

        kfree(ctx->name);
        kfree(ctx->release_agent);
        if (ctx->root)
                cgroup_put(&ctx->root->cgrp);
        put_cgroup_ns(ctx->ns);
        kernfs_free_fs_context(fc);
        kfree(ctx);
}

which also needs to be taken into account.
Andrei Vagin Jan. 3, 2019, 5:34 p.m. UTC | #2
On Thu, Jan 03, 2019 at 08:32:29AM +0000, Al Viro wrote:
> On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote:
> 
> [I'm thoroughly sick of refcounting in that thing, TBH ;-/]

feel your pain...

> 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index a19f0fec9d82..fe67b5e81f9a 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc)
> >  
> >  	ret = kernfs_get_tree(fc);
> >  	if (ret < 0)
> > -		goto out_cgrp;
> > +		return ret;
> 
> Why does that case avoid needing cgroup_put()?  Note, BTW, that we
> also have this:

The origin patch which added this cgroup_put says that it is needed
because mount() and kill_sb() is not a one-to-one match. sb is created
in kernfs_get_tree(), so I decided that this cgroup_put() is needed only
after kernfs_get_tree().

commit c6b3d5bcd67c75961a1e8b9564d1475c0f194a84
Author: Li Zefan <lizefan@huawei.com>
Date:   Fri Apr 4 17:14:41 2014 +0800

cgroup: fix top cgroup refcnt leak

As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.


> /*
>  * Destroy a cgroup filesystem context.
>  */
> static void cgroup_fs_context_free(struct fs_context *fc)
> {
>         struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> 
>         kfree(ctx->name);
>         kfree(ctx->release_agent);
>         if (ctx->root)
>                 cgroup_put(&ctx->root->cgrp);
>         put_cgroup_ns(ctx->ns);
>         kernfs_free_fs_context(fc);
>         kfree(ctx);
> }
> 
> which also needs to be taken into account.

we have unconditional cgroup_get in cgroup1_get_tree() to match this
put, but we need to check error paths.
David Howells Jan. 3, 2019, 9:54 p.m. UTC | #3
Hi Andrei,

It turns out that the cgroup-v1 refcounting is slightly broken upstream.  If
kernfs_get_inode() fails in kernfs_fill_super(), then there's refcount
breakage in the error handling.

This can be provoked by making the following change:

--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -240,6 +240,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 	sb->s_shrink.seeks = 0;
 
 	/* get root inode, initialize and unlock it */
+	if (strcmp(current->comm, "foobar") == 0)
+		return -ENOANO;
 	mutex_lock(&kernfs_mutex);
 	inode = kernfs_get_inode(sb, info->root->kn);
 	mutex_unlock(&kernfs_mutex);

and then copying /bin/mount to /tmp/foobar and doing:

[root@andromeda ~]# /tmp/foobar -t cgroup -o none,name=xxxxy xxx /tmp/x/a1
foobar: /tmp/x/a1: mount(2) system call failed: No anode.

In dmesg I see the attached traces (see below).

The problem appears to be because cgroup_do_mount() calls kernfs_mount(), but
the refcount on the new root cgroup object hasn't been properly initialised
yet.  However, because we make it past sget(), the superblock thinks it has
taken the caller's ref - and this gets eaten by cgroup_kill_sb().

Further, another ref appears to be released by cgroup_do_mount() in the event
that kernfs_mount() fails.

David

------------[ cut here ]------------
percpu_ref_kill_and_confirm called more than once on css_release!
WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x4b/0x14c
Modules linked in:
CPU: 3 PID: 3218 Comm: bugger Not tainted 4.20.0-fscache+ #1287
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_kill_and_confirm+0x4b/0x14c
Code: c6 74 29 80 3d 42 0a 00 01 00 75 20 48 8b 53 10 48 c7 c6 f0 cd e9 81 48 c7 c7 82 99 16 82 c6 05 27 0a 00 01 01 e8 04 54 ac ff <0f> 0b 48 83 4b 08 02 4c 89 e6 48 89 df e8 fb fc ff ff 65 ff 05 54
RSP: 0018:ffff8880c5103c60 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff8880d35e8020 RCX: ffff8880c5103b4c
RDX: 0000000000000046 RSI: ffffffff8245fef8 RDI: ffffffff810ac1ec
RBP: ffff8880c5103c78 R08: 0000000000000041 R09: 0000000000021900
R10: ffff8880c5103900 R11: 0000006423114dc0 R12: 0000000000000000
R13: 000000000027e0eb R14: 0000000000000246 R15: 0000000000000000
FS:  00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0
Call Trace:
 cgroup_kill_sb+0x131/0x141
 deactivate_locked_super+0x29/0x5b
 kernfs_mount_ns+0x1fa/0x223
 cgroup_do_mount+0x36/0x1c8
 cgroup1_mount+0x5b2/0x610
 cgroup_mount+0x33b/0x37f
 mount_fs+0x6a/0x10b
 vfs_kern_mount+0x67/0x13c
 do_mount+0x90e/0xb7e
 ? kmem_cache_alloc_trace+0x241/0x27d
 ksys_mount+0x72/0x97
 __x64_sys_mount+0x21/0x24
 do_syscall_64+0x7d/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8ec1e14ada
Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada
RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0
RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480
R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0
R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184
irq event stamp: 3532
hardirqs last  enabled at (3531): [<ffffffff811c0584>] kfree+0x152/0x159
hardirqs last disabled at (3532): [<ffffffff819e27e8>] _raw_spin_lock_irqsave+0x12/0x44
softirqs last  enabled at (3326): [<ffffffff81c00353>] __do_softirq+0x353/0x38f
softirqs last disabled at (3317): [<ffffffff81058c16>] irq_exit+0x63/0xd1
---[ end trace 9bca09dc135d9213 ]---
WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:359 percpu_ref_reinit+0x10/0x17
Modules linked in:
CPU: 3 PID: 3218 Comm: bugger Tainted: G        W         4.20.0-fscache+ #1287
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_reinit+0x10/0x17
Code: fa ff ff 48 8d 65 f0 4c 89 f6 48 c7 c7 e0 8d 51 82 5b 41 5e 5d e9 3c 50 45 00 48 8b 47 08 a8 03 74 08 48 8b 07 48 85 c0 74 02 <0f> 0b e9 c4 fe ff ff 55 31 f6 53 48 89 fb 48 83 ec 30 65 48 8b 04
RSP: 0018:ffff8880c5103d38 EFLAGS: 00010282
RAX: fffffffffffffffe RBX: ffff8880d35e8000 RCX: ffffffff810f1a8f
RDX: 00000000ffffbf7e RSI: 00000000425b018d RDI: ffff8880d35e8020
RBP: ffff8880c5103da8 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8880c5103d40 R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000000 R14: ffffffffffffffc9 R15: ffff88803f63f001
FS:  00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0
Call Trace:
 cgroup1_mount+0x5d1/0x610
 cgroup_mount+0x33b/0x37f
 mount_fs+0x6a/0x10b
 vfs_kern_mount+0x67/0x13c
 do_mount+0x90e/0xb7e
 ? kmem_cache_alloc_trace+0x241/0x27d
 ksys_mount+0x72/0x97
 __x64_sys_mount+0x21/0x24
 do_syscall_64+0x7d/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8ec1e14ada
Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada
RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0
RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480
R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0
R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184
irq event stamp: 3666
hardirqs last  enabled at (3665): [<ffffffff810bb0b1>] __call_rcu+0x1dc/0x1fa
hardirqs last disabled at (3666): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (3608): [<ffffffff81c00353>] __do_softirq+0x353/0x38f
softirqs last disabled at (3535): [<ffffffff81058c16>] irq_exit+0x63/0xd1
---[ end trace 9bca09dc135d9214 ]---
diff mbox series

Patch

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 4b189e821cad..de7d625ec077 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1285,8 +1285,8 @@  int cgroup1_get_tree(struct fs_context *fc)
 		mutex_lock(&cgroup_mutex);
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
 		mutex_unlock(&cgroup_mutex);
-		cgroup_get(&root->cgrp);
 	}
+	cgroup_get(&root->cgrp);
 
 	/*
 	 * If @pinned_sb, we're reusing an existing root and holding an
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a19f0fec9d82..fe67b5e81f9a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2019,7 +2019,7 @@  int cgroup_do_get_tree(struct fs_context *fc)
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
-		goto out_cgrp;
+		return ret;
 
 	/*
 	 * In non-init cgroup namespace, instead of root cgroup's dentry,
@@ -2038,19 +2038,30 @@  int cgroup_do_get_tree(struct fs_context *fc)
 		mutex_unlock(&cgroup_mutex);
 
 		nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
-		if (IS_ERR(nsdentry))
-			return PTR_ERR(nsdentry);
+		if (IS_ERR(nsdentry)) {
+			ret = PTR_ERR(nsdentry);
+			goto out_cgrp;
+		}
 		dput(fc->root);
 		fc->root = nsdentry;
 	}
 
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
+	if (!ctx->kfc.new_sb_created)
+		apply_cgroup_root_flags(ctx->flags);
 
 out_cgrp:
+	if (!ctx->kfc.new_sb_created)
+		cgroup_put(&ctx->root->cgrp);
+
+	if (unlikely(ret)) {
+		struct super_block *sb = fc->root->d_sb;
+
+		dput(fc->root);
+		deactivate_locked_super(sb);
+		fc->root = NULL;
+	}
+
 	return ret;
 }