[vfs/for-next,v2] cgroup: fix top cgroup refcnt leak

Message ID	20181229000400.26333-1-avagin@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Andrei Vagin <avagin@gmail.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, David Howells <dhowells@redhat.com> Cc: linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, Andrei Vagin <avagin@gmail.com>, Li Zefan <lizefan@huawei.com> Subject: [PATCH vfs/for-next v2] cgroup: fix top cgroup refcnt leak Date: Fri, 28 Dec 2018 16:04:00 -0800 Message-Id: <20181229000400.26333-1-avagin@gmail.com> In-Reply-To: <20181228235900.21468-1-avagin@gmail.com> References: <20181228235900.21468-1-avagin@gmail.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	[vfs/for-next,v2] cgroup: fix top cgroup refcnt leak \| expand [vfs/for-next,v2] cgroup: fix top cgroup refcnt leak

Andrei Vagin Dec. 29, 2018, 12:04 a.m. UTC

It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
commit was reverted by mistake.

$ mkdir /tmp/cgroup
$ mkdir /tmp/cgroup2
$ mount -t cgroup -o none,name=test test /tmp/cgroup
$ mount -t cgroup -o none,name=test test /tmp/cgroup2
$ umount /tmp/cgroup
$ umount /tmp/cgroup2
$ cat /proc/self/cgroup | grep test
12:name=test:/

You can see the test cgroup was not freed.

Cc: Li Zefan <lizefan@huawei.com>
Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
---

v2: clean up code and add the vfs/for-next tag

 kernel/cgroup/cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

Andrei Vagin Dec. 30, 2018, 7:41 p.m. UTC | #1

Alexander and David,

This patch fixed the problem which was reported more than a mounth ago
https://patchwork.kernel.org/patch/10610671/

Now this code is in the linux-next and the problem blocks running CRIU
tests on the linux next kernels:

https://travis-ci.org/avagin/linux/builds

Thanks,
Andrei

On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> ---
> 
> v2: clean up code and add the vfs/for-next tag
> 
>  kernel/cgroup/cgroup.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index fb0717696895..f63974a3725f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  	ret = 0;
>  	if (ctx->kfc.new_sb_created)
>  		goto out_cgrp;
> +	else
> +		cgroup_put(&ctx->root->cgrp);
> +
>  	apply_cgroup_root_flags(ctx->flags);
>  	return 0;
>  
> -- 
> 2.17.2
>

Al Viro Jan. 2, 2019, 2:28 a.m. UTC | #2

On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>
> ---
> 
> v2: clean up code and add the vfs/for-next tag
> 
>  kernel/cgroup/cgroup.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index fb0717696895..f63974a3725f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  	ret = 0;
>  	if (ctx->kfc.new_sb_created)
>  		goto out_cgrp;
> +	else
> +		cgroup_put(&ctx->root->cgrp);
> +
>  	apply_cgroup_root_flags(ctx->flags);
>  	return 0;

That looks horrible, especially since out_cgrp is return ret;
If anything, it should be
	if (!ctx->kfc.new_sb_created) {
		cgroup_put(&ctx->root->cgrp);
		apply_cgroup_root_flags(ctx->flags);
	}
	return 0;

What I don't understand is why apply_cgroup_root_flags() is not
called in "new superblock" case here.  It used to, prior to that
conversion...

Another fishy place I see there is
                nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
                if (IS_ERR(nsdentry))
                        return PTR_ERR(nsdentry);
                dput(fc->root);
                fc->root = nsdentry;
What happens if we get here with non-NULL fc->root (and we'd better,
after successful from kernfs_get_tree() a bit earlier) and hit that
failure exit?  A leak?

With apologies for being MIA for a week - it had been insane here...

Andrei Vagin Jan. 2, 2019, 7:37 p.m. UTC | #3

On Wed, Jan 02, 2019 at 02:28:04AM +0000, Al Viro wrote:
> On Fri, Dec 28, 2018 at 04:04:00PM -0800, Andrei Vagin wrote:
> > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > commit was reverted by mistake.
> > 
> > $ mkdir /tmp/cgroup
> > $ mkdir /tmp/cgroup2
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > $ umount /tmp/cgroup
> > $ umount /tmp/cgroup2
> > $ cat /proc/self/cgroup | grep test
> > 12:name=test:/
> > 
> > You can see the test cgroup was not freed.
> > 
> > Cc: Li Zefan <lizefan@huawei.com>
> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > ---
> > 
> > v2: clean up code and add the vfs/for-next tag
> > 
> >  kernel/cgroup/cgroup.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index fb0717696895..f63974a3725f 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2047,6 +2047,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
> >  	ret = 0;
> >  	if (ctx->kfc.new_sb_created)
> >  		goto out_cgrp;
> > +	else
> > +		cgroup_put(&ctx->root->cgrp);
> > +
> >  	apply_cgroup_root_flags(ctx->flags);
> >  	return 0;
> 
> That looks horrible, especially since out_cgrp is return ret;
> If anything, it should be
> 	if (!ctx->kfc.new_sb_created) {
> 		cgroup_put(&ctx->root->cgrp);
> 		apply_cgroup_root_flags(ctx->flags);
> 	}
> 	return 0;
> 
> What I don't understand is why apply_cgroup_root_flags() is not
> called in "new superblock" case here.  It used to, prior to that
> conversion...

It is a good question and I don't have an answer on it. I think David
can tell more about this.

> 
> Another fishy place I see there is
>                 nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb);
>                 if (IS_ERR(nsdentry))
>                         return PTR_ERR(nsdentry);
>                 dput(fc->root);
>                 fc->root = nsdentry;
> What happens if we get here with non-NULL fc->root (and we'd better,
> after successful from kernfs_get_tree() a bit earlier) and hit that
> failure exit?  A leak?

Yes, here is a leak too. I fixed it and sent v3, but then I decided that
it would be good to test this error path and found one more problem:

[   22.669696] ================================================
[   22.670468] WARNING: lock held when returning to user space!
[   22.671225] 4.20.0-rc1-00081-g01a72fa4bd0e-dirty #12 Not tainted
[   22.672018] ------------------------------------------------
[   22.672817] mount/1148 is leaving the kernel with locks still held!
[   22.673660] 1 lock held by mount/1148:
[   22.674165]  #0: 00000000c07be72c (&fc->fs_type->s_umount_key#41){+.+.}, at: grab_super+0x29/0x90

deactivate_locked_super() has to be called on the error path.  I sent v4 with
this fix. I'm sorry for the noise in the mailing list, I had to test this code
before sending v3;

> With apologies for being MIA for a week - it had been insane here...

Good to see you back in stride.

David Howells Jan. 2, 2019, 10:26 p.m. UTC | #4

Andrei Vagin <avagin@gmail.com> wrote:

> It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> commit was reverted by mistake.
> 
> $ mkdir /tmp/cgroup
> $ mkdir /tmp/cgroup2
> $ mount -t cgroup -o none,name=test test /tmp/cgroup
> $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> $ umount /tmp/cgroup
> $ umount /tmp/cgroup2
> $ cat /proc/self/cgroup | grep test
> 12:name=test:/
> 
> You can see the test cgroup was not freed.
> 
> Cc: Li Zefan <lizefan@huawei.com>
> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: Andrei Vagin <avagin@gmail.com>

The kernel (Al's for-next branch, that is) seems to work fine without this
patch; this patch causes the kernel to go bang (trace below).

David
---
percpu ref (css_release) <= 0 (0) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
Call Trace:
 <IRQ>
 ? rcu_process_callbacks+0x469/0x6de
 ? percpu_ref_exit+0x26/0x26
 rcu_process_callbacks+0x4d7/0x6de
 __do_softirq+0x1a5/0x38f
 irq_exit+0x63/0xd1
 smp_apic_timer_interrupt+0x1cd/0x1e0
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
 do_idle+0x163/0x1ea
 cpu_startup_entry+0x1d/0x1f
 start_secondary+0x175/0x190
 secondary_startup_64+0xa4/0xb0
irq event stamp: 99625
hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
---[ end trace f59fd95ebc091779 ]---

Andrei Vagin Jan. 2, 2019, 11:06 p.m. UTC | #5

On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote:
> Andrei Vagin <avagin@gmail.com> wrote:
> 
> > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > commit was reverted by mistake.
> > 
> > $ mkdir /tmp/cgroup
> > $ mkdir /tmp/cgroup2
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > $ umount /tmp/cgroup
> > $ umount /tmp/cgroup2
> > $ cat /proc/self/cgroup | grep test
> > 12:name=test:/
> > 
> > You can see the test cgroup was not freed.
> > 
> > Cc: Li Zefan <lizefan@huawei.com>
> > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> 
> The kernel (Al's for-next branch, that is) seems to work fine without this
> patch;

It doesn't work for me. The mount system call stucks in cgroup1_get_tree:

[root@fc24 ~]# uname -a
Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux

[avagin@laptop linux]$ git log | head -n 6
commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673
Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Dec 28 02:05:08 2018 -0500

    Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next


$ ps axf
...
  636 ?        S      0:00      \_ sshd: root@pts/0
  643 pts/0    Ss     0:00          \_ -bash
  710 pts/0    S      0:00              \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint
 1449 pts/0    S      0:00              |   \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups
 1450 pts/0    S      0:00              |       \_ /bin/sh ./zdtm_umount_cgroups
 1456 pts/0    D      0:00              |           \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8


[root@fc24 criu]# cat /proc/1456/stack 
[<0>] msleep+0x38/0x40
[<0>] cgroup1_get_tree+0x47b/0x76b
[<0>] vfs_get_tree+0x3d/0x100
[<0>] do_mount+0x2d8/0xde0
[<0>] ksys_mount+0xba/0xd0
[<0>] __x64_sys_mount+0x21/0x30
[<0>] do_syscall_64+0x5a/0x200
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

> this patch causes the kernel to go bang (trace below).
> 
> David
> ---
> percpu ref (css_release) <= 0 (0) after switching to atomic

I have updated to Al's vfs-next and now I see this error too.

I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next")
and it works fine there...


> WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> Modules linked in:
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
> Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
> RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
> RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
> RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
> R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
> R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
> FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
> Call Trace:
>  <IRQ>
>  ? rcu_process_callbacks+0x469/0x6de
>  ? percpu_ref_exit+0x26/0x26
>  rcu_process_callbacks+0x4d7/0x6de
>  __do_softirq+0x1a5/0x38f
>  irq_exit+0x63/0xd1
>  smp_apic_timer_interrupt+0x1cd/0x1e0
>  apic_timer_interrupt+0xf/0x20
>  </IRQ>
> RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
> Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
> RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
> RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
> RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
> R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
> R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
>  do_idle+0x163/0x1ea
>  cpu_startup_entry+0x1d/0x1f
>  start_secondary+0x175/0x190
>  secondary_startup_64+0xa4/0xb0
> irq event stamp: 99625
> hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
> hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
> softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
> ---[ end trace f59fd95ebc091779 ]---

Andrei Vagin Jan. 2, 2019, 11:31 p.m. UTC | #6

On Wed, Jan 02, 2019 at 03:06:55PM -0800, Andrei Vagin wrote:
> On Wed, Jan 02, 2019 at 10:26:20PM +0000, David Howells wrote:
> > Andrei Vagin <avagin@gmail.com> wrote:
> > 
> > > It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak")
> > > commit was reverted by mistake.
> > > 
> > > $ mkdir /tmp/cgroup
> > > $ mkdir /tmp/cgroup2
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup
> > > $ mount -t cgroup -o none,name=test test /tmp/cgroup2
> > > $ umount /tmp/cgroup
> > > $ umount /tmp/cgroup2
> > > $ cat /proc/self/cgroup | grep test
> > > 12:name=test:/
> > > 
> > > You can see the test cgroup was not freed.
> > > 
> > > Cc: Li Zefan <lizefan@huawei.com>
> > > Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> > > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > 
> > The kernel (Al's for-next branch, that is) seems to work fine without this
> > patch;
> 
> It doesn't work for me. The mount system call stucks in cgroup1_get_tree:

Here is a reproducer for this problem:

[root@fc24 ~]# cat fs-vs-cg 
set -m
d=$(mktemp -d /tmp/cg.XXXXXX)
for i in `seq 2`; do 
	mkdir $d/a$i
	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
done
mkdir -p $d/a1/test
for i in `seq 2`; do 
	umount $d/a$i
done
d=$(mktemp -d /tmp/cg.XXXXXX)
for i in `seq 2`; do 
	mkdir $d/a$i
	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
done
rmdir $d/a1/test
for i in `seq 2`; do 
	umount $d/a$i
done

You need to execute this script twice and it will stuck on a second
attmept.

I tested it on Al's vfs-next without my patch.

You will see this warning in the kernel log:

[   47.043634] ------------[ cut here ]------------
[   47.044522] percpu ref (css_release) <= 0 (0) after switching to atomic
[   47.044544] WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1dc/0x210
[   47.047369] Modules linked in:
[   47.047911] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-00071-g1fab5fff0a7a #4

So my patch is probably correct and the problem is in another place.

> 
> [root@fc24 ~]# uname -a
> Linux fc24 4.20.0-rc1-00071-g1fab5fff0a7a #4 SMP Wed Jan 2 14:59:36 PST 2019 x86_64 x86_64 x86_64 GNU/Linux
> 
> [avagin@laptop linux]$ git log | head -n 6
> commit 1fab5fff0a7ae1fa3b78383a78f7a56f03a3d673
> Merge: ea5751ccd665 fd6261f4322c 4addd2640fca a40612ef0ee1 f91528955d00
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Fri Dec 28 02:05:08 2018 -0500
> 
>     Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next
> 
> 
> $ ps axf
> ...
>   636 ?        S      0:00      \_ sshd: root@pts/0
>   643 pts/0    Ss     0:00          \_ -bash
>   710 pts/0    S      0:00              \_ python test/zdtm.py run -p 4 --keep-going --report report -T .*cgroup.* --ignore-taint
>  1449 pts/0    S      0:00              |   \_ flock zdtm_mount_cgroups.lock ./zdtm_umount_cgroups
>  1450 pts/0    S      0:00              |       \_ /bin/sh ./zdtm_umount_cgroups
>  1456 pts/0    D      0:00              |           \_ mount -t cgroup -o none,name=zdtmtst.defaultroot zdtm zdtm.EW1qB8
> 
> 
> [root@fc24 criu]# cat /proc/1456/stack 
> [<0>] msleep+0x38/0x40
> [<0>] cgroup1_get_tree+0x47b/0x76b
> [<0>] vfs_get_tree+0x3d/0x100
> [<0>] do_mount+0x2d8/0xde0
> [<0>] ksys_mount+0xba/0xd0
> [<0>] __x64_sys_mount+0x21/0x30
> [<0>] do_syscall_64+0x5a/0x200
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
> 
> > this patch causes the kernel to go bang (trace below).
> > 
> > David
> > ---
> > percpu ref (css_release) <= 0 (0) after switching to atomic
> 
> I have updated to Al's vfs-next and now I see this error too.
> 
> I tested by patch on 40effd960bec ("Merge branches 'work.mount', 'work.misc', 'misc.misc' and 'work.iov_iter' into for-next")
> and it works fine there...
> 
> 
> > WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> > Modules linked in:
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.20.0-rc1-fscache+ #1256
> > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> > RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x90/0x1a0
> > Code: d8 48 85 c0 7f 26 80 3d 29 60 ff 00 00 75 1d 48 8b 53 d8 48 c7 c7 00 3c 16 82 c6 05 15 60 ff 00 01 48 8b 73 e8 e8 26 1f ac ff <0f> 0b 48 8d 43 d8 48 89 c7 49 89 c6 ff 53 f0 48 c7 43 f0 00 00 00
> > RSP: 0018:ffff8800c6c83ed8 EFLAGS: 00010286
> > RAX: 0000000000000000 RBX: ffff8800d3a6e048 RCX: ffff8800c6c83dc4
> > RDX: 0000000000000003 RSI: ffff8800c5c9cb08 RDI: ffff8800c5c9c340
> > RBP: ffff8800c6c83ef8 R08: 000000000000003b R09: 0000000000021900
> > R10: ffff8800c6c83b00 R11: 0000004ac8b230c2 R12: 000060ff39019cd0
> > R13: ffffffff825b0be8 R14: ffffffffffffffff R15: 0000000000000009
> > FS:  0000000000000000(0000) GS:ffff8800c6c80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f33b5ad1f54 CR3: 0000000002410001 CR4: 00000000001606e0
> > Call Trace:
> >  <IRQ>
> >  ? rcu_process_callbacks+0x469/0x6de
> >  ? percpu_ref_exit+0x26/0x26
> >  rcu_process_callbacks+0x4d7/0x6de
> >  __do_softirq+0x1a5/0x38f
> >  irq_exit+0x63/0xd1
> >  smp_apic_timer_interrupt+0x1cd/0x1e0
> >  apic_timer_interrupt+0xf/0x20
> >  </IRQ>
> > RIP: 0010:cpuidle_enter_state+0x24e/0x2b1
> > Code: ff e8 7b be 8a ff 45 84 ed 74 17 9c 58 0f ba e0 09 73 08 0f 0b fa e8 63 8d 94 ff 31 ff e8 14 af 8f ff e8 cd 8b 94 ff fb 85 db <78> 47 48 8b 14 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48
> > RSP: 0018:ffff8800c5d23ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> > RAX: ffff8800c5c9c340 RBX: 0000000000000005 RCX: 000000000000001f
> > RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffff8800c5c9c340
> > RBP: 0000000000000005 R08: 0000000000000002 R09: 0000000000021900
> > R10: 071c71c71c71c71c R11: 0000004ac7f2c3ed R12: ffffe8ffffc89ed0
> > R13: 0000000000000000 R14: ffffffff8251d478 R15: 0000000000000000
> >  do_idle+0x163/0x1ea
> >  cpu_startup_entry+0x1d/0x1f
> >  start_secondary+0x175/0x190
> >  secondary_startup_64+0xa4/0xb0
> > irq event stamp: 99625
> > hardirqs last  enabled at (99624): [<ffffffff810b18a5>] vprintk_emit+0xe6/0x24a
> > hardirqs last disabled at (99625): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c
> > softirqs last  enabled at (99590): [<ffffffff8105eba9>] irq_enter+0x42/0x5d
> > softirqs last disabled at (99591): [<ffffffff8105ec27>] irq_exit+0x63/0xd1
> > ---[ end trace f59fd95ebc091779 ]---

David Howells Jan. 3, 2019, 12:33 a.m. UTC | #7

Andrei Vagin <avagin@gmail.com> wrote:

> Here is a reproducer for this problem:
> 
> [root@fc24 ~]# cat fs-vs-cg 
> set -m
> d=$(mktemp -d /tmp/cg.XXXXXX)
> for i in `seq 2`; do 
> 	mkdir $d/a$i
> 	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
> done
> mkdir -p $d/a1/test
> for i in `seq 2`; do 
> 	umount $d/a$i
> done
> d=$(mktemp -d /tmp/cg.XXXXXX)
> for i in `seq 2`; do 
> 	mkdir $d/a$i
> 	mount -t cgroup -o none,name=xxxxy xxx $d/a$i
> done
> rmdir $d/a1/test
> for i in `seq 2`; do 
> 	umount $d/a$i
> done
> 
> You need to execute this script twice and it will stuck on a second
> attmept.

This causes:

	percpu ref (css_release) <= 0 (0) after switching to atomic

without your patch and:

	percpu ref (css_release) <= 0 (-2) after switching to atomic

with your patch, both followed by:

	WARNING: CPU: 0 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x90/0x1a0

on Al's for-next branch.

David

David Howells Jan. 3, 2019, 1:41 p.m. UTC | #8

I'm seeing something a bit weird, but I'm not sure why.  If I expand Andrei's
test script to:

	set -m
	d=$(mktemp -d /tmp/cg.XXXXXX)
	mkdir $d/a1
	mount -t cgroup -o none,name=xxxxy xxx $d/a1
	mkdir $d/a2
	mount -t cgroup -o none,name=xxxxy xxx $d/a2
	mkdir -p $d/a1/test
	umount $d/a1
	umount $d/a2

	d=$(mktemp -d /tmp/cg.XXXXXX)
	mkdir $d/a1
	mount -t cgroup -o none,name=xxxxy xxx $d/a1
	mkdir $d/a2
	mount -t cgroup -o none,name=xxxxy xxx $d/a2

	rmdir $d/a1/test
	umount $d/a1
	umount $d/a2

then a cgroup gets leaked at the end.  Just retaining the final:

	for i in `seq 2`; do 
		umount $d/a$i
	done

in place of the two final expanded umount commands fixes the issue.

The mounts are getting released, but not the cgroups as far as I can tell.

David

David Howells Jan. 3, 2019, 1:42 p.m. UTC | #9

David Howells <dhowells@redhat.com> wrote:

> 	umount $d/a1
> 	umount $d/a2

Putting a small sleep between the two fixes the problem too.

David

David Howells Jan. 3, 2019, 3:27 p.m. UTC | #10

David Howells <dhowells@redhat.com> wrote:

> 	umount $d/a1
> 	umount $d/a2
> 
> then a cgroup gets leaked at the end.  Just retaining the final:
> 
> 	for i in `seq 2`; do 
> 		umount $d/a$i
> 	done
> 
> in place of the two final expanded umount commands fixes the issue.

The issue also goes away if I set PERCPU_REF_INIT_ATOMIC on the refcount
object when initialising it:-/

For reference, the sequence of css_get/tryget/put calls is:

*** 00000000e2655b5f RIN   1: cgroup1_get_tree+0x5de/0x73b
*** 00000000e2655b5f GET   2: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   1: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRYO  2: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f PUT   1: cgroup_do_get_tree+0x1c5/0x1fa
*** 00000000e2655b5f GET   2: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   1: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRY   2: cgroup_kn_lock_live+0x145/0x18a
*** 00000000e2655b5f GET   3: cgroup_mkdir+0x241/0x454
*** 00000000e2655b5f PUT   2: cgroup_mkdir+0xad/0x454
*** 00000000e2655b5f PUT   1: cgroup_kill_sb+0x134/0x161

*** 00000000e2655b5f TRYO  2: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f GET   3: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   2: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f TRYO  3: cgroup1_get_tree+0x4aa/0x73b
*** 00000000e2655b5f PUT   2: cgroup_do_get_tree+0x1c5/0x1fa
*** 00000000e2655b5f GET   3: cgroup1_get_tree+0x6e1/0x73b
*** 00000000e2655b5f PUT   2: cgroup_fs_context_free+0x12c/0x15f

*** 00000000e2655b5f KIL   2: cgroup_kill_sb+0x146/0x161
*** 00000000e2655b5f PUT   0: css_free_rwork_fn+0x399/0x5ec

As obtained with the attached patch.

I'm not sure the refcount is correct after the second block (at the end of the
second mount).  Should it have one ref per active sb?

David
---
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..23d022eb669f 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -165,6 +165,8 @@ struct cgroup_subsys_state {
 	 * fields of the containing structure.
 	 */
 	struct cgroup_subsys_state *parent;
+
+	bool debug;
 };
 
 /*
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index bb0c7da50ed2..76ca2be4986d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -295,6 +295,8 @@ void css_task_iter_end(struct css_task_iter *it);
  * Inline functions.
  */
 
+extern void css_refcount(struct cgroup_subsys_state *css, const char *op);
+
 /**
  * css_get - obtain a reference on the specified css
  * @css: target css
@@ -303,8 +305,10 @@ void css_task_iter_end(struct css_task_iter *it);
  */
 static inline void css_get(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get(&css->refcnt);
+		css_refcount(css, "GET");
+	}
 }
 
 /**
@@ -316,8 +320,10 @@ static inline void css_get(struct cgroup_subsys_state *css)
  */
 static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get_many(&css->refcnt, n);
+		css_refcount(css, "GETM");
+	}
 }
 
 /**
@@ -333,8 +339,11 @@ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
  */
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget(&css->refcnt))
+		return false;
+	css_refcount(css, "TRY");
 	return true;
 }
 
@@ -350,8 +359,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
  */
 static inline bool css_tryget_online(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget_live(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget_live(&css->refcnt))
+		return false;
+	css_refcount(css, "TRYO");
 	return true;
 }
 
@@ -383,8 +395,10 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
  */
 static inline void css_put(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put(&css->refcnt);
+		css_refcount(css, "PUT");
+	}
 }
 
 /**
@@ -396,8 +410,10 @@ static inline void css_put(struct cgroup_subsys_state *css)
  */
 static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put_many(&css->refcnt, n);
+		css_refcount(css, "PUTM");
+	}
 }
 
 static inline void cgroup_get(struct cgroup *cgrp)
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index de7d625ec077..d87a0b8c3441 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1175,7 +1175,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		    ss->root == &cgrp_dfl_root)
 			continue;
 
-		if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+		if (!css_tryget_online(&ss->root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			goto err_restart;
 		}
@@ -1228,7 +1228,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		 */
 		pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
 		if (IS_ERR(pinned_sb) ||
-		    !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+		    !css_tryget_online(&root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			if (!IS_ERR_OR_NULL(pinned_sb))
 				deactivate_super(pinned_sb);
@@ -1284,6 +1284,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 	if (new_root) {
 		mutex_lock(&cgroup_mutex);
 		percpu_ref_reinit(&root->cgrp.self.refcnt);
+		css_refcount(&root->cgrp.self, "RIN");
 		mutex_unlock(&cgroup_mutex);
 	}
 	cgroup_get(&root->cgrp);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7dbe71b23941..299e3fb30731 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1904,8 +1904,11 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
 	root->flags = ctx->flags;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
-	if (ctx->name)
+	if (ctx->name) {
 		strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
+		if (strcmp(root->name, "xxxxy") == 0)
+			cgrp->self.debug = true;
+	}
 	if (ctx->cpuset_clone_children)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
@@ -1927,7 +1930,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 	root_cgrp->ancestor_ids[0] = ret;
 
 	ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
-			      ref_flags, GFP_KERNEL);
+			      ref_flags | PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out;
 
@@ -2166,8 +2169,10 @@ static void cgroup_kill_sb(struct super_block *sb)
 	if (!list_empty(&root->cgrp.self.children) ||
 	    root == &cgrp_dfl_root)
 		cgroup_put(&root->cgrp);
-	else
+	else {
+		css_refcount(&root->cgrp.self, "KIL");
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }
@@ -4891,7 +4896,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 
 	init_and_link_css(css, ss, cgrp);
 
-	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
+	err = percpu_ref_init(&css->refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (err)
 		goto err_free_css;
 
@@ -4946,7 +4951,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
+	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out_free_cgrp;
 
@@ -5194,6 +5199,7 @@ static void kill_css(struct cgroup_subsys_state *css)
 	 * Use percpu_ref_kill_and_confirm() to get notifications as each
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
+	css_refcount(css, "KLC");
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
 }
 
@@ -5278,6 +5284,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	cgroup1_check_for_release(parent);
 
 	/* put the base reference */
+	css_refcount(&cgrp->self, "KIL");
 	percpu_ref_kill(&cgrp->self.refcnt);
 
 	return 0;
@@ -6106,3 +6113,11 @@ static int __init cgroup_sysfs_init(void)
 }
 subsys_initcall(cgroup_sysfs_init);
 #endif /* CONFIG_SYSFS */
+
+noinline void css_refcount(struct cgroup_subsys_state *css, const char *op)
+{
+	if (css->debug)
+		printk("*** %p %s %2ld: %pSR\n",
+		       css, op, atomic_long_read(&css->refcnt.count),
+		       __builtin_return_address(0));
+}

David Howells Jan. 3, 2019, 3:44 p.m. UTC | #11

David Howells <dhowells@redhat.com> wrote:

> I'm not sure the refcount is correct after the second block (at the end of the
> second mount).  Should it have one ref per active sb?

No, it shouldn't because there's only one sb.

David

[vfs/for-next,v2] cgroup: fix top cgroup refcnt leak

Commit Message

Comments

Patch