diff mbox series

[V2] cgroup: fix memory leak caused by missing cgroup_bpf_offline

Message ID 20211018075623.26884-1-quanyang.wang@windriver.com (mailing list archive)
State Accepted
Delegated to: BPF
Headers show
Series [V2] cgroup: fix memory leak caused by missing cgroup_bpf_offline | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf success VM_Test
bpf/vmtest-bpf-next success VM_Test

Commit Message

Quanyang Wang Oct. 18, 2021, 7:56 a.m. UTC
From: Quanyang Wang <quanyang.wang@windriver.com>

When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
the command as below:

    $mount -t cgroup -o none,name=foo cgroup cgroup/
    $umount cgroup/

unreferenced object 0xc3585c40 (size 64):
  comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
  hex dump (first 32 bytes):
    01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
    00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
  backtrace:
    [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
    [<1f03679c>] cgroup_setup_root+0x174/0x37c
    [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
    [<f85b12fd>] vfs_get_tree+0x24/0x108
    [<f55aec5c>] path_mount+0x384/0x988
    [<e2d5e9cd>] do_mount+0x64/0x9c
    [<208c9cfe>] sys_mount+0xfc/0x1f4
    [<06dd06e0>] ret_fast_syscall+0x0/0x48
    [<a8308cb3>] 0xbeb4daa8

This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
is called by cgroup_setup_root when mounting, but not freed along with
root_cgrp when umounting. Adding cgroup_bpf_offline which calls
percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
umount path.

This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
cleanup that frees the resources which are allocated by cgroup_bpf_inherit
in cgroup_setup_root. 

And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
cgroup_put is at the end of cgroup_bpf_release which is called by
cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
cgroup's refcount.

Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
---
V1 ---> V2:
1. As per Daniel's suggestion, add description to commit msg about the
balance of cgroup's refcount in cgroup_bpf_offline.
2. As per Michal's suggestion, add tag "Fixes: 4bfc0bb2c60e" and add
description about it.
3. Fix indentation on the percpu_ref_is_dying line.
---
 kernel/cgroup/cgroup.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Ming Lei Oct. 18, 2021, 9:02 a.m. UTC | #1
On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
> From: Quanyang Wang <quanyang.wang@windriver.com>
> 
> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
> the command as below:
> 
>     $mount -t cgroup -o none,name=foo cgroup cgroup/
>     $umount cgroup/
> 
> unreferenced object 0xc3585c40 (size 64):
>   comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>   hex dump (first 32 bytes):
>     01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>     00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>   backtrace:
>     [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>     [<1f03679c>] cgroup_setup_root+0x174/0x37c
>     [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>     [<f85b12fd>] vfs_get_tree+0x24/0x108
>     [<f55aec5c>] path_mount+0x384/0x988
>     [<e2d5e9cd>] do_mount+0x64/0x9c
>     [<208c9cfe>] sys_mount+0xfc/0x1f4
>     [<06dd06e0>] ret_fast_syscall+0x0/0x48
>     [<a8308cb3>] 0xbeb4daa8
> 
> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
> is called by cgroup_setup_root when mounting, but not freed along with
> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
> umount path.
> 
> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
> in cgroup_setup_root. 
> 
> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
> cgroup_put is at the end of cgroup_bpf_release which is called by
> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
> cgroup's refcount.
> 
> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")

If I understand correctly, cgroup_bpf_release() won't be called without
your patch. So anything allocated in cgroup_bpf_inherit() will be
leaked?

If that is true, 'Fixes: 2b0d3d3e4fcf' looks misleading, cause people has to
backport your patch if 4bfc0bb2c60e is applied. Meantime, this fix isn't
needed if 4bfc0bb2c60e isn't merged.


Thanks,
Ming
Quanyang Wang Oct. 18, 2021, 10:06 a.m. UTC | #2
Hi Ming,

On 10/18/21 5:02 PM, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
>> From: Quanyang Wang <quanyang.wang@windriver.com>
>>
>> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
>> the command as below:
>>
>>      $mount -t cgroup -o none,name=foo cgroup cgroup/
>>      $umount cgroup/
>>
>> unreferenced object 0xc3585c40 (size 64):
>>    comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>>    hex dump (first 32 bytes):
>>      01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>>      00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>>    backtrace:
>>      [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>>      [<1f03679c>] cgroup_setup_root+0x174/0x37c
>>      [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>>      [<f85b12fd>] vfs_get_tree+0x24/0x108
>>      [<f55aec5c>] path_mount+0x384/0x988
>>      [<e2d5e9cd>] do_mount+0x64/0x9c
>>      [<208c9cfe>] sys_mount+0xfc/0x1f4
>>      [<06dd06e0>] ret_fast_syscall+0x0/0x48
>>      [<a8308cb3>] 0xbeb4daa8
>>
>> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
>> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
>> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
>> is called by cgroup_setup_root when mounting, but not freed along with
>> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
>> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
>> umount path.
>>
>> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
>> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
>> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
>> in cgroup_setup_root.
>>
>> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
>> cgroup_put is at the end of cgroup_bpf_release which is called by
>> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
>> cgroup's refcount.
>>
>> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
> 
> If I understand correctly, cgroup_bpf_release() won't be called without
> your patch. So anything allocated in cgroup_bpf_inherit() will be
> leaked?
No, for now cgroup_bpf_release is called to release bpf.refcnt.data of 
the cgroup which is not root_cgroup. Only root_cgroup's bpf data is leaked.

For non-root cgroup:
cgroup_mkdir
-> cgroup_create
-->cgroup_bpf_inherit(cgrp_A)
cgroup_rmdir
->cgroup_destroy_locked()
-->cgroup_bpf_offline(cgrp_A)
So for non-root cgroup, there is no memory leak.


For root cgroup:
cgroup_setup_root
->cgroup_bpf_inherit(root_cgrp)
cgroup_kill_sb:
-> (Here should be call cgroup_bpf_offline, or else leak occurs)

Thanks,
Quanyang


> 
> If that is true, 'Fixes: 2b0d3d3e4fcf' looks misleading, cause people has to
> backport your patch if 4bfc0bb2c60e is applied. Meantime, this fix isn't
> needed if 4bfc0bb2c60e isn't merged.
> 
> 
> Thanks,
> Ming
>
Ming Lei Oct. 18, 2021, 12:59 p.m. UTC | #3
On Mon, Oct 18, 2021 at 06:06:28PM +0800, Quanyang Wang wrote:
> Hi Ming,
> 
> On 10/18/21 5:02 PM, Ming Lei wrote:
> > On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
> > > From: Quanyang Wang <quanyang.wang@windriver.com>
> > > 
> > > When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
> > > the command as below:
> > > 
> > >      $mount -t cgroup -o none,name=foo cgroup cgroup/
> > >      $umount cgroup/
> > > 
> > > unreferenced object 0xc3585c40 (size 64):
> > >    comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
> > >    hex dump (first 32 bytes):
> > >      01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
> > >      00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
> > >    backtrace:
> > >      [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
> > >      [<1f03679c>] cgroup_setup_root+0x174/0x37c
> > >      [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
> > >      [<f85b12fd>] vfs_get_tree+0x24/0x108
> > >      [<f55aec5c>] path_mount+0x384/0x988
> > >      [<e2d5e9cd>] do_mount+0x64/0x9c
> > >      [<208c9cfe>] sys_mount+0xfc/0x1f4
> > >      [<06dd06e0>] ret_fast_syscall+0x0/0x48
> > >      [<a8308cb3>] 0xbeb4daa8
> > > 
> > > This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
> > > memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
> > > is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
> > > is called by cgroup_setup_root when mounting, but not freed along with
> > > root_cgrp when umounting. Adding cgroup_bpf_offline which calls
> > > percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
> > > umount path.
> > > 
> > > This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
> > > of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
> > > cleanup that frees the resources which are allocated by cgroup_bpf_inherit
> > > in cgroup_setup_root.
> > > 
> > > And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
> > > cgroup_put is at the end of cgroup_bpf_release which is called by
> > > cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
> > > cgroup's refcount.
> > > 
> > > Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
> > 
> > If I understand correctly, cgroup_bpf_release() won't be called without
> > your patch. So anything allocated in cgroup_bpf_inherit() will be
> > leaked?
> No, for now cgroup_bpf_release is called to release bpf.refcnt.data of the
> cgroup which is not root_cgroup. Only root_cgroup's bpf data is leaked.

You mean that cgroup_bpf_inherit() allocates nothing for root_cgroup?

If yes, I agree you can add 'Fixes: 2b0d3d3e4fcf', otherwise please
remove it.


Thanks, 
Ming
Quanyang Wang Oct. 19, 2021, 10:41 a.m. UTC | #4
Hi Ming,

On 10/18/21 8:59 PM, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 06:06:28PM +0800, Quanyang Wang wrote:
>> Hi Ming,
>>
>> On 10/18/21 5:02 PM, Ming Lei wrote:
>>> On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
>>>> From: Quanyang Wang <quanyang.wang@windriver.com>
>>>>
>>>> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
>>>> the command as below:
>>>>
>>>>       $mount -t cgroup -o none,name=foo cgroup cgroup/
>>>>       $umount cgroup/
>>>>
>>>> unreferenced object 0xc3585c40 (size 64):
>>>>     comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>>>>     hex dump (first 32 bytes):
>>>>       01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>>>>       00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>>>>     backtrace:
>>>>       [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>>>>       [<1f03679c>] cgroup_setup_root+0x174/0x37c
>>>>       [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>>>>       [<f85b12fd>] vfs_get_tree+0x24/0x108
>>>>       [<f55aec5c>] path_mount+0x384/0x988
>>>>       [<e2d5e9cd>] do_mount+0x64/0x9c
>>>>       [<208c9cfe>] sys_mount+0xfc/0x1f4
>>>>       [<06dd06e0>] ret_fast_syscall+0x0/0x48
>>>>       [<a8308cb3>] 0xbeb4daa8
>>>>
>>>> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
>>>> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
>>>> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
>>>> is called by cgroup_setup_root when mounting, but not freed along with
>>>> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
>>>> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
>>>> umount path.
>>>>
>>>> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
>>>> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
>>>> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
>>>> in cgroup_setup_root.
>>>>
>>>> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
>>>> cgroup_put is at the end of cgroup_bpf_release which is called by
>>>> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
>>>> cgroup's refcount.
>>>>
>>>> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
>>>
>>> If I understand correctly, cgroup_bpf_release() won't be called without
>>> your patch. So anything allocated in cgroup_bpf_inherit() will be
>>> leaked?
>> No, for now cgroup_bpf_release is called to release bpf.refcnt.data of the
>> cgroup which is not root_cgroup. Only root_cgroup's bpf data is leaked.
> 
> You mean that cgroup_bpf_inherit() allocates nothing for root_cgroup?
Yes, cgroup_bpf_inherit allocates something for root_cgroup.

The earlier commit 4bfc0bb2c60e ("bpf: decouple the lifetime of 
cgroup_bpf from cgroup itself") introduces an imbalance that call 
cgroup_bpf_inherit(&root_cgroup) but not call 
cgroup_bpf_offline(&root_cgroup). But there was no memory leak here.

When the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of 
percpu_ref in fast path") applies, some data is allocated for 
root_cgroup and not released with root_cgroup, and memory leak is observed.

So I add 2 "Fixes tags" here to indicate that 2 commits introduce two 
different issues.

But it seems that 2 "Fixes tags" is misleading now.
So maybe just fix earlier commit 4bfc0bb2c60e which introduces imbalance?

Thanks,
Quanyang
> 
> If yes, I agree you can add 'Fixes: 2b0d3d3e4fcf', otherwise please
> remove it.
> 
> 
> Thanks,
> Ming
>
Quanyang Wang Oct. 19, 2021, 10:41 a.m. UTC | #5
Hi Ming,

On 10/18/21 8:59 PM, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 06:06:28PM +0800, Quanyang Wang wrote:
>> Hi Ming,
>>
>> On 10/18/21 5:02 PM, Ming Lei wrote:
>>> On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
>>>> From: Quanyang Wang <quanyang.wang@windriver.com>
>>>>
>>>> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
>>>> the command as below:
>>>>
>>>>       $mount -t cgroup -o none,name=foo cgroup cgroup/
>>>>       $umount cgroup/
>>>>
>>>> unreferenced object 0xc3585c40 (size 64):
>>>>     comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>>>>     hex dump (first 32 bytes):
>>>>       01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>>>>       00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>>>>     backtrace:
>>>>       [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>>>>       [<1f03679c>] cgroup_setup_root+0x174/0x37c
>>>>       [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>>>>       [<f85b12fd>] vfs_get_tree+0x24/0x108
>>>>       [<f55aec5c>] path_mount+0x384/0x988
>>>>       [<e2d5e9cd>] do_mount+0x64/0x9c
>>>>       [<208c9cfe>] sys_mount+0xfc/0x1f4
>>>>       [<06dd06e0>] ret_fast_syscall+0x0/0x48
>>>>       [<a8308cb3>] 0xbeb4daa8
>>>>
>>>> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
>>>> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
>>>> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
>>>> is called by cgroup_setup_root when mounting, but not freed along with
>>>> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
>>>> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
>>>> umount path.
>>>>
>>>> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
>>>> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
>>>> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
>>>> in cgroup_setup_root.
>>>>
>>>> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
>>>> cgroup_put is at the end of cgroup_bpf_release which is called by
>>>> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
>>>> cgroup's refcount.
>>>>
>>>> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
>>>
>>> If I understand correctly, cgroup_bpf_release() won't be called without
>>> your patch. So anything allocated in cgroup_bpf_inherit() will be
>>> leaked?
>> No, for now cgroup_bpf_release is called to release bpf.refcnt.data of the
>> cgroup which is not root_cgroup. Only root_cgroup's bpf data is leaked.
> 
> You mean that cgroup_bpf_inherit() allocates nothing for root_cgroup?
> 
> If yes, I agree you can add 'Fixes: 2b0d3d3e4fcf', otherwise please
> remove it.
> 
> 
> Thanks,
> Ming
>
Michal Koutný Oct. 19, 2021, 5:10 p.m. UTC | #6
Hi.

On Tue, Oct 19, 2021 at 06:41:14PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
> So I add 2 "Fixes tags" here to indicate that 2 commits introduce two
> different issues.

AFAIU, both the changes are needed to cause the leak, a single patch
alone won't cause the issue. Is that correct? (Perhaps not as I realize,
see below.)

But on second thought, the problem is the missing percpu_ref_exit() in
the (root) cgroup release path and percpu counter would allocate the
percpu_count_ptr anyway, so 4bfc0bb2c60e is only making the leak more
visible. Is this correct?

I agree the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of
percpu_ref in fast path") alone did nothing wrong.

[On a related (but independent) note, there seems to be an optimization
opportunity in not dealing with cgroup_bpf at all on the non-default
hierarchies.]

Regards,
Michal
Ming Lei Oct. 20, 2021, 2:17 a.m. UTC | #7
On Tue, Oct 19, 2021 at 07:10:26PM +0200, Michal Koutný wrote:
> Hi.
> 
> On Tue, Oct 19, 2021 at 06:41:14PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
> > So I add 2 "Fixes tags" here to indicate that 2 commits introduce two
> > different issues.
> 
> AFAIU, both the changes are needed to cause the leak, a single patch
> alone won't cause the issue. Is that correct? (Perhaps not as I realize,
> see below.)
> 
> But on second thought, the problem is the missing percpu_ref_exit() in
> the (root) cgroup release path and percpu counter would allocate the
> percpu_count_ptr anyway, so 4bfc0bb2c60e is only making the leak more
> visible. Is this correct?
> 
> I agree the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of
> percpu_ref in fast path") alone did nothing wrong.

If only precpu_ref data is leaked, it is fine to add "Fixes: 2b0d3d3e4fcf",
I thought cgroup_bpf_release() needs to release more for root cgroup, but
looks not true.


Thanks,
Ming
Quanyang Wang Oct. 20, 2021, 5:16 a.m. UTC | #8
Hi Michal,

On 10/20/21 1:10 AM, Michal Koutný wrote:
> Hi.
> 
> On Tue, Oct 19, 2021 at 06:41:14PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
>> So I add 2 "Fixes tags" here to indicate that 2 commits introduce two
>> different issues.
> 
> AFAIU, both the changes are needed to cause the leak, a single patch
> alone won't cause the issue. Is that correct? (Perhaps not as I realize,
> see below.)
Yes, I back to the earlier commit 4bfc0bb2c60e and no memory leak is 
observed.
> 
> But on second thought, the problem is the missing percpu_ref_exit() in
> the (root) cgroup release path and percpu counter would allocate the
> percpu_count_ptr anyway, so 4bfc0bb2c60e is only making the leak more
> visible. Is this correct?
No, the earlier commit 4bfc0bb2c60e introduces a imbalance and the later
commit 2b0d3d3e4fcf introduces a visible leak.

Thanks,
Quanyang
> 
> I agree the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of
> percpu_ref in fast path") alone did nothing wrong.
> 
> [On a related (but independent) note, there seems to be an optimization
> opportunity in not dealing with cgroup_bpf at all on the non-default
> hierarchies.]
> 
> Regards,
> Michal
>
Quanyang Wang Oct. 20, 2021, 5:22 a.m. UTC | #9
Hi Ming,

On 10/20/21 10:17 AM, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 07:10:26PM +0200, Michal Koutný wrote:
>> Hi.
>>
>> On Tue, Oct 19, 2021 at 06:41:14PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
>>> So I add 2 "Fixes tags" here to indicate that 2 commits introduce two
>>> different issues.
>>
>> AFAIU, both the changes are needed to cause the leak, a single patch
>> alone won't cause the issue. Is that correct? (Perhaps not as I realize,
>> see below.)
>>
>> But on second thought, the problem is the missing percpu_ref_exit() in
>> the (root) cgroup release path and percpu counter would allocate the
>> percpu_count_ptr anyway, so 4bfc0bb2c60e is only making the leak more
>> visible. Is this correct?
>>
>> I agree the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of
>> percpu_ref in fast path") alone did nothing wrong.
> 
> If only precpu_ref data is leaked, it is fine to add "Fixes: 2b0d3d3e4fcf",
> I thought cgroup_bpf_release() needs to release more for root cgroup, but
> looks not true.
For now, I can only observe that precpu_ref data is leaked when running 
ltp testsuite.
Thanks,
Quanyang
> 
> 
> Thanks,
> Ming
>
Michal Koutný Oct. 20, 2021, 5:28 p.m. UTC | #10
On Wed, Oct 20, 2021 at 01:22:06PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
> > If only precpu_ref data is leaked, it is fine to add "Fixes: 2b0d3d3e4fcf",
> > I thought cgroup_bpf_release() needs to release more for root cgroup, but
> > looks not true.
> For now, I can only observe that precpu_ref data is leaked when running ltp
> testsuite.

I assume you refer to ref->data. I considered the ref->percpu_count_ptr
allocated with __alloc_percpu_gfp(). Could it be that kmemleak won't
detect leaked percpu allocations?

(The patch you sent resolves this as well, I'm just curious.)

Michal
Roman Gushchin Oct. 22, 2021, 1:30 a.m. UTC | #11
On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
> From: Quanyang Wang <quanyang.wang@windriver.com>
> 
> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
> the command as below:
> 
>     $mount -t cgroup -o none,name=foo cgroup cgroup/
>     $umount cgroup/
> 
> unreferenced object 0xc3585c40 (size 64):
>   comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>   hex dump (first 32 bytes):
>     01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>     00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>   backtrace:
>     [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>     [<1f03679c>] cgroup_setup_root+0x174/0x37c
>     [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>     [<f85b12fd>] vfs_get_tree+0x24/0x108
>     [<f55aec5c>] path_mount+0x384/0x988
>     [<e2d5e9cd>] do_mount+0x64/0x9c
>     [<208c9cfe>] sys_mount+0xfc/0x1f4
>     [<06dd06e0>] ret_fast_syscall+0x0/0x48
>     [<a8308cb3>] 0xbeb4daa8
> 
> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
> is called by cgroup_setup_root when mounting, but not freed along with
> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
> umount path.
> 
> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
> in cgroup_setup_root. 
> 
> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
> cgroup_put is at the end of cgroup_bpf_release which is called by
> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
> cgroup's refcount.
> 
> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
> ---
> V1 ---> V2:
> 1. As per Daniel's suggestion, add description to commit msg about the
> balance of cgroup's refcount in cgroup_bpf_offline.
> 2. As per Michal's suggestion, add tag "Fixes: 4bfc0bb2c60e" and add
> description about it.
> 3. Fix indentation on the percpu_ref_is_dying line.

Acked-by: Roman Gushchin <guro@fb.com>

The fix looks correct, two fixes tag are fine too, if only it won't
confuse scripts picking up patches for stable backports.

In fact, it's a very cold path, which is arguably never hit in the real
life. On cgroup v2 it's not an issue. I'm not sure we need a stable
backport at all, only if it creates a noise for some automation tests.

Quanyang, out of curiosity, how did you find it?

Anyway, thanks for catching and fixing it!

Roman
Quanyang Wang Oct. 22, 2021, 11:03 a.m. UTC | #12
Hi Michal,

On 10/21/21 1:28 AM, Michal Koutný wrote:
> On Wed, Oct 20, 2021 at 01:22:06PM +0800, Quanyang Wang <quanyang.wang@windriver.com> wrote:
>>> If only precpu_ref data is leaked, it is fine to add "Fixes: 2b0d3d3e4fcf",
>>> I thought cgroup_bpf_release() needs to release more for root cgroup, but
>>> looks not true.
>> For now, I can only observe that precpu_ref data is leaked when running ltp
>> testsuite.
> 
> I assume you refer to ref->data. I considered the ref->percpu_count_ptr
> allocated with __alloc_percpu_gfp(). Could it be that kmemleak won't
> detect leaked percpu allocations?
Yes, kmemleak can't detect percpu allocations. I find some message about 
this:

commit f528f0b8e53d
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Mon Sep 26 17:12:53 2011 +0100

kmemleak: Handle percpu memory allocation

This patch adds kmemleak callbacks from the percpu allocator, reducing a
number of false positives caused by kmemleak not scanning such memory
blocks. The percpu chunks are never reported as leaks because of current
kmemleak limitations with the __percpu pointer not pointing directly to
the actual chunks.

Thanks,
Quanyang
> 
> (The patch you sent resolves this as well, I'm just curious.)
> 
> Michal
>
Quanyang Wang Oct. 22, 2021, 11:55 a.m. UTC | #13
Hi Roman,

On 10/22/21 9:30 AM, Roman Gushchin wrote:
> On Mon, Oct 18, 2021 at 03:56:23PM +0800, quanyang.wang@windriver.com wrote:
>> From: Quanyang Wang <quanyang.wang@windriver.com>
>>
>> When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
>> the command as below:
>>
>>      $mount -t cgroup -o none,name=foo cgroup cgroup/
>>      $umount cgroup/
>>
>> unreferenced object 0xc3585c40 (size 64):
>>    comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
>>    hex dump (first 32 bytes):
>>      01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00  ......(.........
>>      00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00  ........lC......
>>    backtrace:
>>      [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
>>      [<1f03679c>] cgroup_setup_root+0x174/0x37c
>>      [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
>>      [<f85b12fd>] vfs_get_tree+0x24/0x108
>>      [<f55aec5c>] path_mount+0x384/0x988
>>      [<e2d5e9cd>] do_mount+0x64/0x9c
>>      [<208c9cfe>] sys_mount+0xfc/0x1f4
>>      [<06dd06e0>] ret_fast_syscall+0x0/0x48
>>      [<a8308cb3>] 0xbeb4daa8
>>
>> This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
>> memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
>> is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
>> is called by cgroup_setup_root when mounting, but not freed along with
>> root_cgrp when umounting. Adding cgroup_bpf_offline which calls
>> percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
>> umount path.
>>
>> This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
>> of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
>> cleanup that frees the resources which are allocated by cgroup_bpf_inherit
>> in cgroup_setup_root.
>>
>> And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
>> cgroup_put is at the end of cgroup_bpf_release which is called by
>> cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
>> cgroup's refcount.
>>
>> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
>> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
>> Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
>> ---
>> V1 ---> V2:
>> 1. As per Daniel's suggestion, add description to commit msg about the
>> balance of cgroup's refcount in cgroup_bpf_offline.
>> 2. As per Michal's suggestion, add tag "Fixes: 4bfc0bb2c60e" and add
>> description about it.
>> 3. Fix indentation on the percpu_ref_is_dying line.
> 
> Acked-by: Roman Gushchin <guro@fb.com>
> 
> The fix looks correct, two fixes tag are fine too, if only it won't
> confuse scripts picking up patches for stable backports.
> 
> In fact, it's a very cold path, which is arguably never hit in the real
> life. On cgroup v2 it's not an issue. I'm not sure we need a stable
> backport at all, only if it creates a noise for some automation tests.
> 
> Quanyang, out of curiosity, how did you find it?
I ran ltp testsuite to find this.

./runltp -f controllers -s cgroup

Thanks,
Quanyang
> 
> Anyway, thanks for catching and fixing it!
> 
> Roman
>
Alexei Starovoitov Oct. 22, 2021, 9:29 p.m. UTC | #14
On Fri, Oct 22, 2021 at 4:56 AM Quanyang Wang
<quanyang.wang@windriver.com> wrote:
> >> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
> >> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
> >> Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
> >> ---
> >> V1 ---> V2:
> >> 1. As per Daniel's suggestion, add description to commit msg about the
> >> balance of cgroup's refcount in cgroup_bpf_offline.
> >> 2. As per Michal's suggestion, add tag "Fixes: 4bfc0bb2c60e" and add
> >> description about it.
> >> 3. Fix indentation on the percpu_ref_is_dying line.
> >
> > Acked-by: Roman Gushchin <guro@fb.com>
> >
> > The fix looks correct, two fixes tag are fine too, if only it won't
> > confuse scripts picking up patches for stable backports.
> >
> > In fact, it's a very cold path, which is arguably never hit in the real
> > life. On cgroup v2 it's not an issue. I'm not sure we need a stable
> > backport at all, only if it creates a noise for some automation tests.
> >
> > Quanyang, out of curiosity, how did you find it?
> I ran ltp testsuite to find this.
>
> ./runltp -f controllers -s cgroup
>
> Thanks,
> Quanyang
> >
> > Anyway, thanks for catching and fixing it!

Applied to bpf tree. Thanks everyone!
diff mbox series

Patch

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 570b0c97392a..ea08f01d0111 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2187,8 +2187,10 @@  static void cgroup_kill_sb(struct super_block *sb)
 	 * And don't kill the default root.
 	 */
 	if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
-	    !percpu_ref_is_dying(&root->cgrp.self.refcnt))
+	    !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+		cgroup_bpf_offline(&root->cgrp);
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 	cgroup_put(&root->cgrp);
 	kernfs_kill_sb(sb);
 }