diff mbox series

[RFC] mm: memcg/slab: Stop reparented obj_cgroups from charging root

Message ID 20201014190749.24607-1-rpalethorpe@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: memcg/slab: Stop reparented obj_cgroups from charging root | expand

Commit Message

Richard Palethorpe Oct. 14, 2020, 7:07 p.m. UTC
SLAB objects which outlive their memcg are moved to their parent
memcg where they may be uncharged. However if they are moved to the
root memcg, uncharging will result in negative page counter values as
root has no page counters.

To prevent this, we check whether we are about to uncharge the root
memcg and skip it if we are. Possibly instead; the obj_cgroups should
be removed from their slabs and any per cpu stocks instead of
reparenting them to root?

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-By: ltp@lists.linux.it
Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Roman Gushchin Oct. 14, 2020, 8:08 p.m. UTC | #1
Hi Richard!

> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
> 
> To prevent this, we check whether we are about to uncharge the root
> memcg and skip it if we are. Possibly instead; the obj_cgroups should
> be removed from their slabs and any per cpu stocks instead of
> reparenting them to root?

It would be really complex. I think your fix is totally fine.
We have similar checks in cancel_charge(), uncharge_batch(),
mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.

> 
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
> 
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> ...
> Reported-By: ltp@lists.linux.it
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!
Richard Palethorpe Oct. 16, 2020, 5:40 a.m. UTC | #2
Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> Hi Richard!
>
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
>> 
>> To prevent this, we check whether we are about to uncharge the root
>> memcg and skip it if we are. Possibly instead; the obj_cgroups should
>> be removed from their slabs and any per cpu stocks instead of
>> reparenting them to root?
>
> It would be really complex. I think your fix is totally fine.
> We have similar checks in cancel_charge(), uncharge_batch(),
> mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.
>
>
> Acked-by: Roman Gushchin <guro@fb.com>
>
> Thanks!

Great I will respin.
Michal Koutný Oct. 16, 2020, 9:47 a.m. UTC | #3
Hello.

On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
Why do you think those are reparented objects? If those are originally
charged in a non-root cgroup, then the charge value should be propagated up the
hierarchy, including root memcg, so if they're later uncharged in root
after reparenting, it should still break even. (Or did I miss some stock
imbalance?)

(But the patch seems justifiable to me as objects (not)charged directly to
root memcg may be incorrectly uncharged.)

Thanks,
Michal
Richard Palethorpe Oct. 16, 2020, 10:41 a.m. UTC | #4
Hello Michal,

Michal Koutný <mkoutny@suse.com> writes:

> Hello.
>
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

I traced it and can see they are reparented objects and that the root
groups counters are zero (or negative if I run madvise06 multiple times)
before a drain takes place. I'm guessing this is because the root group
has 'use_hierachy' set to false so that the childs page_counter parents
are set to NULL. However I will check, because I'm not sure about
either.

>
> (But the patch seems justifiable to me as objects (not)charged directly to
> root memcg may be incorrectly uncharged.)
>
> Thanks,
> Michal
Johannes Weiner Oct. 16, 2020, 2:53 p.m. UTC | #5
On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> Hello.
> 
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> > SLAB objects which outlive their memcg are moved to their parent
> > memcg where they may be uncharged. However if they are moved to the
> > root memcg, uncharging will result in negative page counter values as
> > root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

Looking a bit closer at this code, it's kind of a mess right now.

The central try_charge() function charges recursively all the way up
to and including the root. But not if it's called directly on the
root, in which case it bails and does nothing.

kmem and objcg use try_charge(), so they have the same
behavior. get_obj_cgroup_from_current() does it's own redundant
filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
does not, but its callsite __memcg_kmem_charge_page() does.

We should clean this up one way or another: either charge the root or
don't, but do it consistently.

Since we export memory.stat at the root now, we should probably just
always charge the root instead of special-casing it all over the place
and risking bugs.

Indeed, it looks like there is at least one bug where the root-level
memory.stat shows non-root slab objects, but not root ones, whereas it
shows all anon and cache pages, root or no root.
Richard Palethorpe Oct. 16, 2020, 3:05 p.m. UTC | #6
Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello Michal,
>
> Michal Koutný <mkoutny@suse.com> writes:
>
>> Hello.
>>
>> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
>>> SLAB objects which outlive their memcg are moved to their parent
>>> memcg where they may be uncharged. However if they are moved to the
>>> root memcg, uncharging will result in negative page counter values as
>>> root has no page counters.
>> Why do you think those are reparented objects? If those are originally
>> charged in a non-root cgroup, then the charge value should be propagated up the
>> hierarchy, including root memcg, so if they're later uncharged in root
>> after reparenting, it should still break even. (Or did I miss some stock
>> imbalance?)
>
> I traced it and can see they are reparented objects and that the root
> groups counters are zero (or negative if I run madvise06 multiple times)
> before a drain takes place. I'm guessing this is because the root group
> has 'use_hierachy' set to false so that the childs page_counter parents
> are set to NULL. However I will check, because I'm not sure about
> either.

Yes, it appears that use_hierarchy=0 which is probably because the test
mounts cgroup v1, creates a child group within that and does not set
use_hierarchy on the root. On v2 root always has use_hierarchy enabled.

>
>>
>> (But the patch seems justifiable to me as objects (not)charged directly to
>> root memcg may be incorrectly uncharged.)
>>
>> Thanks,
>> Michal

I'm don't know if that could happen without reparenting. I suppose if
use_hierarchy=1 then actually this patch will result in root being
overcharged, so perhaps it should also check for use_hierarchy?
Roman Gushchin Oct. 16, 2020, 5:02 p.m. UTC | #7
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner wrote:
> On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> > Hello.
> > 
> > On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> > > SLAB objects which outlive their memcg are moved to their parent
> > > memcg where they may be uncharged. However if they are moved to the
> > > root memcg, uncharging will result in negative page counter values as
> > > root has no page counters.
> > Why do you think those are reparented objects? If those are originally
> > charged in a non-root cgroup, then the charge value should be propagated up the
> > hierarchy, including root memcg, so if they're later uncharged in root
> > after reparenting, it should still break even. (Or did I miss some stock
> > imbalance?)
> 
> Looking a bit closer at this code, it's kind of a mess right now.
> 
> The central try_charge() function charges recursively all the way up
> to and including the root. But not if it's called directly on the
> root, in which case it bails and does nothing.
> 
> kmem and objcg use try_charge(), so they have the same
> behavior. get_obj_cgroup_from_current() does it's own redundant
> filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
> does not, but its callsite __memcg_kmem_charge_page() does.
> 
> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.

+1

> 
> Since we export memory.stat at the root now, we should probably just
> always charge the root instead of special-casing it all over the place
> and risking bugs.

Hm, we export memory.stat but not memory.current. Charging the root memcg
seems to be an extra atomic operation, which can be avoided.

I wonder if we can handle it in page_counter.c, so there will be a single
place where we do the check.

> 
> Indeed, it looks like there is at least one bug where the root-level
> memory.stat shows non-root slab objects, but not root ones, whereas it
> shows all anon and cache pages, root or no root.

I'll take a look.
Michal Koutný Oct. 16, 2020, 5:15 p.m. UTC | #8
On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The central try_charge() function charges recursively all the way up
> to and including the root.
Except for use_hiearchy=0 (which is the case here as Richard
wrote). The reparenting is hence somewhat incompatible with
new_parent.use_hiearchy=0 :-/

> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.
I agree this'd be good to unify. One upside of excluding root memcg from
charging is that users are spared from the charging overhead when memcg
tree is not created.  (Actually, I thought that was the reason for this
exception.)

Michal
Michal Koutný Oct. 16, 2020, 5:26 p.m. UTC | #9
On Fri, Oct 16, 2020 at 04:05:21PM +0100, Richard Palethorpe <rpalethorpe@suse.de> wrote:
> I'm don't know if that could happen without reparenting. I suppose if
> use_hierarchy=1 then actually this patch will result in root being
> overcharged, so perhaps it should also check for use_hierarchy?
Right, you'd need to distinguish whether the uncharged objcg was
originally (not)charged in the root memcg or it was only reparented to
it. (I originally considered only the genuine root objcgs.)

In this light, homogenous charing to root memcg looks really simpler but
I wonder what other surprises it brings about.

Michal
Richard Palethorpe Oct. 19, 2020, 8:45 a.m. UTC | #10
Hello,

Michal Koutný <mkoutny@suse.com> writes:

> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> The central try_charge() function charges recursively all the way up
>> to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
>

Yes and it also seems

new_parent.use_hierarch=0 -> new_child.use_hierarchy=0

and

new_parent.use_hierarch=0 -> new_child.use_hierarchy=1

are considered valid on cgroupsV1. The kernel will also allow more
descendants on new_child.use_hierarchy=0, but sets
broken_hierarchy=1. However this will not stop the stack trace occuring
(AFAICT) when the reparenting happens between two descendants.

>> We should clean this up one way or another: either charge the root or
>> don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)
>
> Michal
Roman Gushchin Oct. 19, 2020, 10:28 p.m. UTC | #11
On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > The central try_charge() function charges recursively all the way up
> > to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
> 
> > We should clean this up one way or another: either charge the root or
> > don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)

Yeah, I'm completely on the same page. Moving a process to the root memory
cgroup is currently a good way to estimate the memory cgroup overhead.

How about the patch below, which consistently avoids charging the root
memory cgroup? It seems like it doesn't add too many checks.

Thanks!

--

From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Mon, 19 Oct 2020 14:37:35 -0700
Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup

Currently the root memory cgroup is never charged directly, but
if an ancestor cgroup is charged, the charge is propagated up to the
root memory cgroup. The root memory cgroup doesn't show the charge
to a user, neither it does allow to set any limits/protections.
So the information about the current charge is completely useless.

Avoiding to charge the root memory cgroup allows to:
1) simplify the model and the code, so, hopefully, fewer bugs will
   be introduced in the future;
2) avoid unnecessary atomic operations, which are used to (un)charge
   corresponding root page counters.

In the default hierarchy case or if use_hiearchy == true, it's very
straightforward: when the page counters tree is traversed to the root,
the root page counter (the one with parent == NULL), should be
skipped. To avoid multiple identical checks over the page counters
code, for_each_nonroot_ancestor() macro is introduced.

To handle the use_hierarchy == false case without adding custom
checks, let's make page counters of all non-root memory cgroup
direct ascendants of the corresponding root memory cgroup's page
counters. In this case for_each_nonroot_ancestor() will work correctly
as well.

Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
However, it's not based on page counters (refer to mem_cgroup_usage()).

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c   | 21 ++++++++++++++++-----
 mm/page_counter.c | 21 ++++++++++++---------
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..34cac7522e74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
 	}
-	if (parent && parent->use_hierarchy) {
+	if (!parent) {
+		/* root memory cgroup */
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->tcpmem, NULL);
+	} else if (parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
-		page_counter_init(&memcg->memory, NULL);
-		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->kmem, NULL);
-		page_counter_init(&memcg->tcpmem, NULL);
+		/*
+		 * If use_hierarchy == false, consider all page counters direct
+		 * descendants of the corresponding root level counters.
+		 */
+		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
+
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b24a60b28bb0..8901b184b9d5 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,6 +13,9 @@
 #include <linux/bug.h>
 #include <asm/page.h>
 
+#define for_each_nonroot_ancestor(c, counter) \
+	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
+
 static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
@@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
 	unsigned long low, min;
 	long delta;
 
-	if (!c->parent)
-		return;
-
 	min = READ_ONCE(c->min);
 	if (min || atomic_long_read(&c->min_usage)) {
 		protected = min(usage, min);
@@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent) {
+	for_each_nonroot_ancestor(c, counter) {
 		long new;
 
 		new = atomic_long_add_return(nr_pages, &c->usage);
@@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent) {
+	for_each_nonroot_ancestor(c, counter) {
 		long new;
 		/*
 		 * Charge speculatively to avoid an expensive CAS.  If
@@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
 	return true;
 
 failed:
-	for (c = counter; c != *fail; c = c->parent)
+	for_each_nonroot_ancestor(c, counter) {
+		if (c == *fail)
+			break;
 		page_counter_cancel(c, nr_pages);
+	}
 
 	return false;
 }
@@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		page_counter_cancel(c, nr_pages);
 }
 
@@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
 
 	WRITE_ONCE(counter->min, nr_pages);
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		propagate_protected_usage(c, atomic_long_read(&c->usage));
 }
 
@@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
 
 	WRITE_ONCE(counter->low, nr_pages);
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		propagate_protected_usage(c, atomic_long_read(&c->usage));
 }
Richard Palethorpe Oct. 20, 2020, 6:04 a.m. UTC | #12
Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > The central try_charge() function charges recursively all the way up
>> > to and including the root.
>> Except for use_hiearchy=0 (which is the case here as Richard
>> wrote). The reparenting is hence somewhat incompatible with
>> new_parent.use_hiearchy=0 :-/
>> 
>> > We should clean this up one way or another: either charge the root or
>> > don't, but do it consistently.
>> I agree this'd be good to unify. One upside of excluding root memcg from
>> charging is that users are spared from the charging overhead when memcg
>> tree is not created.  (Actually, I thought that was the reason for this
>> exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c   | 21 ++++++++++++++++-----
>  mm/page_counter.c | 21 ++++++++++++---------
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
>  	}
> -	if (parent && parent->use_hierarchy) {
> +	if (!parent) {
> +		/* root memory cgroup */
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->swap, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->tcpmem, NULL);
> +	} else if (parent->use_hierarchy) {
>  		memcg->use_hierarchy = true;
>  		page_counter_init(&memcg->memory, &parent->memory);
>  		page_counter_init(&memcg->swap, &parent->swap);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
>  		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
>  	} else {
> -		page_counter_init(&memcg->memory, NULL);
> -		page_counter_init(&memcg->swap, NULL);
> -		page_counter_init(&memcg->kmem, NULL);
> -		page_counter_init(&memcg->tcpmem, NULL);
> +		/*
> +		 * If use_hierarchy == false, consider all page counters direct
> +		 * descendants of the corresponding root level counters.
> +		 */
> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
> +
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this

Perhaps in this case, where the hierarchy is broken, objcgs should also
be reparented directly to root? Otherwise it will still be possible to
underflow the counter in a descendant of root which has use_hierarchy=0,
but also has children.

> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..8901b184b9d5 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -13,6 +13,9 @@
>  #include <linux/bug.h>
>  #include <asm/page.h>
>  
> +#define for_each_nonroot_ancestor(c, counter) \
> +	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
> +
>  static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
> @@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
>  	unsigned long low, min;
>  	long delta;
>  
> -	if (!c->parent)
> -		return;
> -
>  	min = READ_ONCE(c->min);
>  	if (min || atomic_long_read(&c->min_usage)) {
>  		protected = min(usage, min);
> @@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  
>  		new = atomic_long_add_return(nr_pages, &c->usage);
> @@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  		/*
>  		 * Charge speculatively to avoid an expensive CAS.  If
> @@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
>  	return true;
>  
>  failed:
> -	for (c = counter; c != *fail; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter) {
> +		if (c == *fail)
> +			break;
>  		page_counter_cancel(c, nr_pages);
> +	}
>  
>  	return false;
>  }
> @@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		page_counter_cancel(c, nr_pages);
>  }
>  
> @@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->min, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }
>  
> @@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->low, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }
Richard Palethorpe Oct. 20, 2020, 12:02 p.m. UTC | #13
Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello Roman,
>
> Roman Gushchin <guro@fb.com> writes:
>
>> -		page_counter_init(&memcg->memory, NULL);
>> -		page_counter_init(&memcg->swap, NULL);
>> -		page_counter_init(&memcg->kmem, NULL);
>> -		page_counter_init(&memcg->tcpmem, NULL);
>> +		/*
>> +		 * If use_hierarchy == false, consider all page counters direct
>> +		 * descendants of the corresponding root level counters.
>> +		 */
>> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
>> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
>> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
>> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
>> +
>>  		/*
>>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>>  		 * much sense so let cgroup subsystem know about this
>
> Perhaps in this case, where the hierarchy is broken, objcgs should also
> be reparented directly to root? Otherwise it will still be possible to
> underflow the counter in a descendant of root which has use_hierarchy=0,
> but also has children.

Sorry ignore me, parent_mem_cgroup already selects root. So in the case
of a broken hierarchy objcgs are reparented directly to root.
Richard Palethorpe Oct. 20, 2020, 2:48 p.m. UTC | #14
Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c   | 21 ++++++++++++++++-----
>  mm/page_counter.c | 21 ++++++++++++---------
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
>  	}
> -	if (parent && parent->use_hierarchy) {
> +	if (!parent) {
> +		/* root memory cgroup */
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->swap, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->tcpmem, NULL);
> +	} else if (parent->use_hierarchy) {
>  		memcg->use_hierarchy = true;
>  		page_counter_init(&memcg->memory, &parent->memory);
>  		page_counter_init(&memcg->swap, &parent->swap);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
>  		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
>  	} else {
> -		page_counter_init(&memcg->memory, NULL);
> -		page_counter_init(&memcg->swap, NULL);
> -		page_counter_init(&memcg->kmem, NULL);
> -		page_counter_init(&memcg->tcpmem, NULL);
> +		/*
> +		 * If use_hierarchy == false, consider all page counters direct
> +		 * descendants of the corresponding root level counters.
> +		 */
> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
> +
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..8901b184b9d5 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -13,6 +13,9 @@
>  #include <linux/bug.h>
>  #include <asm/page.h>
>  
> +#define for_each_nonroot_ancestor(c, counter) \
> +	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
> +
>  static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
> @@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
>  	unsigned long low, min;
>  	long delta;
>  
> -	if (!c->parent)
> -		return;
> -
>  	min = READ_ONCE(c->min);
>  	if (min || atomic_long_read(&c->min_usage)) {
>  		protected = min(usage, min);
> @@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  
>  		new = atomic_long_add_return(nr_pages, &c->usage);
> @@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  		/*
>  		 * Charge speculatively to avoid an expensive CAS.  If
> @@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
>  	return true;
>  
>  failed:
> -	for (c = counter; c != *fail; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter) {
> +		if (c == *fail)
> +			break;
>  		page_counter_cancel(c, nr_pages);
> +	}
>  
>  	return false;
>  }
> @@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		page_counter_cancel(c, nr_pages);
>  }
>  
> @@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->min, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }
>  
> @@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->low, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }

This for sure prevents the counter underflow reported by madvise06 and
makes my patch redundant. Although perhaps this is significantly more
intrusive, so not suitable for the 5.9 stable branch?

Tested-by: Richard Palethorpe <rpalethorpe@suse.com>
Michal Koutný Oct. 20, 2020, 4:27 p.m. UTC | #15
Hi.

On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin <guro@fb.com> wrote:
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
s/ancestor/descendant/

> The root memory cgroup doesn't show the charge to a user, neither it
> does allow to set any limits/protections.
An appealing claim, I'd like this to be true...

> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
...and it almost is. But there are still exposed kmem and tcpmem counters.


> To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
If the assumptions behind this patch's idea were true, I think the
implementation would be simpler by merely (not)connecting the root
counters and keep the traversal as is.

> direct ascendants of the corresponding root memory cgroup's page
s/asc/desc/ ;-)

Michal
Shakeel Butt Oct. 20, 2020, 4:55 p.m. UTC | #16
On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > The central try_charge() function charges recursively all the way up
> > > to and including the root.
> > Except for use_hiearchy=0 (which is the case here as Richard
> > wrote). The reparenting is hence somewhat incompatible with
> > new_parent.use_hiearchy=0 :-/
> >
> > > We should clean this up one way or another: either charge the root or
> > > don't, but do it consistently.
> > I agree this'd be good to unify. One upside of excluding root memcg from
> > charging is that users are spared from the charging overhead when memcg
> > tree is not created.  (Actually, I thought that was the reason for this
> > exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

This patch is only doing the page counter part of the cleanup (i.e. to
not update root's page counters when descendent's page counter is
updated) but not the stats part.

For the user memory, we do update the stats for the root memcg and do
set page->mem_cgroup = root_mem_cgroup. However this is not done for
the kmem/obj. I thought this is what Johannes was asking for the
cleanup.
Roman Gushchin Oct. 20, 2020, 5:07 p.m. UTC | #17
On Tue, Oct 20, 2020 at 06:27:14PM +0200, Michal Koutny wrote:
> Hi.
> 
> On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> s/ancestor/descendant/

Oops, will fix, thanks!

> 
> > The root memory cgroup doesn't show the charge to a user, neither it
> > does allow to set any limits/protections.
> An appealing claim, I'd like this to be true...
> 
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> ...and it almost is. But there are still exposed kmem and tcpmem counters.

Hm, I wonder what do they show given that we never set sk->sk_memcg
to the root_mem_cgroup (see mem_cgroup_sk_alloc()) and we never charge
the root_mem_cgroup for !slab kmem allocations (see __memcg_kmem_charge_page()).

So yeah, it's quite a mess now, and it looks like it has been broken
in multiple places and for a while.

If we want these counter to function properly, then we should go into the opposite
direction and remove the special handling of the root memory cgroup in many places.

> > To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> If the assumptions behind this patch's idea were true, I think the
> implementation would be simpler by merely (not)connecting the root
> counters and keep the traversal as is.

We use some fields in root page counters to calculate protections:
see propagate_protected_usage().

Thanks!
Roman Gushchin Oct. 20, 2020, 5:17 p.m. UTC | #18
On Tue, Oct 20, 2020 at 09:55:38AM -0700, Shakeel Butt wrote:
> On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > The central try_charge() function charges recursively all the way up
> > > > to and including the root.
> > > Except for use_hiearchy=0 (which is the case here as Richard
> > > wrote). The reparenting is hence somewhat incompatible with
> > > new_parent.use_hiearchy=0 :-/
> > >
> > > > We should clean this up one way or another: either charge the root or
> > > > don't, but do it consistently.
> > > I agree this'd be good to unify. One upside of excluding root memcg from
> > > charging is that users are spared from the charging overhead when memcg
> > > tree is not created.  (Actually, I thought that was the reason for this
> > > exception.)
> >
> > Yeah, I'm completely on the same page. Moving a process to the root memory
> > cgroup is currently a good way to estimate the memory cgroup overhead.
> >
> > How about the patch below, which consistently avoids charging the root
> > memory cgroup? It seems like it doesn't add too many checks.
> >
> > Thanks!
> >
> > --
> >
> > From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Mon, 19 Oct 2020 14:37:35 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
> >
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> > root memory cgroup. The root memory cgroup doesn't show the charge
> > to a user, neither it does allow to set any limits/protections.
> > So the information about the current charge is completely useless.
> >
> > Avoiding to charge the root memory cgroup allows to:
> > 1) simplify the model and the code, so, hopefully, fewer bugs will
> >    be introduced in the future;
> > 2) avoid unnecessary atomic operations, which are used to (un)charge
> >    corresponding root page counters.
> >
> > In the default hierarchy case or if use_hiearchy == true, it's very
> > straightforward: when the page counters tree is traversed to the root,
> > the root page counter (the one with parent == NULL), should be
> > skipped. To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> >
> > To handle the use_hierarchy == false case without adding custom
> > checks, let's make page counters of all non-root memory cgroup
> > direct ascendants of the corresponding root memory cgroup's page
> > counters. In this case for_each_nonroot_ancestor() will work correctly
> > as well.
> >
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> This patch is only doing the page counter part of the cleanup (i.e. to
> not update root's page counters when descendent's page counter is
> updated) but not the stats part.
> 
> For the user memory, we do update the stats for the root memcg and do
> set page->mem_cgroup = root_mem_cgroup. However this is not done for
> the kmem/obj. I thought this is what Johannes was asking for the
> cleanup.

Yes, it's not the whole story, of course.

Actually I missed that we do export root kmem and tcpmem counters
on cgroup v1 (thanks Michal to pointing at it!). If we want them to function
properly, we have to go into the opposite direction and start charging
the root cgroup for all kinds of kernel memory allocations.

We also have the same problem with root MEMCG_SOCK stats, which seems
to be broken now.

I'll master a patch.

Thanks!
Johannes Weiner Oct. 20, 2020, 6:18 p.m. UTC | #19
On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> If we want these counter to function properly, then we should go into the opposite
> direction and remove the special handling of the root memory cgroup in many places.

I suspect this is also by far the most robust solution from a code and
maintenance POV.

I don't recall the page counter at the root level having been a
concern in recent years, even though it's widely used in production
environments. It's lockless and cache compact. It's also per-cpu
batched, which means it isn't actually part of the memcg hotpath.
Roman Gushchin Oct. 21, 2020, 7:33 p.m. UTC | #20
On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > If we want these counter to function properly, then we should go into the opposite
> > direction and remove the special handling of the root memory cgroup in many places.
> 
> I suspect this is also by far the most robust solution from a code and
> maintenance POV.
> 
> I don't recall the page counter at the root level having been a
> concern in recent years, even though it's widely used in production
> environments. It's lockless and cache compact. It's also per-cpu
> batched, which means it isn't actually part of the memcg hotpath.


I agree.

Here is my first attempt. Comments are welcome!

It doesn't solve the original problem though (use_hierarchy == false and
objcg reparenting), I'll send a separate patch for that.

Thanks!

--

From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 20 Oct 2020 18:05:43 -0700
Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
 specially

Currently the root memory cgroup is treated in a special way:
it's not charged and uncharged directly (only indirectly with their
descendants), processes belonging to the root memory cgroup are exempt
from the kernel- and the socket memory accounting.

At the same time some of root level statistics and data are available
to a user:
  - cgroup v2: memory.stat
  - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
               memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes

Historically the reason for a special treatment was an avoidance
of extra performance cost, however now it's unlikely a good reason:
over years there was a significant improvement in the performance
of the memory cgroup code. Also on a modern system actively using
cgroups (e.g. managed by systemd) there are usually no (significant)
processes in the root memory cgroup.

The special treatment of the root memory cgroups creates a number of
issues visible to a user:
1) slab stats on the root level do not include the slab memory
   consumed by processes in the root memory cgroup
2) non-slab kernel memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.usage_in_bytes
3) socket memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.tcp.usage_in_bytes

It complicates the code and increases a risk of new bugs.

This patch removes a number of exceptions related to the handling of
the root memory cgroup. With this patch applied the root memory cgroup
is treated uniformly to other cgroups in the following cases:
1) root memory cgroup is charged and uncharged directly, try_charge()
   and cancel_charge() do not return immediately if the root memory
   cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
   do not handle the root memory cgroup specially.
2) per-memcg slab statistics is gathered for the root memory cgroup
3) shrinkers infra treats the root memory cgroup as any other memory
   cgroup
4) non-slab kernel memory accounting doesn't exclude pages allocated
   by processes belonging to the root memory cgroup
5) if a socket is opened by a process in the root memory cgroup,
   the socket memory is accounted
6) root cgroup is charged for the used swap memory.

Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            | 82 ++++++++++++++------------------------
 mm/vmscan.c                |  9 +----
 3 files changed, 31 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..d3653eb5d1b2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,8 +416,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 {
 	/*
-	 * The root memcg doesn't account charges, and doesn't support
-	 * protection.
+	 * The root memcg doesn't support memory protection.
 	 */
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..a8bdca0f58f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -438,9 +438,6 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 	struct memcg_shrinker_map *map;
 	int nid;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	for_each_node(nid) {
 		pn = mem_cgroup_nodeinfo(memcg, nid);
 		map = rcu_dereference_protected(pn->shrinker_map, true);
@@ -455,9 +452,6 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	struct memcg_shrinker_map *map;
 	int nid, size, ret = 0;
 
-	if (mem_cgroup_is_root(memcg))
-		return 0;
-
 	mutex_lock(&memcg_shrinker_map_mutex);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
@@ -489,8 +483,6 @@ int memcg_expand_shrinker_maps(int new_id)
 		goto unlock;
 
 	for_each_mem_cgroup(memcg) {
-		if (mem_cgroup_is_root(memcg))
-			continue;
 		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
@@ -506,7 +498,7 @@ int memcg_expand_shrinker_maps(int new_id)
 
 void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
-	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
+	if (shrinker_id >= 0 && memcg) {
 		struct memcg_shrinker_map *map;
 
 		rcu_read_lock();
@@ -868,7 +860,7 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 	memcg = mem_cgroup_from_obj(p);
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
-	if (!memcg || memcg == root_mem_cgroup) {
+	if (!memcg) {
 		__mod_node_page_state(pgdat, idx, val);
 	} else {
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -2439,8 +2431,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							     gfp_mask, true);
 		psi_memstall_leave(&pflags);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return nr_reclaimed;
 }
@@ -2532,8 +2523,7 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 		overage = calculate_overage(page_counter_read(&memcg->memory),
 					    READ_ONCE(memcg->memory.high));
 		max_overage = max(overage, max_overage);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return max_overage;
 }
@@ -2548,8 +2538,7 @@ static u64 swap_find_max_overage(struct mem_cgroup *memcg)
 		if (overage)
 			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
 		max_overage = max(overage, max_overage);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return max_overage;
 }
@@ -2686,8 +2675,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool drained = false;
 	unsigned long pflags;
 
-	if (mem_cgroup_is_root(memcg))
-		return 0;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
@@ -2873,9 +2860,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 #if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU)
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
@@ -2978,7 +2962,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 	else
 		memcg = mem_cgroup_from_task(current);
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		objcg = rcu_dereference(memcg->objcg);
 		if (objcg && obj_cgroup_tryget(objcg))
 			break;
@@ -3096,15 +3080,16 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	int ret = 0;
 
 	memcg = get_mem_cgroup_from_current();
-	if (memcg && !mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
-		if (!ret) {
-			page->mem_cgroup = memcg;
-			__SetPageKmemcg(page);
-			return 0;
-		}
-		css_put(&memcg->css);
+	if (!memcg)
+		return 0;
+
+	ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+	if (!ret) {
+		page->mem_cgroup = memcg;
+		__SetPageKmemcg(page);
+		return 0;
 	}
+	css_put(&memcg->css);
 	return ret;
 }
 
@@ -3121,7 +3106,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (!memcg)
 		return;
 
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 	css_put(&memcg->css);
@@ -5913,8 +5897,7 @@ static void __mem_cgroup_clear_mc(void)
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
+		page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
 
 		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
 
@@ -5922,8 +5905,7 @@ static void __mem_cgroup_clear_mc(void)
 		 * we charged both to->memory and to->memsw, so we
 		 * should uncharge to->memory.
 		 */
-		if (!mem_cgroup_is_root(mc.to))
-			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+		page_counter_uncharge(&mc.to->memory, mc.moved_swap);
 
 		mc.moved_swap = 0;
 	}
@@ -6824,14 +6806,12 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 {
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(ug->memcg)) {
-		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
-		if (do_memsw_account())
-			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
-			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
-		memcg_oom_recover(ug->memcg);
-	}
+	page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+		page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+	memcg_oom_recover(ug->memcg);
 
 	local_irq_save(flags);
 	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
@@ -7013,8 +6993,6 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
-	if (memcg == root_mem_cgroup)
-		goto out;
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
 		goto out;
 	if (css_tryget(&memcg->css))
@@ -7195,12 +7173,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	page->mem_cgroup = NULL;
 
-	if (!mem_cgroup_is_root(memcg))
-		page_counter_uncharge(&memcg->memory, nr_entries);
+	page_counter_uncharge(&memcg->memory, nr_entries);
 
 	if (!cgroup_memory_noswap && memcg != swap_memcg) {
-		if (!mem_cgroup_is_root(swap_memcg))
-			page_counter_charge(&swap_memcg->memsw, nr_entries);
+		page_counter_charge(&swap_memcg->memsw, nr_entries);
 		page_counter_uncharge(&memcg->memsw, nr_entries);
 	}
 
@@ -7249,7 +7225,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 
 	memcg = mem_cgroup_id_get_online(memcg);
 
-	if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) &&
+	if (!cgroup_memory_noswap &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
@@ -7281,7 +7257,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
-		if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg)) {
+		if (!cgroup_memory_noswap) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
@@ -7299,7 +7275,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 
 	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return nr_swap_pages;
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
 				      READ_ONCE(memcg->swap.max) -
 				      page_counter_read(&memcg->swap));
@@ -7321,7 +7297,7 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		unsigned long usage = page_counter_read(&memcg->swap);
 
 		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d848c76e035a..fb6b3cbe0764 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -651,14 +651,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
 
-	/*
-	 * The root memcg might be allocated even though memcg is disabled
-	 * via "cgroup_disable=memory" boot parameter.  This could make
-	 * mem_cgroup_is_root() return false, then just run memcg slab
-	 * shrink, but skip global shrink.  This may result in premature
-	 * oom.
-	 */
-	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_disabled())
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
 	if (!down_read_trylock(&shrinker_rwsem))
Johannes Weiner Oct. 23, 2020, 4:30 p.m. UTC | #21
On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > If we want these counter to function properly, then we should go into the opposite
> > > direction and remove the special handling of the root memory cgroup in many places.
> > 
> > I suspect this is also by far the most robust solution from a code and
> > maintenance POV.
> > 
> > I don't recall the page counter at the root level having been a
> > concern in recent years, even though it's widely used in production
> > environments. It's lockless and cache compact. It's also per-cpu
> > batched, which means it isn't actually part of the memcg hotpath.
> 
> 
> I agree.
> 
> Here is my first attempt. Comments are welcome!
> 
> It doesn't solve the original problem though (use_hierarchy == false and
> objcg reparenting), I'll send a separate patch for that.
> 
> Thanks!
> 
> --
> 
> From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Tue, 20 Oct 2020 18:05:43 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
>  specially
> 
> Currently the root memory cgroup is treated in a special way:
> it's not charged and uncharged directly (only indirectly with their
> descendants), processes belonging to the root memory cgroup are exempt
> from the kernel- and the socket memory accounting.
> 
> At the same time some of root level statistics and data are available
> to a user:
>   - cgroup v2: memory.stat
>   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
>                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> 
> Historically the reason for a special treatment was an avoidance
> of extra performance cost, however now it's unlikely a good reason:
> over years there was a significant improvement in the performance
> of the memory cgroup code. Also on a modern system actively using
> cgroups (e.g. managed by systemd) there are usually no (significant)
> processes in the root memory cgroup.
> 
> The special treatment of the root memory cgroups creates a number of
> issues visible to a user:
> 1) slab stats on the root level do not include the slab memory
>    consumed by processes in the root memory cgroup
> 2) non-slab kernel memory consumed by processes in the root memory cgroup
>    is not included into memory.kmem.usage_in_bytes
> 3) socket memory consumed by processes in the root memory cgroup
>    is not included into memory.kmem.tcp.usage_in_bytes
> 
> It complicates the code and increases a risk of new bugs.
> 
> This patch removes a number of exceptions related to the handling of
> the root memory cgroup. With this patch applied the root memory cgroup
> is treated uniformly to other cgroups in the following cases:
> 1) root memory cgroup is charged and uncharged directly, try_charge()
>    and cancel_charge() do not return immediately if the root memory
>    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
>    do not handle the root memory cgroup specially.
> 2) per-memcg slab statistics is gathered for the root memory cgroup
> 3) shrinkers infra treats the root memory cgroup as any other memory
>    cgroup
> 4) non-slab kernel memory accounting doesn't exclude pages allocated
>    by processes belonging to the root memory cgroup
> 5) if a socket is opened by a process in the root memory cgroup,
>    the socket memory is accounted
> 6) root cgroup is charged for the used swap memory.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

This looks great.

The try_charge(), cancel_charge() etc. paths are relatively
straight-forward and look correct to me.

The swap counters too.

Slab is a bit trickier, but it also looks correct to me.

I'm having some trouble with the shrinkers. Currently, tracked objects
allocated in non-root cgroups live in that cgroup. Tracked objects in
the root cgroup, as well as untracked objects, live in a global pool.
When reclaim iterates all memcgs and calls shrink_slab(), we special
case the root_mem_cgroup and redirect to the global pool.

After your patch we have tracked objects allocated in the root cgroup
actually live in the root cgroup. Removing the shrinker special case
is correct in order to shrink those - but it removes the call to
shrink the global pool of untracked allocation classes.

I think we need to restore the double call to shrink_slab() we had
prior to this:

commit aeed1d325d429ac9699c4bf62d17156d60905519
Author: Vladimir Davydov <vdavydov.dev@gmail.com>
Date:   Fri Aug 17 15:48:17 2018 -0700

    mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
    
    The patch makes shrink_slab() be called for root_mem_cgroup in the same
    way as it's called for the rest of cgroups.  This simplifies the logic
    and improves the readability.

where we iterate through all cgroups, including the root, to reclaim
objects accounted to those respective groups; and then a call to scan
the global pool of untracked objects in that numa node.

For ease of review/verification, it could be helpful to split the
patch and remove the root exception case-by-case (not callsite by
callsite, but e.g. the swap counter, the memory counter etc.).
Michal Hocko Nov. 3, 2020, 1:22 p.m. UTC | #22
Hi,
I am sorry but I was quite busy at the time this has been discussed.
Now that I got to this thread finally I am wondering whether this
resp. other root cgroup exceptions have any follow up.

I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
memcg: link page counters to root if use_hierarchy is false"). Thanks
for taking care of that! I do agree that this approach is the most
reasonable immediate fix.

Btw. we have been carrying a warning about non hierarchical hierarchies
for years in our distribution kernels and haven't heard of any actual
bug reports (except for LTP driven ones). So we might be really close to
simply drop this functionality completely. This would simplify the code
and prevent from future surprises.

Thanks!
Roman Gushchin Nov. 3, 2020, 9:30 p.m. UTC | #23
On Tue, Nov 03, 2020 at 02:22:21PM +0100, Michal Hocko wrote:
> Hi,
> I am sorry but I was quite busy at the time this has been discussed.
> Now that I got to this thread finally I am wondering whether this
> resp. other root cgroup exceptions have any follow up.

I'll address a feedback I've got from Johannes and will post and updated
version soon.

> 
> I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
> memcg: link page counters to root if use_hierarchy is false"). Thanks
> for taking care of that! I do agree that this approach is the most
> reasonable immediate fix.

Thanks!

> 
> Btw. we have been carrying a warning about non hierarchical hierarchies
> for years in our distribution kernels and haven't heard of any actual
> bug reports (except for LTP driven ones). So we might be really close to
> simply drop this functionality completely. This would simplify the code
> and prevent from future surprises.

Just sent an rfc patchset. Your feedback will be greatly appreciated!

Thanks!
Roman Gushchin Nov. 10, 2020, 1:27 a.m. UTC | #24
On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > If we want these counter to function properly, then we should go into the opposite
> > > > direction and remove the special handling of the root memory cgroup in many places.
> > > 
> > > I suspect this is also by far the most robust solution from a code and
> > > maintenance POV.
> > > 
> > > I don't recall the page counter at the root level having been a
> > > concern in recent years, even though it's widely used in production
> > > environments. It's lockless and cache compact. It's also per-cpu
> > > batched, which means it isn't actually part of the memcg hotpath.
> > 
> > 
> > I agree.
> > 
> > Here is my first attempt. Comments are welcome!
> > 
> > It doesn't solve the original problem though (use_hierarchy == false and
> > objcg reparenting), I'll send a separate patch for that.
> > 
> > Thanks!
> > 
> > --
> > 
> > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> >  specially
> > 
> > Currently the root memory cgroup is treated in a special way:
> > it's not charged and uncharged directly (only indirectly with their
> > descendants), processes belonging to the root memory cgroup are exempt
> > from the kernel- and the socket memory accounting.
> > 
> > At the same time some of root level statistics and data are available
> > to a user:
> >   - cgroup v2: memory.stat
> >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > 
> > Historically the reason for a special treatment was an avoidance
> > of extra performance cost, however now it's unlikely a good reason:
> > over years there was a significant improvement in the performance
> > of the memory cgroup code. Also on a modern system actively using
> > cgroups (e.g. managed by systemd) there are usually no (significant)
> > processes in the root memory cgroup.
> > 
> > The special treatment of the root memory cgroups creates a number of
> > issues visible to a user:
> > 1) slab stats on the root level do not include the slab memory
> >    consumed by processes in the root memory cgroup
> > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> >    is not included into memory.kmem.usage_in_bytes
> > 3) socket memory consumed by processes in the root memory cgroup
> >    is not included into memory.kmem.tcp.usage_in_bytes
> > 
> > It complicates the code and increases a risk of new bugs.
> > 
> > This patch removes a number of exceptions related to the handling of
> > the root memory cgroup. With this patch applied the root memory cgroup
> > is treated uniformly to other cgroups in the following cases:
> > 1) root memory cgroup is charged and uncharged directly, try_charge()
> >    and cancel_charge() do not return immediately if the root memory
> >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> >    do not handle the root memory cgroup specially.
> > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > 3) shrinkers infra treats the root memory cgroup as any other memory
> >    cgroup
> > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> >    by processes belonging to the root memory cgroup
> > 5) if a socket is opened by a process in the root memory cgroup,
> >    the socket memory is accounted
> > 6) root cgroup is charged for the used swap memory.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This looks great.
> 
> The try_charge(), cancel_charge() etc. paths are relatively
> straight-forward and look correct to me.
> 
> The swap counters too.
> 
> Slab is a bit trickier, but it also looks correct to me.
> 
> I'm having some trouble with the shrinkers. Currently, tracked objects
> allocated in non-root cgroups live in that cgroup. Tracked objects in
> the root cgroup, as well as untracked objects, live in a global pool.
> When reclaim iterates all memcgs and calls shrink_slab(), we special
> case the root_mem_cgroup and redirect to the global pool.
> 
> After your patch we have tracked objects allocated in the root cgroup
> actually live in the root cgroup. Removing the shrinker special case
> is correct in order to shrink those - but it removes the call to
> shrink the global pool of untracked allocation classes.
> 
> I think we need to restore the double call to shrink_slab() we had
> prior to this:
> 
> commit aeed1d325d429ac9699c4bf62d17156d60905519
> Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> Date:   Fri Aug 17 15:48:17 2018 -0700
> 
>     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
>     
>     The patch makes shrink_slab() be called for root_mem_cgroup in the same
>     way as it's called for the rest of cgroups.  This simplifies the logic
>     and improves the readability.
> 
> where we iterate through all cgroups, including the root, to reclaim
> objects accounted to those respective groups; and then a call to scan
> the global pool of untracked objects in that numa node.

I agree, thank you for pointing at this commit.

> 
> For ease of review/verification, it could be helpful to split the
> patch and remove the root exception case-by-case (not callsite by
> callsite, but e.g. the swap counter, the memory counter etc.).

Sorry for a long pause, here's an update. I've split the patch,
fixed a couple of issues and was almost ready to send it upstream,
but then I've noticed that on cgroup v1 kmem and memsw counters
are sometimes heading into a negative territory and generating a warning
in dmesg. It happens for a short amount of time at early stages
of the system uptime. I haven't seen it happening with the memory counter.

My investigation showed that the reason is that the result of a
cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
early stages. Depending on the return value we charge or skip the kmem
counter and also handle the swap/memsw counter differently.

The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
can change at any particular moment. So I don't see how to make all root's
counters consistent without tracking them all no matter which cgroup version
is used. Which is obviously an overkill and will lead to an overhead, which
unlikely can be justified.

I'll appreciate any ideas, but I don't see a good path forward here
(except fixing a particular issue with root's slab stats with the
Muchun's patch).

Thanks!
Shakeel Butt Nov. 10, 2020, 3:11 p.m. UTC | #25
On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > If we want these counter to function properly, then we should go into the opposite
> > > > > direction and remove the special handling of the root memory cgroup in many places.
> > > >
> > > > I suspect this is also by far the most robust solution from a code and
> > > > maintenance POV.
> > > >
> > > > I don't recall the page counter at the root level having been a
> > > > concern in recent years, even though it's widely used in production
> > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > batched, which means it isn't actually part of the memcg hotpath.
> > >
> > >
> > > I agree.
> > >
> > > Here is my first attempt. Comments are welcome!
> > >
> > > It doesn't solve the original problem though (use_hierarchy == false and
> > > objcg reparenting), I'll send a separate patch for that.
> > >
> > > Thanks!
> > >
> > > --
> > >
> > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin <guro@fb.com>
> > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > >  specially
> > >
> > > Currently the root memory cgroup is treated in a special way:
> > > it's not charged and uncharged directly (only indirectly with their
> > > descendants), processes belonging to the root memory cgroup are exempt
> > > from the kernel- and the socket memory accounting.
> > >
> > > At the same time some of root level statistics and data are available
> > > to a user:
> > >   - cgroup v2: memory.stat
> > >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> > >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > >
> > > Historically the reason for a special treatment was an avoidance
> > > of extra performance cost, however now it's unlikely a good reason:
> > > over years there was a significant improvement in the performance
> > > of the memory cgroup code. Also on a modern system actively using
> > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > processes in the root memory cgroup.
> > >
> > > The special treatment of the root memory cgroups creates a number of
> > > issues visible to a user:
> > > 1) slab stats on the root level do not include the slab memory
> > >    consumed by processes in the root memory cgroup
> > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.usage_in_bytes
> > > 3) socket memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.tcp.usage_in_bytes
> > >
> > > It complicates the code and increases a risk of new bugs.
> > >
> > > This patch removes a number of exceptions related to the handling of
> > > the root memory cgroup. With this patch applied the root memory cgroup
> > > is treated uniformly to other cgroups in the following cases:
> > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > >    and cancel_charge() do not return immediately if the root memory
> > >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > >    do not handle the root memory cgroup specially.
> > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > >    cgroup
> > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > >    by processes belonging to the root memory cgroup
> > > 5) if a socket is opened by a process in the root memory cgroup,
> > >    the socket memory is accounted
> > > 6) root cgroup is charged for the used swap memory.
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > This looks great.
> >
> > The try_charge(), cancel_charge() etc. paths are relatively
> > straight-forward and look correct to me.
> >
> > The swap counters too.
> >
> > Slab is a bit trickier, but it also looks correct to me.
> >
> > I'm having some trouble with the shrinkers. Currently, tracked objects
> > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > the root cgroup, as well as untracked objects, live in a global pool.
> > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > case the root_mem_cgroup and redirect to the global pool.
> >
> > After your patch we have tracked objects allocated in the root cgroup
> > actually live in the root cgroup. Removing the shrinker special case
> > is correct in order to shrink those - but it removes the call to
> > shrink the global pool of untracked allocation classes.
> >
> > I think we need to restore the double call to shrink_slab() we had
> > prior to this:
> >
> > commit aeed1d325d429ac9699c4bf62d17156d60905519
> > Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Date:   Fri Aug 17 15:48:17 2018 -0700
> >
> >     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
> >
> >     The patch makes shrink_slab() be called for root_mem_cgroup in the same
> >     way as it's called for the rest of cgroups.  This simplifies the logic
> >     and improves the readability.
> >
> > where we iterate through all cgroups, including the root, to reclaim
> > objects accounted to those respective groups; and then a call to scan
> > the global pool of untracked objects in that numa node.
>
> I agree, thank you for pointing at this commit.
>
> >
> > For ease of review/verification, it could be helpful to split the
> > patch and remove the root exception case-by-case (not callsite by
> > callsite, but e.g. the swap counter, the memory counter etc.).
>
> Sorry for a long pause, here's an update. I've split the patch,
> fixed a couple of issues and was almost ready to send it upstream,
> but then I've noticed that on cgroup v1 kmem and memsw counters
> are sometimes heading into a negative territory and generating a warning
> in dmesg. It happens for a short amount of time at early stages
> of the system uptime. I haven't seen it happening with the memory counter.
>
> My investigation showed that the reason is that the result of a
> cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
> early stages. Depending on the return value we charge or skip the kmem
> counter and also handle the swap/memsw counter differently.
>
> The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> can change at any particular moment. So I don't see how to make all root's
> counters consistent without tracking them all no matter which cgroup version
> is used. Which is obviously an overkill and will lead to an overhead, which
> unlikely can be justified.
>
> I'll appreciate any ideas, but I don't see a good path forward here
> (except fixing a particular issue with root's slab stats with the
> Muchun's patch).
>

Since the commit 0158115f702b0 ("memcg, kmem: deprecate
kmem.limit_in_bytes"), we are in the process of deprecating the limit
on kmem. If we decide that now is the time to deprecate it, we can
convert the kmem page counter to a memcg stat, update it for both v1
and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. The
memcg stat is more efficient than the page counter, so I don't think
overhead should be an issue. This new memcg stat represents all types
of kmem memory for a memcg like slab, stack and no-type. What do you
think?
Roman Gushchin Nov. 10, 2020, 7:13 p.m. UTC | #26
On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt wrote:
> On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > > If we want these counter to function properly, then we should go into the opposite
> > > > > > direction and remove the special handling of the root memory cgroup in many places.
> > > > >
> > > > > I suspect this is also by far the most robust solution from a code and
> > > > > maintenance POV.
> > > > >
> > > > > I don't recall the page counter at the root level having been a
> > > > > concern in recent years, even though it's widely used in production
> > > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > > batched, which means it isn't actually part of the memcg hotpath.
> > > >
> > > >
> > > > I agree.
> > > >
> > > > Here is my first attempt. Comments are welcome!
> > > >
> > > > It doesn't solve the original problem though (use_hierarchy == false and
> > > > objcg reparenting), I'll send a separate patch for that.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > >
> > > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > > From: Roman Gushchin <guro@fb.com>
> > > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > > >  specially
> > > >
> > > > Currently the root memory cgroup is treated in a special way:
> > > > it's not charged and uncharged directly (only indirectly with their
> > > > descendants), processes belonging to the root memory cgroup are exempt
> > > > from the kernel- and the socket memory accounting.
> > > >
> > > > At the same time some of root level statistics and data are available
> > > > to a user:
> > > >   - cgroup v2: memory.stat
> > > >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> > > >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > > >
> > > > Historically the reason for a special treatment was an avoidance
> > > > of extra performance cost, however now it's unlikely a good reason:
> > > > over years there was a significant improvement in the performance
> > > > of the memory cgroup code. Also on a modern system actively using
> > > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > > processes in the root memory cgroup.
> > > >
> > > > The special treatment of the root memory cgroups creates a number of
> > > > issues visible to a user:
> > > > 1) slab stats on the root level do not include the slab memory
> > > >    consumed by processes in the root memory cgroup
> > > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > > >    is not included into memory.kmem.usage_in_bytes
> > > > 3) socket memory consumed by processes in the root memory cgroup
> > > >    is not included into memory.kmem.tcp.usage_in_bytes
> > > >
> > > > It complicates the code and increases a risk of new bugs.
> > > >
> > > > This patch removes a number of exceptions related to the handling of
> > > > the root memory cgroup. With this patch applied the root memory cgroup
> > > > is treated uniformly to other cgroups in the following cases:
> > > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > > >    and cancel_charge() do not return immediately if the root memory
> > > >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > > >    do not handle the root memory cgroup specially.
> > > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > > >    cgroup
> > > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > > >    by processes belonging to the root memory cgroup
> > > > 5) if a socket is opened by a process in the root memory cgroup,
> > > >    the socket memory is accounted
> > > > 6) root cgroup is charged for the used swap memory.
> > > >
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > This looks great.
> > >
> > > The try_charge(), cancel_charge() etc. paths are relatively
> > > straight-forward and look correct to me.
> > >
> > > The swap counters too.
> > >
> > > Slab is a bit trickier, but it also looks correct to me.
> > >
> > > I'm having some trouble with the shrinkers. Currently, tracked objects
> > > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > > the root cgroup, as well as untracked objects, live in a global pool.
> > > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > > case the root_mem_cgroup and redirect to the global pool.
> > >
> > > After your patch we have tracked objects allocated in the root cgroup
> > > actually live in the root cgroup. Removing the shrinker special case
> > > is correct in order to shrink those - but it removes the call to
> > > shrink the global pool of untracked allocation classes.
> > >
> > > I think we need to restore the double call to shrink_slab() we had
> > > prior to this:
> > >
> > > commit aeed1d325d429ac9699c4bf62d17156d60905519
> > > Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > Date:   Fri Aug 17 15:48:17 2018 -0700
> > >
> > >     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
> > >
> > >     The patch makes shrink_slab() be called for root_mem_cgroup in the same
> > >     way as it's called for the rest of cgroups.  This simplifies the logic
> > >     and improves the readability.
> > >
> > > where we iterate through all cgroups, including the root, to reclaim
> > > objects accounted to those respective groups; and then a call to scan
> > > the global pool of untracked objects in that numa node.
> >
> > I agree, thank you for pointing at this commit.
> >
> > >
> > > For ease of review/verification, it could be helpful to split the
> > > patch and remove the root exception case-by-case (not callsite by
> > > callsite, but e.g. the swap counter, the memory counter etc.).
> >
> > Sorry for a long pause, here's an update. I've split the patch,
> > fixed a couple of issues and was almost ready to send it upstream,
> > but then I've noticed that on cgroup v1 kmem and memsw counters
> > are sometimes heading into a negative territory and generating a warning
> > in dmesg. It happens for a short amount of time at early stages
> > of the system uptime. I haven't seen it happening with the memory counter.
> >
> > My investigation showed that the reason is that the result of a
> > cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
> > early stages. Depending on the return value we charge or skip the kmem
> > counter and also handle the swap/memsw counter differently.
> >
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment. So I don't see how to make all root's
> > counters consistent without tracking them all no matter which cgroup version
> > is used. Which is obviously an overkill and will lead to an overhead, which
> > unlikely can be justified.
> >
> > I'll appreciate any ideas, but I don't see a good path forward here
> > (except fixing a particular issue with root's slab stats with the
> > Muchun's patch).
> >
> 
> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. The
> memcg stat is more efficient than the page counter, so I don't think
> overhead should be an issue. This new memcg stat represents all types
> of kmem memory for a memcg like slab, stack and no-type. What do you
> think?

Hm, I don't see why it can't work, but it will not solve the memsw problem,
right?

Correct handling of root's kmem and tcpmem counters was the main reason
to select "always charge the root memory cgroup" over "never charge the
root memory cgroup". If we'll not use these counters directly, than it's
probably better to never charge the root memory cgroup, at least from
the performance point of view.

Thanks!
Michal Koutný Nov. 20, 2020, 5:46 p.m. UTC | #27
Hi.

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt <shakeelb@google.com> wrote:
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment.
The switch can happen only when singular (i.e. root-only) hierarchy
exists. (Or it could if rebind_subsystems() waited until all memcgs are
completely free'd.)

> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat.
So with the single memcg, it may be possible to reconstruct the
necessary counters in both directions using the statistics (or some
complementarity, without fine grained counters removal).

I didn't check all the charging/uncharging places, these are just my 2
cents to the issue.

Michal
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..214e1fe4e9a2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@  static void obj_cgroup_release(struct percpu_ref *ref)
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	memcg = obj_cgroup_memcg(objcg);
-	if (nr_pages)
+	if (nr_pages && !mem_cgroup_is_root(memcg))
 		__memcg_kmem_uncharge(memcg, nr_pages);
 	list_del(&objcg->list);
 	mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
+	struct mem_cgroup *memcg;
 
 	if (!old)
 		return;
@@ -3110,7 +3111,9 @@  static void drain_obj_stock(struct memcg_stock_pcp *stock)
 
 		if (nr_pages) {
 			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			memcg = obj_cgroup_memcg(old);
+			if (!mem_cgroup_is_root(memcg))
+				__memcg_kmem_uncharge(memcg, nr_pages);
 			rcu_read_unlock();
 		}