[RFC,00/15] Use obj_cgroup APIs to charge the LRU pages

Message ID	20210330101531.82752-1-songmuchun@bytedance.com (mailing list archive)
Headers	show Return-Path: <SRS0=RqBk=I4=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F26E619AA From: Muchun Song <songmuchun@bytedance.com> To: guro@fb.com, hannes@cmpxchg.org, mhocko@kernel.org, akpm@linux-foundation.org, shakeelb@google.com, vdavydov.dev@gmail.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, Muchun Song <songmuchun@bytedance.com> Subject: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages Date: Tue, 30 Mar 2021 18:15:16 +0800 Message-Id: <20210330101531.82752-1-songmuchun@bytedance.com> MIME-Version: 1.0 Received-SPF: none (bytedance.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from="<songmuchun@bytedance.com>"; helo=mail-pf1-f177.google.com; client-ip=209.85.210.177 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Use obj_cgroup APIs to charge the LRU pages \| expand [RFC,00/15] Use obj_cgroup APIs to charge the LRU pages [RFC,01/15] mm: memcontrol: fix page charging in page replacement [RFC,02/15] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm [RFC,03/15] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec [RFC,04/15] mm: memcontrol: use lruvec_memcg in lruvec_holds_page_lru_lock [RFC,05/15] mm: memcontrol: simplify the logic of objcg pinning memcg [RFC,06/15] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM [RFC,07/15] mm: memcontrol: introduce compact_lock_page_lruvec_irqsave [RFC,08/15] mm: memcontrol: make lruvec lock safe when the LRU pages reparented [RFC,09/15] mm: thp: introduce lock/unlock_split_queue{_irqsave}() [RFC,10/15] mm: thp: make deferred split queue lock safe when the LRU pages reparented [RFC,11/15] mm: memcontrol: make all the callers of page_memcg() safe [RFC,12/15] mm: memcontrol: introduce memcg_reparent_ops [RFC,13/15] mm: memcontrol: use obj_cgroup APIs to charge the LRU pages [RFC,14/15] mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg() [RFC,15/15] mm: lru: add VM_BUG_ON_PAGE to lru maintenance function

Muchun Song March 30, 2021, 10:15 a.m. UTC

Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.

	[v17,00/19] The new cgroup slab memory controller
	[v5,0/7] Use obj_cgroup APIs to charge kmem pages

But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.

This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.

```bash
#!/bin/bash

cat /proc/cgroups | grep memory

cd /sys/fs/cgroup/memory

for i in range{1..500}
do
	mkdir test
	echo $$ > test/cgroup.procs
	sleep 60 &
	echo $$ > cgroup.procs
	echo `cat test/cgroup.procs` > cgroup.procs
	rmdir test
done

cat /proc/cgroups | grep memory
```

Patch 1 aims to fix page charging in page replacement.
Patch 2-5 are code cleanup and simplification.
Patch 6-15 convert LRU pages pin to the objcg direction.

Muchun Song (15):
  mm: memcontrol: fix page charging in page replacement
  mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
  mm: memcontrol: use lruvec_memcg in lruvec_holds_page_lru_lock
  mm: memcontrol: simplify the logic of objcg pinning memcg
  mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
  mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
  mm: memcontrol: make lruvec lock safe when the LRU pages reparented
  mm: thp: introduce lock/unlock_split_queue{_irqsave}()
  mm: thp: make deferred split queue lock safe when the LRU pages
    reparented
  mm: memcontrol: make all the callers of page_memcg() safe
  mm: memcontrol: introduce memcg_reparent_ops
  mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
  mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
  mm: lru: add VM_BUG_ON_PAGE to lru maintenance function

 Documentation/admin-guide/cgroup-v1/memory.rst |   2 +-
 fs/buffer.c                                    |  13 +-
 fs/fs-writeback.c                              |  23 +-
 fs/iomap/buffered-io.c                         |   4 +-
 include/linux/memcontrol.h                     | 197 +++++----
 include/linux/mm_inline.h                      |   6 +
 mm/compaction.c                                |  42 +-
 mm/filemap.c                                   |   2 +-
 mm/huge_memory.c                               | 175 +++++++-
 mm/memcontrol.c                                | 578 +++++++++++++++++--------
 mm/migrate.c                                   |   4 +
 mm/page-writeback.c                            |  28 +-
 mm/page_io.c                                   |   5 +-
 mm/rmap.c                                      |  14 +-
 mm/swap.c                                      |   7 +-
 mm/vmscan.c                                    |   3 +-
 mm/workingset.c                                |   2 +-
 17 files changed, 762 insertions(+), 343 deletions(-)

Shakeel Butt March 30, 2021, 6:34 p.m. UTC | #1

On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
>
>         [v17,00/19] The new cgroup slab memory controller
>         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
>
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
>
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
>
> ```bash
> #!/bin/bash
>
> cat /proc/cgroups | grep memory
>
> cd /sys/fs/cgroup/memory
>
> for i in range{1..500}
> do
>         mkdir test
>         echo $$ > test/cgroup.procs
>         sleep 60 &
>         echo $$ > cgroup.procs
>         echo `cat test/cgroup.procs` > cgroup.procs
>         rmdir test
> done
>
> cat /proc/cgroups | grep memory
> ```
>
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-15 convert LRU pages pin to the objcg direction.

The main concern I have with *just* reparenting LRU pages is that for
the long running systems, the root memcg will become a dumping ground.
In addition a job running multiple times on a machine will see
inconsistent memory usage if it re-accesses the file pages which were
reparented to the root memcg.

Please note that I do agree with the mentioned problem and we do see
this issue in our fleet. Internally we have a "memcg mount option"
feature which couples a file system with a memcg and all file pages
allocated on that file system will be charged to that memcg. Multiple
instances (concurrent or subsequent) of the job will use that file
system (with a dedicated memcg) without leaving the zombies behind. I
am not pushing for this solution as it comes with its own intricacies
(e.g. if memcg coupled with a file system has a limit, the oom
behavior would be awkward and therefore internally we don't put a
limit on such memcgs). Though I want this to be part of discussion.

I think the underlying reasons behind this issue are:

1) Filesystem shared by disjoint jobs.
2) For job dedicated filesystems, the lifetime of the filesystem is
different from the lifetime of the job.

For now, we have two potential solutions to the
zombies-due-to-offlined-LRU-pages i.e. (1) reparent and (2) pairing
memcg and filesystem. For reparenting, the cons are inconsistent
memory usage and root memcg potentially becoming dumping ground. For
pairing, the oom behavior is awkward which is the same for any type of
remote charging.

I am wondering how we can resolve the cons for each. To resolve the
root-memcg-dump issue in the reparenting, maybe we uncharge the page
when it reaches root and the next accesser will be charged. For
inconsistent memory usage, either users accept the inconsistency or
force reclaim before offline which will reduce the benefit of sharing
filesystem with subsequent instances of the job. For the pairing,
maybe we can punt to the user/admin to not set a limit on such memcg
to avoid awkward oom situations.

Thoughts?

Roman Gushchin March 30, 2021, 6:58 p.m. UTC | #2

On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> >
> >         [v17,00/19] The new cgroup slab memory controller
> >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> >
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> >
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> >
> > ```bash
> > #!/bin/bash
> >
> > cat /proc/cgroups | grep memory
> >
> > cd /sys/fs/cgroup/memory
> >
> > for i in range{1..500}
> > do
> >         mkdir test
> >         echo $$ > test/cgroup.procs
> >         sleep 60 &
> >         echo $$ > cgroup.procs
> >         echo `cat test/cgroup.procs` > cgroup.procs
> >         rmdir test
> > done
> >
> > cat /proc/cgroups | grep memory
> > ```
> >
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-15 convert LRU pages pin to the objcg direction.
> 
> The main concern I have with *just* reparenting LRU pages is that for
> the long running systems, the root memcg will become a dumping ground.
> In addition a job running multiple times on a machine will see
> inconsistent memory usage if it re-accesses the file pages which were
> reparented to the root memcg.

I agree, but also the reparenting is not the perfect thing in a combination
with any memory protections (e.g. memory.low).

Imagine the following configuration:
workload.slice
- workload_gen_1.service   memory.min = 30G
- workload_gen_2.service   memory.min = 30G
- workload_gen_3.service   memory.min = 30G
  ...

Parent cgroup and several generations of the child cgroup, protected by a memory.low.
Once the memory is getting reparented, it's not protected anymore.

I guess we need something smarter: e.g. reassign a page to a different
cgroup if the page is activated/rotated and is currently on a dying lru.

Also, I'm somewhat concerned about the interaction of the reparenting
with the writeback and dirty throttling. How does it work together?

Thanks!

Johannes Weiner March 30, 2021, 9:10 p.m. UTC | #3

On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> >
> >         [v17,00/19] The new cgroup slab memory controller
> >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> >
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> >
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> >
> > ```bash
> > #!/bin/bash
> >
> > cat /proc/cgroups | grep memory
> >
> > cd /sys/fs/cgroup/memory
> >
> > for i in range{1..500}
> > do
> >         mkdir test
> >         echo $$ > test/cgroup.procs
> >         sleep 60 &
> >         echo $$ > cgroup.procs
> >         echo `cat test/cgroup.procs` > cgroup.procs
> >         rmdir test
> > done
> >
> > cat /proc/cgroups | grep memory
> > ```
> >
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-15 convert LRU pages pin to the objcg direction.
> 
> The main concern I have with *just* reparenting LRU pages is that for
> the long running systems, the root memcg will become a dumping ground.
> In addition a job running multiple times on a machine will see
> inconsistent memory usage if it re-accesses the file pages which were
> reparented to the root memcg.

I don't understand how Muchun's patches are supposed to *change* the
behavior the way you are describing it. This IS today's behavior.

We have hierarchical accounting, and a page that belongs to a leaf
cgroup will automatically belong to all its parents.

Further, if you delete a cgroup today, the abandoned cache will stay
physically linked to that cgroup, but that zombie group no longer acts
as a control structure: it imposes no limit and no protection; the
pages will be reclaimed as if it WERE linked to the parent.

For all intents and purposes, when you delete a cgroup today, its
remaining pages ARE dumped onto the parent.

The only difference is that today they pointlessly pin the leaf cgroup
as a holding vessel - which is then round-robin'd from the parent
during reclaim in order to pretend that all these child pages actually
ARE linked to the parent's LRU list.

Remember how we used to have every page physically linked to multiple
lrus? The leaf cgroup and the root?

All pages always belong to the (virtual) LRU list of all ancestor
cgroups. The only thing Muchun changes is that they no longer pin a
cgroup that has no semantical meaning anymore (because it's neither
visible to the user nor exerts any contol over the pages anymore).

Maybe I'm missing something that either you or Roman can explain to
me. But this series looks like a (rare) pure win.

Whether you like the current semantics is a separate discussion IMO.

> Please note that I do agree with the mentioned problem and we do see
> this issue in our fleet. Internally we have a "memcg mount option"
> feature which couples a file system with a memcg and all file pages
> allocated on that file system will be charged to that memcg. Multiple
> instances (concurrent or subsequent) of the job will use that file
> system (with a dedicated memcg) without leaving the zombies behind. I
> am not pushing for this solution as it comes with its own intricacies
> (e.g. if memcg coupled with a file system has a limit, the oom
> behavior would be awkward and therefore internally we don't put a
> limit on such memcgs). Though I want this to be part of discussion.

Right, you disconnect memory from the tasks that are allocating it,
and so you can't assign culpability when you need to.

OOM is one thing, but there are also CPU cycles and IO bandwidth
consumed during reclaim.

> I think the underlying reasons behind this issue are:
> 
> 1) Filesystem shared by disjoint jobs.
> 2) For job dedicated filesystems, the lifetime of the filesystem is
> different from the lifetime of the job.

There is also the case of deleting a cgroup just to recreate it right
after for the same job. Many job managers do this on restart right now
- like systemd, and what we're using in our fleet. This seems
avoidable by recycling a group for another instance of the same job.

Sharing is a more difficult discussion. If you access a page that you
share with another cgroup, it may or may not be subject to your own or
your buddy's memory limits. The only limit it is guaranteed to be
subjected to is that of your parent. So One thing I could imagine is,
instead of having a separate cgroup outside the hierarchy, we would
reparent live pages the second they are accessed from a foreign
cgroup. And reparent them until you reach the first common ancestor.

This way, when you mount a filesystem shared by two jobs, you can put
them into a joint subtree, and the root level of this subtree captures
all the memory (as well as the reclaim CPU and IO) used by the two
jobs - the private portions and the shared portions - and doesn't make
them the liability of jobs in the system that DON'T share the same fs.

But again, this is a useful discussion to have, but I don't quite see
why it's relevant to Muchun's patches. They're purely an optimization.

So I'd like to clear that up first before going further.

Johannes Weiner March 30, 2021, 9:30 p.m. UTC | #4

On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > >
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > >
> > >         [v17,00/19] The new cgroup slab memory controller
> > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > >
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > and make page reclaim very inefficient.
> > >
> > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > >
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > of the dying cgroups will not increase if we run the following test script.
> > >
> > > ```bash
> > > #!/bin/bash
> > >
> > > cat /proc/cgroups | grep memory
> > >
> > > cd /sys/fs/cgroup/memory
> > >
> > > for i in range{1..500}
> > > do
> > >         mkdir test
> > >         echo $$ > test/cgroup.procs
> > >         sleep 60 &
> > >         echo $$ > cgroup.procs
> > >         echo `cat test/cgroup.procs` > cgroup.procs
> > >         rmdir test
> > > done
> > >
> > > cat /proc/cgroups | grep memory
> > > ```
> > >
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > 
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
> 
> I agree, but also the reparenting is not the perfect thing in a combination
> with any memory protections (e.g. memory.low).
> 
> Imagine the following configuration:
> workload.slice
> - workload_gen_1.service   memory.min = 30G
> - workload_gen_2.service   memory.min = 30G
> - workload_gen_3.service   memory.min = 30G
>   ...
> 
> Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> Once the memory is getting reparented, it's not protected anymore.

That doesn't sound right.

A deleted cgroup today exerts no control over its abandoned
pages. css_reset() will blow out any control settings.

If you're talking about protection previously inherited by
workload.slice, that continues to apply as it always has.

None of this is really accidental. Per definition the workload.slice
control domain includes workload_gen_1.service. And per definition,
the workload_gen_1.service domain ceases to exist when you delete it.

There are no (or shouldn't be any!) semantic changes from the physical
unlinking from a dead control domain.

> Also, I'm somewhat concerned about the interaction of the reparenting
> with the writeback and dirty throttling. How does it work together?

What interaction specifically?

When you delete a cgroup that had both the block and the memory
controller enabled, the control domain of both goes away and it
becomes subject to whatever control domain is above it (if any).

A higher control domain in turn takes a recursive view of the subtree,
see mem_cgroup_wb_stats(), so when control is exerted, it applies
regardless of how and where pages are physically linked in children.

It's also already possible to enable e.g. block control only at a very
high level and memory control down to a lower level. Per design this
code can live with different domain sizes for memory and block.

Roman Gushchin March 30, 2021, 10:05 p.m. UTC | #5

On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > >
> > > > Since the following patchsets applied. All the kernel memory are charged
> > > > with the new APIs of obj_cgroup.
> > > >
> > > >         [v17,00/19] The new cgroup slab memory controller
> > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > >
> > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > it exists at a larger scale and is causing recurring problems in the real
> > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > and make page reclaim very inefficient.
> > > >
> > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > >
> > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > of the dying cgroups will not increase if we run the following test script.
> > > >
> > > > ```bash
> > > > #!/bin/bash
> > > >
> > > > cat /proc/cgroups | grep memory
> > > >
> > > > cd /sys/fs/cgroup/memory
> > > >
> > > > for i in range{1..500}
> > > > do
> > > >         mkdir test
> > > >         echo $$ > test/cgroup.procs
> > > >         sleep 60 &
> > > >         echo $$ > cgroup.procs
> > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > >         rmdir test
> > > > done
> > > >
> > > > cat /proc/cgroups | grep memory
> > > > ```
> > > >
> > > > Patch 1 aims to fix page charging in page replacement.
> > > > Patch 2-5 are code cleanup and simplification.
> > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > 
> > > The main concern I have with *just* reparenting LRU pages is that for
> > > the long running systems, the root memcg will become a dumping ground.
> > > In addition a job running multiple times on a machine will see
> > > inconsistent memory usage if it re-accesses the file pages which were
> > > reparented to the root memcg.
> > 
> > I agree, but also the reparenting is not the perfect thing in a combination
> > with any memory protections (e.g. memory.low).
> > 
> > Imagine the following configuration:
> > workload.slice
> > - workload_gen_1.service   memory.min = 30G
> > - workload_gen_2.service   memory.min = 30G
> > - workload_gen_3.service   memory.min = 30G
> >   ...
> > 
> > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > Once the memory is getting reparented, it's not protected anymore.
> 
> That doesn't sound right.
> 
> A deleted cgroup today exerts no control over its abandoned
> pages. css_reset() will blow out any control settings.

I know. Currently it works in the following way: once cgroup gen_1 is deleted,
it's memory is not protected anymore, so eventually it's getting evicted and
re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
of course. But long-term we likely wanna re-charge such pages to new cgroups
and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
help and likely will complicate this change. So I'm a bit skeptical here.

Also, in my experience the pagecache is not the main/worst memcg reference
holder (writeback structures are way worse). Pages are relatively large
(in comparison to some slab objects), and rarely there is only one page pinning
a separate memcg. And switching to obj_cgroup doesn't completely eliminate
the problem: we just switch from accumulating larger mem_cgroups to accumulating
smaller obj_cgroups.

With all this said, I'm not necessarily opposing the patchset, but it's
necessary to discuss how it fits into the long-term picture.
E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't
we split it back into the reparenting and bytes-sized accounting parts,
as I initially suggested. And shouldn't we move the reparenting part to
the cgroup core level, so we could use it for other controllers
(e.g. to fix the writeback problem).

> 
> If you're talking about protection previously inherited by
> workload.slice, that continues to apply as it always has.
> 
> None of this is really accidental. Per definition the workload.slice
> control domain includes workload_gen_1.service. And per definition,
> the workload_gen_1.service domain ceases to exist when you delete it.
> 
> There are no (or shouldn't be any!) semantic changes from the physical
> unlinking from a dead control domain.
> 
> > Also, I'm somewhat concerned about the interaction of the reparenting
> > with the writeback and dirty throttling. How does it work together?
> 
> What interaction specifically?
> 
> When you delete a cgroup that had both the block and the memory
> controller enabled, the control domain of both goes away and it
> becomes subject to whatever control domain is above it (if any).
> 
> A higher control domain in turn takes a recursive view of the subtree,
> see mem_cgroup_wb_stats(), so when control is exerted, it applies
> regardless of how and where pages are physically linked in children.
> 
> It's also already possible to enable e.g. block control only at a very
> high level and memory control down to a lower level. Per design this
> code can live with different domain sizes for memory and block.

I'm totally happy if it's safe, I just don't know this code well enough
to be sure without taking a closer look.

Shakeel Butt March 31, 2021, 12:28 a.m. UTC | #6

On Tue, Mar 30, 2021 at 2:10 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[...]
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
>
> I don't understand how Muchun's patches are supposed to *change* the
> behavior the way you are describing it. This IS today's behavior.
>
> We have hierarchical accounting, and a page that belongs to a leaf
> cgroup will automatically belong to all its parents.
>
> Further, if you delete a cgroup today, the abandoned cache will stay
> physically linked to that cgroup, but that zombie group no longer acts
> as a control structure: it imposes no limit and no protection; the
> pages will be reclaimed as if it WERE linked to the parent.
>
> For all intents and purposes, when you delete a cgroup today, its
> remaining pages ARE dumped onto the parent.
>
> The only difference is that today they pointlessly pin the leaf cgroup
> as a holding vessel - which is then round-robin'd from the parent
> during reclaim in order to pretend that all these child pages actually
> ARE linked to the parent's LRU list.
>
> Remember how we used to have every page physically linked to multiple
> lrus? The leaf cgroup and the root?
>
> All pages always belong to the (virtual) LRU list of all ancestor
> cgroups. The only thing Muchun changes is that they no longer pin a
> cgroup that has no semantical meaning anymore (because it's neither
> visible to the user nor exerts any contol over the pages anymore).
>

Indeed you are right. Even if the physical representation of the tree
has changed, the logical picture remains the same.

[Subconsciously I was sad that we will lose the information about the
origin memcg of the page for debugging purposes but then I thought if
we really need it then we can just add that metadata in the obj_cgroup
object. So, never mind.]

> Maybe I'm missing something that either you or Roman can explain to
> me. But this series looks like a (rare) pure win.
>
> Whether you like the current semantics is a separate discussion IMO.
>
> > Please note that I do agree with the mentioned problem and we do see
> > this issue in our fleet. Internally we have a "memcg mount option"
> > feature which couples a file system with a memcg and all file pages
> > allocated on that file system will be charged to that memcg. Multiple
> > instances (concurrent or subsequent) of the job will use that file
> > system (with a dedicated memcg) without leaving the zombies behind. I
> > am not pushing for this solution as it comes with its own intricacies
> > (e.g. if memcg coupled with a file system has a limit, the oom
> > behavior would be awkward and therefore internally we don't put a
> > limit on such memcgs). Though I want this to be part of discussion.
>
> Right, you disconnect memory from the tasks that are allocating it,
> and so you can't assign culpability when you need to.
>
> OOM is one thing, but there are also CPU cycles and IO bandwidth
> consumed during reclaim.
>

We didn't really have any issue regarding CPU or IO but that might be
due to our unique setup (i.e. no local disk).

> > I think the underlying reasons behind this issue are:
> >
> > 1) Filesystem shared by disjoint jobs.
> > 2) For job dedicated filesystems, the lifetime of the filesystem is
> > different from the lifetime of the job.
>
> There is also the case of deleting a cgroup just to recreate it right
> after for the same job. Many job managers do this on restart right now
> - like systemd, and what we're using in our fleet. This seems
> avoidable by recycling a group for another instance of the same job.

I was bundling the scenario you mentioned with (2) i.e. the filesystem
persists across multiple subsequent instances of the same job.

>
> Sharing is a more difficult discussion. If you access a page that you
> share with another cgroup, it may or may not be subject to your own or
> your buddy's memory limits. The only limit it is guaranteed to be
> subjected to is that of your parent. So One thing I could imagine is,
> instead of having a separate cgroup outside the hierarchy, we would
> reparent live pages the second they are accessed from a foreign
> cgroup. And reparent them until you reach the first common ancestor.
>
> This way, when you mount a filesystem shared by two jobs, you can put
> them into a joint subtree, and the root level of this subtree captures
> all the memory (as well as the reclaim CPU and IO) used by the two
> jobs - the private portions and the shared portions - and doesn't make
> them the liability of jobs in the system that DON'T share the same fs.

I will give more thought on this idea and see where it goes.

>
> But again, this is a useful discussion to have, but I don't quite see
> why it's relevant to Muchun's patches. They're purely an optimization.
>
> So I'd like to clear that up first before going further.

I think we are on the same page i.e. these patches change the physical
representation of the memcg tree but logically it remains the same and
fixes the zombie memcg issue.

Education Directorate March 31, 2021, 3:28 a.m. UTC | #7

On Tue, Mar 30, 2021 at 05:10:04PM -0400, Johannes Weiner wrote:
> On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > >
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > >
> > >         [v17,00/19] The new cgroup slab memory controller
> > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > >
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > and make page reclaim very inefficient.
> > >
> > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > >
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > of the dying cgroups will not increase if we run the following test script.
> > >
> > > ```bash
> > > #!/bin/bash
> > >
> > > cat /proc/cgroups | grep memory
> > >
> > > cd /sys/fs/cgroup/memory
> > >
> > > for i in range{1..500}
> > > do
> > >         mkdir test
> > >         echo $$ > test/cgroup.procs
> > >         sleep 60 &
> > >         echo $$ > cgroup.procs
> > >         echo `cat test/cgroup.procs` > cgroup.procs
> > >         rmdir test
> > > done
> > >
> > > cat /proc/cgroups | grep memory
> > > ```
> > >
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > 
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
> 
> I don't understand how Muchun's patches are supposed to *change* the
> behavior the way you are describing it. This IS today's behavior.
> 
> We have hierarchical accounting, and a page that belongs to a leaf
> cgroup will automatically belong to all its parents.
> 
> Further, if you delete a cgroup today, the abandoned cache will stay
> physically linked to that cgroup, but that zombie group no longer acts
> as a control structure: it imposes no limit and no protection; the
> pages will be reclaimed as if it WERE linked to the parent.
> 
> For all intents and purposes, when you delete a cgroup today, its
> remaining pages ARE dumped onto the parent.
> 
> The only difference is that today they pointlessly pin the leaf cgroup
> as a holding vessel - which is then round-robin'd from the parent
> during reclaim in order to pretend that all these child pages actually
> ARE linked to the parent's LRU list.
> 
> Remember how we used to have every page physically linked to multiple
> lrus? The leaf cgroup and the root?
> 
> All pages always belong to the (virtual) LRU list of all ancestor
> cgroups. The only thing Muchun changes is that they no longer pin a
> cgroup that has no semantical meaning anymore (because it's neither
> visible to the user nor exerts any contol over the pages anymore).
> 
> Maybe I'm missing something that either you or Roman can explain to
> me. But this series looks like a (rare) pure win.
> 
> Whether you like the current semantics is a separate discussion IMO.
> 
> > Please note that I do agree with the mentioned problem and we do see
> > this issue in our fleet. Internally we have a "memcg mount option"
> > feature which couples a file system with a memcg and all file pages
> > allocated on that file system will be charged to that memcg. Multiple
> > instances (concurrent or subsequent) of the job will use that file
> > system (with a dedicated memcg) without leaving the zombies behind. I
> > am not pushing for this solution as it comes with its own intricacies
> > (e.g. if memcg coupled with a file system has a limit, the oom
> > behavior would be awkward and therefore internally we don't put a
> > limit on such memcgs). Though I want this to be part of discussion.
> 
> Right, you disconnect memory from the tasks that are allocating it,
> and so you can't assign culpability when you need to.
> 
> OOM is one thing, but there are also CPU cycles and IO bandwidth
> consumed during reclaim.
> 
> > I think the underlying reasons behind this issue are:
> > 
> > 1) Filesystem shared by disjoint jobs.
> > 2) For job dedicated filesystems, the lifetime of the filesystem is
> > different from the lifetime of the job.
> 
> There is also the case of deleting a cgroup just to recreate it right
> after for the same job. Many job managers do this on restart right now
> - like systemd, and what we're using in our fleet. This seems
> avoidable by recycling a group for another instance of the same job.
> 
> Sharing is a more difficult discussion. If you access a page that you
> share with another cgroup, it may or may not be subject to your own or
> your buddy's memory limits. The only limit it is guaranteed to be
> subjected to is that of your parent. So One thing I could imagine is,
> instead of having a separate cgroup outside the hierarchy, we would
> reparent live pages the second they are accessed from a foreign
> cgroup. And reparent them until you reach the first common ancestor.
> 
> This way, when you mount a filesystem shared by two jobs, you can put
> them into a joint subtree, and the root level of this subtree captures
> all the memory (as well as the reclaim CPU and IO) used by the two
> jobs - the private portions and the shared portions - and doesn't make
> them the liability of jobs in the system that DON'T share the same fs.
> 
> But again, this is a useful discussion to have, but I don't quite see
> why it's relevant to Muchun's patches. They're purely an optimization.
> 
> So I'd like to clear that up first before going further.
>

I suspect a lot of the issue really is the lack of lockstepping
between a page (unmapped page cache) and the corresponding memcgroups
lifecycle. When we delete a memcgroup, we sort of lose accounting
(depending on the inheriting parent) and ideally we want to bring back
the accounting when the page is reused in a different cgroup (almost
like first touch). I would like to look at the patches and see if they
do solve the issue that leads to zombie cgroups hanging around. In my experience,
the combination of namespaces and number of cgroups (several of which could
be zombies), does not scale well.

Balbir Singh.

Johannes Weiner March 31, 2021, 3:17 p.m. UTC | #8

On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > > >
> > > > > Since the following patchsets applied. All the kernel memory are charged
> > > > > with the new APIs of obj_cgroup.
> > > > >
> > > > >         [v17,00/19] The new cgroup slab memory controller
> > > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > >
> > > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > > it exists at a larger scale and is causing recurring problems in the real
> > > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > > and make page reclaim very inefficient.
> > > > >
> > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > >
> > > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > > of the dying cgroups will not increase if we run the following test script.
> > > > >
> > > > > ```bash
> > > > > #!/bin/bash
> > > > >
> > > > > cat /proc/cgroups | grep memory
> > > > >
> > > > > cd /sys/fs/cgroup/memory
> > > > >
> > > > > for i in range{1..500}
> > > > > do
> > > > >         mkdir test
> > > > >         echo $$ > test/cgroup.procs
> > > > >         sleep 60 &
> > > > >         echo $$ > cgroup.procs
> > > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > > >         rmdir test
> > > > > done
> > > > >
> > > > > cat /proc/cgroups | grep memory
> > > > > ```
> > > > >
> > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > Patch 2-5 are code cleanup and simplification.
> > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > 
> > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > the long running systems, the root memcg will become a dumping ground.
> > > > In addition a job running multiple times on a machine will see
> > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > reparented to the root memcg.
> > > 
> > > I agree, but also the reparenting is not the perfect thing in a combination
> > > with any memory protections (e.g. memory.low).
> > > 
> > > Imagine the following configuration:
> > > workload.slice
> > > - workload_gen_1.service   memory.min = 30G
> > > - workload_gen_2.service   memory.min = 30G
> > > - workload_gen_3.service   memory.min = 30G
> > >   ...
> > > 
> > > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > > Once the memory is getting reparented, it's not protected anymore.
> > 
> > That doesn't sound right.
> > 
> > A deleted cgroup today exerts no control over its abandoned
> > pages. css_reset() will blow out any control settings.
> 
> I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> it's memory is not protected anymore, so eventually it's getting evicted and
> re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> of course. But long-term we likely wanna re-charge such pages to new cgroups
> and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
> help and likely will complicate this change. So I'm a bit skeptical here.

We should be careful with the long-term plans.

The zombie issue is a pretty urgent concern that has caused several
production emergencies now. It needs a fix sooner rather than later.

The long-term plans of how to handle shared/reused data better will
require some time to work out. There are MANY open questions around
recharging to arbitrary foreign cgroup users. Like how to identify
accesses after the page's cgroup has been deleted: Do we need to
annotate every page cache lookup? Do we need to inject minor faults to
recharge mmapped pages? We can't wait for this discussion to complete.

I also think the objcg is helping with that direction rather than
getting in the way because:

- The old charge moving code we have for LRU pages isn't reusable
  anyway. It relies on optimistic locking to work, and may leave
  memory behind in arbitrary and unpredictable ways. After a few
  cycles, objects tend to be spread all over the place.

  The objcg provides a new synchronization scheme that will always
  work because the readside (page/object to memcg lookup) needs to be
  prepared for the memcg to change and/or die at any time.

- There isn't much point in recharging only some of the abandoned
  memory. We've tried per-allocation class reparenting and it didn't
  work out too well. Newly accounted allocations crop up faster than
  we can conjure up tailor-made reparenting schemes for them.

  The objcg provides a generic reference and reassignment scheme that
  can be used by all tracked objects.

  Importantly, once we have a central thing as LRU pages converted, we
  can require all new allocation tracking to use objcg from the start.

> Also, in my experience the pagecache is not the main/worst memcg reference
> holder (writeback structures are way worse). Pages are relatively large
> (in comparison to some slab objects), and rarely there is only one page pinning
> a separate memcg.

I've seen that exact thing cause zombies to pile up: one or two pages
in the old group, pinned by the next instance of a job. If the job has
a sufficiently large working set, this can pin a LOT of dead
cgroups. Is it the biggest or most problematic source of buildups?
Maybe not. But there is definitely cause to fix it.

LRU pages are also the most difficult to convert because of their rich
interactions. It's a good test of the api. If it works for those
pages, converting everybody else will be much simpler.

And again, as the core memory consumer it sets the tone of how memcg
rmapping is supposed to work for new and existing object types. This
helps align ongoing development.

> And switching to obj_cgroup doesn't completely eliminate the
> problem: we just switch from accumulating larger mem_cgroups to
> accumulating smaller obj_cgroups.

In your own words, the discrepancy between tiny objects pinning large
memcgs is a problem. objcgs are smaller than most objects, so it's not
much different as if an object were simply a few bytes bigger in size.
A memcg on the other hand is vastly bigger than most objects. It's
also composed of many allocations and so causes more fragmentation.

Another issue is that memcgs with abandoned objects need to be visited
by the reclaimer on every single reclaim walk, a hot path. The list of
objcgs on the other hand is only iterated when the cgroup is deleted,
which is not a fast path. It's also rare that parents with many dead
children are deleted at all (system.slice, workload.slice etc.)

So no, I would say for all intents and purposes, it fixes all the
problems we're having with zombie memcgs.

> With all this said, I'm not necessarily opposing the patchset, but it's
> necessary to discuss how it fits into the long-term picture.
> E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't
> we split it back into the reparenting and bytes-sized accounting parts,
> as I initially suggested. And shouldn't we move the reparenting part to
> the cgroup core level, so we could use it for other controllers
> (e.g. to fix the writeback problem).

Yes, I do think we want to generalize it. But I wouldn't say it's a
requirement for these patches, either:

- The byte-sized accounting part is one atomic_t. That's 4 bytes
  pinned unnecessarily - compared to an entire struct memcg right now
  which includes memory accounting, swap accounting, byte accounting,
  and a whole lot of other things likely not used by the stale object.

- The cgroup generalization is a good idea too, but that doesn't
  really change the callsites either. Unless you were thinking of
  renaming, but objcg seems like a good, generic fit for a name to
  describe the linkage between objects to a cgroup.

  The memcg member will need to change into something generic (a
  css_set type mapping), but that can likely be hidden behind
  page_memcg(), objcg_memcg() and similar helpers.

Both of these aspects can be improved incrementally.

Muchun Song April 1, 2021, 4:07 p.m. UTC | #9

On Wed, Mar 31, 2021 at 11:17 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > > > >
> > > > > > Since the following patchsets applied. All the kernel memory are charged
> > > > > > with the new APIs of obj_cgroup.
> > > > > >
> > > > > >         [v17,00/19] The new cgroup slab memory controller
> > > > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > >
> > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > > > it exists at a larger scale and is causing recurring problems in the real
> > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > > > and make page reclaim very inefficient.
> > > > > >
> > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > > >
> > > > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > > > of the dying cgroups will not increase if we run the following test script.
> > > > > >
> > > > > > ```bash
> > > > > > #!/bin/bash
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > >
> > > > > > cd /sys/fs/cgroup/memory
> > > > > >
> > > > > > for i in range{1..500}
> > > > > > do
> > > > > >         mkdir test
> > > > > >         echo $$ > test/cgroup.procs
> > > > > >         sleep 60 &
> > > > > >         echo $$ > cgroup.procs
> > > > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > > > >         rmdir test
> > > > > > done
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > > ```
> > > > > >
> > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > >
> > > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > > the long running systems, the root memcg will become a dumping ground.
> > > > > In addition a job running multiple times on a machine will see
> > > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > > reparented to the root memcg.
> > > >
> > > > I agree, but also the reparenting is not the perfect thing in a combination
> > > > with any memory protections (e.g. memory.low).
> > > >
> > > > Imagine the following configuration:
> > > > workload.slice
> > > > - workload_gen_1.service   memory.min = 30G
> > > > - workload_gen_2.service   memory.min = 30G
> > > > - workload_gen_3.service   memory.min = 30G
> > > >   ...
> > > >
> > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > > > Once the memory is getting reparented, it's not protected anymore.
> > >
> > > That doesn't sound right.
> > >
> > > A deleted cgroup today exerts no control over its abandoned
> > > pages. css_reset() will blow out any control settings.
> >
> > I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> > it's memory is not protected anymore, so eventually it's getting evicted and
> > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> > of course. But long-term we likely wanna re-charge such pages to new cgroups
> > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
> > help and likely will complicate this change. So I'm a bit skeptical here.
>
> We should be careful with the long-term plans.
>
> The zombie issue is a pretty urgent concern that has caused several
> production emergencies now. It needs a fix sooner rather than later.

Thank you very much for clarifying the problem for me. I do agree
with you. This issue should be fixed ASAP. Using objcg is a good
choice. Dying objcg should not be a problem. Because the size of
objcg is so small compared to memcg.

Thanks.

>
> The long-term plans of how to handle shared/reused data better will
> require some time to work out. There are MANY open questions around
> recharging to arbitrary foreign cgroup users. Like how to identify
> accesses after the page's cgroup has been deleted: Do we need to
> annotate every page cache lookup? Do we need to inject minor faults to
> recharge mmapped pages? We can't wait for this discussion to complete.
>
> I also think the objcg is helping with that direction rather than
> getting in the way because:
>
> - The old charge moving code we have for LRU pages isn't reusable
>   anyway. It relies on optimistic locking to work, and may leave
>   memory behind in arbitrary and unpredictable ways. After a few
>   cycles, objects tend to be spread all over the place.
>
>   The objcg provides a new synchronization scheme that will always
>   work because the readside (page/object to memcg lookup) needs to be
>   prepared for the memcg to change and/or die at any time.
>
> - There isn't much point in recharging only some of the abandoned
>   memory. We've tried per-allocation class reparenting and it didn't
>   work out too well. Newly accounted allocations crop up faster than
>   we can conjure up tailor-made reparenting schemes for them.
>
>   The objcg provides a generic reference and reassignment scheme that
>   can be used by all tracked objects.
>
>   Importantly, once we have a central thing as LRU pages converted, we
>   can require all new allocation tracking to use objcg from the start.
>
> > Also, in my experience the pagecache is not the main/worst memcg reference
> > holder (writeback structures are way worse). Pages are relatively large
> > (in comparison to some slab objects), and rarely there is only one page pinning
> > a separate memcg.
>
> I've seen that exact thing cause zombies to pile up: one or two pages
> in the old group, pinned by the next instance of a job. If the job has
> a sufficiently large working set, this can pin a LOT of dead
> cgroups. Is it the biggest or most problematic source of buildups?
> Maybe not. But there is definitely cause to fix it.
>
> LRU pages are also the most difficult to convert because of their rich
> interactions. It's a good test of the api. If it works for those
> pages, converting everybody else will be much simpler.
>
> And again, as the core memory consumer it sets the tone of how memcg
> rmapping is supposed to work for new and existing object types. This
> helps align ongoing development.
>
> > And switching to obj_cgroup doesn't completely eliminate the
> > problem: we just switch from accumulating larger mem_cgroups to
> > accumulating smaller obj_cgroups.
>
> In your own words, the discrepancy between tiny objects pinning large
> memcgs is a problem. objcgs are smaller than most objects, so it's not
> much different as if an object were simply a few bytes bigger in size.
> A memcg on the other hand is vastly bigger than most objects. It's
> also composed of many allocations and so causes more fragmentation.
>
> Another issue is that memcgs with abandoned objects need to be visited
> by the reclaimer on every single reclaim walk, a hot path. The list of
> objcgs on the other hand is only iterated when the cgroup is deleted,
> which is not a fast path. It's also rare that parents with many dead
> children are deleted at all (system.slice, workload.slice etc.)
>
> So no, I would say for all intents and purposes, it fixes all the
> problems we're having with zombie memcgs.
>
> > With all this said, I'm not necessarily opposing the patchset, but it's
> > necessary to discuss how it fits into the long-term picture.
> > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't
> > we split it back into the reparenting and bytes-sized accounting parts,
> > as I initially suggested. And shouldn't we move the reparenting part to
> > the cgroup core level, so we could use it for other controllers
> > (e.g. to fix the writeback problem).
>
> Yes, I do think we want to generalize it. But I wouldn't say it's a
> requirement for these patches, either:
>
> - The byte-sized accounting part is one atomic_t. That's 4 bytes
>   pinned unnecessarily - compared to an entire struct memcg right now
>   which includes memory accounting, swap accounting, byte accounting,
>   and a whole lot of other things likely not used by the stale object.
>
> - The cgroup generalization is a good idea too, but that doesn't
>   really change the callsites either. Unless you were thinking of
>   renaming, but objcg seems like a good, generic fit for a name to
>   describe the linkage between objects to a cgroup.
>
>   The memcg member will need to change into something generic (a
>   css_set type mapping), but that can likely be hidden behind
>   page_memcg(), objcg_memcg() and similar helpers.
>
> Both of these aspects can be improved incrementally.

Shakeel Butt April 1, 2021, 5:15 p.m. UTC | #10

On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
[...]
> > The zombie issue is a pretty urgent concern that has caused several
> > production emergencies now. It needs a fix sooner rather than later.
>
> Thank you very much for clarifying the problem for me. I do agree
> with you. This issue should be fixed ASAP. Using objcg is a good
> choice. Dying objcg should not be a problem. Because the size of
> objcg is so small compared to memcg.
>

Just wanted to say out loud that yes this patchset will reduce the
memcg zombie issue but this is not the final destination. We should
continue the discussions on sharing/reusing scenarios.

Muchun, can you please also CC Hugh Dickins and Alex Shi in the next
version of your patchset?

Roman Gushchin April 1, 2021, 9:34 p.m. UTC | #11

On Fri, Apr 02, 2021 at 12:07:42AM +0800, Muchun Song wrote:
> On Wed, Mar 31, 2021 at 11:17 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > > > > >
> > > > > > > Since the following patchsets applied. All the kernel memory are charged
> > > > > > > with the new APIs of obj_cgroup.
> > > > > > >
> > > > > > >         [v17,00/19] The new cgroup slab memory controller
> > > > > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > > >
> > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > > > > it exists at a larger scale and is causing recurring problems in the real
> > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > > > > and make page reclaim very inefficient.
> > > > > > >
> > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > > > >
> > > > > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > > > > of the dying cgroups will not increase if we run the following test script.
> > > > > > >
> > > > > > > ```bash
> > > > > > > #!/bin/bash
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > >
> > > > > > > cd /sys/fs/cgroup/memory
> > > > > > >
> > > > > > > for i in range{1..500}
> > > > > > > do
> > > > > > >         mkdir test
> > > > > > >         echo $$ > test/cgroup.procs
> > > > > > >         sleep 60 &
> > > > > > >         echo $$ > cgroup.procs
> > > > > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > > > > >         rmdir test
> > > > > > > done
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > > ```
> > > > > > >
> > > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > > >
> > > > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > > > the long running systems, the root memcg will become a dumping ground.
> > > > > > In addition a job running multiple times on a machine will see
> > > > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > > > reparented to the root memcg.
> > > > >
> > > > > I agree, but also the reparenting is not the perfect thing in a combination
> > > > > with any memory protections (e.g. memory.low).
> > > > >
> > > > > Imagine the following configuration:
> > > > > workload.slice
> > > > > - workload_gen_1.service   memory.min = 30G
> > > > > - workload_gen_2.service   memory.min = 30G
> > > > > - workload_gen_3.service   memory.min = 30G
> > > > >   ...
> > > > >
> > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > > > > Once the memory is getting reparented, it's not protected anymore.
> > > >
> > > > That doesn't sound right.
> > > >
> > > > A deleted cgroup today exerts no control over its abandoned
> > > > pages. css_reset() will blow out any control settings.
> > >
> > > I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> > > it's memory is not protected anymore, so eventually it's getting evicted and
> > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> > > of course. But long-term we likely wanna re-charge such pages to new cgroups
> > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
> > > help and likely will complicate this change. So I'm a bit skeptical here.
> >
> > We should be careful with the long-term plans.
> >
> > The zombie issue is a pretty urgent concern that has caused several
> > production emergencies now. It needs a fix sooner rather than later.
> 
> Thank you very much for clarifying the problem for me. I do agree
> with you. This issue should be fixed ASAP. Using objcg is a good
> choice. Dying objcg should not be a problem. Because the size of
> objcg is so small compared to memcg.

It would be nice to see some data from real-life workloads showing
an improvement in the memory usage/reclaim efficiency and also not showing
any performance degradation.

I agree that it would be nice to fix this issue, but let's not exaggerate
it's urgency: it was here for years and nothing really changed. I'm not
proposing to postpone it for years, but there were valid questions raised
how it fits into the bigger picture (e.g. other controllers).

Indeed we added an accounting of different objects, but it's not directly
connected to LRU pages.

Thanks!

Yang Shi April 1, 2021, 10:55 p.m. UTC | #12

On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > > > >
> > > > > > Since the following patchsets applied. All the kernel memory are charged
> > > > > > with the new APIs of obj_cgroup.
> > > > > >
> > > > > >         [v17,00/19] The new cgroup slab memory controller
> > > > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > >
> > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > > > it exists at a larger scale and is causing recurring problems in the real
> > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > > > and make page reclaim very inefficient.
> > > > > >
> > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > > >
> > > > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > > > of the dying cgroups will not increase if we run the following test script.
> > > > > >
> > > > > > ```bash
> > > > > > #!/bin/bash
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > >
> > > > > > cd /sys/fs/cgroup/memory
> > > > > >
> > > > > > for i in range{1..500}
> > > > > > do
> > > > > >         mkdir test
> > > > > >         echo $$ > test/cgroup.procs
> > > > > >         sleep 60 &
> > > > > >         echo $$ > cgroup.procs
> > > > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > > > >         rmdir test
> > > > > > done
> > > > > >
> > > > > > cat /proc/cgroups | grep memory
> > > > > > ```
> > > > > >
> > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > >
> > > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > > the long running systems, the root memcg will become a dumping ground.
> > > > > In addition a job running multiple times on a machine will see
> > > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > > reparented to the root memcg.
> > > >
> > > > I agree, but also the reparenting is not the perfect thing in a combination
> > > > with any memory protections (e.g. memory.low).
> > > >
> > > > Imagine the following configuration:
> > > > workload.slice
> > > > - workload_gen_1.service   memory.min = 30G
> > > > - workload_gen_2.service   memory.min = 30G
> > > > - workload_gen_3.service   memory.min = 30G
> > > >   ...
> > > >
> > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > > > Once the memory is getting reparented, it's not protected anymore.
> > >
> > > That doesn't sound right.
> > >
> > > A deleted cgroup today exerts no control over its abandoned
> > > pages. css_reset() will blow out any control settings.
> >
> > I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> > it's memory is not protected anymore, so eventually it's getting evicted and
> > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> > of course. But long-term we likely wanna re-charge such pages to new cgroups
> > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
> > help and likely will complicate this change. So I'm a bit skeptical here.
>
> We should be careful with the long-term plans.

Excuse me for a dumb question. I recall we did reparent LRU pages
before (before 4.x kernel). I vaguely recall there were some tricky
race conditions during reparenting so we didn't do it anymore once
reclaimer could reclaim from offlined memcgs. My memory may be wrong,
if it is not so please feel free to correct me. If my memory is true,
it means the race conditions are gone? Or the new obj_cgroup APIs made
life easier?

Thanks,
Yang

>
> The zombie issue is a pretty urgent concern that has caused several
> production emergencies now. It needs a fix sooner rather than later.
>
> The long-term plans of how to handle shared/reused data better will
> require some time to work out. There are MANY open questions around
> recharging to arbitrary foreign cgroup users. Like how to identify
> accesses after the page's cgroup has been deleted: Do we need to
> annotate every page cache lookup? Do we need to inject minor faults to
> recharge mmapped pages? We can't wait for this discussion to complete.
>
> I also think the objcg is helping with that direction rather than
> getting in the way because:
>
> - The old charge moving code we have for LRU pages isn't reusable
>   anyway. It relies on optimistic locking to work, and may leave
>   memory behind in arbitrary and unpredictable ways. After a few
>   cycles, objects tend to be spread all over the place.
>
>   The objcg provides a new synchronization scheme that will always
>   work because the readside (page/object to memcg lookup) needs to be
>   prepared for the memcg to change and/or die at any time.
>
> - There isn't much point in recharging only some of the abandoned
>   memory. We've tried per-allocation class reparenting and it didn't
>   work out too well. Newly accounted allocations crop up faster than
>   we can conjure up tailor-made reparenting schemes for them.
>
>   The objcg provides a generic reference and reassignment scheme that
>   can be used by all tracked objects.
>
>   Importantly, once we have a central thing as LRU pages converted, we
>   can require all new allocation tracking to use objcg from the start.
>
> > Also, in my experience the pagecache is not the main/worst memcg reference
> > holder (writeback structures are way worse). Pages are relatively large
> > (in comparison to some slab objects), and rarely there is only one page pinning
> > a separate memcg.
>
> I've seen that exact thing cause zombies to pile up: one or two pages
> in the old group, pinned by the next instance of a job. If the job has
> a sufficiently large working set, this can pin a LOT of dead
> cgroups. Is it the biggest or most problematic source of buildups?
> Maybe not. But there is definitely cause to fix it.
>
> LRU pages are also the most difficult to convert because of their rich
> interactions. It's a good test of the api. If it works for those
> pages, converting everybody else will be much simpler.
>
> And again, as the core memory consumer it sets the tone of how memcg
> rmapping is supposed to work for new and existing object types. This
> helps align ongoing development.
>
> > And switching to obj_cgroup doesn't completely eliminate the
> > problem: we just switch from accumulating larger mem_cgroups to
> > accumulating smaller obj_cgroups.
>
> In your own words, the discrepancy between tiny objects pinning large
> memcgs is a problem. objcgs are smaller than most objects, so it's not
> much different as if an object were simply a few bytes bigger in size.
> A memcg on the other hand is vastly bigger than most objects. It's
> also composed of many allocations and so causes more fragmentation.
>
> Another issue is that memcgs with abandoned objects need to be visited
> by the reclaimer on every single reclaim walk, a hot path. The list of
> objcgs on the other hand is only iterated when the cgroup is deleted,
> which is not a fast path. It's also rare that parents with many dead
> children are deleted at all (system.slice, workload.slice etc.)
>
> So no, I would say for all intents and purposes, it fixes all the
> problems we're having with zombie memcgs.
>
> > With all this said, I'm not necessarily opposing the patchset, but it's
> > necessary to discuss how it fits into the long-term picture.
> > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't
> > we split it back into the reparenting and bytes-sized accounting parts,
> > as I initially suggested. And shouldn't we move the reparenting part to
> > the cgroup core level, so we could use it for other controllers
> > (e.g. to fix the writeback problem).
>
> Yes, I do think we want to generalize it. But I wouldn't say it's a
> requirement for these patches, either:
>
> - The byte-sized accounting part is one atomic_t. That's 4 bytes
>   pinned unnecessarily - compared to an entire struct memcg right now
>   which includes memory accounting, swap accounting, byte accounting,
>   and a whole lot of other things likely not used by the stale object.
>
> - The cgroup generalization is a good idea too, but that doesn't
>   really change the callsites either. Unless you were thinking of
>   renaming, but objcg seems like a good, generic fit for a name to
>   describe the linkage between objects to a cgroup.
>
>   The memcg member will need to change into something generic (a
>   css_set type mapping), but that can likely be hidden behind
>   page_memcg(), objcg_memcg() and similar helpers.
>
> Both of these aspects can be improved incrementally.
>

Muchun Song April 2, 2021, 3:14 a.m. UTC | #13

On Fri, Apr 2, 2021 at 1:15 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> [...]
> > > The zombie issue is a pretty urgent concern that has caused several
> > > production emergencies now. It needs a fix sooner rather than later.
> >
> > Thank you very much for clarifying the problem for me. I do agree
> > with you. This issue should be fixed ASAP. Using objcg is a good
> > choice. Dying objcg should not be a problem. Because the size of
> > objcg is so small compared to memcg.
> >
>
> Just wanted to say out loud that yes this patchset will reduce the
> memcg zombie issue but this is not the final destination. We should
> continue the discussions on sharing/reusing scenarios.

Yeah. Reducing the zombie memcg is not the final destination.
But it is an optimization. OK. The discussions about sharing/reusing
is also welcome.

>
> Muchun, can you please also CC Hugh Dickins and Alex Shi in the next
> version of your patchset?

No problem. I will cc Alex Shi in the next version.

Muchun Song April 2, 2021, 4:03 a.m. UTC | #14

On Fri, Apr 2, 2021 at 6:55 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > > > > > >
> > > > > > > Since the following patchsets applied. All the kernel memory are charged
> > > > > > > with the new APIs of obj_cgroup.
> > > > > > >
> > > > > > >         [v17,00/19] The new cgroup slab memory controller
> > > > > > >         [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > > >
> > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > > > > it exists at a larger scale and is causing recurring problems in the real
> > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > > > > > second, third, fourth, ... instance of the same job that was restarted into
> > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > > > > > and make page reclaim very inefficient.
> > > > > > >
> > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > > > >
> > > > > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > > > > > of the dying cgroups will not increase if we run the following test script.
> > > > > > >
> > > > > > > ```bash
> > > > > > > #!/bin/bash
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > >
> > > > > > > cd /sys/fs/cgroup/memory
> > > > > > >
> > > > > > > for i in range{1..500}
> > > > > > > do
> > > > > > >         mkdir test
> > > > > > >         echo $$ > test/cgroup.procs
> > > > > > >         sleep 60 &
> > > > > > >         echo $$ > cgroup.procs
> > > > > > >         echo `cat test/cgroup.procs` > cgroup.procs
> > > > > > >         rmdir test
> > > > > > > done
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > > ```
> > > > > > >
> > > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > > >
> > > > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > > > the long running systems, the root memcg will become a dumping ground.
> > > > > > In addition a job running multiple times on a machine will see
> > > > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > > > reparented to the root memcg.
> > > > >
> > > > > I agree, but also the reparenting is not the perfect thing in a combination
> > > > > with any memory protections (e.g. memory.low).
> > > > >
> > > > > Imagine the following configuration:
> > > > > workload.slice
> > > > > - workload_gen_1.service   memory.min = 30G
> > > > > - workload_gen_2.service   memory.min = 30G
> > > > > - workload_gen_3.service   memory.min = 30G
> > > > >   ...
> > > > >
> > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low.
> > > > > Once the memory is getting reparented, it's not protected anymore.
> > > >
> > > > That doesn't sound right.
> > > >
> > > > A deleted cgroup today exerts no control over its abandoned
> > > > pages. css_reset() will blow out any control settings.
> > >
> > > I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> > > it's memory is not protected anymore, so eventually it's getting evicted and
> > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> > > of course. But long-term we likely wanna re-charge such pages to new cgroups
> > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
> > > help and likely will complicate this change. So I'm a bit skeptical here.
> >
> > We should be careful with the long-term plans.
>
> Excuse me for a dumb question. I recall we did reparent LRU pages
> before (before 4.x kernel). I vaguely recall there were some tricky
> race conditions during reparenting so we didn't do it anymore once
> reclaimer could reclaim from offlined memcgs. My memory may be wrong,
> if it is not so please feel free to correct me. If my memory is true,

I looked at the historical information by git. Your memory is right.

> it means the race conditions are gone? Or the new obj_cgroup APIs made
> life easier?

I do not know about the history of how many race conditions
there are. Johannes should be very familiar with this. From
my point of view, obj_cgroup APIs could make life easier.
Because we do not need to care about the correctness of
page counters and statistics when the LRU pages are
reparented. The operation of reparenting can be easy.
Just splice the lruvec list to its parent lruvec list (We do
not need to isolate page one by one).

Thanks.

>
> Thanks,
> Yang
>
> >
> > The zombie issue is a pretty urgent concern that has caused several
> > production emergencies now. It needs a fix sooner rather than later.
> >
> > The long-term plans of how to handle shared/reused data better will
> > require some time to work out. There are MANY open questions around
> > recharging to arbitrary foreign cgroup users. Like how to identify
> > accesses after the page's cgroup has been deleted: Do we need to
> > annotate every page cache lookup? Do we need to inject minor faults to
> > recharge mmapped pages? We can't wait for this discussion to complete.
> >
> > I also think the objcg is helping with that direction rather than
> > getting in the way because:
> >
> > - The old charge moving code we have for LRU pages isn't reusable
> >   anyway. It relies on optimistic locking to work, and may leave
> >   memory behind in arbitrary and unpredictable ways. After a few
> >   cycles, objects tend to be spread all over the place.
> >
> >   The objcg provides a new synchronization scheme that will always
> >   work because the readside (page/object to memcg lookup) needs to be
> >   prepared for the memcg to change and/or die at any time.
> >
> > - There isn't much point in recharging only some of the abandoned
> >   memory. We've tried per-allocation class reparenting and it didn't
> >   work out too well. Newly accounted allocations crop up faster than
> >   we can conjure up tailor-made reparenting schemes for them.
> >
> >   The objcg provides a generic reference and reassignment scheme that
> >   can be used by all tracked objects.
> >
> >   Importantly, once we have a central thing as LRU pages converted, we
> >   can require all new allocation tracking to use objcg from the start.
> >
> > > Also, in my experience the pagecache is not the main/worst memcg reference
> > > holder (writeback structures are way worse). Pages are relatively large
> > > (in comparison to some slab objects), and rarely there is only one page pinning
> > > a separate memcg.
> >
> > I've seen that exact thing cause zombies to pile up: one or two pages
> > in the old group, pinned by the next instance of a job. If the job has
> > a sufficiently large working set, this can pin a LOT of dead
> > cgroups. Is it the biggest or most problematic source of buildups?
> > Maybe not. But there is definitely cause to fix it.
> >
> > LRU pages are also the most difficult to convert because of their rich
> > interactions. It's a good test of the api. If it works for those
> > pages, converting everybody else will be much simpler.
> >
> > And again, as the core memory consumer it sets the tone of how memcg
> > rmapping is supposed to work for new and existing object types. This
> > helps align ongoing development.
> >
> > > And switching to obj_cgroup doesn't completely eliminate the
> > > problem: we just switch from accumulating larger mem_cgroups to
> > > accumulating smaller obj_cgroups.
> >
> > In your own words, the discrepancy between tiny objects pinning large
> > memcgs is a problem. objcgs are smaller than most objects, so it's not
> > much different as if an object were simply a few bytes bigger in size.
> > A memcg on the other hand is vastly bigger than most objects. It's
> > also composed of many allocations and so causes more fragmentation.
> >
> > Another issue is that memcgs with abandoned objects need to be visited
> > by the reclaimer on every single reclaim walk, a hot path. The list of
> > objcgs on the other hand is only iterated when the cgroup is deleted,
> > which is not a fast path. It's also rare that parents with many dead
> > children are deleted at all (system.slice, workload.slice etc.)
> >
> > So no, I would say for all intents and purposes, it fixes all the
> > problems we're having with zombie memcgs.
> >
> > > With all this said, I'm not necessarily opposing the patchset, but it's
> > > necessary to discuss how it fits into the long-term picture.
> > > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't
> > > we split it back into the reparenting and bytes-sized accounting parts,
> > > as I initially suggested. And shouldn't we move the reparenting part to
> > > the cgroup core level, so we could use it for other controllers
> > > (e.g. to fix the writeback problem).
> >
> > Yes, I do think we want to generalize it. But I wouldn't say it's a
> > requirement for these patches, either:
> >
> > - The byte-sized accounting part is one atomic_t. That's 4 bytes
> >   pinned unnecessarily - compared to an entire struct memcg right now
> >   which includes memory accounting, swap accounting, byte accounting,
> >   and a whole lot of other things likely not used by the stale object.
> >
> > - The cgroup generalization is a good idea too, but that doesn't
> >   really change the callsites either. Unless you were thinking of
> >   renaming, but objcg seems like a good, generic fit for a name to
> >   describe the linkage between objects to a cgroup.
> >
> >   The memcg member will need to change into something generic (a
> >   css_set type mapping), but that can likely be hidden behind
> >   page_memcg(), objcg_memcg() and similar helpers.
> >
> > Both of these aspects can be improved incrementally.
> >

Johannes Weiner April 2, 2021, 5:30 p.m. UTC | #15

On Thu, Apr 01, 2021 at 10:15:45AM -0700, Shakeel Butt wrote:
> On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> [...]
> > > The zombie issue is a pretty urgent concern that has caused several
> > > production emergencies now. It needs a fix sooner rather than later.
> >
> > Thank you very much for clarifying the problem for me. I do agree
> > with you. This issue should be fixed ASAP. Using objcg is a good
> > choice. Dying objcg should not be a problem. Because the size of
> > objcg is so small compared to memcg.
> >
> 
> Just wanted to say out loud that yes this patchset will reduce the
> memcg zombie issue but this is not the final destination. We should
> continue the discussions on sharing/reusing scenarios.

Absolutely. I think it's an important discussion to have.

My only concern is that Muchun's patches fix a regression, which
admittedly has built over a few years, but is a regression nonetheless
that can leave machines in undesirable states and may require reboots.

The sharing and reuse semantics on the other hand have been the same
since the beginning of cgroups. Users have had some time to structure
their requirements around these semantics :-)

If there were a concrete proposal or an RFC on the table for how
sharing and reusing could be implemented, and this proposal would be
in direct conflict with the reparenting patches, I would say let's try
to figure out a way whether the bug could be fixed in a way that is
compatible with such another imminent change.

But we shouldn't hold up a bug fix to start planning a new feature.

[RFC,00/15] Use obj_cgroup APIs to charge the LRU pages

Message

Comments