mbox series

[RFC,v2,00/18] Use obj_cgroup APIs to charge the LRU pages

Message ID 20210409122959.82264-1-songmuchun@bytedance.com (mailing list archive)
Headers show
Series Use obj_cgroup APIs to charge the LRU pages | expand

Message

Muchun Song April 9, 2021, 12:29 p.m. UTC
Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.

	[v17,00/19] The new cgroup slab memory controller
	[v5,0/7] Use obj_cgroup APIs to charge kmem pages

But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.

This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.

```bash
#!/bin/bash

cat /proc/cgroups | grep memory

cd /sys/fs/cgroup/memory

for i in range{1..500}
do
	mkdir test
	echo $$ > test/cgroup.procs
	sleep 60 &
	echo $$ > cgroup.procs
	echo `cat test/cgroup.procs` > cgroup.procs
	rmdir test
done

cat /proc/cgroups | grep memory
```

Patch 1 aims to fix page charging in page replacement.
Patch 2-5 are code cleanup and simplification.
Patch 6-18 convert LRU pages pin to the objcg direction.

Any comments are welcome. Thanks.

Changlogs in RFC v2:
  1. Collect Acked-by tags by Johannes. Thanks.
  2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
  3. Fix move_pages_to_lru().

Muchun Song (18):
  mm: memcontrol: fix page charging in page replacement
  mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
  mm: memcontrol: simplify lruvec_holds_page_lru_lock
  mm: memcontrol: simplify the logic of objcg pinning memcg
  mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
  mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
  mm: memcontrol: make lruvec lock safe when the LRU pages reparented
  mm: vmscan: remove noinline_for_stack
  mm: vmscan: rework move_pages_to_lru()
  mm: thp: introduce lock/unlock_split_queue{_irqsave}()
  mm: thp: make deferred split queue lock safe when the LRU pages
    reparented
  mm: memcontrol: make all the callers of page_memcg() safe
  mm: memcontrol: introduce memcg_reparent_ops
  mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
  mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
  mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
  mm: lru: use lruvec lock to serialize memcg changes

 Documentation/admin-guide/cgroup-v1/memory.rst |   2 +-
 fs/buffer.c                                    |  13 +-
 fs/fs-writeback.c                              |  23 +-
 fs/iomap/buffered-io.c                         |   4 +-
 include/linux/memcontrol.h                     | 216 +++++----
 include/linux/mm_inline.h                      |   6 +
 mm/compaction.c                                |  40 +-
 mm/filemap.c                                   |   2 +-
 mm/huge_memory.c                               | 171 ++++++-
 mm/memcontrol.c                                | 622 ++++++++++++++++++-------
 mm/migrate.c                                   |   4 +
 mm/page-writeback.c                            |  24 +-
 mm/page_io.c                                   |   5 +-
 mm/rmap.c                                      |  14 +-
 mm/swap.c                                      |  48 +-
 mm/vmscan.c                                    |  58 ++-
 mm/workingset.c                                |   2 +-
 17 files changed, 841 insertions(+), 413 deletions(-)

Comments

Johannes Weiner April 9, 2021, 4 p.m. UTC | #1
On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

If you haven't done so yet, please make sure to explicitly test with
all three config combinations, just because the dummy abstractions for
memcg disabled or compiled out tend to be paper thin and don't always
behave the way you might expect when you do more complicated things.

Something like

boot
echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
echo 1 >/proc/sys/vm/compact_memory

should exercise this new function in a couple of important scenarios.
Johannes Weiner April 9, 2021, 4:56 p.m. UTC | #2
On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer.
> 
> Therefore, the infrastructure of objcg no longer only serves
> CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> can reuse it to charge pages.

Just an observation on this:

We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
point. It used to be an optional feature, but nowadays it's not
configurable anymore, and always on unless slob is configured.

We've also added more than just slab accounting to it, like kernel
stack pages, and it all gets disabled on slob configs just because it
doesn't support slab object tracking.

We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.

But that's beyond the scope of your patch series, so I'm also okay
with this patch here.

> We know that the LRU pages are not accounted at the root level. But the
> page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> LRU pages, we should set the page->memcg_data to a root object cgroup. So
> we also allocate an object cgroup for the root_mem_cgroup and introduce
> root_obj_cgroup to cache its value just like root_mem_cgroup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Overall, the patch makes sense to me. A few comments below:

> @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  	return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
>  }
>  
> -#ifdef CONFIG_MEMCG_KMEM
>  extern spinlock_t css_set_lock;
>  
> +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> +{
  > +	return objcg == root_obj_cgroup;
> +}

This function, and by extension root_obj_cgroup, aren't used by this
patch. Please move them to the patch that adds users for them.

> @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>  	percpu_ref_exit(ref);
>  	kfree_rcu(objcg, rcu);
>  }
> +#else
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&css_set_lock, flags);
> +	list_del(&objcg->list);
> +	spin_unlock_irqrestore(&css_set_lock, flags);
> +
> +	percpu_ref_exit(ref);
> +	kfree_rcu(objcg, rcu);
> +}
> +#endif

Having two separate functions for if and else is good when the else
branch is a completely empty dummy function. In this case you end up
duplicating code, so it's better to have just one function and put the
ifdef around the nr_charged_bytes handling in it.

> @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
>  	return objcg;
>  }
>  
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> -				  struct mem_cgroup *parent)
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
>  {
>  	struct obj_cgroup *objcg, *iter;
> +	struct mem_cgroup *parent;
> +
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
>  
>  	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
>  
> @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
>  	percpu_ref_kill(&objcg->refcnt);
>  }
>  
> +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg)
> +		return -ENOMEM;
> +
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +
> +	return 0;
> +}
> +
> +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> +{
> +	if (unlikely(memcg->objcg))
> +		memcg_reparent_objcgs(memcg);
> +}

It's confusing to have a 'free' function not free the object it's
called on.

But rather than search for a fitting name, I think it might be better
to just fold both of these short functions into their only callsites.

Also, since memcg->objcg is reparented, and the pointer cleared, on
offlining, when could this ever be non-NULL? This deserves a comment.

> @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	struct obj_cgroup *objcg;
>  	int memcg_id;
>  
>  	if (cgroup_memory_nokmem)
> @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
>  	if (memcg_id < 0)
>  		return memcg_id;
>  
> -	objcg = obj_cgroup_alloc();
> -	if (!objcg) {
> -		memcg_free_cache_id(memcg_id);
> -		return -ENOMEM;
> -	}
> -	objcg->memcg = memcg;
> -	rcu_assign_pointer(memcg->objcg, objcg);
> -
>  	static_branch_enable(&memcg_kmem_enabled_key);
>  
>  	memcg->kmemcg_id = memcg_id;
> @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>  	if (!parent)
>  		parent = root_mem_cgroup;
>  
> -	memcg_reparent_objcgs(memcg, parent);
> +	memcg_reparent_objcgs(memcg);

Since the objcg is no longer tied to kmem, this should move to
mem_cgroup_css_offline() instead.
Roman Gushchin April 10, 2021, 1:29 a.m. UTC | #3
On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
> 
> 	[v17,00/19] The new cgroup slab memory controller
> 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> 
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
> 
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
> 
> ```bash
> #!/bin/bash
> 
> cat /proc/cgroups | grep memory
> 
> cd /sys/fs/cgroup/memory
> 
> for i in range{1..500}
> do
> 	mkdir test
> 	echo $$ > test/cgroup.procs
> 	sleep 60 &
> 	echo $$ > cgroup.procs
> 	echo `cat test/cgroup.procs` > cgroup.procs
> 	rmdir test
> done
> 
> cat /proc/cgroups | grep memory
> ```
> 
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-18 convert LRU pages pin to the objcg direction.
> 
> Any comments are welcome. Thanks.

Indeed the problem exists for a long time and it would be nice to fix it.
However I'm against merging the patchset in the current form (there are some
nice fixes/clean-ups, which can/must be applied independently). Let me explain
my concerns:

Back to the new slab controller discussion obj_cgroup was suggested by Johannes
as a union of two concepts:
1) reparenting (basically an auto-pointer to a memcg in c++ terms)
2) byte-sized accounting

I was initially against this union because I anticipated that the reparenting
part will be useful separately. And the time told it was true.

I still think obj_cgroup API must be significantly reworked before being
applied outside of the kmem area: reparenting part must be separated
and moved to the cgroup core level to be used not only in the memcg
context but also for other controllers, which are facing similar problems.
Spilling obj_cgroup API in the current form over all memcg code will
make it more complicated and will delay it, given the amount of changes
and the number of potential code conflicts.

I'm working on the generalization of obj_cgroup API (as described above)
and expect to have some patches next week.

Thanks!
Muchun Song April 11, 2021, 5:37 a.m. UTC | #4
On Sat, Apr 10, 2021 at 12:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> > We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> > do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> > lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> > page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> > of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> > to simplify the code. We can have a single definition for this function
> > that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> > CONFIG_MEMCG.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Looks good to me.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for your review.

>
> If you haven't done so yet, please make sure to explicitly test with
> all three config combinations, just because the dummy abstractions for
> memcg disabled or compiled out tend to be paper thin and don't always
> behave the way you might expect when you do more complicated things.

I have tested. There is no problem. Thanks :-)


>
> Something like
>
> boot
> echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
> echo 1 >/proc/sys/vm/compact_memory
>
> should exercise this new function in a couple of important scenarios.
Muchun Song April 11, 2021, 6:24 a.m. UTC | #5
On Sat, Apr 10, 2021 at 12:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> > Because memory allocations pinning memcgs for a long time - it exists
> > at a larger scale and is causing recurring problems in the real world:
> > page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted
> > into a new cgroup every time. Unreclaimable dying cgroups pile up,
> > waste memory, and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg
> > direction to fix this problem, and then the page->memcg will always
> > point to an object cgroup pointer.
> >
> > Therefore, the infrastructure of objcg no longer only serves
> > CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> > objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> > can reuse it to charge pages.
>
> Just an observation on this:
>
> We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
> point. It used to be an optional feature, but nowadays it's not
> configurable anymore, and always on unless slob is configured.
>
> We've also added more than just slab accounting to it, like kernel
> stack pages, and it all gets disabled on slob configs just because it
> doesn't support slab object tracking.
>
> We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
> places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.
>
> But that's beyond the scope of your patch series, so I'm also okay
> with this patch here.
>
> > We know that the LRU pages are not accounted at the root level. But the
> > page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> > of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> > dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> > LRU pages, we should set the page->memcg_data to a root object cgroup. So
> > we also allocate an object cgroup for the root_mem_cgroup and introduce
> > root_obj_cgroup to cache its value just like root_mem_cgroup.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Overall, the patch makes sense to me. A few comments below:
>
> > @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> >       return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
> >  }
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> >  extern spinlock_t css_set_lock;
> >
> > +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> > +{
>   > +   return objcg == root_obj_cgroup;
> > +}
>
> This function, and by extension root_obj_cgroup, aren't used by this
> patch. Please move them to the patch that adds users for them.

OK. Will do.

>
> > @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> >       percpu_ref_exit(ref);
> >       kfree_rcu(objcg, rcu);
> >  }
> > +#else
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > +     struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&css_set_lock, flags);
> > +     list_del(&objcg->list);
> > +     spin_unlock_irqrestore(&css_set_lock, flags);
> > +
> > +     percpu_ref_exit(ref);
> > +     kfree_rcu(objcg, rcu);
> > +}
> > +#endif
>
> Having two separate functions for if and else is good when the else
> branch is a completely empty dummy function. In this case you end up
> duplicating code, so it's better to have just one function and put the
> ifdef around the nr_charged_bytes handling in it.

Make sense. I will rework the code here.

>
> > @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> >       return objcg;
> >  }
> >
> > -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > -                               struct mem_cgroup *parent)
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> >  {
> >       struct obj_cgroup *objcg, *iter;
> > +     struct mem_cgroup *parent;
> > +
> > +     parent = parent_mem_cgroup(memcg);
> > +     if (!parent)
> > +             parent = root_mem_cgroup;
> >
> >       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> >
> > @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> >       percpu_ref_kill(&objcg->refcnt);
> >  }
> >
> > +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> > +{
> > +     struct obj_cgroup *objcg;
> > +
> > +     objcg = obj_cgroup_alloc();
> > +     if (!objcg)
> > +             return -ENOMEM;
> > +
> > +     objcg->memcg = memcg;
> > +     rcu_assign_pointer(memcg->objcg, objcg);
> > +
> > +     return 0;
> > +}
> > +
> > +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> > +{
> > +     if (unlikely(memcg->objcg))
> > +             memcg_reparent_objcgs(memcg);
> > +}
>
> It's confusing to have a 'free' function not free the object it's
> called on.
>
> But rather than search for a fitting name, I think it might be better
> to just fold both of these short functions into their only callsites.

OK. Will do.

>
> Also, since memcg->objcg is reparented, and the pointer cleared, on
> offlining, when could this ever be non-NULL? This deserves a comment.

css_alloc() failed, offlining didn't happen. In this case, memcg->objcg
could be non-NULL (Just like memcg_free_kmem() dose). I will move
memcg_obj_cgroup_alloc() to the mem_cgroup_css_online() so that
we do not need memcg_obj_cgroup_free.

>
> > @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  #ifdef CONFIG_MEMCG_KMEM
> >  static int memcg_online_kmem(struct mem_cgroup *memcg)
> >  {
> > -     struct obj_cgroup *objcg;
> >       int memcg_id;
> >
> >       if (cgroup_memory_nokmem)
> > @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
> >       if (memcg_id < 0)
> >               return memcg_id;
> >
> > -     objcg = obj_cgroup_alloc();
> > -     if (!objcg) {
> > -             memcg_free_cache_id(memcg_id);
> > -             return -ENOMEM;
> > -     }
> > -     objcg->memcg = memcg;
> > -     rcu_assign_pointer(memcg->objcg, objcg);
> > -
> >       static_branch_enable(&memcg_kmem_enabled_key);
> >
> >       memcg->kmemcg_id = memcg_id;
> > @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
> >       if (!parent)
> >               parent = root_mem_cgroup;
> >
> > -     memcg_reparent_objcgs(memcg, parent);
> > +     memcg_reparent_objcgs(memcg);
>
> Since the objcg is no longer tied to kmem, this should move to
> mem_cgroup_css_offline() instead.

LGTM, will do.

Thanks for your all suggestions.
Johannes Weiner April 12, 2021, 5:14 p.m. UTC | #6
On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> > 
> > 	[v17,00/19] The new cgroup slab memory controller
> > 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > 
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> > 
> > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> > 
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> > 
> > ```bash
> > #!/bin/bash
> > 
> > cat /proc/cgroups | grep memory
> > 
> > cd /sys/fs/cgroup/memory
> > 
> > for i in range{1..500}
> > do
> > 	mkdir test
> > 	echo $$ > test/cgroup.procs
> > 	sleep 60 &
> > 	echo $$ > cgroup.procs
> > 	echo `cat test/cgroup.procs` > cgroup.procs
> > 	rmdir test
> > done
> > 
> > cat /proc/cgroups | grep memory
> > ```
> > 
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-18 convert LRU pages pin to the objcg direction.
> > 
> > Any comments are welcome. Thanks.
> 
> Indeed the problem exists for a long time and it would be nice to fix it.
> However I'm against merging the patchset in the current form (there are some
> nice fixes/clean-ups, which can/must be applied independently). Let me explain
> my concerns:
> 
> Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> as a union of two concepts:
> 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> 2) byte-sized accounting
> 
> I was initially against this union because I anticipated that the reparenting
> part will be useful separately. And the time told it was true.

"The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
level is really good."

https://lore.kernel.org/lkml/20191025200020.GA8325@castle.DHCP.thefacebook.com/

If you recall, the main concern was how the byte charging interface
was added to the existing page charging interface, instead of being
layered on top of it. I suggested to do that and, since there was no
other user for the indirection pointer, just include it in the API.

It made sense at the time, and you seemed to agree. But I also agree
it makes sense to factor it out now that more users are materializing.

> I still think obj_cgroup API must be significantly reworked before being
> applied outside of the kmem area: reparenting part must be separated
> and moved to the cgroup core level to be used not only in the memcg
> context but also for other controllers, which are facing similar problems.
> Spilling obj_cgroup API in the current form over all memcg code will
> make it more complicated and will delay it, given the amount of changes
> and the number of potential code conflicts.
> 
> I'm working on the generalization of obj_cgroup API (as described above)
> and expect to have some patches next week.

Yeah, splitting the byte charging API from the reference API and
making the latter cgroup-generic makes sense. I'm looking forward to
your patches.

And yes, the conflicts between that work and Muchun's patches would be
quite large. However, most of them would come down to renames, since
the access rules and refcounting sites will remain the same, so it
shouldn't be too bad to rebase Muchun's patches on yours. And we can
continue reviewing his patches for correctness for now.

Thanks
Roman Gushchin April 12, 2021, 5:45 p.m. UTC | #7
On Mon, Apr 12, 2021 at 01:14:57PM -0400, Johannes Weiner wrote:
> On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > > 
> > > 	[v17,00/19] The new cgroup slab memory controller
> > > 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > 
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > and make page reclaim very inefficient.
> > > 
> > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > 
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > of the dying cgroups will not increase if we run the following test script.
> > > 
> > > ```bash
> > > #!/bin/bash
> > > 
> > > cat /proc/cgroups | grep memory
> > > 
> > > cd /sys/fs/cgroup/memory
> > > 
> > > for i in range{1..500}
> > > do
> > > 	mkdir test
> > > 	echo $$ > test/cgroup.procs
> > > 	sleep 60 &
> > > 	echo $$ > cgroup.procs
> > > 	echo `cat test/cgroup.procs` > cgroup.procs
> > > 	rmdir test
> > > done
> > > 
> > > cat /proc/cgroups | grep memory
> > > ```
> > > 
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-18 convert LRU pages pin to the objcg direction.
> > > 
> > > Any comments are welcome. Thanks.
> > 
> > Indeed the problem exists for a long time and it would be nice to fix it.
> > However I'm against merging the patchset in the current form (there are some
> > nice fixes/clean-ups, which can/must be applied independently). Let me explain
> > my concerns:
> > 
> > Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> > as a union of two concepts:
> > 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> > 2) byte-sized accounting
> > 
> > I was initially against this union because I anticipated that the reparenting
> > part will be useful separately. And the time told it was true.
> 
> "The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
> level is really good."
> 
> https://lore.kernel.org/lkml/20191025200020.GA8325@castle.DHCP.thefacebook.com/
> 
> If you recall, the main concern was how the byte charging interface
> was added to the existing page charging interface, instead of being
> layered on top of it. I suggested to do that and, since there was no
> other user for the indirection pointer, just include it in the API.
> 
> It made sense at the time, and you seemed to agree. But I also agree
> it makes sense to factor it out now that more users are materializing.

Agreed.

> 
> > I still think obj_cgroup API must be significantly reworked before being
> > applied outside of the kmem area: reparenting part must be separated
> > and moved to the cgroup core level to be used not only in the memcg
> > context but also for other controllers, which are facing similar problems.
> > Spilling obj_cgroup API in the current form over all memcg code will
> > make it more complicated and will delay it, given the amount of changes
> > and the number of potential code conflicts.
> > 
> > I'm working on the generalization of obj_cgroup API (as described above)
> > and expect to have some patches next week.
> 
> Yeah, splitting the byte charging API from the reference API and
> making the latter cgroup-generic makes sense. I'm looking forward to
> your patches.
> 
> And yes, the conflicts between that work and Muchun's patches would be
> quite large. However, most of them would come down to renames, since
> the access rules and refcounting sites will remain the same, so it
> shouldn't be too bad to rebase Muchun's patches on yours. And we can
> continue reviewing his patches for correctness for now.

Sounds good to me!

Thanks