diff mbox series

[v3,4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

Message ID 20200623184515.4132564-5-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series mm: memcg accounting of percpu memory | expand

Commit Message

Roman Gushchin June 23, 2020, 6:45 p.m. UTC
Memory cgroups are using large chunks of percpu memory to store vmstat
data.  Yet this memory is not accounted at all, so in the case when there
are many (dying) cgroups, it's not exactly clear where all the memory is.

Because the size of memory cgroup internal structures can dramatically
exceed the size of object or page which is pinning it in the memory, it's
not a good idea to simple ignore it.  It actually breaks the isolation
between cgroups.

Let's account the consumed percpu memory to the parent cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
---
 mm/memcontrol.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comments

Shakeel Butt June 24, 2020, 1:40 a.m. UTC | #1
On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
>
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation

*simply

> between cgroups.
>
> Let's account the consumed percpu memory to the parent cgroup.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Roman Gushchin June 24, 2020, 1:49 a.m. UTC | #2
On Tue, Jun 23, 2020 at 06:40:41PM -0700, Shakeel Butt wrote:
> On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> >
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> 
> *simply
> 
> > between cgroups.
> >
> > Let's account the consumed percpu memory to the parent cgroup.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Hello, Shakeel!

Thank you for the review of this and the previous patchsets!

Btw, I'll be completely offline till the end of the week,
so if any questions will arise around these patchsets,
I'll answer all them on Monday, Jun 29th. Thanks!
Michal Koutný July 29, 2020, 5:10 p.m. UTC | #3
Hello.

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
No doubt about accounting the memory if it's significant amount.

> Let's account the consumed percpu memory to the parent cgroup.
Why did you choose charging to the parent of the created cgroup?

Should the charge go the cgroup _that is creating_ the new memcg?

One reason is that there are the throttling mechanisms for memory limits
and those are better exercised when the actor and its memory artefact
are the same cgroup, aren't they?

The second reason is based on the example Dlegation Containment
(Documentation/admin-guide/cgroup-v2.rst)

> For an example, let's assume cgroups C0 and C1 have been delegated to
> user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> all processes under C0 and C1 belong to U0::
> 
>   ~~~~~~~~~~~~~ - C0 - C00
>   ~ cgroup    ~      \ C01
>   ~ hierarchy ~
>   ~~~~~~~~~~~~~ - C1 - C10

Thanks to permissions a task running in C0 creating a cgroup in C1 would
deplete C1's supply victimizing tasks inside C1.

Thanks,
Michal
Andrew Morton Aug. 7, 2020, 4:16 a.m. UTC | #4
On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:

> Hello.
> 
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> No doubt about accounting the memory if it's significant amount.
> 
> > Let's account the consumed percpu memory to the parent cgroup.
> Why did you choose charging to the parent of the created cgroup?
> 
> Should the charge go the cgroup _that is creating_ the new memcg?
> 
> One reason is that there are the throttling mechanisms for memory limits
> and those are better exercised when the actor and its memory artefact
> are the same cgroup, aren't they?
> 
> The second reason is based on the example Dlegation Containment
> (Documentation/admin-guide/cgroup-v2.rst)
> 
> > For an example, let's assume cgroups C0 and C1 have been delegated to
> > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > all processes under C0 and C1 belong to U0::
> > 
> >   ~~~~~~~~~~~~~ - C0 - C00
> >   ~ cgroup    ~      \ C01
> >   ~ hierarchy ~
> >   ~~~~~~~~~~~~~ - C1 - C10
> 
> Thanks to permissions a task running in C0 creating a cgroup in C1 would
> deplete C1's supply victimizing tasks inside C1.

These week-old issues appear to be significant.  Roman?  Or someone
else?
Roman Gushchin Aug. 7, 2020, 4:37 a.m. UTC | #5
On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> 
> > Hello.
> > 
> > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > Because the size of memory cgroup internal structures can dramatically
> > > exceed the size of object or page which is pinning it in the memory, it's
> > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > between cgroups.
> > No doubt about accounting the memory if it's significant amount.
> > 
> > > Let's account the consumed percpu memory to the parent cgroup.
> > Why did you choose charging to the parent of the created cgroup?
> > 
> > Should the charge go the cgroup _that is creating_ the new memcg?
> > 
> > One reason is that there are the throttling mechanisms for memory limits
> > and those are better exercised when the actor and its memory artefact
> > are the same cgroup, aren't they?

Hi!

In general, yes. But in this case I think it wouldn't be a good idea:
most often cgroups are created by a centralized daemon (systemd),
which is usually located in the root cgroup. Even if it's located not in
the root cgroup, limiting it's memory will likely affect the whole system,
even if only one specific limit was reached.
If there is a containerized workload, which creates sub-cgroups,
charging it's parent cgroup is perfectly effective.

And the opposite, if we'll charge the cgroup of a process, who created
a cgroup, we'll not cover the most common case: systemd creating
cgroups for all services in the system.

> > 
> > The second reason is based on the example Dlegation Containment
> > (Documentation/admin-guide/cgroup-v2.rst)
> > 
> > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > all processes under C0 and C1 belong to U0::
> > > 
> > >   ~~~~~~~~~~~~~ - C0 - C00
> > >   ~ cgroup    ~      \ C01
> > >   ~ hierarchy ~
> > >   ~~~~~~~~~~~~~ - C1 - C10
> > 
> > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > deplete C1's supply victimizing tasks inside C1.

Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
in completely different cgroup. In this particular case there are tons of other
ways how a task from C00 can hurt C1.

> 
> These week-old issues appear to be significant.  Roman?  Or someone
> else?

Oh, I'm sorry, somehow I've missed this letter.
Thank you for pointing at it!

Thanks!
Roman Gushchin Aug. 10, 2020, 7:33 p.m. UTC | #6
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin wrote:
> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> > 
> > > Hello.
> > > 
> > > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > > Because the size of memory cgroup internal structures can dramatically
> > > > exceed the size of object or page which is pinning it in the memory, it's
> > > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > > between cgroups.
> > > No doubt about accounting the memory if it's significant amount.
> > > 
> > > > Let's account the consumed percpu memory to the parent cgroup.
> > > Why did you choose charging to the parent of the created cgroup?
> > > 
> > > Should the charge go the cgroup _that is creating_ the new memcg?
> > > 
> > > One reason is that there are the throttling mechanisms for memory limits
> > > and those are better exercised when the actor and its memory artefact
> > > are the same cgroup, aren't they?
> 
> Hi!
> 
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
> 
> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
> 
> > > 
> > > The second reason is based on the example Dlegation Containment
> > > (Documentation/admin-guide/cgroup-v2.rst)
> > > 
> > > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > > all processes under C0 and C1 belong to U0::
> > > > 
> > > >   ~~~~~~~~~~~~~ - C0 - C00
> > > >   ~ cgroup    ~      \ C01
> > > >   ~ hierarchy ~
> > > >   ~~~~~~~~~~~~~ - C1 - C10
> > > 
> > > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > > deplete C1's supply victimizing tasks inside C1.
> 
> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
> 
> > 
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
> 
> Oh, I'm sorry, somehow I've missed this letter.
> Thank you for pointing at it!

Hello, Michal!

Do you have concerns left here or it's good to go?

It seems that this blocking the whole percpu accounting patchset from being merged,
and I still hope it can be squeezed into 5.9.

Thank you!

Roman
Michal Koutný Aug. 11, 2020, 2:47 p.m. UTC | #7
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
The generic scheme would be (assuming the no internal process
constraint, in the root too)

` root or delegated root
  ` manager-cgroup (systemd, docker, ...)
  ` [aggregation group(s)]
    ` job-group-1
    ` ...
    ` job-group-n

> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
No dispute about this in either approaches.

> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
What I mean is that systemd should be charged for the cgroup creation.
Or more generally, any container/cgroup manager should be charged.
Consider a leak when it wouldn't remove spent cgroups, IMO the effect
(throttling, reclaim) should be exercised on such a culprit.

I don't think the existing workload (job-group-i above) should
unnecessarily suffer when only manager is acting up. Is that different
from your idea?

> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
I agree with that.


If I haven't overlooked anything, this should be first case when
cgroup-related structures are accounted (please correct me).
So this is setting a precendent, if others show useful to be accounted
in the future too. I'm thinking about cpu_cgroup_css_alloc() that can
also allocate a lot (with big CPU count). The current approach would lead
situations where matching cpu and memory csses needn't to exist and that
would need special handling.


> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
Despite my concerns, I don't think this is fundamental and can't be
changed later so it doesn't prevent the inclusion in 5.9 RC1.

Regards,
Michal
Johannes Weiner Aug. 11, 2020, 3:27 p.m. UTC | #8
On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
> 
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
> 
> Let's account the consumed percpu memory to the parent cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This makes sense, and the accounting is in line with how we track and
distribute child creation quotas (cgroup.max.descendants and
cgroup.max.depth) up the cgroup tree.

I have one minor comment that isn't a dealbreaker for me:

> @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  	if (!pn)
>  		return 1;
>  
> -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_local) {
>  		kfree(pn);
>  		return 1;
>  	}
>  
> -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> +					       GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_cpu) {
>  		free_percpu(pn->lruvec_stat_local);
>  		kfree(pn);
> @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  		goto fail;
>  	}
>  
> -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_local)
>  		goto fail;
>  
> -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_percpu)
>  		goto fail;
>  
> @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  
> +	memalloc_use_memcg(parent);
>  	memcg = mem_cgroup_alloc();
> +	memalloc_unuse_memcg();

The disconnect between 1) requesting accounting and 2) which cgroup to
charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
handgrenade, because accounting to the current task is almost
guaranteed to be wrong if the use_memcg() annotation were to get lost
in a refactor or not make it to a new caller of the function.

The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
used elsewhere. And pretending it's an independent interface would be
overengineering. But how about the following in mem_cgroup_alloc() and
alloc_mem_cgroup_per_node_info() to document that caller relationship:

	/* We charge the parent cgroup, never the current task */
	WARN_ON_ONCE(!current->active_memcg);
Roman Gushchin Aug. 11, 2020, 4:55 p.m. UTC | #9
On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote:
> On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > In general, yes. But in this case I think it wouldn't be a good idea:
> > most often cgroups are created by a centralized daemon (systemd),
> > which is usually located in the root cgroup. Even if it's located not in
> > the root cgroup, limiting it's memory will likely affect the whole system,
> > even if only one specific limit was reached.
> The generic scheme would be (assuming the no internal process
> constraint, in the root too)
> 
> ` root or delegated root
>   ` manager-cgroup (systemd, docker, ...)
>   ` [aggregation group(s)]
>     ` job-group-1
>     ` ...
>     ` job-group-n
> 
> > If there is a containerized workload, which creates sub-cgroups,
> > charging it's parent cgroup is perfectly effective.
> No dispute about this in either approaches.
> 
> > And the opposite, if we'll charge the cgroup of a process, who created
> > a cgroup, we'll not cover the most common case: systemd creating
> > cgroups for all services in the system.
> What I mean is that systemd should be charged for the cgroup creation.
> Or more generally, any container/cgroup manager should be charged.
> Consider a leak when it wouldn't remove spent cgroups, IMO the effect
> (throttling, reclaim) should be exercised on such a culprit.

As I said, there are 2 problems with charging systemd (or a similar daemon):
1) It often belongs to the root cgroup.
2) OOMing or failing some random memory allocations is a bad way
   to "communicate" a memory shortage to the daemon.
   What we really want is to prevent creating a huge number of cgroups
   (including dying cgroups) in some specific sub-tree(s).
   OOMing the daemon or returning -ENOMEM to some random syscalls
   will not help us to reach the goal and likely will bring a bad
   experience to a user.

In a generic case I don't see how we can charge the cgroup which
creates cgroups without solving these problems first.

And if there is a very special case where we have to limit it,
we can just add an additional layer:

` root or delegated root
   ` manager-parent-cgroup-with-a-limit
     ` manager-cgroup (systemd, docker, ...)
   ` [aggregation group(s)]
     ` job-group-1
     ` ...
     ` job-group-n

> 
> I don't think the existing workload (job-group-i above) should
> unnecessarily suffer when only manager is acting up. Is that different
> from your idea?
> 
> > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> > in completely different cgroup. In this particular case there are tons of other
> > ways how a task from C00 can hurt C1.
> I agree with that.
> 
> 
> If I haven't overlooked anything, this should be first case when
> cgroup-related structures are accounted (please correct me).
> So this is setting a precendent, if others show useful to be accounted
> in the future too.

Right.

> I'm thinking about cpu_cgroup_css_alloc() that can
> also allocate a lot (with big CPU count). The current approach would lead
> situations where matching cpu and memory csses needn't to exist and that
> would need special handling.

I'd definitely charge the parent cgroup in all similar cases.

> 
> 
> > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > > These week-old issues appear to be significant.  Roman?  Or someone
> > > else?
> Despite my concerns, I don't think this is fundamental and can't be
> changed later so it doesn't prevent the inclusion in 5.9 RC1.

Thank you!
Roman Gushchin Aug. 11, 2020, 5:06 p.m. UTC | #10
On Tue, Aug 11, 2020 at 11:27:37AM -0400, Johannes Weiner wrote:
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> > 
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> > 
> > Let's account the consumed percpu memory to the parent cgroup.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you!

> 
> This makes sense, and the accounting is in line with how we track and
> distribute child creation quotas (cgroup.max.descendants and
> cgroup.max.depth) up the cgroup tree.
> 
> I have one minor comment that isn't a dealbreaker for me:
> 
> > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> >  	if (!pn)
> >  		return 1;
> >  
> > -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_local) {
> >  		kfree(pn);
> >  		return 1;
> >  	}
> >  
> > -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> > +					       GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_cpu) {
> >  		free_percpu(pn->lruvec_stat_local);
> >  		kfree(pn);
> > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >  		goto fail;
> >  	}
> >  
> > -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_local)
> >  		goto fail;
> >  
> > -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_percpu)
> >  		goto fail;
> >  
> > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >  	struct mem_cgroup *memcg;
> >  	long error = -ENOMEM;
> >  
> > +	memalloc_use_memcg(parent);
> >  	memcg = mem_cgroup_alloc();
> > +	memalloc_unuse_memcg();
> 
> The disconnect between 1) requesting accounting and 2) which cgroup to
> charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
> handgrenade, because accounting to the current task is almost
> guaranteed to be wrong if the use_memcg() annotation were to get lost
> in a refactor or not make it to a new caller of the function.
> 
> The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
> used elsewhere. And pretending it's an independent interface would be
> overengineering. But how about the following in mem_cgroup_alloc() and
> alloc_mem_cgroup_per_node_info() to document that caller relationship:
> 
> 	/* We charge the parent cgroup, never the current task */
> 	WARN_ON_ONCE(!current->active_memcg);

I have nothing against.

Andrew, can you, please, squash the following diff into the patch?

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 130093bdf74b..e25f2db7e61c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
        if (!pn)
                return 1;
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
                                                 GFP_KERNEL_ACCOUNT);
        if (!pn->lruvec_stat_local) {
@@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
                goto fail;
        }
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
                                                GFP_KERNEL_ACCOUNT);
        if (!memcg->vmstats_local)
Michal Koutný Aug. 11, 2020, 6:32 p.m. UTC | #11
On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> As I said, there are 2 problems with charging systemd (or a similar daemon):
> 1) It often belongs to the root cgroup.
This doesn't hold for systemd (if we agree that systemd is the most
common case).

> 2) OOMing or failing some random memory allocations is a bad way
>    to "communicate" a memory shortage to the daemon.
>    What we really want is to prevent creating a huge number of cgroups
There's cgroup.max.descendants for that...

>    (including dying cgroups) in some specific sub-tree(s).
...oh, so is this limiting the number of cgroups or limiting resources
used?

>    OOMing the daemon or returning -ENOMEM to some random syscalls
>    will not help us to reach the goal and likely will bring a bad
>    experience to a user.
If we reach the situation when memory for cgroup operations is tight,
it'll disappoint the user either way.
My premise is that a running workload is more valuable than the
accompanying manager.

> In a generic case I don't see how we can charge the cgroup which
> creates cgroups without solving these problems first.
In my understanding, "onbehalveness" is a concept useful for various
kernel threads doing deferred work. Here it's promoted to user processes
managing cgroups.

> And if there is a very special case where we have to limit it,
> we can just add an additional layer:
> 
> ` root or delegated root
>    ` manager-parent-cgroup-with-a-limit
>      ` manager-cgroup (systemd, docker, ...)
>    ` [aggregation group(s)]
>      ` job-group-1
>      ` ...
>      ` job-group-n
If the charge goes to the parent of created cgroup (job-cgroup-i here),
then the layer adds nothing. Am I missing something?

> I'd definitely charge the parent cgroup in all similar cases.
(This would mandate the controllers on the unified hierarchy, which is
fine IMO.) Then the order of enabling controllers on a subtree (e.g.
cpu,memory vs memory,cpu) by the manager would yield different charging.
This seems wrong^W confusing to me. 


Thanks,
Michal
Roman Gushchin Aug. 11, 2020, 7:32 p.m. UTC | #12
On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote:
> On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > As I said, there are 2 problems with charging systemd (or a similar daemon):
> > 1) It often belongs to the root cgroup.
> This doesn't hold for systemd (if we agree that systemd is the most
> common case).

Ok, it's better.

> 
> > 2) OOMing or failing some random memory allocations is a bad way
> >    to "communicate" a memory shortage to the daemon.
> >    What we really want is to prevent creating a huge number of cgroups
> There's cgroup.max.descendants for that...

cgroup.max.descendants limits the number of live cgroups, it can't limit
the number of dying cgroups.

> 
> >    (including dying cgroups) in some specific sub-tree(s).
> ...oh, so is this limiting the number of cgroups or limiting resources
> used?

My scenario is simple: there is a large machine, which has no memory
pressure for some time (e.g. is idle or running a workload with small
working set). Periodically running services creating a lot of cgroups,
usually in system.slice. After some time a significant part of the whole
memory is getting consumed by dying cgroups and their percpu data.
Getting rid of it and reclaiming all memory is not always possible
(percpu is getting fragmented relatively easy) and is time consuming.

If we'll set memory.high on system.slice, it will create an artificial
memory pressure once we're getting close to the limit. It will trigger
the reclaim of user pages and slab objects, so eventually we'll be able
to release dying cgroups as well.

You might say that it would work even without charging memcg internal
structures. The problem is that a small slab object can indirectly pin
a lot of (percpu) memory. If don't take the indirectly pinned memory
into account, likely we won't apply enough memory pressure.

If we'll limit init.slice (where systemd seems to reside), as you suggest,
we'll eventually create trashing in init.slice, followed by OOM.
I struggle to see how it makes the life of a user better?

> 
> >    OOMing the daemon or returning -ENOMEM to some random syscalls
> >    will not help us to reach the goal and likely will bring a bad
> >    experience to a user.
> If we reach the situation when memory for cgroup operations is tight,
> it'll disappoint the user either way.
> My premise is that a running workload is more valuable than the
> accompanying manager.

The problem is that OOM-killing the accompanying manager won't release
resources and help to get rid of accumulated cgroups. So in the very
best case it will prevent new cgroups from being created (as well
as some other random operations from being performed). Most likely
the only way to "fix" this for a user will be to reboot the machine.

> 
> > In a generic case I don't see how we can charge the cgroup which
> > creates cgroups without solving these problems first.
> In my understanding, "onbehalveness" is a concept useful for various
> kernel threads doing deferred work. Here it's promoted to user processes
> managing cgroups.
> 
> > And if there is a very special case where we have to limit it,
> > we can just add an additional layer:
> > 
> > ` root or delegated root
> >    ` manager-parent-cgroup-with-a-limit
> >      ` manager-cgroup (systemd, docker, ...)
> >    ` [aggregation group(s)]
> >      ` job-group-1
> >      ` ...
> >      ` job-group-n
> If the charge goes to the parent of created cgroup (job-cgroup-i here),
> then the layer adds nothing. Am I missing something?

Sorry, I was wrong here, please ignore this part.

> 
> > I'd definitely charge the parent cgroup in all similar cases.
> (This would mandate the controllers on the unified hierarchy, which is
> fine IMO.) Then the order of enabling controllers on a subtree (e.g.
> cpu,memory vs memory,cpu) by the manager would yield different charging.
> This seems wrong^W confusing to me.

I agree it's confusing.

Thanks!
Michal Koutný Aug. 12, 2020, 4:28 p.m. UTC | #13
On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin <guro@fb.com> wrote:
> If we'll limit init.slice (where systemd seems to reside), as you suggest,
> we'll eventually create trashing in init.slice, followed by OOM.
> I struggle to see how it makes the life of a user better?
> [...]
> The problem is that OOM-killing the accompanying manager won't release
> resources and help to get rid of accumulated cgroups.
I see your point now. I focused on the creator because of the live
memcgs.

When the issue are the dying memcgs (c), they were effectively released
by their creator but are pinned by whatever remained after their life
(LRU pages, slab->obj_cgroups). Since these pins were created _from
within_ such a child (c), they're most readily removable by reclaiming
(hierarchically) close to c. (It'd be achievable by limiting the lowest
common ancestor of manager and its product (typically root) but that is
more clumsy and less effective.)

This is the reasoning that justifies the remote charge.

Thanks!
Michal
Naresh Kamboju Aug. 13, 2020, 9:16 a.m. UTC | #14
The kernel warnings  were noticed on linux next 20200813 while booting
on arm64, arm, x86_64 and i386.

metadata:
  git branch: master
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
  git commit: e6d113aca646fb6a92b237340109237fd7a9c770
  git describe: next-20200813
  make_kernelversion: 5.8.0
  kernel-config:
https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 130093bdf74b..e25f2db7e61c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>         if (!pn)
>                 return 1;
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);
> +
>         pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
>                                                  GFP_KERNEL_ACCOUNT);
>         if (!pn->lruvec_stat_local) {
> @@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                 goto fail;
>         }
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);

[    0.217404] ------------[ cut here ]------------
[    0.218038] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5226
mem_cgroup_css_alloc+0x680/0x740
[    0.219188] Modules linked in:
[    0.219597] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-next-20200813 #1
[    0.220187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.221190] EIP: mem_cgroup_css_alloc+0x680/0x740
[    0.222190] Code: d6 17 5d ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74
26 00 b8 58 39 6a d1 e8 fe 94 55 ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d
74 26 00 <0f> 0b e9 01 fa ff ff 8d b4 26 00 00 00 00 66 90 bb f4 ff ff
ff ba
[    0.223188] EAX: 00000000 EBX: d13666c0 ECX: 00000cc0 EDX: 0000ffff
[    0.224187] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.225188] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.226190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.227195] Call Trace:
[    0.227882]  ? _cond_resched+0x17/0x30
[    0.228195]  cgroup_init_subsys+0x66/0x12a
[    0.229193]  cgroup_init+0x118/0x323
[    0.230194]  start_kernel+0x43c/0x47d
[    0.231193]  i386_start_kernel+0x48/0x4a
[    0.232194]  startup_32_smp+0x164/0x168
[    0.233195] ---[ end trace dfcf9be7b40caf05 ]---
[    0.2342#
08] ------------[ cut here ]------------
[    0.235192] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5141
mem_cgroup_css_alloc+0x718/0x740
[    0.236187] Modules linked in:
[    0.236590] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W
  5.8.0-next-20200813 #1
[    0.237190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.238194] EIP: mem_cgroup_css_alloc+0x718/0x740
[    0.239191] Code: 48 ff e9 7c fd ff ff 8d 76 00 a1 b0 14 40 d1 e9
53 fc ff ff 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00
00 00 00 <0f> 0b e9 df f9 ff ff 90 89 f8 e8 29 0c 5c ff 89 f2 b8 10 f4
40 d1
[    0.240190] EAX: 00000000 EBX: f4c0c800 ECX: 00000000 EDX: d0eab660
[    0.241189] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.242189] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.243190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.244188] Call Trace:
[    0.245191]  ? _cond_resched+0x17/0x30
[    0.245686]  cgroup_init_subsys+0x66/0x12a
[    0.246189]  cgroup_init+0x118/0x323
[    0.246654]  start_kernel+0x43c/0x47d
[    0.247189]  i386_start_kernel+0x48/0x4a
[    0.247697]  startup_32_smp+0x164/0x168
[    0.248188] ---[ end trace dfcf9be7b40caf06 ]---
[    0.248990] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[    0.249187] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0


Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20200813/testrun/3061112/suite/linux-log-parser/test/check-kernel-warning-1665815/log
Stephen Rothwell Aug. 13, 2020, 11:27 p.m. UTC | #15
Hi Naresh,

On Thu, 13 Aug 2020 14:46:51 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> The kernel warnings  were noticed on linux next 20200813 while booting
> on arm64, arm, x86_64 and i386.
> 
> metadata:
>   git branch: master
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   git commit: e6d113aca646fb6a92b237340109237fd7a9c770
>   git describe: next-20200813
>   make_kernelversion: 5.8.0
>   kernel-config:
> https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

Actually in Linus' tree.

It has been fixed today.  Thanks for reporting.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ec2498a6cf..9f14b91700d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5069,13 +5069,15 @@  static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
+						 GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_local) {
 		kfree(pn);
 		return 1;
 	}
 
-	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
+	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
+					       GFP_KERNEL_ACCOUNT);
 	if (!pn->lruvec_stat_cpu) {
 		free_percpu(pn->lruvec_stat_local);
 		kfree(pn);
@@ -5149,11 +5151,13 @@  static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_local)
 		goto fail;
 
-	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
+	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
+						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
 		goto fail;
 
@@ -5202,7 +5206,9 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 
+	memalloc_use_memcg(parent);
 	memcg = mem_cgroup_alloc();
+	memalloc_unuse_memcg();
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);