memcg: expose root cgroup's memory.stat
diff mbox series

Message ID 20200508170630.94406-1-shakeelb@google.com
State New
Headers show
Series
  • memcg: expose root cgroup's memory.stat
Related show

Commit Message

Shakeel Butt May 8, 2020, 5:06 p.m. UTC
One way to measure the efficiency of memory reclaim is to look at the
ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
not updated consistently at the system level and the ratio of these are
not very meaningful. The pgsteal and pgscan are updated for only global
reclaim while pgrefill gets updated for global as well as cgroup
reclaim.

Please note that this difference is only for system level vmstats. The
cgroup stats returned by memory.stat are actually consistent. The
cgroup's pgsteal contains number of reclaimed pages for global as well
as cgroup reclaim. So, one way to get the system level stats is to get
these stats from root's memory.stat, so, expose memory.stat for the root
cgroup.

	from Johannes Weiner:
	There are subtle differences between /proc/vmstat and
	memory.stat, and cgroup-aware code that wants to watch the full
	hierarchy currently has to know about these intricacies and
	translate semantics back and forth.

	Generally having the fully recursive memory.stat at the root
	level could help a broader range of usecases.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Johannes Weiner May 8, 2020, 9:44 p.m. UTC | #1
On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> One way to measure the efficiency of memory reclaim is to look at the
> ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> not updated consistently at the system level and the ratio of these are
> not very meaningful. The pgsteal and pgscan are updated for only global
> reclaim while pgrefill gets updated for global as well as cgroup
> reclaim.
> 
> Please note that this difference is only for system level vmstats. The
> cgroup stats returned by memory.stat are actually consistent. The
> cgroup's pgsteal contains number of reclaimed pages for global as well
> as cgroup reclaim. So, one way to get the system level stats is to get
> these stats from root's memory.stat, so, expose memory.stat for the root
> cgroup.
> 
> 	from Johannes Weiner:
> 	There are subtle differences between /proc/vmstat and
> 	memory.stat, and cgroup-aware code that wants to watch the full
> 	hierarchy currently has to know about these intricacies and
> 	translate semantics back and forth.
> 
> 	Generally having the fully recursive memory.stat at the root
> 	level could help a broader range of usecases.

The changelog begs the question why we don't just "fix" the
system-level stats. It may be useful to include the conclusions from
that discussion, and why there is value in keeping the stats this way.

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Shakeel Butt May 9, 2020, 2:06 p.m. UTC | #2
On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > One way to measure the efficiency of memory reclaim is to look at the
> > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > not updated consistently at the system level and the ratio of these are
> > not very meaningful. The pgsteal and pgscan are updated for only global
> > reclaim while pgrefill gets updated for global as well as cgroup
> > reclaim.
> >
> > Please note that this difference is only for system level vmstats. The
> > cgroup stats returned by memory.stat are actually consistent. The
> > cgroup's pgsteal contains number of reclaimed pages for global as well
> > as cgroup reclaim. So, one way to get the system level stats is to get
> > these stats from root's memory.stat, so, expose memory.stat for the root
> > cgroup.
> >
> >       from Johannes Weiner:
> >       There are subtle differences between /proc/vmstat and
> >       memory.stat, and cgroup-aware code that wants to watch the full
> >       hierarchy currently has to know about these intricacies and
> >       translate semantics back and forth.
> >
> >       Generally having the fully recursive memory.stat at the root
> >       level could help a broader range of usecases.
>
> The changelog begs the question why we don't just "fix" the
> system-level stats. It may be useful to include the conclusions from
> that discussion, and why there is value in keeping the stats this way.
>

Right. Andrew, can you please add the following para to the changelog?

Why not fix the stats by including both the global and cgroup reclaim
activity instead of exposing root cgroup's memory.stat? The reason is
the benefit of having metrics exposing the activity that happens
purely due to machine capacity rather than localized activity that
happens due to the limits throughout the cgroup tree. Additionally
there are userspace tools like sysstat(sar) which reads these stats to
inform about the system level reclaim activity. So, we should not
break such use-cases.

> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks a lot.
Yafang Shao May 9, 2020, 2:43 p.m. UTC | #3
On Sat, May 9, 2020 at 10:06 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > One way to measure the efficiency of memory reclaim is to look at the
> > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > not updated consistently at the system level and the ratio of these are
> > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > reclaim while pgrefill gets updated for global as well as cgroup
> > > reclaim.
> > >
> > > Please note that this difference is only for system level vmstats. The
> > > cgroup stats returned by memory.stat are actually consistent. The
> > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > cgroup.
> > >
> > >       from Johannes Weiner:
> > >       There are subtle differences between /proc/vmstat and
> > >       memory.stat, and cgroup-aware code that wants to watch the full
> > >       hierarchy currently has to know about these intricacies and
> > >       translate semantics back and forth.
> > >
> > >       Generally having the fully recursive memory.stat at the root
> > >       level could help a broader range of usecases.
> >
> > The changelog begs the question why we don't just "fix" the
> > system-level stats. It may be useful to include the conclusions from
> > that discussion, and why there is value in keeping the stats this way.
> >
>
> Right. Andrew, can you please add the following para to the changelog?
>
> Why not fix the stats by including both the global and cgroup reclaim
> activity instead of exposing root cgroup's memory.stat? The reason is
> the benefit of having metrics exposing the activity that happens
> purely due to machine capacity rather than localized activity that
> happens due to the limits throughout the cgroup tree. Additionally
> there are userspace tools like sysstat(sar) which reads these stats to
> inform about the system level reclaim activity. So, we should not
> break such use-cases.
>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Thanks a lot.
Michal Hocko May 15, 2020, 8:29 a.m. UTC | #4
On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > One way to measure the efficiency of memory reclaim is to look at the
> > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > not updated consistently at the system level and the ratio of these are
> > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > reclaim while pgrefill gets updated for global as well as cgroup
> > > reclaim.
> > >
> > > Please note that this difference is only for system level vmstats. The
> > > cgroup stats returned by memory.stat are actually consistent. The
> > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > cgroup.
> > >
> > >       from Johannes Weiner:
> > >       There are subtle differences between /proc/vmstat and
> > >       memory.stat, and cgroup-aware code that wants to watch the full
> > >       hierarchy currently has to know about these intricacies and
> > >       translate semantics back and forth.

Can we have those subtle differences documented please?

> > >
> > >       Generally having the fully recursive memory.stat at the root
> > >       level could help a broader range of usecases.
> >
> > The changelog begs the question why we don't just "fix" the
> > system-level stats. It may be useful to include the conclusions from
> > that discussion, and why there is value in keeping the stats this way.
> >
> 
> Right. Andrew, can you please add the following para to the changelog?
> 
> Why not fix the stats by including both the global and cgroup reclaim
> activity instead of exposing root cgroup's memory.stat? The reason is
> the benefit of having metrics exposing the activity that happens
> purely due to machine capacity rather than localized activity that
> happens due to the limits throughout the cgroup tree. Additionally
> there are userspace tools like sysstat(sar) which reads these stats to
> inform about the system level reclaim activity. So, we should not
> break such use-cases.
> 
> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thanks a lot.

I was quite surprised that the patch is so simple TBH. For some reason
I've still had memories that we do not account for root memcg (likely
because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
are slightly different here. I have started looking at different stat
counters because they are not really all the same. E.g. 
- mem_cgroup_charge_statistics accounts for each memcg
- memcg_charge_kernel_stack relies on pages being associated with a
  memcg and that in turn relies on __memcg_kmem_charge_page which bails
  out on root memcg
- memcg_charge_slab (NR_SLAB*) skips over root memcg as well
- __mod_lruvec_page_state relies on page->mem_cgroup as well but this
  one is ok for paths which go through commit_charge path.

That being said we should really double check which stats are
accounted properly. At least MEMCG_KERNEL_STACK_KB won't unless I am
misreading the code.

I do not mind displaying the root's memcg stats but a) a closer look had
to be done for each counter and b) a clarification of differences from
the global vmstat counters would be really handy.
Johannes Weiner May 15, 2020, 1:24 p.m. UTC | #5
On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > not updated consistently at the system level and the ratio of these are
> > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > reclaim.
> > > >
> > > > Please note that this difference is only for system level vmstats. The
> > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > cgroup.
> > > >
> > > >       from Johannes Weiner:
> > > >       There are subtle differences between /proc/vmstat and
> > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > >       hierarchy currently has to know about these intricacies and
> > > >       translate semantics back and forth.
> 
> Can we have those subtle differences documented please?
> 
> > > >
> > > >       Generally having the fully recursive memory.stat at the root
> > > >       level could help a broader range of usecases.
> > >
> > > The changelog begs the question why we don't just "fix" the
> > > system-level stats. It may be useful to include the conclusions from
> > > that discussion, and why there is value in keeping the stats this way.
> > >
> > 
> > Right. Andrew, can you please add the following para to the changelog?
> > 
> > Why not fix the stats by including both the global and cgroup reclaim
> > activity instead of exposing root cgroup's memory.stat? The reason is
> > the benefit of having metrics exposing the activity that happens
> > purely due to machine capacity rather than localized activity that
> > happens due to the limits throughout the cgroup tree. Additionally
> > there are userspace tools like sysstat(sar) which reads these stats to
> > inform about the system level reclaim activity. So, we should not
> > break such use-cases.
> > 
> > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Thanks a lot.
> 
> I was quite surprised that the patch is so simple TBH. For some reason
> I've still had memories that we do not account for root memcg (likely
> because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> are slightly different here.

Yep, we skip the page_counter for root, but keep in mind that cgroup1
*does* have a root-level memory.stat, so (for the most part) we've
been keeping consumer stats for the root level the whole time.

> counters because they are not really all the same. E.g.
> - mem_cgroup_charge_statistics accounts for each memcg

Yep, that's heritage from cgroup1.

> - memcg_charge_kernel_stack relies on pages being associated with a
>   memcg and that in turn relies on __memcg_kmem_charge_page which bails
>   out on root memcg

You're right. It should only bypass the page_counter, but still set
page->mem_cgroup = root_mem_cgroup, just like user pages.

This counter also doesn't get exported on cgroup1, so it would indeed
be a new bug. It needs to be fixed before this patch here.

> - memcg_charge_slab (NR_SLAB*) skips over root memcg as well

Same thing with these two.
Shakeel Butt May 15, 2020, 1:44 p.m. UTC | #6
On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > not updated consistently at the system level and the ratio of these are
> > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > reclaim.
> > > > >
> > > > > Please note that this difference is only for system level vmstats. The
> > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > cgroup.
> > > > >
> > > > >       from Johannes Weiner:
> > > > >       There are subtle differences between /proc/vmstat and
> > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > >       hierarchy currently has to know about these intricacies and
> > > > >       translate semantics back and forth.
> >
> > Can we have those subtle differences documented please?
> >
> > > > >
> > > > >       Generally having the fully recursive memory.stat at the root
> > > > >       level could help a broader range of usecases.
> > > >
> > > > The changelog begs the question why we don't just "fix" the
> > > > system-level stats. It may be useful to include the conclusions from
> > > > that discussion, and why there is value in keeping the stats this way.
> > > >
> > >
> > > Right. Andrew, can you please add the following para to the changelog?
> > >
> > > Why not fix the stats by including both the global and cgroup reclaim
> > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > the benefit of having metrics exposing the activity that happens
> > > purely due to machine capacity rather than localized activity that
> > > happens due to the limits throughout the cgroup tree. Additionally
> > > there are userspace tools like sysstat(sar) which reads these stats to
> > > inform about the system level reclaim activity. So, we should not
> > > break such use-cases.
> > >
> > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > Thanks a lot.
> >
> > I was quite surprised that the patch is so simple TBH. For some reason
> > I've still had memories that we do not account for root memcg (likely
> > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > are slightly different here.
>
> Yep, we skip the page_counter for root, but keep in mind that cgroup1
> *does* have a root-level memory.stat, so (for the most part) we've
> been keeping consumer stats for the root level the whole time.
>
> > counters because they are not really all the same. E.g.
> > - mem_cgroup_charge_statistics accounts for each memcg
>
> Yep, that's heritage from cgroup1.
>
> > - memcg_charge_kernel_stack relies on pages being associated with a
> >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> >   out on root memcg
>
> You're right. It should only bypass the page_counter, but still set
> page->mem_cgroup = root_mem_cgroup, just like user pages.
>
> This counter also doesn't get exported on cgroup1, so it would indeed
> be a new bug. It needs to be fixed before this patch here.
>
> > - memcg_charge_slab (NR_SLAB*) skips over root memcg as well
>
> Same thing with these two.

Yes, we skip page_counter for root but not the stats. I will go over
all the stats and make sure we are not skipping the stats for root.
Roman Gushchin May 15, 2020, 3 p.m. UTC | #7
On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > > not updated consistently at the system level and the ratio of these are
> > > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > reclaim.
> > > > > >
> > > > > > Please note that this difference is only for system level vmstats. The
> > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > > cgroup.
> > > > > >
> > > > > >       from Johannes Weiner:
> > > > > >       There are subtle differences between /proc/vmstat and
> > > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > > >       hierarchy currently has to know about these intricacies and
> > > > > >       translate semantics back and forth.
> > >
> > > Can we have those subtle differences documented please?
> > >
> > > > > >
> > > > > >       Generally having the fully recursive memory.stat at the root
> > > > > >       level could help a broader range of usecases.
> > > > >
> > > > > The changelog begs the question why we don't just "fix" the
> > > > > system-level stats. It may be useful to include the conclusions from
> > > > > that discussion, and why there is value in keeping the stats this way.
> > > > >
> > > >
> > > > Right. Andrew, can you please add the following para to the changelog?
> > > >
> > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > the benefit of having metrics exposing the activity that happens
> > > > purely due to machine capacity rather than localized activity that
> > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > inform about the system level reclaim activity. So, we should not
> > > > break such use-cases.
> > > >
> > > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > >
> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >
> > > > Thanks a lot.
> > >
> > > I was quite surprised that the patch is so simple TBH. For some reason
> > > I've still had memories that we do not account for root memcg (likely
> > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > are slightly different here.
> >
> > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > *does* have a root-level memory.stat, so (for the most part) we've
> > been keeping consumer stats for the root level the whole time.
> >
> > > counters because they are not really all the same. E.g.
> > > - mem_cgroup_charge_statistics accounts for each memcg
> >
> > Yep, that's heritage from cgroup1.
> >
> > > - memcg_charge_kernel_stack relies on pages being associated with a
> > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > >   out on root memcg
> >
> > You're right. It should only bypass the page_counter, but still set
> > page->mem_cgroup = root_mem_cgroup, just like user pages.

What about kernel threads? We consider them belonging to the root memory
cgroup. Should their memory consumption being considered in root-level stats?

I'm not sure we really want it, but I guess we need to document how
kernel threads are handled.

> >
> > This counter also doesn't get exported on cgroup1, so it would indeed
> > be a new bug. It needs to be fixed before this patch here.
> >
> > > - memcg_charge_slab (NR_SLAB*) skips over root memcg as well
> >
> > Same thing with these two.
> 
> Yes, we skip page_counter for root but not the stats. I will go over
> all the stats and make sure we are not skipping the stats for root.
Shakeel Butt May 15, 2020, 5:49 p.m. UTC | #8
On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > >
> > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > > > not updated consistently at the system level and the ratio of these are
> > > > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > > reclaim.
> > > > > > >
> > > > > > > Please note that this difference is only for system level vmstats. The
> > > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > > > cgroup.
> > > > > > >
> > > > > > >       from Johannes Weiner:
> > > > > > >       There are subtle differences between /proc/vmstat and
> > > > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > > > >       hierarchy currently has to know about these intricacies and
> > > > > > >       translate semantics back and forth.
> > > >
> > > > Can we have those subtle differences documented please?
> > > >
> > > > > > >
> > > > > > >       Generally having the fully recursive memory.stat at the root
> > > > > > >       level could help a broader range of usecases.
> > > > > >
> > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > system-level stats. It may be useful to include the conclusions from
> > > > > > that discussion, and why there is value in keeping the stats this way.
> > > > > >
> > > > >
> > > > > Right. Andrew, can you please add the following para to the changelog?
> > > > >
> > > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > > the benefit of having metrics exposing the activity that happens
> > > > > purely due to machine capacity rather than localized activity that
> > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > > inform about the system level reclaim activity. So, we should not
> > > > > break such use-cases.
> > > > >
> > > > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > >
> > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > >
> > > > > Thanks a lot.
> > > >
> > > > I was quite surprised that the patch is so simple TBH. For some reason
> > > > I've still had memories that we do not account for root memcg (likely
> > > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > > are slightly different here.
> > >
> > > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > > *does* have a root-level memory.stat, so (for the most part) we've
> > > been keeping consumer stats for the root level the whole time.
> > >
> > > > counters because they are not really all the same. E.g.
> > > > - mem_cgroup_charge_statistics accounts for each memcg
> > >
> > > Yep, that's heritage from cgroup1.
> > >
> > > > - memcg_charge_kernel_stack relies on pages being associated with a
> > > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > > >   out on root memcg
> > >
> > > You're right. It should only bypass the page_counter, but still set
> > > page->mem_cgroup = root_mem_cgroup, just like user pages.
>
> What about kernel threads? We consider them belonging to the root memory
> cgroup. Should their memory consumption being considered in root-level stats?
>
> I'm not sure we really want it, but I guess we need to document how
> kernel threads are handled.
>

What will be the cons of updating root-level stats for kthreads?

> > >
> > > This counter also doesn't get exported on cgroup1, so it would indeed
> > > be a new bug. It needs to be fixed before this patch here.
> > >
> > > > - memcg_charge_slab (NR_SLAB*) skips over root memcg as well
> > >
> > > Same thing with these two.
> >
> > Yes, we skip page_counter for root but not the stats. I will go over
> > all the stats and make sure we are not skipping the stats for root.
Johannes Weiner May 15, 2020, 6:09 p.m. UTC | #9
On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@fb.com> wrote:
> > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > You're right. It should only bypass the page_counter, but still set
> > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> >
> > What about kernel threads? We consider them belonging to the root memory
> > cgroup. Should their memory consumption being considered in root-level stats?
> >
> > I'm not sure we really want it, but I guess we need to document how
> > kernel threads are handled.
> 
> What will be the cons of updating root-level stats for kthreads?

Should kernel threads be doing GFP_ACCOUNT allocations without
memalloc_use_memcg()? GFP_ACCOUNT implies that the memory consumption
can be significant and should be attributed to userspace activity.

If the kernel thread has no userspace entity to blame, it seems to
imply the same thing as a !GFP_ACCOUNT allocation: shared public
infrastructure, not interesting to account to any specific cgroup.

I'm not sure if we have such allocations right now. But IMO we should
not account anything from kthreads, or interrupts for that matter,
/unless/ there is a specific active_memcg that was set by the kthread
or the interrupt.
Roman Gushchin May 15, 2020, 6:09 p.m. UTC | #10
On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > > > > not updated consistently at the system level and the ratio of these are
> > > > > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > > > reclaim.
> > > > > > > >
> > > > > > > > Please note that this difference is only for system level vmstats. The
> > > > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > > > > cgroup.
> > > > > > > >
> > > > > > > >       from Johannes Weiner:
> > > > > > > >       There are subtle differences between /proc/vmstat and
> > > > > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > > > > >       hierarchy currently has to know about these intricacies and
> > > > > > > >       translate semantics back and forth.
> > > > >
> > > > > Can we have those subtle differences documented please?
> > > > >
> > > > > > > >
> > > > > > > >       Generally having the fully recursive memory.stat at the root
> > > > > > > >       level could help a broader range of usecases.
> > > > > > >
> > > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > > system-level stats. It may be useful to include the conclusions from
> > > > > > > that discussion, and why there is value in keeping the stats this way.
> > > > > > >
> > > > > >
> > > > > > Right. Andrew, can you please add the following para to the changelog?
> > > > > >
> > > > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > > > the benefit of having metrics exposing the activity that happens
> > > > > > purely due to machine capacity rather than localized activity that
> > > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > > > inform about the system level reclaim activity. So, we should not
> > > > > > break such use-cases.
> > > > > >
> > > > > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > >
> > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > >
> > > > > > Thanks a lot.
> > > > >
> > > > > I was quite surprised that the patch is so simple TBH. For some reason
> > > > > I've still had memories that we do not account for root memcg (likely
> > > > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > > > are slightly different here.
> > > >
> > > > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > > > *does* have a root-level memory.stat, so (for the most part) we've
> > > > been keeping consumer stats for the root level the whole time.
> > > >
> > > > > counters because they are not really all the same. E.g.
> > > > > - mem_cgroup_charge_statistics accounts for each memcg
> > > >
> > > > Yep, that's heritage from cgroup1.
> > > >
> > > > > - memcg_charge_kernel_stack relies on pages being associated with a
> > > > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > > > >   out on root memcg
> > > >
> > > > You're right. It should only bypass the page_counter, but still set
> > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> >
> > What about kernel threads? We consider them belonging to the root memory
> > cgroup. Should their memory consumption being considered in root-level stats?
> >
> > I'm not sure we really want it, but I guess we need to document how
> > kernel threads are handled.
> >
> 
> What will be the cons of updating root-level stats for kthreads?

It makes total sense for stacks, but not much for the slab memory.
Because it's really "some part of the total slab memory, which is
accounted on the memcg level". And it comes with some performance
overhead.

I'm not really opposing any solution, just saying we need to document
what's included into this statistics and what not.

Thanks!
Shakeel Butt May 16, 2020, 12:06 a.m. UTC | #11
On Fri, May 15, 2020 at 11:09 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > > > > > not updated consistently at the system level and the ratio of these are
> > > > > > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > > > > reclaim.
> > > > > > > > >
> > > > > > > > > Please note that this difference is only for system level vmstats. The
> > > > > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > > > > > cgroup.
> > > > > > > > >
> > > > > > > > >       from Johannes Weiner:
> > > > > > > > >       There are subtle differences between /proc/vmstat and
> > > > > > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > > > > > >       hierarchy currently has to know about these intricacies and
> > > > > > > > >       translate semantics back and forth.
> > > > > >
> > > > > > Can we have those subtle differences documented please?
> > > > > >
> > > > > > > > >
> > > > > > > > >       Generally having the fully recursive memory.stat at the root
> > > > > > > > >       level could help a broader range of usecases.
> > > > > > > >
> > > > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > > > system-level stats. It may be useful to include the conclusions from
> > > > > > > > that discussion, and why there is value in keeping the stats this way.
> > > > > > > >
> > > > > > >
> > > > > > > Right. Andrew, can you please add the following para to the changelog?
> > > > > > >
> > > > > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > > > > the benefit of having metrics exposing the activity that happens
> > > > > > > purely due to machine capacity rather than localized activity that
> > > > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > > > > inform about the system level reclaim activity. So, we should not
> > > > > > > break such use-cases.
> > > > > > >
> > > > > > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > >
> > > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > >
> > > > > > > Thanks a lot.
> > > > > >
> > > > > > I was quite surprised that the patch is so simple TBH. For some reason
> > > > > > I've still had memories that we do not account for root memcg (likely
> > > > > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > > > > are slightly different here.
> > > > >
> > > > > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > > > > *does* have a root-level memory.stat, so (for the most part) we've
> > > > > been keeping consumer stats for the root level the whole time.
> > > > >
> > > > > > counters because they are not really all the same. E.g.
> > > > > > - mem_cgroup_charge_statistics accounts for each memcg
> > > > >
> > > > > Yep, that's heritage from cgroup1.
> > > > >
> > > > > > - memcg_charge_kernel_stack relies on pages being associated with a
> > > > > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > > > > >   out on root memcg
> > > > >
> > > > > You're right. It should only bypass the page_counter, but still set
> > > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> > >
> > > What about kernel threads? We consider them belonging to the root memory
> > > cgroup. Should their memory consumption being considered in root-level stats?
> > >
> > > I'm not sure we really want it, but I guess we need to document how
> > > kernel threads are handled.
> > >
> >
> > What will be the cons of updating root-level stats for kthreads?
>
> It makes total sense for stacks, but not much for the slab memory.
> Because it's really "some part of the total slab memory, which is
> accounted on the memcg level". And it comes with some performance
> overhead.
>
> I'm not really opposing any solution, just saying we need to document
> what's included into this statistics and what not.
>

Yes, I agree. I will explore which stats it makes sense and for which
it does not.
Shakeel Butt May 16, 2020, 12:13 a.m. UTC | #12
On Fri, May 15, 2020 at 11:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@fb.com> wrote:
> > > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > You're right. It should only bypass the page_counter, but still set
> > > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> > >
> > > What about kernel threads? We consider them belonging to the root memory
> > > cgroup. Should their memory consumption being considered in root-level stats?
> > >
> > > I'm not sure we really want it, but I guess we need to document how
> > > kernel threads are handled.
> >
> > What will be the cons of updating root-level stats for kthreads?
>
> Should kernel threads be doing GFP_ACCOUNT allocations without
> memalloc_use_memcg()? GFP_ACCOUNT implies that the memory consumption
> can be significant and should be attributed to userspace activity.
>
> If the kernel thread has no userspace entity to blame, it seems to
> imply the same thing as a !GFP_ACCOUNT allocation: shared public
> infrastructure, not interesting to account to any specific cgroup.
>
> I'm not sure if we have such allocations right now. But IMO we should
> not account anything from kthreads, or interrupts for that matter,
> /unless/ there is a specific active_memcg that was set by the kthread
> or the interrupt.

I totally agree with you but I think your response is about memory
charging in IRQ/kthread context, a topic of a parallel patch from
Zefan Li at [1].

Here we are discussing stats update for kthreads e.g. should we update
root memcg's MEMCG_KERNEL_STACK_KB stat when we allocate stack for
kernel threads?

[1] http://lkml.kernel.org/r/3a721f62-5a66-8bc5-247b-5c8b7c51c555@huawei.com
Chris Down May 16, 2020, 1:42 a.m. UTC | #13
Hey Shakeel,

Shakeel Butt writes:
>One way to measure the efficiency of memory reclaim is to look at the
>ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
>not updated consistently at the system level and the ratio of these are
>not very meaningful. The pgsteal and pgscan are updated for only global
>reclaim while pgrefill gets updated for global as well as cgroup
>reclaim.
>
>Please note that this difference is only for system level vmstats. The
>cgroup stats returned by memory.stat are actually consistent. The
>cgroup's pgsteal contains number of reclaimed pages for global as well
>as cgroup reclaim. So, one way to get the system level stats is to get
>these stats from root's memory.stat, so, expose memory.stat for the root
>cgroup.
>
>	from Johannes Weiner:
>	There are subtle differences between /proc/vmstat and
>	memory.stat, and cgroup-aware code that wants to watch the full
>	hierarchy currently has to know about these intricacies and
>	translate semantics back and forth.
>
>	Generally having the fully recursive memory.stat at the root
>	level could help a broader range of usecases.
>
>Signed-off-by: Shakeel Butt <shakeelb@google.com>
>Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

The patch looks great, thanks. To that extent you can add my ack:

Acked-by: Chris Down <chris@chrisdown.name>

One concern about the API now exposed, though: to a new cgroup v2 user this 
looks fairly dodgy as a sole stat file (except for cgroup.stat) at the root. If 
I used cgroup v2 for the first time and only saw memory.stat and cgroup.stat 
there, but for some reason io.stat and cpu.stat are not available at the root 
but are available everywhere else, I think my first thought would be that the 
cgroup v2 developers must have been on some strong stuff when they came up with 
this ;-)

Even if they're only really duplicating information available elsewhere right 
now, have you considered exposing the rest of the stat files as well so that 
the API maintains a bit more consistency? As a bonus, that also means userspace 
applications can parse in the same way from the root down.

Patch
diff mbox series

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05dcb72314b5..c300d52c07a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6230,7 +6230,6 @@  static struct cftype memory_files[] = {
 	},
 	{
 		.name = "stat",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
 	{