diff mbox series

[RFC] memcg: expose children memory usage for root

Message ID 20240722225306.1494878-1-shakeel.butt@linux.dev (mailing list archive)
State New
Headers show
Series [RFC] memcg: expose children memory usage for root | expand

Commit Message

Shakeel Butt July 22, 2024, 10:53 p.m. UTC
Linux kernel does not expose memory.current on the root memcg and there
are applications which have to traverse all the top level memcgs to
calculate the total memory charged in the system. This is more expensive
(directory traversal and multiple open and reads) and is racy on a busy
machine. As the kernel already have the needed information i.e. root's
memory.current, why not expose that?

However root's memory.current will have a different semantics than the
non-root's memory.current as the kernel skips the charging for root, so
maybe it is better to have a different named interface for the root.
Something like memory.children_usage only for root memcg.

Now there is still a question that why the kernel does not expose
memory.current for the root. The historical reason was that the memcg
charging was expensice and to provide the users to bypass the memcg
charging by letting them run in the root. However do we still want to
have this exception today? What is stopping us to start charging the
root memcg as well. Of course the root will not have limits but the
allocations will go through memcg charging and then the memory.current
of root and non-root will have the same semantics.

This is an RFC to start a discussion on memcg charging for root.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 mm/memcontrol.c                         | 5 +++++
 2 files changed, 11 insertions(+)

Comments

T.J. Mercier July 25, 2024, 11:12 p.m. UTC | #1
On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Linux kernel does not expose memory.current on the root memcg and there
> are applications which have to traverse all the top level memcgs to
> calculate the total memory charged in the system. This is more expensive
> (directory traversal and multiple open and reads) and is racy on a busy
> machine. As the kernel already have the needed information i.e. root's
> memory.current, why not expose that?
>
> However root's memory.current will have a different semantics than the
> non-root's memory.current as the kernel skips the charging for root, so
> maybe it is better to have a different named interface for the root.
> Something like memory.children_usage only for root memcg.
>
> Now there is still a question that why the kernel does not expose
> memory.current for the root. The historical reason was that the memcg
> charging was expensice and to provide the users to bypass the memcg
> charging by letting them run in the root. However do we still want to
> have this exception today? What is stopping us to start charging the
> root memcg as well. Of course the root will not have limits but the
> allocations will go through memcg charging and then the memory.current
> of root and non-root will have the same semantics.
>
> This is an RFC to start a discussion on memcg charging for root.

Hi Shakeel,

Since the root already has a page_counter I'm not opposed to this new
file as it doesn't increase the page_counter depth for children.
However I don't currently have any use-cases for it that wouldn't be
met by memory.stat in the root.

As far as charging, I've only ever seen kthreads and init in the root.
You have workloads that run there?

Best,
T.J.



> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
>  mm/memcontrol.c                         | 5 +++++
>  2 files changed, 11 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6c6075ed4aa5..e4afc05fd8ea 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1220,6 +1220,12 @@ PAGE_SIZE multiple when read back.
>         The total amount of memory currently being used by the cgroup
>         and its descendants.
>
> +  memory.children_usage
> +       A read-only single value file which exists only on root cgroup.
> +
> +       The total amount of memory currently being used by the
> +        descendants of the root cgroup.
> +
>    memory.min
>         A read-write single value file which exists on non-root
>         cgroups.  The default is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 960371788687..eba8cf76d3d3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4304,6 +4304,11 @@ static struct cftype memory_files[] = {
>                 .flags = CFTYPE_NOT_ON_ROOT,
>                 .read_u64 = memory_current_read,
>         },
> +       {
> +               .name = "children_usage",
> +               .flags = CFTYPE_ONLY_ON_ROOT,
> +               .read_u64 = memory_current_read,
> +       },
>         {
>                 .name = "peak",
>                 .flags = CFTYPE_NOT_ON_ROOT,
> --
> 2.43.0
>
>
Yosry Ahmed July 25, 2024, 11:20 p.m. UTC | #2
On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Linux kernel does not expose memory.current on the root memcg and there
> are applications which have to traverse all the top level memcgs to
> calculate the total memory charged in the system. This is more expensive
> (directory traversal and multiple open and reads) and is racy on a busy
> machine. As the kernel already have the needed information i.e. root's
> memory.current, why not expose that?
>
> However root's memory.current will have a different semantics than the
> non-root's memory.current as the kernel skips the charging for root, so
> maybe it is better to have a different named interface for the root.
> Something like memory.children_usage only for root memcg.
>
> Now there is still a question that why the kernel does not expose
> memory.current for the root. The historical reason was that the memcg
> charging was expensice and to provide the users to bypass the memcg
> charging by letting them run in the root. However do we still want to
> have this exception today? What is stopping us to start charging the
> root memcg as well. Of course the root will not have limits but the
> allocations will go through memcg charging and then the memory.current
> of root and non-root will have the same semantics.
>
> This is an RFC to start a discussion on memcg charging for root.

I vaguely remember when running some netperf tests (tcp_rr?) in a
cgroup that the performance decreases considerably with every level
down the hierarchy. I am assuming that charging was a part of the
reason. If that's the case, charging the root will be similar to
moving all workloads one level down the hierarchy in terms of charging
overhead.

>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
>  mm/memcontrol.c                         | 5 +++++
>  2 files changed, 11 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6c6075ed4aa5..e4afc05fd8ea 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1220,6 +1220,12 @@ PAGE_SIZE multiple when read back.
>         The total amount of memory currently being used by the cgroup
>         and its descendants.
>
> +  memory.children_usage
> +       A read-only single value file which exists only on root cgroup.
> +
> +       The total amount of memory currently being used by the
> +        descendants of the root cgroup.
> +
>    memory.min
>         A read-write single value file which exists on non-root
>         cgroups.  The default is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 960371788687..eba8cf76d3d3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4304,6 +4304,11 @@ static struct cftype memory_files[] = {
>                 .flags = CFTYPE_NOT_ON_ROOT,
>                 .read_u64 = memory_current_read,
>         },
> +       {
> +               .name = "children_usage",
> +               .flags = CFTYPE_ONLY_ON_ROOT,
> +               .read_u64 = memory_current_read,
> +       },
>         {
>                 .name = "peak",
>                 .flags = CFTYPE_NOT_ON_ROOT,
> --
> 2.43.0
>
>
Shakeel Butt July 26, 2024, 3:46 p.m. UTC | #3
Hi T.J.

On Thu, Jul 25, 2024 at 04:12:12PM GMT, T.J. Mercier wrote:
> On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > Linux kernel does not expose memory.current on the root memcg and there
> > are applications which have to traverse all the top level memcgs to
> > calculate the total memory charged in the system. This is more expensive
> > (directory traversal and multiple open and reads) and is racy on a busy
> > machine. As the kernel already have the needed information i.e. root's
> > memory.current, why not expose that?
> >
> > However root's memory.current will have a different semantics than the
> > non-root's memory.current as the kernel skips the charging for root, so
> > maybe it is better to have a different named interface for the root.
> > Something like memory.children_usage only for root memcg.
> >
> > Now there is still a question that why the kernel does not expose
> > memory.current for the root. The historical reason was that the memcg
> > charging was expensice and to provide the users to bypass the memcg
> > charging by letting them run in the root. However do we still want to
> > have this exception today? What is stopping us to start charging the
> > root memcg as well. Of course the root will not have limits but the
> > allocations will go through memcg charging and then the memory.current
> > of root and non-root will have the same semantics.
> >
> > This is an RFC to start a discussion on memcg charging for root.
> 
> Hi Shakeel,
> 
> Since the root already has a page_counter I'm not opposed to this new
> file as it doesn't increase the page_counter depth for children.
> However I don't currently have any use-cases for it that wouldn't be
> met by memory.stat in the root.

I think difference would be getting a single number versus accumulating
different fields in memory.stat to get that number (memory used by
root's children) which might be a bit error prone.

> 
> As far as charging, I've only ever seen kthreads and init in the root.
> You have workloads that run there?

No workloads in root. The charging is only to make the semanctics of
root's memory.current same as its descendants.

Thanks,
Shakeel
Shakeel Butt July 26, 2024, 3:48 p.m. UTC | #4
On Thu, Jul 25, 2024 at 04:20:45PM GMT, Yosry Ahmed wrote:
> On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > Linux kernel does not expose memory.current on the root memcg and there
> > are applications which have to traverse all the top level memcgs to
> > calculate the total memory charged in the system. This is more expensive
> > (directory traversal and multiple open and reads) and is racy on a busy
> > machine. As the kernel already have the needed information i.e. root's
> > memory.current, why not expose that?
> >
> > However root's memory.current will have a different semantics than the
> > non-root's memory.current as the kernel skips the charging for root, so
> > maybe it is better to have a different named interface for the root.
> > Something like memory.children_usage only for root memcg.
> >
> > Now there is still a question that why the kernel does not expose
> > memory.current for the root. The historical reason was that the memcg
> > charging was expensice and to provide the users to bypass the memcg
> > charging by letting them run in the root. However do we still want to
> > have this exception today? What is stopping us to start charging the
> > root memcg as well. Of course the root will not have limits but the
> > allocations will go through memcg charging and then the memory.current
> > of root and non-root will have the same semantics.
> >
> > This is an RFC to start a discussion on memcg charging for root.
> 
> I vaguely remember when running some netperf tests (tcp_rr?) in a
> cgroup that the performance decreases considerably with every level
> down the hierarchy. I am assuming that charging was a part of the
> reason. If that's the case, charging the root will be similar to
> moving all workloads one level down the hierarchy in terms of charging
> overhead.

No, the workloads running in non-root memcgs will not see any
difference. Only the workloads running in root will see charging
overhead.
Yosry Ahmed July 26, 2024, 4:25 p.m. UTC | #5
On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Jul 25, 2024 at 04:20:45PM GMT, Yosry Ahmed wrote:
> > On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > Linux kernel does not expose memory.current on the root memcg and there
> > > are applications which have to traverse all the top level memcgs to
> > > calculate the total memory charged in the system. This is more expensive
> > > (directory traversal and multiple open and reads) and is racy on a busy
> > > machine. As the kernel already have the needed information i.e. root's
> > > memory.current, why not expose that?
> > >
> > > However root's memory.current will have a different semantics than the
> > > non-root's memory.current as the kernel skips the charging for root, so
> > > maybe it is better to have a different named interface for the root.
> > > Something like memory.children_usage only for root memcg.
> > >
> > > Now there is still a question that why the kernel does not expose
> > > memory.current for the root. The historical reason was that the memcg
> > > charging was expensice and to provide the users to bypass the memcg
> > > charging by letting them run in the root. However do we still want to
> > > have this exception today? What is stopping us to start charging the
> > > root memcg as well. Of course the root will not have limits but the
> > > allocations will go through memcg charging and then the memory.current
> > > of root and non-root will have the same semantics.
> > >
> > > This is an RFC to start a discussion on memcg charging for root.
> >
> > I vaguely remember when running some netperf tests (tcp_rr?) in a
> > cgroup that the performance decreases considerably with every level
> > down the hierarchy. I am assuming that charging was a part of the
> > reason. If that's the case, charging the root will be similar to
> > moving all workloads one level down the hierarchy in terms of charging
> > overhead.
>
> No, the workloads running in non-root memcgs will not see any
> difference. Only the workloads running in root will see charging
> overhead.

Oh yeah we already charge the root's page counters hierarchically in
the upstream kernel, we just do not charge them if the origin of the
charge is the root itself.

We also have workloads that iterate top-level memcgs to calculate the
total charged memory, so memory.children_usage for the root memcg
would help.

As for memory.current, do you have any data about how much memory is
charged to the root itself? We think of the memory charged to the root
as system overhead, while the memory charged to top-level memcgs
isn't.

So basically total_memory - root::memory.children_usage would be a
fast way to get a rough estimation of system overhead. The same would
not apply for total_memory - root::memory.current if I understand
correctly.
T.J. Mercier July 26, 2024, 4:46 p.m. UTC | #6
On Fri, Jul 26, 2024 at 8:47 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi T.J.
>
> On Thu, Jul 25, 2024 at 04:12:12PM GMT, T.J. Mercier wrote:
> > On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > Linux kernel does not expose memory.current on the root memcg and there
> > > are applications which have to traverse all the top level memcgs to
> > > calculate the total memory charged in the system. This is more expensive
> > > (directory traversal and multiple open and reads) and is racy on a busy
> > > machine. As the kernel already have the needed information i.e. root's
> > > memory.current, why not expose that?
> > >
> > > However root's memory.current will have a different semantics than the
> > > non-root's memory.current as the kernel skips the charging for root, so
> > > maybe it is better to have a different named interface for the root.
> > > Something like memory.children_usage only for root memcg.
> > >
> > > Now there is still a question that why the kernel does not expose
> > > memory.current for the root. The historical reason was that the memcg
> > > charging was expensice and to provide the users to bypass the memcg
> > > charging by letting them run in the root. However do we still want to
> > > have this exception today? What is stopping us to start charging the
> > > root memcg as well. Of course the root will not have limits but the
> > > allocations will go through memcg charging and then the memory.current
> > > of root and non-root will have the same semantics.
> > >
> > > This is an RFC to start a discussion on memcg charging for root.
> >
> > Hi Shakeel,
> >
> > Since the root already has a page_counter I'm not opposed to this new
> > file as it doesn't increase the page_counter depth for children.
> > However I don't currently have any use-cases for it that wouldn't be
> > met by memory.stat in the root.
>
> I think difference would be getting a single number versus accumulating
> different fields in memory.stat to get that number (memory used by
> root's children) which might be a bit error prone.

Yeah that makes sense, I get how it'd be nicer to do just one read in
the root instead of digging into all the children. I just meant to say
that when looking at the root, currently I only care about a
particular stat (e.g. file_mapped) instead of the whole usage.

> > As far as charging, I've only ever seen kthreads and init in the root.
> > You have workloads that run there?
>
> No workloads in root. The charging is only to make the semanctics of
> root's memory.current same as its descendants.
>
> Thanks,
> Shakeel
T.J. Mercier July 26, 2024, 4:50 p.m. UTC | #7
On Fri, Jul 26, 2024 at 9:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Jul 25, 2024 at 04:20:45PM GMT, Yosry Ahmed wrote:
> > > On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > Linux kernel does not expose memory.current on the root memcg and there
> > > > are applications which have to traverse all the top level memcgs to
> > > > calculate the total memory charged in the system. This is more expensive
> > > > (directory traversal and multiple open and reads) and is racy on a busy
> > > > machine. As the kernel already have the needed information i.e. root's
> > > > memory.current, why not expose that?
> > > >
> > > > However root's memory.current will have a different semantics than the
> > > > non-root's memory.current as the kernel skips the charging for root, so
> > > > maybe it is better to have a different named interface for the root.
> > > > Something like memory.children_usage only for root memcg.
> > > >
> > > > Now there is still a question that why the kernel does not expose
> > > > memory.current for the root. The historical reason was that the memcg
> > > > charging was expensice and to provide the users to bypass the memcg
> > > > charging by letting them run in the root. However do we still want to
> > > > have this exception today? What is stopping us to start charging the
> > > > root memcg as well. Of course the root will not have limits but the
> > > > allocations will go through memcg charging and then the memory.current
> > > > of root and non-root will have the same semantics.
> > > >
> > > > This is an RFC to start a discussion on memcg charging for root.
> > >
> > > I vaguely remember when running some netperf tests (tcp_rr?) in a
> > > cgroup that the performance decreases considerably with every level
> > > down the hierarchy. I am assuming that charging was a part of the
> > > reason. If that's the case, charging the root will be similar to
> > > moving all workloads one level down the hierarchy in terms of charging
> > > overhead.
> >
> > No, the workloads running in non-root memcgs will not see any
> > difference. Only the workloads running in root will see charging
> > overhead.
>
> Oh yeah we already charge the root's page counters hierarchically in
> the upstream kernel, we just do not charge them if the origin of the
> charge is the root itself.
>
> We also have workloads that iterate top-level memcgs to calculate the
> total charged memory, so memory.children_usage for the root memcg
> would help.
>
> As for memory.current, do you have any data about how much memory is
> charged to the root itself?

Yeah I wonder if we'd be able to see any significant regressions for
stuff that lives there today if we were to start charging it. I can
try running a test with Android next week. I guess try_charge() is the
main thing that would need to change to allow root charges?

> We think of the memory charged to the root
> as system overhead, while the memory charged to top-level memcgs
> isn't.
>
> So basically total_memory - root::memory.children_usage would be a
> fast way to get a rough estimation of system overhead. The same would
> not apply for total_memory - root::memory.current if I understand
> correctly.
>

On Fri, Jul 26, 2024 at 9:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Jul 25, 2024 at 04:20:45PM GMT, Yosry Ahmed wrote:
> > > On Mon, Jul 22, 2024 at 3:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > Linux kernel does not expose memory.current on the root memcg and there
> > > > are applications which have to traverse all the top level memcgs to
> > > > calculate the total memory charged in the system. This is more expensive
> > > > (directory traversal and multiple open and reads) and is racy on a busy
> > > > machine. As the kernel already have the needed information i.e. root's
> > > > memory.current, why not expose that?
> > > >
> > > > However root's memory.current will have a different semantics than the
> > > > non-root's memory.current as the kernel skips the charging for root, so
> > > > maybe it is better to have a different named interface for the root.
> > > > Something like memory.children_usage only for root memcg.
> > > >
> > > > Now there is still a question that why the kernel does not expose
> > > > memory.current for the root. The historical reason was that the memcg
> > > > charging was expensice and to provide the users to bypass the memcg
> > > > charging by letting them run in the root. However do we still want to
> > > > have this exception today? What is stopping us to start charging the
> > > > root memcg as well. Of course the root will not have limits but the
> > > > allocations will go through memcg charging and then the memory.current
> > > > of root and non-root will have the same semantics.
> > > >
> > > > This is an RFC to start a discussion on memcg charging for root.
> > >
> > > I vaguely remember when running some netperf tests (tcp_rr?) in a
> > > cgroup that the performance decreases considerably with every level
> > > down the hierarchy. I am assuming that charging was a part of the
> > > reason. If that's the case, charging the root will be similar to
> > > moving all workloads one level down the hierarchy in terms of charging
> > > overhead.
> >
> > No, the workloads running in non-root memcgs will not see any
> > difference. Only the workloads running in root will see charging
> > overhead.
>
> Oh yeah we already charge the root's page counters hierarchically in
> the upstream kernel, we just do not charge them if the origin of the
> charge is the root itself.
>
> We also have workloads that iterate top-level memcgs to calculate the
> total charged memory, so memory.children_usage for the root memcg
> would help.
>
> As for memory.current, do you have any data about how much memory is
> charged to the root itself? We think of the memory charged to the root
> as system overhead, while the memory charged to top-level memcgs
> isn't.
>
> So basically total_memory - root::memory.children_usage would be a
> fast way to get a rough estimation of system overhead. The same would
> not apply for total_memory - root::memory.current if I understand
> correctly.
>
Shakeel Butt July 26, 2024, 5:18 p.m. UTC | #8
On Fri, Jul 26, 2024 at 09:50:49AM GMT, T.J. Mercier wrote:
> On Fri, Jul 26, 2024 at 9:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
[...]
> >
> > Oh yeah we already charge the root's page counters hierarchically in
> > the upstream kernel, we just do not charge them if the origin of the
> > charge is the root itself.
> >
> > We also have workloads that iterate top-level memcgs to calculate the
> > total charged memory, so memory.children_usage for the root memcg
> > would help.
> >
> > As for memory.current, do you have any data about how much memory is
> > charged to the root itself?
> 
> Yeah I wonder if we'd be able to see any significant regressions for
> stuff that lives there today if we were to start charging it. I can
> try running a test with Android next week. I guess try_charge() is the
> main thing that would need to change to allow root charges?
> 

It would a bit more involved like allocating objcg for the root and
changing memcg_slab_post_alloc_hook() to use root memcg if account flags
are not present. There might some changes needed for list_lru and
reclaim but I have not looked deeper into it yet.
Shakeel Butt July 26, 2024, 5:30 p.m. UTC | #9
On Fri, Jul 26, 2024 at 09:25:27AM GMT, Yosry Ahmed wrote:
> On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
[...]
> >
> > No, the workloads running in non-root memcgs will not see any
> > difference. Only the workloads running in root will see charging
> > overhead.
> 
> Oh yeah we already charge the root's page counters hierarchically in
> the upstream kernel, we just do not charge them if the origin of the
> charge is the root itself.
> 
> We also have workloads that iterate top-level memcgs to calculate the
> total charged memory, so memory.children_usage for the root memcg
> would help.
> 
> As for memory.current, do you have any data about how much memory is
> charged to the root itself? We think of the memory charged to the root
> as system overhead, while the memory charged to top-level memcgs
> isn't.
> 
> So basically total_memory - root::memory.children_usage would be a
> fast way to get a rough estimation of system overhead. The same would
> not apply for total_memory - root::memory.current if I understand
> correctly.

Please note that root::memory.children_usage will have top level zombies
included as well (at least until lru reparenting is not done). So, for
your example it would provide good estimation of top level zombie memory
through root::memory.children_usage - top_level(alive)::memory.current.
Yosry Ahmed July 26, 2024, 5:43 p.m. UTC | #10
On Fri, Jul 26, 2024 at 10:30 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Jul 26, 2024 at 09:25:27AM GMT, Yosry Ahmed wrote:
> > On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> [...]
> > >
> > > No, the workloads running in non-root memcgs will not see any
> > > difference. Only the workloads running in root will see charging
> > > overhead.
> >
> > Oh yeah we already charge the root's page counters hierarchically in
> > the upstream kernel, we just do not charge them if the origin of the
> > charge is the root itself.
> >
> > We also have workloads that iterate top-level memcgs to calculate the
> > total charged memory, so memory.children_usage for the root memcg
> > would help.
> >
> > As for memory.current, do you have any data about how much memory is
> > charged to the root itself? We think of the memory charged to the root
> > as system overhead, while the memory charged to top-level memcgs
> > isn't.
> >
> > So basically total_memory - root::memory.children_usage would be a
> > fast way to get a rough estimation of system overhead. The same would
> > not apply for total_memory - root::memory.current if I understand
> > correctly.
>
> Please note that root::memory.children_usage will have top level zombies
> included as well (at least until lru reparenting is not done). So, for
> your example it would provide good estimation of top level zombie memory
> through root::memory.children_usage - top_level(alive)::memory.current.

Good point. The fact that it includes the top-level zombies makes it
less valuable for this use case, as zombie memory is considered system
overhead as well. So we need to iterate the top level memcgs anyway.
Shakeel Butt July 26, 2024, 6:16 p.m. UTC | #11
On Fri, Jul 26, 2024 at 10:43:31AM GMT, Yosry Ahmed wrote:
> On Fri, Jul 26, 2024 at 10:30 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Fri, Jul 26, 2024 at 09:25:27AM GMT, Yosry Ahmed wrote:
> > > On Fri, Jul 26, 2024 at 8:48 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > [...]
> > > >
> > > > No, the workloads running in non-root memcgs will not see any
> > > > difference. Only the workloads running in root will see charging
> > > > overhead.
> > >
> > > Oh yeah we already charge the root's page counters hierarchically in
> > > the upstream kernel, we just do not charge them if the origin of the
> > > charge is the root itself.
> > >
> > > We also have workloads that iterate top-level memcgs to calculate the
> > > total charged memory, so memory.children_usage for the root memcg
> > > would help.
> > >
> > > As for memory.current, do you have any data about how much memory is
> > > charged to the root itself? We think of the memory charged to the root
> > > as system overhead, while the memory charged to top-level memcgs
> > > isn't.
> > >
> > > So basically total_memory - root::memory.children_usage would be a
> > > fast way to get a rough estimation of system overhead. The same would
> > > not apply for total_memory - root::memory.current if I understand
> > > correctly.
> >
> > Please note that root::memory.children_usage will have top level zombies
> > included as well (at least until lru reparenting is not done). So, for
> > your example it would provide good estimation of top level zombie memory
> > through root::memory.children_usage - top_level(alive)::memory.current.
> 
> Good point. The fact that it includes the top-level zombies makes it
> less valuable for this use case, as zombie memory is considered system
> overhead as well. So we need to iterate the top level memcgs anyway.

Most of the users use systemd which has fixed top level hierarchy, so
this is fine for most users.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6c6075ed4aa5..e4afc05fd8ea 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1220,6 +1220,12 @@  PAGE_SIZE multiple when read back.
 	The total amount of memory currently being used by the cgroup
 	and its descendants.
 
+  memory.children_usage
+	A read-only single value file which exists only on root cgroup.
+
+	The total amount of memory currently being used by the
+        descendants of the root cgroup.
+
   memory.min
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "0".
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 960371788687..eba8cf76d3d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4304,6 +4304,11 @@  static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = memory_current_read,
 	},
+	{
+		.name = "children_usage",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_u64 = memory_current_read,
+	},
 	{
 		.name = "peak",
 		.flags = CFTYPE_NOT_ON_ROOT,