[00/14] bpf: Allow not to charge bpf memory

Message ID	20220319173036.23352-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Yafang Shao <laoar.shao@gmail.com> To: roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, shuah@kernel.org Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com> Subject: [PATCH 00/14] bpf: Allow not to charge bpf memory Date: Sat, 19 Mar 2022 17:30:22 +0000 Message-Id: <20220319173036.23352-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf: Allow not to charge bpf memory \| expand [00/14] bpf: Allow not to charge bpf memory [01/14] bpf: Introduce no charge flag for bpf map [02/14] bpf: Only sys admin can set no charge flag [03/14] bpf: Enable no charge in map _CREATE_FLAG_MASK [04/14] bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc [05/14] bpf: Allow no charge in bpf_map_area_alloc [06/14] bpf: Allow no charge for allocation not at map creation time [07/14] bpf: Allow no charge in map specific allocation [08/14] bpf: Aggregate flags for BPF_PROG_LOAD command [09/14] bpf: Add no charge flag for bpf prog [10/14] bpf: Only sys admin can set no charge flag for bpf prog [11/14] bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc [12/14] bpf: Allow no charge for bpf prog [13/14] bpf: selftests: Add test case for BPF_F_NO_CHARTE [14/14] bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE

Message ID

20220319173036.23352-1-laoar.shao@gmail.com (mailing list archive)

Headers

From: Yafang Shao <laoar.shao@gmail.com>
To: roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net,
        andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com,
        john.fastabend@gmail.com, kpsingh@kernel.org, shuah@kernel.org
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
        linux-kselftest@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH 00/14] bpf: Allow not to charge bpf memory 
Date: Sat, 19 Mar 2022 17:30:22 +0000
Message-Id: <20220319173036.23352-1-laoar.shao@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

bpf: Allow not to charge bpf memory | expand

Message

Yafang Shao March 19, 2022, 5:30 p.m. UTC

After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by defaut, that causes unexpected issues for
us. For instance, the container of the loader-which loads the bpf programs
and pins them on bpffs-may restart after pinning the progs and maps. After
the restart, the pinned progs and maps won't belong to the new container
any more, while they actually belong to an offline memcg left by the
previous generation. That inconsistent behavior will make trouble for the
memory resource management for this container. 

The reason why these progs and maps have to be persistent across multiple
generations is that these progs and maps are also used by other processes
which are not in this container. IOW, they can't be removed when this
container is restarted. Take a specific example, bpf program for clsact
qdisc is loaded by a agent running in a container, which not only loads
bpf program but also processes the data generated by this program and do
some other maintainace things.

In order to keep the charging behavior consistent, we used to consider a
way to recharge these pinned maps and progs again after the container is
restarted, but after the discussion[1] with Roman, we decided to go
another direction that don't charge them to the container in the first
place. TL;DR about the mentioned disccussion: recharging is not a generic
solution and it may take too much risk.

This patchset is the solution of no charge. Two flags are introduced in
union bpf_attr, one for bpf map and another for bpf prog. The user who
doesn't want to charge to current memcg can use these two flags. These two
flags are only permitted for sys admin as these memory will be accounted to
the root memcg only.

Patches #1~#8 are for bpf map. Patches #9~#12 are for bpf prog. Patch #13
and #14 are for selftests and also the examples of how to use them.

[1]. https://lwn.net/Articles/887180/ 

Yafang Shao (14):
  bpf: Introduce no charge flag for bpf map
  bpf: Only sys admin can set no charge flag
  bpf: Enable no charge in map _CREATE_FLAG_MASK
  bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc
  bpf: Allow no charge in bpf_map_area_alloc
  bpf: Allow no charge for allocation not at map creation time
  bpf: Allow no charge in map specific allocation
  bpf: Aggregate flags for BPF_PROG_LOAD command
  bpf: Add no charge flag for bpf prog
  bpf: Only sys admin can set no charge flag for bpf prog
  bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc
  bpf: Allow no charge for bpf prog
  bpf: selftests: Add test case for BPF_F_NO_CHARTE
  bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE

 include/linux/bpf.h                           | 27 ++++++-
 include/uapi/linux/bpf.h                      | 21 +++--
 kernel/bpf/arraymap.c                         |  9 +--
 kernel/bpf/bloom_filter.c                     |  7 +-
 kernel/bpf/bpf_local_storage.c                |  8 +-
 kernel/bpf/bpf_struct_ops.c                   | 13 +--
 kernel/bpf/core.c                             | 20 +++--
 kernel/bpf/cpumap.c                           | 10 ++-
 kernel/bpf/devmap.c                           | 14 ++--
 kernel/bpf/hashtab.c                          | 14 ++--
 kernel/bpf/local_storage.c                    |  4 +-
 kernel/bpf/lpm_trie.c                         |  4 +-
 kernel/bpf/queue_stack_maps.c                 |  5 +-
 kernel/bpf/reuseport_array.c                  |  3 +-
 kernel/bpf/ringbuf.c                          | 19 ++---
 kernel/bpf/stackmap.c                         | 13 +--
 kernel/bpf/syscall.c                          | 40 +++++++---
 kernel/bpf/verifier.c                         |  2 +-
 net/core/filter.c                             |  6 +-
 net/core/sock_map.c                           |  8 +-
 net/xdp/xskmap.c                              |  9 ++-
 tools/include/uapi/linux/bpf.h                | 21 +++--
 .../selftests/bpf/map_tests/no_charg.c        | 79 +++++++++++++++++++
 .../selftests/bpf/prog_tests/no_charge.c      | 49 ++++++++++++
 24 files changed, 297 insertions(+), 108 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c

Comments

Roman Gushchin March 21, 2022, 10:52 p.m. UTC | #1

Hello, Yafang!

Thank you for continuing working on this!

On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting, the bpf memory is
> charged to the loader's memcg by defaut, that causes unexpected issues for
> us. For instance, the container of the loader-which loads the bpf programs
> and pins them on bpffs-may restart after pinning the progs and maps. After
> the restart, the pinned progs and maps won't belong to the new container
> any more, while they actually belong to an offline memcg left by the
> previous generation. That inconsistent behavior will make trouble for the
> memory resource management for this container.

I'm looking at this text and increasingly feeling that it's not a bpf-specific
problem and it shouldn't be solved as one.

Is there any significant reason why the loader can't temporarily enter the
root cgroup before creating bpf maps/progs? I know it will require some changes
in the userspace code, but so do new bpf flags.

> 
> The reason why these progs and maps have to be persistent across multiple
> generations is that these progs and maps are also used by other processes
> which are not in this container. IOW, they can't be removed when this
> container is restarted. Take a specific example, bpf program for clsact
> qdisc is loaded by a agent running in a container, which not only loads
> bpf program but also processes the data generated by this program and do
> some other maintainace things.
> 
> In order to keep the charging behavior consistent, we used to consider a
> way to recharge these pinned maps and progs again after the container is
> restarted, but after the discussion[1] with Roman, we decided to go
> another direction that don't charge them to the container in the first
> place. TL;DR about the mentioned disccussion: recharging is not a generic
> solution and it may take too much risk.
> 
> This patchset is the solution of no charge. Two flags are introduced in
> union bpf_attr, one for bpf map and another for bpf prog. The user who
> doesn't want to charge to current memcg can use these two flags. These two
> flags are only permitted for sys admin as these memory will be accounted to
> the root memcg only.

If we're going with bpf-specific flags (which I'd prefer not to), let's
define them as the way to create system-wide bpf objects which are expected
to outlive the original cgroup. With expectations that they will be treated
as belonging to the root cgroup by any sort of existing or future resource
accounting (e.g. if we'll start accounting CPU used by bpf prgrams).

But then again: why just not create them in the root cgroup?

Thanks!

Yafang Shao March 22, 2022, 4:10 p.m. UTC | #2

On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Hello, Yafang!
>
> Thank you for continuing working on this!
>
> On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by defaut, that causes unexpected issues for
> > us. For instance, the container of the loader-which loads the bpf programs
> > and pins them on bpffs-may restart after pinning the progs and maps. After
> > the restart, the pinned progs and maps won't belong to the new container
> > any more, while they actually belong to an offline memcg left by the
> > previous generation. That inconsistent behavior will make trouble for the
> > memory resource management for this container.
>
> I'm looking at this text and increasingly feeling that it's not a bpf-specific
> problem and it shouldn't be solved as one.
>

I'm not sure whether it is a common issue or not, but I'm sure bpf has
its special attribute that we should handle it specifically.  I can
show you an example on why bpf is a special one.

The pinned bpf is similar to a kernel module, right?
But that issue can't happen in a kernel module, while it can happen in
bpf only.  The reason is that the kernel module has the choice on
whether account the allocated memory or not, e.g.
    - Account
      kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
   - Not Account
      kmalloc(size, GFP_KERNEL);

While the bpf has no choice because the GFP_KERNEL is a KAPI which is
not exposed to the user.

Then the issue is exposed when the memcg-based accounting is
forcefully enabled to all bpf programs. That is a behavior change,
while unfortunately we don't give the user a chance to keep the old
behavior unless they don't use memcg....

But that is not to say the memcg-based accounting is bad, while it is
really useful, but it needs to be improved. We can't expose
GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
the user, that's what this patchset did, like what we always have done
in bpf.

> Is there any significant reason why the loader can't temporarily enter the
> root cgroup before creating bpf maps/progs? I know it will require some changes
> in the userspace code, but so do new bpf flags.
>

On our k8s environment, the user agents should be deployed in a
Daemonset[1].  It will make more trouble to temporarily enter the root
group before creating bpf maps/progs, for example we must change the
way we used to deploy user agents, that will be a big project.

[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

> >
> > The reason why these progs and maps have to be persistent across multiple
> > generations is that these progs and maps are also used by other processes
> > which are not in this container. IOW, they can't be removed when this
> > container is restarted. Take a specific example, bpf program for clsact
> > qdisc is loaded by a agent running in a container, which not only loads
> > bpf program but also processes the data generated by this program and do
> > some other maintainace things.
> >
> > In order to keep the charging behavior consistent, we used to consider a
> > way to recharge these pinned maps and progs again after the container is
> > restarted, but after the discussion[1] with Roman, we decided to go
> > another direction that don't charge them to the container in the first
> > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > solution and it may take too much risk.
> >
> > This patchset is the solution of no charge. Two flags are introduced in
> > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > doesn't want to charge to current memcg can use these two flags. These two
> > flags are only permitted for sys admin as these memory will be accounted to
> > the root memcg only.
>
> If we're going with bpf-specific flags (which I'd prefer not to), let's
> define them as the way to create system-wide bpf objects which are expected
> to outlive the original cgroup. With expectations that they will be treated
> as belonging to the root cgroup by any sort of existing or future resource
> accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
>

Now that talking about the cpu resource, I have some more complaints
that cpu cgroup does really better than memory cgroup. Below is the
detailed information why cpu cgroup does a good job,

   - CPU
                        Task Cgroup
      Code          CPU time is accounted to the one who is executeING
 this code

   - Memory
                         Memory Cgroup
      Data           Memory usage is accounted to the one who
allocatED this data.

Have you found the difference?
The difference is that, cpu time is accounted to the one who is using
it (that is reasonable), while memory usage is accounted to the one
who allocated it (that is unreasonable). If we split the Data/Code
into private and shared, we can find why it is unreasonable.

                                Memory Cgroup
Private Data           Private and thus accounted to one single memcg, good.
Shared Data           Shared but accounted to one single memcg, bad.

                                Task Cgroup
Private Code          Private and thus accounted to one single task group, good.
Shared Code          Shared and thus accounted to all the task groups, good.

The pages are accounted when they are allocated rather than when they
are used, that is why so many ridiculous things happen.   But we have
a common sense that we can’t dynamically charge the page to the
process who is accessing it, right?  So we have to handle the issues
caused by shared pages case by case.

> But then again: why just not create them in the root cgroup?
>

As I explained above, that is limited by the deployment.

Roman Gushchin March 22, 2022, 7:10 p.m. UTC | #3

On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
> On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Hello, Yafang!
> >
> > Thank you for continuing working on this!
> >
> > On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > charged to the loader's memcg by defaut, that causes unexpected issues for
> > > us. For instance, the container of the loader-which loads the bpf programs
> > > and pins them on bpffs-may restart after pinning the progs and maps. After
> > > the restart, the pinned progs and maps won't belong to the new container
> > > any more, while they actually belong to an offline memcg left by the
> > > previous generation. That inconsistent behavior will make trouble for the
> > > memory resource management for this container.
> >
> > I'm looking at this text and increasingly feeling that it's not a bpf-specific
> > problem and it shouldn't be solved as one.
> >
> 
> I'm not sure whether it is a common issue or not, but I'm sure bpf has
> its special attribute that we should handle it specifically.  I can
> show you an example on why bpf is a special one.
> 
> The pinned bpf is similar to a kernel module, right?
> But that issue can't happen in a kernel module, while it can happen in
> bpf only.  The reason is that the kernel module has the choice on
> whether account the allocated memory or not, e.g.
>     - Account
>       kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
>    - Not Account
>       kmalloc(size, GFP_KERNEL);
> 
> While the bpf has no choice because the GFP_KERNEL is a KAPI which is
> not exposed to the user.

But if your process opens a file, creates a pipe etc there are also
kernel allocations happening and the process has no control over whether
these allocations are accounted or not. The same applies for the anonymous
memory and pagecache as well, so it's not even kmem-specific.

> 
> Then the issue is exposed when the memcg-based accounting is
> forcefully enabled to all bpf programs. That is a behavior change,
> while unfortunately we don't give the user a chance to keep the old
> behavior unless they don't use memcg....
> 
> But that is not to say the memcg-based accounting is bad, while it is
> really useful, but it needs to be improved. We can't expose
> GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
> the user, that's what this patchset did, like what we always have done
> in bpf.
> 
> > Is there any significant reason why the loader can't temporarily enter the
> > root cgroup before creating bpf maps/progs? I know it will require some changes
> > in the userspace code, but so do new bpf flags.
> >
> 
> On our k8s environment, the user agents should be deployed in a
> Daemonset[1].  It will make more trouble to temporarily enter the root
> group before creating bpf maps/progs, for example we must change the
> way we used to deploy user agents, that will be a big project.

I understand, however introducing new kernel interfaces to overcome such
things has its own downside: every introduced interface will stay pretty
much forever and we'll _have_ to support it. Kernel interfaces have a very long
life cycle, we have to admit it.

The problem you're describing - inconsistencies on accounting of shared regions
of memory - is a generic cgroup problem, which has a configuration solution:
the resource accounting and control should be performed on a stable level and
actual workloads can be (re)started in sub-cgroups with optionally disabled
physical controllers.
E.g.:
			/
	workload2.slice   workload1.slice     <- accounting should be performed here
workload_gen1.scope workload_gen2.scope ...


> 
> [1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
> 
> > >
> > > The reason why these progs and maps have to be persistent across multiple
> > > generations is that these progs and maps are also used by other processes
> > > which are not in this container. IOW, they can't be removed when this
> > > container is restarted. Take a specific example, bpf program for clsact
> > > qdisc is loaded by a agent running in a container, which not only loads
> > > bpf program but also processes the data generated by this program and do
> > > some other maintainace things.
> > >
> > > In order to keep the charging behavior consistent, we used to consider a
> > > way to recharge these pinned maps and progs again after the container is
> > > restarted, but after the discussion[1] with Roman, we decided to go
> > > another direction that don't charge them to the container in the first
> > > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > > solution and it may take too much risk.
> > >
> > > This patchset is the solution of no charge. Two flags are introduced in
> > > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > > doesn't want to charge to current memcg can use these two flags. These two
> > > flags are only permitted for sys admin as these memory will be accounted to
> > > the root memcg only.
> >
> > If we're going with bpf-specific flags (which I'd prefer not to), let's
> > define them as the way to create system-wide bpf objects which are expected
> > to outlive the original cgroup. With expectations that they will be treated
> > as belonging to the root cgroup by any sort of existing or future resource
> > accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
> >
> 
> Now that talking about the cpu resource, I have some more complaints
> that cpu cgroup does really better than memory cgroup. Below is the
> detailed information why cpu cgroup does a good job,
> 
>    - CPU
>                         Task Cgroup
>       Code          CPU time is accounted to the one who is executeING
>  this code
> 
>    - Memory
>                          Memory Cgroup
>       Data           Memory usage is accounted to the one who
> allocatED this data.
> 
> Have you found the difference?

Well, RAM is a vastly different thing than CPU :)
They have different physical properties and corresponding accounting limitations.

> The difference is that, cpu time is accounted to the one who is using
> it (that is reasonable), while memory usage is accounted to the one
> who allocated it (that is unreasonable). If we split the Data/Code
> into private and shared, we can find why it is unreasonable.
> 
>                                 Memory Cgroup
> Private Data           Private and thus accounted to one single memcg, good.
> Shared Data           Shared but accounted to one single memcg, bad.
> 
>                                 Task Cgroup
> Private Code          Private and thus accounted to one single task group, good.
> Shared Code          Shared and thus accounted to all the task groups, good.
> 
> The pages are accounted when they are allocated rather than when they
> are used, that is why so many ridiculous things happen.   But we have
> a common sense that we can’t dynamically charge the page to the
> process who is accessing it, right?  So we have to handle the issues
> caused by shared pages case by case.

The accounting of shared regions of memory is complex because of two
physical limitations:

1) Amount of (meta)data which we can be used to track ownership. We expect
the memory overhead be small in comparison to the accounted data. If a page
is used by many cgroups, even saving a single pointer to each cgroup can take
a lot of space. Even worse for slab objects. At some point it stops making
sense: if the accounting takes more memory than the accounted memory, it's
better to not account at all.

2) CPU overhead: tracking memory usage beyond the initial allocation adds
an overhead to some very hot paths. Imagine two processes mapping the same file,
first processes faults in the whole file and the second just uses the pagecache.
Currently it's very efficient. Causing the second process to change the ownership
information on each minor page fault will lead to a performance regression.
Think of libc binary as this file.

That said, I'm not saying it can't be done better that now. But it's a complex
problem.

Thanks!

Yafang Shao March 23, 2022, 1:37 a.m. UTC | #4

On Wed, Mar 23, 2022 at 3:10 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
> > On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > Hello, Yafang!
> > >
> > > Thank you for continuing working on this!
> > >
> > > On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > charged to the loader's memcg by defaut, that causes unexpected issues for
> > > > us. For instance, the container of the loader-which loads the bpf programs
> > > > and pins them on bpffs-may restart after pinning the progs and maps. After
> > > > the restart, the pinned progs and maps won't belong to the new container
> > > > any more, while they actually belong to an offline memcg left by the
> > > > previous generation. That inconsistent behavior will make trouble for the
> > > > memory resource management for this container.
> > >
> > > I'm looking at this text and increasingly feeling that it's not a bpf-specific
> > > problem and it shouldn't be solved as one.
> > >
> >
> > I'm not sure whether it is a common issue or not, but I'm sure bpf has
> > its special attribute that we should handle it specifically.  I can
> > show you an example on why bpf is a special one.
> >
> > The pinned bpf is similar to a kernel module, right?
> > But that issue can't happen in a kernel module, while it can happen in
> > bpf only.  The reason is that the kernel module has the choice on
> > whether account the allocated memory or not, e.g.
> >     - Account
> >       kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
> >    - Not Account
> >       kmalloc(size, GFP_KERNEL);
> >
> > While the bpf has no choice because the GFP_KERNEL is a KAPI which is
> > not exposed to the user.
>
> But if your process opens a file, creates a pipe etc there are also
> kernel allocations happening and the process has no control over whether
> these allocations are accounted or not. The same applies for the anonymous
> memory and pagecache as well, so it's not even kmem-specific.
>

So what is the real problem in practice ?
Does anyone complain about it ?

[At least,  there's no behavior change in these areas.]

> >
> > Then the issue is exposed when the memcg-based accounting is
> > forcefully enabled to all bpf programs. That is a behavior change,
> > while unfortunately we don't give the user a chance to keep the old
> > behavior unless they don't use memcg....
> >
> > But that is not to say the memcg-based accounting is bad, while it is
> > really useful, but it needs to be improved. We can't expose
> > GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
> > the user, that's what this patchset did, like what we always have done
> > in bpf.
> >
> > > Is there any significant reason why the loader can't temporarily enter the
> > > root cgroup before creating bpf maps/progs? I know it will require some changes
> > > in the userspace code, but so do new bpf flags.
> > >
> >
> > On our k8s environment, the user agents should be deployed in a
> > Daemonset[1].  It will make more trouble to temporarily enter the root
> > group before creating bpf maps/progs, for example we must change the
> > way we used to deploy user agents, that will be a big project.
>
> I understand, however introducing new kernel interfaces to overcome such
> things has its own downside: every introduced interface will stay pretty
> much forever and we'll _have_ to support it. Kernel interfaces have a very long
> life cycle, we have to admit it.
>
> The problem you're describing - inconsistencies on accounting of shared regions
> of memory - is a generic cgroup problem, which has a configuration solution:
> the resource accounting and control should be performed on a stable level and
> actual workloads can be (re)started in sub-cgroups with optionally disabled
> physical controllers.
> E.g.:
>                         /
>         workload2.slice   workload1.slice     <- accounting should be performed here
> workload_gen1.scope workload_gen2.scope ...
>

I think we talked about it several days earlier.

>
> >
> > [1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
> >
> > > >
> > > > The reason why these progs and maps have to be persistent across multiple
> > > > generations is that these progs and maps are also used by other processes
> > > > which are not in this container. IOW, they can't be removed when this
> > > > container is restarted. Take a specific example, bpf program for clsact
> > > > qdisc is loaded by a agent running in a container, which not only loads
> > > > bpf program but also processes the data generated by this program and do
> > > > some other maintainace things.
> > > >
> > > > In order to keep the charging behavior consistent, we used to consider a
> > > > way to recharge these pinned maps and progs again after the container is
> > > > restarted, but after the discussion[1] with Roman, we decided to go
> > > > another direction that don't charge them to the container in the first
> > > > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > > > solution and it may take too much risk.
> > > >
> > > > This patchset is the solution of no charge. Two flags are introduced in
> > > > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > > > doesn't want to charge to current memcg can use these two flags. These two
> > > > flags are only permitted for sys admin as these memory will be accounted to
> > > > the root memcg only.
> > >
> > > If we're going with bpf-specific flags (which I'd prefer not to), let's
> > > define them as the way to create system-wide bpf objects which are expected
> > > to outlive the original cgroup. With expectations that they will be treated
> > > as belonging to the root cgroup by any sort of existing or future resource
> > > accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
> > >
> >
> > Now that talking about the cpu resource, I have some more complaints
> > that cpu cgroup does really better than memory cgroup. Below is the
> > detailed information why cpu cgroup does a good job,
> >
> >    - CPU
> >                         Task Cgroup
> >       Code          CPU time is accounted to the one who is executeING
> >  this code
> >
> >    - Memory
> >                          Memory Cgroup
> >       Data           Memory usage is accounted to the one who
> > allocatED this data.
> >
> > Have you found the difference?
>
> Well, RAM is a vastly different thing than CPU :)
> They have different physical properties and corresponding accounting limitations.
>
> > The difference is that, cpu time is accounted to the one who is using
> > it (that is reasonable), while memory usage is accounted to the one
> > who allocated it (that is unreasonable). If we split the Data/Code
> > into private and shared, we can find why it is unreasonable.
> >
> >                                 Memory Cgroup
> > Private Data           Private and thus accounted to one single memcg, good.
> > Shared Data           Shared but accounted to one single memcg, bad.
> >
> >                                 Task Cgroup
> > Private Code          Private and thus accounted to one single task group, good.
> > Shared Code          Shared and thus accounted to all the task groups, good.
> >
> > The pages are accounted when they are allocated rather than when they
> > are used, that is why so many ridiculous things happen.   But we have
> > a common sense that we can’t dynamically charge the page to the
> > process who is accessing it, right?  So we have to handle the issues
> > caused by shared pages case by case.
>
> The accounting of shared regions of memory is complex because of two
> physical limitations:
>
> 1) Amount of (meta)data which we can be used to track ownership. We expect
> the memory overhead be small in comparison to the accounted data. If a page
> is used by many cgroups, even saving a single pointer to each cgroup can take
> a lot of space. Even worse for slab objects. At some point it stops making
> sense: if the accounting takes more memory than the accounted memory, it's
> better to not account at all.
>
> 2) CPU overhead: tracking memory usage beyond the initial allocation adds
> an overhead to some very hot paths. Imagine two processes mapping the same file,
> first processes faults in the whole file and the second just uses the pagecache.
> Currently it's very efficient. Causing the second process to change the ownership
> information on each minor page fault will lead to a performance regression.
> Think of libc binary as this file.
>
> That said, I'm not saying it can't be done better that now. But it's a complex
> problem.
>

Memcg-based accounting is also complex, but it introduces user visible
behavior change now :(