mbox series

[bpf-next,v2,00/12] bpf: Introduce selectable memcg for bpf map

Message ID 20220818143118.17733-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series bpf: Introduce selectable memcg for bpf map | expand

Message

Yafang Shao Aug. 18, 2022, 2:31 p.m. UTC
On our production environment, we may load, run and pin bpf programs and
maps in containers. For example, some of our networking bpf programs and
maps are loaded and pinned by a process running in a container on our
k8s environment. In this container, there're also running some other
user applications which watch the networking configurations from remote
servers and update them on this local host, log the error events, monitor
the traffic, and do some other stuffs. Sometimes we may need to update 
these user applications to a new release, and in this update process we
will destroy the old container and then start a new genration. In order not
to interrupt the bpf programs in the update process, we will pin the bpf
programs and maps in bpffs. That is the background and use case on our
production environment. 

After switching to memcg-based bpf memory accounting to limit the bpf
memory, some unexpected issues jumped out at us.
1. The memory usage is not consistent between the first generation and
new generations.
2. After the first generation is destroyed, the bpf memory can't be
limited if the bpf maps are not preallocated, because they will be
reparented.

This patchset tries to resolve these issues by introducing an
independent memcg to limit the bpf memory.

In the bpf map creation, we can assign a specific memcg instead of using
the current memcg.  That makes it flexible in containized environment.
For example, if we want to limit the pinned bpf maps, we can use below
hierarchy,

    Shared resources              Private resources 
                                    
     bpf-memcg                      k8s-memcg
     /        \                     /             
bpf-bar-memcg bpf-foo-memcg   srv-foo-memcg        
                  |               /        \
               (charged)     (not charged) (charged)                 
                  |           /              \
                  |          /                \
          bpf-foo-{progs,maps}              srv-foo

srv-foo loads and pins bpf-foo-{progs, maps}, but they are charged to an
independent memcg (bpf-foo-memcg) instead of srv-foo's memcg
(srv-foo-memcg).

Pls. note that there may be no process in bpf-foo-memcg, that means it
can be rmdir-ed by root user currently. Meanwhile we don't forcefully
destroy a memcg if it doesn't have any residents. So this hierarchy is
acceptible. 

In order to make the memcg of bpf maps seletectable, this patchset
introduces some memory allocation wrappers to allocate map related
memory. In these wrappers, it will get the memcg from the map and then
charge the allocated pages or objs.  

Currenly it only supports for bpf map, and we can extend it to bpf prog
as well.

The observebility can also be supported in the next step, for example,
showing the bpf map's memcg by 'bpftool map show' or even showing which
maps are charged to a specific memcg by 'bpftool cgroup show'.
Furthermore, we may also show an accurate memory size of a bpf map
instead of an estimated memory size in 'bpftool map show' in the future. 

v1->v2:
- cgroup1 is also supported after
  commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
  So update the commit log.
- remove incorrect fix to mem_cgroup_put  (Shakeel,Roman,Muchun) 
- use cgroup_put() in bpf_map_save_memcg() (Shakeel)
- add detailed commit log for get_obj_cgroup_from_cgroup (Shakeel) 

RFC->v1:
- get rid of bpf_map container wrapper (Alexei)
- add the new field into the end of struct (Alexei)
- get rid of BPF_F_SELECTABLE_MEMCG (Alexei)
- save memcg in bpf_map_init_from_attr
- introduce bpf_ringbuf_pages_{alloc,free} and keep them inside
  kernel/bpf/ringbuf.c  (Andrii)

Yafang Shao (12):
  cgroup: Update the comment on cgroup_get_from_fd
  bpf: Introduce new helper bpf_map_put_memcg()
  bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  bpf: Call bpf_map_init_from_attr() immediately after map creation
  bpf: Save memcg in bpf_map_init_from_attr()
  bpf: Use scoped-based charge in bpf_map_area_alloc
  bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: Use bpf_map_kzalloc in arraymap
  bpf: Use bpf_map_kvcalloc in bpf_local_storage
  mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  bpf: Add return value for bpf_map_init_from_attr
  bpf: Introduce selectable memcg for bpf map

 include/linux/bpf.h            |  40 +++++++++++-
 include/linux/memcontrol.h     |  11 ++++
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/arraymap.c          |  34 ++++++-----
 kernel/bpf/bloom_filter.c      |  11 +++-
 kernel/bpf/bpf_local_storage.c |  17 ++++--
 kernel/bpf/bpf_struct_ops.c    |  19 +++---
 kernel/bpf/cpumap.c            |  17 ++++--
 kernel/bpf/devmap.c            |  30 +++++----
 kernel/bpf/hashtab.c           |  26 +++++---
 kernel/bpf/local_storage.c     |  11 +++-
 kernel/bpf/lpm_trie.c          |  12 +++-
 kernel/bpf/offload.c           |  12 ++--
 kernel/bpf/queue_stack_maps.c  |  11 +++-
 kernel/bpf/reuseport_array.c   |  11 +++-
 kernel/bpf/ringbuf.c           | 104 +++++++++++++++++++++----------
 kernel/bpf/stackmap.c          |  13 ++--
 kernel/bpf/syscall.c           | 136 ++++++++++++++++++++++++++++-------------
 kernel/cgroup/cgroup.c         |   2 +-
 mm/memcontrol.c                |  47 ++++++++++++++
 net/core/sock_map.c            |  30 +++++----
 net/xdp/xskmap.c               |  12 +++-
 tools/include/uapi/linux/bpf.h |   1 +
 tools/lib/bpf/bpf.c            |   3 +-
 tools/lib/bpf/bpf.h            |   3 +-
 tools/lib/bpf/gen_loader.c     |   2 +-
 tools/lib/bpf/libbpf.c         |   2 +
 tools/lib/bpf/skel_internal.h  |   2 +-
 28 files changed, 443 insertions(+), 177 deletions(-)

Comments

Yosry Ahmed Aug. 18, 2022, 7:11 p.m. UTC | #1
On Thu, Aug 18, 2022 at 7:31 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
> we can open a cgroup1 dir as well. So let's update the comment.

I missed updating the comment in my change. Thanks for catching this.

>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 5f4502a..b7d2e55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
>
>  /**
>   * cgroup_get_from_fd - get a cgroup pointer from a fd
> - * @fd: fd obtained by open(cgroup2_dir)
> + * @fd: fd obtained by open(cgroup_dir)
>   *
>   * Find the cgroup from a fd which should be obtained
>   * by opening a cgroup directory.  Returns a pointer to the
> --
> 1.8.3.1
>
Tejun Heo Aug. 18, 2022, 10:20 p.m. UTC | #2
Hello,

On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting to limit the bpf
> memory, some unexpected issues jumped out at us.
> 1. The memory usage is not consistent between the first generation and
> new generations.
> 2. After the first generation is destroyed, the bpf memory can't be
> limited if the bpf maps are not preallocated, because they will be
> reparented.
> 
> This patchset tries to resolve these issues by introducing an
> independent memcg to limit the bpf memory.

memcg folks would have better informed opinions but from generic cgroup pov
I don't think this is a good direction to take. This isn't a problem limited
to bpf progs and it doesn't make whole lot of sense to solve this for bpf.

We have the exact same problem for any resources which span multiple
instances of a service including page cache, tmpfs instances and any other
thing which can persist longer than procss life time. My current opinion is
that this is best solved by introducing an extra cgroup layer to represent
the persistent entity and put the per-instance cgroup under it.

It does require reorganizing how things are organized from userspace POV but
the end result is really desirable. We get entities accurately representing
what needs to be tracked and control over the granularity of accounting and
control (e.g. folks who don't care about telling apart the current
instance's usage can simply not enable controllers at the persistent entity
level).

We surely can discuss other approaches but my current intuition is that it'd
be really difficult to come up with a better solution than layering to
introduce persistent service entities.

So, please consider the approach nacked for the time being.

Thanks.
Tejun Heo Aug. 18, 2022, 10:33 p.m. UTC | #3
On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> We have the exact same problem for any resources which span multiple
> instances of a service including page cache, tmpfs instances and any other
> thing which can persist longer than procss life time. My current opinion is

To expand a bit more on this point, once we start including page cache and
tmpfs, we now get entangled with memory reclaim which then brings in IO and
not-yet-but-eventually CPU usage. Once you start splitting the tree like
you're suggesting here, all those will break down and now we have to worry
about how to split resource accounting and control for the same entities
across two split branches of the tree, which doesn't really make any sense.

So, we *really* don't wanna paint ourselves into that kind of a corner. This
is a dead-end. Please ditch it.

Thanks.
Yafang Shao Aug. 19, 2022, 12:59 a.m. UTC | #4
On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting to limit the bpf
> > memory, some unexpected issues jumped out at us.
> > 1. The memory usage is not consistent between the first generation and
> > new generations.
> > 2. After the first generation is destroyed, the bpf memory can't be
> > limited if the bpf maps are not preallocated, because they will be
> > reparented.
> >
> > This patchset tries to resolve these issues by introducing an
> > independent memcg to limit the bpf memory.
>
> memcg folks would have better informed opinions but from generic cgroup pov
> I don't think this is a good direction to take. This isn't a problem limited
> to bpf progs and it doesn't make whole lot of sense to solve this for bpf.
>

This change is bpf specific. It doesn't refactor a whole lot of things.

> We have the exact same problem for any resources which span multiple
> instances of a service including page cache, tmpfs instances and any other
> thing which can persist longer than procss life time. My current opinion is
> that this is best solved by introducing an extra cgroup layer to represent
> the persistent entity and put the per-instance cgroup under it.
>

It is not practical on k8s.
Because, before the persistent entity, the cgroup dir is stateless.
After, it is stateful.
Pls, don't continue keeping blind eyes on k8s.

> It does require reorganizing how things are organized from userspace POV but
> the end result is really desirable. We get entities accurately representing
> what needs to be tracked and control over the granularity of accounting and
> control (e.g. folks who don't care about telling apart the current
> instance's usage can simply not enable controllers at the persistent entity
> level).
>

Pls.s also think about why k8s refuse to use cgroup2.

> We surely can discuss other approaches but my current intuition is that it'd
> be really difficult to come up with a better solution than layering to
> introduce persistent service entities.
>
> So, please consider the approach nacked for the time being.
>

It doesn't make sense to nack it.
I will explain to you by replying to your other email.
Yafang Shao Aug. 19, 2022, 1:09 a.m. UTC | #5
On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
>
> To expand a bit more on this point, once we start including page cache and
> tmpfs, we now get entangled with memory reclaim which then brings in IO and
> not-yet-but-eventually CPU usage.

Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

> Once you start splitting the tree like
> you're suggesting here, all those will break down and now we have to worry
> about how to split resource accounting and control for the same entities
> across two split branches of the tree, which doesn't really make any sense.
>

The k8s has already been broken thanks to the memcg accounting on  bpf memory.
If you ignored it, I paste it below.
[0]"1. The memory usage is not consistent between the first generation and
new generations."

This issue will persist even if you introduce a new layer.

> So, we *really* don't wanna paint ourselves into that kind of a corner. This
> is a dead-end. Please ditch it.
>

It makes non-sensen to ditch it.
Because, the hierarchy I described in the commit log is *one* use case
of the selectable memcg, but not *the only one* use case of it. If you
dislike that hierarchy, I will remove it to avoid misleading you.

Even if you introduce a new layer, you still need the selectable memcg.
For example, to avoid the issue I described in [0],  you still need to
charge to the parent cgroup instead of the current cgroup.

That's why I described in the commit log that the selectable memcg is flexible.
Tejun Heo Aug. 19, 2022, 4:45 p.m. UTC | #6
Hello,

On Fri, Aug 19, 2022 at 08:59:20AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote:
> > memcg folks would have better informed opinions but from generic cgroup pov
> > I don't think this is a good direction to take. This isn't a problem limited
> > to bpf progs and it doesn't make whole lot of sense to solve this for bpf.
> 
> This change is bpf specific. It doesn't refactor a whole lot of things.

I'm not sure what point the above sentence is making. It may not change a
lot of code but it does introduce significantly new mode of operation which
affects memcg and cgroup in general.

> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
> > that this is best solved by introducing an extra cgroup layer to represent
> > the persistent entity and put the per-instance cgroup under it.
> 
> It is not practical on k8s.
> Because, before the persistent entity, the cgroup dir is stateless.
> After, it is stateful.
> Pls, don't continue keeping blind eyes on k8s.

Can you please elaborate why it isn't practical for k8s? I don't know the
details of k8s and what you wrote above is not a detailed enough technical
argument.

> > It does require reorganizing how things are organized from userspace POV but
> > the end result is really desirable. We get entities accurately representing
> > what needs to be tracked and control over the granularity of accounting and
> > control (e.g. folks who don't care about telling apart the current
> > instance's usage can simply not enable controllers at the persistent entity
> > level).
> 
> Pls.s also think about why k8s refuse to use cgroup2.

This attitude really bothers me. You aren't spelling it out fully but
instead of engaging in the technical argument at the hand, you're putting
forth conforming upstream to the current k8s's assumptions and behaviors as
a requirement and then insisting that it's upstream's fault that k8s is
staying with cgroup1.

This is not an acceptable form of argument and it would be irresponsible to
grant any kind weight to this line of reasoning. k8s may seem like the world
to you but it is one of many use cases of the upstream kernel. We all should
pay attention to the use cases and technical arguments to determine how we
chart our way forward, but being k8s or whoever else clearly isn't a waiver
to claim this kind of unilateral demand.

It's okay to emphasize the gravity of the specific use case at hand but
please realize that it's one of the many factors that should be considered
and sometimes one which can and should get trumped by others.

Thanks.
Tejun Heo Aug. 19, 2022, 5:06 p.m. UTC | #7
Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.
Yafang Shao Aug. 20, 2022, 2:25 a.m. UTC | #8
On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > > We have the exact same problem for any resources which span multiple
> > > > instances of a service including page cache, tmpfs instances and any other
> > > > thing which can persist longer than procss life time. My current opinion is
> > >
> > > To expand a bit more on this point, once we start including page cache and
> > > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > > not-yet-but-eventually CPU usage.
> >
> > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?
>
> Introducing a new layer in cgroup2 doesn't mean that any specific resource
> controller is enabled, so there is no runtime overhead difference. In terms
> of logical complexity, introducing a localized layer seems a lot more
> straightforward than building a whole separate tree.
>
> Note that the same applies to cgroup1 where collapsed controller tree is
> represented by simply not creating those layers in that particular
> controller tree.
>

No, we have observed on our product env that multiple-layers cpuacct
would cause obvious performance hit due to cache miss.

> No matter how we cut the problem here, if we want to track these persistent
> resources, we have to create a cgroup to host them somewhere. The discussion
> we're having is mostly around where to put them. With your proposal, it can
> be anywhere and you draw out an example where the persistent cgroups form
> their own separate tree. What I'm saying is that the logical place to put it
> is where the current resource consumption is and we just need to put the
> persistent entity as the parent of the instances.
>
> Flexibility, just like anything else, isn't free. Here, if we extrapolate
> this approach, the cost is evidently hefty in that it doesn't generically
> work with the basic resource control structure.
>
> > > Once you start splitting the tree like
> > > you're suggesting here, all those will break down and now we have to worry
> > > about how to split resource accounting and control for the same entities
> > > across two split branches of the tree, which doesn't really make any sense.
> >
> > The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> > If you ignored it, I paste it below.
> > [0]"1. The memory usage is not consistent between the first generation and
> > new generations."
> >
> > This issue will persist even if you introduce a new layer.
>
> Please watch your tone.
>

Hm? I apologize if my words offend you.
But, could you pls take a serious look at the patchset  before giving a NACK?
You didn't even want to know the background before you sent your NACK.

> Again, this isn't a problem specific to k8s. We have the same problem with
> e.g. persistent tmpfs. One idea which I'm not against is allowing specific
> resources to be charged to an ancestor. We gotta think carefully about how
> such charges should be granted / denied but an approach like that jives well
> with the existing hierarchical control structure and because introducing a
> persistent layer does too, the combination of the two works well.
>
> > > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > > is a dead-end. Please ditch it.
> >
> > It makes non-sensen to ditch it.
> > Because, the hierarchy I described in the commit log is *one* use case
> > of the selectable memcg, but not *the only one* use case of it. If you
> > dislike that hierarchy, I will remove it to avoid misleading you.
>
> But if you drop that, what'd be the rationale for adding what you're
> proposing? Why would we want bpf memory charges to be attached any part of
> the hierarchy?
>

I have explained it to you.
But unfortunately you ignored it again.
But I don't mind explaining to you again.

                 Parent-memcg
                     \
                   Child-memcg (k8s pod)

The user can charge the memory to the parent directly without charging
into the k8s pod.
Then the memory.stat is consistent between different generations.

> > Even if you introduce a new layer, you still need the selectable memcg.
> > For example, to avoid the issue I described in [0],  you still need to
> > charge to the parent cgroup instead of the current cgroup.
>
> As I wrote above, we've been discussing the above. Again, I'd be a lot more
> amenable to such approach because it fits with how everything is structured.
>
> > That's why I described in the commit log that the selectable memcg is flexible.
>
> Hopefully, my point on this is clear by now.
>

Unfortunately, you didn't want to get my point.
Tejun Heo Aug. 22, 2022, 11:29 a.m. UTC | #9
(Sorry, this is a resend. I messed up the header in the first posting.)

Hello,

This thread started on a bpf-specific memory tracking change proposal and
went south, but a lot of people who would be interested are already cc'd, so
I'm hijacking it to discuss what to do w/ persistent memory usage tracking.

Cc'ing Mina and Yosry who were involved in the discussions on the similar
problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
experience than me, and Lennart for his thoughts from systemd side.

The root problem is that there are resources (almost solely memory
currently) that outlive a given instance of a, to use systemd-lingo,
service. Page cache is the most common case.

Let's say there's system.slice/hello.service. When it runs for the first
time, page cache backing its binary will be charged to hello.service.
However, when it restarts after e.g. a config change, when the initial
hello.service cgroup gets destroyed, we reparent the page cache charges to
the parent system.slice and when the second instance starts, its binary will
stay charged to system.slice. Over time, some may get reclaimed and
refaulted into the new hello.service but that's not guaranteed and most hot
pages likely won't.

The same problem exists for any memory which is not freed synchronously when
the current instance exits. While this isn't a problem for many cases, it's
not difficult to imagine situations where the amount of memory which ends up
getting pushed to the parent is significant, even clear majority, with big
page cache footprint, persistent tmpfs instances and so on, creating issues
with accounting accuracy and thus control.

I think there are two broad issues to discuss here:

[1] Can this be solved by layering the instance cgroups under persistent
    entity cgroup?

So, instead of systemd.slice/hello.service, the application runs inside
something like systemd.slice/hello.service/hello.service.instance and the
service-level cgroup hello.service is not destroyed as long as it is
something worth tracking on the system.

The benefits are

a. While requiring changing how userland organizes cgroup hiearchy, it is a
   straight-forward extension of the current architecture and doesn't
   require any conceptual or structural changes. All the accounting and
   control schemes work exactly the same as before. The only difference is
   that we now have a persistent entity representing each service as we want
   to track their persistent resource usages.

b. Per-instance tracking and control is optional. To me, it seems that the
   persistent resource usages would be more meaningful than per-instance and
   tracking down to the persistent usages shouldn't add noticeable runtime
   overheads while keeping per-instance process management niceties and
   allowing use cases to opt-in for per-instance resource tracking and
   control as needed.

The complications are:

a. It requires changing cgroup hierarchy in a very visible way.

b. What should be the lifetime rules for persistent cgroups? Do we keep them
   around forever or maybe they can be created on the first use and kept
   around until the service is removed from the system? When the persistent
   cgroup is removed, do we need to make sure that the remaining resource
   usages are low enough? Note that this problem exists for any approach
   that tries to track persistent usages no matter how it's done.

c. Do we need to worry about nesting overhead? Given that there's no reason
   to enable controllers w/o persisten states for the instance level and the
   nesting overhead is pretty low for memcg, this doesn't seem like a
   problem to me. If this becomes a problem, we just need to fix it.

A couple alternatives discussed are:

a. Userspace keeps reusing the same cgroup for different instances of the
   same service. This simplifies some aspects while making others more
   complicated. e.g. Determining the current instance's CPU or IO usages now
   require the monitoring software remembering what they were when this
   instance started and calculating the deltas. Also, if some use cases want
   to distinguish persistent vs. instance usages (more on this later), this
   isn't gonna work. That said, this definitely is attractive in that it
   miminizes overt user visible changes.

b. Memory is disassociated rather than just reparented on cgroup destruction
   and get re-charged to the next first user. This is attractive in that it
   doesn't require any userspace changes; however, I'm not sure how this
   would work for non-pageable memory usages such as bpf maps. How would we
   detect the next first usage?


[2] Whether and how to solve first and second+ instance charge differences.

If we take the layering approach, the first instance will get charged for
all memory that it uses while the second+ instances likely won't get charged
for a lot of persistent usages. I don't think there is a consensus on
whether this needs to be solved and I don't have enough context to form a
strong opinion. memcg folks are a lot better equipped to make this decision.

Assuming this needs to be solved, here's a braindump to be taken with a big
pinch of salt:

I have a bit of difficult time imagining a perfect solution given that
whether a given page cache page is persistent or not would be really
difficult to know (or maybe all page cache is persistent by default while
anon is not). However, the problem still seems worthwhile to consider for
big ticket items such as persistent tmpfs mounts and huge bpf maps as they
can easily make the differences really big.

If we want to solve this problem, here are options that I can think of:

a. Let userspace put the charges where they belong using the current
   mechanisms. ie. Create persistent entities in the persistent parent
   cgroup while there's no current instance.

   Pro: It won't require any major kernel or interface changes. There still
   need to be some tweaking such as allowing tmpfs pages to be always
   charged to the cgroup which created the instance (maybe as long as it's
   an ancestor of the faulting cgroup?) but nothing too invasive.

   Con: It may not be flexible enough.

b. Let userspace specify which cgroup to charge for some of constructs like
   tmpfs and bpf maps. The key problems with this approach are

   1. How to grant/deny what can be charged where. We must ensure that a
      descendant can't move charges up or across the tree without the
      ancestors allowing it.

   2. How to specify the cgroup to charge. While specifying the target
      cgroup directly might seem like an obvious solution, it has a couple
      rather serious problems. First, if the descendant is inside a cgroup
      namespace, it might be able to see the target cgroup at all. Second,
      it's an interface which is likely to cause misunderstandings on how it
      can be used. It's too broad an interface.

   One solution that I can think of is leveraging the resource domain
   concept which is currently only used for threaded cgroups. All memory
   usages of threaded cgroups are charged to their resource domain cgroup
   which hosts the processes for those threads. The persistent usages have a
   similar pattern, so maybe the service level cgroup can declare that it's
   the encompassing resource domain and the instance cgroup can say whether
   it's gonna charge e.g. the tmpfs instance to its own or the encompassing
   resource domain.

   This has the benefit that the user only needs to indicate its intention
   without worrying about how cgroups are composed and what their IDs are.
   It just indicates whether the given resource is persistent and if the
   cgroup hierarchy is set up for that, it gets charged that way and if not
   it can be just charged to itself.

   This is a shower thought but, if we allow nesting such domains (and maybe
   name them), we can use it for shared resources too so that co-services
   are put inside a shared slice and shared resources are pushed to the
   slice level.

This became pretty long. I obviously have a pretty strong bias towards
solving this within the current basic architecture but other than that most
of these decisions are best made by memcg folks. We can hopefully build some
consensus on the issue.

Thanks.
Shakeel Butt Aug. 22, 2022, 4:12 p.m. UTC | #10
Ccing Mina.

On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>
> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>
>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun
Mina Almasry Aug. 22, 2022, 7:02 p.m. UTC | #11
On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

The problem of first and second+ instance charges is something I've
ran into when dealing with tmpfs charges, and something I believe
Yosry ran into with regards to bpf as well, so AFAIU we think it's
something worth addressing, but I'd also like to hear from other memcg
folks if this is something that needs to be solved for them as well.
For us this is something we want to fix because the memory usage of
these services is hard to predict and limit since the memory usage of
the first charger may be much greater than second+.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>

I believe this works for us, provided memory is charged to the
persistent parent entity directly and is not charged to the child
instance. If memory is charged to the child instance first and then
reparented, I think we still have the same problem. Because the first
instance is charged for this memory but the second+ instances are not,
making predicting the memory usage of such use cases difficult.

> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

This is pretty much the solution I sent out for review about a year
ago and yes, it suffers from the issues you've brought up:
https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/


>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>

I think this sounds excellent and addresses our use cases. Basically
the tmpfs/bpf memory would get charged to the encompassing resource
domain cgroup rather than the instance cgroup, making the memory usage
of the first and second+ instances consistent and predictable.

Would love to hear from other memcg folks what they would think of
such an approach. I would also love to hear what kind of interface you
have in mind. Perhaps a cgroup tunable that says whether it's going to
charge the tmpfs/bpf instance to itself or to the encompassing
resource domain?

>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun
>
Johannes Weiner Aug. 22, 2022, 9:19 p.m. UTC | #12
On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all. Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> This is pretty much the solution I sent out for review about a year
> ago and yes, it suffers from the issues you've brought up:
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> 
> 
> >    One solution that I can think of is leveraging the resource domain
> >    concept which is currently only used for threaded cgroups. All memory
> >    usages of threaded cgroups are charged to their resource domain cgroup
> >    which hosts the processes for those threads. The persistent usages have a
> >    similar pattern, so maybe the service level cgroup can declare that it's
> >    the encompassing resource domain and the instance cgroup can say whether
> >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> >    resource domain.
> >
> 
> I think this sounds excellent and addresses our use cases. Basically
> the tmpfs/bpf memory would get charged to the encompassing resource
> domain cgroup rather than the instance cgroup, making the memory usage
> of the first and second+ instances consistent and predictable.
> 
> Would love to hear from other memcg folks what they would think of
> such an approach. I would also love to hear what kind of interface you
> have in mind. Perhaps a cgroup tunable that says whether it's going to
> charge the tmpfs/bpf instance to itself or to the encompassing
> resource domain?

I like this too. It makes shared charging predictable, with a coherent
resource hierarchy (congruent OOM, CPU, IO domains), and without the
need for cgroup paths in tmpfs mounts or similar.

As far as who is declaring what goes, though: if the instance groups
can declare arbitrary files/objects persistent or shared, they'd be
able to abuse this and sneak private memory past local limits and
burden the wider persistent/shared domain with it.

I'm thinking it might make more sense for the service level to declare
which objects are persistent and shared across instances.

If that's the case, we may not need a two-component interface. Just
the ability for an intermediate cgroup to say: "This object's future
memory is to be charged to me, not the instantiating cgroup."

Can we require a process in the intermediate cgroup to set up the file
or object, and use madvise/fadvise to say "charge me", before any
instances are launched?
Mina Almasry Aug. 22, 2022, 9:52 p.m. UTC | #13
On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> >
> >
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> >
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> >
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
>
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
>
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
>
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.
>
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
>
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

I think doing this on a file granularity makes it logistically hard to
use, no? The service needs to create a file in the shared domain and
all its instances need to re-use this exact same file.

Our kubernetes use case from [1] shares a mount between subtasks
rather than specific files. This allows subtasks to create files at
will in the mount with the memory charged to the shared domain. I
imagine this is more convenient than a shared file.

Our other use case, which I hope to address here as well, is a
service-client relationship from [1] where the service would like to
charge per-client memory back to the client itself. In this case the
service or client can create a mount from the shared domain and pass
it to the service at which point the service is free to create/remove
files in this mount as it sees fit.

Would you be open to a per-mount interface rather than a per-file
fadvise interface?

Yosry, would a proposal like so be extensible to address the bpf
charging issues?

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
Roman Gushchin Aug. 23, 2022, 3:01 a.m. UTC | #14
On Mon, Aug 22, 2022 at 05:19:50PM -0400, Johannes Weiner wrote:
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> > 
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> > 
> > 
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> > 
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> > 
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
> 
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
> 
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
> 
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.

I like this idea.

> 
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
> 
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

We need to think how to make this interface convenient to use.
First, these persistent resources are likely created by some agent software,
not the main workload. So the requirement to call madvise() from the
actual cgroup might be not easily achievable.

So _maybe_ something like writing a fd into cgroup.memory.resources.

Second, it would be really useful to export the current configuration
to userspace. E.g. a user should be able to query to which cgroup the given
bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
it will create a problem for userspace programs which manage cgroups
(e.g. systemd): they should be able to restore the current configuration
from the kernel state, without "remembering" what has been configured
before.

Thanks!
Tejun Heo Aug. 23, 2022, 3:14 a.m. UTC | #15
Hello,

On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > >    One solution that I can think of is leveraging the resource domain
> > > >    concept which is currently only used for threaded cgroups. All memory
> > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > >    which hosts the processes for those threads. The persistent usages have a
> > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > >    the encompassing resource domain and the instance cgroup can say whether
> > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > >    resource domain.
> > > >
> > > 
> > > I think this sounds excellent and addresses our use cases. Basically
> > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > domain cgroup rather than the instance cgroup, making the memory usage
> > > of the first and second+ instances consistent and predictable.
> > > 
> > > Would love to hear from other memcg folks what they would think of
> > > such an approach. I would also love to hear what kind of interface you
> > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > resource domain?
> > 
> > I like this too. It makes shared charging predictable, with a coherent
> > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > need for cgroup paths in tmpfs mounts or similar.
> > 
> > As far as who is declaring what goes, though: if the instance groups
> > can declare arbitrary files/objects persistent or shared, they'd be
> > able to abuse this and sneak private memory past local limits and
> > burden the wider persistent/shared domain with it.

My thought was that the persistent cgroup and instance cgroups should belong
to the same trust domain and system level control should be applied at the
resource domain level. The application may decide to shift between
persistent and per-instance however it wants to and may even configure
resource control at that level but all that's for its own accounting
accuracy and benefit.

> > I'm thinking it might make more sense for the service level to declare
> > which objects are persistent and shared across instances.
> 
> I like this idea.
> 
> > If that's the case, we may not need a two-component interface. Just
> > the ability for an intermediate cgroup to say: "This object's future
> > memory is to be charged to me, not the instantiating cgroup."
> > 
> > Can we require a process in the intermediate cgroup to set up the file
> > or object, and use madvise/fadvise to say "charge me", before any
> > instances are launched?
> 
> We need to think how to make this interface convenient to use.
> First, these persistent resources are likely created by some agent software,
> not the main workload. So the requirement to call madvise() from the
> actual cgroup might be not easily achievable.

So one worry that I have for this is that it requires the application itself
to be aware of cgroup topolgies and restructure itself so that allocation of
those resources are factored out into something else. Maybe that's not a
huge problem but it may limit its applicability quite a bit.

If we can express all the resource contraints and structures in the cgroup
side and configured by the management agent, the application can simply e.g.
madvise whatever memory region or flag bpf maps as "these are persistent"
and the rest can be handled by the system. If the agent set up the
environment for that, it gets accounted accordingly; otherwise, it'd behave
as if those tagging didn't exist. Asking the application to set up all its
resources in separate steps, that might require significant restructuring
and knowledge of how the hierarchy is setup in many cases.

> So _maybe_ something like writing a fd into cgroup.memory.resources.
> 
> Second, it would be really useful to export the current configuration
> to userspace. E.g. a user should be able to query to which cgroup the given
> bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> it will create a problem for userspace programs which manage cgroups
> (e.g. systemd): they should be able to restore the current configuration
> from the kernel state, without "remembering" what has been configured
> before.

This too can be achieved by separating out cgroup setup and tagging specific
resources. Agent and cgroup know what each cgroup is supposed to do as they
already do now and each resource is tagged whether they're persistent or
not, so everything is always known without the agent and the application
having to explicitly share the information.

Thanks.
Yafang Shao Aug. 23, 2022, 11:08 a.m. UTC | #16
On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>

Hi,

Below is some background of kubernetes.
In kubernetes, a pod is organized as follows,

               pod
               |- Container
               |- Container

IOW, it is a two-layer unit, or a two-layer instance.
The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
Once the old pod is destroyed (that can happen when the user wants to
update their service), the new pod will have a different UUID.
That said, different instances will have different cgroup dir.

If we want to introduce a  persistent entity cgroup, we have to make
it a three-layer unit.

           persistent-entity
           |- pod
                 |- Container
                 |- Container

There will be some issues,
1.  The kuber-apiserver must maintain the persistent-entity on each host.
     It needs a great refactor and the compatibility is also a problem
per my discussion with kubernetes experts.
2.  How to do the monitor?
     If there's only one pod under this persistent-entity, we can
easily get the memory size of  shared resources by:
         Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
    But what if it has N pods and N is dynamically changed ?
3.  What if it has more than one shared resource?
     For example, pod-foo has two shared resources A and B, pod-bar
has two shared resources A and C, and another pod has two shared
resources B and C.
     How to deploy them?
     Pls, note that we can introduce multiple-layer persistent-entity,
but which one should be the parent ?

So from my perspective, it is almost impossible.

> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>

JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
[1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/

>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

Just sharing our practice.
For many of our users, it means the memcg is unreliable (at least for
the observability) when the memory usage is inconsistent.
So they prefer to drop the page cache (by echoing memory.force_empty)
when the container (pod) is destroyed, at the cost of taking time to
reload these page caches next time.  Reliability is more important
than performance.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>

We can add restrictions to check which memcg can be selected
(regarding the selectable memcg).
But I think it may be too early to do the restrictions, as only the
privileged user can set it.
It is the sys admin's responsbility to select a proper memcg.
That said, the selectable memcg is not going south.

>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all.

It is not a problem. Just sharing our practice below.
$ docker run -tid --privileged    \
                      --mount
type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
                      --mount
type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
                      docker-image

The bind-mount can make it work.

>  Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

As I said above, we just need some restrictions or guidance if that is
desired now.

>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>

JFYI.
Apologize in advance if my words offend you again.
Tejun Heo Aug. 23, 2022, 5:12 p.m. UTC | #17
Hello,

On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
> > [1] Can this be solved by layering the instance cgroups under persistent
> >     entity cgroup?
>
> Below is some background of kubernetes.
> In kubernetes, a pod is organized as follows,
> 
>                pod
>                |- Container
>                |- Container
> 
> IOW, it is a two-layer unit, or a two-layer instance.
> The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> Once the old pod is destroyed (that can happen when the user wants to
> update their service), the new pod will have a different UUID.
> That said, different instances will have different cgroup dir.
> 
> If we want to introduce a  persistent entity cgroup, we have to make
> it a three-layer unit.
> 
>            persistent-entity
>            |- pod
>                  |- Container
>                  |- Container
> 
> There will be some issues,
> 1.  The kuber-apiserver must maintain the persistent-entity on each host.
>      It needs a great refactor and the compatibility is also a problem
> per my discussion with kubernetes experts.

This is gonna be true for anybody. The basic strategy here should be
defining a clear boundary between system agent and applications so that
individual applications, even when they build their own subhierarchy
internally, aren't affected by system level hierarchy configuration changes.
systemd is already like that with clear delegation boundary. Our (fb)
container management is like that too. So, while this requires some
reorganization from the agent side, things like this don't create huge
backward compatbility issues involving applications. I have no idea about
k8s, so the situation may differ but in the long term at least, it'd be a
good idea to build in similar conceptual separation even if it stays with
cgroup1.

> 2.  How to do the monitor?
>      If there's only one pod under this persistent-entity, we can
> easily get the memory size of  shared resources by:
>          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
>     But what if it has N pods and N is dynamically changed ?

There should only be one live pod instance inside the pod's persistent
cgroup, so the calculation doesn't really change.

> 3.  What if it has more than one shared resource?
>      For example, pod-foo has two shared resources A and B, pod-bar
> has two shared resources A and C, and another pod has two shared
> resources B and C.
>      How to deploy them?
>      Pls, note that we can introduce multiple-layer persistent-entity,
> but which one should be the parent ?
> 
> So from my perspective, it is almost impossible.

Yeah, this is a different problem and cgroup has never been good at tracking
resources shared across multiple groups. There is always tension between
overhead, complexity and accuracy and the tradeoff re. resource sharing has
almost always been towards the former two.

Naming specific resources and designating them as being shared and
accounting them commonly somehow does make sense as an approach as we get to
avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
it can even be claimed that a lot of use cases which may want cross-group
sharing can be sufficiently served by such approach.

That said, it still has to be balanced against other factors. For example,
memory pressure caused by the shared resources should affect all
participants in the sharing group in terms of both memory and IO, which
means that they'll have to stay within a nested subtree structure. This does
restrict overlapping partial sharing that you described above but the goal
is finding a reasonable tradeoff, so something has to give.

> > b. Memory is disassociated rather than just reparented on cgroup destruction
> >    and get re-charged to the next first user. This is attractive in that it
> >    doesn't require any userspace changes; however, I'm not sure how this
> >    would work for non-pageable memory usages such as bpf maps. How would we
> >    detect the next first usage?
> >
> 
> JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/

I'm not a big fan of explicit recharging. It's too hairy to use requiring
mixing system level hierarchy configuration knoweldge with in-application
resource handling. There should be clear isolation between the two. This is
also what leads to namespace and visibility issues. Ideally, these should be
handled by structuring the resource hierarchay correctly from the get-go.

...
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> 
> We can add restrictions to check which memcg can be selected
> (regarding the selectable memcg).
> But I think it may be too early to do the restrictions, as only the
> privileged user can set it.
> It is the sys admin's responsbility to select a proper memcg.
> That said, the selectable memcg is not going south.

I generally tend towards shaping interfaces more carefully. We can of course
add a do-whatever-you-want interface and declare that it's for root only but
even that becomes complicated with things like userns as we're finding out
in different areas, but again, nothing is free and approaches like that
often bring more longterm headaches than the problems they solve.

> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all.
> 
> It is not a problem. Just sharing our practice below.
> $ docker run -tid --privileged    \
>                       --mount
> type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
>                       --mount
> type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
>                       docker-image
> 
> The bind-mount can make it work.

Not mount namespace. cgroup namespace which is used to isolate the cgroup
subtree that's visible to each container. Please take a look at
cgroup_namespaces(7).

> >  Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> As I said above, we just need some restrictions or guidance if that is
> desired now.

I'm really unlikely to go down that path because as I said before I believe
that that's a road to long term headaches and already creates immediate
problems in terms of how it'd interact with other resources. No matter what
approach we may choose, it has to work with the existing resource hierarchy.
Otherwise, we end up in a situation where we have to split accounting and
control of other controllers too which makes little sense for non-memory
controllers (not that it's great for memory in the first place).

Thanks.
Yafang Shao Aug. 24, 2022, 11:57 a.m. UTC | #18
On Wed, Aug 24, 2022 at 1:12 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> > On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
> > > [1] Can this be solved by layering the instance cgroups under persistent
> > >     entity cgroup?
> >
> > Below is some background of kubernetes.
> > In kubernetes, a pod is organized as follows,
> >
> >                pod
> >                |- Container
> >                |- Container
> >
> > IOW, it is a two-layer unit, or a two-layer instance.
> > The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> > Once the old pod is destroyed (that can happen when the user wants to
> > update their service), the new pod will have a different UUID.
> > That said, different instances will have different cgroup dir.
> >
> > If we want to introduce a  persistent entity cgroup, we have to make
> > it a three-layer unit.
> >
> >            persistent-entity
> >            |- pod
> >                  |- Container
> >                  |- Container
> >
> > There will be some issues,
> > 1.  The kuber-apiserver must maintain the persistent-entity on each host.
> >      It needs a great refactor and the compatibility is also a problem
> > per my discussion with kubernetes experts.
>
> This is gonna be true for anybody. The basic strategy here should be
> defining a clear boundary between system agent and applications so that
> individual applications, even when they build their own subhierarchy
> internally, aren't affected by system level hierarchy configuration changes.
> systemd is already like that with clear delegation boundary. Our (fb)
> container management is like that too. So, while this requires some
> reorganization from the agent side, things like this don't create huge
> backward compatbility issues involving applications. I have no idea about
> k8s, so the situation may differ but in the long term at least, it'd be a
> good idea to build in similar conceptual separation even if it stays with
> cgroup1.
>
> > 2.  How to do the monitor?
> >      If there's only one pod under this persistent-entity, we can
> > easily get the memory size of  shared resources by:
> >          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
> >     But what if it has N pods and N is dynamically changed ?
>
> There should only be one live pod instance inside the pod's persistent
> cgroup, so the calculation doesn't really change.
>
> > 3.  What if it has more than one shared resource?
> >      For example, pod-foo has two shared resources A and B, pod-bar
> > has two shared resources A and C, and another pod has two shared
> > resources B and C.
> >      How to deploy them?
> >      Pls, note that we can introduce multiple-layer persistent-entity,
> > but which one should be the parent ?
> >
> > So from my perspective, it is almost impossible.
>
> Yeah, this is a different problem and cgroup has never been good at tracking
> resources shared across multiple groups. There is always tension between
> overhead, complexity and accuracy and the tradeoff re. resource sharing has
> almost always been towards the former two.
>
> Naming specific resources and designating them as being shared and
> accounting them commonly somehow does make sense as an approach as we get to
> avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
> it can even be claimed that a lot of use cases which may want cross-group
> sharing can be sufficiently served by such approach.
>
> That said, it still has to be balanced against other factors. For example,
> memory pressure caused by the shared resources should affect all
> participants in the sharing group in terms of both memory and IO, which
> means that they'll have to stay within a nested subtree structure. This does
> restrict overlapping partial sharing that you described above but the goal
> is finding a reasonable tradeoff, so something has to give.
>
> > > b. Memory is disassociated rather than just reparented on cgroup destruction
> > >    and get re-charged to the next first user. This is attractive in that it
> > >    doesn't require any userspace changes; however, I'm not sure how this
> > >    would work for non-pageable memory usages such as bpf maps. How would we
> > >    detect the next first usage?
> > >
> >
> > JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> > [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/
>
> I'm not a big fan of explicit recharging. It's too hairy to use requiring
> mixing system level hierarchy configuration knoweldge with in-application
> resource handling. There should be clear isolation between the two. This is
> also what leads to namespace and visibility issues. Ideally, these should be
> handled by structuring the resource hierarchay correctly from the get-go.
>
> ...
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> >
> > We can add restrictions to check which memcg can be selected
> > (regarding the selectable memcg).
> > But I think it may be too early to do the restrictions, as only the
> > privileged user can set it.
> > It is the sys admin's responsbility to select a proper memcg.
> > That said, the selectable memcg is not going south.
>
> I generally tend towards shaping interfaces more carefully. We can of course
> add a do-whatever-you-want interface and declare that it's for root only but
> even that becomes complicated with things like userns as we're finding out
> in different areas, but again, nothing is free and approaches like that
> often bring more longterm headaches than the problems they solve.
>
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all.
> >
> > It is not a problem. Just sharing our practice below.
> > $ docker run -tid --privileged    \
> >                       --mount
> > type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
> >                       --mount
> > type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
> >                       docker-image
> >
> > The bind-mount can make it work.
>
> Not mount namespace. cgroup namespace which is used to isolate the cgroup
> subtree that's visible to each container. Please take a look at
> cgroup_namespaces(7).
>

IIUC, we can get the target cgroup with a relative path if cgroup
namespace is enabled, for example ../bpf, right ?
(If not, then I think we can extend it.)

> > >  Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > As I said above, we just need some restrictions or guidance if that is
> > desired now.
>
> I'm really unlikely to go down that path because as I said before I believe
> that that's a road to long term headaches and already creates immediate
> problems in terms of how it'd interact with other resources. No matter what
> approach we may choose, it has to work with the existing resource hierarchy.

Unfortunately,  the bpf map has already broken and is still breaking
the existing resource hierarchy.
What I'm doing is to improve or even fix this breakage.

The reason I say it is breaking the existing resource hierarchy is
that the bpf map is improperly treated as a process while it is really
a shared resource.

A bpf-map can be written by processes running in other memcgs, but the
memory allocated caused by the writing won't be charged to the
writer's memcg, but will be charged to bpf-map's memcg.

That's why I try to convey to you that what I'm trying to fix is a bpf
specific issue.

> Otherwise, we end up in a situation where we have to split accounting and
> control of other controllers too which makes little sense for non-memory
> controllers (not that it's great for memory in the first place).
>

If the resource hierarchy is what you concern, then I think it can be
addressed with below addition change,

if (cgroup_is_not_ancestor(cgrp))
    return -EINVAL;
Mina Almasry Aug. 24, 2022, 7:02 p.m. UTC | #19
On Mon, Aug 22, 2022 at 8:14 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > > >    One solution that I can think of is leveraging the resource domain
> > > > >    concept which is currently only used for threaded cgroups. All memory
> > > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > > >    which hosts the processes for those threads. The persistent usages have a
> > > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > > >    the encompassing resource domain and the instance cgroup can say whether
> > > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > > >    resource domain.
> > > > >
> > > >
> > > > I think this sounds excellent and addresses our use cases. Basically
> > > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > > domain cgroup rather than the instance cgroup, making the memory usage
> > > > of the first and second+ instances consistent and predictable.
> > > >
> > > > Would love to hear from other memcg folks what they would think of
> > > > such an approach. I would also love to hear what kind of interface you
> > > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > > resource domain?
> > >
> > > I like this too. It makes shared charging predictable, with a coherent
> > > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > > need for cgroup paths in tmpfs mounts or similar.
> > >
> > > As far as who is declaring what goes, though: if the instance groups
> > > can declare arbitrary files/objects persistent or shared, they'd be
> > > able to abuse this and sneak private memory past local limits and
> > > burden the wider persistent/shared domain with it.
>
> My thought was that the persistent cgroup and instance cgroups should belong
> to the same trust domain and system level control should be applied at the
> resource domain level. The application may decide to shift between
> persistent and per-instance however it wants to and may even configure
> resource control at that level but all that's for its own accounting
> accuracy and benefit.
>
> > > I'm thinking it might make more sense for the service level to declare
> > > which objects are persistent and shared across instances.
> >
> > I like this idea.
> >
> > > If that's the case, we may not need a two-component interface. Just
> > > the ability for an intermediate cgroup to say: "This object's future
> > > memory is to be charged to me, not the instantiating cgroup."
> > >
> > > Can we require a process in the intermediate cgroup to set up the file
> > > or object, and use madvise/fadvise to say "charge me", before any
> > > instances are launched?
> >
> > We need to think how to make this interface convenient to use.
> > First, these persistent resources are likely created by some agent software,
> > not the main workload. So the requirement to call madvise() from the
> > actual cgroup might be not easily achievable.
>
> So one worry that I have for this is that it requires the application itself
> to be aware of cgroup topolgies and restructure itself so that allocation of
> those resources are factored out into something else. Maybe that's not a
> huge problem but it may limit its applicability quite a bit.
>

I agree with this point 100%. The interfaces being discussed here
require existing applications restructuring themselves which I don't
imagine will be very useful for us at least.

> If we can express all the resource contraints and structures in the cgroup
> side and configured by the management agent, the application can simply e.g.
> madvise whatever memory region or flag bpf maps as "these are persistent"
> and the rest can be handled by the system. If the agent set up the
> environment for that, it gets accounted accordingly; otherwise, it'd behave
> as if those tagging didn't exist. Asking the application to set up all its
> resources in separate steps, that might require significant restructuring
> and knowledge of how the hierarchy is setup in many cases.
>

I don't know if this level of granularity is needed with a madvise()
or such. The kernel knows whether resources are persistent due to the
nature of the resource. For example a shared tmpfs file is a resource
that is persistent and not cleaned up after the process using it dies,
but private memory is. madvise(PERSISTENT) on private memory would not
make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
memory region would make sense. Also, this requires adding madvise()
hints in userspace code to leverage this.

> > So _maybe_ something like writing a fd into cgroup.memory.resources.
> >

Sorry, I don't see this being useful - to us at least - if it is an
fd-based interface. It needs to support marking entire tmpfs mounts as
persistent. The reasoning is as Tejun alludes to: our container
management agent generally sets up the container hierarchy for a job
and also sets up the filesystem mounts the job requests. This is
generally because the job doesn't run as root and doesn't bother with
mount namespaces. Thus, our jobs are well-trained in receiving mounts
set up for them from our container management agent. Jobs are _not_
well-trained in receiving an fd from the container management agent,
and restructuring our jobs/services for such an interface will not be
feasible I think.

This applies to us but I imagine it is common practice for the
container management agent to set up mounts for the jobs to use,
rather than provide the job with an fd or collection of fds.

> > Second, it would be really useful to export the current configuration
> > to userspace. E.g. a user should be able to query to which cgroup the given
> > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> > it will create a problem for userspace programs which manage cgroups
> > (e.g. systemd): they should be able to restore the current configuration
> > from the kernel state, without "remembering" what has been configured
> > before.
>
> This too can be achieved by separating out cgroup setup and tagging specific
> resources. Agent and cgroup know what each cgroup is supposed to do as they
> already do now and each resource is tagged whether they're persistent or
> not, so everything is always known without the agent and the application
> having to explicitly share the information.
>
> Thanks.
>
> --
> tejun
Tejun Heo Aug. 25, 2022, 5:59 p.m. UTC | #20
Hello,

On Wed, Aug 24, 2022 at 12:02:04PM -0700, Mina Almasry wrote:
> > If we can express all the resource contraints and structures in the cgroup
> > side and configured by the management agent, the application can simply e.g.
> > madvise whatever memory region or flag bpf maps as "these are persistent"
> > and the rest can be handled by the system. If the agent set up the
> > environment for that, it gets accounted accordingly; otherwise, it'd behave
> > as if those tagging didn't exist. Asking the application to set up all its
> > resources in separate steps, that might require significant restructuring
> > and knowledge of how the hierarchy is setup in many cases.
> 
> I don't know if this level of granularity is needed with a madvise()
> or such. The kernel knows whether resources are persistent due to the
> nature of the resource. For example a shared tmpfs file is a resource
> that is persistent and not cleaned up after the process using it dies,
> but private memory is. madvise(PERSISTENT) on private memory would not
> make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
> memory region would make sense. Also, this requires adding madvise()
> hints in userspace code to leverage this.

I haven't thought hard about what the hinting interface should be like. The
default assumptions would be that page cache belongs to the persistent
domain and anon belongs to the instance (mm folks, please correct me if I'm
off the rails here), but I can imagine situations where that doesn't
necessarily hold - like temp files which get unlinked on instance shutdown.

In terms of hint granularity, more coarse grained (e.g. file, mount
whatever) seems to make sense but again I haven't thought too hard on it.
That said, as long as the default behavior is reasonable, I think adding
some hinting calls in the application is acceptable. It doesn't require any
structrual changes and the additions would be for its own benefit of more
accurate accounting and control. That makes sense to me.

One unfortunate effect this will have is that we'll be actively putting
resources into intermediate cgroups. This already happens today but if we
support persistent domains, it's gonna be a lot more prevalent and we'll
need to update e.g. iocost to support IOs coming out of intermediate
cgroups. This kinda sucks because we don't even have knobs to control self
vs. children distributions. Oh well...

Thanks.