mbox series

[RFC,0/8] memory recharging for offline memcgs

Message ID 20230720070825.992023-1-yosryahmed@google.com (mailing list archive)
Headers show
Series memory recharging for offline memcgs | expand

Message

Yosry Ahmed July 20, 2023, 7:08 a.m. UTC
This patch series implements the proposal in LSF/MM/BPF 2023 conference
for reducing offline/zombie memcgs by memory recharging [1]. The main
difference is that this series focuses on recharging and does not
include eviction of any memory charged to offline memcgs.

Two methods of recharging are proposed:

(a) Recharging of mapped folios.

When a memcg is offlined, queue an asynchronous worker that will walk
the lruvec of the offline memcg and try to recharge any mapped folios to
the memcg of one of the processes mapping the folio. The main assumption
is that a process mapping the folio is the "rightful" owner of the
memory.

Currently, this is only supported for evictable folios, as the
unevictable lru is imaginary and we cannot iterate the folios on it. A
separate proposal [2] was made to revive the unevictable lru, which
would allow recharging of unevictable folios.

(b) Deferred recharging of folios.

For folios that are unmapped, or mapped but we fail to recharge them
with (a), we rely on deferred recharging. Simply put, any time a folio
is accessed or dirtied by a userspace process, and that folio is charged
to an offline memcg, we will try to recharge it to the memcg of the
process accessing the folio. Again, we assume this process should be the
"rightful" owner of the memory. This is also done asynchronously to avoid
slowing down the data access path.

In both cases, we never OOM kill the recharging target if it goes above
limit. This is to avoid OOM killing a process an arbitrary amount of
time after it started using memory. This is a conservative policy that
can be revisited later.

The patches in this series are divided as follows:
- Patches 1 & 2 are preliminary refactoring and helpers introducion.
- Patches 3 to 5 implement (a) and (b) above.
- Patches 6 & 7 add stats, a sysctl, and a config option.
- Patch 8 is a selftest.

[1]https://lore.kernel.org/linux-mm/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/
[2]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@google.com/

Yosry Ahmed (8):
  memcg: refactor updating memcg->moving_account
  mm: vmscan: add lruvec_for_each_list() helper
  memcg: recharge mapped folios when a memcg is offlined
  memcg: support deferred memcg recharging
  memcg: recharge folios when accessed or dirtied
  memcg: add stats for offline memcgs recharging
  memcg: add sysctl and config option to control memory recharging
  selftests: cgroup: test_memcontrol: add a selftest for memcg
    recharging

 include/linux/memcontrol.h                    |  14 +
 include/linux/swap.h                          |   8 +
 include/linux/vm_event_item.h                 |   5 +
 kernel/sysctl.c                               |  11 +
 mm/Kconfig                                    |  12 +
 mm/memcontrol.c                               | 376 +++++++++++++++++-
 mm/page-writeback.c                           |   2 +
 mm/swap.c                                     |   2 +
 mm/vmscan.c                                   |  28 ++
 mm/vmstat.c                                   |   6 +-
 tools/testing/selftests/cgroup/cgroup_util.c  |  14 +
 tools/testing/selftests/cgroup/cgroup_util.h  |   1 +
 .../selftests/cgroup/test_memcontrol.c        | 310 +++++++++++++++
 13 files changed, 784 insertions(+), 5 deletions(-)

Comments

Johannes Weiner July 20, 2023, 3:35 p.m. UTC | #1
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
> 
> Two methods of recharging are proposed:
> 
> (a) Recharging of mapped folios.
> 
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
> 
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
> 
> (b) Deferred recharging of folios.
> 
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.

I'm super skeptical of this proposal.

Recharging *might* be the most desirable semantics from a user pov,
but only if it applies consistently to the whole memory footprint.
There is no mention of slab allocations such as inodes, dentries,
network buffers etc. which can be a significant part of a cgroup's
footprint. These are currently reparented. I don't think doing one
thing with half of the memory, and a totally different thing with the
other half upon cgroup deletion is going to be acceptable semantics.

It appears this also brings back the reliability issue that caused us
to deprecate charge moving. The recharge path has trylocks, LRU
isolation attempts, GFP_ATOMIC allocations. These introduce a variable
error rate into the relocation process, which causes pages that should
belong to the same domain to be scattered around all over the place.
It also means that zombie pinning still exists, but it's now even more
influenced by timing and race conditions, and so less predictable.

There are two issues being conflated here:

a) the problem of zombie cgroups, and

b) who controls resources that outlive the control domain.

For a), reparenting is still the most reasonable proposal. It's
reliable for one, but it also fixes the problem fully within the
established, user-facing semantics: resources that belong to a cgroup
also hierarchically belong to all ancestral groups; if those resources
outlive the last-level control domain, they continue to belong to the
parents. This is how it works today, and this is how it continues to
work with reparenting. The only difference is that those resources no
longer pin a dead cgroup anymore, but instead are physically linked to
the next online ancestor. Since dead cgroups have no effective control
parameters anymore, this is semantically equivalent - it's just a more
memory efficient implementation of the same exact thing.

b) is a discussion totally separate from this. We can argue what we
want this behavior to be, but I'd argue strongly that whatever we do
here should apply to all resources managed by the controller equally.

It could also be argued that if you don't want to lose control over a
set of resources, then maybe don't delete their control domain while
they are still alive and in use. For example, when restarting a
workload, and the new instance is expected to have largely the same
workingset, consider reusing the cgroup instead of making a new one.

For the zombie problem, I think we should merge Muchun's patches
ASAP. They've been proposed several times, they have Roman's reviews
and acks, and they do not change user-facing semantics. There is no
good reason not to merge them.
Tejun Heo July 20, 2023, 7:57 p.m. UTC | #2
On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.

Or just create a nesting layer so that there's a cgroup which represents the
persistent resources and a nested cgroup instance inside representing the
current instance.

Thanks.
Yosry Ahmed July 20, 2023, 9:33 p.m. UTC | #3
On Thu, Jul 20, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> I'm super skeptical of this proposal.

I expected this :)

>
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.

I think, as you say, recharging has the most desirable semantics
because the charge is maintained where it *should* be (with who is
actually using it). We simply cannot do that for kernel memory,
because we have no way of attributing it to a user. On the other hand,
we *can* attribute user memory to a user. Consistency is great, but
our inability to do (arguably) the right thing for one type of memory,
doesn't mean we shouldn't do it when we can. I would also argue that
user memory (anon/file pages) would commonly be the larger portion of
memory on a machine compared to kernel memory (e.g. slab).

>
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process,

Recharging is naturally best effort, because it's non-disruptive.
After a memcg dies, the kernel continuously tries to move the charges
away from it on every chance it gets. If it fails one time that's
fine, there will be other chances. Compared to the status quo, it is
definitely better than just leaving all the memory behind with the
zombie memcg. I would argue that over time (and accesses), most/all
memory should eventually get recharged. If not, something is not
working correctly, or a wrong assumption is being made.

> which causes pages that should
> belong to the same domain to be scattered around all over the place.

I strongly disagree with this point. Ideally, yes, memory charged to a
memcg would belong to the same domain. In practice, due to the first
touch charging semantics, this is far from the truth. For anonymous
memory, sure, they all belong to the same domain (mostly), the process
they belong to. But most of anonymous memory will go away when the
process dies anyway, the problem is mostly with shared resources (e.g.
file, tmpfs, ..). With file/tmpfs memory, the charging behavior is
random. The first memcg that touches a page gets charged for it.
Consequently, the file/tmpfs memory charged to a memcg would be a
mixture of pages from different files in different mounts, definitely
not a single domain. Perhaps with some workloads, where each memcg is
accessing different files, most memory charged to a memcg will belong
to the same domain, but in this case, recharging wouldn't move it away
anyway.

> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.

It still exists, but it is improved. The kernel tries to move charges
away from zombies on every chance it gets instead of doing nothing
about it. It is less predictable, can't argue about this, but it can't
get worse, only better.

>
> There are two issues being conflated here:
>
> a) the problem of zombie cgroups, and
>
> b) who controls resources that outlive the control domain.
>
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.

I agree that reparenting is more deterministic and reliable, but there
are two major flaws off the top of my head:

(1) If a memcg touches a page one time and gets charged for it, the
charge is stuck in its hierarchy forever. It can get reparented, but
it will never be charged to whoever is actually using it again, unless
it is reclaimed and refaulted (in some cases).

Consider this hierarchy:
    root
  /       \
A        B
            \
            C

Consider a case where memcg C touches a library file once, and gets
charged for some memory, and then dies. The memory gets reparente to
memcg B. Meanwhile, memcg A is continuously using the memory that
memcg B is charged for. memcg B would be indefinitely taxed by memcg
A. The only way out is if memcg B hit its limit, and the pages get
reclaimed, and then refaulted and recharged to memcg A. In some cases
(e.g. tmpfs), even then the memory would still get charged to memcg B.
There is no way to get rid of the charge until the resource itself is
freed.

This problem exists today, even without reparenting, with the
difference being that the charge will remain with C instead of B.
Recharging offers a better alternative where the charge will be
correctly moved to A, the "rightful" owner.

(2) In the above scenario, when memcg B dies, the memory will be
reparented to the root. That's even worse. Now memcg A is using memory
that is not accounted for anywhere, essentially an accounting leak.
From an admin perspective, the memory charged to root is system
overhead, it is lost capacity. For long-living systems, as memcgs are
created and destroyed for different workloads, memory will keep
accumulating at the root. The machine will keep leaking capacity over
time, and accounting becomes less and less accurate as more memory
becomes charged to the root.

>
> b) is a discussion totally separate from this.

I would argue that the zombie problem is (at least partially) an
artifact of the shared/sticky resources problem. If all resources are
used by one memcg and do not outlive it, we wouldn't have zombies.

> We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.

User memory and kernel memory are very different in nature. Ideally
yeah, we want to treat all resources equally. But user memory is
naturally more attributable to users and easier to account correctly
than kernel memory.

>
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use.

This is easier said than done :) As I mentioned earlier, the charging
semantics are inherently indeterministic for shared resources (e.g.
file/tmpfs). The user cannot control or monitor which resources belong
to which control domain. Each memcg in the system could be charged for
one page from each file in a shared library for all that matters :)

> For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.

In a large fleet with many different jobs getting rescheduled and
restarted on different machines, it's really hard in practice to do
so. We can keep the same cgroup if the same workload is being
restarted on the same machine, sure, but most of the time there's a
new workload arriving or so. We can't reuse containers in this case.

>
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.

There are some, which I pointed out above.

All in all, I understand where you are coming from. Your concerns are
valid. Recharging is not a perfect approach, but it is arguably the
best we can do at this point. Being indeterministic sucks, but our
charging semantics are inherently indeterministic anyway.
Yosry Ahmed July 20, 2023, 9:34 p.m. UTC | #4
On Thu, Jul 20, 2023 at 12:57 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote:
> > It could also be argued that if you don't want to lose control over a
> > set of resources, then maybe don't delete their control domain while
> > they are still alive and in use. For example, when restarting a
> > workload, and the new instance is expected to have largely the same
> > workingset, consider reusing the cgroup instead of making a new one.
>
> Or just create a nesting layer so that there's a cgroup which represents the
> persistent resources and a nested cgroup instance inside representing the
> current instance.

In practice it is not easy to know exactly which resources are shared
and used by which cgroups, especially in a large dynamic environment.

>
> Thanks.
>
> --
> tejun
Tejun Heo July 20, 2023, 10:11 p.m. UTC | #5
Hello,

On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > Or just create a nesting layer so that there's a cgroup which represents the
> > persistent resources and a nested cgroup instance inside representing the
> > current instance.
> 
> In practice it is not easy to know exactly which resources are shared
> and used by which cgroups, especially in a large dynamic environment.

Yeah, that only covers when resource persistence is confined in a known
scope. That said, I have a hard time seeing how recharding once after cgroup
destruction can be a solution for the situations you describe. What if A
touches it once first, B constantly uses it but C only very occasionally and
after A dies C ends up owning it due to timing. This is very much possible
in a large dynamic environment but neither the initial or final situation is
satisfactory.

To solve the problems you're describing, you actually would have to
guarantee that memory pages are charged to the current majority user (or
maybe even spread across current active users). Maybe it can be argued that
this is a step towards that but it's a very partial step and at least would
need a technically viable direction that this development can follow.

On its own, AFAICS, I'm not sure the scope of problems it can actually solve
is justifiably greater than what can be achieved with simple nesting.

Thanks.
Yosry Ahmed July 20, 2023, 10:23 p.m. UTC | #6
On Thu, Jul 20, 2023 at 3:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > > Or just create a nesting layer so that there's a cgroup which represents the
> > > persistent resources and a nested cgroup instance inside representing the
> > > current instance.
> >
> > In practice it is not easy to know exactly which resources are shared
> > and used by which cgroups, especially in a large dynamic environment.
>
> Yeah, that only covers when resource persistence is confined in a known
> scope. That said, I have a hard time seeing how recharding once after cgroup
> destruction can be a solution for the situations you describe. What if A
> touches it once first, B constantly uses it but C only very occasionally and
> after A dies C ends up owning it due to timing. This is very much possible
> in a large dynamic environment but neither the initial or final situation is
> satisfactory.

That is indeed possible, but it would be more likely that the charge
is moved to B. As I said, it's not perfect, but it is an improvement
over what we have today. Even if C ends up owning it, it's better than
staying with the dead A.

>
> To solve the problems you're describing, you actually would have to
> guarantee that memory pages are charged to the current majority user (or
> maybe even spread across current active users). Maybe it can be argued that
> this is a step towards that but it's a very partial step and at least would
> need a technically viable direction that this development can follow.

Right, that would be a much larger effort (arguably memcg v3 ;) ).
This proposal is focused on the painful artifact of the sharing/sticky
resources problem: zombie memcgs. We can extend the automatic charge
movement semantics later to cover more cases or be smarter, or ditch
the existing charging semantics completely and start over with
sharing/stickiness in mind. Either way, that would be a long-term
effort. There is a problem that exists today though that ideally can
be fixed/improved by this proposal.

>
> On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> is justifiably greater than what can be achieved with simple nesting.

In our use case nesting is not a viable option. As I said, in a large
fleet where a lot of different workloads are dynamically being
scheduled on different machines, and where there is no way of knowing
what resources are being shared among what workloads, and even if we
do, it wouldn't be constant, it's very difficult to construct the
hierarchy with nesting to keep the resources confined.

Keep in mind that the environment is dynamic, workloads are constantly
coming and going. Even if find the perfect nesting to appropriately
scope resources, some rescheduling may render the hierarchy obsolete
and require us to start over.

>
> Thanks.
>
> --
> tejun
Tejun Heo July 20, 2023, 10:31 p.m. UTC | #7
Hello,

On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > is justifiably greater than what can be achieved with simple nesting.
> 
> In our use case nesting is not a viable option. As I said, in a large
> fleet where a lot of different workloads are dynamically being
> scheduled on different machines, and where there is no way of knowing
> what resources are being shared among what workloads, and even if we
> do, it wouldn't be constant, it's very difficult to construct the
> hierarchy with nesting to keep the resources confined.

Hmm... so, usually, the problems we see are resources that are persistent
across different instances of the same application as they may want to share
large chunks of memory like on-memory cache. I get that machines get
different dynamic jobs but unrelated jobs usually don't share huge amount of
memory at least in our case. The sharing across them comes down to things
like some common library pages which don't really account for much these
days.

> Keep in mind that the environment is dynamic, workloads are constantly
> coming and going. Even if find the perfect nesting to appropriately
> scope resources, some rescheduling may render the hierarchy obsolete
> and require us to start over.

Can you please go into more details on how much memory is shared for what
across unrelated dynamic workloads? That sounds different from other use
cases.

Thanks.
T.J. Mercier July 20, 2023, 11:24 p.m. UTC | #8
On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > > is justifiably greater than what can be achieved with simple nesting.
> >
> > In our use case nesting is not a viable option. As I said, in a large
> > fleet where a lot of different workloads are dynamically being
> > scheduled on different machines, and where there is no way of knowing
> > what resources are being shared among what workloads, and even if we
> > do, it wouldn't be constant, it's very difficult to construct the
> > hierarchy with nesting to keep the resources confined.
>
> Hmm... so, usually, the problems we see are resources that are persistent
> across different instances of the same application as they may want to share
> large chunks of memory like on-memory cache. I get that machines get
> different dynamic jobs but unrelated jobs usually don't share huge amount of
> memory at least in our case. The sharing across them comes down to things
> like some common library pages which don't really account for much these
> days.
>
This has also been my experience in terms of bytes of memory that are
incorrectly charged (because they're charged to a zombie), but that is
because memcg doesn't currently track the large shared allocations in
my case (primarily dma-buf). The greater issue I've seen so far is the
number of zombie cgroups that can accumulate over time. But my
understanding is that both of these two problems are currently
significant for Yosry's case.
Tejun Heo July 20, 2023, 11:33 p.m. UTC | #9
Hello,

On Thu, Jul 20, 2023 at 04:24:02PM -0700, T.J. Mercier wrote:
> > Hmm... so, usually, the problems we see are resources that are persistent
> > across different instances of the same application as they may want to share
> > large chunks of memory like on-memory cache. I get that machines get
> > different dynamic jobs but unrelated jobs usually don't share huge amount of
> > memory at least in our case. The sharing across them comes down to things
> > like some common library pages which don't really account for much these
> > days.
> >
> This has also been my experience in terms of bytes of memory that are
> incorrectly charged (because they're charged to a zombie), but that is
> because memcg doesn't currently track the large shared allocations in
> my case (primarily dma-buf). The greater issue I've seen so far is the
> number of zombie cgroups that can accumulate over time. But my
> understanding is that both of these two problems are currently
> significant for Yosry's case.

memcg already does reparenting of slab pages to lower the number of dying
cgroups and maybe it makes sense to expand that to user memory too. One
related thing is that if those reparented pages are written to, that's gonna
break IO isolation w/ blk-iocost because iocost currently bypasses IOs from
intermediate cgroups to root but we can fix that. Anyways, that's something
pretty different from what's proposed here. Reparenting, I think, is a lot
less conroversial.

Thanks.
Roman Gushchin July 21, 2023, 12:02 a.m. UTC | #10
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> This patch series implements the proposal in LSF/MM/BPF 2023 conference
> for reducing offline/zombie memcgs by memory recharging [1]. The main
> difference is that this series focuses on recharging and does not
> include eviction of any memory charged to offline memcgs.
> 
> Two methods of recharging are proposed:
> 
> (a) Recharging of mapped folios.
> 
> When a memcg is offlined, queue an asynchronous worker that will walk
> the lruvec of the offline memcg and try to recharge any mapped folios to
> the memcg of one of the processes mapping the folio. The main assumption
> is that a process mapping the folio is the "rightful" owner of the
> memory.
> 
> Currently, this is only supported for evictable folios, as the
> unevictable lru is imaginary and we cannot iterate the folios on it. A
> separate proposal [2] was made to revive the unevictable lru, which
> would allow recharging of unevictable folios.
> 
> (b) Deferred recharging of folios.
> 
> For folios that are unmapped, or mapped but we fail to recharge them
> with (a), we rely on deferred recharging. Simply put, any time a folio
> is accessed or dirtied by a userspace process, and that folio is charged
> to an offline memcg, we will try to recharge it to the memcg of the
> process accessing the folio. Again, we assume this process should be the
> "rightful" owner of the memory. This is also done asynchronously to avoid
> slowing down the data access path.

Unfortunately I have to agree with Johannes, Tejun and others who are not big
fans of this approach.

Lazy recharging leads to an interesting phenomena: a memory usage of a running
workload may suddenly go up only because some other workload is terminated and
now it's memory is being recharged. I find it confusing. It also makes hard
to set up limits and/or guarantees.

In general, I don't think we can handle shared memory well without getting rid
of "whoever allocates a page, pays the full price" policy and making a shared
ownership a fully supported concept. Of course, it's a huge work and I believe
the only way we can achieve it is to compromise on the granularity of the
accounting. Will the resulting system be better in the real life, it's hard to
say in advance.

Thanks!
Yosry Ahmed July 21, 2023, 12:07 a.m. UTC | #11
On Thu, Jul 20, 2023 at 5:02 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote:
> > This patch series implements the proposal in LSF/MM/BPF 2023 conference
> > for reducing offline/zombie memcgs by memory recharging [1]. The main
> > difference is that this series focuses on recharging and does not
> > include eviction of any memory charged to offline memcgs.
> >
> > Two methods of recharging are proposed:
> >
> > (a) Recharging of mapped folios.
> >
> > When a memcg is offlined, queue an asynchronous worker that will walk
> > the lruvec of the offline memcg and try to recharge any mapped folios to
> > the memcg of one of the processes mapping the folio. The main assumption
> > is that a process mapping the folio is the "rightful" owner of the
> > memory.
> >
> > Currently, this is only supported for evictable folios, as the
> > unevictable lru is imaginary and we cannot iterate the folios on it. A
> > separate proposal [2] was made to revive the unevictable lru, which
> > would allow recharging of unevictable folios.
> >
> > (b) Deferred recharging of folios.
> >
> > For folios that are unmapped, or mapped but we fail to recharge them
> > with (a), we rely on deferred recharging. Simply put, any time a folio
> > is accessed or dirtied by a userspace process, and that folio is charged
> > to an offline memcg, we will try to recharge it to the memcg of the
> > process accessing the folio. Again, we assume this process should be the
> > "rightful" owner of the memory. This is also done asynchronously to avoid
> > slowing down the data access path.
>
> Unfortunately I have to agree with Johannes, Tejun and others who are not big
> fans of this approach.
>
> Lazy recharging leads to an interesting phenomena: a memory usage of a running
> workload may suddenly go up only because some other workload is terminated and
> now it's memory is being recharged. I find it confusing. It also makes hard
> to set up limits and/or guarantees.

This can happen today.

If memcg A starts accessing some memory and gets charged for it, and
then memcg B also accesses it, it will not be charged for it. If at a
later point memcg A runs into reclaim, and the memory is freed, then
memcg B tries to access it, its usage will suddenly go up as well,
because some other workload experienced reclaim. This is a very
similar scenario, only instead of reclaim, the memcg was offlined. As
a matter of fact, it's common to try to free up a memcg before
removing it (by lowering the limit or using memory.reclaim). In that
case, the net result would be exactly the same -- with the difference
being that recharging will avoid freeing the memory and faulting it
back in.

>
> In general, I don't think we can handle shared memory well without getting rid
> of "whoever allocates a page, pays the full price" policy and making a shared
> ownership a fully supported concept. Of course, it's a huge work and I believe
> the only way we can achieve it is to compromise on the granularity of the
> accounting. Will the resulting system be better in the real life, it's hard to
> say in advance.
>
> Thanks!
Yosry Ahmed July 21, 2023, 6:15 p.m. UTC | #12
On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote:
> > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> > > is justifiably greater than what can be achieved with simple nesting.
> >
> > In our use case nesting is not a viable option. As I said, in a large
> > fleet where a lot of different workloads are dynamically being
> > scheduled on different machines, and where there is no way of knowing
> > what resources are being shared among what workloads, and even if we
> > do, it wouldn't be constant, it's very difficult to construct the
> > hierarchy with nesting to keep the resources confined.
>
> Hmm... so, usually, the problems we see are resources that are persistent
> across different instances of the same application as they may want to share
> large chunks of memory like on-memory cache. I get that machines get
> different dynamic jobs but unrelated jobs usually don't share huge amount of

I am digging deeper to get more information for you. One thing I know
now is that different instances of the same job are contained within a
common parent, and we even use our previously proposed memcg= mount
option for tmpfs to charge their shared resources to a common parent.
So restarting tasks is not the problem we are seeing.

> memory at least in our case. The sharing across them comes down to things
> like some common library pages which don't really account for much these
> days.

Keep in mind that even a single page charged to a memcg and used by
another memcg is sufficient to result in a zombie memcg.

>
> > Keep in mind that the environment is dynamic, workloads are constantly
> > coming and going. Even if find the perfect nesting to appropriately
> > scope resources, some rescheduling may render the hierarchy obsolete
> > and require us to start over.
>
> Can you please go into more details on how much memory is shared for what
> across unrelated dynamic workloads? That sounds different from other use
> cases.

I am trying to collect more information from our fleet, but the
application restarting in a different cgroup is not what is happening
in our case. It is not easy to find out exactly what is going on on
machines and where the memory is coming from due to the
indeterministic nature of charging. The goal of this proposal is to
let the kernel handle leftover memory in zombie memcgs because it is
not always obvious to userspace what's going on (like it's not obvious
to me now where exactly is the sharing happening :) ).

One thing to note is that in some cases, maybe a userspace bug or
failed cleanup is a reason for the zombie memcgs. Ideally, this
wouldn't happen, but it would be nice to have a fallback mechanism in
the kernel if it does.

>
> Thanks.
>
> --
> tejun
Tejun Heo July 21, 2023, 6:26 p.m. UTC | #13
Hello,

On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > memory at least in our case. The sharing across them comes down to things
> > like some common library pages which don't really account for much these
> > days.
> 
> Keep in mind that even a single page charged to a memcg and used by
> another memcg is sufficient to result in a zombie memcg.

I mean, yeah, that's a separate issue or rather a subset which isn't all
that controversial. That can be deterministically solved by reparenting to
the parent like how slab is handled. I think the "deterministic" part is
important here. As you said, even a single page can pin a dying cgroup.

> > > Keep in mind that the environment is dynamic, workloads are constantly
> > > coming and going. Even if find the perfect nesting to appropriately
> > > scope resources, some rescheduling may render the hierarchy obsolete
> > > and require us to start over.
> >
> > Can you please go into more details on how much memory is shared for what
> > across unrelated dynamic workloads? That sounds different from other use
> > cases.
> 
> I am trying to collect more information from our fleet, but the
> application restarting in a different cgroup is not what is happening
> in our case. It is not easy to find out exactly what is going on on

This is the point that Johannes raised but I don't think the current
proposal would make things more deterministic. From what I can see, it
actually pushes it towards even less predictability. Currently, yeah, some
pages may end up in cgroups which aren't the majority user but it at least
is clear how that would happen. The proposed change adds layers of
indeterministic behaviors on top. I don't think that's the direction we want
to go.

> machines and where the memory is coming from due to the
> indeterministic nature of charging. The goal of this proposal is to
> let the kernel handle leftover memory in zombie memcgs because it is
> not always obvious to userspace what's going on (like it's not obvious
> to me now where exactly is the sharing happening :) ).
>
> One thing to note is that in some cases, maybe a userspace bug or
> failed cleanup is a reason for the zombie memcgs. Ideally, this
> wouldn't happen, but it would be nice to have a fallback mechanism in
> the kernel if it does.

I'm not disagreeing on that. Our handling of pages owned by dying cgroups
isn't great but I don't think the proposed change is an acceptable solution.

Thanks.
Yosry Ahmed July 21, 2023, 6:47 p.m. UTC | #14
On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > memory at least in our case. The sharing across them comes down to things
> > > like some common library pages which don't really account for much these
> > > days.
> >
> > Keep in mind that even a single page charged to a memcg and used by
> > another memcg is sufficient to result in a zombie memcg.
>
> I mean, yeah, that's a separate issue or rather a subset which isn't all
> that controversial. That can be deterministically solved by reparenting to
> the parent like how slab is handled. I think the "deterministic" part is
> important here. As you said, even a single page can pin a dying cgroup.

There are serious flaws with reparenting that I mentioned above. We do
it for kernel memory, but that's because we really have no other
choice. Oftentimes the memory is not reclaimable and we cannot find an
owner for it. This doesn't mean it's the right answer for user memory.

The semantics are new compared to normal charging (as opposed to
recharging, as I explain below). There is an extra layer of
indirection that we did not (as far as I know) measure the impact of.
Parents end up with pages that they never used and we have no
observability into where it came from. Most importantly, over time
user memory will keep accumulating at the root, reducing the accuracy
and usefulness of accounting, effectively an accounting leak and
reduction of capacity. Memory that is not attributed to any user, aka
system overhead.

>
> > > > Keep in mind that the environment is dynamic, workloads are constantly
> > > > coming and going. Even if find the perfect nesting to appropriately
> > > > scope resources, some rescheduling may render the hierarchy obsolete
> > > > and require us to start over.
> > >
> > > Can you please go into more details on how much memory is shared for what
> > > across unrelated dynamic workloads? That sounds different from other use
> > > cases.
> >
> > I am trying to collect more information from our fleet, but the
> > application restarting in a different cgroup is not what is happening
> > in our case. It is not easy to find out exactly what is going on on
>
> This is the point that Johannes raised but I don't think the current
> proposal would make things more deterministic. From what I can see, it
> actually pushes it towards even less predictability. Currently, yeah, some
> pages may end up in cgroups which aren't the majority user but it at least
> is clear how that would happen. The proposed change adds layers of
> indeterministic behaviors on top. I don't think that's the direction we want
> to go.

I believe recharging is being mis-framed here :)

Recharging semantics are not new, it is a shortcut to a process that
is already happening that is focused on offline memcgs. Let's take a
step back.

It is common practice (at least in my knowledge) to try to reclaim
memory from a cgroup before deleting it (by lowering the limit or
using memory.reclaim). Reclaim heuristics are biased towards
reclaiming memory from offline cgroups. After the memory is reclaimed,
if it is used again by a different process, it will be refaulted and
charged again (aka recharged) to the new

What recharging is doing is *not* anything new. It is effectively
doing what reclaim + refault would do above, with an efficient
shortcut. It avoids the unnecessary fault, avoids disrupting the
workload that will access the memory after it is reclaimed, and cleans
up zombie memcgs memory faster than reclaim would. Moreover, it works
for memory that may not be reclaimable (e.g. because of lack of swap).

All the indeterministic behaviors in recharging are exactly the
indeterministic behaviors in reclaim. It is very similar. We iterate
the lrus, try to isolate and lock folios, etc. This is what reclaim
does. Recharging is basically lightweight reclaim + charging again (as
opposed to fully reclaiming the memory then refaulting it).

We are not introducing new indeterminism or charging semantics.
Recharging does exactly what would happen when we reclaim zombie
memory. It is just more efficient and accelerated.

>
> > machines and where the memory is coming from due to the
> > indeterministic nature of charging. The goal of this proposal is to
> > let the kernel handle leftover memory in zombie memcgs because it is
> > not always obvious to userspace what's going on (like it's not obvious
> > to me now where exactly is the sharing happening :) ).
> >
> > One thing to note is that in some cases, maybe a userspace bug or
> > failed cleanup is a reason for the zombie memcgs. Ideally, this
> > wouldn't happen, but it would be nice to have a fallback mechanism in
> > the kernel if it does.
>
> I'm not disagreeing on that. Our handling of pages owned by dying cgroups
> isn't great but I don't think the proposed change is an acceptable solution.

 I hope the above arguments change your mind :)

>
> Thanks.
>
> --
> tejun
Tejun Heo July 21, 2023, 7:18 p.m. UTC | #15
Hello,

On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > memory at least in our case. The sharing across them comes down to things
> > > > like some common library pages which don't really account for much these
> > > > days.
> > >
> > > Keep in mind that even a single page charged to a memcg and used by
> > > another memcg is sufficient to result in a zombie memcg.
> >
> > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > that controversial. That can be deterministically solved by reparenting to
> > the parent like how slab is handled. I think the "deterministic" part is
> > important here. As you said, even a single page can pin a dying cgroup.
> 
> There are serious flaws with reparenting that I mentioned above. We do
> it for kernel memory, but that's because we really have no other
> choice. Oftentimes the memory is not reclaimable and we cannot find an
> owner for it. This doesn't mean it's the right answer for user memory.
> 
> The semantics are new compared to normal charging (as opposed to
> recharging, as I explain below). There is an extra layer of
> indirection that we did not (as far as I know) measure the impact of.
> Parents end up with pages that they never used and we have no
> observability into where it came from. Most importantly, over time
> user memory will keep accumulating at the root, reducing the accuracy
> and usefulness of accounting, effectively an accounting leak and
> reduction of capacity. Memory that is not attributed to any user, aka
> system overhead.

That really sounds like the setup is missing cgroup layers tracking
persistent resources. Most of the problems you describe can be solved by
adding cgroup layers at the right spots which would usually align with the
logical structure of the system, right?

...
> I believe recharging is being mis-framed here :)
> 
> Recharging semantics are not new, it is a shortcut to a process that
> is already happening that is focused on offline memcgs. Let's take a
> step back.

Yeah, it does sound better when viewed that way. I'm still not sure what
extra problems it solves tho. We experienced similar problems but AFAIK all
of them came down to needing the appropriate hierarchical structure to
capture how resources are being used on systems.

Thanks.
Yosry Ahmed July 21, 2023, 8:37 p.m. UTC | #16
On Fri, Jul 21, 2023 at 12:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > > memory at least in our case. The sharing across them comes down to things
> > > > > like some common library pages which don't really account for much these
> > > > > days.
> > > >
> > > > Keep in mind that even a single page charged to a memcg and used by
> > > > another memcg is sufficient to result in a zombie memcg.
> > >
> > > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > > that controversial. That can be deterministically solved by reparenting to
> > > the parent like how slab is handled. I think the "deterministic" part is
> > > important here. As you said, even a single page can pin a dying cgroup.
> >
> > There are serious flaws with reparenting that I mentioned above. We do
> > it for kernel memory, but that's because we really have no other
> > choice. Oftentimes the memory is not reclaimable and we cannot find an
> > owner for it. This doesn't mean it's the right answer for user memory.
> >
> > The semantics are new compared to normal charging (as opposed to
> > recharging, as I explain below). There is an extra layer of
> > indirection that we did not (as far as I know) measure the impact of.
> > Parents end up with pages that they never used and we have no
> > observability into where it came from. Most importantly, over time
> > user memory will keep accumulating at the root, reducing the accuracy
> > and usefulness of accounting, effectively an accounting leak and
> > reduction of capacity. Memory that is not attributed to any user, aka
> > system overhead.
>
> That really sounds like the setup is missing cgroup layers tracking
> persistent resources. Most of the problems you describe can be solved by
> adding cgroup layers at the right spots which would usually align with the
> logical structure of the system, right?

It is difficult to track down all persistent/shareable resources and
find the users, especially when both the resources and the users are
dynamically changed. A simple example is text files for a shared
library or sidecar processes that run with different workloads and
need to have their usage charged to the workload, but they may have
memory. For those cases there is no layering that would work. More
practically, sometimes userspace just doesn't even know what exactly
is being shared by whom.

>
> ...
> > I believe recharging is being mis-framed here :)
> >
> > Recharging semantics are not new, it is a shortcut to a process that
> > is already happening that is focused on offline memcgs. Let's take a
> > step back.
>
> Yeah, it does sound better when viewed that way. I'm still not sure what
> extra problems it solves tho. We experienced similar problems but AFAIK all
> of them came down to needing the appropriate hierarchical structure to
> capture how resources are being used on systems.

It solves the problem of zombie memcgs and unaccounted memory. It is
great that in some cases an appropriate hierarchy structure fixes the
problem by accurately capturing how resources are being shared, but in
some cases it's not as straightforward. Recharging attempts to fix the
problem in a way that is more consistent with current semantics and
more appealing that reparenting in terms of rightful ownership.

Some systems are not rebooted for months. Can you imagine how much
memory can be accumulated at the root (escaping all accounting) over
months of reparenting?

>
> Thanks.
>
> --
> tejun
Johannes Weiner July 21, 2023, 8:44 p.m. UTC | #17
On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > memory at least in our case. The sharing across them comes down to things
> > > > like some common library pages which don't really account for much these
> > > > days.
> > >
> > > Keep in mind that even a single page charged to a memcg and used by
> > > another memcg is sufficient to result in a zombie memcg.
> >
> > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > that controversial. That can be deterministically solved by reparenting to
> > the parent like how slab is handled. I think the "deterministic" part is
> > important here. As you said, even a single page can pin a dying cgroup.
> 
> There are serious flaws with reparenting that I mentioned above. We do
> it for kernel memory, but that's because we really have no other
> choice. Oftentimes the memory is not reclaimable and we cannot find an
> owner for it. This doesn't mean it's the right answer for user memory.
> 
> The semantics are new compared to normal charging (as opposed to
> recharging, as I explain below). There is an extra layer of
> indirection that we did not (as far as I know) measure the impact of.
> Parents end up with pages that they never used and we have no
> observability into where it came from. Most importantly, over time
> user memory will keep accumulating at the root, reducing the accuracy
> and usefulness of accounting, effectively an accounting leak and
> reduction of capacity. Memory that is not attributed to any user, aka
> system overhead.

Reparenting has been the behavior since the first iteration of cgroups
in the kernel. The initial implementation would loop over the LRUs and
reparent pages synchronously during rmdir. This had some locking
issues, so we switched to the current implementation of just leaving
the zombie memcg behind but neutralizing its controls.

Thanks to Roman's objcg abstraction, we can now go back to the old
implementation of directly moving pages up to avoid the zombies.

However, these were pure implementation changes. The user-visible
semantics never varied: when you delete a cgroup, any leftover
resources are subject to control by the remaining parent cgroups.
Don't remove control domains if you still need to control resources.
But none of this is new or would change in any way! Neutralizing
controls of a zombie cgroup results in the same behavior and
accounting as linking the pages to the parent cgroup's LRU!

The only thing that's new is the zombie cgroups. We can fix that by
effectively going back to the earlier implementation, but thanks to
objcg without the locking problems.

I just wanted to address this, because your description/framing of
reparenting strikes me as quite wrong.
Yosry Ahmed July 21, 2023, 8:59 p.m. UTC | #18
On Fri, Jul 21, 2023 at 1:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote:
> > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > > > memory at least in our case. The sharing across them comes down to things
> > > > > like some common library pages which don't really account for much these
> > > > > days.
> > > >
> > > > Keep in mind that even a single page charged to a memcg and used by
> > > > another memcg is sufficient to result in a zombie memcg.
> > >
> > > I mean, yeah, that's a separate issue or rather a subset which isn't all
> > > that controversial. That can be deterministically solved by reparenting to
> > > the parent like how slab is handled. I think the "deterministic" part is
> > > important here. As you said, even a single page can pin a dying cgroup.
> >
> > There are serious flaws with reparenting that I mentioned above. We do
> > it for kernel memory, but that's because we really have no other
> > choice. Oftentimes the memory is not reclaimable and we cannot find an
> > owner for it. This doesn't mean it's the right answer for user memory.
> >
> > The semantics are new compared to normal charging (as opposed to
> > recharging, as I explain below). There is an extra layer of
> > indirection that we did not (as far as I know) measure the impact of.
> > Parents end up with pages that they never used and we have no
> > observability into where it came from. Most importantly, over time
> > user memory will keep accumulating at the root, reducing the accuracy
> > and usefulness of accounting, effectively an accounting leak and
> > reduction of capacity. Memory that is not attributed to any user, aka
> > system overhead.
>
> Reparenting has been the behavior since the first iteration of cgroups
> in the kernel. The initial implementation would loop over the LRUs and
> reparent pages synchronously during rmdir. This had some locking
> issues, so we switched to the current implementation of just leaving
> the zombie memcg behind but neutralizing its controls.

Thanks for the context.

>
> Thanks to Roman's objcg abstraction, we can now go back to the old
> implementation of directly moving pages up to avoid the zombies.
>
> However, these were pure implementation changes. The user-visible
> semantics never varied: when you delete a cgroup, any leftover
> resources are subject to control by the remaining parent cgroups.
> Don't remove control domains if you still need to control resources.
> But none of this is new or would change in any way!

The problem is that you cannot fully monitor or control all the
resources charged to a control domain. The example of common shared
libraries stands, the pages are charged on first touch basis. You
can't easily control it or monitor who is charged for what exactly.
Even if you can find out, is the answer to leave the cgroup alive
forever because it is charged for a shared resource?

> Neutralizing
> controls of a zombie cgroup results in the same behavior and
> accounting as linking the pages to the parent cgroup's LRU!
>
> The only thing that's new is the zombie cgroups. We can fix that by
> effectively going back to the earlier implementation, but thanks to
> objcg without the locking problems.
>
> I just wanted to address this, because your description/framing of
> reparenting strikes me as quite wrong.

Thanks for the context, and sorry if my framing was inaccurate. I was
more focused on the in-kernel semantics rather than user-visible
semantics. Nonetheless, with today's status or with reparenting, once
the memory is at the root level (whether reparented to the root level,
or in a zombie memcg whose parent is root), the memory has effectively
escaped accounting. This is not a new problem that reparenting would
introduce, but it's a problem that recharging is trying to fix that
reparenting won't.

As I outlined above, the semantics of recharging are not new, they are
equivalent to reclaiming and refaulting the memory in a more
accelerated/efficient manner. The indeterminism in recharging is very
similar to reclaiming and refaulting.

What do you think?
Michal Hocko Aug. 1, 2023, 9:54 a.m. UTC | #19
[Sorry for being late to this discussion]

On Thu 20-07-23 11:35:15, Johannes Weiner wrote:
[...]
> I'm super skeptical of this proposal.

Agreed.
 
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
> 
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process, which causes pages that should
> belong to the same domain to be scattered around all over the place.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
> 
> There are two issues being conflated here:
> 
> a) the problem of zombie cgroups, and
> 
> b) who controls resources that outlive the control domain.
> 
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
> 
> b) is a discussion totally separate from this. We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
> 
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
> 
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.

Yes, fully agreed on both points. The problem with zombies is real but
reparenting should address it for a large part. Ownership is a different
problem. We have discussed that at LSFMM this year and in the past as
well I believe. What we probably need is a concept of taking an
ownership of the memory (something like madvise(MADV_OWN, range) or
fadvise for fd based resources). This would allow the caller to take
ownership of the said resource (like memcg charge of it).

I understand that would require some changes to existing workloads.
Whatever the interface will be, it has to be explicit otherwise we
are hitting problems with unaccounted resources that are sitting without
any actual ownership and an undeterministic and time dependeing hopping
over owners. In other words, nobody should be able to drop
responsibility of any object while it is still consuming resources.