mbox series

[RFC,v5,0/3] Add memory.max.effective for application's allocators

Message ID 20240606152232.20253-1-mkoutny@suse.com (mailing list archive)
Headers show
Series Add memory.max.effective for application's allocators | expand

Message

Michal Koutný June 6, 2024, 3:22 p.m. UTC
Some applications use memory cgroup limits to scale their own memory
needs. Reading of the immediate membership cgroup's memory.max is not
sufficient because of possible ancestral limits. The application could
traverse upwards to figure out the tightest limit but this would not
work in cgroup namespace where the view of cgroup hierarchy is
incomplete and the limit may apply from outer world.
Additionally, applications should respond to limit changes.

(cgroup v1 used memory.stat:hierarchical_memory_limit to report the
value but there's no such counterpart in cgroup v2 memory.stat.)

Introduce a new memcg attribute file that contains the effective value
of memory limit for given cgroup (following cpuset.cpus.effective
pattern) and that sends notifications like memory.events when the
effective limit changes.

Reasons for RFC:
1) Should global limit be included? (And respond to memory hotplug?)
2) Is swap.max.effective needed? (in v2 without memsw accounting)
3) Should memory.high be also handled?
4) What would be an alternative?

My answers to RFC:

1) No (there's no memory.max in global root memcg)
2) No (app doesn't have full control of memory that's swapped out)
3) No (scaling the allocator against the "soft" limit could end up in
   dynamics difficult to reason and admin)
4)
- PSI (too obscure for traditional users but better semantics for limit
  shrinking)
- memory.stat field (like v1 but separate attribute is better for
  notifications, cpuset precedent)

Changes from v4 (https://lore.kernel.org/r/ZcvlhOZ4VBEX9raZ@host1.jankratochvil.net)
- split the patch for swap.max.effetive
- add Documentation/
- reword commit messages
- add notification support

Michal Koutný (3):
  memcg: Add memory.max.effective attribute
  memcg: Add memory.swap.max.effective like hierarchical_memsw_limit
  memcg: Notify on memory.max.effective changes

 Documentation/admin-guide/cgroup-v2.rst |  6 ++++
 include/linux/memcontrol.h              |  2 ++
 mm/memcontrol.c                         | 46 +++++++++++++++++++++++++
 3 files changed, 54 insertions(+)


base-commit: 2df0193e62cf887f373995fb8a91068562784adc

Comments

Roman Gushchin June 6, 2024, 6:15 p.m. UTC | #1
On Thu, Jun 06, 2024 at 05:22:29PM +0200, Michal Koutný wrote:
> Some applications use memory cgroup limits to scale their own memory
> needs. Reading of the immediate membership cgroup's memory.max is not
> sufficient because of possible ancestral limits. The application could
> traverse upwards to figure out the tightest limit but this would not
> work in cgroup namespace where the view of cgroup hierarchy is
> incomplete and the limit may apply from outer world.
> Additionally, applications should respond to limit changes.

If the goal is to detect how much memory would it be possible to allocate,
I'm not sure that knowing all memory.max limits upper in the hierarchy
really buys anything without knowing actual usages and a potential
for memory reclaim across the entire tree.

E.g.:

A (max = 100G)
| \
B  C

C's effective max will come out as 100G, but if B.anon_usage = 100G and
there is no swap, the actual number is 0.

But if it's more about exploring the "invisible" part of the cgroup
tree configuration, it makes sense to me.
Not sure about the naming, maybe something like memory.tree.max
or memory.parent.max or even memory.hierarchical.max is a better fit.

Thanks!
Jan Kratochvil Aug. 17, 2024, 6 a.m. UTC | #2
On Fri, 07 Jun 2024 02:15:00 +0800, Roman Gushchin wrote:
> If the goal is to detect how much memory would it be possible to allocate,
> I'm not sure that knowing all memory.max limits upper in the hierarchy
> really buys anything without knowing actual usages and a potential
> for memory reclaim across the entire tree.
> 
> E.g.:
> 
> A (max = 100G)
> | \
> B  C
> 
> C's effective max will come out as 100G, but if B.anon_usage = 100G and
> there is no swap, the actual number is 0.

Yes, it would be better to subtract the used memory from ancestor (and thus
even current) cgroups. The original use case of this feature is for cloud
nodes running a single Java JVM where the sibling cgroups are not an issue.


Jan Kratochvil
Michal Koutný Aug. 19, 2024, 4:42 p.m. UTC | #3
Hello.

On Sat, Aug 17, 2024 at 02:00:15PM GMT, Jan Kratochvil <jkratochvil@azul.com> wrote:
> Yes, it would be better to subtract the used memory from ancestor (and thus
> even current) cgroups.

Then it becomes a more dynamic characterstics and it leads to
calculations of available memory. I share a link [1] for completeness
and to prevent repeated discussions (that past one ended up with no
memory.stat:avail).


> The original use case of this feature is for cloud nodes running a
> single Java JVM where the sibling cgroups are not an issue.

IIUC, it's a tree like this:

        O
      / | \
     A  B  C	// B:memory.max < O:memory.max
        |
       ...
        |
        W	// workload

This picture made me realize that memory controller may not be even
enabled all the way down from B to W, i.e. W would have no
memory.max.effective, IOW memory.* attribute would not be the right
place for such an value. That would even apply in the apparently
purposeful case if there was a cgroup NS boundary between B and W.

(At least in the proposed implementation, memory.* file would have to be
decoupled from memory controller, similarly to e.g. cpu.stat:usage_usec.)

Jan, do I get the tree shape right? Are B and W in different cgroup
namespaces?

Thanks,
Michal

[1] https://lore.kernel.org/all/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/