mbox series

[RFC,0/4] memory tiering fairness by per-cgroup control of promotion and demotion

Message ID 20240920221202.1734227-1-kaiyang2@cs.cmu.edu (mailing list archive)
Headers show
Series memory tiering fairness by per-cgroup control of promotion and demotion | expand

Message

Kaiyang Zhao Sept. 20, 2024, 10:11 p.m. UTC
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Currently in Linux, there is no concept of fairness in memory tiering. Depending
on the memory usage and access patterns of other colocated applications, an
application cannot be sure of how much memory in which tier it will get, and how
much its performance will suffer or benefit.

Fairness is, however, important in a multi-tenant system. For example, an
application may need to meet a certain tail latency requirement, which can be
difficult to satisfy without x amount of frequently accessed pages in top-tier
memory. Similarly, an application may want to declare a minimum throughput when
running on a system for capacity planning purposes, but without fairness
controls in memory tiering its throughput can fluctuate wildly as other
applications come and go on the system.

In this proposal, we amend the memory.low control in memcg to protect a cgroup’s
memory usage in top-tier memory. A low protection for top-tier memory is scaled
proportionally to the ratio of top-tier memory and total memory on the system.
The protection is then applied to reclaim for top-tier memory. Promotion by NUMA
balancing is also throttled through reduced scanning window when top-tier memory
is contended and the cgroup is over its protection.

Experiments we did with microbenchmarks exhibiting a range of memory access
patterns and memory size confirmed that when top-tier memory is contended, the
system moves towards a stable memory distribution where each cgroup’s memory
usage in local DRAM converges to the protected amounts.

One notable missing part in the patches is determining which NUMA nodes have
top-tier memory; currently they use hardcoded node 0 for top-tier memory and
node 1 for a CPU-less node backed by CXL memory. We’re working on removing
this artifact and correctly applying to top-tier nodes in the system.

Your feedback is greatly appreciated!

Kaiyang Zhao (4):
  Add get_cgroup_local_usage for estimating the top-tier memory usage
  calculate memory.low for the local node and track its usage
  use memory.low local node protection for local node reclaim
  reduce NUMA balancing scan size of cgroups over their local memory.low

 include/linux/memcontrol.h   | 25 ++++++++-----
 include/linux/page_counter.h | 16 ++++++---
 kernel/sched/fair.c          | 54 +++++++++++++++++++++++++---
 mm/hugetlb_cgroup.c          |  4 +--
 mm/memcontrol.c              | 68 ++++++++++++++++++++++++++++++------
 mm/page_counter.c            | 52 +++++++++++++++++++++------
 mm/vmscan.c                  | 19 +++++++---
 7 files changed, 192 insertions(+), 46 deletions(-)

Comments

Kaiyang Zhao Nov. 8, 2024, 7:01 p.m. UTC | #1
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>

Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.

The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.

Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7

Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)

Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.

Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark. 
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.

We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.