Message ID | 20240920221202.1734227-1-kaiyang2@cs.cmu.edu (mailing list archive) |
---|---|
Headers | show |
Series | memory tiering fairness by per-cgroup control of promotion and demotion | expand |
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.
The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.
Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7
Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)
Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.
Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark.
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.
We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.
From: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Currently in Linux, there is no concept of fairness in memory tiering. Depending on the memory usage and access patterns of other colocated applications, an application cannot be sure of how much memory in which tier it will get, and how much its performance will suffer or benefit. Fairness is, however, important in a multi-tenant system. For example, an application may need to meet a certain tail latency requirement, which can be difficult to satisfy without x amount of frequently accessed pages in top-tier memory. Similarly, an application may want to declare a minimum throughput when running on a system for capacity planning purposes, but without fairness controls in memory tiering its throughput can fluctuate wildly as other applications come and go on the system. In this proposal, we amend the memory.low control in memcg to protect a cgroup’s memory usage in top-tier memory. A low protection for top-tier memory is scaled proportionally to the ratio of top-tier memory and total memory on the system. The protection is then applied to reclaim for top-tier memory. Promotion by NUMA balancing is also throttled through reduced scanning window when top-tier memory is contended and the cgroup is over its protection. Experiments we did with microbenchmarks exhibiting a range of memory access patterns and memory size confirmed that when top-tier memory is contended, the system moves towards a stable memory distribution where each cgroup’s memory usage in local DRAM converges to the protected amounts. One notable missing part in the patches is determining which NUMA nodes have top-tier memory; currently they use hardcoded node 0 for top-tier memory and node 1 for a CPU-less node backed by CXL memory. We’re working on removing this artifact and correctly applying to top-tier nodes in the system. Your feedback is greatly appreciated! Kaiyang Zhao (4): Add get_cgroup_local_usage for estimating the top-tier memory usage calculate memory.low for the local node and track its usage use memory.low local node protection for local node reclaim reduce NUMA balancing scan size of cgroups over their local memory.low include/linux/memcontrol.h | 25 ++++++++----- include/linux/page_counter.h | 16 ++++++--- kernel/sched/fair.c | 54 +++++++++++++++++++++++++--- mm/hugetlb_cgroup.c | 4 +-- mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++------ mm/page_counter.c | 52 +++++++++++++++++++++------ mm/vmscan.c | 19 +++++++--- 7 files changed, 192 insertions(+), 46 deletions(-)