Message ID | 20231129032154.3710765-4-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: memcg: subtree stats flushing and thresholds | expand |
On Wed, Nov 29, 2023 at 03:21:51AM +0000, Yosry Ahmed wrote: > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): > > (1) Running 22 instances of netperf on a 44 cpu machine with > hyperthreading disabled. All instances are run in a level 2 cgroup, as > well as netserver: > # netserver -6 > # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Averaging 20 runs, the numbers are as follows: > Base: 40198.0 mbps > Patched: 38629.7 mbps (-3.9%) > > The regression is minimal, especially for 22 instances in the same > cgroup sharing all ancestors (so updating the same atomics). > > (2) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [1]. These are the > numbers from 10 runs (+ is good) on a machine with 256 cpus: > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 270249.164 | 265437.000 | 13451.836 | > (B) patched | 261368.709 | 255725.000 | 13394.767 | > | -3.29% | -3.66% | | > page_fault1_per_thread_ops | | | | > (A) base | 242111.345 | 239737.000 | 10026.031 | > (B) patched | 237057.109 | 235305.000 | 9769.687 | > | -2.09% | -1.85% | | > page_fault1_scalability | | | > (A) base | 0.034387 | 0.035168 | 0.0018283 | > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > | -1.16% | -1.69% | | > page_fault2_per_process_ops | | | > (A) base | 203561.836 | 203301.000 | 2550.764 | > (B) patched | 197195.945 | 197746.000 | 2264.263 | > | -3.13% | -2.73% | | > page_fault2_per_thread_ops | | | > (A) base | 171046.473 | 170776.000 | 1509.679 | > (B) patched | 166626.327 | 166406.000 | 768.753 | > | -2.58% | -2.56% | | > page_fault2_scalability | | | > (A) base | 0.054026 | 0.053821 | 0.00062121 | > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > | -1.29% | -1.41% | | > page_fault3_per_process_ops | | | > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > | -1.56% | -1.86% | | > page_fault3_per_thread_ops | | | > (A) base | 391234.164 | 390860.000 | 1760.720 | > (B) patched | 377231.273 | 376369.000 | 1874.971 | > | -3.58% | -3.71% | | > page_fault3_scalability | | | > (A) base | 0.60369 | 0.60072 | 0.0083029 | > (B) patched | 0.61733 | 0.61544 | 0.009855 | > | +2.26% | +2.45% | | > > All regressions seem to be minimal, and within the normal variance for > the benchmark. The fix for [1] assumes that 3% is noise -- and there > were no further practical complaints), so hopefully this means that such > variations in these microbenchmarks do not reflect on practical > workloads. > > (3) I also ran stress-ng in a nested cgroup and did not observe any > obvious regressions. > > [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf05b97c1e824..93b483b379aa1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -631,6 +631,9 @@ struct memcg_vmstats_percpu { /* Cgroup1: threshold notifications & softlimit tree updates */ unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; + + /* Stats updates since the last flush */ + unsigned int stats_updates; }; struct memcg_vmstats { @@ -645,6 +648,9 @@ struct memcg_vmstats { /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; + + /* Stats updates since the last flush */ + atomic64_t stats_updates; }; /* @@ -664,9 +670,7 @@ struct memcg_vmstats { */ static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_PER_CPU(unsigned int, stats_updates); static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); -static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_last_time; #define FLUSH_TIME (2UL*HZ) @@ -693,26 +697,37 @@ static void memcg_stats_unlock(void) preempt_enable_nested(); } + +static bool memcg_should_flush_stats(struct mem_cgroup *memcg) +{ + return atomic64_read(&memcg->vmstats->stats_updates) > + MEMCG_CHARGE_BATCH * num_online_cpus(); +} + static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { + int cpu = smp_processor_id(); unsigned int x; if (!val) return; - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + cgroup_rstat_updated(memcg->css.cgroup, cpu); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, + abs(val)); + + if (x < MEMCG_CHARGE_BATCH) + continue; - x = __this_cpu_add_return(stats_updates, abs(val)); - if (x > MEMCG_CHARGE_BATCH) { /* - * If stats_flush_threshold exceeds the threshold - * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further - * is redundant and simply adds overhead in atomic update. + * If @memcg is already flush-able, increasing stats_updates is + * redundant. Avoid the overhead of the atomic update. */ - if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) - atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold); - __this_cpu_write(stats_updates, 0); + if (!memcg_should_flush_stats(memcg)) + atomic64_add(x, &memcg->vmstats->stats_updates); + __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); } } @@ -731,13 +746,12 @@ static void do_flush_stats(void) cgroup_rstat_flush(root_mem_cgroup->css.cgroup); - atomic_set(&stats_flush_threshold, 0); atomic_set(&stats_flush_ongoing, 0); } void mem_cgroup_flush_stats(void) { - if (atomic_read(&stats_flush_threshold) > num_online_cpus()) + if (memcg_should_flush_stats(root_mem_cgroup)) do_flush_stats(); } @@ -751,8 +765,8 @@ void mem_cgroup_flush_stats_ratelimited(void) static void flush_memcg_stats_dwork(struct work_struct *w) { /* - * Always flush here so that flushing in latency-sensitive paths is - * as cheap as possible. + * Deliberately ignore memcg_should_flush_stats() here so that flushing + * in latency-sensitive paths is as cheap as possible. */ do_flush_stats(); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); @@ -5809,6 +5823,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } + statc->stats_updates = 0; + /* We are in a per-cpu loop here, only do the atomic write once */ + if (atomic64_read(&memcg->vmstats->stats_updates)) + atomic64_set(&memcg->vmstats->stats_updates, 0); } #ifdef CONFIG_MMU