Message ID | 20211001190040.48086-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] memcg: flush stats only if updated | expand |
On Fri, Oct 01, 2021 at 12:00:39PM -0700, Shakeel Butt <shakeelb@google.com> wrote: > In this patch we kept the stats update codepath very minimal and let the > stats reader side to flush the stats only when the updates are over a > specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH). BTW, a noob question -- are the updates always single page sized? This is motivated by apples vs oranges comparison since the nr_cpus * MEMCG_CHARGE_BATCH suggests what could the expected error be in pages (bytes). But it's mostly wrong since: a) uncertain single-page updates, b) various counter updates summed together. I wonder whether the formula can serve to provide at least some (upper) estimate. > +static inline void memcg_rstat_updated(struct mem_cgroup *memcg) > +{ > + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > + if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH)) > + atomic_inc(&stats_flush_threshold); > +} Neat trick! (I guess there are no benchmarks complaining about this (yet)). Overall, Reviewed-by: Michal Koutný <mkoutny@suse.com>
Hi Michal, On Wed, Oct 13, 2021 at 11:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Oct 01, 2021 at 12:00:39PM -0700, Shakeel Butt <shakeelb@google.com> wrote: > > In this patch we kept the stats update codepath very minimal and let the > > stats reader side to flush the stats only when the updates are over a > > specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH). > > BTW, a noob question -- are the updates always single page sized? > > This is motivated by apples vs oranges comparison since the > nr_cpus * MEMCG_CHARGE_BATCH > suggests what could the expected error be in pages (bytes). But it's mostly > wrong since: a) uncertain single-page updates, b) various counter > updates summed together. I wonder whether the formula can serve to > provide at least some (upper) estimate. > Thanks for your review. This forces me to think more on this because each update does not necessarily be a single page sized update e.g. adding a hugepage to an LRU. Though I think the error is time bounded by 2 seconds but in those 2 seconds mathematically the error can be large. What do you think of the following change? It will bound the error better within the 2 seconds window. From e87a36eedd02b0d10d8f66f83833bd6e2bae17b8 Mon Sep 17 00:00:00 2001 From: Shakeel Butt <shakeelb@google.com> Date: Thu, 14 Oct 2021 08:49:06 -0700 Subject: [PATCH] Better bounds on the stats error --- mm/memcontrol.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f1d9c028897..e5d5c850a521 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -626,14 +626,20 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_SPINLOCK(stats_flush_lock); -static DEFINE_PER_CPU(unsigned int, stats_updates); +static DEFINE_PER_CPU(int, stats_diff); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); -static inline void memcg_rstat_updated(struct mem_cgroup *memcg) +static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { + unsigned int x; + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); - if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH)) - atomic_inc(&stats_flush_threshold); + + x = abs(__this_cpu_add_return(stats_diff, val)); + if (x > MEMCG_CHARGE_BATCH) { + atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold); + __this_cpu_write(stats_diff, 0); + } } static void __mem_cgroup_flush_stats(void) @@ -672,7 +678,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) return; __this_cpu_add(memcg->vmstats_percpu->state[idx], val); - memcg_rstat_updated(memcg); + memcg_rstat_updated(memcg, val); } /* idx can be of type enum memcg_stat_item or node_stat_item. */ @@ -705,7 +711,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update lruvec */ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); - memcg_rstat_updated(memcg); + memcg_rstat_updated(memcg, val); } /** @@ -807,7 +813,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, return; __this_cpu_add(memcg->vmstats_percpu->events[idx], count); - memcg_rstat_updated(memcg); + memcg_rstat_updated(memcg, val); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
On Thu, Oct 14, 2021 at 09:31:46AM -0700, Shakeel Butt <shakeelb@google.com> wrote: > Thanks for your review. This forces me to think more on this because each > update does not necessarily be a single page sized update e.g. adding a hugepage > to an LRU. Aha, hugepages... (I also noticed that on the opposite size scale are NR_SLAB_{UN,}RECLAIMABLE_B, the complementary problem to too big error would be too frequent flushes.) > Though I think the error is time bounded by 2 seconds but in those 2 seconds > mathematically the error can be large. Honestly, I can't tell how much the (transient) errors in various node_stat_item entries will or won't affect MM behavior. But having some guards on it sounds practical when some problems to troubleshoot arise. > What do you think of the following change? It will bound the error > better within the 2 seconds window. > [...] > -static inline void memcg_rstat_updated(struct mem_cgroup *memcg) > +static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) > { > + unsigned int x; > + > cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > - if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH)) > - atomic_inc(&stats_flush_threshold); > + > + x = abs(__this_cpu_add_return(stats_diff, val)); > + if (x > MEMCG_CHARGE_BATCH) { > + atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold); > + __this_cpu_write(stats_diff, 0); > + } > } Looks better wrt meaningful error calculation (and hopefully still doesn't add too much to the hot path). > @@ -807,7 +813,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > return; > > __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > - memcg_rstat_updated(memcg); > + memcg_rstat_updated(memcg, val); s/val/count/ (Just thinking loudly.) At one moment I thought, it could effectively be even s/val/0/ since events aren't(?) inputs for reclaim calculations. But with the introduced stats_diff it may happen stats_diff flickers (its abs value) within the MEMCG_CHARGE_BATCH and stats_flush_threshold would never be incremented. Basically disabling the periodic flush. Therefore the events must also increment the stats_diff.
On Thu, 14 Oct 2021 09:31:46 -0700 Shakeel Butt <shakeelb@google.com> wrote: > Hi Michal, > > On Wed, Oct 13, 2021 at 11:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > > > On Fri, Oct 01, 2021 at 12:00:39PM -0700, Shakeel Butt <shakeelb@google.com> wrote: > > > In this patch we kept the stats update codepath very minimal and let the > > > stats reader side to flush the stats only when the updates are over a > > > specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH). > > > > BTW, a noob question -- are the updates always single page sized? > > > > This is motivated by apples vs oranges comparison since the > > nr_cpus * MEMCG_CHARGE_BATCH > > suggests what could the expected error be in pages (bytes). But it's mostly > > wrong since: a) uncertain single-page updates, b) various counter > > updates summed together. I wonder whether the formula can serve to > > provide at least some (upper) estimate. > > > > Thanks for your review. This forces me to think more on this because each > update does not necessarily be a single page sized update e.g. adding a hugepage > to an LRU. > > Though I think the error is time bounded by 2 seconds but in those 2 seconds > mathematically the error can be large. Sounds significant? > What do you think of the following > change? It will bound the error better within the 2 seconds window. This didn't seem to go anywhere. I'll send "memcg: flush stats only if updated" Linuswards, but please remember to resurrect this idea soonish (this month?) if you think such a change is desirable.
On Thu, Nov 4, 2021 at 2:27 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 14 Oct 2021 09:31:46 -0700 Shakeel Butt <shakeelb@google.com> wrote: > > > Hi Michal, > > > > On Wed, Oct 13, 2021 at 11:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > > > > > On Fri, Oct 01, 2021 at 12:00:39PM -0700, Shakeel Butt <shakeelb@google.com> wrote: > > > > In this patch we kept the stats update codepath very minimal and let the > > > > stats reader side to flush the stats only when the updates are over a > > > > specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH). > > > > > > BTW, a noob question -- are the updates always single page sized? > > > > > > This is motivated by apples vs oranges comparison since the > > > nr_cpus * MEMCG_CHARGE_BATCH > > > suggests what could the expected error be in pages (bytes). But it's mostly > > > wrong since: a) uncertain single-page updates, b) various counter > > > updates summed together. I wonder whether the formula can serve to > > > provide at least some (upper) estimate. > > > > > > > Thanks for your review. This forces me to think more on this because each > > update does not necessarily be a single page sized update e.g. adding a hugepage > > to an LRU. > > > > Though I think the error is time bounded by 2 seconds but in those 2 seconds > > mathematically the error can be large. > > Sounds significant? Yes it can be. > > > What do you think of the following > > change? It will bound the error better within the 2 seconds window. > > This didn't seem to go anywhere. I'll send "memcg: flush stats only if > updated" Linuswards, but please remember to resurrect this idea soonish > (this month?) if you think such a change is desirable. > Yes, I will follow up on this soon.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7c9d5703700e..25f55636ca37 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -103,11 +103,6 @@ static bool do_memsw_account(void) return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap; } -/* memcg and lruvec stats flushing */ -static void flush_memcg_stats_dwork(struct work_struct *w); -static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_SPINLOCK(stats_flush_lock); - #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024 @@ -613,6 +608,56 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; } +/* + * memcg and lruvec stats flushing + * + * Many codepaths leading to stats update or read are performance sensitive and + * adding stats flushing in such codepaths is not desirable. So, to optimize the + * flushing the kernel does: + * + * 1) Periodically and asynchronously flush the stats every 2 seconds to not let + * rstat update tree grow unbounded. + * + * 2) Flush the stats synchronously on reader side only when there are more than + * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization + * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but + * only for 2 seconds due to (1). + */ +static void flush_memcg_stats_dwork(struct work_struct *w); +static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); +static DEFINE_SPINLOCK(stats_flush_lock); +static DEFINE_PER_CPU(unsigned int, stats_updates); +static atomic_t stats_flush_threshold = ATOMIC_INIT(0); + +static inline void memcg_rstat_updated(struct mem_cgroup *memcg) +{ + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH)) + atomic_inc(&stats_flush_threshold); +} + +static void __mem_cgroup_flush_stats(void) +{ + if (!spin_trylock(&stats_flush_lock)) + return; + + cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); + atomic_set(&stats_flush_threshold, 0); + spin_unlock(&stats_flush_lock); +} + +void mem_cgroup_flush_stats(void) +{ + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) + __mem_cgroup_flush_stats(); +} + +static void flush_memcg_stats_dwork(struct work_struct *w) +{ + mem_cgroup_flush_stats(); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); +} + /** * __mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup @@ -625,7 +670,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) return; __this_cpu_add(memcg->vmstats_percpu->state[idx], val); - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + memcg_rstat_updated(memcg); } /* idx can be of type enum memcg_stat_item or node_stat_item. */ @@ -653,10 +698,12 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, memcg = pn->memcg; /* Update memcg */ - __mod_memcg_state(memcg, idx, val); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); /* Update lruvec */ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); + + memcg_rstat_updated(memcg); } /** @@ -758,7 +805,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, return; __this_cpu_add(memcg->vmstats_percpu->events[idx], count); - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + memcg_rstat_updated(memcg); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) @@ -5342,21 +5389,6 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); } -void mem_cgroup_flush_stats(void) -{ - if (!spin_trylock(&stats_flush_lock)) - return; - - cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); - spin_unlock(&stats_flush_lock); -} - -static void flush_memcg_stats_dwork(struct work_struct *w) -{ - mem_cgroup_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); -} - static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct mem_cgroup *memcg = mem_cgroup_from_css(css);