Message ID | 172070450139.2992819.13210624094367257881.stgit@firesoul (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [V7,1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes | expand |
Production results below. - On AMD CPU with 12 NUMA nodes and 192 cores. Using this bpftrace (looong) oneliner: sudo bpftrace -e ' tracepoint:cgroup:cgroup_rstat_locked {@cnt[probe]=count(); if (args->contended) { @lock_contended["after_obtaining_lock", args->level]=count(); } else { if (args->cpu == -1) { @lock_contended["no_ongoing_flusher", args->level]} }} tracepoint:cgroup:cgroup_rstat_lock_contended {@cnt[probe]=count(); if (args->cpu == -1) {@lock_contended["normal", args->level] = count();} else {@lock_contended["yield", args->level] = count();}} kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()} tracepoint:cgroup:cgroup_ongoing_flusher {@start[tid]=nsecs; if (args->race) { printf(" - XXXXXXX[%d]: Handled RACE(%d) for ongoing flusher\n", tid, args->race); @ongoing_flusher_cnt["handled_race"]=count(); } } tracepoint:cgroup:cgroup_ongoing_flusher_wait { $now=nsecs; $start=@start[tid]; $diff=$now-$start; @res=hist(args->res); @wait=hist($diff); @cnt[probe]=count(); @ongoing_flusher_cnt[comm]=count(); @ongoing_flusher_cnt["all"]=count(); $r=args->race; if ($r) { printf(" - Success[%d]: Handled RACE(%d) for ongoing flusher\n", tid, $r) } } interval:s:1 {time("%H:%M:%S "); print(@ongoing_flusher_cnt); print(@cnt); print(@lock_contended); clear (@cnt); clear (@ongoing_flusher_cnt); clear (@lock_contended); } END {clear(@start)}' On 11/07/2024 15.28, Jesper Dangaard Brouer wrote: > Avoid lock contention on the global cgroup rstat lock caused by kswapd > starting on all NUMA nodes simultaneously. At Cloudflare, we observed > massive issues due to kswapd and the specific mem_cgroup_flush_stats() > call inlined in shrink_node, which takes the rstat lock. > > On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, > we noted severe lock contention on the rstat lock. This contention > causes 12 CPUs to waste cycles spinning every time kswapd runs. > Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are > burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily > due to spinning on the rstat lock. > > Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds > causing all kswapdN threads to wake up simultaneously. The kswapd thread > invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush > operation as part of its work. This results in kernel self-induced rstat > lock contention by waking up all kswapd threads simultaneously. Leveraging > this detail: balance_pgdat() have NULL value in target_mem_cgroup, this > cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. > > To avoid this kind of thundering herd problem, kernel previously had a > "stats_flush_ongoing" concept, but this was removed as part of commit > 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch > reintroduce and generalized the concept to apply to all users of cgroup > rstat, not just memcg. > > If there is an ongoing rstat flush, and current cgroup is a descendant, > then it is unnecessary to do the flush. For callers to still see updated > stats, wait for ongoing flusher to complete before returning, but add > timeout as stats are already inaccurate given updaters keeps running. > > Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > --- > V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ > V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ > V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ > V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ > V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ > V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ > RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ > > include/linux/cgroup-defs.h | 2 + > kernel/cgroup/rstat.c | 95 ++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 85 insertions(+), 12 deletions(-) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index b36690ca0d3f..a33b37514c29 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -548,6 +548,8 @@ struct cgroup { > #ifdef CONFIG_BPF_SYSCALL > struct bpf_local_storage __rcu *bpf_cgrp_storage; > #endif > + /* completion queue for cgrp_rstat_ongoing_flusher */ > + struct completion flush_done; > > /* All ancestors including self */ > struct cgroup *ancestors[]; > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index fb8b49437573..fe2a81a310bb 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -2,6 +2,7 @@ > #include "cgroup-internal.h" > > #include <linux/sched/cputime.h> > +#include <linux/completion.h> > > #include <linux/bpf.h> > #include <linux/btf.h> > @@ -11,6 +12,7 @@ > > static DEFINE_SPINLOCK(cgroup_rstat_lock); > static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); > +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL; > > static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); > > @@ -279,17 +281,32 @@ __bpf_hook_end(); > * value -1 is used when obtaining the main lock else this is the CPU > * number processed last. > */ > -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) > +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop) > +{ > + bool locked; > + > + locked = spin_trylock_irq(&cgroup_rstat_lock); > + if (!locked) > + trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true); > + else > + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false); > + > + return locked; > +} > + > +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop, > + bool check_contention) > __acquires(&cgroup_rstat_lock) > { > - bool contended; > + bool locked = false; > > - contended = !spin_trylock_irq(&cgroup_rstat_lock); > - if (contended) { > - trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); > + if (check_contention) > + locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop); > + > + if (!locked) { > spin_lock_irq(&cgroup_rstat_lock); > + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention); > } > - trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); > } > > static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > spin_unlock_irq(&cgroup_rstat_lock); > } > > +#define MAX_WAIT msecs_to_jiffies(100) > +/* Trylock helper that also checks for on ongoing flusher */ > +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) > +{ > + struct cgroup *ongoing; > + bool locked; > + > + /* Check if ongoing flusher is already taking care of this, if > + * we are a descendant skip work, but wait for ongoing flusher > + * to complete work. > + */ > +retry: > + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) { > + wait_for_completion_interruptible_timeout( > + &ongoing->flush_done, MAX_WAIT); > + /* TODO: Add tracepoint here */ > + return false; > + } > + > + locked = __cgroup_rstat_trylock(cgrp, -1); > + if (!locked) { > + /* Contended: Handle loosing race for ongoing flusher */ > + if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher)) > + goto retry; This race do happen in production: - XXXXXXX[1040]: Handled RACE(1) for ongoing flusher - Success[1040]: Handled RACE(1) for ongoing flusher - XXXXXXX[1038]: Handled RACE(1) for ongoing flusher - Success[1038]: Handled RACE(1) for ongoing flusher - XXXXXXX[1040]: Handled RACE(1) for ongoing flusher - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher - Success[1039]: Handled RACE(1) for ongoing flusher - Success[1040]: Handled RACE(1) for ongoing flusher - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher - Success[1047]: Handled RACE(1) for ongoing flusher - Success[1044]: Handled RACE(1) for ongoing flusher - Success[1045]: Handled RACE(1) for ongoing flusher - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher - XXXXXXX[1042]: Handled RACE(1) for ongoing flusher - XXXXXXX[1037]: Handled RACE(1) for ongoing flusher - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher - Success[1043]: Handled RACE(1) for ongoing flusher - Success[1037]: Handled RACE(1) for ongoing flusher - Success[1039]: Handled RACE(1) for ongoing flusher - Success[1044]: Handled RACE(1) for ongoing flusher - Success[1042]: Handled RACE(1) for ongoing flusher - Success[1047]: Handled RACE(1) for ongoing flusher - XXXXXXX[1038]: Handled RACE(1) for ongoing flusher - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher - Success[1039]: Handled RACE(1) for ongoing flusher - Success[1038]: Handled RACE(1) for ongoing flusher - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher - Success[1048]: Handled RACE(1) for ongoing flusher - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher - XXXXXXX[1046]: Handled RACE(1) for ongoing flusher - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher - Success[1045]: Handled RACE(1) for ongoing flusher - Success[1043]: Handled RACE(1) for ongoing flusher - Success[1047]: Handled RACE(1) for ongoing flusher - Success[1046]: Handled RACE(1) for ongoing flusher - Success[1048]: Handled RACE(1) for ongoing flusher - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher - XXXXXXX[1042]: Handled RACE(1) for ongoing flusher - XXXXXXX[1041]: Handled RACE(1) for ongoing flusher - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher - Success[1045]: Handled RACE(1) for ongoing flusher - Success[1041]: Handled RACE(1) for ongoing flusher - Success[1042]: Handled RACE(1) for ongoing flusher - Success[1043]: Handled RACE(1) for ongoing flusher - Success[1044]: Handled RACE(1) for ongoing flusher - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher - Success[1048]: Handled RACE(1) for ongoing flusher 07:42:02 @ongoing_flusher_cnt[kswapd11]: 7 @ongoing_flusher_cnt[kswapd4]: 7 @ongoing_flusher_cnt[kswapd10]: 7 @ongoing_flusher_cnt[kswapd9]: 8 @ongoing_flusher_cnt[kswapd6]: 8 @ongoing_flusher_cnt[kswapd3]: 8 @ongoing_flusher_cnt[kswapd8]: 8 @ongoing_flusher_cnt[kswapd7]: 9 @ongoing_flusher_cnt[kswapd1]: 9 @ongoing_flusher_cnt[kswapd0]: 9 @ongoing_flusher_cnt[kswapd5]: 9 @ongoing_flusher_cnt[kswapd2]: 9 @ongoing_flusher_cnt[handled_race]: 27 @ongoing_flusher_cnt[all]: 98 @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 27 @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 37 @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 45 @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 98 @lock_contended[normal, 0]: 27 This is a clean (meaning no cadvisor interference) example of kswapd starting simultaniously on many NUMA nodes, that in 27 out of 98 cases hit the race (which is handled in V6 and V7). The BPF "cnt" maps are getting cleared every second, so this approximates per sec numbers. This patch reduce pressure on the lock, but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full flushes approx 37 per sec (every 27 ms). On the positive side ongoing_flusher mitigation stopped 98 per sec of these. In this clean kswapd case the patch removes the lock contention issue for kswapd. The lock_contended cases 27 seems to be all related to handled_race cases 27. The remaning high flush rate should also be addressed, and we should also work on aproaches to limit this like my ealier proposal[1]. [1] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/#Z31kernel:cgroup:rstat.c > + > + __cgroup_rstat_lock(cgrp, -1, false); > + } > + /* Obtained lock, record this cgrp as the ongoing flusher */ > + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (!ongoing) { > + reinit_completion(&cgrp->flush_done); > + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); > + } > + return true; /* locked */ > +} > + > +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) > +{ > + /* Detect if we are the ongoing flusher */ > + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { > + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); > + complete_all(&cgrp->flush_done); > + } > + __cgroup_rstat_unlock(cgrp, -1); > +} > + > /* see cgroup_rstat_flush() */ > static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) > @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > __cgroup_rstat_unlock(cgrp, cpu); > if (!cond_resched()) > cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > + __cgroup_rstat_lock(cgrp, cpu, true); > } > } > } > @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > { > might_sleep(); > > - __cgroup_rstat_lock(cgrp, -1); > + if (!cgroup_rstat_trylock_flusher(cgrp)) > + return; > + > cgroup_rstat_flush_locked(cgrp); > - __cgroup_rstat_unlock(cgrp, -1); > + cgroup_rstat_unlock_flusher(cgrp); > } > > /** > @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) > __acquires(&cgroup_rstat_lock) > { > might_sleep(); > - __cgroup_rstat_lock(cgrp, -1); > - cgroup_rstat_flush_locked(cgrp); > + > + if (cgroup_rstat_trylock_flusher(cgrp)) > + cgroup_rstat_flush_locked(cgrp); > + else > + __cgroup_rstat_lock(cgrp, -1, false); > } > > /** > @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) > void cgroup_rstat_flush_release(struct cgroup *cgrp) > __releases(&cgroup_rstat_lock) > { > - __cgroup_rstat_unlock(cgrp, -1); > + cgroup_rstat_unlock_flusher(cgrp); > } > > int cgroup_rstat_init(struct cgroup *cgrp) > @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) > u64_stats_init(&rstatc->bsync); > } > > + init_completion(&cgrp->flush_done); > + > return 0; > } > > >
On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > Avoid lock contention on the global cgroup rstat lock caused by kswapd > starting on all NUMA nodes simultaneously. At Cloudflare, we observed > massive issues due to kswapd and the specific mem_cgroup_flush_stats() > call inlined in shrink_node, which takes the rstat lock. > > On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, > we noted severe lock contention on the rstat lock. This contention > causes 12 CPUs to waste cycles spinning every time kswapd runs. > Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are > burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily > due to spinning on the rstat lock. > > Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds > causing all kswapdN threads to wake up simultaneously. The kswapd thread > invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush > operation as part of its work. This results in kernel self-induced rstat > lock contention by waking up all kswapd threads simultaneously. Leveraging > this detail: balance_pgdat() have NULL value in target_mem_cgroup, this > cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. > > To avoid this kind of thundering herd problem, kernel previously had a > "stats_flush_ongoing" concept, but this was removed as part of commit > 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch > reintroduce and generalized the concept to apply to all users of cgroup > rstat, not just memcg. > > If there is an ongoing rstat flush, and current cgroup is a descendant, > then it is unnecessary to do the flush. For callers to still see updated > stats, wait for ongoing flusher to complete before returning, but add > timeout as stats are already inaccurate given updaters keeps running. > > Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Thanks for working on this, Jesper! I love the data you collected here! I think the commit subject and message should be changed to better describe the patch. This is a patch that exclusively modifies cgroup code, yet the subject is about kswapd. This change affects all users of rstat flushing. I think a better subject would be: "cgroup/rstat: avoid flushing if there is an ongoing overlapping flush" or similar. The commit message should first describe the cgroup change, and then use kswapd as a brief example/illustration of how the problem manifests in practice. You should also include a brief summary of the numbers you collected from prod. > --- > V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ > V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ > V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ > V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ > V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ > V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ > RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ > > include/linux/cgroup-defs.h | 2 + > kernel/cgroup/rstat.c | 95 ++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 85 insertions(+), 12 deletions(-) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index b36690ca0d3f..a33b37514c29 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -548,6 +548,8 @@ struct cgroup { > #ifdef CONFIG_BPF_SYSCALL > struct bpf_local_storage __rcu *bpf_cgrp_storage; > #endif > + /* completion queue for cgrp_rstat_ongoing_flusher */ > + struct completion flush_done; > > /* All ancestors including self */ > struct cgroup *ancestors[]; > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index fb8b49437573..fe2a81a310bb 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -2,6 +2,7 @@ > #include "cgroup-internal.h" > > #include <linux/sched/cputime.h> > +#include <linux/completion.h> > > #include <linux/bpf.h> > #include <linux/btf.h> > @@ -11,6 +12,7 @@ > > static DEFINE_SPINLOCK(cgroup_rstat_lock); > static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); > +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL; > > static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); > > @@ -279,17 +281,32 @@ __bpf_hook_end(); > * value -1 is used when obtaining the main lock else this is the CPU > * number processed last. > */ > -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) > +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop) > +{ > + bool locked; > + > + locked = spin_trylock_irq(&cgroup_rstat_lock); > + if (!locked) > + trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true); > + else > + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false); > + > + return locked; > +} > + > +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop, > + bool check_contention) > __acquires(&cgroup_rstat_lock) > { > - bool contended; > + bool locked = false; > > - contended = !spin_trylock_irq(&cgroup_rstat_lock); > - if (contended) { > - trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); > + if (check_contention) > + locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop); > + > + if (!locked) { > spin_lock_irq(&cgroup_rstat_lock); > + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention); > } > - trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); > } I will let Tejun and others weigh in about what tracepoints we should have going forward and how they should be structured (which also includes patch 2). I see the tremendous value they have to collect data from prod, but I do not see similar existing tracepoints, and I am not sure if the branching here could have an effect when the tracepoints are off. > > static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > spin_unlock_irq(&cgroup_rstat_lock); > } > > +#define MAX_WAIT msecs_to_jiffies(100) > +/* Trylock helper that also checks for on ongoing flusher */ > +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) > +{ > + struct cgroup *ongoing; > + bool locked; > + > + /* Check if ongoing flusher is already taking care of this, if nit: I think commonly the comment would start on a new line after /*. > + * we are a descendant skip work, but wait for ongoing flusher > + * to complete work. > + */ > +retry: > + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) { > + wait_for_completion_interruptible_timeout( > + &ongoing->flush_done, MAX_WAIT); > + /* TODO: Add tracepoint here */ > + return false; > + } > + > + locked = __cgroup_rstat_trylock(cgrp, -1); > + if (!locked) { > + /* Contended: Handle loosing race for ongoing flusher */ nit: losing > + if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher)) > + goto retry; > + > + __cgroup_rstat_lock(cgrp, -1, false); > + } > + /* Obtained lock, record this cgrp as the ongoing flusher */ Do we want a comment here to explain why there could be an existing ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's not super obvious. > + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (!ongoing) { > + reinit_completion(&cgrp->flush_done); > + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); > + } > + return true; /* locked */ Would it be better to explain the return value of the function in the comment above it? > +} > + > +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) > +{ > + /* Detect if we are the ongoing flusher */ I think this is a bit obvious. > + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { > + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); > + complete_all(&cgrp->flush_done); > + } > + __cgroup_rstat_unlock(cgrp, -1); > +} > + > /* see cgroup_rstat_flush() */ > static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) > @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > __cgroup_rstat_unlock(cgrp, cpu); > if (!cond_resched()) > cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > + __cgroup_rstat_lock(cgrp, cpu, true); > } > } > } > @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > { > might_sleep(); > > - __cgroup_rstat_lock(cgrp, -1); > + if (!cgroup_rstat_trylock_flusher(cgrp)) > + return; > + > cgroup_rstat_flush_locked(cgrp); > - __cgroup_rstat_unlock(cgrp, -1); > + cgroup_rstat_unlock_flusher(cgrp); > } > > /** > @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) > __acquires(&cgroup_rstat_lock) > { > might_sleep(); > - __cgroup_rstat_lock(cgrp, -1); > - cgroup_rstat_flush_locked(cgrp); > + > + if (cgroup_rstat_trylock_flusher(cgrp)) > + cgroup_rstat_flush_locked(cgrp); > + else > + __cgroup_rstat_lock(cgrp, -1, false); > } > > /** > @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) > void cgroup_rstat_flush_release(struct cgroup *cgrp) > __releases(&cgroup_rstat_lock) > { > - __cgroup_rstat_unlock(cgrp, -1); > + cgroup_rstat_unlock_flusher(cgrp); > } > > int cgroup_rstat_init(struct cgroup *cgrp) > @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) > u64_stats_init(&rstatc->bsync); > } > > + init_completion(&cgrp->flush_done); > + > return 0; > } > > >
[..] > > > This is a clean (meaning no cadvisor interference) example of kswapd > starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > hit the race (which is handled in V6 and V7). > > The BPF "cnt" maps are getting cleared every second, so this > approximates per sec numbers. This patch reduce pressure on the lock, > but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > flushes approx 37 per sec (every 27 ms). On the positive side > ongoing_flusher mitigation stopped 98 per sec of these. > > In this clean kswapd case the patch removes the lock contention issue > for kswapd. The lock_contended cases 27 seems to be all related to > handled_race cases 27. > > The remaning high flush rate should also be addressed, and we should > also work on aproaches to limit this like my ealier proposal[1]. I honestly don't think a high number of flushes is a problem on its own as long as we are not spending too much time flushing, especially when we have magnitude-based thresholding so we know there is something to flush (although it may not be relevant to what we are doing). If we keep observing a lot of lock contention, one thing that I thought about is to have a variant of spin_lock with a timeout. This limits the flushing latency, instead of limiting the number of flushes (which I believe is the wrong metric to optimize). It also seems to me that we are doing a flush each 27ms, and your proposed threshold was once per 50ms. It doesn't seem like a fundamental difference. I am also wondering how many more flushes could be skipped if we handle the case of multiple ongoing flushers (whether by using a mutex, or making it a per-cgroup property as I suggested earlier).
On 7/16/24 20:35, Yosry Ahmed wrote: > [..] >> >> This is a clean (meaning no cadvisor interference) example of kswapd >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases >> hit the race (which is handled in V6 and V7). >> >> The BPF "cnt" maps are getting cleared every second, so this >> approximates per sec numbers. This patch reduce pressure on the lock, >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full >> flushes approx 37 per sec (every 27 ms). On the positive side >> ongoing_flusher mitigation stopped 98 per sec of these. >> >> In this clean kswapd case the patch removes the lock contention issue >> for kswapd. The lock_contended cases 27 seems to be all related to >> handled_race cases 27. >> >> The remaning high flush rate should also be addressed, and we should >> also work on aproaches to limit this like my ealier proposal[1]. > I honestly don't think a high number of flushes is a problem on its > own as long as we are not spending too much time flushing, especially > when we have magnitude-based thresholding so we know there is > something to flush (although it may not be relevant to what we are > doing). > > If we keep observing a lot of lock contention, one thing that I > thought about is to have a variant of spin_lock with a timeout. This > limits the flushing latency, instead of limiting the number of flushes > (which I believe is the wrong metric to optimize). Except for semaphore, none of our locking primitives allow for a timeout parameter. For sleeping locks, I don't think it is hard to add variants with timeout parameter, but not the spinning locks. Cheers, Longman
On 17/07/2024 02.30, Yosry Ahmed wrote: > On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> Avoid lock contention on the global cgroup rstat lock caused by kswapd >> starting on all NUMA nodes simultaneously. At Cloudflare, we observed >> massive issues due to kswapd and the specific mem_cgroup_flush_stats() >> call inlined in shrink_node, which takes the rstat lock. >> >> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, >> we noted severe lock contention on the rstat lock. This contention >> causes 12 CPUs to waste cycles spinning every time kswapd runs. >> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are >> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily >> due to spinning on the rstat lock. >> >> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds >> causing all kswapdN threads to wake up simultaneously. The kswapd thread >> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush >> operation as part of its work. This results in kernel self-induced rstat >> lock contention by waking up all kswapd threads simultaneously. Leveraging >> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this >> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. >> >> To avoid this kind of thundering herd problem, kernel previously had a >> "stats_flush_ongoing" concept, but this was removed as part of commit >> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch >> reintroduce and generalized the concept to apply to all users of cgroup >> rstat, not just memcg. >> >> If there is an ongoing rstat flush, and current cgroup is a descendant, >> then it is unnecessary to do the flush. For callers to still see updated >> stats, wait for ongoing flusher to complete before returning, but add >> timeout as stats are already inaccurate given updaters keeps running. >> >> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). >> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Thanks for working on this, Jesper! I love the data you collected here! > > I think the commit subject and message should be changed to better > describe the patch. This is a patch that exclusively modifies cgroup > code, yet the subject is about kswapd. This change affects all users > of rstat flushing. > > I think a better subject would be: > "cgroup/rstat: avoid flushing if there is an ongoing overlapping > flush" or similar. > > The commit message should first describe the cgroup change, and then > use kswapd as a brief example/illustration of how the problem > manifests in practice. You should also include a brief summary of the > numbers you collected from prod. > I agree that description should be updated as patch have evolved, and you suggestions make sense, thanks. But I'm not sure we are finished evolving this patch, because the scheme of allowing all sub-cgroups to become the ongoing flusher (and only avoiding lock-spinning when cgroup_is_descendant), is causing issues in production. We are using cadvisor in production for collecting metrics, which walks all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat (doing 3 flushes of same cgroup in short timespan), but individual cgroups don't overlap much to benefit from ongoing_flusher scheme. Below is (1 sec) production data, where cadvisor and kswapd collide: > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1 > @ongoing_flusher_cnt[kswapd5]: 3 > @ongoing_flusher_cnt[kswapd7]: 4 > @ongoing_flusher_cnt[kswapd6]: 4 > @ongoing_flusher_cnt[kswapd8]: 4 > @ongoing_flusher_cnt[kswapd9]: 5 > @ongoing_flusher_cnt[kswapd2]: 5 > @ongoing_flusher_cnt[kswapd10]: 5 > @ongoing_flusher_cnt[kswapd3]: 5 > @ongoing_flusher_cnt[kswapd11]: 5 > @ongoing_flusher_cnt[kswapd0]: 5 > @ongoing_flusher_cnt[kswapd4]: 6 > @ongoing_flusher_cnt[kswapd1]: 7 > @ongoing_flusher_cnt[cadvisor]: 9 For cadvisor ongoing_flusher only helps 9 times to avoid flush. > @ongoing_flusher_cnt[handled_race]: 18 > @ongoing_flusher_cnt[all]: 61 Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes. > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61 > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462 > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032 This is bad: Lock is contended 9032 time within this 1 sec period. Below, lock_contended cases is captured in more detail. > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435 > @lock_contended[normal, 4]: 1 > @lock_contended[normal, 1]: 2 > @lock_contended[normal, 3]: 6 > @lock_contended[normal, 2]: 15 > @lock_contended[yield, 4]: 49 > @lock_contended[after_obtaining_lock, 4]: 49 > @lock_contended[normal, 0]: 59 The "normal" lock_contended for level 0, meaning the cgroup root is contended 59 times within this 1 sec period. > @lock_contended[after_obtaining_lock, 1]: 117 > @lock_contended[yield, 1]: 118 > @lock_contended[yield, 3]: 258 > @lock_contended[after_obtaining_lock, 3]: 258 > @lock_contended[after_obtaining_lock, 2]: 946 > @lock_contended[yield, 2]: 946 Lock contention for 'yielded' case for level 2. Goes crazy with 946/sec. > @lock_contended[yield, 0]: 7579 Lock contention for 'yielded' case for level 0, the root. Is really crazy with 7579/sec lock spin cases. > @lock_contended[after_obtaining_lock, 0]: 7579 IMHO this shows that, due to lock yielding, the scheme of ongoing_flusher for sub-trees only cause issues. IMHO we should go back to only doing ongoing_flusher for root-cgroup. There is really low chance of sub-trees flushes being concurrent enough to benefit from this, and it cause issues and needs (ugly) race handling. Further more sub-tree flushing doesn't take as long time as level 0 flushing, which further lower the chance of concurrent flushes. Let's get some quick data on flush times from production, to support my claim: The bpftrace onliner: sudo bpftrace -e ' tracepoint:cgroup:cgroup_rstat_locked { if (args->cpu == -1) { @start[tid]=nsecs; } @cnt[probe]=count(); if (args->contended) { @lock_contended["after_obtaining_lock", args->level]=count(); }} tracepoint:cgroup:cgroup_rstat_unlock { if (args->cpu == -1) { $now=nsecs; $start=@start[tid]; $diff=$now-$start; @locked_time_level[args->level]=hist($diff); } @cnt[probe]=count()} kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@cnt); print(@lock_contended); print(@locked_time_level); clear (@cnt); clear (@lock_contended); } END {clear(@start)}' Time below is in nanosec. @locked_time_level[0]: [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | @locked_time_level[1]: [4K, 8K) 6 |@@@@ | [8K, 16K) 65 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 23 |@@@@@@@@@@@@@@@@@@ | [64K, 128K) 15 |@@@@@@@@@@@@ | [128K, 256K) 10 |@@@@@@@@ | [256K, 512K) 6 |@@@@ | [512K, 1M) 15 |@@@@@@@@@@@@ | [1M, 2M) 2 |@ | [2M, 4M) 14 |@@@@@@@@@@@ | [4M, 8M) 6 |@@@@ | [8M, 16M) 7 |@@@@@ | [16M, 32M) 1 | | @locked_time_level[2]: [2K, 4K) 1 | | [4K, 8K) 160 |@@@@@@@@@ | [8K, 16K) 733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 901 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 191 |@@@@@@@@@@@ | [64K, 128K) 115 |@@@@@@ | [128K, 256K) 61 |@@@ | [256K, 512K) 70 |@@@@ | [512K, 1M) 59 |@@@ | [1M, 2M) 27 |@ | [2M, 4M) 9 | | @locked_time_level[3]: [1K, 2K) 3 | | [2K, 4K) 2 | | [4K, 8K) 5 | | [8K, 16K) 147 |@@@@@@ | [16K, 32K) 1222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 266 |@@@@@@@@@@@ | [64K, 128K) 199 |@@@@@@@@ | [128K, 256K) 146 |@@@@@@ | [256K, 512K) 124 |@@@@@ | [512K, 1M) 17 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 1 | | @locked_time_level[4]: [4K, 8K) 2 |@@ | [8K, 16K) 17 |@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 4 |@@@@@ | --Jesper >> --- >> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ >> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ >> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ >> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ >> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ >> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ >> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ >> >> include/linux/cgroup-defs.h | 2 + >> kernel/cgroup/rstat.c | 95 ++++++++++++++++++++++++++++++++++++++----- >> 2 files changed, 85 insertions(+), 12 deletions(-) >> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index b36690ca0d3f..a33b37514c29 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -548,6 +548,8 @@ struct cgroup { >> #ifdef CONFIG_BPF_SYSCALL >> struct bpf_local_storage __rcu *bpf_cgrp_storage; >> #endif >> + /* completion queue for cgrp_rstat_ongoing_flusher */ >> + struct completion flush_done; >> >> /* All ancestors including self */ >> struct cgroup *ancestors[]; >> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >> index fb8b49437573..fe2a81a310bb 100644 >> --- a/kernel/cgroup/rstat.c >> +++ b/kernel/cgroup/rstat.c >> @@ -2,6 +2,7 @@ >> #include "cgroup-internal.h" >> >> #include <linux/sched/cputime.h> >> +#include <linux/completion.h> >> >> #include <linux/bpf.h> >> #include <linux/btf.h> >> @@ -11,6 +12,7 @@ >> >> static DEFINE_SPINLOCK(cgroup_rstat_lock); >> static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); >> +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL; >> >> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); >> >> @@ -279,17 +281,32 @@ __bpf_hook_end(); >> * value -1 is used when obtaining the main lock else this is the CPU >> * number processed last. >> */ >> -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) >> +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop) >> +{ >> + bool locked; >> + >> + locked = spin_trylock_irq(&cgroup_rstat_lock); >> + if (!locked) >> + trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true); >> + else >> + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false); >> + >> + return locked; >> +} >> + >> +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop, >> + bool check_contention) >> __acquires(&cgroup_rstat_lock) >> { >> - bool contended; >> + bool locked = false; >> >> - contended = !spin_trylock_irq(&cgroup_rstat_lock); >> - if (contended) { >> - trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); >> + if (check_contention) >> + locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop); >> + >> + if (!locked) { >> spin_lock_irq(&cgroup_rstat_lock); >> + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention); >> } >> - trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); >> } > > I will let Tejun and others weigh in about what tracepoints we should > have going forward and how they should be structured (which also > includes patch 2). I see the tremendous value they have to collect > data from prod, but I do not see similar existing tracepoints, and I > am not sure if the branching here could have an effect when the > tracepoints are off. > >> >> static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> spin_unlock_irq(&cgroup_rstat_lock); >> } >> >> +#define MAX_WAIT msecs_to_jiffies(100) >> +/* Trylock helper that also checks for on ongoing flusher */ >> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) >> +{ >> + struct cgroup *ongoing; >> + bool locked; >> + >> + /* Check if ongoing flusher is already taking care of this, if > > nit: I think commonly the comment would start on a new line after /*. > >> + * we are a descendant skip work, but wait for ongoing flusher >> + * to complete work. >> + */ >> +retry: >> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >> + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) { >> + wait_for_completion_interruptible_timeout( >> + &ongoing->flush_done, MAX_WAIT); >> + /* TODO: Add tracepoint here */ >> + return false; >> + } >> + >> + locked = __cgroup_rstat_trylock(cgrp, -1); >> + if (!locked) { >> + /* Contended: Handle loosing race for ongoing flusher */ > > nit: losing > >> + if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher)) >> + goto retry; >> + >> + __cgroup_rstat_lock(cgrp, -1, false); >> + } >> + /* Obtained lock, record this cgrp as the ongoing flusher */ > > Do we want a comment here to explain why there could be an existing > ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's > not super obvious. > >> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >> + if (!ongoing) { >> + reinit_completion(&cgrp->flush_done); >> + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); >> + } >> + return true; /* locked */ > > Would it be better to explain the return value of the function in the > comment above it? > >> +} >> + >> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) >> +{ >> + /* Detect if we are the ongoing flusher */ > > I think this is a bit obvious. > >> + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { >> + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); >> + complete_all(&cgrp->flush_done); >> + } >> + __cgroup_rstat_unlock(cgrp, -1); >> +} >> + >> /* see cgroup_rstat_flush() */ >> static void cgroup_rstat_flush_locked(struct cgroup *cgrp) >> __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) >> @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) >> __cgroup_rstat_unlock(cgrp, cpu); >> if (!cond_resched()) >> cpu_relax(); >> - __cgroup_rstat_lock(cgrp, cpu); >> + __cgroup_rstat_lock(cgrp, cpu, true); >> } >> } >> } >> @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) >> { >> might_sleep(); >> >> - __cgroup_rstat_lock(cgrp, -1); >> + if (!cgroup_rstat_trylock_flusher(cgrp)) >> + return; >> + >> cgroup_rstat_flush_locked(cgrp); >> - __cgroup_rstat_unlock(cgrp, -1); >> + cgroup_rstat_unlock_flusher(cgrp); >> } >> >> /** >> @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) >> __acquires(&cgroup_rstat_lock) >> { >> might_sleep(); >> - __cgroup_rstat_lock(cgrp, -1); >> - cgroup_rstat_flush_locked(cgrp); >> + >> + if (cgroup_rstat_trylock_flusher(cgrp)) >> + cgroup_rstat_flush_locked(cgrp); >> + else >> + __cgroup_rstat_lock(cgrp, -1, false); >> } >> >> /** >> @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) >> void cgroup_rstat_flush_release(struct cgroup *cgrp) >> __releases(&cgroup_rstat_lock) >> { >> - __cgroup_rstat_unlock(cgrp, -1); >> + cgroup_rstat_unlock_flusher(cgrp); >> } >> >> int cgroup_rstat_init(struct cgroup *cgrp) >> @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) >> u64_stats_init(&rstatc->bsync); >> } >> >> + init_completion(&cgrp->flush_done); >> + >> return 0; >> } >> >> >>
On Tue, Jul 16, 2024 at 8:00 PM Waiman Long <longman@redhat.com> wrote: > > On 7/16/24 20:35, Yosry Ahmed wrote: > > [..] > >> > >> This is a clean (meaning no cadvisor interference) example of kswapd > >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > >> hit the race (which is handled in V6 and V7). > >> > >> The BPF "cnt" maps are getting cleared every second, so this > >> approximates per sec numbers. This patch reduce pressure on the lock, > >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > >> flushes approx 37 per sec (every 27 ms). On the positive side > >> ongoing_flusher mitigation stopped 98 per sec of these. > >> > >> In this clean kswapd case the patch removes the lock contention issue > >> for kswapd. The lock_contended cases 27 seems to be all related to > >> handled_race cases 27. > >> > >> The remaning high flush rate should also be addressed, and we should > >> also work on aproaches to limit this like my ealier proposal[1]. > > I honestly don't think a high number of flushes is a problem on its > > own as long as we are not spending too much time flushing, especially > > when we have magnitude-based thresholding so we know there is > > something to flush (although it may not be relevant to what we are > > doing). > > > > If we keep observing a lot of lock contention, one thing that I > > thought about is to have a variant of spin_lock with a timeout. This > > limits the flushing latency, instead of limiting the number of flushes > > (which I believe is the wrong metric to optimize). > > Except for semaphore, none of our locking primitives allow for a timeout > parameter. For sleeping locks, I don't think it is hard to add variants > with timeout parameter, but not the spinning locks. Thanks for pointing this out. I am assuming a mutex with a timeout will also address the priority inversion problem that Shakeel was talking about AFAICT.
[..] > > I agree that description should be updated as patch have evolved, and > you suggestions make sense, thanks. > > But I'm not sure we are finished evolving this patch, because the scheme > of allowing all sub-cgroups to become the ongoing flusher (and only > avoiding lock-spinning when cgroup_is_descendant), is causing issues in > production. Sure, I just wanted to point out the the commit log should be updated going forward, not implying that we are done here :) > > We are using cadvisor in production for collecting metrics, which walks > all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat > (doing 3 flushes of same cgroup in short timespan), but individual > cgroups don't overlap much to benefit from ongoing_flusher scheme. > > Below is (1 sec) production data, where cadvisor and kswapd collide: > > > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1 > > @ongoing_flusher_cnt[kswapd5]: 3 > > @ongoing_flusher_cnt[kswapd7]: 4 > > @ongoing_flusher_cnt[kswapd6]: 4 > > @ongoing_flusher_cnt[kswapd8]: 4 > > @ongoing_flusher_cnt[kswapd9]: 5 > > @ongoing_flusher_cnt[kswapd2]: 5 > > @ongoing_flusher_cnt[kswapd10]: 5 > > @ongoing_flusher_cnt[kswapd3]: 5 > > @ongoing_flusher_cnt[kswapd11]: 5 > > @ongoing_flusher_cnt[kswapd0]: 5 > > @ongoing_flusher_cnt[kswapd4]: 6 > > @ongoing_flusher_cnt[kswapd1]: 7 > > @ongoing_flusher_cnt[cadvisor]: 9 > > For cadvisor ongoing_flusher only helps 9 times to avoid flush. > > > @ongoing_flusher_cnt[handled_race]: 18 > > @ongoing_flusher_cnt[all]: 61 > > Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes. > > > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61 > > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462 > > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032 > > This is bad: Lock is contended 9032 time within this 1 sec period. > Below, lock_contended cases is captured in more detail. > > > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435 > > @lock_contended[normal, 4]: 1 > > @lock_contended[normal, 1]: 2 > > @lock_contended[normal, 3]: 6 > > @lock_contended[normal, 2]: 15 > > @lock_contended[yield, 4]: 49 > > @lock_contended[after_obtaining_lock, 4]: 49 > > @lock_contended[normal, 0]: 59 > > The "normal" lock_contended for level 0, > meaning the cgroup root is contended 59 times within this 1 sec period. > > > @lock_contended[after_obtaining_lock, 1]: 117 > > @lock_contended[yield, 1]: 118 > > @lock_contended[yield, 3]: 258 > > @lock_contended[after_obtaining_lock, 3]: 258 > > @lock_contended[after_obtaining_lock, 2]: 946 > > @lock_contended[yield, 2]: 946 > > Lock contention for 'yielded' case for level 2. > Goes crazy with 946/sec. > > > @lock_contended[yield, 0]: 7579 > > Lock contention for 'yielded' case for level 0, the root. > Is really crazy with 7579/sec lock spin cases. > > > @lock_contended[after_obtaining_lock, 0]: 7579 > > > IMHO this shows that, due to lock yielding, the scheme of > ongoing_flusher for sub-trees only cause issues. Just to make sure I understand correctly, allowing non-root to become ongoing_flushers is problematic because we only support a single ongoing_flusher, so if we have a subsequent root flush, it won't be the ongoing_flusher and we won't benefit from skipping other flushes. Correct? I honestly think the problem here is not that we support ongoing_flusher for subtrees, which imo is nice to have at a low additional cost, and does have its benefits (see below). I think the problem is that lock yielding means that we can have multiple ongoing flushers, while the current scheme only records one. What I think we should be doing is either supporting multiple ongoing_flushers (by replacing the global variable with a per-cgroup flag, perhaps), or biting the bullet and start using a mutex to drop lock yielding. If there aren't any concerns beyond priority inversion (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I suggested could fix this issue (and also put a bound on the flushing latency in general)? (I suppose the latter is preferrable) > > IMHO we should go back to only doing ongoing_flusher for root-cgroup. > There is really low chance of sub-trees flushes being concurrent enough > to benefit from this, and it cause issues and needs (ugly) race handling. > > Further more sub-tree flushing doesn't take as long time as level 0 > flushing, which further lower the chance of concurrent flushes. I agree that the race handling is not pretty, and we can try to improve the implementation or handle the race in other ways. However, I think that skipping flushes for subtrees is valuable. I am not sure about the cgroup arrangement in your use case, but we do have cgroups with a lot of tasks in them (and/or child cgroups). If there is memory pressure (or hit the cgroup limit), they all may go into direct reclaim concurrently, so skipping flushes could really be beneficial. Of course, if the difference in complexity is not justified, we can go back to only supporting root cgroups for ongoing_flusher for now. But as I mentioned in the v4 email thread, some of the complexity may still remain anyway as we have multiple root cgroups in v1. > > Let's get some quick data on flush times from production, to support my > claim: Thanks for the data. I agree that in general root flushes will be a lot more expensive than subtree flushes, but keep in mind that the data may look really different depends on the cgroup setup. As I mentioned, I think we should still support subtree flushes unless the difference in complexity is not justified. > > The bpftrace onliner: > sudo bpftrace -e ' > tracepoint:cgroup:cgroup_rstat_locked { > if (args->cpu == -1) { @start[tid]=nsecs; } > @cnt[probe]=count(); > if (args->contended) { > @lock_contended["after_obtaining_lock", args->level]=count(); > }} > tracepoint:cgroup:cgroup_rstat_unlock { > if (args->cpu == -1) { > $now=nsecs; $start=@start[tid]; $diff=$now-$start; > @locked_time_level[args->level]=hist($diff); > } > @cnt[probe]=count()} > kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()} > interval:s:1 {time("%H:%M:%S "); > print(@cnt); > print(@lock_contended); > print(@locked_time_level); > clear (@cnt); > clear (@lock_contended); } > END {clear(@start)}' > > Time below is in nanosec. > > @locked_time_level[0]: > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8M, 16M) 860 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ > | > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ > | > > > @locked_time_level[1]: > [4K, 8K) 6 |@@@@ > | > [8K, 16K) 65 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [32K, 64K) 23 |@@@@@@@@@@@@@@@@@@ > | > [64K, 128K) 15 |@@@@@@@@@@@@ > | > [128K, 256K) 10 |@@@@@@@@ > | > [256K, 512K) 6 |@@@@ > | > [512K, 1M) 15 |@@@@@@@@@@@@ > | > [1M, 2M) 2 |@ > | > [2M, 4M) 14 |@@@@@@@@@@@ > | > [4M, 8M) 6 |@@@@ > | > [8M, 16M) 7 |@@@@@ > | > [16M, 32M) 1 | > | > > > @locked_time_level[2]: > [2K, 4K) 1 | > | > [4K, 8K) 160 |@@@@@@@@@ > | > [8K, 16K) 733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [16K, 32K) 901 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 191 |@@@@@@@@@@@ > | > [64K, 128K) 115 |@@@@@@ > | > [128K, 256K) 61 |@@@ > | > [256K, 512K) 70 |@@@@ > | > [512K, 1M) 59 |@@@ > | > [1M, 2M) 27 |@ > | > [2M, 4M) 9 | > | > > > @locked_time_level[3]: > [1K, 2K) 3 | > | > [2K, 4K) 2 | > | > [4K, 8K) 5 | > | > [8K, 16K) 147 |@@@@@@ > | > [16K, 32K) 1222 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 266 |@@@@@@@@@@@ > | > [64K, 128K) 199 |@@@@@@@@ > | > [128K, 256K) 146 |@@@@@@ > | > [256K, 512K) 124 |@@@@@ > | > [512K, 1M) 17 | > | > [1M, 2M) 0 | > | > [2M, 4M) 0 | > | > [4M, 8M) 1 | > | > > > @locked_time_level[4]: > [4K, 8K) 2 |@@ > | > [8K, 16K) 17 |@@@@@@@@@@@@@@@@@@@@@@ > | > [16K, 32K) 40 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 4 |@@@@@ > | > > --Jesper >
On 17/07/2024 02.35, Yosry Ahmed wrote: > [..] >> >> >> This is a clean (meaning no cadvisor interference) example of kswapd >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases >> hit the race (which is handled in V6 and V7). >> >> The BPF "cnt" maps are getting cleared every second, so this >> approximates per sec numbers. This patch reduce pressure on the lock, >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full >> flushes approx 37 per sec (every 27 ms). On the positive side >> ongoing_flusher mitigation stopped 98 per sec of these. >> >> In this clean kswapd case the patch removes the lock contention issue >> for kswapd. The lock_contended cases 27 seems to be all related to >> handled_race cases 27. >> >> The remaning high flush rate should also be addressed, and we should >> also work on aproaches to limit this like my ealier proposal[1]. > > I honestly don't think a high number of flushes is a problem on its > own as long as we are not spending too much time flushing, especially > when we have magnitude-based thresholding so we know there is > something to flush (although it may not be relevant to what we are > doing). > We are "spending too much time flushing" see below. > If we keep observing a lot of lock contention, one thing that I > thought about is to have a variant of spin_lock with a timeout. This > limits the flushing latency, instead of limiting the number of flushes > (which I believe is the wrong metric to optimize). > > It also seems to me that we are doing a flush each 27ms, and your > proposed threshold was once per 50ms. It doesn't seem like a > fundamental difference. > Looking at the production numbers for the time the lock is held for level 0: @locked_time_level[0]: [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | The time is in nanosec, so M corresponds to ms (milliseconds). With 36 flushes per second (as shown earlier) this is a flush every 27.7ms. It is not unreasonable (from above data) that the flush time also spend 27ms, which means that we spend a full CPU second flushing. That is spending too much time flushing. This around 1 sec CPU usage for kswapd is also quite clear in the attached grafana graph for when server was rebooted into this V7 kernel. I choose 50ms because at the time I saw flush taking around 30ms, and I view the flush time as queue service-time. When arrival-rate is faster than service-time, then a queue will form. So, choosing 50ms as arrival-rate gave me some headroom. As I mentioned earlier, optimally this threshold should be dynamically measured. --Jesper
On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > > On 17/07/2024 02.35, Yosry Ahmed wrote: > > [..] > >> > >> > >> This is a clean (meaning no cadvisor interference) example of kswapd > >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > >> hit the race (which is handled in V6 and V7). > >> > >> The BPF "cnt" maps are getting cleared every second, so this > >> approximates per sec numbers. This patch reduce pressure on the lock, > >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > >> flushes approx 37 per sec (every 27 ms). On the positive side > >> ongoing_flusher mitigation stopped 98 per sec of these. > >> > >> In this clean kswapd case the patch removes the lock contention issue > >> for kswapd. The lock_contended cases 27 seems to be all related to > >> handled_race cases 27. > >> > >> The remaning high flush rate should also be addressed, and we should > >> also work on aproaches to limit this like my ealier proposal[1]. > > > > I honestly don't think a high number of flushes is a problem on its > > own as long as we are not spending too much time flushing, especially > > when we have magnitude-based thresholding so we know there is > > something to flush (although it may not be relevant to what we are > > doing). > > > > We are "spending too much time flushing" see below. > > > If we keep observing a lot of lock contention, one thing that I > > thought about is to have a variant of spin_lock with a timeout. This > > limits the flushing latency, instead of limiting the number of flushes > > (which I believe is the wrong metric to optimize). > > > > It also seems to me that we are doing a flush each 27ms, and your > > proposed threshold was once per 50ms. It doesn't seem like a > > fundamental difference. > > > > > Looking at the production numbers for the time the lock is held for level 0: > > @locked_time_level[0]: > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > The time is in nanosec, so M corresponds to ms (milliseconds). > > With 36 flushes per second (as shown earlier) this is a flush every > 27.7ms. It is not unreasonable (from above data) that the flush time > also spend 27ms, which means that we spend a full CPU second flushing. > That is spending too much time flushing. > > This around 1 sec CPU usage for kswapd is also quite clear in the > attached grafana graph for when server was rebooted into this V7 kernel. > > I choose 50ms because at the time I saw flush taking around 30ms, and I > view the flush time as queue service-time. When arrival-rate is faster > than service-time, then a queue will form. So, choosing 50ms as > arrival-rate gave me some headroom. As I mentioned earlier, optimally > this threshold should be dynamically measured. Thanks for the data. Yeah this doesn't look good. Does it make sense to just throttle flushers at some point to increase the chances of coalescing multiple flushers? Otherwise I think it makes sense in this case to ratelimit flushing in general. Although instead of just checking how much time elapsed since the last flush, can we use something like __ratelimit()? This will make sure that we skip flushes when we actually have a high rate of flushing over a period of time, not because two flushes happened to be requested in close succession and the flushing rate is generally low. > > --Jesper
On 17/07/2024 18.31, Yosry Ahmed wrote: > [..] >> >> I agree that description should be updated as patch have evolved, and >> you suggestions make sense, thanks. >> >> But I'm not sure we are finished evolving this patch, because the scheme >> of allowing all sub-cgroups to become the ongoing flusher (and only >> avoiding lock-spinning when cgroup_is_descendant), is causing issues in >> production. > > Sure, I just wanted to point out the the commit log should be updated > going forward, not implying that we are done here :) > >> >> We are using cadvisor in production for collecting metrics, which walks >> all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat >> (doing 3 flushes of same cgroup in short timespan), but individual >> cgroups don't overlap much to benefit from ongoing_flusher scheme. >> >> Below is (1 sec) production data, where cadvisor and kswapd collide: >> >> > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1 >> > @ongoing_flusher_cnt[kswapd5]: 3 >> > @ongoing_flusher_cnt[kswapd7]: 4 >> > @ongoing_flusher_cnt[kswapd6]: 4 >> > @ongoing_flusher_cnt[kswapd8]: 4 >> > @ongoing_flusher_cnt[kswapd9]: 5 >> > @ongoing_flusher_cnt[kswapd2]: 5 >> > @ongoing_flusher_cnt[kswapd10]: 5 >> > @ongoing_flusher_cnt[kswapd3]: 5 >> > @ongoing_flusher_cnt[kswapd11]: 5 >> > @ongoing_flusher_cnt[kswapd0]: 5 >> > @ongoing_flusher_cnt[kswapd4]: 6 >> > @ongoing_flusher_cnt[kswapd1]: 7 >> > @ongoing_flusher_cnt[cadvisor]: 9 >> >> For cadvisor ongoing_flusher only helps 9 times to avoid flush. >> >> > @ongoing_flusher_cnt[handled_race]: 18 >> > @ongoing_flusher_cnt[all]: 61 >> >> Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes. >> >> > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61 >> > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462 >> > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032 >> >> This is bad: Lock is contended 9032 time within this 1 sec period. >> Below, lock_contended cases is captured in more detail. >> >> > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435 >> > @lock_contended[normal, 4]: 1 >> > @lock_contended[normal, 1]: 2 >> > @lock_contended[normal, 3]: 6 >> > @lock_contended[normal, 2]: 15 >> > @lock_contended[yield, 4]: 49 >> > @lock_contended[after_obtaining_lock, 4]: 49 >> > @lock_contended[normal, 0]: 59 >> >> The "normal" lock_contended for level 0, >> meaning the cgroup root is contended 59 times within this 1 sec period. >> >> > @lock_contended[after_obtaining_lock, 1]: 117 >> > @lock_contended[yield, 1]: 118 >> > @lock_contended[yield, 3]: 258 >> > @lock_contended[after_obtaining_lock, 3]: 258 >> > @lock_contended[after_obtaining_lock, 2]: 946 >> > @lock_contended[yield, 2]: 946 >> >> Lock contention for 'yielded' case for level 2. >> Goes crazy with 946/sec. >> >> > @lock_contended[yield, 0]: 7579 >> >> Lock contention for 'yielded' case for level 0, the root. >> Is really crazy with 7579/sec lock spin cases. >> >> > @lock_contended[after_obtaining_lock, 0]: 7579 >> >> >> IMHO this shows that, due to lock yielding, the scheme of >> ongoing_flusher for sub-trees only cause issues. > > Just to make sure I understand correctly, allowing non-root to become > ongoing_flushers is problematic because we only support a single > ongoing_flusher, so if we have a subsequent root flush, it won't be > the ongoing_flusher and we won't benefit from skipping other flushes. > Correct? > Yes, basically... keep in mind, what will happen is that subsequent root flush will busy-wait/spin on getting lock, which in case of a yield from ongoing-flusher, will result in multiple flushers being active, despite our efforts to have a single ongoing-flusher. > I honestly think the problem here is not that we support > ongoing_flusher for subtrees, which imo is nice to have at a low > additional cost, and does have its benefits (see below). I think the > problem is that lock yielding means that we can have multiple ongoing > flushers, while the current scheme only records one. > Yes, the yielding is causing multiple flushers to be active the same time. > What I think we should be doing is either supporting multiple > ongoing_flushers (by replacing the global variable with a per-cgroup > flag, perhaps), or biting the bullet and start using a mutex to drop > lock yielding. If there aren't any concerns beyond priority inversion > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I > suggested could fix this issue (and also put a bound on the flushing > latency in general)? > The mutex_lock_timeout is an interesting idea, but not a primitive we have available today, right? p.s. I have 5 prod machines running with mutex change, and I've not gotten any SRE complaints. > (I suppose the latter is preferrable) > >> >> IMHO we should go back to only doing ongoing_flusher for root-cgroup. >> There is really low chance of sub-trees flushes being concurrent enough >> to benefit from this, and it cause issues and needs (ugly) race handling. >> >> Further more sub-tree flushing doesn't take as long time as level 0 >> flushing, which further lower the chance of concurrent flushes. > > I agree that the race handling is not pretty, and we can try to > improve the implementation or handle the race in other ways. However, > I think that skipping flushes for subtrees is valuable. I am not sure > about the cgroup arrangement in your use case, but we do have cgroups > with a lot of tasks in them (and/or child cgroups). If there is memory > pressure (or hit the cgroup limit), they all may go into direct > reclaim concurrently, so skipping flushes could really be beneficial. > > Of course, if the difference in complexity is not justified, we can go > back to only supporting root cgroups for ongoing_flusher for now. But > as I mentioned in the v4 email thread, some of the complexity may > still remain anyway as we have multiple root cgroups in v1. > Having an incremental step with "only supporting root cgroups for ongoing_flusher for now" is a good step forward IMHO. As you could see in grafana plot, this would be a significant production improvement on its own, as it avoids wasting CPU resources spinning on the lock. Being able to have multiple root cgroups, due to in v1, does pose an implementation problem. Only having a single root, would allow to have a race-free cmpxchg scheme. Would it be reasonable to only support v2? If so, how can I match on this? >> >> Let's get some quick data on flush times from production, to support my >> claim: > > Thanks for the data. I agree that in general root flushes will be a > lot more expensive than subtree flushes, but keep in mind that the > data may look really different depends on the cgroup setup. As I > mentioned, I think we should still support subtree flushes unless the > difference in complexity is not justified. > It would be really valuable if you could provide production data on the lock-hold times, just like I did with below bpftrace script... Is that possible, please? >> >> The bpftrace onliner: >> sudo bpftrace -e ' >> tracepoint:cgroup:cgroup_rstat_locked { >> if (args->cpu == -1) { @start[tid]=nsecs; } >> @cnt[probe]=count(); >> if (args->contended) { >> @lock_contended["after_obtaining_lock", args->level]=count(); >> }} >> tracepoint:cgroup:cgroup_rstat_unlock { >> if (args->cpu == -1) { >> $now=nsecs; $start=@start[tid]; $diff=$now-$start; >> @locked_time_level[args->level]=hist($diff); >> } >> @cnt[probe]=count()} >> kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()} >> interval:s:1 {time("%H:%M:%S "); >> print(@cnt); >> print(@lock_contended); >> print(@locked_time_level); >> clear (@cnt); >> clear (@lock_contended); } >> END {clear(@start)}' >> >> Time below is in nanosec. >> >> @locked_time_level[0]: >> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >> | >> [8M, 16M) 860 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ >> | >> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ >> | >> >> >> @locked_time_level[1]: >> [4K, 8K) 6 |@@@@ >> | >> [8K, 16K) 65 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [16K, 32K) 52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >> | >> [32K, 64K) 23 |@@@@@@@@@@@@@@@@@@ >> | >> [64K, 128K) 15 |@@@@@@@@@@@@ >> | >> [128K, 256K) 10 |@@@@@@@@ >> | >> [256K, 512K) 6 |@@@@ >> | >> [512K, 1M) 15 |@@@@@@@@@@@@ >> | >> [1M, 2M) 2 |@ >> | >> [2M, 4M) 14 |@@@@@@@@@@@ >> | >> [4M, 8M) 6 |@@@@ >> | >> [8M, 16M) 7 |@@@@@ >> | >> [16M, 32M) 1 | >> | >> >> >> @locked_time_level[2]: >> [2K, 4K) 1 | >> | >> [4K, 8K) 160 |@@@@@@@@@ >> | >> [8K, 16K) 733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >> | >> [16K, 32K) 901 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [32K, 64K) 191 |@@@@@@@@@@@ >> | >> [64K, 128K) 115 |@@@@@@ >> | >> [128K, 256K) 61 |@@@ >> | >> [256K, 512K) 70 |@@@@ >> | >> [512K, 1M) 59 |@@@ >> | >> [1M, 2M) 27 |@ >> | >> [2M, 4M) 9 | >> | >> >> >> @locked_time_level[3]: >> [1K, 2K) 3 | >> | >> [2K, 4K) 2 | >> | >> [4K, 8K) 5 | >> | >> [8K, 16K) 147 |@@@@@@ >> | >> [16K, 32K) 1222 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [32K, 64K) 266 |@@@@@@@@@@@ >> | >> [64K, 128K) 199 |@@@@@@@@ >> | >> [128K, 256K) 146 |@@@@@@ >> | >> [256K, 512K) 124 |@@@@@ >> | >> [512K, 1M) 17 | >> | >> [1M, 2M) 0 | >> | >> [2M, 4M) 0 | >> | >> [4M, 8M) 1 | >> | >> >> >> @locked_time_level[4]: >> [4K, 8K) 2 |@@ >> | >> [8K, 16K) 17 |@@@@@@@@@@@@@@@@@@@@@@ >> | >> [16K, 32K) 40 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [32K, 64K) 4 |@@@@@ >> | >> >> --Jesper >>
[..] > > What I think we should be doing is either supporting multiple > > ongoing_flushers (by replacing the global variable with a per-cgroup > > flag, perhaps), or biting the bullet and start using a mutex to drop > > lock yielding. If there aren't any concerns beyond priority inversion > > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I > > suggested could fix this issue (and also put a bound on the flushing > > latency in general)? > > > > The mutex_lock_timeout is an interesting idea, but not a primitive we > have available today, right? We don't, but Waiman said it shouldn't be difficult to add: https://lore.kernel.org/lkml/623e62c5-3045-4dca-9f2c-ed15b8d3bad8@redhat.com/ > > p.s. I have 5 prod machines running with mutex change, and I've not > gotten any SRE complaints. That's nice! > > > > (I suppose the latter is preferrable) > > > >> > >> IMHO we should go back to only doing ongoing_flusher for root-cgroup. > >> There is really low chance of sub-trees flushes being concurrent enough > >> to benefit from this, and it cause issues and needs (ugly) race handling. > >> > >> Further more sub-tree flushing doesn't take as long time as level 0 > >> flushing, which further lower the chance of concurrent flushes. > > > > I agree that the race handling is not pretty, and we can try to > > improve the implementation or handle the race in other ways. However, > > I think that skipping flushes for subtrees is valuable. I am not sure > > about the cgroup arrangement in your use case, but we do have cgroups > > with a lot of tasks in them (and/or child cgroups). If there is memory > > pressure (or hit the cgroup limit), they all may go into direct > > reclaim concurrently, so skipping flushes could really be beneficial. > > > > Of course, if the difference in complexity is not justified, we can go > > back to only supporting root cgroups for ongoing_flusher for now. But > > as I mentioned in the v4 email thread, some of the complexity may > > still remain anyway as we have multiple root cgroups in v1. > > > > Having an incremental step with "only supporting root cgroups for > ongoing_flusher for now" is a good step forward IMHO. > As you could see in grafana plot, this would be a significant production > improvement on its own, as it avoids wasting CPU resources spinning on > the lock. I am not opposed to this at all. All I am saying is, if we need to handle most complexity anyway due to multiple root cgroups in v1, then might as well support subtrees too. > > Being able to have multiple root cgroups, due to in v1, does pose an > implementation problem. Only having a single root, would allow to have > a race-free cmpxchg scheme. > Would it be reasonable to only support v2? The rstat code has so far been v1/v2 agnostic AFAICT, and Google is still using cgroup v1, so I naturally prefer we keep supporting both v1 and v2 going forward. > If so, how can I match on this? cgroup_on_dfl() is basically testing if a cgroup is on v2, but I really want to keep v1 included if possible :/ > > >> > >> Let's get some quick data on flush times from production, to support my > >> claim: > > > > Thanks for the data. I agree that in general root flushes will be a > > lot more expensive than subtree flushes, but keep in mind that the > > data may look really different depends on the cgroup setup. As I > > mentioned, I think we should still support subtree flushes unless the > > difference in complexity is not justified. > > > > It would be really valuable if you could provide production data on the > lock-hold times, just like I did with below bpftrace script... > Is that possible, please? Unfortunately, we don't have the infrastructure to do this :/ But again, I am not objecting to only supporting root cgroups as ongoing flushers for now if there is a justifiable complexity difference. So this shouldn't be a blocker.
On 17/07/2024 18.49, Yosry Ahmed wrote: > On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> >> On 17/07/2024 02.35, Yosry Ahmed wrote: >>> [..] >>>> >>>> >>>> This is a clean (meaning no cadvisor interference) example of kswapd >>>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases >>>> hit the race (which is handled in V6 and V7). >>>> >>>> The BPF "cnt" maps are getting cleared every second, so this >>>> approximates per sec numbers. This patch reduce pressure on the lock, >>>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full >>>> flushes approx 37 per sec (every 27 ms). On the positive side >>>> ongoing_flusher mitigation stopped 98 per sec of these. >>>> >>>> In this clean kswapd case the patch removes the lock contention issue >>>> for kswapd. The lock_contended cases 27 seems to be all related to >>>> handled_race cases 27. >>>> >>>> The remaning high flush rate should also be addressed, and we should >>>> also work on aproaches to limit this like my ealier proposal[1]. >>> >>> I honestly don't think a high number of flushes is a problem on its >>> own as long as we are not spending too much time flushing, especially >>> when we have magnitude-based thresholding so we know there is >>> something to flush (although it may not be relevant to what we are >>> doing). >>> >> >> We are "spending too much time flushing" see below. >> >>> If we keep observing a lot of lock contention, one thing that I >>> thought about is to have a variant of spin_lock with a timeout. This >>> limits the flushing latency, instead of limiting the number of flushes >>> (which I believe is the wrong metric to optimize). >>> >>> It also seems to me that we are doing a flush each 27ms, and your >>> proposed threshold was once per 50ms. It doesn't seem like a >>> fundamental difference. >>> >> >> >> Looking at the production numbers for the time the lock is held for level 0: >> >> @locked_time_level[0]: >> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >> [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | >> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | >> >> The time is in nanosec, so M corresponds to ms (milliseconds). >> >> With 36 flushes per second (as shown earlier) this is a flush every >> 27.7ms. It is not unreasonable (from above data) that the flush time >> also spend 27ms, which means that we spend a full CPU second flushing. >> That is spending too much time flushing. >> >> This around 1 sec CPU usage for kswapd is also quite clear in the >> attached grafana graph for when server was rebooted into this V7 kernel. >> >> I choose 50ms because at the time I saw flush taking around 30ms, and I >> view the flush time as queue service-time. When arrival-rate is faster >> than service-time, then a queue will form. So, choosing 50ms as >> arrival-rate gave me some headroom. As I mentioned earlier, optimally >> this threshold should be dynamically measured. > > Thanks for the data. Yeah this doesn't look good. > > Does it make sense to just throttle flushers at some point to increase > the chances of coalescing multiple flushers? > > Otherwise I think it makes sense in this case to ratelimit flushing in > general. Although instead of just checking how much time elapsed since > the last flush, can we use something like __ratelimit()? > > This will make sure that we skip flushes when we actually have a high > rate of flushing over a period of time, not because two flushes > happened to be requested in close succession and the flushing rate is > generally low. > I really think "time elapsed since the last flush" is the right solution here. As, we *do* want to catch the case you describe "two flushes happened to be requested in close succession and the flushing rate is generally low." (After this patch fixing the lock contention triggered by kswapd). The remaining problem with kswapd is that those flushes that doesn't "collide" on the lock, will be flushing in close succession. And we likely have a generally low flushing rate, until kswapd starts up. Some production data from a "slow" period where only kswapd is active: 05:59:32 @ongoing_flusher_cnt[kswapd11]: 1 @ongoing_flusher_cnt[kswapd7]: 1 @ongoing_flusher_cnt[kswapd3]: 1 @ongoing_flusher_cnt[kswapd5]: 1 @ongoing_flusher_cnt[kswapd10]: 1 @ongoing_flusher_cnt[kswapd6]: 2 @ongoing_flusher_cnt[kswapd8]: 2 @ongoing_flusher_cnt[kswapd1]: 2 @ongoing_flusher_cnt[kswapd9]: 2 @ongoing_flusher_cnt[kswapd0]: 2 @ongoing_flusher_cnt[kswapd2]: 2 @ongoing_flusher_cnt[kswapd4]: 2 @ongoing_flusher_cnt[handled_race]: 2 @ongoing_flusher_cnt[all]: 14 @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2 @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10 @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43 @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51 We observe that ongoing_flusher scheme saves/avoids 14 of the flushes great, but we still have 43 flushes in this period. I think only kswapd is doing the flushing here, so I claim 41 flushes are likely unnecessary (as there are indication of 2 kswapd startups in the period). Also observe that some of the kswapdNN processes only have ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with an ongoing flusher, while others have 2. --Jesper
On Thu, Jul 18, 2024 at 1:12 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > On 17/07/2024 18.49, Yosry Ahmed wrote: > > On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > >> > >> > >> On 17/07/2024 02.35, Yosry Ahmed wrote: > >>> [..] > >>>> > >>>> > >>>> This is a clean (meaning no cadvisor interference) example of kswapd > >>>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > >>>> hit the race (which is handled in V6 and V7). > >>>> > >>>> The BPF "cnt" maps are getting cleared every second, so this > >>>> approximates per sec numbers. This patch reduce pressure on the lock, > >>>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > >>>> flushes approx 37 per sec (every 27 ms). On the positive side > >>>> ongoing_flusher mitigation stopped 98 per sec of these. > >>>> > >>>> In this clean kswapd case the patch removes the lock contention issue > >>>> for kswapd. The lock_contended cases 27 seems to be all related to > >>>> handled_race cases 27. > >>>> > >>>> The remaning high flush rate should also be addressed, and we should > >>>> also work on aproaches to limit this like my ealier proposal[1]. > >>> > >>> I honestly don't think a high number of flushes is a problem on its > >>> own as long as we are not spending too much time flushing, especially > >>> when we have magnitude-based thresholding so we know there is > >>> something to flush (although it may not be relevant to what we are > >>> doing). > >>> > >> > >> We are "spending too much time flushing" see below. > >> > >>> If we keep observing a lot of lock contention, one thing that I > >>> thought about is to have a variant of spin_lock with a timeout. This > >>> limits the flushing latency, instead of limiting the number of flushes > >>> (which I believe is the wrong metric to optimize). > >>> > >>> It also seems to me that we are doing a flush each 27ms, and your > >>> proposed threshold was once per 50ms. It doesn't seem like a > >>> fundamental difference. > >>> > >> > >> > >> Looking at the production numbers for the time the lock is held for level 0: > >> > >> @locked_time_level[0]: > >> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > >> [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > >> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > >> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > >> > >> The time is in nanosec, so M corresponds to ms (milliseconds). > >> > >> With 36 flushes per second (as shown earlier) this is a flush every > >> 27.7ms. It is not unreasonable (from above data) that the flush time > >> also spend 27ms, which means that we spend a full CPU second flushing. > >> That is spending too much time flushing. > >> > >> This around 1 sec CPU usage for kswapd is also quite clear in the > >> attached grafana graph for when server was rebooted into this V7 kernel. > >> > >> I choose 50ms because at the time I saw flush taking around 30ms, and I > >> view the flush time as queue service-time. When arrival-rate is faster > >> than service-time, then a queue will form. So, choosing 50ms as > >> arrival-rate gave me some headroom. As I mentioned earlier, optimally > >> this threshold should be dynamically measured. > > > > Thanks for the data. Yeah this doesn't look good. > > > > Does it make sense to just throttle flushers at some point to increase > > the chances of coalescing multiple flushers? > > > > Otherwise I think it makes sense in this case to ratelimit flushing in > > general. Although instead of just checking how much time elapsed since > > the last flush, can we use something like __ratelimit()? > > > > This will make sure that we skip flushes when we actually have a high > > rate of flushing over a period of time, not because two flushes > > happened to be requested in close succession and the flushing rate is > > generally low. > > > > I really think "time elapsed since the last flush" is the right solution > here. As, we *do* want to catch the case you describe "two flushes > happened to be requested in close succession and the flushing rate is > generally low." > > (After this patch fixing the lock contention triggered by kswapd). > The remaining problem with kswapd is that those flushes that doesn't > "collide" on the lock, will be flushing in close succession. And we > likely have a generally low flushing rate, until kswapd starts up. I do not disagree, I think I may have misrepresented my intention. Time since last flush is essentially ratelimiting to 2 flushes per (let's say) 50 ms. So if we only have 2 flushes in an entire second, but they happen to occur <50ms apart, the latter will be skipped (arguably unnecessarily). What I am trying to say is to make it more generalized/less-granular, and apply ratelimiting when we are actually spending a large portion of our time flushing. For example, cap it at 20 flushes per 1s. I think __ratelimit() does exactly that, although it's currently mostly used to ratelimit warnings, but I think we can reuse it here. Ideally, of course, the number would be dynamic as you described before, but we can start with a reasonable static value based on data. What we do when the limit is hit could be changed later as well. We can just skip flushes initially, and later we can be smarter about it and throttle them in a way that makes them collide more effectively. > > Some production data from a "slow" period where only kswapd is active: > > 05:59:32 @ongoing_flusher_cnt[kswapd11]: 1 > @ongoing_flusher_cnt[kswapd7]: 1 > @ongoing_flusher_cnt[kswapd3]: 1 > @ongoing_flusher_cnt[kswapd5]: 1 > @ongoing_flusher_cnt[kswapd10]: 1 > @ongoing_flusher_cnt[kswapd6]: 2 > @ongoing_flusher_cnt[kswapd8]: 2 > @ongoing_flusher_cnt[kswapd1]: 2 > @ongoing_flusher_cnt[kswapd9]: 2 > @ongoing_flusher_cnt[kswapd0]: 2 > @ongoing_flusher_cnt[kswapd2]: 2 > @ongoing_flusher_cnt[kswapd4]: 2 > @ongoing_flusher_cnt[handled_race]: 2 > @ongoing_flusher_cnt[all]: 14 > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2 > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10 > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43 > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51 > > We observe that ongoing_flusher scheme saves/avoids 14 of the flushes > great, but we still have 43 flushes in this period. I think only kswapd > is doing the flushing here, so I claim 41 flushes are likely unnecessary > (as there are indication of 2 kswapd startups in the period). Also > observe that some of the kswapdNN processes only have > ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with > an ongoing flusher, while others have 2. > > --Jesper
Hi Jesper, On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > [...] > > > Looking at the production numbers for the time the lock is held for level 0: > > @locked_time_level[0]: > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > Is it possible to get the above histogram for other levels as well? I know this is 12 numa node machine, how many total CPUs are there? > The time is in nanosec, so M corresponds to ms (milliseconds). > > With 36 flushes per second (as shown earlier) this is a flush every > 27.7ms. It is not unreasonable (from above data) that the flush time > also spend 27ms, which means that we spend a full CPU second flushing. > That is spending too much time flushing. One idea to further reduce this time is more fine grained flush skipping. At the moment we either skip the whole flush or not. How about we make this decision per-cpu? We already have per-cpu updates data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu.
On Thu, Jul 18, 2024 at 5:41 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Hi Jesper, > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > [...] > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > @locked_time_level[0]: > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > Is it possible to get the above histogram for other levels as well? I > know this is 12 numa node machine, how many total CPUs are there? > > > The time is in nanosec, so M corresponds to ms (milliseconds). > > > > With 36 flushes per second (as shown earlier) this is a flush every > > 27.7ms. It is not unreasonable (from above data) that the flush time > > also spend 27ms, which means that we spend a full CPU second flushing. > > That is spending too much time flushing. > > One idea to further reduce this time is more fine grained flush > skipping. At the moment we either skip the whole flush or not. How > about we make this decision per-cpu? We already have per-cpu updates > data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu. Good idea. I think we would need a per-subsystem callback to decide whether we want to flush the cgroup or not. This needs to happen in the core rstat flushing code (not the memcg flushing code), as we need to make sure we do not remove the cgroup from the per-cpu updated tree if we don't flush it. More generally, I think we should be able to have a "force" flush API that skips all optimizations and ensures that a flush occurs. I think this will be needed in the cgroup_rstat_exit() path, where stats of a cgroup being freed must be propagated to its parent, no matter how insignificant they may be, to avoid inconsistencies.
On 19/07/2024 02.40, Shakeel Butt wrote: > Hi Jesper, > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: >> > [...] >> >> >> Looking at the production numbers for the time the lock is held for level 0: >> >> @locked_time_level[0]: >> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >> [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | >> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | >> > > Is it possible to get the above histogram for other levels as well? Data from other levels available in [1]: [1] https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ IMHO the data shows we will get most out of skipping level-0 root-cgroup flushes. Copy-pasted here (adj whitespaces hoping it better fit email): Time below is in nanosec: - M -> ms - K -> usec @locked_time_level[0]: [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | @locked_time_level[1]: [4K, 8K) 6 |@@@@ | [8K, 16K) 65 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 23 |@@@@@@@@@@@@@@@@@@ | [64K, 128K) 15 |@@@@@@@@@@@@ | [128K, 256K) 10 |@@@@@@@@ | [256K, 512K) 6 |@@@@ | [512K, 1M) 15 |@@@@@@@@@@@@ | [1M, 2M) 2 |@ | [2M, 4M) 14 |@@@@@@@@@@@ | [4M, 8M) 6 |@@@@ | [8M, 16M) 7 |@@@@@ | [16M, 32M) 1 | | @locked_time_level[2]: [2K, 4K) 1 | | [4K, 8K) 160 |@@@@@@@@@ | [8K, 16K) 733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 901 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 191 |@@@@@@@@@@@ | [64K, 128K) 115 |@@@@@@ | [128K, 256K) 61 |@@@ | [256K, 512K) 70 |@@@@ | [512K, 1M) 59 |@@@ | [1M, 2M) 27 |@ | [2M, 4M) 9 | | @locked_time_level[3]: [1K, 2K) 3 | | [2K, 4K) 2 | | [4K, 8K) 5 | | [8K, 16K) 147 |@@@@@@ | [16K, 32K) 1222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 266 |@@@@@@@@@@@ | [64K, 128K) 199 |@@@@@@@@ | [128K, 256K) 146 |@@@@@@ | [256K, 512K) 124 |@@@@@ | [512K, 1M) 17 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 1 | | @locked_time_level[4]: [4K, 8K) 2 |@@ | [8K, 16K) 17 |@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 4 |@@@@@ | > I know this is 12 numa node machine, how many total CPUs are there? 192 CPU cores (incl HyperThreaded). --Jesper
On 17/07/2024 02.30, Yosry Ahmed wrote: > On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> Avoid lock contention on the global cgroup rstat lock caused by kswapd >> starting on all NUMA nodes simultaneously. At Cloudflare, we observed >> massive issues due to kswapd and the specific mem_cgroup_flush_stats() >> call inlined in shrink_node, which takes the rstat lock. >> >> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, >> we noted severe lock contention on the rstat lock. This contention >> causes 12 CPUs to waste cycles spinning every time kswapd runs. >> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are >> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily >> due to spinning on the rstat lock. >> >> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds >> causing all kswapdN threads to wake up simultaneously. The kswapd thread >> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush >> operation as part of its work. This results in kernel self-induced rstat >> lock contention by waking up all kswapd threads simultaneously. Leveraging >> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this >> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. >> >> To avoid this kind of thundering herd problem, kernel previously had a >> "stats_flush_ongoing" concept, but this was removed as part of commit >> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch >> reintroduce and generalized the concept to apply to all users of cgroup >> rstat, not just memcg. >> >> If there is an ongoing rstat flush, and current cgroup is a descendant, >> then it is unnecessary to do the flush. For callers to still see updated >> stats, wait for ongoing flusher to complete before returning, but add >> timeout as stats are already inaccurate given updaters keeps running. >> >> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). >> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Thanks for working on this, Jesper! I love the data you collected here! > > I think the commit subject and message should be changed to better > describe the patch. This is a patch that exclusively modifies cgroup > code, yet the subject is about kswapd. This change affects all users > of rstat flushing. > > I think a better subject would be: > "cgroup/rstat: avoid flushing if there is an ongoing overlapping > flush" or similar. > Took this for V8. https://lore.kernel.org/all/172139415725.3084888.13770938453137383953.stgit@firesoul > The commit message should first describe the cgroup change, and then > use kswapd as a brief example/illustration of how the problem > manifests in practice. You should also include a brief summary of the > numbers you collected from prod. > Update desc in V8 >> --- >> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ >> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ >> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ >> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ >> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ >> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ >> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ >> >> include/linux/cgroup-defs.h | 2 + >> kernel/cgroup/rstat.c | 95 ++++++++++++++++++++++++++++++++++++++----- >> 2 files changed, 85 insertions(+), 12 deletions(-) >> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h [...] >> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >> index fb8b49437573..fe2a81a310bb 100644 >> --- a/kernel/cgroup/rstat.c >> +++ b/kernel/cgroup/rstat.c [...] >> static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> spin_unlock_irq(&cgroup_rstat_lock); >> } >> >> +#define MAX_WAIT msecs_to_jiffies(100) >> +/* Trylock helper that also checks for on ongoing flusher */ >> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) >> +{ >> + struct cgroup *ongoing; >> + bool locked; >> + >> + /* Check if ongoing flusher is already taking care of this, if > > nit: I think commonly the comment would start on a new line after /*. > We use this comment style in networking code. I've updated it to follow this subsystem. >> + * we are a descendant skip work, but wait for ongoing flusher >> + * to complete work. >> + */ >> +retry: >> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >> + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) { >> + wait_for_completion_interruptible_timeout( >> + &ongoing->flush_done, MAX_WAIT); >> + /* TODO: Add tracepoint here */ >> + return false; >> + } >> + >> + locked = __cgroup_rstat_trylock(cgrp, -1); >> + if (!locked) { >> + /* Contended: Handle loosing race for ongoing flusher */ > > nit: losing > Thanks for catching this subtle wording issue. >> + if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher)) >> + goto retry; >> + >> + __cgroup_rstat_lock(cgrp, -1, false); >> + } >> + /* Obtained lock, record this cgrp as the ongoing flusher */ > > Do we want a comment here to explain why there could be an existing > ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's > not super obvious. Extended this in V8. > >> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); >> + if (!ongoing) { >> + reinit_completion(&cgrp->flush_done); >> + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); >> + } >> + return true; /* locked */ > > Would it be better to explain the return value of the function in the > comment above it? > Fixed this in V8. >> +} >> + >> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) >> +{ >> + /* Detect if we are the ongoing flusher */ > > I think this is a bit obvious. > True, removed comment. >> + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { >> + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); >> + complete_all(&cgrp->flush_done); >> + } >> + __cgroup_rstat_unlock(cgrp, -1); >> +} >> + [...] Thanks for going through and commenting on the code! :-) --Jesper
On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > Hi Jesper, > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > [...] > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > @locked_time_level[0]: > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > Is it possible to get the above histogram for other levels as well? > > Data from other levels available in [1]: > [1] > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > flushes. > Thanks a lot of the data. Are all or most of these locked_time_level[0] from kswapds? This just motivates me to strongly push the ratelimited flush patch of mine (which would be orthogonal to your patch series). Shakeel
On Thu, Jul 18, 2024 at 08:11:32PM GMT, Yosry Ahmed wrote: > On Thu, Jul 18, 2024 at 5:41 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > Hi Jesper, > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > [...] > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > @locked_time_level[0]: > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > Is it possible to get the above histogram for other levels as well? I > > know this is 12 numa node machine, how many total CPUs are there? > > > > > The time is in nanosec, so M corresponds to ms (milliseconds). > > > > > > With 36 flushes per second (as shown earlier) this is a flush every > > > 27.7ms. It is not unreasonable (from above data) that the flush time > > > also spend 27ms, which means that we spend a full CPU second flushing. > > > That is spending too much time flushing. > > > > One idea to further reduce this time is more fine grained flush > > skipping. At the moment we either skip the whole flush or not. How > > about we make this decision per-cpu? We already have per-cpu updates > > data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu. > > Good idea. > > I think we would need a per-subsystem callback to decide whether we > want to flush the cgroup or not. This needs to happen in the core > rstat flushing code (not the memcg flushing code), as we need to make > sure we do not remove the cgroup from the per-cpu updated tree if we > don't flush it. Unless we have per-subsystem update tree, I don't think per-subsystem callback would work or we would be flushing all if any subsystem wants it. Anyways we can discuss when we have data that it really helps. > > More generally, I think we should be able to have a "force" flush API > that skips all optimizations and ensures that a flush occurs. I think > this will be needed in the cgroup_rstat_exit() path, where stats of a > cgroup being freed must be propagated to its parent, no matter how > insignificant they may be, to avoid inconsistencies. Agree.
On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > Hi Jesper, > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > [...] > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > @locked_time_level[0]: > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > Data from other levels available in [1]: > > [1] > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > flushes. > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > from kswapds? This just motivates me to strongly push the ratelimited > flush patch of mine (which would be orthogonal to your patch series). Jesper and I were discussing a better ratelimiting approach, whether it's measuring the time since the last flush, or only skipping if we have a lot of flushes in a specific time frame (using __ratelimit()). I believe this would be better than the current memcg ratelimiting approach, and we can remove the latter. WDYT? > > Shakeel
On 20/07/2024 06.52, Yosry Ahmed wrote: > On Fri, Jul 19, 2024 at 9:52 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >>> >>> On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: >>>> >>>> >>>> On 19/07/2024 02.40, Shakeel Butt wrote: >>>>> Hi Jesper, >>>>> >>>>> On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: >>>>>> >>>>> [...] >>>>>> >>>>>> >>>>>> Looking at the production numbers for the time the lock is held for level 0: >>>>>> >>>>>> @locked_time_level[0]: >>>>>> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >>>>>> [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >>>>>> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | >>>>>> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | >>>>>> >>>>> >>>>> Is it possible to get the above histogram for other levels as well? >>>> >>>> Data from other levels available in [1]: >>>> [1] >>>> https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ >>>> >>>> IMHO the data shows we will get most out of skipping level-0 root-cgroup >>>> flushes. >>>> >>> >>> Thanks a lot of the data. Are all or most of these locked_time_level[0] >>> from kswapds? This just motivates me to strongly push the ratelimited >>> flush patch of mine (which would be orthogonal to your patch series). >> There are also others flushing level 0. Extended the bpftrace oneliner to also capture the process 'comm' name. (I reduced 'kworker' to one entry in below, e.g pattern 'kworker/u392:19'). grep 'level\[' out01.bpf_oneliner_locked_time | awk -F/ '{print $1}' | sort | uniq @locked_time_level[0, cadvisor]: @locked_time_level[0, consul]: @locked_time_level[0, kswapd0]: @locked_time_level[0, kswapd10]: @locked_time_level[0, kswapd11]: @locked_time_level[0, kswapd1]: @locked_time_level[0, kswapd2]: @locked_time_level[0, kswapd3]: @locked_time_level[0, kswapd4]: @locked_time_level[0, kswapd5]: @locked_time_level[0, kswapd6]: @locked_time_level[0, kswapd7]: @locked_time_level[0, kswapd8]: @locked_time_level[0, kswapd9]: @locked_time_level[0, kworker @locked_time_level[0, lassen]: @locked_time_level[0, thunderclap-san]: @locked_time_level[0, xdpd]: @locked_time_level[1, cadvisor]: @locked_time_level[2, cadvisor]: @locked_time_level[2, kworker @locked_time_level[2, memory-saturati]: @locked_time_level[2, systemd]: @locked_time_level[2, thread-saturati]: @locked_time_level[3, cadvisor]: @locked_time_level[3, cat]: @locked_time_level[3, kworker @locked_time_level[3, memory-saturati]: @locked_time_level[3, systemd]: @locked_time_level[3, thread-saturati]: @locked_time_level[4, cadvisor]: >> Jesper and I were discussing a better ratelimiting approach, whether >> it's measuring the time since the last flush, or only skipping if we >> have a lot of flushes in a specific time frame (using __ratelimit()). >> I believe this would be better than the current memcg ratelimiting >> approach, and we can remove the latter. >> >> WDYT? > > Forgot to link this: > https://lore.kernel.org/lkml/CAJD7tkZ5nxoa7aCpAix1bYOoYiLVfn+aNkq7jmRAZqsxruHYLw@mail.gmail.com/ > I agree that ratelimiting is orthogonal to this patch, and that we really need to address this in follow up patchset. The proposed mem_cgroup_flush_stats_ratelimited patch[1] helps, but is limited to memory area. I'm proposing a more generic solution in [2] that helps all users of rstat. It is time based, because it makes sense to observe the time it takes to flush root (service rate), and then limit how quickly after another flusher can run (limiting arrival rate). From practical queue theory we intuitively know that we should keep arrival rate below service rate, else queuing happens. --Jesper [1] "memcg: use ratelimited stats flush in the reclaim" - https://lore.kernel.org/all/20240615081257.3945587-1-shakeel.butt@linux.dev/ [2] "cgroup/rstat: introduce ratelimited rstat flushing" - https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/
On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote: > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > > Hi Jesper, > > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > [...] > > > > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > > > @locked_time_level[0]: > > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > > > Data from other levels available in [1]: > > > [1] > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > > flushes. > > > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > > from kswapds? This just motivates me to strongly push the ratelimited > > flush patch of mine (which would be orthogonal to your patch series). > > Jesper and I were discussing a better ratelimiting approach, whether > it's measuring the time since the last flush, or only skipping if we > have a lot of flushes in a specific time frame (using __ratelimit()). > I believe this would be better than the current memcg ratelimiting > approach, and we can remove the latter. > > WDYT? The last statement gives me the impression that you are trying to fix something that is not broken. The current ratelimiting users are ok, the issue is with the sync flushers. Or maybe you are suggesting that the new ratelimiting will be used for all sync flushers and current ratelimiting users and the new ratelimiting will make a good tradeoff between the accuracy and potential flush stall?
On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote: > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > > > Hi Jesper, > > > > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > > > > > @locked_time_level[0]: > > > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > > > > > Data from other levels available in [1]: > > > > [1] > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > > > flushes. > > > > > > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > > > from kswapds? This just motivates me to strongly push the ratelimited > > > flush patch of mine (which would be orthogonal to your patch series). > > > > Jesper and I were discussing a better ratelimiting approach, whether > > it's measuring the time since the last flush, or only skipping if we > > have a lot of flushes in a specific time frame (using __ratelimit()). > > I believe this would be better than the current memcg ratelimiting > > approach, and we can remove the latter. > > > > WDYT? > > The last statement gives me the impression that you are trying to fix > something that is not broken. The current ratelimiting users are ok, the > issue is with the sync flushers. Or maybe you are suggesting that the new > ratelimiting will be used for all sync flushers and current ratelimiting > users and the new ratelimiting will make a good tradeoff between the > accuracy and potential flush stall? The latter. Basically the idea is to have more informed and generic ratelimiting logic in the core rstat flushing code (e.g. using __ratelimit()), which would apply to ~all flushers*. Then, we ideally wouldn't need mem_cgroup_flush_stats_ratelimited() at all. *The obvious exception is the force flushing case we discussed for cgroup_rstat_exit(). In fact, I think we need that even with the ongoing flusher optimization, because I think there is a slight chance that a flush is missed. It wouldn't be problematic for other flushers, but it certainly can be for cgroup_rstat_exit() as the stats will be completely dropped. The scenario I have in mind is: - CPU 1 starts a flush of cgroup A. Flushing complete, but waiters are not woke up yet. - CPU 2 updates the stats of cgroup A after it is flushed by CPU 1. - CPU 3 calls cgroup_rstat_exit(), sees the ongoing flusher and waits. - CPU 1 wakes up the waiters. - CPU 3 proceeds to destroy cgroup A, and the updates made by CPU 2 are lost.
On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote: > On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote: > > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > > > > Hi Jesper, > > > > > > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > > > > > > > @locked_time_level[0]: > > > > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > > > > > > > Data from other levels available in [1]: > > > > > [1] > > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > > > > flushes. > > > > > > > > > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > > > > from kswapds? This just motivates me to strongly push the ratelimited > > > > flush patch of mine (which would be orthogonal to your patch series). > > > > > > Jesper and I were discussing a better ratelimiting approach, whether > > > it's measuring the time since the last flush, or only skipping if we > > > have a lot of flushes in a specific time frame (using __ratelimit()). > > > I believe this would be better than the current memcg ratelimiting > > > approach, and we can remove the latter. > > > > > > WDYT? > > > > The last statement gives me the impression that you are trying to fix > > something that is not broken. The current ratelimiting users are ok, the > > issue is with the sync flushers. Or maybe you are suggesting that the new > > ratelimiting will be used for all sync flushers and current ratelimiting > > users and the new ratelimiting will make a good tradeoff between the > > accuracy and potential flush stall? > > The latter. Basically the idea is to have more informed and generic > ratelimiting logic in the core rstat flushing code (e.g. using > __ratelimit()), which would apply to ~all flushers*. Then, we ideally > wouldn't need mem_cgroup_flush_stats_ratelimited() at all. > I wonder if we really need a universal ratelimit. As you noted below there are cases where we want exact stats and then we know there are cases where accurate stats are not needed but they are very performance sensitive. Aiming to have a solution which will ignore such differences might be a futile effort. > *The obvious exception is the force flushing case we discussed for > cgroup_rstat_exit(). > > In fact, I think we need that even with the ongoing flusher > optimization, because I think there is a slight chance that a flush is > missed. It wouldn't be problematic for other flushers, but it > certainly can be for cgroup_rstat_exit() as the stats will be > completely dropped. > > The scenario I have in mind is: > - CPU 1 starts a flush of cgroup A. Flushing complete, but waiters are > not woke up yet. > - CPU 2 updates the stats of cgroup A after it is flushed by CPU 1. > - CPU 3 calls cgroup_rstat_exit(), sees the ongoing flusher and waits. > - CPU 1 wakes up the waiters. > - CPU 3 proceeds to destroy cgroup A, and the updates made by CPU 2 are lost.
On Mon, Jul 22, 2024 at 02:32:03PM GMT, Shakeel Butt wrote: > On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote: > > On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote: > > > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > > > > > Hi Jesper, > > > > > > > > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > > > > > > > > > @locked_time_level[0]: > > > > > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > > > > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > > > > > > > > > Data from other levels available in [1]: > > > > > > [1] > > > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > > > > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > > > > > flushes. > > > > > > > > > > > > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > > > > > from kswapds? This just motivates me to strongly push the ratelimited > > > > > flush patch of mine (which would be orthogonal to your patch series). > > > > > > > > Jesper and I were discussing a better ratelimiting approach, whether > > > > it's measuring the time since the last flush, or only skipping if we > > > > have a lot of flushes in a specific time frame (using __ratelimit()). > > > > I believe this would be better than the current memcg ratelimiting > > > > approach, and we can remove the latter. > > > > > > > > WDYT? > > > > > > The last statement gives me the impression that you are trying to fix > > > something that is not broken. The current ratelimiting users are ok, the > > > issue is with the sync flushers. Or maybe you are suggesting that the new > > > ratelimiting will be used for all sync flushers and current ratelimiting > > > users and the new ratelimiting will make a good tradeoff between the > > > accuracy and potential flush stall? > > > > The latter. Basically the idea is to have more informed and generic > > ratelimiting logic in the core rstat flushing code (e.g. using > > __ratelimit()), which would apply to ~all flushers*. Then, we ideally > > wouldn't need mem_cgroup_flush_stats_ratelimited() at all. > > > > I wonder if we really need a universal ratelimit. As you noted below > there are cases where we want exact stats and then we know there are > cases where accurate stats are not needed but they are very performance > sensitive. Aiming to have a solution which will ignore such differences > might be a futile effort. > BTW I am not against it. If we can achieve this with minimal regression and maintainence burden then it would be preferable.
On Mon, Jul 22, 2024 at 3:59 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Jul 22, 2024 at 02:32:03PM GMT, Shakeel Butt wrote: > > On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote: > > > On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote: > > > > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote: > > > > > > > > Hi Jesper, > > > > > > > > > > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > > > > > > > > > > > > > @locked_time_level[0]: > > > > > > > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > > > > > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > > > > > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > > > > > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > > > > > > > > > > > > > > > > > > > Is it possible to get the above histogram for other levels as well? > > > > > > > > > > > > > > Data from other levels available in [1]: > > > > > > > [1] > > > > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/ > > > > > > > > > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup > > > > > > > flushes. > > > > > > > > > > > > > > > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0] > > > > > > from kswapds? This just motivates me to strongly push the ratelimited > > > > > > flush patch of mine (which would be orthogonal to your patch series). > > > > > > > > > > Jesper and I were discussing a better ratelimiting approach, whether > > > > > it's measuring the time since the last flush, or only skipping if we > > > > > have a lot of flushes in a specific time frame (using __ratelimit()). > > > > > I believe this would be better than the current memcg ratelimiting > > > > > approach, and we can remove the latter. > > > > > > > > > > WDYT? > > > > > > > > The last statement gives me the impression that you are trying to fix > > > > something that is not broken. The current ratelimiting users are ok, the > > > > issue is with the sync flushers. Or maybe you are suggesting that the new > > > > ratelimiting will be used for all sync flushers and current ratelimiting > > > > users and the new ratelimiting will make a good tradeoff between the > > > > accuracy and potential flush stall? > > > > > > The latter. Basically the idea is to have more informed and generic > > > ratelimiting logic in the core rstat flushing code (e.g. using > > > __ratelimit()), which would apply to ~all flushers*. Then, we ideally > > > wouldn't need mem_cgroup_flush_stats_ratelimited() at all. > > > > > > > I wonder if we really need a universal ratelimit. As you noted below > > there are cases where we want exact stats and then we know there are > > cases where accurate stats are not needed but they are very performance > > sensitive. Aiming to have a solution which will ignore such differences > > might be a futile effort. > > > > BTW I am not against it. If we can achieve this with minimal regression > and maintainence burden then it would be preferable. It is possible that it is a futile effort, but if it works, the memcg flushing interface will be much better and we don't have to evaluate whether ratelimiting is needed on a case-by-case basis. According to Jesper's data, allowing a flush every 50ms at most may be reasonable, which means we can ratelimit the flushes to 20 flushers per second or similar. I think on average, this should provide enough accuracy for most use cases, and it should also reduce flushes in the cases that Jesper presented. It's probably worth a try, especially that it does not involve changing user visible ABIs so we can always go back to what we have today.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index b36690ca0d3f..a33b37514c29 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -548,6 +548,8 @@ struct cgroup { #ifdef CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *bpf_cgrp_storage; #endif + /* completion queue for cgrp_rstat_ongoing_flusher */ + struct completion flush_done; /* All ancestors including self */ struct cgroup *ancestors[]; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index fb8b49437573..fe2a81a310bb 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -2,6 +2,7 @@ #include "cgroup-internal.h" #include <linux/sched/cputime.h> +#include <linux/completion.h> #include <linux/bpf.h> #include <linux/btf.h> @@ -11,6 +12,7 @@ static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL; static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); @@ -279,17 +281,32 @@ __bpf_hook_end(); * value -1 is used when obtaining the main lock else this is the CPU * number processed last. */ -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop) +{ + bool locked; + + locked = spin_trylock_irq(&cgroup_rstat_lock); + if (!locked) + trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true); + else + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false); + + return locked; +} + +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop, + bool check_contention) __acquires(&cgroup_rstat_lock) { - bool contended; + bool locked = false; - contended = !spin_trylock_irq(&cgroup_rstat_lock); - if (contended) { - trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); + if (check_contention) + locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop); + + if (!locked) { spin_lock_irq(&cgroup_rstat_lock); + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention); } - trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); } static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } +#define MAX_WAIT msecs_to_jiffies(100) +/* Trylock helper that also checks for on ongoing flusher */ +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) +{ + struct cgroup *ongoing; + bool locked; + + /* Check if ongoing flusher is already taking care of this, if + * we are a descendant skip work, but wait for ongoing flusher + * to complete work. + */ +retry: + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) { + wait_for_completion_interruptible_timeout( + &ongoing->flush_done, MAX_WAIT); + /* TODO: Add tracepoint here */ + return false; + } + + locked = __cgroup_rstat_trylock(cgrp, -1); + if (!locked) { + /* Contended: Handle loosing race for ongoing flusher */ + if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher)) + goto retry; + + __cgroup_rstat_lock(cgrp, -1, false); + } + /* Obtained lock, record this cgrp as the ongoing flusher */ + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); + if (!ongoing) { + reinit_completion(&cgrp->flush_done); + WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp); + } + return true; /* locked */ +} + +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp) +{ + /* Detect if we are the ongoing flusher */ + if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) { + WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL); + complete_all(&cgrp->flush_done); + } + __cgroup_rstat_unlock(cgrp, -1); +} + /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); - __cgroup_rstat_lock(cgrp, cpu); + __cgroup_rstat_lock(cgrp, cpu, true); } } } @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { might_sleep(); - __cgroup_rstat_lock(cgrp, -1); + if (!cgroup_rstat_trylock_flusher(cgrp)) + return; + cgroup_rstat_flush_locked(cgrp); - __cgroup_rstat_unlock(cgrp, -1); + cgroup_rstat_unlock_flusher(cgrp); } /** @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { might_sleep(); - __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); + + if (cgroup_rstat_trylock_flusher(cgrp)) + cgroup_rstat_flush_locked(cgrp); + else + __cgroup_rstat_lock(cgrp, -1, false); } /** @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) void cgroup_rstat_flush_release(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) { - __cgroup_rstat_unlock(cgrp, -1); + cgroup_rstat_unlock_flusher(cgrp); } int cgroup_rstat_init(struct cgroup *cgrp) @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp) u64_stats_init(&rstatc->bsync); } + init_completion(&cgrp->flush_done); + return 0; }
Avoid lock contention on the global cgroup rstat lock caused by kswapd starting on all NUMA nodes simultaneously. At Cloudflare, we observed massive issues due to kswapd and the specific mem_cgroup_flush_stats() call inlined in shrink_node, which takes the rstat lock. On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we noted severe lock contention on the rstat lock. This contention causes 12 CPUs to waste cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily due to spinning on the rstat lock. Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds causing all kswapdN threads to wake up simultaneously. The kswapd thread invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush operation as part of its work. This results in kernel self-induced rstat lock contention by waking up all kswapd threads simultaneously. Leveraging this detail: balance_pgdat() have NULL value in target_mem_cgroup, this cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. To avoid this kind of thundering herd problem, kernel previously had a "stats_flush_ongoing" concept, but this was removed as part of commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch reintroduce and generalized the concept to apply to all users of cgroup rstat, not just memcg. If there is an ongoing rstat flush, and current cgroup is a descendant, then it is unnecessary to do the flush. For callers to still see updated stats, wait for ongoing flusher to complete before returning, but add timeout as stats are already inaccurate given updaters keeps running. Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> --- V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ include/linux/cgroup-defs.h | 2 + kernel/cgroup/rstat.c | 95 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 85 insertions(+), 12 deletions(-)