diff mbox series

[V7,1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Message ID 172070450139.2992819.13210624094367257881.stgit@firesoul (mailing list archive)
State New
Headers show
Series [V7,1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes | expand

Commit Message

Jesper Dangaard Brouer July 11, 2024, 1:28 p.m. UTC
Avoid lock contention on the global cgroup rstat lock caused by kswapd
starting on all NUMA nodes simultaneously. At Cloudflare, we observed
massive issues due to kswapd and the specific mem_cgroup_flush_stats()
call inlined in shrink_node, which takes the rstat lock.

On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
we noted severe lock contention on the rstat lock. This contention
causes 12 CPUs to waste cycles spinning every time kswapd runs.
Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
due to spinning on the rstat lock.

Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds
causing all kswapdN threads to wake up simultaneously. The kswapd thread
invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush
operation as part of its work. This results in kernel self-induced rstat
lock contention by waking up all kswapd threads simultaneously. Leveraging
this detail: balance_pgdat() have NULL value in target_mem_cgroup, this
cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup.

To avoid this kind of thundering herd problem, kernel previously had a
"stats_flush_ongoing" concept, but this was removed as part of commit
7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch
reintroduce and generalized the concept to apply to all users of cgroup
rstat, not just memcg.

If there is an ongoing rstat flush, and current cgroup is a descendant,
then it is unnecessary to do the flush. For callers to still see updated
stats, wait for ongoing flusher to complete before returning, but add
timeout as stats are already inaccurate given updaters keeps running.

Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/

 include/linux/cgroup-defs.h |    2 +
 kernel/cgroup/rstat.c       |   95 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 85 insertions(+), 12 deletions(-)

Comments

Jesper Dangaard Brouer July 16, 2024, 8:42 a.m. UTC | #1
Production results below.
  - On AMD CPU with 12 NUMA nodes and 192 cores.

Using this bpftrace (looong) oneliner:

  sudo bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_locked {@cnt[probe]=count();
    if (args->contended) {
      @lock_contended["after_obtaining_lock", args->level]=count();
    } else {
      if (args->cpu == -1) { @lock_contended["no_ongoing_flusher", 
args->level]}
    }}
   tracepoint:cgroup:cgroup_rstat_lock_contended {@cnt[probe]=count();
    if (args->cpu == -1) {@lock_contended["normal", args->level] = count();}
    else {@lock_contended["yield", args->level] = count();}}
   kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()}
   tracepoint:cgroup:cgroup_ongoing_flusher {@start[tid]=nsecs;
   if (args->race) {
     printf(" - XXXXXXX[%d]: Handled RACE(%d) for ongoing flusher\n", 
tid, args->race);
     @ongoing_flusher_cnt["handled_race"]=count(); }
   }
   tracepoint:cgroup:cgroup_ongoing_flusher_wait {
    $now=nsecs; $start=@start[tid]; $diff=$now-$start;
    @res=hist(args->res);
    @wait=hist($diff);
    @cnt[probe]=count();
    @ongoing_flusher_cnt[comm]=count();
    @ongoing_flusher_cnt["all"]=count();
    $r=args->race;
    if ($r) {
     printf(" - Success[%d]: Handled RACE(%d) for ongoing flusher\n", 
tid, $r) }
    }
   interval:s:1 {time("%H:%M:%S ");
    print(@ongoing_flusher_cnt);
    print(@cnt);
    print(@lock_contended);
    clear (@cnt);
    clear (@ongoing_flusher_cnt);
    clear (@lock_contended); }
   END {clear(@start)}'

On 11/07/2024 15.28, Jesper Dangaard Brouer wrote:
> Avoid lock contention on the global cgroup rstat lock caused by kswapd
> starting on all NUMA nodes simultaneously. At Cloudflare, we observed
> massive issues due to kswapd and the specific mem_cgroup_flush_stats()
> call inlined in shrink_node, which takes the rstat lock.
> 
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
> we noted severe lock contention on the rstat lock. This contention
> causes 12 CPUs to waste cycles spinning every time kswapd runs.
> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
> due to spinning on the rstat lock.
> 
> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds
> causing all kswapdN threads to wake up simultaneously. The kswapd thread
> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush
> operation as part of its work. This results in kernel self-induced rstat
> lock contention by waking up all kswapd threads simultaneously. Leveraging
> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this
> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup.
> 
> To avoid this kind of thundering herd problem, kernel previously had a
> "stats_flush_ongoing" concept, but this was removed as part of commit
> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch
> reintroduce and generalized the concept to apply to all users of cgroup
> rstat, not just memcg.
> 
> If there is an ongoing rstat flush, and current cgroup is a descendant,
> then it is unnecessary to do the flush. For callers to still see updated
> stats, wait for ongoing flusher to complete before returning, but add
> timeout as stats are already inaccurate given updaters keeps running.
> 
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
> 
>   include/linux/cgroup-defs.h |    2 +
>   kernel/cgroup/rstat.c       |   95 ++++++++++++++++++++++++++++++++++++++-----
>   2 files changed, 85 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index b36690ca0d3f..a33b37514c29 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -548,6 +548,8 @@ struct cgroup {
>   #ifdef CONFIG_BPF_SYSCALL
>   	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
>   #endif
> +	/* completion queue for cgrp_rstat_ongoing_flusher */
> +	struct completion flush_done;
>   
>   	/* All ancestors including self */
>   	struct cgroup *ancestors[];
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index fb8b49437573..fe2a81a310bb 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -2,6 +2,7 @@
>   #include "cgroup-internal.h"
>   
>   #include <linux/sched/cputime.h>
> +#include <linux/completion.h>
>   
>   #include <linux/bpf.h>
>   #include <linux/btf.h>
> @@ -11,6 +12,7 @@
>   
>   static DEFINE_SPINLOCK(cgroup_rstat_lock);
>   static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL;
>   
>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>   
> @@ -279,17 +281,32 @@ __bpf_hook_end();
>    * value -1 is used when obtaining the main lock else this is the CPU
>    * number processed last.
>    */
> -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
> +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop)
> +{
> +	bool locked;
> +
> +	locked = spin_trylock_irq(&cgroup_rstat_lock);
> +	if (!locked)
> +		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true);
> +	else
> +		trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false);
> +
> +	return locked;
> +}
> +
> +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop,
> +				       bool check_contention)
>   	__acquires(&cgroup_rstat_lock)
>   {
> -	bool contended;
> +	bool locked = false;
>   
> -	contended = !spin_trylock_irq(&cgroup_rstat_lock);
> -	if (contended) {
> -		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
> +	if (check_contention)
> +		locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop);
> +
> +	if (!locked) {
>   		spin_lock_irq(&cgroup_rstat_lock);
> +		trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention);
>   	}
> -	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>   }
>   
>   static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>   	spin_unlock_irq(&cgroup_rstat_lock);
>   }
>   
> +#define MAX_WAIT	msecs_to_jiffies(100)
> +/* Trylock helper that also checks for on ongoing flusher */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
> +{
> +	struct cgroup *ongoing;
> +	bool locked;
> +
> +	/* Check if ongoing flusher is already taking care of this, if
> +	 * we are a descendant skip work, but wait for ongoing flusher
> +	 * to complete work.
> +	 */
> +retry:
> +	ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +	if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
> +		wait_for_completion_interruptible_timeout(
> +			&ongoing->flush_done, MAX_WAIT);
> +		/* TODO: Add tracepoint here */
> +		return false;
> +	}
> +
> +	locked = __cgroup_rstat_trylock(cgrp, -1);
> +	if (!locked) {
> +		/* Contended: Handle loosing race for ongoing flusher */
> +		if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
> +			goto retry;

This race do happen in production:

  - XXXXXXX[1040]: Handled RACE(1) for ongoing flusher
  - Success[1040]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1038]: Handled RACE(1) for ongoing flusher
  - Success[1038]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1040]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher
  - Success[1039]: Handled RACE(1) for ongoing flusher
  - Success[1040]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher
  - Success[1047]: Handled RACE(1) for ongoing flusher
  - Success[1044]: Handled RACE(1) for ongoing flusher
  - Success[1045]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1042]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1037]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher
  - Success[1043]: Handled RACE(1) for ongoing flusher
  - Success[1037]: Handled RACE(1) for ongoing flusher
  - Success[1039]: Handled RACE(1) for ongoing flusher
  - Success[1044]: Handled RACE(1) for ongoing flusher
  - Success[1042]: Handled RACE(1) for ongoing flusher
  - Success[1047]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1038]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1039]: Handled RACE(1) for ongoing flusher
  - Success[1039]: Handled RACE(1) for ongoing flusher
  - Success[1038]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher
  - Success[1048]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1047]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1046]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher
  - Success[1045]: Handled RACE(1) for ongoing flusher
  - Success[1043]: Handled RACE(1) for ongoing flusher
  - Success[1047]: Handled RACE(1) for ongoing flusher
  - Success[1046]: Handled RACE(1) for ongoing flusher
  - Success[1048]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1045]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1043]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1042]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1041]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1044]: Handled RACE(1) for ongoing flusher
  - Success[1045]: Handled RACE(1) for ongoing flusher
  - Success[1041]: Handled RACE(1) for ongoing flusher
  - Success[1042]: Handled RACE(1) for ongoing flusher
  - Success[1043]: Handled RACE(1) for ongoing flusher
  - Success[1044]: Handled RACE(1) for ongoing flusher
  - XXXXXXX[1048]: Handled RACE(1) for ongoing flusher
  - Success[1048]: Handled RACE(1) for ongoing flusher
07:42:02 @ongoing_flusher_cnt[kswapd11]: 7
@ongoing_flusher_cnt[kswapd4]: 7
@ongoing_flusher_cnt[kswapd10]: 7
@ongoing_flusher_cnt[kswapd9]: 8
@ongoing_flusher_cnt[kswapd6]: 8
@ongoing_flusher_cnt[kswapd3]: 8
@ongoing_flusher_cnt[kswapd8]: 8
@ongoing_flusher_cnt[kswapd7]: 9
@ongoing_flusher_cnt[kswapd1]: 9
@ongoing_flusher_cnt[kswapd0]: 9
@ongoing_flusher_cnt[kswapd5]: 9
@ongoing_flusher_cnt[kswapd2]: 9
@ongoing_flusher_cnt[handled_race]: 27
@ongoing_flusher_cnt[all]: 98
@cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 27
@cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 37
@cnt[tracepoint:cgroup:cgroup_rstat_locked]: 45
@cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 98
@lock_contended[normal, 0]: 27


This is a clean (meaning no cadvisor interference) example of kswapd
starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
hit the race (which is handled in V6 and V7).

The BPF "cnt" maps are getting cleared every second, so this
approximates per sec numbers.  This patch reduce pressure on the lock,
but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
flushes approx 37 per sec (every 27 ms). On the positive side
ongoing_flusher mitigation stopped 98 per sec of these.

In this clean kswapd case the patch removes the lock contention issue
for kswapd. The lock_contended cases 27 seems to be all related to
handled_race cases 27.

The remaning high flush rate should also be addressed, and we should
also work on aproaches to limit this like my ealier proposal[1].

  [1] 
https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/#Z31kernel:cgroup:rstat.c


> +
> +		__cgroup_rstat_lock(cgrp, -1, false);
> +	}
> +	/* Obtained lock, record this cgrp as the ongoing flusher */
> +	ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +	if (!ongoing) {
> +		reinit_completion(&cgrp->flush_done);
> +		WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
> +	}
> +	return true; /* locked */
> +}
> +
> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
> +{
> +	/* Detect if we are the ongoing flusher */
> +	if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) {
> +		WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
> +		complete_all(&cgrp->flush_done);
> +	}
> +	__cgroup_rstat_unlock(cgrp, -1);
> +}
> +
>   /* see cgroup_rstat_flush() */
>   static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>   	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
> @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>   			__cgroup_rstat_unlock(cgrp, cpu);
>   			if (!cond_resched())
>   				cpu_relax();
> -			__cgroup_rstat_lock(cgrp, cpu);
> +			__cgroup_rstat_lock(cgrp, cpu, true);
>   		}
>   	}
>   }
> @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>   {
>   	might_sleep();
>   
> -	__cgroup_rstat_lock(cgrp, -1);
> +	if (!cgroup_rstat_trylock_flusher(cgrp))
> +		return;
> +
>   	cgroup_rstat_flush_locked(cgrp);
> -	__cgroup_rstat_unlock(cgrp, -1);
> +	cgroup_rstat_unlock_flusher(cgrp);
>   }
>   
>   /**
> @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>   	__acquires(&cgroup_rstat_lock)
>   {
>   	might_sleep();
> -	__cgroup_rstat_lock(cgrp, -1);
> -	cgroup_rstat_flush_locked(cgrp);
> +
> +	if (cgroup_rstat_trylock_flusher(cgrp))
> +		cgroup_rstat_flush_locked(cgrp);
> +	else
> +		__cgroup_rstat_lock(cgrp, -1, false);
>   }
>   
>   /**
> @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>   void cgroup_rstat_flush_release(struct cgroup *cgrp)
>   	__releases(&cgroup_rstat_lock)
>   {
> -	__cgroup_rstat_unlock(cgrp, -1);
> +	cgroup_rstat_unlock_flusher(cgrp);
>   }
>   
>   int cgroup_rstat_init(struct cgroup *cgrp)
> @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>   		u64_stats_init(&rstatc->bsync);
>   	}
>   
> +	init_completion(&cgrp->flush_done);
> +
>   	return 0;
>   }
>   
> 
>
Yosry Ahmed July 17, 2024, 12:30 a.m. UTC | #2
On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> Avoid lock contention on the global cgroup rstat lock caused by kswapd
> starting on all NUMA nodes simultaneously. At Cloudflare, we observed
> massive issues due to kswapd and the specific mem_cgroup_flush_stats()
> call inlined in shrink_node, which takes the rstat lock.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
> we noted severe lock contention on the rstat lock. This contention
> causes 12 CPUs to waste cycles spinning every time kswapd runs.
> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
> due to spinning on the rstat lock.
>
> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds
> causing all kswapdN threads to wake up simultaneously. The kswapd thread
> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush
> operation as part of its work. This results in kernel self-induced rstat
> lock contention by waking up all kswapd threads simultaneously. Leveraging
> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this
> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup.
>
> To avoid this kind of thundering herd problem, kernel previously had a
> "stats_flush_ongoing" concept, but this was removed as part of commit
> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch
> reintroduce and generalized the concept to apply to all users of cgroup
> rstat, not just memcg.
>
> If there is an ongoing rstat flush, and current cgroup is a descendant,
> then it is unnecessary to do the flush. For callers to still see updated
> stats, wait for ongoing flusher to complete before returning, but add
> timeout as stats are already inaccurate given updaters keeps running.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Thanks for working on this, Jesper! I love the data you collected here!

I think the commit subject and message should be changed to better
describe the patch. This is a patch that exclusively modifies cgroup
code, yet the subject is about kswapd. This change affects all users
of rstat flushing.

I think a better subject would be:
"cgroup/rstat: avoid flushing if there is an ongoing overlapping
flush" or similar.

The commit message should first describe the cgroup change, and then
use kswapd as a brief example/illustration of how the problem
manifests in practice. You should also include a brief summary of the
numbers you collected from prod.

> ---
> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
>
>  include/linux/cgroup-defs.h |    2 +
>  kernel/cgroup/rstat.c       |   95 ++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 85 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index b36690ca0d3f..a33b37514c29 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -548,6 +548,8 @@ struct cgroup {
>  #ifdef CONFIG_BPF_SYSCALL
>         struct bpf_local_storage __rcu  *bpf_cgrp_storage;
>  #endif
> +       /* completion queue for cgrp_rstat_ongoing_flusher */
> +       struct completion flush_done;
>
>         /* All ancestors including self */
>         struct cgroup *ancestors[];
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index fb8b49437573..fe2a81a310bb 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -2,6 +2,7 @@
>  #include "cgroup-internal.h"
>
>  #include <linux/sched/cputime.h>
> +#include <linux/completion.h>
>
>  #include <linux/bpf.h>
>  #include <linux/btf.h>
> @@ -11,6 +12,7 @@
>
>  static DEFINE_SPINLOCK(cgroup_rstat_lock);
>  static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL;
>
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>
> @@ -279,17 +281,32 @@ __bpf_hook_end();
>   * value -1 is used when obtaining the main lock else this is the CPU
>   * number processed last.
>   */
> -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
> +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop)
> +{
> +       bool locked;
> +
> +       locked = spin_trylock_irq(&cgroup_rstat_lock);
> +       if (!locked)
> +               trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true);
> +       else
> +               trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false);
> +
> +       return locked;
> +}
> +
> +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop,
> +                                      bool check_contention)
>         __acquires(&cgroup_rstat_lock)
>  {
> -       bool contended;
> +       bool locked = false;
>
> -       contended = !spin_trylock_irq(&cgroup_rstat_lock);
> -       if (contended) {
> -               trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
> +       if (check_contention)
> +               locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop);
> +
> +       if (!locked) {
>                 spin_lock_irq(&cgroup_rstat_lock);
> +               trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention);
>         }
> -       trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>  }

I will let Tejun and others weigh in about what tracepoints we should
have going forward and how they should be structured (which also
includes patch 2). I see the tremendous value they have to collect
data from prod, but I do not see similar existing tracepoints, and I
am not sure if the branching here could have an effect when the
tracepoints are off.

>
>  static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>         spin_unlock_irq(&cgroup_rstat_lock);
>  }
>
> +#define MAX_WAIT       msecs_to_jiffies(100)
> +/* Trylock helper that also checks for on ongoing flusher */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
> +{
> +       struct cgroup *ongoing;
> +       bool locked;
> +
> +       /* Check if ongoing flusher is already taking care of this, if

nit: I think commonly the comment would start on a new line after /*.

> +        * we are a descendant skip work, but wait for ongoing flusher
> +        * to complete work.
> +        */
> +retry:
> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
> +               wait_for_completion_interruptible_timeout(
> +                       &ongoing->flush_done, MAX_WAIT);
> +               /* TODO: Add tracepoint here */
> +               return false;
> +       }
> +
> +       locked = __cgroup_rstat_trylock(cgrp, -1);
> +       if (!locked) {
> +               /* Contended: Handle loosing race for ongoing flusher */

nit: losing

> +               if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
> +                       goto retry;
> +
> +               __cgroup_rstat_lock(cgrp, -1, false);
> +       }
> +       /* Obtained lock, record this cgrp as the ongoing flusher */

Do we want a comment here to explain why there could be an existing
ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's
not super obvious.

> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +       if (!ongoing) {
> +               reinit_completion(&cgrp->flush_done);
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
> +       }
> +       return true; /* locked */

Would it be better to explain the return value of the function in the
comment above it?

> +}
> +
> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
> +{
> +       /* Detect if we are the ongoing flusher */

I think this is a bit obvious.

> +       if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) {
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
> +               complete_all(&cgrp->flush_done);
> +       }
> +       __cgroup_rstat_unlock(cgrp, -1);
> +}
> +
>  /* see cgroup_rstat_flush() */
>  static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>         __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
> @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>                         __cgroup_rstat_unlock(cgrp, cpu);
>                         if (!cond_resched())
>                                 cpu_relax();
> -                       __cgroup_rstat_lock(cgrp, cpu);
> +                       __cgroup_rstat_lock(cgrp, cpu, true);
>                 }
>         }
>  }
> @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>  {
>         might_sleep();
>
> -       __cgroup_rstat_lock(cgrp, -1);
> +       if (!cgroup_rstat_trylock_flusher(cgrp))
> +               return;
> +
>         cgroup_rstat_flush_locked(cgrp);
> -       __cgroup_rstat_unlock(cgrp, -1);
> +       cgroup_rstat_unlock_flusher(cgrp);
>  }
>
>  /**
> @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>         __acquires(&cgroup_rstat_lock)
>  {
>         might_sleep();
> -       __cgroup_rstat_lock(cgrp, -1);
> -       cgroup_rstat_flush_locked(cgrp);
> +
> +       if (cgroup_rstat_trylock_flusher(cgrp))
> +               cgroup_rstat_flush_locked(cgrp);
> +       else
> +               __cgroup_rstat_lock(cgrp, -1, false);
>  }
>
>  /**
> @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>  void cgroup_rstat_flush_release(struct cgroup *cgrp)
>         __releases(&cgroup_rstat_lock)
>  {
> -       __cgroup_rstat_unlock(cgrp, -1);
> +       cgroup_rstat_unlock_flusher(cgrp);
>  }
>
>  int cgroup_rstat_init(struct cgroup *cgrp)
> @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>                 u64_stats_init(&rstatc->bsync);
>         }
>
> +       init_completion(&cgrp->flush_done);
> +
>         return 0;
>  }
>
>
>
Yosry Ahmed July 17, 2024, 12:35 a.m. UTC | #3
[..]
>
>
> This is a clean (meaning no cadvisor interference) example of kswapd
> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
> hit the race (which is handled in V6 and V7).
>
> The BPF "cnt" maps are getting cleared every second, so this
> approximates per sec numbers.  This patch reduce pressure on the lock,
> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
> flushes approx 37 per sec (every 27 ms). On the positive side
> ongoing_flusher mitigation stopped 98 per sec of these.
>
> In this clean kswapd case the patch removes the lock contention issue
> for kswapd. The lock_contended cases 27 seems to be all related to
> handled_race cases 27.
>
> The remaning high flush rate should also be addressed, and we should
> also work on aproaches to limit this like my ealier proposal[1].

I honestly don't think a high number of flushes is a problem on its
own as long as we are not spending too much time flushing, especially
when we have magnitude-based thresholding so we know there is
something to flush (although it may not be relevant to what we are
doing).

If we keep observing a lot of lock contention, one thing that I
thought about is to have a variant of spin_lock with a timeout. This
limits the flushing latency, instead of limiting the number of flushes
(which I believe is the wrong metric to optimize).

It also seems to me that we are doing a flush each 27ms, and your
proposed threshold was once per 50ms. It doesn't seem like a
fundamental difference.

I am also wondering how many more flushes could be skipped if we
handle the case of multiple ongoing flushers (whether by using a
mutex, or making it a per-cgroup property as I suggested earlier).
Waiman Long July 17, 2024, 3 a.m. UTC | #4
On 7/16/24 20:35, Yosry Ahmed wrote:
> [..]
>>
>> This is a clean (meaning no cadvisor interference) example of kswapd
>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
>> hit the race (which is handled in V6 and V7).
>>
>> The BPF "cnt" maps are getting cleared every second, so this
>> approximates per sec numbers.  This patch reduce pressure on the lock,
>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
>> flushes approx 37 per sec (every 27 ms). On the positive side
>> ongoing_flusher mitigation stopped 98 per sec of these.
>>
>> In this clean kswapd case the patch removes the lock contention issue
>> for kswapd. The lock_contended cases 27 seems to be all related to
>> handled_race cases 27.
>>
>> The remaning high flush rate should also be addressed, and we should
>> also work on aproaches to limit this like my ealier proposal[1].
> I honestly don't think a high number of flushes is a problem on its
> own as long as we are not spending too much time flushing, especially
> when we have magnitude-based thresholding so we know there is
> something to flush (although it may not be relevant to what we are
> doing).
>
> If we keep observing a lot of lock contention, one thing that I
> thought about is to have a variant of spin_lock with a timeout. This
> limits the flushing latency, instead of limiting the number of flushes
> (which I believe is the wrong metric to optimize).

Except for semaphore, none of our locking primitives allow for a timeout 
parameter. For sleeping locks, I don't think it is hard to add variants 
with timeout parameter, but not the spinning locks.

Cheers,
Longman
Jesper Dangaard Brouer July 17, 2024, 7:32 a.m. UTC | #5
On 17/07/2024 02.30, Yosry Ahmed wrote:
> On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>> Avoid lock contention on the global cgroup rstat lock caused by kswapd
>> starting on all NUMA nodes simultaneously. At Cloudflare, we observed
>> massive issues due to kswapd and the specific mem_cgroup_flush_stats()
>> call inlined in shrink_node, which takes the rstat lock.
>>
>> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
>> we noted severe lock contention on the rstat lock. This contention
>> causes 12 CPUs to waste cycles spinning every time kswapd runs.
>> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
>> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
>> due to spinning on the rstat lock.
>>
>> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds
>> causing all kswapdN threads to wake up simultaneously. The kswapd thread
>> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush
>> operation as part of its work. This results in kernel self-induced rstat
>> lock contention by waking up all kswapd threads simultaneously. Leveraging
>> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this
>> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup.
>>
>> To avoid this kind of thundering herd problem, kernel previously had a
>> "stats_flush_ongoing" concept, but this was removed as part of commit
>> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch
>> reintroduce and generalized the concept to apply to all users of cgroup
>> rstat, not just memcg.
>>
>> If there is an ongoing rstat flush, and current cgroup is a descendant,
>> then it is unnecessary to do the flush. For callers to still see updated
>> stats, wait for ongoing flusher to complete before returning, but add
>> timeout as stats are already inaccurate given updaters keeps running.
>>
>> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Thanks for working on this, Jesper! I love the data you collected here!
> 
> I think the commit subject and message should be changed to better
> describe the patch. This is a patch that exclusively modifies cgroup
> code, yet the subject is about kswapd. This change affects all users
> of rstat flushing.
> 
> I think a better subject would be:
> "cgroup/rstat: avoid flushing if there is an ongoing overlapping
> flush" or similar.
> 
> The commit message should first describe the cgroup change, and then
> use kswapd as a brief example/illustration of how the problem
> manifests in practice. You should also include a brief summary of the
> numbers you collected from prod.
> 

I agree that description should be updated as patch have evolved, and
you suggestions make sense, thanks.

But I'm not sure we are finished evolving this patch, because the scheme
of allowing all sub-cgroups to become the ongoing flusher (and only
avoiding lock-spinning when cgroup_is_descendant), is causing issues in
production.

We are using cadvisor in production for collecting metrics, which walks
all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat
(doing 3 flushes of same cgroup in short timespan), but individual
cgroups don't overlap much to benefit from ongoing_flusher scheme.

Below is (1 sec) production data, where cadvisor and kswapd collide:

 > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1
 > @ongoing_flusher_cnt[kswapd5]: 3
 > @ongoing_flusher_cnt[kswapd7]: 4
 > @ongoing_flusher_cnt[kswapd6]: 4
 > @ongoing_flusher_cnt[kswapd8]: 4
 > @ongoing_flusher_cnt[kswapd9]: 5
 > @ongoing_flusher_cnt[kswapd2]: 5
 > @ongoing_flusher_cnt[kswapd10]: 5
 > @ongoing_flusher_cnt[kswapd3]: 5
 > @ongoing_flusher_cnt[kswapd11]: 5
 > @ongoing_flusher_cnt[kswapd0]: 5
 > @ongoing_flusher_cnt[kswapd4]: 6
 > @ongoing_flusher_cnt[kswapd1]: 7
 > @ongoing_flusher_cnt[cadvisor]: 9

For cadvisor ongoing_flusher only helps 9 times to avoid flush.

 > @ongoing_flusher_cnt[handled_race]: 18
 > @ongoing_flusher_cnt[all]: 61

Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes.

 > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61
 > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462
 > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032

This is bad: Lock is contended 9032 time within this 1 sec period.
Below, lock_contended cases is captured in more detail.

 > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435
 > @lock_contended[normal, 4]: 1
 > @lock_contended[normal, 1]: 2
 > @lock_contended[normal, 3]: 6
 > @lock_contended[normal, 2]: 15
 > @lock_contended[yield, 4]: 49
 > @lock_contended[after_obtaining_lock, 4]: 49
 > @lock_contended[normal, 0]: 59

The "normal" lock_contended for level 0,
meaning the cgroup root is contended 59 times within this 1 sec period.

 > @lock_contended[after_obtaining_lock, 1]: 117
 > @lock_contended[yield, 1]: 118
 > @lock_contended[yield, 3]: 258
 > @lock_contended[after_obtaining_lock, 3]: 258
 > @lock_contended[after_obtaining_lock, 2]: 946
 > @lock_contended[yield, 2]: 946

Lock contention for 'yielded' case for level 2.
Goes crazy with 946/sec.

 > @lock_contended[yield, 0]: 7579

Lock contention for 'yielded' case for level 0, the root.
Is really crazy with 7579/sec lock spin cases.

 > @lock_contended[after_obtaining_lock, 0]: 7579


IMHO this shows that, due to lock yielding, the scheme of
ongoing_flusher for sub-trees only cause issues.

IMHO we should go back to only doing ongoing_flusher for root-cgroup.
There is really low chance of sub-trees flushes being concurrent enough
to benefit from this, and it cause issues and needs (ugly) race handling.

Further more sub-tree flushing doesn't take as long time as level 0
flushing, which further lower the chance of concurrent flushes.

Let's get some quick data on flush times from production, to support my
claim:

The bpftrace onliner:
  sudo bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_locked {
    if (args->cpu == -1) { @start[tid]=nsecs; }
    @cnt[probe]=count();
    if (args->contended) {
      @lock_contended["after_obtaining_lock", args->level]=count();
    }}
   tracepoint:cgroup:cgroup_rstat_unlock {
    if (args->cpu == -1) {
     $now=nsecs; $start=@start[tid]; $diff=$now-$start;
     @locked_time_level[args->level]=hist($diff);
    }
    @cnt[probe]=count()}
   kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()}
   interval:s:1 {time("%H:%M:%S ");
    print(@cnt);
    print(@lock_contended);
    print(@locked_time_level);
    clear (@cnt);
    clear (@lock_contended); }
   END {clear(@start)}'

Time below is in nanosec.

@locked_time_level[0]:
[4M, 8M)             623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[8M, 16M)            860 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)           295 |@@@@@@@@@@@@@@@@@ 
      |
[32M, 64M)           275 |@@@@@@@@@@@@@@@@ 
      |


@locked_time_level[1]:
[4K, 8K)               6 |@@@@ 
      |
[8K, 16K)             65 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)            52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[32K, 64K)            23 |@@@@@@@@@@@@@@@@@@ 
      |
[64K, 128K)           15 |@@@@@@@@@@@@ 
      |
[128K, 256K)          10 |@@@@@@@@ 
      |
[256K, 512K)           6 |@@@@ 
      |
[512K, 1M)            15 |@@@@@@@@@@@@ 
      |
[1M, 2M)               2 |@ 
      |
[2M, 4M)              14 |@@@@@@@@@@@ 
      |
[4M, 8M)               6 |@@@@ 
      |
[8M, 16M)              7 |@@@@@ 
      |
[16M, 32M)             1 | 
      |


@locked_time_level[2]:
[2K, 4K)               1 | 
      |
[4K, 8K)             160 |@@@@@@@@@ 
      |
[8K, 16K)            733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[16K, 32K)           901 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           191 |@@@@@@@@@@@ 
      |
[64K, 128K)          115 |@@@@@@ 
      |
[128K, 256K)          61 |@@@ 
      |
[256K, 512K)          70 |@@@@ 
      |
[512K, 1M)            59 |@@@ 
      |
[1M, 2M)              27 |@ 
      |
[2M, 4M)               9 | 
      |


@locked_time_level[3]:
[1K, 2K)               3 | 
      |
[2K, 4K)               2 | 
      |
[4K, 8K)               5 | 
      |
[8K, 16K)            147 |@@@@@@ 
      |
[16K, 32K)          1222 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           266 |@@@@@@@@@@@ 
      |
[64K, 128K)          199 |@@@@@@@@ 
      |
[128K, 256K)         146 |@@@@@@ 
      |
[256K, 512K)         124 |@@@@@ 
      |
[512K, 1M)            17 | 
      |
[1M, 2M)               0 | 
      |
[2M, 4M)               0 | 
      |
[4M, 8M)               1 | 
      |


@locked_time_level[4]:
[4K, 8K)               2 |@@ 
      |
[8K, 16K)             17 |@@@@@@@@@@@@@@@@@@@@@@ 
      |
[16K, 32K)            40 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)             4 |@@@@@ 
      |

--Jesper

>> ---
>> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
>> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
>> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
>> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
>> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
>> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
>> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
>>
>>   include/linux/cgroup-defs.h |    2 +
>>   kernel/cgroup/rstat.c       |   95 ++++++++++++++++++++++++++++++++++++++-----
>>   2 files changed, 85 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index b36690ca0d3f..a33b37514c29 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -548,6 +548,8 @@ struct cgroup {
>>   #ifdef CONFIG_BPF_SYSCALL
>>          struct bpf_local_storage __rcu  *bpf_cgrp_storage;
>>   #endif
>> +       /* completion queue for cgrp_rstat_ongoing_flusher */
>> +       struct completion flush_done;
>>
>>          /* All ancestors including self */
>>          struct cgroup *ancestors[];
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index fb8b49437573..fe2a81a310bb 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -2,6 +2,7 @@
>>   #include "cgroup-internal.h"
>>
>>   #include <linux/sched/cputime.h>
>> +#include <linux/completion.h>
>>
>>   #include <linux/bpf.h>
>>   #include <linux/btf.h>
>> @@ -11,6 +12,7 @@
>>
>>   static DEFINE_SPINLOCK(cgroup_rstat_lock);
>>   static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>> +static struct cgroup *cgrp_rstat_ongoing_flusher = NULL;
>>
>>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>>
>> @@ -279,17 +281,32 @@ __bpf_hook_end();
>>    * value -1 is used when obtaining the main lock else this is the CPU
>>    * number processed last.
>>    */
>> -static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>> +static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop)
>> +{
>> +       bool locked;
>> +
>> +       locked = spin_trylock_irq(&cgroup_rstat_lock);
>> +       if (!locked)
>> +               trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true);
>> +       else
>> +               trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false);
>> +
>> +       return locked;
>> +}
>> +
>> +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop,
>> +                                      bool check_contention)
>>          __acquires(&cgroup_rstat_lock)
>>   {
>> -       bool contended;
>> +       bool locked = false;
>>
>> -       contended = !spin_trylock_irq(&cgroup_rstat_lock);
>> -       if (contended) {
>> -               trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
>> +       if (check_contention)
>> +               locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop);
>> +
>> +       if (!locked) {
>>                  spin_lock_irq(&cgroup_rstat_lock);
>> +               trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention);
>>          }
>> -       trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>>   }
> 
> I will let Tejun and others weigh in about what tracepoints we should
> have going forward and how they should be structured (which also
> includes patch 2). I see the tremendous value they have to collect
> data from prod, but I do not see similar existing tracepoints, and I
> am not sure if the branching here could have an effect when the
> tracepoints are off.
> 
>>
>>   static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>>          spin_unlock_irq(&cgroup_rstat_lock);
>>   }
>>
>> +#define MAX_WAIT       msecs_to_jiffies(100)
>> +/* Trylock helper that also checks for on ongoing flusher */
>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
>> +{
>> +       struct cgroup *ongoing;
>> +       bool locked;
>> +
>> +       /* Check if ongoing flusher is already taking care of this, if
> 
> nit: I think commonly the comment would start on a new line after /*.
> 
>> +        * we are a descendant skip work, but wait for ongoing flusher
>> +        * to complete work.
>> +        */
>> +retry:
>> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
>> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
>> +               wait_for_completion_interruptible_timeout(
>> +                       &ongoing->flush_done, MAX_WAIT);
>> +               /* TODO: Add tracepoint here */
>> +               return false;
>> +       }
>> +
>> +       locked = __cgroup_rstat_trylock(cgrp, -1);
>> +       if (!locked) {
>> +               /* Contended: Handle loosing race for ongoing flusher */
> 
> nit: losing
> 
>> +               if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
>> +                       goto retry;
>> +
>> +               __cgroup_rstat_lock(cgrp, -1, false);
>> +       }
>> +       /* Obtained lock, record this cgrp as the ongoing flusher */
> 
> Do we want a comment here to explain why there could be an existing
> ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's
> not super obvious.
> 
>> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
>> +       if (!ongoing) {
>> +               reinit_completion(&cgrp->flush_done);
>> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
>> +       }
>> +       return true; /* locked */
> 
> Would it be better to explain the return value of the function in the
> comment above it?
> 
>> +}
>> +
>> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
>> +{
>> +       /* Detect if we are the ongoing flusher */
> 
> I think this is a bit obvious.
> 
>> +       if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) {
>> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
>> +               complete_all(&cgrp->flush_done);
>> +       }
>> +       __cgroup_rstat_unlock(cgrp, -1);
>> +}
>> +
>>   /* see cgroup_rstat_flush() */
>>   static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>>          __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>> @@ -328,7 +392,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>>                          __cgroup_rstat_unlock(cgrp, cpu);
>>                          if (!cond_resched())
>>                                  cpu_relax();
>> -                       __cgroup_rstat_lock(cgrp, cpu);
>> +                       __cgroup_rstat_lock(cgrp, cpu, true);
>>                  }
>>          }
>>   }
>> @@ -350,9 +414,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>>   {
>>          might_sleep();
>>
>> -       __cgroup_rstat_lock(cgrp, -1);
>> +       if (!cgroup_rstat_trylock_flusher(cgrp))
>> +               return;
>> +
>>          cgroup_rstat_flush_locked(cgrp);
>> -       __cgroup_rstat_unlock(cgrp, -1);
>> +       cgroup_rstat_unlock_flusher(cgrp);
>>   }
>>
>>   /**
>> @@ -368,8 +434,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>>          __acquires(&cgroup_rstat_lock)
>>   {
>>          might_sleep();
>> -       __cgroup_rstat_lock(cgrp, -1);
>> -       cgroup_rstat_flush_locked(cgrp);
>> +
>> +       if (cgroup_rstat_trylock_flusher(cgrp))
>> +               cgroup_rstat_flush_locked(cgrp);
>> +       else
>> +               __cgroup_rstat_lock(cgrp, -1, false);
>>   }
>>
>>   /**
>> @@ -379,7 +448,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>>   void cgroup_rstat_flush_release(struct cgroup *cgrp)
>>          __releases(&cgroup_rstat_lock)
>>   {
>> -       __cgroup_rstat_unlock(cgrp, -1);
>> +       cgroup_rstat_unlock_flusher(cgrp);
>>   }
>>
>>   int cgroup_rstat_init(struct cgroup *cgrp)
>> @@ -401,6 +470,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>>                  u64_stats_init(&rstatc->bsync);
>>          }
>>
>> +       init_completion(&cgrp->flush_done);
>> +
>>          return 0;
>>   }
>>
>>
>>
Yosry Ahmed July 17, 2024, 4:05 p.m. UTC | #6
On Tue, Jul 16, 2024 at 8:00 PM Waiman Long <longman@redhat.com> wrote:
>
> On 7/16/24 20:35, Yosry Ahmed wrote:
> > [..]
> >>
> >> This is a clean (meaning no cadvisor interference) example of kswapd
> >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
> >> hit the race (which is handled in V6 and V7).
> >>
> >> The BPF "cnt" maps are getting cleared every second, so this
> >> approximates per sec numbers.  This patch reduce pressure on the lock,
> >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
> >> flushes approx 37 per sec (every 27 ms). On the positive side
> >> ongoing_flusher mitigation stopped 98 per sec of these.
> >>
> >> In this clean kswapd case the patch removes the lock contention issue
> >> for kswapd. The lock_contended cases 27 seems to be all related to
> >> handled_race cases 27.
> >>
> >> The remaning high flush rate should also be addressed, and we should
> >> also work on aproaches to limit this like my ealier proposal[1].
> > I honestly don't think a high number of flushes is a problem on its
> > own as long as we are not spending too much time flushing, especially
> > when we have magnitude-based thresholding so we know there is
> > something to flush (although it may not be relevant to what we are
> > doing).
> >
> > If we keep observing a lot of lock contention, one thing that I
> > thought about is to have a variant of spin_lock with a timeout. This
> > limits the flushing latency, instead of limiting the number of flushes
> > (which I believe is the wrong metric to optimize).
>
> Except for semaphore, none of our locking primitives allow for a timeout
> parameter. For sleeping locks, I don't think it is hard to add variants
> with timeout parameter, but not the spinning locks.

Thanks for pointing this out. I am assuming a mutex with a timeout
will also address the priority inversion problem that Shakeel was
talking about AFAICT.
Yosry Ahmed July 17, 2024, 4:31 p.m. UTC | #7
[..]
>
> I agree that description should be updated as patch have evolved, and
> you suggestions make sense, thanks.
>
> But I'm not sure we are finished evolving this patch, because the scheme
> of allowing all sub-cgroups to become the ongoing flusher (and only
> avoiding lock-spinning when cgroup_is_descendant), is causing issues in
> production.

Sure, I just wanted to point out the the commit log should be updated
going forward, not implying that we are done here :)

>
> We are using cadvisor in production for collecting metrics, which walks
> all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat
> (doing 3 flushes of same cgroup in short timespan), but individual
> cgroups don't overlap much to benefit from ongoing_flusher scheme.
>
> Below is (1 sec) production data, where cadvisor and kswapd collide:
>
>  > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1
>  > @ongoing_flusher_cnt[kswapd5]: 3
>  > @ongoing_flusher_cnt[kswapd7]: 4
>  > @ongoing_flusher_cnt[kswapd6]: 4
>  > @ongoing_flusher_cnt[kswapd8]: 4
>  > @ongoing_flusher_cnt[kswapd9]: 5
>  > @ongoing_flusher_cnt[kswapd2]: 5
>  > @ongoing_flusher_cnt[kswapd10]: 5
>  > @ongoing_flusher_cnt[kswapd3]: 5
>  > @ongoing_flusher_cnt[kswapd11]: 5
>  > @ongoing_flusher_cnt[kswapd0]: 5
>  > @ongoing_flusher_cnt[kswapd4]: 6
>  > @ongoing_flusher_cnt[kswapd1]: 7
>  > @ongoing_flusher_cnt[cadvisor]: 9
>
> For cadvisor ongoing_flusher only helps 9 times to avoid flush.
>
>  > @ongoing_flusher_cnt[handled_race]: 18
>  > @ongoing_flusher_cnt[all]: 61
>
> Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes.
>
>  > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61
>  > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462
>  > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032
>
> This is bad: Lock is contended 9032 time within this 1 sec period.
> Below, lock_contended cases is captured in more detail.
>
>  > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435
>  > @lock_contended[normal, 4]: 1
>  > @lock_contended[normal, 1]: 2
>  > @lock_contended[normal, 3]: 6
>  > @lock_contended[normal, 2]: 15
>  > @lock_contended[yield, 4]: 49
>  > @lock_contended[after_obtaining_lock, 4]: 49
>  > @lock_contended[normal, 0]: 59
>
> The "normal" lock_contended for level 0,
> meaning the cgroup root is contended 59 times within this 1 sec period.
>
>  > @lock_contended[after_obtaining_lock, 1]: 117
>  > @lock_contended[yield, 1]: 118
>  > @lock_contended[yield, 3]: 258
>  > @lock_contended[after_obtaining_lock, 3]: 258
>  > @lock_contended[after_obtaining_lock, 2]: 946
>  > @lock_contended[yield, 2]: 946
>
> Lock contention for 'yielded' case for level 2.
> Goes crazy with 946/sec.
>
>  > @lock_contended[yield, 0]: 7579
>
> Lock contention for 'yielded' case for level 0, the root.
> Is really crazy with 7579/sec lock spin cases.
>
>  > @lock_contended[after_obtaining_lock, 0]: 7579
>
>
> IMHO this shows that, due to lock yielding, the scheme of
> ongoing_flusher for sub-trees only cause issues.

Just to make sure I understand correctly, allowing non-root to become
ongoing_flushers is problematic because we only support a single
ongoing_flusher, so if we have a subsequent root flush, it won't be
the ongoing_flusher and we won't benefit from skipping other flushes.
Correct?

I honestly think the problem here is not that we support
ongoing_flusher for subtrees, which imo is nice to have at a low
additional cost, and does have its benefits (see below). I think the
problem is that lock yielding means that we can have multiple ongoing
flushers, while the current scheme only records one.

What I think we should be doing is either supporting multiple
ongoing_flushers (by replacing the global variable with a per-cgroup
flag, perhaps), or biting the bullet and start using a mutex to drop
lock yielding. If there aren't any concerns beyond priority inversion
(raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I
suggested could fix this issue (and also put a bound on the flushing
latency in general)?

(I suppose the latter is preferrable)

>
> IMHO we should go back to only doing ongoing_flusher for root-cgroup.
> There is really low chance of sub-trees flushes being concurrent enough
> to benefit from this, and it cause issues and needs (ugly) race handling.
>
> Further more sub-tree flushing doesn't take as long time as level 0
> flushing, which further lower the chance of concurrent flushes.

I agree that the race handling is not pretty, and we can try to
improve the implementation or handle the race in other ways. However,
I think that skipping flushes for subtrees is valuable. I am not sure
about the cgroup arrangement in your use case, but we do have cgroups
with a lot of tasks in them (and/or child cgroups). If there is memory
pressure (or hit the cgroup limit), they all may go into direct
reclaim concurrently, so skipping flushes could really be beneficial.

Of course, if the difference in complexity is not justified, we can go
back to only supporting root cgroups for ongoing_flusher for now. But
as I mentioned in the v4 email thread, some of the complexity may
still remain anyway as we have multiple root cgroups in v1.

>
> Let's get some quick data on flush times from production, to support my
> claim:

Thanks for the data. I agree that in general root flushes will be a
lot more expensive than subtree flushes, but keep in mind that the
data may look really different depends on the cgroup setup. As I
mentioned, I think we should still support subtree flushes unless the
difference in complexity is not justified.

>
> The bpftrace onliner:
>   sudo bpftrace -e '
>    tracepoint:cgroup:cgroup_rstat_locked {
>     if (args->cpu == -1) { @start[tid]=nsecs; }
>     @cnt[probe]=count();
>     if (args->contended) {
>       @lock_contended["after_obtaining_lock", args->level]=count();
>     }}
>    tracepoint:cgroup:cgroup_rstat_unlock {
>     if (args->cpu == -1) {
>      $now=nsecs; $start=@start[tid]; $diff=$now-$start;
>      @locked_time_level[args->level]=hist($diff);
>     }
>     @cnt[probe]=count()}
>    kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()}
>    interval:s:1 {time("%H:%M:%S ");
>     print(@cnt);
>     print(@lock_contended);
>     print(@locked_time_level);
>     clear (@cnt);
>     clear (@lock_contended); }
>    END {clear(@start)}'
>
> Time below is in nanosec.
>
> @locked_time_level[0]:
> [4M, 8M)             623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [8M, 16M)            860
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16M, 32M)           295 |@@@@@@@@@@@@@@@@@
>       |
> [32M, 64M)           275 |@@@@@@@@@@@@@@@@
>       |
>
>
> @locked_time_level[1]:
> [4K, 8K)               6 |@@@@
>       |
> [8K, 16K)             65
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)            52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [32K, 64K)            23 |@@@@@@@@@@@@@@@@@@
>       |
> [64K, 128K)           15 |@@@@@@@@@@@@
>       |
> [128K, 256K)          10 |@@@@@@@@
>       |
> [256K, 512K)           6 |@@@@
>       |
> [512K, 1M)            15 |@@@@@@@@@@@@
>       |
> [1M, 2M)               2 |@
>       |
> [2M, 4M)              14 |@@@@@@@@@@@
>       |
> [4M, 8M)               6 |@@@@
>       |
> [8M, 16M)              7 |@@@@@
>       |
> [16M, 32M)             1 |
>       |
>
>
> @locked_time_level[2]:
> [2K, 4K)               1 |
>       |
> [4K, 8K)             160 |@@@@@@@@@
>       |
> [8K, 16K)            733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>       |
> [16K, 32K)           901
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           191 |@@@@@@@@@@@
>       |
> [64K, 128K)          115 |@@@@@@
>       |
> [128K, 256K)          61 |@@@
>       |
> [256K, 512K)          70 |@@@@
>       |
> [512K, 1M)            59 |@@@
>       |
> [1M, 2M)              27 |@
>       |
> [2M, 4M)               9 |
>       |
>
>
> @locked_time_level[3]:
> [1K, 2K)               3 |
>       |
> [2K, 4K)               2 |
>       |
> [4K, 8K)               5 |
>       |
> [8K, 16K)            147 |@@@@@@
>       |
> [16K, 32K)          1222
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           266 |@@@@@@@@@@@
>       |
> [64K, 128K)          199 |@@@@@@@@
>       |
> [128K, 256K)         146 |@@@@@@
>       |
> [256K, 512K)         124 |@@@@@
>       |
> [512K, 1M)            17 |
>       |
> [1M, 2M)               0 |
>       |
> [2M, 4M)               0 |
>       |
> [4M, 8M)               1 |
>       |
>
>
> @locked_time_level[4]:
> [4K, 8K)               2 |@@
>       |
> [8K, 16K)             17 |@@@@@@@@@@@@@@@@@@@@@@
>       |
> [16K, 32K)            40
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)             4 |@@@@@
>       |
>
> --Jesper
>
Jesper Dangaard Brouer July 17, 2024, 4:36 p.m. UTC | #8
On 17/07/2024 02.35, Yosry Ahmed wrote:
> [..]
>>
>>
>> This is a clean (meaning no cadvisor interference) example of kswapd
>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
>> hit the race (which is handled in V6 and V7).
>>
>> The BPF "cnt" maps are getting cleared every second, so this
>> approximates per sec numbers.  This patch reduce pressure on the lock,
>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
>> flushes approx 37 per sec (every 27 ms). On the positive side
>> ongoing_flusher mitigation stopped 98 per sec of these.
>>
>> In this clean kswapd case the patch removes the lock contention issue
>> for kswapd. The lock_contended cases 27 seems to be all related to
>> handled_race cases 27.
>>
>> The remaning high flush rate should also be addressed, and we should
>> also work on aproaches to limit this like my ealier proposal[1].
> 
> I honestly don't think a high number of flushes is a problem on its
> own as long as we are not spending too much time flushing, especially
> when we have magnitude-based thresholding so we know there is
> something to flush (although it may not be relevant to what we are
> doing).
> 

We are "spending too much time flushing" see below.

> If we keep observing a lot of lock contention, one thing that I
> thought about is to have a variant of spin_lock with a timeout. This
> limits the flushing latency, instead of limiting the number of flushes
> (which I believe is the wrong metric to optimize).
> 
> It also seems to me that we are doing a flush each 27ms, and your
> proposed threshold was once per 50ms. It doesn't seem like a
> fundamental difference.
> 


Looking at the production numbers for the time the lock is held for level 0:

@locked_time_level[0]:
[4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
[32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |

The time is in nanosec, so M corresponds to ms (milliseconds).

With 36 flushes per second (as shown earlier) this is a flush every
27.7ms.  It is not unreasonable (from above data) that the flush time
also spend 27ms, which means that we spend a full CPU second flushing.
That is spending too much time flushing.

This around 1 sec CPU usage for kswapd is also quite clear in the
attached grafana graph for when server was rebooted into this V7 kernel.

I choose 50ms because at the time I saw flush taking around 30ms, and I
view the flush time as queue service-time.  When arrival-rate is faster
than service-time, then a queue will form.  So, choosing 50ms as
arrival-rate gave me some headroom.  As I mentioned earlier, optimally
this threshold should be dynamically measured.

--Jesper
Yosry Ahmed July 17, 2024, 4:49 p.m. UTC | #9
On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
>
> On 17/07/2024 02.35, Yosry Ahmed wrote:
> > [..]
> >>
> >>
> >> This is a clean (meaning no cadvisor interference) example of kswapd
> >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
> >> hit the race (which is handled in V6 and V7).
> >>
> >> The BPF "cnt" maps are getting cleared every second, so this
> >> approximates per sec numbers.  This patch reduce pressure on the lock,
> >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
> >> flushes approx 37 per sec (every 27 ms). On the positive side
> >> ongoing_flusher mitigation stopped 98 per sec of these.
> >>
> >> In this clean kswapd case the patch removes the lock contention issue
> >> for kswapd. The lock_contended cases 27 seems to be all related to
> >> handled_race cases 27.
> >>
> >> The remaning high flush rate should also be addressed, and we should
> >> also work on aproaches to limit this like my ealier proposal[1].
> >
> > I honestly don't think a high number of flushes is a problem on its
> > own as long as we are not spending too much time flushing, especially
> > when we have magnitude-based thresholding so we know there is
> > something to flush (although it may not be relevant to what we are
> > doing).
> >
>
> We are "spending too much time flushing" see below.
>
> > If we keep observing a lot of lock contention, one thing that I
> > thought about is to have a variant of spin_lock with a timeout. This
> > limits the flushing latency, instead of limiting the number of flushes
> > (which I believe is the wrong metric to optimize).
> >
> > It also seems to me that we are doing a flush each 27ms, and your
> > proposed threshold was once per 50ms. It doesn't seem like a
> > fundamental difference.
> >
>
>
> Looking at the production numbers for the time the lock is held for level 0:
>
> @locked_time_level[0]:
> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
>
> The time is in nanosec, so M corresponds to ms (milliseconds).
>
> With 36 flushes per second (as shown earlier) this is a flush every
> 27.7ms.  It is not unreasonable (from above data) that the flush time
> also spend 27ms, which means that we spend a full CPU second flushing.
> That is spending too much time flushing.
>
> This around 1 sec CPU usage for kswapd is also quite clear in the
> attached grafana graph for when server was rebooted into this V7 kernel.
>
> I choose 50ms because at the time I saw flush taking around 30ms, and I
> view the flush time as queue service-time.  When arrival-rate is faster
> than service-time, then a queue will form.  So, choosing 50ms as
> arrival-rate gave me some headroom.  As I mentioned earlier, optimally
> this threshold should be dynamically measured.

Thanks for the data. Yeah this doesn't look good.

Does it make sense to just throttle flushers at some point to increase
the chances of coalescing multiple flushers?

Otherwise I think it makes sense in this case to ratelimit flushing in
general. Although instead of just checking how much time elapsed since
the last flush, can we use something like __ratelimit()?

This will make sure that we skip flushes when we actually have a high
rate of flushing over a period of time, not because two flushes
happened to be requested in close succession and the flushing rate is
generally low.

>
> --Jesper
Jesper Dangaard Brouer July 17, 2024, 6:17 p.m. UTC | #10
On 17/07/2024 18.31, Yosry Ahmed wrote:
> [..]
>>
>> I agree that description should be updated as patch have evolved, and
>> you suggestions make sense, thanks.
>>
>> But I'm not sure we are finished evolving this patch, because the scheme
>> of allowing all sub-cgroups to become the ongoing flusher (and only
>> avoiding lock-spinning when cgroup_is_descendant), is causing issues in
>> production.
> 
> Sure, I just wanted to point out the the commit log should be updated
> going forward, not implying that we are done here :)
> 
>>
>> We are using cadvisor in production for collecting metrics, which walks
>> all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat
>> (doing 3 flushes of same cgroup in short timespan), but individual
>> cgroups don't overlap much to benefit from ongoing_flusher scheme.
>>
>> Below is (1 sec) production data, where cadvisor and kswapd collide:
>>
>>   > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1
>>   > @ongoing_flusher_cnt[kswapd5]: 3
>>   > @ongoing_flusher_cnt[kswapd7]: 4
>>   > @ongoing_flusher_cnt[kswapd6]: 4
>>   > @ongoing_flusher_cnt[kswapd8]: 4
>>   > @ongoing_flusher_cnt[kswapd9]: 5
>>   > @ongoing_flusher_cnt[kswapd2]: 5
>>   > @ongoing_flusher_cnt[kswapd10]: 5
>>   > @ongoing_flusher_cnt[kswapd3]: 5
>>   > @ongoing_flusher_cnt[kswapd11]: 5
>>   > @ongoing_flusher_cnt[kswapd0]: 5
>>   > @ongoing_flusher_cnt[kswapd4]: 6
>>   > @ongoing_flusher_cnt[kswapd1]: 7
>>   > @ongoing_flusher_cnt[cadvisor]: 9
>>
>> For cadvisor ongoing_flusher only helps 9 times to avoid flush.
>>
>>   > @ongoing_flusher_cnt[handled_race]: 18
>>   > @ongoing_flusher_cnt[all]: 61
>>
>> Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes.
>>
>>   > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61
>>   > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462
>>   > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032
>>
>> This is bad: Lock is contended 9032 time within this 1 sec period.
>> Below, lock_contended cases is captured in more detail.
>>
>>   > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435
>>   > @lock_contended[normal, 4]: 1
>>   > @lock_contended[normal, 1]: 2
>>   > @lock_contended[normal, 3]: 6
>>   > @lock_contended[normal, 2]: 15
>>   > @lock_contended[yield, 4]: 49
>>   > @lock_contended[after_obtaining_lock, 4]: 49
>>   > @lock_contended[normal, 0]: 59
>>
>> The "normal" lock_contended for level 0,
>> meaning the cgroup root is contended 59 times within this 1 sec period.
>>
>>   > @lock_contended[after_obtaining_lock, 1]: 117
>>   > @lock_contended[yield, 1]: 118
>>   > @lock_contended[yield, 3]: 258
>>   > @lock_contended[after_obtaining_lock, 3]: 258
>>   > @lock_contended[after_obtaining_lock, 2]: 946
>>   > @lock_contended[yield, 2]: 946
>>
>> Lock contention for 'yielded' case for level 2.
>> Goes crazy with 946/sec.
>>
>>   > @lock_contended[yield, 0]: 7579
>>
>> Lock contention for 'yielded' case for level 0, the root.
>> Is really crazy with 7579/sec lock spin cases.
>>
>>   > @lock_contended[after_obtaining_lock, 0]: 7579
>>
>>
>> IMHO this shows that, due to lock yielding, the scheme of
>> ongoing_flusher for sub-trees only cause issues.
> 
> Just to make sure I understand correctly, allowing non-root to become
> ongoing_flushers is problematic because we only support a single
> ongoing_flusher, so if we have a subsequent root flush, it won't be
> the ongoing_flusher and we won't benefit from skipping other flushes.
> Correct?
> 

Yes, basically... keep in mind, what will happen is that subsequent root
flush will busy-wait/spin on getting lock, which in case of a yield from
ongoing-flusher, will result in multiple flushers being active, despite
our efforts to have a single ongoing-flusher.


> I honestly think the problem here is not that we support
> ongoing_flusher for subtrees, which imo is nice to have at a low
> additional cost, and does have its benefits (see below). I think the
> problem is that lock yielding means that we can have multiple ongoing
> flushers, while the current scheme only records one.
> 

Yes, the yielding is causing multiple flushers to be active the same time.


> What I think we should be doing is either supporting multiple
> ongoing_flushers (by replacing the global variable with a per-cgroup
> flag, perhaps), or biting the bullet and start using a mutex to drop
> lock yielding. If there aren't any concerns beyond priority inversion
> (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I
> suggested could fix this issue (and also put a bound on the flushing
> latency in general)?
> 

The mutex_lock_timeout is an interesting idea, but not a primitive we
have available today, right?

p.s. I have 5 prod machines running with mutex change, and I've not
gotten any SRE complaints.


> (I suppose the latter is preferrable)
> 
>>
>> IMHO we should go back to only doing ongoing_flusher for root-cgroup.
>> There is really low chance of sub-trees flushes being concurrent enough
>> to benefit from this, and it cause issues and needs (ugly) race handling.
>>
>> Further more sub-tree flushing doesn't take as long time as level 0
>> flushing, which further lower the chance of concurrent flushes.
> 
> I agree that the race handling is not pretty, and we can try to
> improve the implementation or handle the race in other ways. However,
> I think that skipping flushes for subtrees is valuable. I am not sure
> about the cgroup arrangement in your use case, but we do have cgroups
> with a lot of tasks in them (and/or child cgroups). If there is memory
> pressure (or hit the cgroup limit), they all may go into direct
> reclaim concurrently, so skipping flushes could really be beneficial.
> 
> Of course, if the difference in complexity is not justified, we can go
> back to only supporting root cgroups for ongoing_flusher for now. But
> as I mentioned in the v4 email thread, some of the complexity may
> still remain anyway as we have multiple root cgroups in v1.
> 

Having an incremental step with "only supporting root cgroups for
ongoing_flusher for now" is a good step forward IMHO.
As you could see in grafana plot, this would be a significant production
improvement on its own, as it avoids wasting CPU resources spinning on
the lock.

Being able to have multiple root cgroups, due to in v1, does pose an
implementation problem.  Only having a single root, would allow to have 
a race-free cmpxchg scheme.
Would it be reasonable to only support v2?
If so, how can I match on this?

>>
>> Let's get some quick data on flush times from production, to support my
>> claim:
> 
> Thanks for the data. I agree that in general root flushes will be a
> lot more expensive than subtree flushes, but keep in mind that the
> data may look really different depends on the cgroup setup. As I
> mentioned, I think we should still support subtree flushes unless the
> difference in complexity is not justified.
> 

It would be really valuable if you could provide production data on the 
lock-hold times, just like I did with below bpftrace script...
Is that possible, please?

>>
>> The bpftrace onliner:
>>    sudo bpftrace -e '
>>     tracepoint:cgroup:cgroup_rstat_locked {
>>      if (args->cpu == -1) { @start[tid]=nsecs; }
>>      @cnt[probe]=count();
>>      if (args->contended) {
>>        @lock_contended["after_obtaining_lock", args->level]=count();
>>      }}
>>     tracepoint:cgroup:cgroup_rstat_unlock {
>>      if (args->cpu == -1) {
>>       $now=nsecs; $start=@start[tid]; $diff=$now-$start;
>>       @locked_time_level[args->level]=hist($diff);
>>      }
>>      @cnt[probe]=count()}
>>     kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()}
>>     interval:s:1 {time("%H:%M:%S ");
>>      print(@cnt);
>>      print(@lock_contended);
>>      print(@locked_time_level);
>>      clear (@cnt);
>>      clear (@lock_contended); }
>>     END {clear(@start)}'
>>
>> Time below is in nanosec.
>>
>> @locked_time_level[0]:
>> [4M, 8M)             623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>>        |
>> [8M, 16M)            860
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [16M, 32M)           295 |@@@@@@@@@@@@@@@@@
>>        |
>> [32M, 64M)           275 |@@@@@@@@@@@@@@@@
>>        |
>>
>>
>> @locked_time_level[1]:
>> [4K, 8K)               6 |@@@@
>>        |
>> [8K, 16K)             65
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [16K, 32K)            52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>>        |
>> [32K, 64K)            23 |@@@@@@@@@@@@@@@@@@
>>        |
>> [64K, 128K)           15 |@@@@@@@@@@@@
>>        |
>> [128K, 256K)          10 |@@@@@@@@
>>        |
>> [256K, 512K)           6 |@@@@
>>        |
>> [512K, 1M)            15 |@@@@@@@@@@@@
>>        |
>> [1M, 2M)               2 |@
>>        |
>> [2M, 4M)              14 |@@@@@@@@@@@
>>        |
>> [4M, 8M)               6 |@@@@
>>        |
>> [8M, 16M)              7 |@@@@@
>>        |
>> [16M, 32M)             1 |
>>        |
>>
>>
>> @locked_time_level[2]:
>> [2K, 4K)               1 |
>>        |
>> [4K, 8K)             160 |@@@@@@@@@
>>        |
>> [8K, 16K)            733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>>        |
>> [16K, 32K)           901
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [32K, 64K)           191 |@@@@@@@@@@@
>>        |
>> [64K, 128K)          115 |@@@@@@
>>        |
>> [128K, 256K)          61 |@@@
>>        |
>> [256K, 512K)          70 |@@@@
>>        |
>> [512K, 1M)            59 |@@@
>>        |
>> [1M, 2M)              27 |@
>>        |
>> [2M, 4M)               9 |
>>        |
>>
>>
>> @locked_time_level[3]:
>> [1K, 2K)               3 |
>>        |
>> [2K, 4K)               2 |
>>        |
>> [4K, 8K)               5 |
>>        |
>> [8K, 16K)            147 |@@@@@@
>>        |
>> [16K, 32K)          1222
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [32K, 64K)           266 |@@@@@@@@@@@
>>        |
>> [64K, 128K)          199 |@@@@@@@@
>>        |
>> [128K, 256K)         146 |@@@@@@
>>        |
>> [256K, 512K)         124 |@@@@@
>>        |
>> [512K, 1M)            17 |
>>        |
>> [1M, 2M)               0 |
>>        |
>> [2M, 4M)               0 |
>>        |
>> [4M, 8M)               1 |
>>        |
>>
>>
>> @locked_time_level[4]:
>> [4K, 8K)               2 |@@
>>        |
>> [8K, 16K)             17 |@@@@@@@@@@@@@@@@@@@@@@
>>        |
>> [16K, 32K)            40
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [32K, 64K)             4 |@@@@@
>>        |
>>
>> --Jesper
>>
Yosry Ahmed July 17, 2024, 6:43 p.m. UTC | #11
[..]
> > What I think we should be doing is either supporting multiple
> > ongoing_flushers (by replacing the global variable with a per-cgroup
> > flag, perhaps), or biting the bullet and start using a mutex to drop
> > lock yielding. If there aren't any concerns beyond priority inversion
> > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I
> > suggested could fix this issue (and also put a bound on the flushing
> > latency in general)?
> >
>
> The mutex_lock_timeout is an interesting idea, but not a primitive we
> have available today, right?

We don't, but Waiman said it shouldn't be difficult to add:
https://lore.kernel.org/lkml/623e62c5-3045-4dca-9f2c-ed15b8d3bad8@redhat.com/

>
> p.s. I have 5 prod machines running with mutex change, and I've not
> gotten any SRE complaints.

That's nice!

>
>
> > (I suppose the latter is preferrable)
> >
> >>
> >> IMHO we should go back to only doing ongoing_flusher for root-cgroup.
> >> There is really low chance of sub-trees flushes being concurrent enough
> >> to benefit from this, and it cause issues and needs (ugly) race handling.
> >>
> >> Further more sub-tree flushing doesn't take as long time as level 0
> >> flushing, which further lower the chance of concurrent flushes.
> >
> > I agree that the race handling is not pretty, and we can try to
> > improve the implementation or handle the race in other ways. However,
> > I think that skipping flushes for subtrees is valuable. I am not sure
> > about the cgroup arrangement in your use case, but we do have cgroups
> > with a lot of tasks in them (and/or child cgroups). If there is memory
> > pressure (or hit the cgroup limit), they all may go into direct
> > reclaim concurrently, so skipping flushes could really be beneficial.
> >
> > Of course, if the difference in complexity is not justified, we can go
> > back to only supporting root cgroups for ongoing_flusher for now. But
> > as I mentioned in the v4 email thread, some of the complexity may
> > still remain anyway as we have multiple root cgroups in v1.
> >
>
> Having an incremental step with "only supporting root cgroups for
> ongoing_flusher for now" is a good step forward IMHO.
> As you could see in grafana plot, this would be a significant production
> improvement on its own, as it avoids wasting CPU resources spinning on
> the lock.

I am not opposed to this at all. All I am saying is, if we need to
handle most complexity anyway due to multiple root cgroups in v1, then
might as well support subtrees too.

>
> Being able to have multiple root cgroups, due to in v1, does pose an
> implementation problem.  Only having a single root, would allow to have
> a race-free cmpxchg scheme.
> Would it be reasonable to only support v2?

The rstat code has so far been v1/v2 agnostic AFAICT, and Google is
still using cgroup v1, so I naturally prefer we keep supporting both
v1 and v2 going forward.

> If so, how can I match on this?

cgroup_on_dfl() is basically testing if a cgroup is on v2, but I
really want to keep v1 included if possible :/

>
> >>
> >> Let's get some quick data on flush times from production, to support my
> >> claim:
> >
> > Thanks for the data. I agree that in general root flushes will be a
> > lot more expensive than subtree flushes, but keep in mind that the
> > data may look really different depends on the cgroup setup. As I
> > mentioned, I think we should still support subtree flushes unless the
> > difference in complexity is not justified.
> >
>
> It would be really valuable if you could provide production data on the
> lock-hold times, just like I did with below bpftrace script...
> Is that possible, please?

Unfortunately, we don't have the infrastructure to do this :/

But again, I am not objecting to only supporting root cgroups as
ongoing flushers for now if there is a justifiable complexity
difference. So this shouldn't be a blocker.
Jesper Dangaard Brouer July 18, 2024, 8:12 a.m. UTC | #12
On 17/07/2024 18.49, Yosry Ahmed wrote:
> On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>>
>> On 17/07/2024 02.35, Yosry Ahmed wrote:
>>> [..]
>>>>
>>>>
>>>> This is a clean (meaning no cadvisor interference) example of kswapd
>>>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
>>>> hit the race (which is handled in V6 and V7).
>>>>
>>>> The BPF "cnt" maps are getting cleared every second, so this
>>>> approximates per sec numbers.  This patch reduce pressure on the lock,
>>>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
>>>> flushes approx 37 per sec (every 27 ms). On the positive side
>>>> ongoing_flusher mitigation stopped 98 per sec of these.
>>>>
>>>> In this clean kswapd case the patch removes the lock contention issue
>>>> for kswapd. The lock_contended cases 27 seems to be all related to
>>>> handled_race cases 27.
>>>>
>>>> The remaning high flush rate should also be addressed, and we should
>>>> also work on aproaches to limit this like my ealier proposal[1].
>>>
>>> I honestly don't think a high number of flushes is a problem on its
>>> own as long as we are not spending too much time flushing, especially
>>> when we have magnitude-based thresholding so we know there is
>>> something to flush (although it may not be relevant to what we are
>>> doing).
>>>
>>
>> We are "spending too much time flushing" see below.
>>
>>> If we keep observing a lot of lock contention, one thing that I
>>> thought about is to have a variant of spin_lock with a timeout. This
>>> limits the flushing latency, instead of limiting the number of flushes
>>> (which I believe is the wrong metric to optimize).
>>>
>>> It also seems to me that we are doing a flush each 27ms, and your
>>> proposed threshold was once per 50ms. It doesn't seem like a
>>> fundamental difference.
>>>
>>
>>
>> Looking at the production numbers for the time the lock is held for level 0:
>>
>> @locked_time_level[0]:
>> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
>> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
>> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
>>
>> The time is in nanosec, so M corresponds to ms (milliseconds).
>>
>> With 36 flushes per second (as shown earlier) this is a flush every
>> 27.7ms.  It is not unreasonable (from above data) that the flush time
>> also spend 27ms, which means that we spend a full CPU second flushing.
>> That is spending too much time flushing.
>>
>> This around 1 sec CPU usage for kswapd is also quite clear in the
>> attached grafana graph for when server was rebooted into this V7 kernel.
>>
>> I choose 50ms because at the time I saw flush taking around 30ms, and I
>> view the flush time as queue service-time.  When arrival-rate is faster
>> than service-time, then a queue will form.  So, choosing 50ms as
>> arrival-rate gave me some headroom.  As I mentioned earlier, optimally
>> this threshold should be dynamically measured.
> 
> Thanks for the data. Yeah this doesn't look good.
> 
> Does it make sense to just throttle flushers at some point to increase
> the chances of coalescing multiple flushers?
> 
> Otherwise I think it makes sense in this case to ratelimit flushing in
> general. Although instead of just checking how much time elapsed since
> the last flush, can we use something like __ratelimit()?
> 
> This will make sure that we skip flushes when we actually have a high
> rate of flushing over a period of time, not because two flushes
> happened to be requested in close succession and the flushing rate is
> generally low.
> 

I really think "time elapsed since the last flush" is the right solution
here.  As, we *do* want to catch the case you describe "two flushes
happened to be requested in close succession and the flushing rate is
generally low."

(After this patch fixing the lock contention triggered by kswapd).
The remaining problem with kswapd is that those flushes that doesn't
"collide" on the lock, will be flushing in close succession.  And we
likely have a generally low flushing rate, until kswapd starts up.

Some production data from a "slow" period where only kswapd is active:

05:59:32 @ongoing_flusher_cnt[kswapd11]: 1
@ongoing_flusher_cnt[kswapd7]: 1
@ongoing_flusher_cnt[kswapd3]: 1
@ongoing_flusher_cnt[kswapd5]: 1
@ongoing_flusher_cnt[kswapd10]: 1
@ongoing_flusher_cnt[kswapd6]: 2
@ongoing_flusher_cnt[kswapd8]: 2
@ongoing_flusher_cnt[kswapd1]: 2
@ongoing_flusher_cnt[kswapd9]: 2
@ongoing_flusher_cnt[kswapd0]: 2
@ongoing_flusher_cnt[kswapd2]: 2
@ongoing_flusher_cnt[kswapd4]: 2
@ongoing_flusher_cnt[handled_race]: 2
@ongoing_flusher_cnt[all]: 14
@cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2
@cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10
@cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43
@cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51

We observe that ongoing_flusher scheme saves/avoids 14 of the flushes
great, but we still have 43 flushes in this period.  I think only kswapd
is doing the flushing here, so I claim 41 flushes are likely unnecessary
(as there are indication of 2 kswapd startups in the period).  Also
observe that some of the kswapdNN processes only have
ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with
an ongoing flusher, while others have 2.

--Jesper
Yosry Ahmed July 18, 2024, 3:55 p.m. UTC | #13
On Thu, Jul 18, 2024 at 1:12 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 17/07/2024 18.49, Yosry Ahmed wrote:
> > On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>
> >>
> >> On 17/07/2024 02.35, Yosry Ahmed wrote:
> >>> [..]
> >>>>
> >>>>
> >>>> This is a clean (meaning no cadvisor interference) example of kswapd
> >>>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
> >>>> hit the race (which is handled in V6 and V7).
> >>>>
> >>>> The BPF "cnt" maps are getting cleared every second, so this
> >>>> approximates per sec numbers.  This patch reduce pressure on the lock,
> >>>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
> >>>> flushes approx 37 per sec (every 27 ms). On the positive side
> >>>> ongoing_flusher mitigation stopped 98 per sec of these.
> >>>>
> >>>> In this clean kswapd case the patch removes the lock contention issue
> >>>> for kswapd. The lock_contended cases 27 seems to be all related to
> >>>> handled_race cases 27.
> >>>>
> >>>> The remaning high flush rate should also be addressed, and we should
> >>>> also work on aproaches to limit this like my ealier proposal[1].
> >>>
> >>> I honestly don't think a high number of flushes is a problem on its
> >>> own as long as we are not spending too much time flushing, especially
> >>> when we have magnitude-based thresholding so we know there is
> >>> something to flush (although it may not be relevant to what we are
> >>> doing).
> >>>
> >>
> >> We are "spending too much time flushing" see below.
> >>
> >>> If we keep observing a lot of lock contention, one thing that I
> >>> thought about is to have a variant of spin_lock with a timeout. This
> >>> limits the flushing latency, instead of limiting the number of flushes
> >>> (which I believe is the wrong metric to optimize).
> >>>
> >>> It also seems to me that we are doing a flush each 27ms, and your
> >>> proposed threshold was once per 50ms. It doesn't seem like a
> >>> fundamental difference.
> >>>
> >>
> >>
> >> Looking at the production numbers for the time the lock is held for level 0:
> >>
> >> @locked_time_level[0]:
> >> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> >> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> >> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> >>
> >> The time is in nanosec, so M corresponds to ms (milliseconds).
> >>
> >> With 36 flushes per second (as shown earlier) this is a flush every
> >> 27.7ms.  It is not unreasonable (from above data) that the flush time
> >> also spend 27ms, which means that we spend a full CPU second flushing.
> >> That is spending too much time flushing.
> >>
> >> This around 1 sec CPU usage for kswapd is also quite clear in the
> >> attached grafana graph for when server was rebooted into this V7 kernel.
> >>
> >> I choose 50ms because at the time I saw flush taking around 30ms, and I
> >> view the flush time as queue service-time.  When arrival-rate is faster
> >> than service-time, then a queue will form.  So, choosing 50ms as
> >> arrival-rate gave me some headroom.  As I mentioned earlier, optimally
> >> this threshold should be dynamically measured.
> >
> > Thanks for the data. Yeah this doesn't look good.
> >
> > Does it make sense to just throttle flushers at some point to increase
> > the chances of coalescing multiple flushers?
> >
> > Otherwise I think it makes sense in this case to ratelimit flushing in
> > general. Although instead of just checking how much time elapsed since
> > the last flush, can we use something like __ratelimit()?
> >
> > This will make sure that we skip flushes when we actually have a high
> > rate of flushing over a period of time, not because two flushes
> > happened to be requested in close succession and the flushing rate is
> > generally low.
> >
>
> I really think "time elapsed since the last flush" is the right solution
> here.  As, we *do* want to catch the case you describe "two flushes
> happened to be requested in close succession and the flushing rate is
> generally low."
>
> (After this patch fixing the lock contention triggered by kswapd).
> The remaining problem with kswapd is that those flushes that doesn't
> "collide" on the lock, will be flushing in close succession.  And we
> likely have a generally low flushing rate, until kswapd starts up.

I do not disagree, I think I may have misrepresented my intention.

Time since last flush is essentially ratelimiting to 2 flushes per
(let's say) 50 ms. So if we only have 2 flushes in an entire second,
but they happen to occur <50ms apart, the latter will be skipped
(arguably unnecessarily).

What I am trying to say is to make it more generalized/less-granular,
and apply ratelimiting when we are actually spending a large portion
of our time flushing. For example, cap it at 20 flushes per 1s. I
think __ratelimit() does exactly that, although it's currently mostly
used to ratelimit warnings, but I think we can reuse it here.

Ideally, of course, the number would be dynamic as you described
before, but we can start with a reasonable static value based on data.
What we do when the limit is hit could be changed later as well. We
can just skip flushes initially, and later we can be smarter about it
and throttle them in a way that makes them collide more effectively.

>
> Some production data from a "slow" period where only kswapd is active:
>
> 05:59:32 @ongoing_flusher_cnt[kswapd11]: 1
> @ongoing_flusher_cnt[kswapd7]: 1
> @ongoing_flusher_cnt[kswapd3]: 1
> @ongoing_flusher_cnt[kswapd5]: 1
> @ongoing_flusher_cnt[kswapd10]: 1
> @ongoing_flusher_cnt[kswapd6]: 2
> @ongoing_flusher_cnt[kswapd8]: 2
> @ongoing_flusher_cnt[kswapd1]: 2
> @ongoing_flusher_cnt[kswapd9]: 2
> @ongoing_flusher_cnt[kswapd0]: 2
> @ongoing_flusher_cnt[kswapd2]: 2
> @ongoing_flusher_cnt[kswapd4]: 2
> @ongoing_flusher_cnt[handled_race]: 2
> @ongoing_flusher_cnt[all]: 14
> @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2
> @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10
> @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43
> @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51
>
> We observe that ongoing_flusher scheme saves/avoids 14 of the flushes
> great, but we still have 43 flushes in this period.  I think only kswapd
> is doing the flushing here, so I claim 41 flushes are likely unnecessary
> (as there are indication of 2 kswapd startups in the period).  Also
> observe that some of the kswapdNN processes only have
> ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with
> an ongoing flusher, while others have 2.
>
> --Jesper
Shakeel Butt July 19, 2024, 12:40 a.m. UTC | #14
Hi Jesper,

On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> 
[...]
> 
> 
> Looking at the production numbers for the time the lock is held for level 0:
> 
> @locked_time_level[0]:
> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> 

Is it possible to get the above histogram for other levels as well? I
know this is 12 numa node machine, how many total CPUs are there?

> The time is in nanosec, so M corresponds to ms (milliseconds).
> 
> With 36 flushes per second (as shown earlier) this is a flush every
> 27.7ms.  It is not unreasonable (from above data) that the flush time
> also spend 27ms, which means that we spend a full CPU second flushing.
> That is spending too much time flushing.

One idea to further reduce this time is more fine grained flush
skipping. At the moment we either skip the whole flush or not. How
about we make this decision per-cpu? We already have per-cpu updates
data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu.
Yosry Ahmed July 19, 2024, 3:11 a.m. UTC | #15
On Thu, Jul 18, 2024 at 5:41 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi Jesper,
>
> On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> >
> [...]
> >
> >
> > Looking at the production numbers for the time the lock is held for level 0:
> >
> > @locked_time_level[0]:
> > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> >
>
> Is it possible to get the above histogram for other levels as well? I
> know this is 12 numa node machine, how many total CPUs are there?
>
> > The time is in nanosec, so M corresponds to ms (milliseconds).
> >
> > With 36 flushes per second (as shown earlier) this is a flush every
> > 27.7ms.  It is not unreasonable (from above data) that the flush time
> > also spend 27ms, which means that we spend a full CPU second flushing.
> > That is spending too much time flushing.
>
> One idea to further reduce this time is more fine grained flush
> skipping. At the moment we either skip the whole flush or not. How
> about we make this decision per-cpu? We already have per-cpu updates
> data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu.

Good idea.

I think we would need a per-subsystem callback to decide whether we
want to flush the cgroup or not. This needs to happen in the core
rstat flushing code (not the memcg flushing code), as we need to make
sure we do not remove the cgroup from the per-cpu updated tree if we
don't flush it.

More generally, I think we should be able to have a "force" flush API
that skips all optimizations and ensures that a flush occurs. I think
this will be needed in the cgroup_rstat_exit() path, where stats of a
cgroup being freed must be propagated to its parent, no matter how
insignificant they may be, to avoid inconsistencies.
Jesper Dangaard Brouer July 19, 2024, 7:54 a.m. UTC | #16
On 19/07/2024 02.40, Shakeel Butt wrote:
> Hi Jesper,
> 
> On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
>>
> [...]
>>
>>
>> Looking at the production numbers for the time the lock is held for level 0:
>>
>> @locked_time_level[0]:
>> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
>> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
>> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
>>
> 
> Is it possible to get the above histogram for other levels as well?

Data from other levels available in [1]:
  [1] 
https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/

IMHO the data shows we will get most out of skipping level-0 root-cgroup 
flushes.

Copy-pasted here (adj whitespaces hoping it better fit email):

Time below is in nanosec:
  - M -> ms
  - K -> usec

@locked_time_level[0]:
[4M, 8M)      623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[8M, 16M)     860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)    295 |@@@@@@@@@@@@@@@@@      |
[32M, 64M)    275 |@@@@@@@@@@@@@@@@      |

@locked_time_level[1]:
[4K, 8K)        6 |@@@@      |
[8K, 16K)      65 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)     52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)     23 |@@@@@@@@@@@@@@@@@@      |
[64K, 128K)    15 |@@@@@@@@@@@@      |
[128K, 256K)   10 |@@@@@@@@      |
[256K, 512K)    6 |@@@@      |
[512K, 1M)     15 |@@@@@@@@@@@@      |
[1M, 2M)        2 |@      |
[2M, 4M)       14 |@@@@@@@@@@@      |
[4M, 8M)        6 |@@@@      |
[8M, 16M)       7 |@@@@@      |
[16M, 32M)      1 |      |

@locked_time_level[2]:
[2K, 4K)        1 |      |
[4K, 8K)      160 |@@@@@@@@@      |
[8K, 16K)     733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[16K, 32K)    901 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)    191 |@@@@@@@@@@@      |
[64K, 128K)   115 |@@@@@@      |
[128K, 256K)   61 |@@@      |
[256K, 512K)   70 |@@@@      |
[512K, 1M)     59 |@@@      |
[1M, 2M)       27 |@      |
[2M, 4M)        9 |      |

@locked_time_level[3]:
[1K, 2K)        3 |      |
[2K, 4K)        2 |      |
[4K, 8K)        5 |      |
[8K, 16K)     147 |@@@@@@      |
[16K, 32K)   1222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)    266 |@@@@@@@@@@@      |
[64K, 128K)   199 |@@@@@@@@      |
[128K, 256K)  146 |@@@@@@      |
[256K, 512K)  124 |@@@@@      |
[512K, 1M)     17 |      |
[1M, 2M)        0 |      |
[2M, 4M)        0 |      |
[4M, 8M)        1 |      |

@locked_time_level[4]:
[4K, 8K)        2 |@@      |
[8K, 16K)      17 |@@@@@@@@@@@@@@@@@@@@@@      |
[16K, 32K)     40 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)      4 |@@@@@      |


> I know this is 12 numa node machine, how many total CPUs are there?

192 CPU cores (incl HyperThreaded).

--Jesper
Jesper Dangaard Brouer July 19, 2024, 3:07 p.m. UTC | #17
On 17/07/2024 02.30, Yosry Ahmed wrote:
> On Thu, Jul 11, 2024 at 6:28 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>> Avoid lock contention on the global cgroup rstat lock caused by kswapd
>> starting on all NUMA nodes simultaneously. At Cloudflare, we observed
>> massive issues due to kswapd and the specific mem_cgroup_flush_stats()
>> call inlined in shrink_node, which takes the rstat lock.
>>
>> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node,
>> we noted severe lock contention on the rstat lock. This contention
>> causes 12 CPUs to waste cycles spinning every time kswapd runs.
>> Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are
>> burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily
>> due to spinning on the rstat lock.
>>
>> Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds
>> causing all kswapdN threads to wake up simultaneously. The kswapd thread
>> invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush
>> operation as part of its work. This results in kernel self-induced rstat
>> lock contention by waking up all kswapd threads simultaneously. Leveraging
>> this detail: balance_pgdat() have NULL value in target_mem_cgroup, this
>> cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup.
>>
>> To avoid this kind of thundering herd problem, kernel previously had a
>> "stats_flush_ongoing" concept, but this was removed as part of commit
>> 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch
>> reintroduce and generalized the concept to apply to all users of cgroup
>> rstat, not just memcg.
>>
>> If there is an ongoing rstat flush, and current cgroup is a descendant,
>> then it is unnecessary to do the flush. For callers to still see updated
>> stats, wait for ongoing flusher to complete before returning, but add
>> timeout as stats are already inaccurate given updaters keeps running.
>>
>> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Thanks for working on this, Jesper! I love the data you collected here!
> 
> I think the commit subject and message should be changed to better
> describe the patch. This is a patch that exclusively modifies cgroup
> code, yet the subject is about kswapd. This change affects all users
> of rstat flushing.
> 
> I think a better subject would be:
> "cgroup/rstat: avoid flushing if there is an ongoing overlapping
> flush" or similar.
> 

Took this for V8.
 
https://lore.kernel.org/all/172139415725.3084888.13770938453137383953.stgit@firesoul

> The commit message should first describe the cgroup change, and then
> use kswapd as a brief example/illustration of how the problem
> manifests in practice. You should also include a brief summary of the
> numbers you collected from prod.
> 

Update desc in V8

>> ---
>> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
>> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
>> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
>> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
>> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
>> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
>> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
>>
>>   include/linux/cgroup-defs.h |    2 +
>>   kernel/cgroup/rstat.c       |   95 ++++++++++++++++++++++++++++++++++++++-----
>>   2 files changed, 85 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
[...]
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index fb8b49437573..fe2a81a310bb 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
[...]
>>   static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>> @@ -299,6 +316,53 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>>          spin_unlock_irq(&cgroup_rstat_lock);
>>   }
>>
>> +#define MAX_WAIT       msecs_to_jiffies(100)
>> +/* Trylock helper that also checks for on ongoing flusher */
>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
>> +{
>> +       struct cgroup *ongoing;
>> +       bool locked;
>> +
>> +       /* Check if ongoing flusher is already taking care of this, if
> 
> nit: I think commonly the comment would start on a new line after /*.
> 

We use this comment style in networking code.
I've updated it to follow this subsystem.


>> +        * we are a descendant skip work, but wait for ongoing flusher
>> +        * to complete work.
>> +        */
>> +retry:
>> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
>> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
>> +               wait_for_completion_interruptible_timeout(
>> +                       &ongoing->flush_done, MAX_WAIT);
>> +               /* TODO: Add tracepoint here */
>> +               return false;
>> +       }
>> +
>> +       locked = __cgroup_rstat_trylock(cgrp, -1);
>> +       if (!locked) {
>> +               /* Contended: Handle loosing race for ongoing flusher */
> 
> nit: losing
>

Thanks for catching this subtle wording issue.


>> +               if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
>> +                       goto retry;
>> +
>> +               __cgroup_rstat_lock(cgrp, -1, false);
>> +       }
>> +       /* Obtained lock, record this cgrp as the ongoing flusher */
> 
> Do we want a comment here to explain why there could be an existing
> ongoing flusher (i.e. due to multiple ongoing flushers)? I think it's
> not super obvious.

Extended this in V8.

> 
>> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
>> +       if (!ongoing) {
>> +               reinit_completion(&cgrp->flush_done);
>> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
>> +       }
>> +       return true; /* locked */
> 
> Would it be better to explain the return value of the function in the
> comment above it?
> 

Fixed this in V8.

>> +}
>> +
>> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
>> +{
>> +       /* Detect if we are the ongoing flusher */
> 
> I think this is a bit obvious.
> 

True, removed comment.

>> +       if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) {
>> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
>> +               complete_all(&cgrp->flush_done);
>> +       }
>> +       __cgroup_rstat_unlock(cgrp, -1);
>> +}
>> +
[...]


Thanks for going through and commenting on the code! :-)

--Jesper
Shakeel Butt July 19, 2024, 10:47 p.m. UTC | #18
On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 19/07/2024 02.40, Shakeel Butt wrote:
> > Hi Jesper,
> > 
> > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > 
> > [...]
> > > 
> > > 
> > > Looking at the production numbers for the time the lock is held for level 0:
> > > 
> > > @locked_time_level[0]:
> > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > 
> > 
> > Is it possible to get the above histogram for other levels as well?
> 
> Data from other levels available in [1]:
>  [1]
> https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> 
> IMHO the data shows we will get most out of skipping level-0 root-cgroup
> flushes.
> 

Thanks a lot of the data. Are all or most of these locked_time_level[0]
from kswapds? This just motivates me to strongly push the ratelimited
flush patch of mine (which would be orthogonal to your patch series).

Shakeel
Shakeel Butt July 19, 2024, 11:01 p.m. UTC | #19
On Thu, Jul 18, 2024 at 08:11:32PM GMT, Yosry Ahmed wrote:
> On Thu, Jul 18, 2024 at 5:41 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > Hi Jesper,
> >
> > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > >
> > [...]
> > >
> > >
> > > Looking at the production numbers for the time the lock is held for level 0:
> > >
> > > @locked_time_level[0]:
> > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > >
> >
> > Is it possible to get the above histogram for other levels as well? I
> > know this is 12 numa node machine, how many total CPUs are there?
> >
> > > The time is in nanosec, so M corresponds to ms (milliseconds).
> > >
> > > With 36 flushes per second (as shown earlier) this is a flush every
> > > 27.7ms.  It is not unreasonable (from above data) that the flush time
> > > also spend 27ms, which means that we spend a full CPU second flushing.
> > > That is spending too much time flushing.
> >
> > One idea to further reduce this time is more fine grained flush
> > skipping. At the moment we either skip the whole flush or not. How
> > about we make this decision per-cpu? We already have per-cpu updates
> > data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu.
> 
> Good idea.
> 
> I think we would need a per-subsystem callback to decide whether we
> want to flush the cgroup or not. This needs to happen in the core
> rstat flushing code (not the memcg flushing code), as we need to make
> sure we do not remove the cgroup from the per-cpu updated tree if we
> don't flush it.

Unless we have per-subsystem update tree, I don't think per-subsystem
callback would work or we would be flushing all if any subsystem wants
it. Anyways we can discuss when we have data that it really helps.

> 
> More generally, I think we should be able to have a "force" flush API
> that skips all optimizations and ensures that a flush occurs. I think
> this will be needed in the cgroup_rstat_exit() path, where stats of a
> cgroup being freed must be propagated to its parent, no matter how
> insignificant they may be, to avoid inconsistencies.

Agree.
Yosry Ahmed July 20, 2024, 4:52 a.m. UTC | #20
On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> >
> >
> > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > Hi Jesper,
> > >
> > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > >
> > > [...]
> > > >
> > > >
> > > > Looking at the production numbers for the time the lock is held for level 0:
> > > >
> > > > @locked_time_level[0]:
> > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > >
> > >
> > > Is it possible to get the above histogram for other levels as well?
> >
> > Data from other levels available in [1]:
> >  [1]
> > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> >
> > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > flushes.
> >
>
> Thanks a lot of the data. Are all or most of these locked_time_level[0]
> from kswapds? This just motivates me to strongly push the ratelimited
> flush patch of mine (which would be orthogonal to your patch series).

Jesper and I were discussing a better ratelimiting approach, whether
it's measuring the time since the last flush, or only skipping if we
have a lot of flushes in a specific time frame (using __ratelimit()).
I believe this would be better than the current memcg ratelimiting
approach, and we can remove the latter.

WDYT?

>
> Shakeel
Jesper Dangaard Brouer July 20, 2024, 3:05 p.m. UTC | #21
On 20/07/2024 06.52, Yosry Ahmed wrote:
> On Fri, Jul 19, 2024 at 9:52 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 19/07/2024 02.40, Shakeel Butt wrote:
>>>>> Hi Jesper,
>>>>>
>>>>> On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
>>>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> Looking at the production numbers for the time the lock is held for level 0:
>>>>>>
>>>>>> @locked_time_level[0]:
>>>>>> [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
>>>>>> [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>>>> [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
>>>>>> [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
>>>>>>
>>>>>
>>>>> Is it possible to get the above histogram for other levels as well?
>>>>
>>>> Data from other levels available in [1]:
>>>>   [1]
>>>> https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
>>>>
>>>> IMHO the data shows we will get most out of skipping level-0 root-cgroup
>>>> flushes.
>>>>
>>>
>>> Thanks a lot of the data. Are all or most of these locked_time_level[0]
>>> from kswapds? This just motivates me to strongly push the ratelimited
>>> flush patch of mine (which would be orthogonal to your patch series).
>>

There are also others flushing level 0.
Extended the bpftrace oneliner to also capture the process 'comm' name.
(I reduced 'kworker' to one entry in below, e.g pattern 'kworker/u392:19').

grep 'level\[' out01.bpf_oneliner_locked_time | awk -F/ '{print $1}' | 
sort | uniq
@locked_time_level[0, cadvisor]:
@locked_time_level[0, consul]:
@locked_time_level[0, kswapd0]:
@locked_time_level[0, kswapd10]:
@locked_time_level[0, kswapd11]:
@locked_time_level[0, kswapd1]:
@locked_time_level[0, kswapd2]:
@locked_time_level[0, kswapd3]:
@locked_time_level[0, kswapd4]:
@locked_time_level[0, kswapd5]:
@locked_time_level[0, kswapd6]:
@locked_time_level[0, kswapd7]:
@locked_time_level[0, kswapd8]:
@locked_time_level[0, kswapd9]:
@locked_time_level[0, kworker
@locked_time_level[0, lassen]:
@locked_time_level[0, thunderclap-san]:
@locked_time_level[0, xdpd]:
@locked_time_level[1, cadvisor]:
@locked_time_level[2, cadvisor]:
@locked_time_level[2, kworker
@locked_time_level[2, memory-saturati]:
@locked_time_level[2, systemd]:
@locked_time_level[2, thread-saturati]:
@locked_time_level[3, cadvisor]:
@locked_time_level[3, cat]:
@locked_time_level[3, kworker
@locked_time_level[3, memory-saturati]:
@locked_time_level[3, systemd]:
@locked_time_level[3, thread-saturati]:
@locked_time_level[4, cadvisor]:


>> Jesper and I were discussing a better ratelimiting approach, whether
>> it's measuring the time since the last flush, or only skipping if we
>> have a lot of flushes in a specific time frame (using __ratelimit()).
>> I believe this would be better than the current memcg ratelimiting
>> approach, and we can remove the latter.
>>
>> WDYT?
> 
> Forgot to link this:
> https://lore.kernel.org/lkml/CAJD7tkZ5nxoa7aCpAix1bYOoYiLVfn+aNkq7jmRAZqsxruHYLw@mail.gmail.com/
> 

I agree that ratelimiting is orthogonal to this patch, and that we 
really need to address this in follow up patchset.

The proposed mem_cgroup_flush_stats_ratelimited patch[1] helps, but is
limited to memory area.  I'm proposing a more generic solution in [2]
that helps all users of rstat.

It is time based, because it makes sense to observe the time it takes to
flush root (service rate), and then limit how quickly after another
flusher can run (limiting arrival rate). From practical queue theory we
intuitively know that we should keep arrival rate below service rate,
else queuing happens.

--Jesper

[1] "memcg: use ratelimited stats flush in the reclaim"
  - 
https://lore.kernel.org/all/20240615081257.3945587-1-shakeel.butt@linux.dev/

[2] "cgroup/rstat: introduce ratelimited rstat flushing"
  - 
https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/
Shakeel Butt July 22, 2024, 8:02 p.m. UTC | #22
On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote:
> On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > > Hi Jesper,
> > > >
> > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > > >
> > > > [...]
> > > > >
> > > > >
> > > > > Looking at the production numbers for the time the lock is held for level 0:
> > > > >
> > > > > @locked_time_level[0]:
> > > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > > >
> > > >
> > > > Is it possible to get the above histogram for other levels as well?
> > >
> > > Data from other levels available in [1]:
> > >  [1]
> > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> > >
> > > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > > flushes.
> > >
> >
> > Thanks a lot of the data. Are all or most of these locked_time_level[0]
> > from kswapds? This just motivates me to strongly push the ratelimited
> > flush patch of mine (which would be orthogonal to your patch series).
> 
> Jesper and I were discussing a better ratelimiting approach, whether
> it's measuring the time since the last flush, or only skipping if we
> have a lot of flushes in a specific time frame (using __ratelimit()).
> I believe this would be better than the current memcg ratelimiting
> approach, and we can remove the latter.
> 
> WDYT?

The last statement gives me the impression that you are trying to fix
something that is not broken. The current ratelimiting users are ok, the
issue is with the sync flushers. Or maybe you are suggesting that the new
ratelimiting will be used for all sync flushers and current ratelimiting
users and the new ratelimiting will make a good tradeoff between the
accuracy and potential flush stall?
Yosry Ahmed July 22, 2024, 8:12 p.m. UTC | #23
On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote:
> > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > > > Hi Jesper,
> > > > >
> > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > > > >
> > > > > [...]
> > > > > >
> > > > > >
> > > > > > Looking at the production numbers for the time the lock is held for level 0:
> > > > > >
> > > > > > @locked_time_level[0]:
> > > > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > > > >
> > > > >
> > > > > Is it possible to get the above histogram for other levels as well?
> > > >
> > > > Data from other levels available in [1]:
> > > >  [1]
> > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> > > >
> > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > > > flushes.
> > > >
> > >
> > > Thanks a lot of the data. Are all or most of these locked_time_level[0]
> > > from kswapds? This just motivates me to strongly push the ratelimited
> > > flush patch of mine (which would be orthogonal to your patch series).
> >
> > Jesper and I were discussing a better ratelimiting approach, whether
> > it's measuring the time since the last flush, or only skipping if we
> > have a lot of flushes in a specific time frame (using __ratelimit()).
> > I believe this would be better than the current memcg ratelimiting
> > approach, and we can remove the latter.
> >
> > WDYT?
>
> The last statement gives me the impression that you are trying to fix
> something that is not broken. The current ratelimiting users are ok, the
> issue is with the sync flushers. Or maybe you are suggesting that the new
> ratelimiting will be used for all sync flushers and current ratelimiting
> users and the new ratelimiting will make a good tradeoff between the
> accuracy and potential flush stall?

The latter. Basically the idea is to have more informed and generic
ratelimiting logic in the core rstat flushing code (e.g. using
__ratelimit()), which would apply to ~all flushers*. Then, we ideally
wouldn't need mem_cgroup_flush_stats_ratelimited() at all.

*The obvious exception is the force flushing case we discussed for
cgroup_rstat_exit().

In fact, I think we need that even with the ongoing flusher
optimization, because I think there is a slight chance that a flush is
missed. It wouldn't be problematic for other flushers, but it
certainly can be for cgroup_rstat_exit() as the stats will be
completely dropped.

The scenario I have in mind is:
- CPU 1 starts a flush of cgroup A. Flushing complete, but waiters are
not woke up yet.
- CPU 2 updates the stats of cgroup A after it is flushed by CPU 1.
- CPU 3 calls cgroup_rstat_exit(), sees the ongoing flusher and waits.
- CPU 1 wakes up the waiters.
- CPU 3 proceeds to destroy cgroup A, and the updates made by CPU 2 are lost.
Shakeel Butt July 22, 2024, 9:32 p.m. UTC | #24
On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote:
> On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote:
> > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> > > > >
> > > > >
> > > > > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > > > > Hi Jesper,
> > > > > >
> > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > > > > >
> > > > > > [...]
> > > > > > >
> > > > > > >
> > > > > > > Looking at the production numbers for the time the lock is held for level 0:
> > > > > > >
> > > > > > > @locked_time_level[0]:
> > > > > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > > > > >
> > > > > >
> > > > > > Is it possible to get the above histogram for other levels as well?
> > > > >
> > > > > Data from other levels available in [1]:
> > > > >  [1]
> > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> > > > >
> > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > > > > flushes.
> > > > >
> > > >
> > > > Thanks a lot of the data. Are all or most of these locked_time_level[0]
> > > > from kswapds? This just motivates me to strongly push the ratelimited
> > > > flush patch of mine (which would be orthogonal to your patch series).
> > >
> > > Jesper and I were discussing a better ratelimiting approach, whether
> > > it's measuring the time since the last flush, or only skipping if we
> > > have a lot of flushes in a specific time frame (using __ratelimit()).
> > > I believe this would be better than the current memcg ratelimiting
> > > approach, and we can remove the latter.
> > >
> > > WDYT?
> >
> > The last statement gives me the impression that you are trying to fix
> > something that is not broken. The current ratelimiting users are ok, the
> > issue is with the sync flushers. Or maybe you are suggesting that the new
> > ratelimiting will be used for all sync flushers and current ratelimiting
> > users and the new ratelimiting will make a good tradeoff between the
> > accuracy and potential flush stall?
> 
> The latter. Basically the idea is to have more informed and generic
> ratelimiting logic in the core rstat flushing code (e.g. using
> __ratelimit()), which would apply to ~all flushers*. Then, we ideally
> wouldn't need mem_cgroup_flush_stats_ratelimited() at all.
> 

I wonder if we really need a universal ratelimit. As you noted below
there are cases where we want exact stats and then we know there are
cases where accurate stats are not needed but they are very performance
sensitive. Aiming to have a solution which will ignore such differences
might be a futile effort.

> *The obvious exception is the force flushing case we discussed for
> cgroup_rstat_exit().
> 
> In fact, I think we need that even with the ongoing flusher
> optimization, because I think there is a slight chance that a flush is
> missed. It wouldn't be problematic for other flushers, but it
> certainly can be for cgroup_rstat_exit() as the stats will be
> completely dropped.
> 
> The scenario I have in mind is:
> - CPU 1 starts a flush of cgroup A. Flushing complete, but waiters are
> not woke up yet.
> - CPU 2 updates the stats of cgroup A after it is flushed by CPU 1.
> - CPU 3 calls cgroup_rstat_exit(), sees the ongoing flusher and waits.
> - CPU 1 wakes up the waiters.
> - CPU 3 proceeds to destroy cgroup A, and the updates made by CPU 2 are lost.
Shakeel Butt July 22, 2024, 10:58 p.m. UTC | #25
On Mon, Jul 22, 2024 at 02:32:03PM GMT, Shakeel Butt wrote:
> On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote:
> > On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote:
> > > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> > > > > >
> > > > > >
> > > > > > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > > > > > Hi Jesper,
> > > > > > >
> > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > >
> > > > > > > >
> > > > > > > > Looking at the production numbers for the time the lock is held for level 0:
> > > > > > > >
> > > > > > > > @locked_time_level[0]:
> > > > > > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > > > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > > > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > > > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > > > > > >
> > > > > > >
> > > > > > > Is it possible to get the above histogram for other levels as well?
> > > > > >
> > > > > > Data from other levels available in [1]:
> > > > > >  [1]
> > > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> > > > > >
> > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > > > > > flushes.
> > > > > >
> > > > >
> > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0]
> > > > > from kswapds? This just motivates me to strongly push the ratelimited
> > > > > flush patch of mine (which would be orthogonal to your patch series).
> > > >
> > > > Jesper and I were discussing a better ratelimiting approach, whether
> > > > it's measuring the time since the last flush, or only skipping if we
> > > > have a lot of flushes in a specific time frame (using __ratelimit()).
> > > > I believe this would be better than the current memcg ratelimiting
> > > > approach, and we can remove the latter.
> > > >
> > > > WDYT?
> > >
> > > The last statement gives me the impression that you are trying to fix
> > > something that is not broken. The current ratelimiting users are ok, the
> > > issue is with the sync flushers. Or maybe you are suggesting that the new
> > > ratelimiting will be used for all sync flushers and current ratelimiting
> > > users and the new ratelimiting will make a good tradeoff between the
> > > accuracy and potential flush stall?
> > 
> > The latter. Basically the idea is to have more informed and generic
> > ratelimiting logic in the core rstat flushing code (e.g. using
> > __ratelimit()), which would apply to ~all flushers*. Then, we ideally
> > wouldn't need mem_cgroup_flush_stats_ratelimited() at all.
> > 
> 
> I wonder if we really need a universal ratelimit. As you noted below
> there are cases where we want exact stats and then we know there are
> cases where accurate stats are not needed but they are very performance
> sensitive. Aiming to have a solution which will ignore such differences
> might be a futile effort.
> 

BTW I am not against it. If we can achieve this with minimal regression
and maintainence burden then it would be preferable.
Yosry Ahmed July 23, 2024, 6:24 a.m. UTC | #26
On Mon, Jul 22, 2024 at 3:59 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jul 22, 2024 at 02:32:03PM GMT, Shakeel Butt wrote:
> > On Mon, Jul 22, 2024 at 01:12:35PM GMT, Yosry Ahmed wrote:
> > > On Mon, Jul 22, 2024 at 1:02 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Fri, Jul 19, 2024 at 09:52:17PM GMT, Yosry Ahmed wrote:
> > > > > On Fri, Jul 19, 2024 at 3:48 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > > >
> > > > > > On Fri, Jul 19, 2024 at 09:54:41AM GMT, Jesper Dangaard Brouer wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 19/07/2024 02.40, Shakeel Butt wrote:
> > > > > > > > Hi Jesper,
> > > > > > > >
> > > > > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote:
> > > > > > > > >
> > > > > > > > [...]
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Looking at the production numbers for the time the lock is held for level 0:
> > > > > > > > >
> > > > > > > > > @locked_time_level[0]:
> > > > > > > > > [4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
> > > > > > > > > [8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > > > > > > > [16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
> > > > > > > > > [32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is it possible to get the above histogram for other levels as well?
> > > > > > >
> > > > > > > Data from other levels available in [1]:
> > > > > > >  [1]
> > > > > > > https://lore.kernel.org/all/8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org/
> > > > > > >
> > > > > > > IMHO the data shows we will get most out of skipping level-0 root-cgroup
> > > > > > > flushes.
> > > > > > >
> > > > > >
> > > > > > Thanks a lot of the data. Are all or most of these locked_time_level[0]
> > > > > > from kswapds? This just motivates me to strongly push the ratelimited
> > > > > > flush patch of mine (which would be orthogonal to your patch series).
> > > > >
> > > > > Jesper and I were discussing a better ratelimiting approach, whether
> > > > > it's measuring the time since the last flush, or only skipping if we
> > > > > have a lot of flushes in a specific time frame (using __ratelimit()).
> > > > > I believe this would be better than the current memcg ratelimiting
> > > > > approach, and we can remove the latter.
> > > > >
> > > > > WDYT?
> > > >
> > > > The last statement gives me the impression that you are trying to fix
> > > > something that is not broken. The current ratelimiting users are ok, the
> > > > issue is with the sync flushers. Or maybe you are suggesting that the new
> > > > ratelimiting will be used for all sync flushers and current ratelimiting
> > > > users and the new ratelimiting will make a good tradeoff between the
> > > > accuracy and potential flush stall?
> > >
> > > The latter. Basically the idea is to have more informed and generic
> > > ratelimiting logic in the core rstat flushing code (e.g. using
> > > __ratelimit()), which would apply to ~all flushers*. Then, we ideally
> > > wouldn't need mem_cgroup_flush_stats_ratelimited() at all.
> > >
> >
> > I wonder if we really need a universal ratelimit. As you noted below
> > there are cases where we want exact stats and then we know there are
> > cases where accurate stats are not needed but they are very performance
> > sensitive. Aiming to have a solution which will ignore such differences
> > might be a futile effort.
> >
>
> BTW I am not against it. If we can achieve this with minimal regression
> and maintainence burden then it would be preferable.

It is possible that it is a futile effort, but if it works, the memcg
flushing interface will be much better and we don't have to evaluate
whether ratelimiting is needed on a case-by-case basis.

According to Jesper's data, allowing a flush every 50ms at most may be
reasonable, which means we can ratelimit the flushes to 20 flushers
per second or similar. I think on average, this should provide enough
accuracy for most use cases, and it should also reduce flushes in the
cases that Jesper presented.

It's probably worth a try, especially that it does not involve
changing user visible ABIs so we can always go back to what we have
today.
diff mbox series

Patch

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index b36690ca0d3f..a33b37514c29 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -548,6 +548,8 @@  struct cgroup {
 #ifdef CONFIG_BPF_SYSCALL
 	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
 #endif
+	/* completion queue for cgrp_rstat_ongoing_flusher */
+	struct completion flush_done;
 
 	/* All ancestors including self */
 	struct cgroup *ancestors[];
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index fb8b49437573..fe2a81a310bb 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -2,6 +2,7 @@ 
 #include "cgroup-internal.h"
 
 #include <linux/sched/cputime.h>
+#include <linux/completion.h>
 
 #include <linux/bpf.h>
 #include <linux/btf.h>
@@ -11,6 +12,7 @@ 
 
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static struct cgroup *cgrp_rstat_ongoing_flusher = NULL;
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -279,17 +281,32 @@  __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+static inline bool __cgroup_rstat_trylock(struct cgroup *cgrp, int cpu_in_loop)
+{
+	bool locked;
+
+	locked = spin_trylock_irq(&cgroup_rstat_lock);
+	if (!locked)
+		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, true);
+	else
+		trace_cgroup_rstat_locked(cgrp, cpu_in_loop, false);
+
+	return locked;
+}
+
+static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop,
+				       bool check_contention)
 	__acquires(&cgroup_rstat_lock)
 {
-	bool contended;
+	bool locked = false;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
-	if (contended) {
-		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
+	if (check_contention)
+		locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop);
+
+	if (!locked) {
 		spin_lock_irq(&cgroup_rstat_lock);
+		trace_cgroup_rstat_locked(cgrp, cpu_in_loop, check_contention);
 	}
-	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
 static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
@@ -299,6 +316,53 @@  static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+#define MAX_WAIT	msecs_to_jiffies(100)
+/* Trylock helper that also checks for on ongoing flusher */
+static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
+{
+	struct cgroup *ongoing;
+	bool locked;
+
+	/* Check if ongoing flusher is already taking care of this, if
+	 * we are a descendant skip work, but wait for ongoing flusher
+	 * to complete work.
+	 */
+retry:
+	ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
+	if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
+		wait_for_completion_interruptible_timeout(
+			&ongoing->flush_done, MAX_WAIT);
+		/* TODO: Add tracepoint here */
+		return false;
+	}
+
+	locked = __cgroup_rstat_trylock(cgrp, -1);
+	if (!locked) {
+		/* Contended: Handle loosing race for ongoing flusher */
+		if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
+			goto retry;
+
+		__cgroup_rstat_lock(cgrp, -1, false);
+	}
+	/* Obtained lock, record this cgrp as the ongoing flusher */
+	ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
+	if (!ongoing) {
+		reinit_completion(&cgrp->flush_done);
+		WRITE_ONCE(cgrp_rstat_ongoing_flusher, cgrp);
+	}
+	return true; /* locked */
+}
+
+static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
+{
+	/* Detect if we are the ongoing flusher */
+	if (cgrp == READ_ONCE(cgrp_rstat_ongoing_flusher)) {
+		WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
+		complete_all(&cgrp->flush_done);
+	}
+	__cgroup_rstat_unlock(cgrp, -1);
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -328,7 +392,7 @@  static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__cgroup_rstat_lock(cgrp, cpu, true);
 		}
 	}
 }
@@ -350,9 +414,11 @@  __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	might_sleep();
 
-	__cgroup_rstat_lock(cgrp, -1);
+	if (!cgroup_rstat_trylock_flusher(cgrp))
+		return;
+
 	cgroup_rstat_flush_locked(cgrp);
-	__cgroup_rstat_unlock(cgrp, -1);
+	cgroup_rstat_unlock_flusher(cgrp);
 }
 
 /**
@@ -368,8 +434,11 @@  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+
+	if (cgroup_rstat_trylock_flusher(cgrp))
+		cgroup_rstat_flush_locked(cgrp);
+	else
+		__cgroup_rstat_lock(cgrp, -1, false);
 }
 
 /**
@@ -379,7 +448,7 @@  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 void cgroup_rstat_flush_release(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock)
 {
-	__cgroup_rstat_unlock(cgrp, -1);
+	cgroup_rstat_unlock_flusher(cgrp);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)
@@ -401,6 +470,8 @@  int cgroup_rstat_init(struct cgroup *cgrp)
 		u64_stats_init(&rstatc->bsync);
 	}
 
+	init_completion(&cgrp->flush_done);
+
 	return 0;
 }