From patchwork Tue Apr 16 17:51:26 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jesper Dangaard Brouer <hawk@kernel.org>
X-Patchwork-Id: 13632265
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B2DB12EBF0;
	Tue, 16 Apr 2024 17:51:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1713289892; cv=none;
 b=OtaSPt8LDvZ93WWHItzvdESnO/hI+lanZ7dioE1I0b1E1MVWTAMFgZir/9pnsd4spk/Ma9KFQSiyItTzl9lYHV1oCBf07KT7+zV5atSE0gvszSJ67aSvbbGS3h2zEXFhU57rktjbtU8UsN1cdFUmnWXG+TAmtOb61Ln+hGIrRJw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1713289892; c=relaxed/simple;
	bh=6AoQjMmtCpA92ORAYahs1Yjyu49n9JrorgvUAsN4z7A=;
	h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=WjLDHhxclmdfJeMN6iduYyyCM/2ekirySsckF/boQEKjGUYOuLP4tkEuiE/pw1l7jPBbjOvkc5/sSOjz2MQy/GUgnp1l3kYLZLAYHl/4siH0BQAwn5hK4KWZ/eI7St25jJYkYPJTLh+3MAJn6jG4omPE2ofFRnn1tQPW4nXyMjg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=dBq3Bzz0; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="dBq3Bzz0"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7D599C113CE;
	Tue, 16 Apr 2024 17:51:28 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1713289892;
	bh=6AoQjMmtCpA92ORAYahs1Yjyu49n9JrorgvUAsN4z7A=;
	h=Subject:From:To:Cc:Date:In-Reply-To:References:From;
	b=dBq3Bzz0EYHz6B9QSkhEnwznlxP5JAAVUOO63eojPw9TTIgF3RqHAWDqPv1j32x3D
	 xkmhmtazuEqtuuOT8YiBeCcHSwk6K+gA3L+Wi6kjmC3GH0i16vmnvB5Xt/xCHpwMmu
	 z+MAOQsXce/0Af8YKLkJfedtDulk2hgU9LdLgXJr0xLiUIlHKUa9KM3i1HIPw64p1g
	 6Su3c4lVRrtAcw1srlnua2Y/o5gACkabKTgADjyYhaDuaavJ4iLOtWk257IydQkkjb
	 DKymWfbCHI5jEW2guzlRNnok/PlUl+gV+QUhfqBCTIX47xul1EKw3nClMy4KFOd3vc
	 6im5vJq/eah5Q==
Subject: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and
 tracepoints
From: Jesper Dangaard Brouer <hawk@kernel.org>
To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com,
 cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, netdev@vger.kernel.org,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev,
 kernel-team@cloudflare.com, linux-kernel@vger.kernel.org,
 Arnaldo Carvalho de Melo <acme@kernel.org>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>, mhocko@kernel.org
Date: Tue, 16 Apr 2024 19:51:26 +0200
Message-ID: <171328988660.3930751.17537768209042139758.stgit@firesoul>
In-Reply-To: <171328983017.3930751.9484082608778623495.stgit@firesoul>
References: <171328983017.3930751.9484082608778623495.stgit@firesoul>
User-Agent: StGit/1.5
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This commit enhances the ability to troubleshoot the global
cgroup_rstat_lock by introducing wrapper helper functions for the lock
along with associated tracepoints.

Although global, the cgroup_rstat_lock helper APIs and tracepoints take
arguments such as cgroup pointer and cpu_in_loop variable. This
adjustment is made because flushing occurs per cgroup despite the lock
being global. Hence, when troubleshooting, it's important to identify the
relevant cgroup. The cpu_in_loop variable is necessary because the global
lock may be released within the main flushing loop that traverses CPUs.
In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the
main lock; otherwise, it denotes the CPU number processed last.

The new feature in this patchset is detecting when lock is contended. The
tracepoints are implemented with production in mind. For minimum overhead
attach to cgroup:cgroup_rstat_lock_contended, which only gets activated
when trylock detects lock is contended. A quick production check for
issues could be done via this perf commands:

 perf record -g -e cgroup:cgroup_rstat_lock_contended

Next natural question would be asking how long time do lock contenders
wait for obtaining the lock. This can be answered by measuring the time
between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked
when args->contended is set.  Like this bpftrace script:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
   tracepoint:cgroup:cgroup_rstat_locked {
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
   interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'

Extending with time spend holding the lock will be more expensive as this
also looks at all the non-contended cases.
Like this bpftrace script:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
   tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs;
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
   tracepoint:cgroup:cgroup_rstat_unlock {
       @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);}
   interval:s:1 {time("%H:%M:%S ");  print(@wait_ns);print(@locked_ns); }'

Signes-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/linux/cgroup.h        |    2 +-
 include/trace/events/cgroup.h |   48 +++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/rstat.c         |   47 +++++++++++++++++++++++++++++++++-------
 3 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..2150ca60394b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(void);
+void cgroup_rstat_flush_release(struct cgroup *cgrp);
 
 /*
  * Basic resource stats.
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..13f375800135 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,54 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
 	TP_ARGS(cgrp, path, val)
 );
 
+DECLARE_EVENT_CLASS(cgroup_rstat,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended),
+
+	TP_ARGS(cgrp, cpu_in_loop, contended),
+
+	TP_STRUCT__entry(
+		__field(	int,		root			)
+		__field(	int,		level			)
+		__field(	u64,		id			)
+		__field(	int,		cpu_in_loop		)
+		__field(	bool,		contended		)
+	),
+
+	TP_fast_assign(
+		__entry->root = cgrp->root->hierarchy_id;
+		__entry->id = cgroup_id(cgrp);
+		__entry->level = cgrp->level;
+		__entry->cpu_in_loop = cpu_in_loop;
+		__entry->contended = contended;
+	),
+
+	TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d",
+		  __entry->root, __entry->id, __entry->level,
+		  __entry->cpu_in_loop, __entry->contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
 #endif /* _TRACE_CGROUP_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 07e2284bb499..ff68c904e647 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,8 @@
 #include <linux/btf.h>
 #include <linux/btf_ids.h>
 
+#include <trace/events/cgroup.h>
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -222,6 +224,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 
 __bpf_hook_end();
 
+/*
+ * Helper functions for locking cgroup_rstat_lock.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments.  The parameter @cpu_in_loop indicate lock
+ * was released and re-taken when collection data from the CPUs. The
+ * value -1 is used when obtaining the main lock else this is the CPU
+ * number processed last.
+ */
+static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+	__acquires(&cgroup_rstat_lock)
+{
+	bool contended;
+
+	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	if (contended) {
+		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
+		spin_lock_irq(&cgroup_rstat_lock);
+	}
+	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
+}
+
+static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+	__releases(&cgroup_rstat_lock)
+{
+	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
+	spin_unlock_irq(&cgroup_rstat_lock);
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -248,10 +279,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			spin_unlock_irq(&cgroup_rstat_lock);
+			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			spin_lock_irq(&cgroup_rstat_lock);
+			__cgroup_rstat_lock(cgrp, cpu);
 		}
 	}
 }
@@ -273,9 +304,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	might_sleep();
 
-	spin_lock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 /**
@@ -291,17 +322,17 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
-	spin_lock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  */
-void cgroup_rstat_flush_release(void)
+void cgroup_rstat_flush_release(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock)
 {
-	spin_unlock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)
@@ -533,7 +564,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 #ifdef CONFIG_SCHED_CORE
 		forceidle_time = cgrp->bstat.forceidle_sum;
 #endif
-		cgroup_rstat_flush_release();
+		cgroup_rstat_flush_release(cgrp);
 	} else {
 		root_cgroup_cputime(&bstat);
 		usage = bstat.cputime.sum_exec_runtime;

From patchwork Tue Apr 16 17:51:33 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jesper Dangaard Brouer <hawk@kernel.org>
X-Patchwork-Id: 13632266
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4246712EBF0;
	Tue, 16 Apr 2024 17:51:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1713289899; cv=none;
 b=crPWeUMzt7AkiwPRvXNNTXecJ8IZZCMs2bAKB1WE9fhbiCQjHgPjfMSPBknZ8fWJRMdc94ueiNoGXXY6IZ6aBT5T+L7o0jlsLllc0/cg2wQdkJ0GTJFIxkE3QGv/io3cLukHQ2AQ75OYkwYyy/AbnAhbyYw9sypZ262n7kYwiDg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1713289899; c=relaxed/simple;
	bh=5yq8vAYf6xC8gXi+nuWR4s1aX8FwORkuEEx5b9YmRYs=;
	h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=hWagTtaNXNMLI0Lq4xVYy38LyAkJJOLwxb160J8Qlxs8YHbLTApPe4+aAH1Bep3CFv0fvujPIi2vPgiVSCaiKOw7KFECIrRapBNvyTI76+WyBCmJha5LN8UmoJwLciw8usUFt+X9tUyN7PStN1GBt2c6K5sXxn6IMe0Y4NvubnM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=I/GAY3Fl; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="I/GAY3Fl"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4AB99C2BD11;
	Tue, 16 Apr 2024 17:51:35 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1713289898;
	bh=5yq8vAYf6xC8gXi+nuWR4s1aX8FwORkuEEx5b9YmRYs=;
	h=Subject:From:To:Cc:Date:In-Reply-To:References:From;
	b=I/GAY3FlPH6B1QRPQ2dqGR9a/rlcVnfGdvRVTQLO45RzGqp+LU4MWf4UwES2dc8+a
	 0BKM1Hre4ug7oBUZv1rJzeMn16gFlal2lBFm4UKtNvuhqrygHncYibMZMgGCxfxXeI
	 ymw2t80RgVLeCyqPphCGqemiZ6vti6QTZxAtrPY9iUorEpW90I2Ko8DQ7xkOq7uA+t
	 f0RH0cGf0dGovxwsZX5/eNt3Mk9J46hRV5nKs1Yz+PrESejCcpZoER+HybDBcH588p
	 FutiTe1lgqsteYKqc599Y8n2e/uPvzK1sA8VHal2CeKFIT5/QKqFRby+WL5Sn53u6q
	 JDZTow0vfEVEA==
Subject: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
From: Jesper Dangaard Brouer <hawk@kernel.org>
To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com,
 cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, netdev@vger.kernel.org,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev,
 kernel-team@cloudflare.com, linux-kernel@vger.kernel.org,
 Arnaldo Carvalho de Melo <acme@kernel.org>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>, mhocko@kernel.org
Date: Tue, 16 Apr 2024 19:51:33 +0200
Message-ID: <171328989335.3930751.3091577850420501533.stgit@firesoul>
In-Reply-To: <171328983017.3930751.9484082608778623495.stgit@firesoul>
References: <171328983017.3930751.9484082608778623495.stgit@firesoul>
User-Agent: StGit/1.5
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
with a spinlock").

Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
necessary during the collection of per-CPU stats, this approach has led
to several scaling issues observed in production environments. Holding
this IRQ lock has caused starvation of other critical kernel functions,
such as softirq (e.g., timers and netstack). Although kernel v6.8
introduced optimizations in this area, we continue to observe instances
where the spin_lock is held for 64-128 ms in production.

This patch converts cgroup_rstat_lock back to being a mutex lock. This
change is made possible thanks to the significant effort by Yosry Ahmed
to eliminate all atomic context use-cases through multiple commits,
ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
included in kernel v6.5.

After this patch lock contention will be less obvious, as converting this
to a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention. It is recommended to use the
tracepoints to diagnose this.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 kernel/cgroup/rstat.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ff68c904e647..a90d68a7c27f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,7 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
+static DEFINE_MUTEX(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
 {
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !mutex_trylock(&cgroup_rstat_lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		mutex_lock(&cgroup_rstat_lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
@@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	__releases(&cgroup_rstat_lock)
 {
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	mutex_unlock(&cgroup_rstat_lock);
 }
 
 /* see cgroup_rstat_flush() */
@@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (need_resched()) {
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();

From patchwork Tue Apr 16 17:51:40 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jesper Dangaard Brouer <hawk@kernel.org>
X-Patchwork-Id: 13632267
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98950134433;
	Tue, 16 Apr 2024 17:51:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1713289905; cv=none;
 b=DJfenTjdETCQEVgJda860L8ShI6LGm6KhkE2xw/+SsBwYRu+61Pjwi95VMpVeXHGb/qDHD2CZF27Wzq3KE308i0TCHWmCOVuMCXLbOtdXqgLVswHGvbf3p3nhnqKiDsquTjEgvZbzTGP8LZuTibS5vBQ9FU0w+WCAT1FG7aR7/Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1713289905; c=relaxed/simple;
	bh=xykn9xlAKIRlCdpBDJjUFA6u/Fh4oeaFhP6Oe5BJZ9w=;
	h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=lOZet7CEiS9THFD9w6v5DpK/KnksXQhlmTOtl47Z8+adlRULND+7dlStWMFICv41GnYev1G5X6a7Wjt9QRZB9yvUd3SgNdvcxC7WOGCIl7yJfuCfJVDGwCRRdGak4dKJUcfLGhrymjx70TQBoGii38Pq+mdbaLQBQq392l+V/ZY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=rBTGA/j4; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="rBTGA/j4"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0820FC2BD11;
	Tue, 16 Apr 2024 17:51:41 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1713289905;
	bh=xykn9xlAKIRlCdpBDJjUFA6u/Fh4oeaFhP6Oe5BJZ9w=;
	h=Subject:From:To:Cc:Date:In-Reply-To:References:From;
	b=rBTGA/j4eYXaha1zep2XsWqQMo/7AuaFqKvR+9ztUt384qlMMGAN+2aCqPtF1UGaK
	 U2ZPG0IxNZ5eBgCilfLyWL7Af8dqomP8JBPxoNHkHEQrQ9yumvgnBhvEkzyfFooiYh
	 UfuchUwUlMiYHINfQ1r4Of3CW93zlTmy08apcPnyltNYZTekRluD60kr7Yi5rvjlco
	 eKvF5OPmD8pQ46Zmo4DAUT/lv+uhsTwiXWRXDTCWzlfR2mmHCpoiw3Bnwu4NSLAo0f
	 BbHg1J8OD7NoRHbsi3B6F+e5OyqeUQ1cNugA+Vu+KrvrJNxx1NXzHZ2s1+sDQcTf3h
	 yFyl/RclpMJmg==
Subject: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
From: Jesper Dangaard Brouer <hawk@kernel.org>
To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com,
 cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, netdev@vger.kernel.org,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev,
 kernel-team@cloudflare.com, linux-kernel@vger.kernel.org,
 Arnaldo Carvalho de Melo <acme@kernel.org>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>, mhocko@kernel.org
Date: Tue, 16 Apr 2024 19:51:40 +0200
Message-ID: <171328990014.3930751.10674097155895405137.stgit@firesoul>
In-Reply-To: <171328983017.3930751.9484082608778623495.stgit@firesoul>
References: <171328983017.3930751.9484082608778623495.stgit@firesoul>
User-Agent: StGit/1.5
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch aims to reduce userspace-triggered pressure on the global
cgroup_rstat_lock by introducing a mechanism to limit how often reading
stat files causes cgroup rstat flushing.

In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
mem_cgroup_flush_stats_ratelimited() already limits pressure on the
global lock (cgroup_rstat_lock). As a result, reading memory-related stat
files (such as memory.stat, memory.numa_stat, zswap.current) is already
a less userspace-triggerable issue.

However, other userspace users of cgroup_rstat_flush(), such as when
reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
limit pressure on the global lock. Furthermore, userspace can easily
trigger this issue by reading those stat files.

Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
same cgroup) without realizing that on the kernel side, they share the
same global lock. This limitation also helps prevent malicious userspace
applications from harming the kernel by reading these stat files in a
tight loop.

To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
similar to memcg's mem_cgroup_flush_stats_ratelimited().

Flushing occurs per cgroup (even though the lock remains global) a
variable named rstat_flush_last_time is introduced to track when a given
cgroup was last flushed. This variable, which contains the jiffies of the
flush, shares properties and a cache line with rstat_flush_next and is
updated simultaneously.

For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
because other data is read under the lock, but we skip the expensive
flushing if it occurred recently.

Regarding io.stat, there is an opportunity outside the lock to skip the
flush, but inside the lock, we must recheck to handle races.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 block/blk-cgroup.c          |    2 +
 include/linux/cgroup-defs.h |    1 +
 include/linux/cgroup.h      |    3 +-
 kernel/cgroup/rstat.c       |   60 ++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol.c             |    1 +
 5 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bdbb557feb5a..4156fedbb910 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1162,7 +1162,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush_ratelimited(blkcg->css.cgroup);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ea48c861cd36..366dc644e075 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -509,6 +509,7 @@ struct cgroup {
 	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
 	 */
 	struct cgroup	*rstat_flush_next;
+	unsigned long	rstat_flush_last_time;
 
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2150ca60394b..8622b222453e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,7 +689,8 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(struct cgroup *cgrp);
 
 /*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a90d68a7c27f..8c71af67b197 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -193,6 +193,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	/* Push @root to the list first before pushing the children */
 	head = root;
 	root->rstat_flush_next = NULL;
+	root->rstat_flush_last_time = jiffies;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
@@ -261,12 +262,15 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
+	cgrp->rstat_flush_last_time = jiffies;
+
 	for_each_possible_cpu(cpu) {
 		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
+			pos->rstat_flush_last_time = jiffies;
 			cgroup_base_stat_flush(pos, cpu);
 			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
@@ -309,6 +313,49 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
+#define FLUSH_TIME	msecs_to_jiffies(50)
+
+/**
+ * cgroup_rstat_flush_ratelimited - limit pressure on global lock (cgroup_rstat_lock)
+ * @cgrp: target cgroup
+ *
+ * Trade accuracy for less pressure on global lock. Only use this when caller
+ * don't need 100% up-to-date stats.
+ *
+ * Userspace stats tools spawn threads reading io.stat, cpu.stat and memory.stat
+ * from same cgroup, but kernel side they share same global lock.  This also
+ * limit malicious userspace from hurting kernel by reading these stat files in
+ * a tight loop.
+ *
+ * This function exit early if flush recently happened.
+ *
+ * Returns 1 if flush happened
+ */
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp)
+{
+	int res = 0;
+
+	might_sleep();
+
+	/* Outside lock: check if this flush and lock can be skipped */
+	if (time_before(jiffies,
+			READ_ONCE(cgrp->rstat_flush_last_time) + FLUSH_TIME))
+		return 0;
+
+	__cgroup_rstat_lock(cgrp, -1);
+
+	/* Recheck inside lock, do flush after enough time have passed */
+	if (time_after(jiffies,
+		       cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+		cgroup_rstat_flush_locked(cgrp);
+		res = 1;
+	}
+
+	__cgroup_rstat_unlock(cgrp, -1);
+
+	return res;
+}
+
 /**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
@@ -317,13 +364,22 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  * paired with cgroup_rstat_flush_release().
  *
  * This function may block.
+ *
+ * Returns 1 if flush happened
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+int cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+
+	/* Only do expensive flush after enough time have passed */
+	if (time_after(jiffies,
+		       cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+		cgroup_rstat_flush_locked(cgrp);
+		return 1;
+	}
+	return 0;
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fabce2b50c69..771261612ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -742,6 +742,7 @@ static void do_flush_stats(struct mem_cgroup *memcg)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
+	/* memcg have own ratelimit layer */
 	cgroup_rstat_flush(memcg->css.cgroup);
 }