From patchwork Tue Apr 16 17:51:26 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 13632261 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4988DC4345F for ; Tue, 16 Apr 2024 17:51:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DBDBB6B009B; Tue, 16 Apr 2024 13:51:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D6E5B6B009C; Tue, 16 Apr 2024 13:51:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C355D6B009D; Tue, 16 Apr 2024 13:51:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A53826B009B for ; Tue, 16 Apr 2024 13:51:35 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 3D050A0FCF for ; Tue, 16 Apr 2024 17:51:35 +0000 (UTC) X-FDA: 82016137350.27.CEC4065 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf25.hostedemail.com (Postfix) with ESMTP id 7EDCFA000C for ; Tue, 16 Apr 2024 17:51:33 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=dBq3Bzz0; spf=pass (imf25.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713289893; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7FERMcdJ/nypr/NtJibSgI0Cn/LJnTGynFKq1MlnNQE=; b=qYfN330T+lrnm0v6oDbhpH/8BpNLbBxj2MdD9Tc3GNeJZqBsFzVybdi1/7Ns0OmQnw4tgB 2JUR9Q4sA867uWT9cI7ORT0AbXOdjX5n6wDqZ4EkRO4LDKQWZjd9xW3VKrq/hhphQzRWwC a09dK4J5MyQOsznZ67H9GTQ9H9qKVBY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713289893; a=rsa-sha256; cv=none; b=bmeHv8oGMWb0I+cKgdaXka+euzHKDypsuD6KzjyzjIU2STVDoZ7t1UWhl7vKNdt2GhCI2B trydI6P3GQRD2t07yFCjamOmMz7jwoUTYLLzGGGJ9D9EsID9xGl0AqXhFBV4pRpSvmsPyx DKMZxsDeePUM2PMEuBfk4HxT8+FQYjs= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=dBq3Bzz0; spf=pass (imf25.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 69C6061257; Tue, 16 Apr 2024 17:51:32 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7D599C113CE; Tue, 16 Apr 2024 17:51:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1713289892; bh=6AoQjMmtCpA92ORAYahs1Yjyu49n9JrorgvUAsN4z7A=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=dBq3Bzz0EYHz6B9QSkhEnwznlxP5JAAVUOO63eojPw9TTIgF3RqHAWDqPv1j32x3D xkmhmtazuEqtuuOT8YiBeCcHSwk6K+gA3L+Wi6kjmC3GH0i16vmnvB5Xt/xCHpwMmu z+MAOQsXce/0Af8YKLkJfedtDulk2hgU9LdLgXJr0xLiUIlHKUa9KM3i1HIPw64p1g 6Su3c4lVRrtAcw1srlnua2Y/o5gACkabKTgADjyYhaDuaavJ4iLOtWk257IydQkkjb DKymWfbCHI5jEW2guzlRNnok/PlUl+gV+QUhfqBCTIX47xul1EKw3nClMy4KFOd3vc 6im5vJq/eah5Q== Subject: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints From: Jesper Dangaard Brouer To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com Cc: Jesper Dangaard Brouer , netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, linux-kernel@vger.kernel.org, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior , mhocko@kernel.org Date: Tue, 16 Apr 2024 19:51:26 +0200 Message-ID: <171328988660.3930751.17537768209042139758.stgit@firesoul> In-Reply-To: <171328983017.3930751.9484082608778623495.stgit@firesoul> References: <171328983017.3930751.9484082608778623495.stgit@firesoul> User-Agent: StGit/1.5 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7EDCFA000C X-Stat-Signature: j8f197ski9ooo9gii91e6efeinx1bc7r X-HE-Tag: 1713289893-891590 X-HE-Meta: U2FsdGVkX1/1TZXOludwzYEPzoEh7QRu+OW31RHtlP4xMQ21+tBdkIxplMoP4qZOxF6T08KnQbKT4S7w+f/HyTb/insGgafC/2PrMbLDHHz107RJeR2ULkeLPnCwzu+iH3sn0I5ZclaHG7hvtpb+bgotDfnWeiqDrUmuWIH4+5v7Ec5MMhlWw+/wwZ63EENuXl4L8poJY4jzcJm6C9oN8/HHl7iPOcZeusy5ySxBnM0ZH+gEJNnVfe1Ec4Ckkb1VpTw4BDR4UU7hyLqV/IpjNqVwDFCjUYQpeSX+23chXqBIO2KNAsBH0jjAcnqp5PmxqdkZzey3ZydL5EIur9taQf76vqAq8cDBSEBr66EpIHHox0FpU4vrxR7cZxETV/T/g/eCs4blqwzdMqPQ+szjFVYquH/lhoz6tbLsTFLLndzouh/s/v88b8+c7Nswltag73wpbP3NVCZ3Sz+u9UnUUqwlPOZHXs85V142KVHIqVkB0N0KDey8oohaVNpMbqYWAtwWvDwCanzm70EbAKxSmNa2RhjPkSCb/aMzjeKV5997o2mhpx8YnlFdlpA6MWpAjGSD1ACm0wZGmNYGhOgWUTzEa7QMxYPh/g0umzn59kg38gH8mYX29HAtDbKGyKNOLo/SMFyZ3agnA/iNr1VNAef6ry2CfqkKZqZwKlFPQO0oqwWccpaMPIjWo9TmiqDsCIeusQTLoH/iTdK9nQKdIjYn54EhPrdNyoZpJ9bq2P+0qFf0ZP+0d8vIER545mjimwbIiTcBPHBvWwcPKWqsTarg7ZMfAZaUnD40d6/+xMHBkZ313Im8mHIr67kI/VSLKdAH4G3YJor1ZRKRWr646TBl8ZaEtzPgNIeuETT6tilW9n0PnuEBw4PghwePYEddAPLKH2liSOmEiUnlSUJVgDiJEMKGyWtzHPFHb+6ghVlicwxS7jmCRvowicTE8KOt0Zm7Un4fj/5h3YQ3v2K gU5KQeoc 7C7LlkwiLPkBGuBtBCoPwkQ6ePIIZkDGFO5pg09egCXjmrvjZvSR3Y5G1wXneGjxNF3GcNavtR7zLtaQ7Ok4AqC/QFBIp9m7SkiVPYGZOfbUGi67npi1gtYhEXZpy2acDHPJ76erNo+VW5U6Pl3mpED+JW970RzsQl++4oumq5c8Hbkhz3rhLiHOPXod/HXOFEzbAE2l191ZaE7h11VA/yR7XNqKruWBePUNzxSQ1T03/YyLgrCA9+gXMiYU0DKA4sy/6zzXtykHtPsObrF3yJoOHg9M9YTTRv/ljAYwPBNyw0AI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints. Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last. The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands: perf record -g -e cgroup:cgroup_rstat_lock_contended Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }' Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }' Signes-off-by: Jesper Dangaard Brouer --- include/linux/cgroup.h | 2 +- include/trace/events/cgroup.h | 48 +++++++++++++++++++++++++++++++++++++++++ kernel/cgroup/rstat.c | 47 +++++++++++++++++++++++++++++++++------- 3 files changed, 88 insertions(+), 9 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 34aaf0e87def..2150ca60394b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu); void cgroup_rstat_flush(struct cgroup *cgrp); void cgroup_rstat_flush_hold(struct cgroup *cgrp); -void cgroup_rstat_flush_release(void); +void cgroup_rstat_flush_release(struct cgroup *cgrp); /* * Basic resource stats. diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h index dd7d7c9efecd..13f375800135 100644 --- a/include/trace/events/cgroup.h +++ b/include/trace/events/cgroup.h @@ -204,6 +204,54 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, TP_ARGS(cgrp, path, val) ); +DECLARE_EVENT_CLASS(cgroup_rstat, + + TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended), + + TP_ARGS(cgrp, cpu_in_loop, contended), + + TP_STRUCT__entry( + __field( int, root ) + __field( int, level ) + __field( u64, id ) + __field( int, cpu_in_loop ) + __field( bool, contended ) + ), + + TP_fast_assign( + __entry->root = cgrp->root->hierarchy_id; + __entry->id = cgroup_id(cgrp); + __entry->level = cgrp->level; + __entry->cpu_in_loop = cpu_in_loop; + __entry->contended = contended; + ), + + TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d", + __entry->root, __entry->id, __entry->level, + __entry->cpu_in_loop, __entry->contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + #endif /* _TRACE_CGROUP_H */ /* This part must be outside protection */ diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 07e2284bb499..ff68c904e647 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -7,6 +7,8 @@ #include #include +#include + static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); @@ -222,6 +224,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, __bpf_hook_end(); +/* + * Helper functions for locking cgroup_rstat_lock. + * + * This makes it easier to diagnose locking issues and contention in + * production environments. The parameter @cpu_in_loop indicate lock + * was released and re-taken when collection data from the CPUs. The + * value -1 is used when obtaining the main lock else this is the CPU + * number processed last. + */ +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) + __acquires(&cgroup_rstat_lock) +{ + bool contended; + + contended = !spin_trylock_irq(&cgroup_rstat_lock); + if (contended) { + trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); + spin_lock_irq(&cgroup_rstat_lock); + } + trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); +} + +static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) + __releases(&cgroup_rstat_lock) +{ + trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false); + spin_unlock_irq(&cgroup_rstat_lock); +} + /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) @@ -248,10 +279,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) /* play nice and yield if necessary */ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { - spin_unlock_irq(&cgroup_rstat_lock); + __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); - spin_lock_irq(&cgroup_rstat_lock); + __cgroup_rstat_lock(cgrp, cpu); } } } @@ -273,9 +304,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { might_sleep(); - spin_lock_irq(&cgroup_rstat_lock); + __cgroup_rstat_lock(cgrp, -1); cgroup_rstat_flush_locked(cgrp); - spin_unlock_irq(&cgroup_rstat_lock); + __cgroup_rstat_unlock(cgrp, -1); } /** @@ -291,17 +322,17 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { might_sleep(); - spin_lock_irq(&cgroup_rstat_lock); + __cgroup_rstat_lock(cgrp, -1); cgroup_rstat_flush_locked(cgrp); } /** * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold() */ -void cgroup_rstat_flush_release(void) +void cgroup_rstat_flush_release(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) { - spin_unlock_irq(&cgroup_rstat_lock); + __cgroup_rstat_unlock(cgrp, -1); } int cgroup_rstat_init(struct cgroup *cgrp) @@ -533,7 +564,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) #ifdef CONFIG_SCHED_CORE forceidle_time = cgrp->bstat.forceidle_sum; #endif - cgroup_rstat_flush_release(); + cgroup_rstat_flush_release(cgrp); } else { root_cgroup_cputime(&bstat); usage = bstat.cputime.sum_exec_runtime;