From patchwork Wed May 1 14:04:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 13650770 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FF39C10F1A for ; Wed, 1 May 2024 14:04:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 95C476B0082; Wed, 1 May 2024 10:04:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 90D426B0087; Wed, 1 May 2024 10:04:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D55D6B0089; Wed, 1 May 2024 10:04:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5B92E6B0082 for ; Wed, 1 May 2024 10:04:21 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id C5E9BC0BC5 for ; Wed, 1 May 2024 14:04:20 +0000 (UTC) X-FDA: 82069996680.14.FB39E2B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf04.hostedemail.com (Postfix) with ESMTP id 19CE74001B for ; Wed, 1 May 2024 14:04:17 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kGoT5ZR4; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf04.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714572258; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=WSdEjz78UYw4GCAqwReSMSjvOD6XxkRIOSPajR/CVv0=; b=VrS/8tOgnfs9X6FJbaEsm/9YcSS9CqGaBq3P8ZjozzMldgmvqJEK5TgZ/OD3h0EriUMwDL VSGwzAyl7GtaQfOy1+nj9hVeQu7IIMfi/yRnapnwN1oFBFUFgb2hOSvvu3wVXJCSpGqoQZ 82v8tmHtDbMxVifg63d3yi00Q5vG/fA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714572258; a=rsa-sha256; cv=none; b=vcTX5aKrTqKNmiemiH3nqo/41T3Kre+xdaVreogLw5TubqvD/XLtU7s3DnhlGKCacvZ6ti JZuuHQ4AC2PxmFSvHLTGGKp8lsU2/CllzFxeQJGnxyEu28J7hrmpMEt2RSeqq9Uu8hncYO ntsuF9D9W2qQqvFNpDMDRh1kZ4UjJpg= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kGoT5ZR4; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf04.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 154DA61909; Wed, 1 May 2024 14:04:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 69E74C072AA; Wed, 1 May 2024 14:04:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1714572256; bh=5nytjxuE5wqjo2pAzbCDQ+HlyJo5Br8femhSNJ2Epcs=; h=Subject:From:To:Cc:Date:From; b=kGoT5ZR4AhjQ6KuePiZvqNLVorbZP9YzjtrP1ot8Mx4uMOjbrUj7GUyD5/SyHoGM+ o08uj94zXcOdNHRBbXsEIzAOVAFkq+v0TejMBAu42xSzdaj7a3Lt9HdpC4tU5aIqdB 6I2bfP5FeeyhnbI2QNQlkuP6ip0eI1/6EmhpD7xixXAw1kVz3aL+Kenc8maQY3g7JX VxzkGIcumbDFBFbewwG63Ur0B1uJml7oTqPx1TRBD0N8gwj/1fa39Leeiji56/gPCy rxZAB5Ovmw34yGYgQ59mA5KEc68DVxU3Yq8+PWCHCYyThGWM2DpbCsPiNGvH3ZMSr2 +Xew2Zpvcbydg== Subject: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints From: Jesper Dangaard Brouer To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com Cc: Jesper Dangaard Brouer , netdev@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior Date: Wed, 01 May 2024 16:04:11 +0200 Message-ID: <171457225108.4159924.12821205549807669839.stgit@firesoul> User-Agent: StGit/1.5 MIME-Version: 1.0 X-Rspamd-Queue-Id: 19CE74001B X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 3kikohi5oucqzbgzindzhffpuxcbm1je X-HE-Tag: 1714572257-269596 X-HE-Meta: U2FsdGVkX1+cPiyvmR5kesi8IuEzFsFSYXdEDjZLaQIlqNkyEmJCknEMNxK8EQ+eIlrc9HzRt8FK3+si+zkzwfU6N4KTD86WVZvuEKgAdeDEHZchL/u13Tiy8W3/t+NKQS93Q0BuV4Sr5dMqzv0yy8wZEdtOWF3UJUEjJ7/9vGMX9Z8YwGsQOSNSYbUIZ8RmKusEJQw4Y4YcfGgcu/4mtgXxkDqHyfw+cRQPEW8zw+XQMoB1+4/Ib9s04fWG3DWiou+97J6TdHsTxpq6Pe0d4ul4CkqRARjIrkFQfGkPyFswuCvZqDBRmF8J9lT2EmERNJOSSw52RmfZKqy239qcpEnDfZeg99YdenG9k7Pf6TMT248gVjhJPDR9cAJ9yFpJjVY+isvf4AsRfaVETb5L4ZNAfA0WbSXejdCFrj6MjiFG/IilfJuKvURziUksxOzbW46mDuzK5oSS+ksUTWocOng4TqfmOK2aSjXEp/yykVXbyvAbG9PEdKAAhp+Q4ISG3LpKChXzpY9Ib3cyvUaNommRvxBfxju7gnbmMLrVW5mLgO/1p+3kF1jR2DXMNQPgecJuR+lPRRyoNlDm/HGVjtWMSC/UjxJSgaOjquiwq09o9gPANemDm7UBYhJQHksY2kMOiGnrg4CrpfbvtQ3RpSQFC5+cn8Lz6+92c5u1rNPnHxiDunU05dJ35yoBsL6bQmN23UdkCMoZsh9CEpDBi88+Cr92F7njvsEJ6W5QAlkSJovd6/ISOVoinLhikKqax1/RP7uIXvj3UROhqO1+xaCScWUTvCMr7je2Yam70vULAEuA2lD9sr4tGD6UFG2qigik2Ri+NG02I1QqEodaxt12kCBz2U93+5WZsJ55Z5znEyTPIAzYmf4HDb7AFBkzKitDySXfkJatEWidCvwcHr3wdbms8W9UMI9Jc3ZnSavYgb24aIxEFEH2cMUj2OGG6XxmpfJxfAAkpKvuMAy rPciC2WI ++Y6F+uBQQlImh1R/p1Yo6hhVbQ35eLg0NyjI7OAPb71L7hhzeUHeZ6WCxWZMOb+ACVsDGSoMYseFPYBDo20+YLMM873qwz74LyKosnirLikq35kW9sGGV/0s27+44xsdFh+OrnaRPwshgIyfmFZmIecij4macxESnEh8Jx7sCiWpnkosAlFUvNkZ3qXMirzKB0LuPD+Yx3V6x6TKgJttrVnhpp93S1/zNiTAq4QDDtOtOTbybE0NuvuXnv8+ggBxMJNKEhP3fkdmrIAN1ZUgkQ79EG/vdOtu6SKIGY7IiT+2S6Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec. For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations. A practical bpftrace script for monitoring contention latency: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' Signed-off-by: Jesper Dangaard Brouer --- include/trace/events/cgroup.h | 56 +++++++++++++++++++++++++++++---- kernel/cgroup/rstat.c | 70 ++++++++++++++++++++++++++++++++++------- 2 files changed, 108 insertions(+), 18 deletions(-) diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h index 13f375800135..0b95865a90f3 100644 --- a/include/trace/events/cgroup.h +++ b/include/trace/events/cgroup.h @@ -206,15 +206,15 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, DECLARE_EVENT_CLASS(cgroup_rstat, - TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended), + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), - TP_ARGS(cgrp, cpu_in_loop, contended), + TP_ARGS(cgrp, cpu, contended), TP_STRUCT__entry( __field( int, root ) __field( int, level ) __field( u64, id ) - __field( int, cpu_in_loop ) + __field( int, cpu ) __field( bool, contended ) ), @@ -222,15 +222,16 @@ DECLARE_EVENT_CLASS(cgroup_rstat, __entry->root = cgrp->root->hierarchy_id; __entry->id = cgroup_id(cgrp); __entry->level = cgrp->level; - __entry->cpu_in_loop = cpu_in_loop; + __entry->cpu = cpu; __entry->contended = contended; ), - TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d", + TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d", __entry->root, __entry->id, __entry->level, - __entry->cpu_in_loop, __entry->contended) + __entry->cpu, __entry->contended) ); +/* Related to global: cgroup_rstat_lock */ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended, TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), @@ -252,6 +253,49 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock, TP_ARGS(cgrp, cpu, contended) ); +/* Related to per CPU: cgroup_rstat_cpu_lock */ +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath, + + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), + + TP_ARGS(cgrp, cpu, contended) +); + #endif /* _TRACE_CGROUP_H */ /* This part must be outside protection */ diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 52e3b0ed1cee..fb8b49437573 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) return per_cpu_ptr(cgrp->rstat_cpu, cpu); } +/* + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock). + * + * This makes it easier to diagnose locking issues and contention in + * production environments. The parameter @fast_path determine the + * tracepoints being added, allowing us to diagnose "flush" related + * operations without handling high-frequency fast-path "update" events. + */ +static __always_inline +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu, + struct cgroup *cgrp, const bool fast_path) +{ + unsigned long flags; + bool contended; + + /* + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. + */ + contended = !raw_spin_trylock_irqsave(cpu_lock, flags); + if (contended) { + if (fast_path) + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended); + else + trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended); + + raw_spin_lock_irqsave(cpu_lock, flags); + } + + if (fast_path) + trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended); + else + trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended); + + return flags; +} + +static __always_inline +void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu, + struct cgroup *cgrp, unsigned long flags, + const bool fast_path) +{ + if (fast_path) + trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false); + else + trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false); + + raw_spin_unlock_irqrestore(cpu_lock, flags); +} + /** * cgroup_rstat_updated - keep track of updated rstat_cpu * @cgrp: target cgroup @@ -44,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) return; - raw_spin_lock_irqsave(cpu_lock, flags); + flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true); /* put @cgrp and all ancestors on the corresponding updated lists */ while (true) { @@ -72,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) cgrp = parent; } - raw_spin_unlock_irqrestore(cpu_lock, flags); + _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true); } /** @@ -153,15 +207,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) struct cgroup *head = NULL, *parent, *child; unsigned long flags; - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); + flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false); /* Return NULL if this subtree is not on-list */ if (!rstatc->updated_next) @@ -198,7 +244,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) if (child != root) head = cgroup_rstat_push_children(head, child, cpu); unlock_ret: - raw_spin_unlock_irqrestore(cpu_lock, flags); + _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false); return head; }