From patchwork Sat Jul 13 16:58:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neeraj Upadhyay X-Patchwork-Id: 13732454 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3070513B287; Sat, 13 Jul 2024 16:59:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889970; cv=none; b=ODyCFYn9qhdtTmUNmKVaUqSDD2YNumlwCgGpR+M+iLHT8yhuNjnbECYpEp+5X5A6GHEfcdKr7xGHuVXrhTLCW15j5Lyd7lIKs3Oxcg3GaDRel5sxUm4jfplyaSWD+oMpIzpPgMEPoaDwLGUuk3L8h/K6PcP0KGl9itkqhkIok+E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889970; c=relaxed/simple; bh=KYBad74cZVInqYWF85EGV9C9g4LucoMsE/dNGXtFggU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=tAbv4iLS/TnftM/3AYyV+P3TsrUPCeC5cpplkFeVKqp3dVTOlgsdOrxSr9rM4pDVGhAOCY/ULwWoLNRUF4OpixBAv2hpg6h01l75e0zjYq9qx/AR43E1gbdVHJLg+KAJSlMS/jp2Lgx9zwkU8whBwsJBuEXJjTxCLMkcusa3Hrw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ivG7ua0f; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ivG7ua0f" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 95F9BC32781; Sat, 13 Jul 2024 16:59:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720889969; bh=KYBad74cZVInqYWF85EGV9C9g4LucoMsE/dNGXtFggU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ivG7ua0fflJ90aOgBFrUgWD74Fli9Tt2UpLfAAa5BmG38tcX1yzE5dT0+R8fu2o8Q v7wM+vzKoqBt0Gix3uFgjZ7aOkTpjzY2SADh+SlSjjo/H76d+rBGCib/FS4tigubCM WjRwsOBipxMb78kh8kTzJDo43CN+CRuHfh1uT5tWY/nVC5mbl/vzPVL2/k9HC+okQI 5Kp4jqK99HflrmHzA4iowWdyzZmgEP3lt9rzq0Iss775MaR2/nfT2qEAtss45A3Z29 XHOefIVdkUHIN2fry/uTVYjbq693nsIr8HSmctvU+oT4umRVeY6uyjZBHCwlImIXG5 PSJWPtxTsZwgw== From: neeraj.upadhyay@kernel.org To: linux-kernel@vger.kernel.org Cc: rcu@vger.kernel.org, kernel-team@meta.com, rostedt@goodmis.org, mingo@kernel.org, peterz@infradead.org, paulmck@kernel.org, leobras@redhat.com, imran.f.khan@oracle.com, riel@surriel.com, neeraj.upadhyay@kernel.org Subject: [PATCH 1/3] locking/csd_lock: Print large numbers as negatives Date: Sat, 13 Jul 2024 22:28:44 +0530 Message-Id: <20240713165846.216174-1-neeraj.upadhyay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240713165642.GA215331@neeraj.linux> References: <20240713165642.GA215331@neeraj.linux> Precedence: bulk X-Mailing-List: rcu@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: "Paul E. McKenney" The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are printed in nanoseconds as unsigned long longs, which is a bit obtuse for human readers when timing bugs result in negative CSD-lock hold times. Yes, there are some people to whom it is immediately obvious that 18446744073709551615 is really -1, but for the rest of us... Therefore, print these numbers as signed long longs, making the negative hold times immediately apparent. Reported-by: Rik van Riel Signed-off-by: Paul E. McKenney Cc: Imran Khan Cc: Ingo Molnar Cc: Leonardo Bras Cc: "Peter Zijlstra (Intel)" Cc: Rik van Riel Signed-off-by: Neeraj Upadhyay Reviewed-by: Rik van Riel --- kernel/smp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index f085ebcdf9e7..81f7083a53e2 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -248,8 +248,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ /* How long since this CSD lock was stuck. */ ts_delta = ts2 - ts0; - pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, + pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n", + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta, cpu, csd->func, csd->info); /* * If the CSD lock is still stuck after 5 minutes, it is unlikely From patchwork Sat Jul 13 16:58:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neeraj Upadhyay X-Patchwork-Id: 13732455 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 114CB13B287; Sat, 13 Jul 2024 16:59:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889977; cv=none; b=R+zXDOlUNYjd3qddZYjGIIA0w+yCVvVTyqha7ZtgFkYrNLr7ad172Qsujuo8TUNPBBlRZF+lJe/wVApnY6kGsR7waGJ5P+NPQu45REclTbutbNA2vSVMrgATxrCIkOrWGzydzA6VJmwOut5745CpqPL6v12FLrjDuM6vvGbhQAQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889977; c=relaxed/simple; bh=TcTKuYb/T3TiNi95YK/ATJr+aH6f3xo82T3OOyXeLDk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iXFCf/4TiGT9zrEH4FG1QJyHpFOlMMez51q8VuM0/KYX3cKZtwzndo+nCv0UNbNCrQJ+ih6Ajk+A+LM4p2QJ0xfoEnCVOEHlK3OX4Tw3531SUJdwcch3rGuOiZhOklt5YpdKeZYOIeX/Sh9VTL4iigHCHmcOuIPRPHZttDScOvc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=K0fbpG8q; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="K0fbpG8q" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6D130C32781; Sat, 13 Jul 2024 16:59:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720889976; bh=TcTKuYb/T3TiNi95YK/ATJr+aH6f3xo82T3OOyXeLDk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=K0fbpG8qQirQk8feTGjpFY02fyuc0mVKjAKBkKo/N+FJTtWMj0Ruvbsfagt9LH5RH XqEQzqx4NWcZ6ho0utqjKWCGG0YFJc498P5cEte+5x79z/gFdvM/lvkB1ghr0P0Y84 YDO0qF7EghCSEHxjDKcKEI71WSL/deSJRzYIsw6g+EuYHpkQL3AnipOGXdd65dSQYa LQHPG7bU1EE9uB6WYZLG0rdo7brKuk2LXGkQMqXoMS3BEWzEsqXIlIHce7UchkV8WP NdDweL6GtNxEWw2BawsIuTY5JhatYtg2lF4EXBva2pULVkOCLVcj+Emp4cx6IWpK9W BelPT7o+yuDRw== From: neeraj.upadhyay@kernel.org To: linux-kernel@vger.kernel.org Cc: rcu@vger.kernel.org, kernel-team@meta.com, rostedt@goodmis.org, mingo@kernel.org, peterz@infradead.org, paulmck@kernel.org, leobras@redhat.com, imran.f.khan@oracle.com, riel@surriel.com, neeraj.upadhyay@kernel.org Subject: [PATCH 2/3] locking/csd_lock: Provide an indication of ongoing CSD-lock stall Date: Sat, 13 Jul 2024 22:28:45 +0530 Message-Id: <20240713165846.216174-2-neeraj.upadhyay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240713165642.GA215331@neeraj.linux> References: <20240713165642.GA215331@neeraj.linux> Precedence: bulk X-Mailing-List: rcu@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: "Paul E. McKenney" If a CSD-lock stall goes on long enough, it will cause an RCU CPU stall warning. This additional warning provides much additional console-log traffic and little additional information. Therefore, provide a new csd_lock_is_stuck() function that returns true if there is an ongoing CSD-lock stall. This function will be used by the RCU CPU stall warnings to provide a one-line indication of the stall when this function returns true. Signed-off-by: Paul E. McKenney Cc: Imran Khan Cc: Ingo Molnar Cc: Leonardo Bras Cc: "Peter Zijlstra (Intel)" Cc: Rik van Riel Signed-off-by: Neeraj Upadhyay --- include/linux/smp.h | 6 ++++++ kernel/smp.c | 16 ++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/include/linux/smp.h b/include/linux/smp.h index fcd61dfe2af3..3871bd32018f 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -294,4 +294,10 @@ int smpcfd_prepare_cpu(unsigned int cpu); int smpcfd_dead_cpu(unsigned int cpu); int smpcfd_dying_cpu(unsigned int cpu); +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG +bool csd_lock_is_stuck(void); +#else +static inline bool csd_lock_is_stuck(void) { return false; } +#endif + #endif /* __LINUX_SMP_H */ diff --git a/kernel/smp.c b/kernel/smp.c index 81f7083a53e2..c3e8241e9cbb 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -207,6 +207,19 @@ static int csd_lock_wait_getcpu(call_single_data_t *csd) return -1; } +static atomic_t n_csd_lock_stuck; + +/** + * csd_lock_is_stuck - Has a CSD-lock acquisition been stuck too long? + * + * Returns @true if a CSD-lock acquisition is stuck and has been stuck + * long enough for a "non-responsive CSD lock" message to be printed. + */ +bool csd_lock_is_stuck(void) +{ + return !!atomic_read(&n_csd_lock_stuck); +} + /* * Complain if too much time spent waiting. Note that only * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU, @@ -228,6 +241,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in cpu = csd_lock_wait_getcpu(csd); pr_alert("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock.\n", *bug_id, raw_smp_processor_id(), cpu); + atomic_dec(&n_csd_lock_stuck); return true; } @@ -251,6 +265,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n", firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta, cpu, csd->func, csd->info); + if (firsttime) + atomic_dec(&n_csd_lock_stuck); /* * If the CSD lock is still stuck after 5 minutes, it is unlikely * to become unstuck. Use a signed comparison to avoid triggering From patchwork Sat Jul 13 16:58:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neeraj Upadhyay X-Patchwork-Id: 13732456 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 863D213B287; Sat, 13 Jul 2024 16:59:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889981; cv=none; b=Hc6bjOcjuUUjrmYP4/qOgOLLdZ/AI0/tUk1ppIBhzGreeii+2hHrxoEqF893DjoaJEknmV71NMALSpDWhFLhDd77n/d7r0ft3csOGgfrYLKt7oHYwDVEoHTpUOonhN4i70HATIL9CZR5bUsz+gdlZmyQxz7c8v+UJSiLnMm0zng= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720889981; c=relaxed/simple; bh=GLt0F51yTmhEBaXCELRw+d8hd96JhxGIjc0n17/Yv3I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=VU+KqG/sEdzEFSua/3zQUeyGauOqNboWr4+JHC+EsufQRh2arWNprgQsUWSsPUVm64YpVridbuGK64xj9Gur/cBDM5q+Ne9vq30y+rY90Kk3T2bZpiGLULTZNImYtb9jdIgzOd/cwylKrlzbq2NYrOZzF4RRkzdnel8X/8X57aM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=V7MLAuAn; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="V7MLAuAn" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EFFC0C32786; Sat, 13 Jul 2024 16:59:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720889981; bh=GLt0F51yTmhEBaXCELRw+d8hd96JhxGIjc0n17/Yv3I=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=V7MLAuAnvCXX7QlIoZtI9Tpdpflj+8SslgEiDYbw1AIZ1Kwfhc5Z8ajOtwUQ0AkKR xgihewkRHSNVJazZLBsKGPgjev0BJV19it0IjWO4VdfItGIitVbTv4KtECdbIfktAN ZihTRB+jOJB4vF6mTVWBIufkR59BjxWuWFBlMtcaPcKs8iEdQTAXwrTBN6BuybKRnk BJ1FXnY/ifiiMv9Fi+6ATJ/x5KcXKADn3+mShFN65F/TvlTFPROVly//fwJz6tV5/3 dPSCLOu40dZC2pfMPAqcmSHQdKkqsVRfBS/J70lnR+bG5Z01yUBg87Nir8cW/CXA2x TI12pfDpvtUiA== From: neeraj.upadhyay@kernel.org To: linux-kernel@vger.kernel.org Cc: rcu@vger.kernel.org, kernel-team@meta.com, rostedt@goodmis.org, mingo@kernel.org, peterz@infradead.org, paulmck@kernel.org, leobras@redhat.com, imran.f.khan@oracle.com, riel@surriel.com, neeraj.upadhyay@kernel.org Subject: [PATCH 3/3] locking/csd-lock: Use backoff for repeated reports of same incident Date: Sat, 13 Jul 2024 22:28:46 +0530 Message-Id: <20240713165846.216174-3-neeraj.upadhyay@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240713165642.GA215331@neeraj.linux> References: <20240713165642.GA215331@neeraj.linux> Precedence: bulk X-Mailing-List: rcu@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: "Paul E. McKenney" Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y kernels are emitted at five-second intervals. Although this has proven to be a good time interval for the first diagnostic, if the target CPU keeps interrupts disabled for way longer than five seconds, the ratio of useful new information to pointless repetition increases considerably. Therefore, back off the time period for repeated reports of the same incident, increasing linearly with the number of reports and logarithmicly with the number of online CPUs. [ paulmck: Apply Dan Carpenter feedback. ] Signed-off-by: Paul E. McKenney Cc: Imran Khan Cc: Ingo Molnar Cc: Leonardo Bras Cc: "Peter Zijlstra (Intel)" Cc: Rik van Riel Signed-off-by: Neeraj Upadhyay Reviewed-by: Rik van Riel --- kernel/smp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index c3e8241e9cbb..80c1173ce369 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -225,7 +225,7 @@ bool csd_lock_is_stuck(void) * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU, * so waiting on other types gets much less information. */ -static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id) +static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned long *nmessages) { int cpu = -1; int cpux; @@ -248,7 +248,9 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in ts2 = sched_clock(); /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; - if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) + if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) * + (!*nmessages ? 1 : (ilog2(num_online_cpus()) / 2 + 1)) || + csd_lock_timeout_ns == 0)) return false; firsttime = !*bug_id; @@ -265,6 +267,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n", firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta, cpu, csd->func, csd->info); + (*nmessages)++; if (firsttime) atomic_dec(&n_csd_lock_stuck); /* @@ -305,12 +308,13 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in */ static void __csd_lock_wait(call_single_data_t *csd) { + unsigned long nmessages = 0; int bug_id = 0; u64 ts0, ts1; ts1 = ts0 = sched_clock(); for (;;) { - if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id)) + if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages)) break; cpu_relax(); }