[1/3] locking/csd_lock: Print large numbers as negatives

Message ID	20240713165846.216174-1-neeraj.upadhyay@kernel.org (mailing list archive)
State	Superseded
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3070513B287; Sat, 13 Jul 2024 16:59:29 +0000 (UTC) From: neeraj.upadhyay@kernel.org To: linux-kernel@vger.kernel.org Cc: rcu@vger.kernel.org, kernel-team@meta.com, rostedt@goodmis.org, mingo@kernel.org, peterz@infradead.org, paulmck@kernel.org, leobras@redhat.com, imran.f.khan@oracle.com, riel@surriel.com, neeraj.upadhyay@kernel.org Subject: [PATCH 1/3] locking/csd_lock: Print large numbers as negatives Date: Sat, 13 Jul 2024 22:28:44 +0530 Message-Id: <20240713165846.216174-1-neeraj.upadhyay@kernel.org> In-Reply-To: <20240713165642.GA215331@neeraj.linux> References: <20240713165642.GA215331@neeraj.linux> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	CSD-lock diagnostics enhancements \| expand [0/3] CSD-lock diagnostics enhancements [1/3] locking/csd_lock: Print large numbers as negatives [2/3] locking/csd_lock: Provide an indication of ongoing CSD-lock stall [3/3] locking/csd-lock: Use backoff for repeated reports of same incident

Message ID

20240713165846.216174-1-neeraj.upadhyay@kernel.org (mailing list archive)

State

Superseded

Headers

From: neeraj.upadhyay@kernel.org
To: linux-kernel@vger.kernel.org
Cc: rcu@vger.kernel.org,
	kernel-team@meta.com,
	rostedt@goodmis.org,
	mingo@kernel.org,
	peterz@infradead.org,
	paulmck@kernel.org,
	leobras@redhat.com,
	imran.f.khan@oracle.com,
	riel@surriel.com,
	neeraj.upadhyay@kernel.org
Subject: [PATCH  1/3] locking/csd_lock: Print large numbers as negatives
Date: Sat, 13 Jul 2024 22:28:44 +0530
Message-Id: <20240713165846.216174-1-neeraj.upadhyay@kernel.org>
In-Reply-To: <20240713165642.GA215331@neeraj.linux>
References: <20240713165642.GA215331@neeraj.linux>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

CSD-lock diagnostics enhancements | expand

Commit Message

Neeraj Upadhyay July 13, 2024, 4:58 p.m. UTC

From: "Paul E. McKenney" <paulmck@kernel.org>

The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are
printed in nanoseconds as unsigned long longs, which is a bit obtuse for
human readers when timing bugs result in negative CSD-lock hold times.
Yes, there are some people to whom it is immediately obvious that
18446744073709551615 is really -1, but for the rest of us...

Therefore, print these numbers as signed long longs, making the negative
hold times immediately apparent.

Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
---
 kernel/smp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Rik van Riel July 13, 2024, 5:14 p.m. UTC | #1

On Sat, 2024-07-13 at 22:28 +0530, neeraj.upadhyay@kernel.org wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are
> printed in nanoseconds as unsigned long longs, which is a bit obtuse
> for
> human readers when timing bugs result in negative CSD-lock hold
> times.
> Yes, there are some people to whom it is immediately obvious that
> 18446744073709551615 is really -1, but for the rest of us...
> 
To clarify the report a little bit: it appears that, on some CPU
models, occasionally sched_clock() values jump backward, on the
same CPU.

Looking at the number of systems where this happened over time,
leaving out the exact numbers, the distribution looks something
like this:
- 1 day:     N systems
- 3 days:   3N systems
- 1 week:   7N systems
- 1 month: 26N systems
- 90 days: 72N systems

This does not appear to be a case of a few systems with bad
hardware, where it happens constantly to the same systems, but
something that many systems experience occasionally, and then
not again for months.

The systems in question advertise CONSTANT_TSC, NONSTOP_TSC,
and generally seem to have stable, nonstop, monotonic TSC
values, but sometimes the values go back in time a little bit.
The cycles_2_ns data does not appear to change during the
episodes of sched_clock() going backward.

The csd_lock code is not the only thing that breaks when the
sched_clock values go backward, but it seems to be the best thing
we seem to have right now to detect it.

I don't know whether adding more detection of this issue would
increase the number of systems where backwards sched_clock is
observed.

Many of the systems with backwards going TSC values seem to
encounter a bunch of them across some time period, end up
getting rebooted, and then behave well for months after.

> Reported-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Imran Khan <imran.f.khan@oracle.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Leonardo Bras <leobras@redhat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Cc: Rik van Riel <riel@surriel.com>
> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
> 
Reviewed-by: Rik van Riel <riel@surriel.com>

diff --git a/kernel/smp.c b/kernel/smp.c
index f085ebcdf9e7..81f7083a53e2 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -248,8 +248,8 @@  static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
 	/* How long since this CSD lock was stuck. */
 	ts_delta = ts2 - ts0;
-	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
+	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %lld ns for CPU#%02d %pS(%ps).\n",
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), (s64)ts_delta,
 		 cpu, csd->func, csd->info);
 	/*
 	 * If the CSD lock is still stuck after 5 minutes, it is unlikely

[1/3] locking/csd_lock: Print large numbers as negatives

Commit Message

Comments

Patch