diff mbox series

hung_task: Allow printing warnings every check interval

Message ID 20190724170249.9644-1-dima@arista.com (mailing list archive)
State New, archived
Headers show
Series hung_task: Allow printing warnings every check interval | expand

Commit Message

Dmitry Safonov July 24, 2019, 5:02 p.m. UTC
Hung task detector has one timeout and has two associated actions on it:
- issuing warnings with names and stacks of blocked tasks
- panic()

We want switches to panic (and reboot) if there's a task
in uninterruptible sleep for some minutes - at that moment something
ugly has happened and the box needs a reboot.
But we also want to detect conditions that are "out of range"
or approaching the point of failure. Under such conditions we want
to issue an "early warning" of an impending failure, minutes before
the switch is going to panic.

Those "early warnings" serve a purpose while monitoring the network
infrastructure. Those are also valuable on post-mortem analysis, when
the logs from userspace applications aren't enough.
Furthermore, we have a test pool of long-running duts that are
constantly under close to real-world load for weeks. And such early
warnings allowed to figure out some bottle necks without much engineer
work intervention.

There are also not yet upstream patches for other kinds of "early
warnings" as prints whenever a mutex/semaphore is released after being
held for long time, but those patches are much more intricate and have
their runtime cost.

It seems rather easy to add printing tasks and their stacks for
notification and debugging purposes into hung task detector without
complicating the code or major cost (prints are with KERN_INFO loglevel
and so don't go on console, only into dmesg log).

Since commit a2e514453861 ("kernel/hung_task.c: allow to set checking
interval separately from timeout") it's possible to set checking
interval for hung task detector with `hung_task_check_interval_secs`.

Provide `hung_task_interval_warnings` sysctl that allows printing
hung tasks every detection interval. It's not ratelimited, so the root
should be cautious configuring it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Vasiliy Khoruzhick <vasilykh@arista.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 20 ++++++++-
 include/linux/sched/sysctl.h                |  1 +
 kernel/hung_task.c                          | 50 ++++++++++++++-------
 kernel/sysctl.c                             |  8 ++++
 4 files changed, 62 insertions(+), 17 deletions(-)

Comments

Tetsuo Handa July 25, 2019, 10:38 a.m. UTC | #1
On 2019/07/25 2:02, Dmitry Safonov wrote:
> Hung task detector has one timeout and has two associated actions on it:
> - issuing warnings with names and stacks of blocked tasks
> - panic()
> 
> We want switches to panic (and reboot) if there's a task
> in uninterruptible sleep for some minutes - at that moment something
> ugly has happened and the box needs a reboot.
> But we also want to detect conditions that are "out of range"
> or approaching the point of failure. Under such conditions we want
> to issue an "early warning" of an impending failure, minutes before
> the switch is going to panic.

Can't we do it by extending sysctl_hung_task_panic to accept values larger
than 1, and decrease by one when at least one thread was reported by each
check_hung_uninterruptible_tasks() check, and call panic() when
sysctl_hung_task_panic reached to 0 (or maybe 1 is simpler) ?

Hmm, might have the same problem regarding how/when to reset the counter.
If some userspace process can reset the counter, such process can trigger
SysRq-c when some period expired...

> It seems rather easy to add printing tasks and their stacks for
> notification and debugging purposes into hung task detector without
> complicating the code or major cost (prints are with KERN_INFO loglevel
> and so don't go on console, only into dmesg log).

Well, I don't think so. Might be noisy for systems without "quiet" kernel
command line option, and we can't pass KERN_DEBUG to e.g. sched_show_task()...
Dmitry Safonov July 25, 2019, 2:25 p.m. UTC | #2
On 7/25/19 11:38 AM, Tetsuo Handa wrote:
> On 2019/07/25 2:02, Dmitry Safonov wrote:
>> Hung task detector has one timeout and has two associated actions on it:
>> - issuing warnings with names and stacks of blocked tasks
>> - panic()
>>
>> We want switches to panic (and reboot) if there's a task
>> in uninterruptible sleep for some minutes - at that moment something
>> ugly has happened and the box needs a reboot.
>> But we also want to detect conditions that are "out of range"
>> or approaching the point of failure. Under such conditions we want
>> to issue an "early warning" of an impending failure, minutes before
>> the switch is going to panic.
> 
> Can't we do it by extending sysctl_hung_task_panic to accept values larger
> than 1, and decrease by one when at least one thread was reported by each
> check_hung_uninterruptible_tasks() check, and call panic() when
> sysctl_hung_task_panic reached to 0 (or maybe 1 is simpler) ?
> 
> Hmm, might have the same problem regarding how/when to reset the counter.
> If some userspace process can reset the counter, such process can trigger
> SysRq-c when some period expired...

Yes, also current distributions already using the counter to print
warnings number of times and then silently ignore. I.e., on my Arch
Linux setup:
hung_task_warnings:10

>> It seems rather easy to add printing tasks and their stacks for
>> notification and debugging purposes into hung task detector without
>> complicating the code or major cost (prints are with KERN_INFO loglevel
>> and so don't go on console, only into dmesg log).
> 
> Well, I don't think so. Might be noisy for systems without "quiet" kernel
> command line option, and we can't pass KERN_DEBUG to e.g. sched_show_task()...

Yes, that's why it's disabled by default (=0).
I tend to agree that printing with KERN_DEBUG may be better, but in my
point of view the patch isn't enough justification for patching
sched_show_task() and show_stack().

Thanks,
          Dmitry
Tetsuo Handa July 26, 2019, 11:29 a.m. UTC | #3
On 2019/07/25 23:25, Dmitry Safonov wrote:
> Yes, also current distributions already using the counter to print
> warnings number of times and then silently ignore. I.e., on my Arch
> Linux setup:
> hung_task_warnings:10

You can propose changing the default value of hung_task_warnings to -1.

Current patch might be inconvenient because printk() from hung_task_warning(t, false)
fails to go to consoles when that "t" was blocked for more than "timeout" seconds, for

	if (sysctl_hung_task_panic) {
		console_verbose();
		hung_task_show_lock = true;
		hung_task_call_panic = true;
	}

path which is intended to force printk() to go to consoles is ignored by

	/* Don't print warings twice */
	if (!sysctl_hung_task_interval_warnings)
		hung_task_warning(t, true);

when panic() should be called. (The vmcore would contain the printk() output which
was not sent to consoles if kdump is configured. But vmcore is not always available.)

> Yes, that's why it's disabled by default (=0).
> I tend to agree that printing with KERN_DEBUG may be better, but in my
> point of view the patch isn't enough justification for patching
> sched_show_task() and show_stack().

You can propose sched_show_task_log_lvl() and show_stack_log_lvl() like show_trace_log_lvl().

I think that sysctl_hung_task_interval_warnings should not be decremented automatically.
I guess that that variable should become a boolean which controls whether to report threads
(with KERN_DEBUG level) which was blocked for more than sysctl_hung_task_check_interval_secs
seconds (or a tristate which also controls whether the report should be sent to consoles
(because KERN_DEBUG level likely prevents sending to consoles)), and
hung_task_warning(t, false) should be called like

	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) {
		if (sysctl_hung_task_interval_warnings)
			hung_task_warning(t, false);
		return;
	}

rather than

	if (sysctl_hung_task_interval_warnings)
		hung_task_warning(t, false);
	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
		return;

.
Tetsuo Handa July 26, 2019, 1:28 p.m. UTC | #4
On 2019/07/26 20:29, Tetsuo Handa wrote:
> On 2019/07/25 23:25, Dmitry Safonov wrote:
>> Yes, also current distributions already using the counter to print
>> warnings number of times and then silently ignore. I.e., on my Arch
>> Linux setup:
>> hung_task_warnings:10
> 
> You can propose changing the default value of hung_task_warnings to -1.
> 
> Current patch might be inconvenient because printk() from hung_task_warning(t, false)
> fails to go to consoles when that "t" was blocked for more than "timeout" seconds, for
> 
> 	if (sysctl_hung_task_panic) {
> 		console_verbose();
> 		hung_task_show_lock = true;
> 		hung_task_call_panic = true;
> 	}
> 
> path which is intended to force printk() to go to consoles is ignored by
> 
> 	/* Don't print warings twice */
> 	if (!sysctl_hung_task_interval_warnings)
> 		hung_task_warning(t, true);
> 
> when panic() should be called. (The vmcore would contain the printk() output which
> was not sent to consoles if kdump is configured. But vmcore is not always available.)
> 
>> Yes, that's why it's disabled by default (=0).
>> I tend to agree that printing with KERN_DEBUG may be better, but in my
>> point of view the patch isn't enough justification for patching
>> sched_show_task() and show_stack().
> 
> You can propose sched_show_task_log_lvl() and show_stack_log_lvl() like show_trace_log_lvl().
> 
> I think that sysctl_hung_task_interval_warnings should not be decremented automatically.
> I guess that that variable should become a boolean which controls whether to report threads
> (with KERN_DEBUG level) which was blocked for more than sysctl_hung_task_check_interval_secs
> seconds (or a tristate which also controls whether the report should be sent to consoles
> (because KERN_DEBUG level likely prevents sending to consoles)), and
> hung_task_warning(t, false) should be called like
> 
> 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) {
> 		if (sysctl_hung_task_interval_warnings)
> 			hung_task_warning(t, false);
> 		return;
> 	}
> 
> rather than
> 
> 	if (sysctl_hung_task_interval_warnings)
> 		hung_task_warning(t, false);
> 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
> 		return;
> 
> .
> 


Well, another direction is to disassociate sysctl_hung_task_panic from
sysctl_hung_task_timeout_secs. Since nobody would want to call panic() when
a thread was blocked for only one second, allow sysctl_hung_task_panic to
specify larger than 1, and interpret it as sysctl_hung_task_timeout_secs for
calling panic(). Roughly speaking:

-	if (sysctl_hung_task_panic) {
+	unsigned long panic_timeout = READ_ONCE(sysctl_hung_task_panic)
+	if (panic_timeout == 1 || (panic_timeout > 1 &&
+	     (jiffies - t->last_switch_time) / HZ >= panic_timeout)) {
 		console_verbose();
 		hung_task_show_lock = true;
 		hung_task_call_panic = true;
 	}

If use of different loglevel is not a requirement for you, this would be the simplest.
Dmitry Safonov July 26, 2019, 1:58 p.m. UTC | #5
On 7/26/19 2:28 PM, Tetsuo Handa wrote:
> On 2019/07/26 20:29, Tetsuo Handa wrote:
>> On 2019/07/25 23:25, Dmitry Safonov wrote:
>>> Yes, also current distributions already using the counter to print
>>> warnings number of times and then silently ignore. I.e., on my Arch
>>> Linux setup:
>>> hung_task_warnings:10
>>
>> You can propose changing the default value of hung_task_warnings to -1.
>>
>> Current patch might be inconvenient because printk() from hung_task_warning(t, false)
>> fails to go to consoles when that "t" was blocked for more than "timeout" seconds, for
>>
>> 	if (sysctl_hung_task_panic) {
>> 		console_verbose();
>> 		hung_task_show_lock = true;
>> 		hung_task_call_panic = true;
>> 	}
>>
>> path which is intended to force printk() to go to consoles is ignored by
>>
>> 	/* Don't print warings twice */
>> 	if (!sysctl_hung_task_interval_warnings)
>> 		hung_task_warning(t, true);
>>
>> when panic() should be called. (The vmcore would contain the printk() output which
>> was not sent to consoles if kdump is configured. But vmcore is not always available.)

Fair enough.

>>> Yes, that's why it's disabled by default (=0).
>>> I tend to agree that printing with KERN_DEBUG may be better, but in my
>>> point of view the patch isn't enough justification for patching
>>> sched_show_task() and show_stack().
>>
>> You can propose sched_show_task_log_lvl() and show_stack_log_lvl() like show_trace_log_lvl().

I'll try, not sure how well it will go..

>>
>> I think that sysctl_hung_task_interval_warnings should not be decremented automatically.
>> I guess that that variable should become a boolean which controls whether to report threads
>> (with KERN_DEBUG level) which was blocked for more than sysctl_hung_task_check_interval_secs
>> seconds (or a tristate which also controls whether the report should be sent to consoles
>> (because KERN_DEBUG level likely prevents sending to consoles)), and
>> hung_task_warning(t, false) should be called like
>>
>> 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) {
>> 		if (sysctl_hung_task_interval_warnings)
>> 			hung_task_warning(t, false);
>> 		return;
>> 	}
>>
>> rather than
>>
>> 	if (sysctl_hung_task_interval_warnings)
>> 		hung_task_warning(t, false);
>> 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
>> 		return;

Good point, will do.

> Well, another direction is to disassociate sysctl_hung_task_panic from
> sysctl_hung_task_timeout_secs. Since nobody would want to call panic() when
> a thread was blocked for only one second, allow sysctl_hung_task_panic to
> specify larger than 1, and interpret it as sysctl_hung_task_timeout_secs for
> calling panic(). Roughly speaking:
> 
> -	if (sysctl_hung_task_panic) {
> +	unsigned long panic_timeout = READ_ONCE(sysctl_hung_task_panic)
> +	if (panic_timeout == 1 || (panic_timeout > 1 &&
> +	     (jiffies - t->last_switch_time) / HZ >= panic_timeout)) {
>  		console_verbose();
>  		hung_task_show_lock = true;
>  		hung_task_call_panic = true;
>  	}
> 
> If use of different loglevel is not a requirement for you, this would be the simplest.

No, we consider such messages as notifications/warnings, rather than
complete failures. So, it would be better to hide them from console.
They're also not rate-limited which is a bummer with slow serial
consoles that we've on some devices (9600).

Thanks for the review,
          Dmitry
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 032c7cd3cede..2e36620ec1e4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -45,6 +45,7 @@  show up in /proc/sys/kernel:
 - hung_task_timeout_secs
 - hung_task_check_interval_secs
 - hung_task_warnings
+- hung_task_interval_warnings
 - hyperv_record_panic_msg
 - kexec_load_disabled
 - kptr_restrict
@@ -383,14 +384,29 @@  Possible values to set are in range {0..LONG_MAX/HZ}.
 hung_task_warnings:
 ===================
 
-The maximum number of warnings to report. During a check interval
-if a hung task is detected, this value is decreased by 1.
+The maximum number of warnings to report. If after timeout a hung
+task is present, this value is decreased by 1 every check interval,
+producing a warning.
 When this value reaches 0, no more warnings will be reported.
 This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 -1: report an infinite number of warnings.
 
 
+hung_task_interval_warnings:
+===================
+
+The same as hung_task_warnings, but set the number of interval
+warnings to be issued about detected hung tasks during check
+interval. That will produce warnings *before* the timeout happens.
+If a hung task is detected during check interval, this value is
+decreased by 1. When this value reaches 0, only timeout warnings
+will be reported.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+-1: report an infinite number of check interval warnings.
+
+
 hyperv_record_panic_msg:
 ========================
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..89f55e914673 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -12,6 +12,7 @@  extern unsigned int  sysctl_hung_task_panic;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern unsigned long sysctl_hung_task_check_interval_secs;
 extern int sysctl_hung_task_warnings;
+extern int sysctl_hung_task_interval_warnings;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 					 void __user *buffer,
 					 size_t *lenp, loff_t *ppos);
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 14a625c16cb3..cd971eef8226 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -49,6 +49,7 @@  unsigned long __read_mostly sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_
 unsigned long __read_mostly sysctl_hung_task_check_interval_secs;
 
 int __read_mostly sysctl_hung_task_warnings = 10;
+int __read_mostly sysctl_hung_task_interval_warnings;
 
 static int __read_mostly did_panic;
 static bool hung_task_show_lock;
@@ -85,6 +86,34 @@  static struct notifier_block panic_block = {
 	.notifier_call = hung_task_panic,
 };
 
+static void hung_task_warning(struct task_struct *t, bool timeout)
+{
+	const char *loglevel = timeout ? KERN_ERR : KERN_INFO;
+	const char *path;
+	int *warnings;
+
+	if (timeout) {
+		warnings = &sysctl_hung_task_warnings;
+		path = "hung_task_timeout_secs";
+	} else {
+		warnings = &sysctl_hung_task_interval_warnings;
+		path = "hung_task_interval_secs";
+	}
+
+	if (*warnings > 0)
+		--*warnings;
+
+	printk("%sINFO: task %s:%d blocked for more than %ld seconds.\n",
+	       loglevel, t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
+	printk("%s      %s %s %.*s\n",
+		loglevel, print_tainted(), init_utsname()->release,
+		(int)strcspn(init_utsname()->version, " "),
+		init_utsname()->version);
+	printk("%s\"echo 0 > /proc/sys/kernel/%s\" disables this message.\n",
+		loglevel, path);
+	sched_show_task(t);
+}
+
 static void check_hung_task(struct task_struct *t, unsigned long timeout)
 {
 	unsigned long switch_count = t->nvcsw + t->nivcsw;
@@ -109,6 +138,9 @@  static void check_hung_task(struct task_struct *t, unsigned long timeout)
 		t->last_switch_time = jiffies;
 		return;
 	}
+	if (sysctl_hung_task_interval_warnings)
+		hung_task_warning(t, false);
+
 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
 		return;
 
@@ -120,22 +152,10 @@  static void check_hung_task(struct task_struct *t, unsigned long timeout)
 		hung_task_call_panic = true;
 	}
 
-	/*
-	 * Ok, the task did not get scheduled for more than 2 minutes,
-	 * complain:
-	 */
 	if (sysctl_hung_task_warnings) {
-		if (sysctl_hung_task_warnings > 0)
-			sysctl_hung_task_warnings--;
-		pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
-		       t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
-		pr_err("      %s %s %.*s\n",
-			print_tainted(), init_utsname()->release,
-			(int)strcspn(init_utsname()->version, " "),
-			init_utsname()->version);
-		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
-			" disables this message.\n");
-		sched_show_task(t);
+		/* Don't print warings twice */
+		if (!sysctl_hung_task_interval_warnings)
+			hung_task_warning(t, true);
 		hung_task_show_lock = true;
 	}
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..f12888971d66 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1147,6 +1147,14 @@  static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &neg_one,
 	},
+	{
+		.procname	= "hung_task_interval_warnings",
+		.data		= &sysctl_hung_task_interval_warnings,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &neg_one,
+	},
 #endif
 #ifdef CONFIG_RT_MUTEXES
 	{