From patchwork Thu Feb 27 21:16:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Simmons X-Patchwork-Id: 11410873 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2FF231580 for ; Thu, 27 Feb 2020 21:49:02 +0000 (UTC) Received: from pdx1-mailman02.dreamhost.com (pdx1-mailman02.dreamhost.com [64.90.62.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 18BDF24690 for ; Thu, 27 Feb 2020 21:49:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 18BDF24690 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lustre-devel-bounces@lists.lustre.org Received: from pdx1-mailman02.dreamhost.com (localhost [IPv6:::1]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 3F26834A55B; Thu, 27 Feb 2020 13:39:26 -0800 (PST) X-Original-To: lustre-devel@lists.lustre.org Delivered-To: lustre-devel-lustre.org@pdx1-mailman02.dreamhost.com Received: from smtp3.ccs.ornl.gov (smtp3.ccs.ornl.gov [160.91.203.39]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id F2216348868 for ; Thu, 27 Feb 2020 13:20:54 -0800 (PST) Received: from star.ccs.ornl.gov (star.ccs.ornl.gov [160.91.202.134]) by smtp3.ccs.ornl.gov (Postfix) with ESMTP id 2FFB19190; Thu, 27 Feb 2020 16:18:19 -0500 (EST) Received: by star.ccs.ornl.gov (Postfix, from userid 2004) id 2EF9146C; Thu, 27 Feb 2020 16:18:19 -0500 (EST) From: James Simmons To: Andreas Dilger , Oleg Drokin , NeilBrown Date: Thu, 27 Feb 2020 16:16:11 -0500 Message-Id: <1582838290-17243-504-git-send-email-jsimmons@infradead.org> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> References: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 503/622] lustre: ptlrpc: fix watchdog ratelimit logic X-BeenThere: lustre-devel@lists.lustre.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "For discussing Lustre software development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Lustre Development List MIME-Version: 1.0 Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" From: Andreas Dilger The ptlrpc-level watchdog ratelimiting is broken. The kernel prints: mdt00_009: service thread pid 18935 was inactive for 72s. Watchdog stack traces are limited to 3 per 300s, skipping... even though there hasn't been any stack trace printed before. It looks like the __ratelimit() return value is backward from what one would expect from normal English grammar, namely that if __ratelimit() returns true the action should NOT be limited. Fix the logic checking the __ratelimit() return value, and add a check in sanity test_422 (which forces a service thread timeout) to ensure that the watchdog sometimes prints a full stack. Fixes: aeaf46886c7b ("lustre: ptlrpc: add watchdog for ptlrpc service threads") WC-bug-id: https://jira.whamcloud.com/browse/LU-12838 Lustre-commit: 594c79f2f855 ("LU-12838 ptlrpc: fix watchdog ratelimit logic") Signed-off-by: Andreas Dilger Reviewed-on: https://review.whamcloud.com/36409 Reviewed-by: James Simmons Reviewed-by: Neil Brown Reviewed-by: Oleg Drokin Signed-off-by: James Simmons --- fs/lustre/ptlrpc/service.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c index b2a33a3..fe0e108 100644 --- a/fs/lustre/ptlrpc/service.c +++ b/fs/lustre/ptlrpc/service.c @@ -2067,7 +2067,8 @@ static void ptlrpc_watchdog_fire(struct work_struct *w) s64 ms_lapse = ktime_ms_delta(ktime_get(), thread->t_touched); u32 ms_frac = do_div(ms_lapse, MSEC_PER_SEC); - if (!__ratelimit(&watchdog_limit)) { + /* ___ratelimit() returns true if the action is NOT ratelimited */ + if (__ratelimit(&watchdog_limit)) { /* below message is checked in sanity-quota.sh test_6,18 */ LCONSOLE_WARN("%s: service thread pid %u was inactive for %llu.%.03u seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:\n", thread->t_task->comm, thread->t_task->pid,