From patchwork Tue Aug 22 01:54:39 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicholas Piggin X-Patchwork-Id: 9914089 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 688E5600C5 for ; Tue, 22 Aug 2017 01:56:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 561DD2881D for ; Tue, 22 Aug 2017 01:56:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4A90328829; Tue, 22 Aug 2017 01:56:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, DKIM_VALID, FREEMAIL_FROM, RCVD_IN_DNSWL_LOW autolearn=unavailable version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [65.50.211.133]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 807CB2881D for ; Tue, 22 Aug 2017 01:56:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:MIME-Version:Cc:List-Subscribe: List-Help:List-Post:List-Archive:List-Unsubscribe:List-Id:Message-Id:Date: Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To: References:List-Owner; bh=8ZMkAl6rQRf+ZeUjjqoewg+70zSZHcGA7a7Ux+sYcLM=; b=nqg AETfUZnLPJcFWXfzWUBhi5nYN7Uyv7mXOlOxMzXeFNNi54KgzXT3yDJYHq5Rqo8cfq2/TawkAY+go K+uZ9hBRWu/afM/WYhashvlyI+sWAemqied4Y6NR5xaGOsf1dLmQo7geytBV2H4uNK/oT2+zAYtT6 zlxZ7JUoCZq1pbk8oOhfltBdUU/DN3FOkG1DU49vkfJX23WrqpT+UB7ZL0pcxsNo15wBrpHXlhax4 VPx5j0B/h4fLnzIAaom1rGkqSKhnVwdBOMCiETK4sruT00vcbDOJ32qt+ZpB5nvl8sqddpj5dfELm Mf2w/gC8VqUnyY2rdiGkmbkfIZAGO7g==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.87 #1 (Red Hat Linux)) id 1djyQ9-0000NA-AQ; Tue, 22 Aug 2017 01:55:57 +0000 Received: from mail-pg0-x241.google.com ([2607:f8b0:400e:c05::241]) by bombadil.infradead.org with esmtps (Exim 4.87 #1 (Red Hat Linux)) id 1djyQ5-0008DQ-PX for linux-arm-kernel@lists.infradead.org; Tue, 22 Aug 2017 01:55:55 +0000 Received: by mail-pg0-x241.google.com with SMTP id t3so5719501pgt.5 for ; Mon, 21 Aug 2017 18:55:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=QnBo48oYEQ9eVJuff8X2Z4etQcECbVDfmjGHK3NU9kk=; b=uQ6JV4jI+sKq1JOaWmfGpfrxrmSxwEZlQjHG5t0hLW1qhYuAYKYrReI+wEXJu7fn57 fn5Di6T8JNe3OxKguSRuvAyPMmz1W2kMEFfrQiAUkLBk9mQ9Cz0ngYR/OAOhdb72gnJe ZR21hsQj6uth8RGCDR4r9dvi6kzBNk88/TRhd6q0+K4wbhyMJIlwzm6L9dlSPFo/+1me HfUFbvzOK1ot7/A1kYtz6OgT5zxGUWpWkKki0wcN8X6hCKUIAMQ/1gimlmOAuqqHGqcF 11L9VQ7zQAdlCtVBb5M1nWmXEHyupmTHfvQu19iGtuQRSiPEjfolnhSddfVsvjcppewo 5ZFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=QnBo48oYEQ9eVJuff8X2Z4etQcECbVDfmjGHK3NU9kk=; b=WdaAYVXOPm2cSugnfO0KBVv3txjKKF5EI8b5kAP8sDpYmjMbHXR3n8sSCOgw8NLw9f NdpovRbCuKLniRNO8YQLQtdQRy9A2BND/dF3u0mZAX6RiuZ3mAoJDbeUrbGkNo3jY3x6 PEk8U9MYtIq4Q38hMTjIkmWOK91GC/Syb/WTl2uA49mN6SRoyX2Rj8IGtOV49nrUPlN9 4KxJDMS4EBPKsx2LYsSFV8+hWxjynUroZx0VqAyBDBZyhXJOl/uRxl3oK+l9uZaPHt3y 4TPGwF2Xl39Hk6oLjoV6kqnDRrmtjWeYK5uIax6pXs8oz2R5RtrE0Zm48MKLB8tFGiaF shHw== X-Gm-Message-State: AHYfb5gJMT4AmdxSN8qT6pnCDFJiB0tU3XQQqesNdSrJ8mB6m4DrLqx2 Bevzkvh3QrSUoQ== X-Received: by 10.98.155.26 with SMTP id r26mr3254070pfd.43.1503366931781; Mon, 21 Aug 2017 18:55:31 -0700 (PDT) Received: from roar.au.ibm.com (203-219-56-202.tpgi.com.au. [203.219.56.202]) by smtp.gmail.com with ESMTPSA id u12sm23454449pgq.24.2017.08.21.18.55.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 21 Aug 2017 18:55:30 -0700 (PDT) From: Nicholas Piggin To: akpm@linux-foundation.org, John Stultz , Thomas Gleixner , Stephen Boyd Subject: [PATCH resend] timers: Fix excessive granularity of new timers after a nohz idle Date: Tue, 22 Aug 2017 11:54:39 +1000 Message-Id: <20170822015439.11263-1-npiggin@gmail.com> X-Mailer: git-send-email 2.13.3 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20170821_185553_867690_D5D4B2E4 X-CRM114-Status: GOOD ( 23.13 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: dzickus@redhat.com, sfr@canb.auug.org.au, mpe@ellerman.id.au, torvalds@linux-foundation.org, linuxarm@huawei.com, Nicholas Piggin , linux-kernel@vger.kernel.org, abdhalee@linux.vnet.ibm.com, Jonathan.Cameron@huawei.com, sparclinux@vger.kernel.org, paulmck@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org, linux-arm-kernel@lists.infradead.org MIME-Version: 1.0 Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP When a timer base is idle, it is forwarded when a new timer is added to ensure that granularity does not become excessive. When not idle, the timer tick is expected to increment the base. However there are several problems: - If an existing timer is modified, the base is forwarded only after the index is calculated. - The base is not forwarded by add_timer_on. - There is a window after a timer is restarted from a nohz ide, after it is marked not-idle and before the timer tick on this CPU, where a timer may be added but the ancient base does not get forwarded. These result in excessive granularity (a 1 jiffy timeout can blow out to 100s of jiffies), which cause the rcu lockup detector to trigger, among other things. Fix this by keeping track of whether the timer base has been idle since it was last run or forwarded, and if so then forward it before adding a new timer. There is still a problem where the mod_timer optimization where it's modified with the same expiry time can result in excessive granularity relative to the new shorter interval. That is not addressed by this patch because checking base->was_idle would increase overhead and it's a rather special case (you could reason that the caller should not expect change in absolute expiry time due to such an operation). So that is noted as a comment. As well as fixing the visible RCU softlockup failures, I tested an idle system (with no lockup watchdogs running) and traced all non-deferrable timer expiries for 1000s, and analysed wakeup latency relative to requested latency. 1.0 means we slept for as many jiffies as requested, 2.0 means we slept 2x the time (this suffers jiffies round-up skew at low absolute times): max avg std upstream 506.0 1.20 4.68 patched 2.0 1.08 0.15 This was noticed due to the lockup detector Kconfig changes dropping it out of people's .configs. When the lockup detectors are enabled, no CPU can go idle for longer than 4 seconds, which limits the granularity errors. Sub-optimal timer behaviour is observable on a smaller scale: max avg std upstream 9.0 1.05 0.19 patched 2.0 1.04 0.11 Tested-by: David Miller Signed-off-by: Nicholas Piggin --- Hi Andrew, I would have preferred to get comments from the timer maintainers, but they've been busy or away for the past copule of weeks. Perhaps you would consider carrying it until then? Thanks, Nick kernel/time/timer.c | 44 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 8 deletions(-) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 8f5d1bf18854..dd7be9fe6839 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -203,6 +203,7 @@ struct timer_base { bool migration_enabled; bool nohz_active; bool is_idle; + bool was_idle; /* was it idle since last run/fwded */ DECLARE_BITMAP(pending_map, WHEEL_SIZE); struct hlist_head vectors[WHEEL_SIZE]; } ____cacheline_aligned; @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags) static inline void forward_timer_base(struct timer_base *base) { - unsigned long jnow = READ_ONCE(jiffies); + unsigned long jnow; /* - * We only forward the base when it's idle and we have a delta between - * base clock and jiffies. + * We only forward the base when we are idle or have just come out + * of idle (was_idle logic), and have a delta between base clock + * and jiffies. In the common case, run_timers will take care of it. */ - if (!base->is_idle || (long) (jnow - base->clk) < 2) + if (likely(!base->was_idle)) + return; + + jnow = READ_ONCE(jiffies); + base->was_idle = base->is_idle; + if ((long)(jnow - base->clk) < 2) return; /* @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) * same array bucket then just return: */ if (timer_pending(timer)) { + /* + * The downside of this optimization is that it can result in + * larger granularity than you would get from adding a new + * timer with this expiry. Would a timer flag for networking + * be appropriate, then we can try to keep expiry of general + * timers within ~1/8th of their interval? + */ if (timer->expires == expires) return 1; @@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) * dequeue/enqueue dance. */ base = lock_timer_base(timer, &flags); + forward_timer_base(base); clk = base->clk; idx = calc_wheel_index(expires, clk); @@ -964,6 +979,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) } } else { base = lock_timer_base(timer, &flags); + forward_timer_base(base); } ret = detach_if_pending(timer, base, false); @@ -991,12 +1007,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) raw_spin_lock(&base->lock); WRITE_ONCE(timer->flags, (timer->flags & ~TIMER_BASEMASK) | base->cpu); + forward_timer_base(base); } } - /* Try to forward a stale timer base clock */ - forward_timer_base(base); - timer->expires = expires; /* * If 'idx' was calculated above and the base time did not advance @@ -1112,6 +1126,7 @@ void add_timer_on(struct timer_list *timer, int cpu) WRITE_ONCE(timer->flags, (timer->flags & ~TIMER_BASEMASK) | cpu); } + forward_timer_base(base); debug_activate(timer, timer->expires); internal_add_timer(base, timer); @@ -1499,8 +1514,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem) /* * If we expect to sleep more than a tick, mark the base idle: */ - if ((expires - basem) > TICK_NSEC) + if ((expires - basem) > TICK_NSEC) { + base->was_idle = true; base->is_idle = true; + } } raw_spin_unlock(&base->lock); @@ -1611,6 +1628,17 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h) { struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]); + /* + * was_idle must be cleared before running timers so that any timer + * functions that call mod_timer will not try to forward the base. + * + * The deferrable base does not do idle tracking at all, so we do + * not forward it. This can result in very large variations in + * granularity for deferrable timers, but they can be deferred for + * long periods due to idle. + */ + base->was_idle = false; + __run_timers(base); if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active) __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));