From patchwork Fri Dec 15 16:16:08 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Doug Smythies X-Patchwork-Id: 10115473 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 4E1016019C for ; Fri, 15 Dec 2017 16:16:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3D8732917E for ; Fri, 15 Dec 2017 16:16:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2EDA72A035; Fri, 15 Dec 2017 16:16:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A1AB12917E for ; Fri, 15 Dec 2017 16:16:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932479AbdLOQQQ (ORCPT ); Fri, 15 Dec 2017 11:16:16 -0500 Received: from cmta17.telus.net ([209.171.16.90]:35496 "EHLO cmta17.telus.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932396AbdLOQQP (ORCPT ); Fri, 15 Dec 2017 11:16:15 -0500 Received: from dougxps ([173.180.45.4]) by cmsmtp with SMTP id PseeemR3G5YhBPsege3iWf; Fri, 15 Dec 2017 09:16:14 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=telus.net; s=neo; t=1513354574; bh=t96DTM6QiTisa1ZOKwxjMLp9BZonuJGV4oRDWhaJq3I=; h=From:To:Cc:References:In-Reply-To:Subject:Date; b=W8m3QCYLZnYUfwG9/6WdYuyP3dl6fkFePmVydjGIqoS6ilv2NvE/68keexauH/RYn 8zdZB6WDcKhYrEVRoKQH7wwsIr4rLUc/AUj2crJ0unfJW+PpndoA15iJ4OpKLZ3PfR wrv4Qouys8yoVgbcIF/9oY88KsUz7PMGEONeageiGE7DxIcmBnr+roWw+EAleLf/1J 5M7RKV+BW/9KPGRZ8QN67+k02WMafPiNaOxH1g77JiMGCva5Mon8lhWu1ujjCaX2dK XACwqguz7EdJX80duZw+BT41sLceuVnqEo32BGx/Sk/DwLqG8bsnunIzW40I6m91l8 NJHIbmNjG8K3A== X-Authority-Analysis: v=2.2 cv=eOz8tDh1 c=1 sm=1 tr=0 a=zJWegnE7BH9C0Gl4FFgQyA==:117 a=zJWegnE7BH9C0Gl4FFgQyA==:17 a=Pyq9K9CWowscuQLKlpiwfMBGOR0=:19 a=IkcTkHD0fZMA:10 a=CTQnQQS24jOZT-lbIJEA:9 a=h4GMM_4rBtR8CirA:21 a=xmTEqy5tSgOBMVlm:21 a=QEXdDO2ut3YA:10 From: "Doug Smythies" To: "'Rafael J. Wysocki'" Cc: "'Len Brown'" , "'Yu Chen'" , =?utf-8?Q?'Marcus_H=C3=A4hnel'?= , "'Daniel Hackenberg'" , =?utf-8?Q?'Robert_Sch=C3=B6ne'?= , , "'Rafael J. Wysocki'" , "'Alex Shi'" , "'Ingo Molnar'" , "'Rik van Riel'" , "'Daniel Lezcano'" , "'Nicholas Piggin'" , "'Linux PM'" , "'Peter Zijlstra'" , "'Thomas Gleixner'" , "'Thomas Ilsche'" , "Doug Smythies" References: <000101d34938$da740870$8f5c1950$@net> <000801d34a78$cdd27890$697769b0$@net> <002c01d35d0f$8b0416f0$a10c44d0$@net> <000a01d3654a$c4996990$4dcc3cb0$@net> <17b7bd4f-6b4d-2561-8895-f1dd32f9a171@tu-dresden.de> <25c78d66-9c2a-012c-6574-a687519d4820@tu-dresden.de> Pqt7eQIAF7WlePqtGetBgu In-Reply-To: Pqt7eQIAF7WlePqtGetBgu Subject: RE: [PATCH] cpuidle: Allow menu governor to enter deeper sleep states after some time Date: Fri, 15 Dec 2017 08:16:08 -0800 Message-ID: <001b01d375c0$07738c20$165aa460$@net> MIME-Version: 1.0 X-Mailer: Microsoft Office Outlook 12.0 Content-Language: en-ca Thread-Index: AdN1sDnmxitouDXoSE6XF3m3z1ktNQAB5VZw X-CMAE-Envelope: MS4wfPzGdC04Q3qKh0r1RkEyViPoAEAMZ3wuSVgyw8HfJnSSws9kxl/RbaIO1Jb/VnIrB4iWYGgp3YlUYyNR4jh/e37JVLAKUhAM5VfCKhstLh5a2dIw2jyI /10f+iIwf47Rb2ffeZzvsGsLdyw1vL304jivHG3vvrtZ3LS8KrW9fdhXnJS7UM56ijsUyuuloXRVt1Tk6hRzq2cw3WZ1sdzp9CdKHe2sxmEEaW0YTeExSD+G nLrnB5dTo/ydNmD6ZS594Scq0+qHeySoAHf/Et7NUZjWKLw6v6XcRkEq0VejWP7qNyevWR7h4RseoypLhQtKZLMWxjSmq4SGJK8QVU5M620JdZ0jEYNHl8/z EXiTG29i13jngrDkCMkYXX/bG4Y/Jmd8EN6pc06kcuxijiiQPV2Vft5utFk20gaek8i1M/0sPo7/MVMfXJ+SFjTlfHfqN89rlXwGK1GXGR2NEYnv4aKRew3p B7WlGQmOyv+tPOWJSoweR/DHjHYqpUDHsAbwf1YQ7s55c3Wm0w4+BsVU+cZKbEWX6Go2vbNPmY+MfkoVCpWcZQIClLQZfhdr76t/4sGw64CYSixWDYafIlNK JzJylu7iY0fV1npHRn5CexMd97i2FAtVC+awno6DAUD9hqEo7jIpveGIvBbE/MhoP7pxPeC0CBRGSKI9bPK/KcALp9gHwrTqrPUL+i17K6QPWgCHAWSVXwW/ J6GOULECK2eihL1D045JSzb/8UykgpV7 Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2017.12.15 06:23 Rafael J. Wysocki wrote: > On Fri, Dec 15, 2017 at 11:44 AM, Thomas Ilsche wrote: > >> we just saw very bad Powernightmares on a freshly installed dual socket 36 >> core During the gap in this thread, I have seen some pretty bad ones also, but not to the magnitude that you do. > Can you please refrain from using marketing language for describing > technical issues? Ugh oh. I have just changed my web notes on this to use that term. >> This issue is probably wasting a couple Megawatts globally, I agree that there appears to be significant energy that can be saved. >> I like to reinforce the offer >> to help develop and test a patch, Me also, however and as previously mentioned, I am stuck. >> We acknowledge your suggestion to keep the scheduling ticks running instead >> of using an additional fallback timer. Still, we need a solution with >> manageable code changes without tightly coupling different software layers. >> Unfortunately, the tick-timer is disabled early in the outer idle-loop in >> do_idle(), but the C-state decision is taken in the governor. >> >> In your opinion, what is the best solution? >> (a) re-enable the tick-timer? (not a very clean solution) >> (b) moving the tick-timer disabling? > > Yes. > >> (could be very intrusive in terms of LOC, >> we have no overview on which assumptions would break if we do this) > > Regardless, this is the long-term way to go IMO not just to address > this issue, but also to avoid unneeded overhead of stopping the tick. > > The tick should only be stopped when we know for a fact that it is > worth the overhead which is only when we know that the target > residency of the idle state we want to enter is longer than the time > to the next tick. Note that this doesn't depend on the idle governor > in use at all. > > And since this is the long-term way to go anyway, why don't we just do > it right away? Maybe that would also solve the C0 problem, I don't know. >> >> Regarding the C0/poll_idle problem, what is suitable as abort criterion: >> (a) interrupt counts >> (b) tick_nohz_get_sleep_length >> (c) tick_sched->idle_expires > > Can you please remind me what that problem is exactly? The C0 version of the problem is what I have been investigating for a long time now. In an off-list e-mail, Yu claims to have found the root issue. I disagree that it explains all the long poll_idle scenarios I have found. Obviously, I would test any patch when available. The problem is within the main loop of drivers/cpuidle/poll_state.c: while (!need_resched()) cpu_relax(); because not all isr's set the need_resched flag, and therefore it can end up sitting in that loop for multiple seconds, instead of the expected few uSecs. You may or may not recall that my main test scenario for this is a 100% load on one CPU. Typically I observe something like for every unit of time properly spent in idle state 0, the most energy costly idle state, roughly 60,000 units of time are spent incorrectly in idle state 0, wasting energy. Testing needs to done over hours, because the issue tends to come in clusters being aggravated by some sort of beat frequencies between events. My proposed solution was this: However, I do not like that I am disabling and re-enabling local interrupts in some loop, albeit only once per 1024 main loops. I do not know if there is another way to get the same information without disabling interrupts. I have not been able to figure out how to replace my "100" microseconds arbitrary exit condition with some useful, machine specific exit criteria. The root issue for me, is that I am not very familiar with the idle code. ... Doug diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index 7416b16..4d17d3d 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -5,16 +5,31 @@ */ #include +#include #include #include static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + unsigned int next_timer_us, i; + local_irq_enable(); if (!current_set_polling_and_test()) { - while (!need_resched()) + while (!need_resched()){ cpu_relax(); + + /* Occasionally check for a new and long expected residency time. */ + if (!(i++ % 1024)) { + local_irq_disable(); + next_timer_us = ktime_to_us(tick_nohz_get_sleep_length()); + local_irq_enable(); + /* need a better way to get threshold, including large margin */ + /* We are only trying to catch really bad cases here. */ + if (next_timer_us > 100) + break; + } + } } current_clr_polling();