From patchwork Fri Apr 20 16:00:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 10353119 X-Patchwork-Delegate: geert@linux-m68k.org Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id A8BB06023A for ; Fri, 20 Apr 2018 16:00:20 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 968DC287BF for ; Fri, 20 Apr 2018 16:00:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 86B73287BB; Fri, 20 Apr 2018 16:00:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6696F287BB for ; Fri, 20 Apr 2018 16:00:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755858AbeDTQAS (ORCPT ); Fri, 20 Apr 2018 12:00:18 -0400 Received: from mail-wr0-f170.google.com ([209.85.128.170]:41580 "EHLO mail-wr0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755830AbeDTQAR (ORCPT ); Fri, 20 Apr 2018 12:00:17 -0400 Received: by mail-wr0-f170.google.com with SMTP id v24-v6so24254766wra.8 for ; Fri, 20 Apr 2018 09:00:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=mJshR31+G0ZlRxM5OAWS5GjMPzNizK27pGqevcx+tk4=; b=MgDTEuzPACeJjh5onutu1t74LpjiEaUHVdKJIIO7RQ6W2B9ZgxGB3zV+YCvZZ8alOt pmvXeNaJDV2knPBCfbUr0MoOFoom6e+7/6O/puwqZu/oehmUl3P2cYBlYRISyOMol4yO m3tRZxnjjMe1uFluyA+aflKuGJZPjuLKA0+n8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=mJshR31+G0ZlRxM5OAWS5GjMPzNizK27pGqevcx+tk4=; b=nQmCM12nfJB/+zrSLcFvF+UJSIKO5EL62uOTWAEIJvxjMfyn+/ipVFK2IoiBvzvNS3 CC3tPgqj75/aVYd3UTEmXvjLno6zVaYoRlHj19EzN0CrdcTMLcDbbWouZPiqahnlsA4d NHYmP30sflofOHBeL0GAEbi6EAItbaD2DJ+OXOt/tL7VTaRDZBouo2DGNiG6d/YcMI5k t1QpLRt53n3VXWxLtgtJ/59Qj3++T4eTUeqCjAcW/YmkEfMDk1bk/H9sg56oeReX07Rp QIa+LfmKTVPC89/wybYYZvqi1IPO7bUMwvmwf+3zcM4U0GVHiN/bK3TvfxHm6YRlAiZh C/LA== X-Gm-Message-State: ALQs6tBu28x6Mfwk4My6QYZW06qHGHOaGoel38GsiTKQACoMNm1d1ykQ Gj1gTgF1y5MtbrKfiUPJbMT9dw== X-Google-Smtp-Source: AB8JxZqz+C/fCAw6Nx/PD5snWMLvgwalhuS9hWm2hQ/zxwh0BUL3PPDeXKwZrPwMVdlTKqhXg8PQRw== X-Received: by 10.28.147.83 with SMTP id v80mr2443437wmd.91.1524240015989; Fri, 20 Apr 2018 09:00:15 -0700 (PDT) Received: from linaro.org ([2a01:e0a:f:6020:575:3484:84df:41ba]) by smtp.gmail.com with ESMTPSA id k79sm2541350wmg.39.2018.04.20.09.00.14 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 20 Apr 2018 09:00:14 -0700 (PDT) Date: Fri, 20 Apr 2018 18:00:13 +0200 From: Vincent Guittot To: Niklas =?iso-8859-1?Q?S=F6derlund?= , Heiner Kallweit Cc: Peter Zijlstra , "Paul E. McKenney" , Ingo Molnar , linux-kernel , linux-renesas-soc@vger.kernel.org Subject: Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle") Message-ID: <20180420160013.GA13769@linaro.org> References: <20180412091822.GG12256@bigcity.dyn.berto.se> <20180412111519.GH12256@bigcity.dyn.berto.se> <20180412133031.GA551@linaro.org> <20180412223904.GJ12256@bigcity.dyn.berto.se> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-renesas-soc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-renesas-soc@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Heiner and Niklas, Le Saturday 14 Apr 2018 à 13:24:20 (+0200), Vincent Guittot a écrit : > Hi Niklas, > > On 13 April 2018 at 00:39, Niklas Söderlund > wrote: > > Hi Vincent, > > > > Thanks for helping trying to figure this out. > > > > On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote: > > > > [snip] > > > >> > >> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you try > >> them separatly on top of c18bb396d3d261eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) > >> and check if one of them fixes the problem ?i > > > > I tried your suggested changes based on top of c18bb396d3d261eb. > > > >> > >> (They should apply on linux-next as well) > >> > >> First patch always kick ilb instead of doing ilb on local cpu before entering idle > >> > >> --- > >> kernel/sched/fair.c | 3 +-- > >> 1 file changed, 1 insertion(+), 2 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 0951d1c..b21925b 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq) > >> * candidate for ilb instead of waking up another idle CPU. > >> * Kick an normal ilb if we failed to do the update. > >> */ > >> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) > >> - kick_ilb(NOHZ_STATS_KICK); > >> + kick_ilb(NOHZ_STATS_KICK); > >> raw_spin_lock(&this_rq->lock); > >> } > > > > This change don't seem to effect the issue. I can still get the single > > ssh session and the system to lockup by hitting the return key. And > > opening a second ssh session immediately unblocks both the first ssh > > session and the serial console. And I can still trigger the console > > warning by just letting the system be once it locks-up. I do have > > just as before reset the system a few times to trigger the issue. > > You results are similar to Heiner's ones. The problem is still there > even if we only kick ilb which mainly send an IPI reschedule to the > other CPU if Idle > Could it be possible to record some traces of the problem to get a better view of what happens ? I have a small patch that adds some traces in the functions that seems to create the problem --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0951d1c..a951464 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9606,6 +9606,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, */ WRITE_ONCE(nohz.has_blocked, 0); + trace_printk("_nohz_idle_balance cpu %d idle %d flag %x", this_cpu, idle, flags); + /* * Ensures that if we miss the CPU, we must see the has_blocked * store from nohz_balance_enter_idle(). @@ -9680,6 +9682,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, if (likely(update_next_balance)) nohz.next_balance = next_balance; + trace_printk("_nohz_idle_balance return %d", ret); + return ret; } @@ -9732,6 +9736,8 @@ static void nohz_newidle_balance(struct rq *this_rq) time_before(jiffies, READ_ONCE(nohz.next_blocked))) return; + trace_printk("nohz_newidle_balance start update"); + raw_spin_unlock(&this_rq->lock); /* * This CPU is going to be idle and blocked load of idle CPUs