From patchwork Thu Feb 21 06:16:51 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jason Liu X-Patchwork-Id: 2170591 Return-Path: X-Original-To: patchwork-linux-arm@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) by patchwork1.kernel.org (Postfix) with ESMTP id B35B83FD4E for ; Thu, 21 Feb 2013 06:19:54 +0000 (UTC) Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux)) id 1U8PSi-0006W2-Lj; Thu, 21 Feb 2013 06:16:56 +0000 Received: from mail-ia0-x22e.google.com ([2607:f8b0:4001:c02::22e]) by merlin.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1U8PSf-0006Vj-PU for linux-arm-kernel@lists.infradead.org; Thu, 21 Feb 2013 06:16:54 +0000 Received: by mail-ia0-f174.google.com with SMTP id u20so3702119iag.5 for ; Wed, 20 Feb 2013 22:16:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=g3gCX+fRQzjJoyUWe0hjhQjmwuanbKXWIZVRnErV2wE=; b=hnIh0RJKWwJ66RmKc1A1hGlVmRII+zau1UQ950q8W47lVivxmX2cqURo9Frur7JBEA 0XO2xGk17rN/52pJSO/YirTJCypynqe5U/Ox7JessgLgcafS0x79kVv3Ya3Hat02bLLo s9t65gHUlnNUNU2Xb7IF5tG0yYkOYQPyrpQBBy5S/rhHYeD6KXYpn5yHPRZQRb8iTgjv RCmT2zfRbbhoLC2aVjt4uOsxoFIdbpZDAxtGyHO06a9pNt4PhhQ3kb+Eruq9VBsCmYbv p42vyUl8hCMnHrT6/nHt6Hjt2iq1d7p/CtGMkjlTtOmG2JY+s20h6ldQM/TrnZ2ISYwr /vHQ== MIME-Version: 1.0 X-Received: by 10.50.173.66 with SMTP id bi2mr11327864igc.77.1361427411362; Wed, 20 Feb 2013 22:16:51 -0800 (PST) Received: by 10.42.18.9 with HTTP; Wed, 20 Feb 2013 22:16:51 -0800 (PST) In-Reply-To: References: Date: Thu, 21 Feb 2013 14:16:51 +0800 Message-ID: Subject: Re: too many timer retries happen when do local timer swtich with broadcast timer From: Jason Liu To: Thomas Gleixner X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20130221_011653_960844_CEC7DBC7 X-CRM114-Status: GOOD ( 34.65 ) X-Spam-Score: -2.0 (--) X-Spam-Report: SpamAssassin version 3.3.2 on merlin.infradead.org summary: Content analysis details: (-2.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (liu.h.jason[at]gmail.com) -0.0 SPF_PASS SPF: sender matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature Cc: LKML , linux-arm-kernel@lists.infradead.org X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org 2013/2/20 Thomas Gleixner : > On Wed, 20 Feb 2013, Jason Liu wrote: >> void arch_idle(void) >> { >> .... >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); >> >> enter_the_wait_mode(); >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> } >> >> when the broadcast timer interrupt arrives(this interrupt just wakeup >> the ARM, and ARM has no chance >> to handle it since local irq is disabled. In fact it's disabled in >> cpu_idle() of arch/arm/kernel/process.c) >> >> the broadcast timer interrupt will wake up the CPU and run: >> >> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); -> >> tick_broadcast_oneshot_control(...); >> -> >> tick_program_event(dev->next_event, 1); >> -> >> tick_dev_program_event(dev, expires, force); >> -> >> for (i = 0;;) { >> int ret = clockevents_program_event(dev, expires, now); >> if (!ret || !force) >> return ret; >> >> dev->retries++; >> .... >> now = ktime_get(); >> expires = ktime_add_ns(now, dev->min_delta_ns); >> } >> clockevents_program_event(dev, expires, now); >> >> delta = ktime_to_ns(ktime_sub(expires, now)); >> >> if (delta <= 0) >> return -ETIME; >> >> when the bc timer interrupt arrives, which means the last local timer >> expires too. so, >> clockevents_program_event will return -ETIME, which will cause the >> dev->retries++ >> when retry to program the expired timer. >> >> Even under the worst case, after the re-program the expired timer, >> then CPU enter idle >> quickly before the re-progam timer expired, it will make system >> ping-pang forever, > > That's nonsense. I don't think so. > > The timer IPI brings the core out of the deep idle state. > > So after returning from enter_wait_mode() and after calling > clockevents_notify() it returns from arch_idle() to cpu_idle(). > > In cpu_idle() interrupts are reenabled, so the timer IPI handler is > invoked. That calls the event_handler of the per cpu local clockevent > device (the one which stops in C3). That ends up in the generic timer > code which expires timers and reprograms the local clock event device > with the next pending timer. > > So you cannot go idle again, before the expired timers of this event > are handled and their callbacks invoked. That's true for the CPUs which not response to the global timer interrupt. Take our platform as example: we have 4CPUs(CPU0, CPU1,CPU2,CPU3) The global timer device will keep running even in the deep idle mode, so, it can be used as the broadcast timer device, and the interrupt of this device just raised to CPU0 when the timer expired, then, CPU0 will broadcast the IPI timer to other CPUs which is in deep idle mode. So for CPU1, CPU2, CPU3, you are right, the IPI timer will bring it out of idle state, after running clockevents_notify() it returns from arch_idle() to cpu_idle(), then local_irq_enable(), the IPI handler will be invoked and handle the expires times and re-program the next pending timer. But, that's not true for the CPU0. The flow for CPU0 is: the global timer interrupt wakes up CPU0 and then call: clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); which will cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); in the function tick_broadcast_oneshot_control(), After return from clockevents_notify(), it will return to cpu_idle from arch_idle, then local_irq_enable(), the CPU0 will response to the global timer interrupt, and call the interrupt handler: tick_handle_oneshot_broadcast() static void tick_handle_oneshot_broadcast(struct clock_event_device *dev) { struct tick_device *td; ktime_t now, next_event; int cpu; raw_spin_lock(&tick_broadcast_lock); again: dev->next_event.tv64 = KTIME_MAX; next_event.tv64 = KTIME_MAX; cpumask_clear(to_cpumask(tmpmask)); now = ktime_get(); /* Find all expired events */ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { td = &per_cpu(tick_cpu_device, cpu); if (td->evtdev->next_event.tv64 <= now.tv64) cpumask_set_cpu(cpu, to_cpumask(tmpmask)); else if (td->evtdev->next_event.tv64 < next_event.tv64) next_event.tv64 = td->evtdev->next_event.tv64; } /* * Wakeup the cpus which have an expired event. */ tick_do_broadcast(to_cpumask(tmpmask)); ... } since cpu0 has been removed from the tick_get_broadcast_oneshot_mask(), and if all the other cpu1/2/3 state in idle, and no expired timers, then the tmpmask will be 0, when call tick_do_broadcast(). static void tick_do_broadcast(struct cpumask *mask) { int cpu = smp_processor_id(); struct tick_device *td; /* * Check, if the current cpu is in the mask */ if (cpumask_test_cpu(cpu, mask)) { cpumask_clear_cpu(cpu, mask); td = &per_cpu(tick_cpu_device, cpu); td->evtdev->event_handler(td->evtdev); } if (!cpumask_empty(mask)) { /* * It might be necessary to actually check whether the devices * have different broadcast functions. For now, just use the * one of the first device. This works as long as we have this * misfeature only on x86 (lapic) */ td = &per_cpu(tick_cpu_device, cpumask_first(mask)); td->evtdev->broadcast(mask); } } If the mask is empty, then tick_do_broadcast will do nothing and return, which will make cpu0 enter idle quickly, and then system will ping-pang there. switch to bc timer->wait->bc timer expires->wakeup->switch to loc timer-> | ^ |------<-enter idle <- reprogram the expired loc timer ---< We did see such things happen on the system which will make system stuck there and no any sched-tick handler running on the CPU0. And the easy way to reproduce it is: /* hack code: just for experiment */ We need fix this common issue. Do you have good idea about how to fix it? > > > Now the reprogramming itself is a different issue. > > It's necessary as we have no information whether the wakeup was caused > by the timer IPI or by something else. > > We might try to be clever and store the pending IPI in > tick_handle_oneshot_broadcast() and avoid the reprogramming in that > case. Completely untested patch below. Thanks for the patch. > > Thanks, > > tglx > > Index: linux-2.6/kernel/time/tick-broadcast.c > =================================================================== > --- linux-2.6.orig/kernel/time/tick-broadcast.c > +++ linux-2.6/kernel/time/tick-broadcast.c > @@ -29,6 +29,7 @@ > static struct tick_device tick_broadcast_device; > /* FIXME: Use cpumask_var_t. */ > static DECLARE_BITMAP(tick_broadcast_mask, NR_CPUS); > +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); > static DECLARE_BITMAP(tmpmask, NR_CPUS); > static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); > static int tick_broadcast_force; > @@ -417,9 +418,10 @@ again: > /* Find all expired events */ > for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { > td = &per_cpu(tick_cpu_device, cpu); > - if (td->evtdev->next_event.tv64 <= now.tv64) > + if (td->evtdev->next_event.tv64 <= now.tv64) { > cpumask_set_cpu(cpu, to_cpumask(tmpmask)); > - else if (td->evtdev->next_event.tv64 < next_event.tv64) > + set_bit(cpu, tick_broadcast_pending); > + } else if (td->evtdev->next_event.tv64 < next_event.tv64) > next_event.tv64 = td->evtdev->next_event.tv64; > } > > @@ -493,7 +495,8 @@ void tick_broadcast_oneshot_control(unsi > cpumask_clear_cpu(cpu, > tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); > - if (dev->next_event.tv64 != KTIME_MAX) > + if (dev->next_event.tv64 != KTIME_MAX && > + !test_and_clear_bit(cpu, tick_broadcast_pending)) > tick_program_event(dev->next_event, 1); > } > } > I have tested the patch, it can fix the retries on the CPU1/2/3, but not on CPU0. see, cpu0: retries: 39042 root@~$ cat /proc/timer_list Timer List Version: v0.6 HRTIMER_MAX_CLOCK_BASES: 3 [...] Tick Device: mode: 1 Per CPU device: 0 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 3 next_event: 3295010000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 39042 Tick Device: mode: 1 Per CPU device: 1 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 3295050000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 1 Tick Device: mode: 1 Per CPU device: 2 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 10740407770000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 0 Tick Device: mode: 1 Per CPU device: 3 Clock Event Device: local_timer max_delta_ns: 8624432320 min_delta_ns: 1000 mult: 2138893713 shift: 32 mode: 1 next_event: 10737578200000000 nsecs set_next_event: twd_set_next_event set_mode: twd_set_mode event_handler: hrtimer_interrupt retries: 0 Jason Liu > > > > > diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index f113755..d142802 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -493,8 +493,12 @@ void tick_broadcast_oneshot_control(unsigned long reason) cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); - if (dev->next_event.tv64 != KTIME_MAX) - tick_program_event(dev->next_event, 1); + if (dev->next_event.tv64 != KTIME_MAX) { + if (cpu) + tick_program_event(dev->next_event, 1); + else + tick_program_event(ktime_add_ns(dev->next_event, 100000), 1); + } } }