Message ID | alpine.LFD.2.02.1302211436300.22263@ionos (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Thomas, On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > Jason, > > On Thu, 21 Feb 2013, Jason Liu wrote: >> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: >>> Now your explanation makes sense. >>> >>> I have no fast solution for this, but I think that I have an idea how >>> to fix it. Stay tuned. >> >> Thanks Thomas, wait for your fix. :) > > find below a completely untested patch, which should address that issue. > After looking at the thread, I tried to see the issue on OMAP and could see the same issue as Jason. Your patch fixes the retries on both CPUs on my dual core machine. So you use my tested by if you need one. Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Thanks for the patch. And thanks to Jason for spotting the issue. Regards, Santosh
On Thu, 21 Feb 2013, Santosh Shilimkar wrote: > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > > find below a completely untested patch, which should address that issue. > > > After looking at the thread, I tried to see the issue on OMAP and could > see the same issue as Jason. That's interesting. We have the same issue on x86 since 2007 and nobody noticed ever. It's basically the same problem there, but it seems that on x86 getting out of those low power states is way slower than the minimal reprogramming delta which is used to enforce the local timer to fire after the wakeup. I'm still amazed that as Jason stated a 1us reprogramming delta is sufficient to get this ping-pong going. I somehow doubt that, but maybe ARM is really that fast :) > Your patch fixes the retries on both CPUs on my dual core machine. So > you use my tested by if you need one. They are always welcome. > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> > > Thanks for the patch. > And thanks to Jason for spotting the issue. And for coping with my initial inability to parse his report! Thanks, tglx
Thomas, On Friday 22 February 2013 03:49 AM, Thomas Gleixner wrote: > On Thu, 21 Feb 2013, Santosh Shilimkar wrote: >> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: >>> find below a completely untested patch, which should address that issue. >>> >> After looking at the thread, I tried to see the issue on OMAP and could >> see the same issue as Jason. > > That's interesting. We have the same issue on x86 since 2007 and > nobody noticed ever. It's basically the same problem there, but it > seems that on x86 getting out of those low power states is way slower > than the minimal reprogramming delta which is used to enforce the > local timer to fire after the wakeup. > > I'm still amazed that as Jason stated a 1us reprogramming delta is > sufficient to get this ping-pong going. I somehow doubt that, but > maybe ARM is really that fast :) > >> Your patch fixes the retries on both CPUs on my dual core machine. So >> you use my tested by if you need one. > > They are always welcome. > BTW, Lorenzo off-list mentioned to me about warning in boot-up which I missed while testing your patch. It will take bit more time for me to look into it and hence thought of reporting it. [ 2.186126] ------------[ cut here ]------------ [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 tick_broadcast_oneshot_control+0x1c0/0x21c() [ 2.200622] Modules linked in: [ 2.203826] [<c001bfe4>] (unwind_backtrace+0x0/0xf0) from [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) [ 2.213653] [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) from [<c0047da0>] (warn_slowpath_null+0x1c/0x24) [ 2.223754] [<c0047da0>] (warn_slowpath_null+0x1c/0x24) from [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) [ 2.234924] [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) from [<c00928dc>] (tick_notify+0x23c/0x42c) [ 2.245666] [<c00928dc>] (tick_notify+0x23c/0x42c) from [<c0539a3c>] (notifier_call_chain+0x44/0x84) [ 2.255218] [<c0539a3c>] (notifier_call_chain+0x44/0x84) from [<c0071068>] (raw_notifier_call_chain+0x18/0x20) [ 2.265686] [<c0071068>] (raw_notifier_call_chain+0x18/0x20) from [<c0091c70>] (clockevents_notify+0x2c/0x174) [ 2.276123] [<c0091c70>] (clockevents_notify+0x2c/0x174) from [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) [ 2.286315] [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) from [<c042e504>] (cpuidle_enter+0x14/0x18) [ 2.295928] [<c042e504>] (cpuidle_enter+0x14/0x18) from [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) [ 2.305389] [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) from [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) [ 2.315307] [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) from [<c0015100>] (cpu_idle+0x8c/0x11c) [ 2.324401] [<c0015100>] (cpu_idle+0x8c/0x11c) from [<c073d7ac>] (start_kernel+0x2b0/0x300) [ 2.333129] ---[ end trace 6fe1f7b4606a9e20 ]---
On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > BTW, Lorenzo off-list mentioned to me about warning in boot-up > which I missed while testing your patch. It will take bit more > time for me to look into it and hence thought of reporting it. > > [ 2.186126] ------------[ cut here ]------------ > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > tick_broadcast_oneshot_control+0x1c0/0x21c() Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
Thomas, 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: > Jason, > > On Thu, 21 Feb 2013, Jason Liu wrote: >> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>: >> > Now your explanation makes sense. >> > >> > I have no fast solution for this, but I think that I have an idea how >> > to fix it. Stay tuned. >> >> Thanks Thomas, wait for your fix. :) > > find below a completely untested patch, which should address that issue. > I have tested the below patch, but run into the following warnings: [ 713.340593] ------------[ cut here ]------------ [ 713.345238] WARNING: at /linux-2.6-imx/kernel/time/tick-broadcast.c:497 tick_broadcast_oneshot_control+0x210/0x254() [ 713.359240] Modules linked in: [ 713.362332] [<8004b2cc>] (unwind_backtrace+0x0/0xf8) from [<80079764>] (warn_slowpath_common+0x54/0x64) [ 713.371740] [<80079764>] (warn_slowpath_common+0x54/0x64) from [<80079790>] (warn_slowpath_null+0x1c/0x24) [ 713.381408] [<80079790>] (warn_slowpath_null+0x1c/0x24) from [<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254) [ 713.392120] [<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254) from [<800a48b0>] (tick_notify+0xd4/0x1f8) [ 713.402317] [<800a48b0>] (tick_notify+0xd4/0x1f8) from [<8009b154>] (notifier_call_chain+0x4c/0x8c) [ 713.411377] [<8009b154>] (notifier_call_chain+0x4c/0x8c) from [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) [ 713.421392] [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) from [<800a3d9c>] (clockevents_notify+0x30/0x198) [ 713.431417] [<800a3d9c>] (clockevents_notify+0x30/0x198) from [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) [ 713.441175] [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) from [<80044f68>] (default_idle+0x20/0x28) [ 713.450322] [<80044f68>] (default_idle+0x20/0x28) from [<800455c0>] (cpu_idle+0xc8/0xfc) [ 713.458425] [<800455c0>] (cpu_idle+0xc8/0xfc) from [<80008984>] (start_kernel+0x260/0x294) [ 713.466701] [<80008984>] (start_kernel+0x260/0x294) from [<10008040>] (0x10008040) So, the code here which bring the warnings: void tick_broadcast_oneshot_control(unsigned long reason) { ... WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); } > > Thanks, > > tglx > > Index: linux-2.6/kernel/time/tick-broadcast.c > =================================================================== > --- linux-2.6.orig/kernel/time/tick-broadcast.c > +++ linux-2.6/kernel/time/tick-broadcast.c > @@ -397,6 +397,8 @@ int tick_resume_broadcast(void) > > /* FIXME: use cpumask_var_t. */ > static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS); > +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); > +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS); > > /* > * Exposed for debugging: see timer_list.c > @@ -453,12 +455,24 @@ again: > /* Find all expired events */ > for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { > td = &per_cpu(tick_cpu_device, cpu); > - if (td->evtdev->next_event.tv64 <= now.tv64) > + if (td->evtdev->next_event.tv64 <= now.tv64) { > cpumask_set_cpu(cpu, to_cpumask(tmpmask)); > - else if (td->evtdev->next_event.tv64 < next_event.tv64) > + /* > + * Mark the remote cpu in the pending mask, so > + * it can avoid reprogramming the cpu local > + * timer in tick_broadcast_oneshot_control(). > + */ > + set_bit(cpu, tick_broadcast_pending); > + } else if (td->evtdev->next_event.tv64 < next_event.tv64) > next_event.tv64 = td->evtdev->next_event.tv64; > } > > + /* Take care of enforced broadcast requests */ > + for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) { > + set_bit(cpu, tmpmask); > + clear_bit(cpu, tick_force_broadcast_mask); > + } > + > /* > * Wakeup the cpus which have an expired event. > */ > @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi > struct clock_event_device *bc, *dev; > struct tick_device *td; > unsigned long flags; > + ktime_t now; > int cpu; > > /* > @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi > > raw_spin_lock_irqsave(&tick_broadcast_lock, flags); > if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) { > + WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending)); > + WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); > if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) { > cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN); > @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi > cpumask_clear_cpu(cpu, > tick_get_broadcast_oneshot_mask()); > clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); > - if (dev->next_event.tv64 != KTIME_MAX) > - tick_program_event(dev->next_event, 1); > + if (dev->next_event.tv64 == KTIME_MAX) > + goto out; > + /* > + * The cpu handling the broadcast timer marked > + * this cpu in the broadcast pending mask and > + * fired the broadcast IPI. So we are going to > + * handle the expired event anyway via the > + * broadcast IPI handler. No need to reprogram > + * the timer with an already expired event. > + */ > + if (test_and_clear_bit(cpu, tick_broadcast_pending)) > + goto out; > + /* > + * If the pending bit is not set, then we are > + * either the CPU handling the broadcast > + * interrupt or we got woken by something else. > + * > + * We are not longer in the broadcast mask, so > + * if the cpu local expiry time is already > + * reached, we would reprogram the cpu local > + * timer with an already expired event. > + * > + * This can lead to a ping-pong when we return > + * to idle and therefor rearm the broadcast > + * timer before the cpu local timer was able > + * to fire. This happens because the forced > + * reprogramming makes sure that the event > + * will happen in the future and depending on > + * the min_delta setting this might be far > + * enough out that the ping-pong starts. > + * > + * If the cpu local next_event has expired > + * then we know that the broadcast timer > + * next_event has expired as well and > + * broadcast is about to be handled. So we > + * avoid reprogramming and enforce that the > + * broadcast handler, which did not run yet, > + * will invoke the cpu local handler. > + * > + * We cannot call the handler directly from > + * here, because we might be in a NOHZ phase > + * and we did not go through the irq_enter() > + * nohz fixups. > + */ > + now = ktime_get(); > + if (dev->next_event.tv64 <= now.tv64) { > + set_bit(cpu, tick_force_broadcast_mask); > + goto out; > + } > + /* > + * We got woken by something else. Reprogram > + * the cpu local timer device. > + */ > + tick_program_event(dev->next_event, 1); > } > } > +out: > raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags); > } >
On Thu, Feb 21, 2013 at 10:19:18PM +0000, Thomas Gleixner wrote: > On Thu, 21 Feb 2013, Santosh Shilimkar wrote: > > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote: > > > find below a completely untested patch, which should address that issue. > > > > > After looking at the thread, I tried to see the issue on OMAP and could > > see the same issue as Jason. > > That's interesting. We have the same issue on x86 since 2007 and > nobody noticed ever. It's basically the same problem there, but it > seems that on x86 getting out of those low power states is way slower > than the minimal reprogramming delta which is used to enforce the > local timer to fire after the wakeup. > > I'm still amazed that as Jason stated a 1us reprogramming delta is > sufficient to get this ping-pong going. I somehow doubt that, but > maybe ARM is really that fast :) It also depends on when the idle driver exits broadcast mode. Certainly if that's the last thing it does before enabling IRQs, that might help trigger the issue. I am still a bit sceptic myself too, and I take advantage of Thomas' knowledge on the subject, which is ways deeper than mine BTW, to ask a question. The thread started with a subject "too many retries...." and here I have a doubt. If the fix is not applied, on the CPU affine to the broadcast timer, it is _normal_ to have local timer retries, since the CPU is setting/forcing the local timer to fire after a min_delta_ns every time the expired event was local to the CPU affine to the broadcast timer. The problem, supposedly, is that the timer has not enough time (sorry for the mouthful) to expire(fire) before IRQs are disabled and the idle thread goes back to idle again. This means that we should notice a mismatch between the number of broadcast timer IRQs and local timer IRQs on the CPU affine to the broadcast timer IRQ (granted, we also have to take into account broadcast timer IRQs meant to service (through IPI) a local timer expired on a CPU which is not the one running the broadcast IRQ handler and "normal" local timer IRQs as well). I am not sure the sheer number of retries is a symptom of the problem happening, but I might well be mistaken so I am asking. For certain, with the fix applied, lots of duplicate IRQs on the CPU affine to the broadcast timer are eliminated, since the local timer is not reprogrammed anymore (before the fix, basically the broadcast timer was firing an IRQ that did nothing since the CPU was already out of broadcast mode by the time the broadcast handler was running, the real job was carried out in the local timer handler). > > > Your patch fixes the retries on both CPUs on my dual core machine. So > > you use my tested by if you need one. > > They are always welcome. > > > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com> You can add mine too, we should fix the WARN_ON_ONCE mentioned in Santosh's reply. Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
On Friday 22 February 2013 03:54 PM, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: >> BTW, Lorenzo off-list mentioned to me about warning in boot-up >> which I missed while testing your patch. It will take bit more >> time for me to look into it and hence thought of reporting it. >> >> [ 2.186126] ------------[ cut here ]------------ >> [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 >> tick_broadcast_oneshot_control+0x1c0/0x21c() > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > The force broadcast mask one coming from below. "WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));" Regards, Santosh
On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: > On Fri, 22 Feb 2013, Santosh Shilimkar wrote: > > BTW, Lorenzo off-list mentioned to me about warning in boot-up > > which I missed while testing your patch. It will take bit more > > time for me to look into it and hence thought of reporting it. > > > > [ 2.186126] ------------[ cut here ]------------ > > [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 > > tick_broadcast_oneshot_control+0x1c0/0x21c() > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? It is the tick_force_broadcast_mask and I think that's because on all systems we are testing, the broadcast timer IRQ is a thundering herd, all CPUs get out of idle at once and try to get out of broadcast mode at more or less the same time. Lorenzo
On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote: > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote: >> On Fri, 22 Feb 2013, Santosh Shilimkar wrote: >>> BTW, Lorenzo off-list mentioned to me about warning in boot-up >>> which I missed while testing your patch. It will take bit more >>> time for me to look into it and hence thought of reporting it. >>> >>> [ 2.186126] ------------[ cut here ]------------ >>> [ 2.190979] WARNING: at kernel/time/tick-broadcast.c:501 >>> tick_broadcast_oneshot_control+0x1c0/0x21c() >> >> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ? > > It is the tick_force_broadcast_mask and I think that's because on all > systems we are testing, the broadcast timer IRQ is a thundering herd, > all CPUs get out of idle at once and try to get out of broadcast mode > at more or less the same time. > So the issue comes ups only when the idle state used where CPU wakeup more or less at same time as Lorenzo mentioned. I have two platforms where I could test the patch and see the issue only with one platform. Yesterday I didn't notice the warning because it wasn't seen on that platform :-) OMAP4 idle entry and exit in deep state is staggered between CPUs and hence the warning isn't seen. On OMAP5 though, there is an additional C-state where idle entry/exit for CPU isn't staggered and I see the issue in that case. Actually the broad-cast code doesn't expect such a behavior from CPUs since only the broad-cast affine CPU should wake up and rest of the CPU should be woken up by the broad-cast IPIs. Regards, Santosh
Index: linux-2.6/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.orig/kernel/time/tick-broadcast.c +++ linux-2.6/kernel/time/tick-broadcast.c @@ -397,6 +397,8 @@ int tick_resume_broadcast(void) /* FIXME: use cpumask_var_t. */ static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS); +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS); +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS); /* * Exposed for debugging: see timer_list.c @@ -453,12 +455,24 @@ again: /* Find all expired events */ for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) { td = &per_cpu(tick_cpu_device, cpu); - if (td->evtdev->next_event.tv64 <= now.tv64) + if (td->evtdev->next_event.tv64 <= now.tv64) { cpumask_set_cpu(cpu, to_cpumask(tmpmask)); - else if (td->evtdev->next_event.tv64 < next_event.tv64) + /* + * Mark the remote cpu in the pending mask, so + * it can avoid reprogramming the cpu local + * timer in tick_broadcast_oneshot_control(). + */ + set_bit(cpu, tick_broadcast_pending); + } else if (td->evtdev->next_event.tv64 < next_event.tv64) next_event.tv64 = td->evtdev->next_event.tv64; } + /* Take care of enforced broadcast requests */ + for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) { + set_bit(cpu, tmpmask); + clear_bit(cpu, tick_force_broadcast_mask); + } + /* * Wakeup the cpus which have an expired event. */ @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi struct clock_event_device *bc, *dev; struct tick_device *td; unsigned long flags; + ktime_t now; int cpu; /* @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi raw_spin_lock_irqsave(&tick_broadcast_lock, flags); if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) { + WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending)); + WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask)); if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) { cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN); @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi cpumask_clear_cpu(cpu, tick_get_broadcast_oneshot_mask()); clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT); - if (dev->next_event.tv64 != KTIME_MAX) - tick_program_event(dev->next_event, 1); + if (dev->next_event.tv64 == KTIME_MAX) + goto out; + /* + * The cpu handling the broadcast timer marked + * this cpu in the broadcast pending mask and + * fired the broadcast IPI. So we are going to + * handle the expired event anyway via the + * broadcast IPI handler. No need to reprogram + * the timer with an already expired event. + */ + if (test_and_clear_bit(cpu, tick_broadcast_pending)) + goto out; + /* + * If the pending bit is not set, then we are + * either the CPU handling the broadcast + * interrupt or we got woken by something else. + * + * We are not longer in the broadcast mask, so + * if the cpu local expiry time is already + * reached, we would reprogram the cpu local + * timer with an already expired event. + * + * This can lead to a ping-pong when we return + * to idle and therefor rearm the broadcast + * timer before the cpu local timer was able + * to fire. This happens because the forced + * reprogramming makes sure that the event + * will happen in the future and depending on + * the min_delta setting this might be far + * enough out that the ping-pong starts. + * + * If the cpu local next_event has expired + * then we know that the broadcast timer + * next_event has expired as well and + * broadcast is about to be handled. So we + * avoid reprogramming and enforce that the + * broadcast handler, which did not run yet, + * will invoke the cpu local handler. + * + * We cannot call the handler directly from + * here, because we might be in a NOHZ phase + * and we did not go through the irq_enter() + * nohz fixups. + */ + now = ktime_get(); + if (dev->next_event.tv64 <= now.tv64) { + set_bit(cpu, tick_force_broadcast_mask); + goto out; + } + /* + * We got woken by something else. Reprogram + * the cpu local timer device. + */ + tick_program_event(dev->next_event, 1); } } +out: raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags); }