diff mbox

too many timer retries happen when do local timer swtich with broadcast timer

Message ID alpine.LFD.2.02.1302211436300.22263@ionos (mailing list archive)
State New, archived
Headers show

Commit Message

Thomas Gleixner Feb. 21, 2013, 1:48 p.m. UTC
Jason,

On Thu, 21 Feb 2013, Jason Liu wrote:
> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
> > Now your explanation makes sense.
> >
> > I have no fast solution for this, but I think that I have an idea how
> > to fix it. Stay tuned.
> 
> Thanks Thomas, wait for your fix. :)

find below a completely untested patch, which should address that issue.

Sigh. Stopping the cpu local timer in deep idle is known to be an
idiotic design decision for 10+ years. I'll never understand why ARM
vendors insist on implementing the same stupidity over and over.

Thanks,

	tglx

Comments

Santosh Shilimkar Feb. 21, 2013, 3:12 p.m. UTC | #1
Thomas,

On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> Jason,
>
> On Thu, 21 Feb 2013, Jason Liu wrote:
>> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
>>> Now your explanation makes sense.
>>>
>>> I have no fast solution for this, but I think that I have an idea how
>>> to fix it. Stay tuned.
>>
>> Thanks Thomas, wait for your fix. :)
>
> find below a completely untested patch, which should address that issue.
>
After looking at the thread, I tried to see the issue on OMAP and could
see the same issue as Jason.

Your patch fixes the retries on both CPUs on my dual core machine. So
you use my tested by if you need one.
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>

Thanks for the patch.
And thanks to Jason for spotting the issue.

Regards,
Santosh
Thomas Gleixner Feb. 21, 2013, 10:19 p.m. UTC | #2
On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> > find below a completely untested patch, which should address that issue.
> > 
> After looking at the thread, I tried to see the issue on OMAP and could
> see the same issue as Jason.

That's interesting. We have the same issue on x86 since 2007 and
nobody noticed ever. It's basically the same problem there, but it
seems that on x86 getting out of those low power states is way slower
than the minimal reprogramming delta which is used to enforce the
local timer to fire after the wakeup. 

I'm still amazed that as Jason stated a 1us reprogramming delta is
sufficient to get this ping-pong going. I somehow doubt that, but
maybe ARM is really that fast :)

> Your patch fixes the retries on both CPUs on my dual core machine. So
> you use my tested by if you need one.

They are always welcome.

> Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
> 
> Thanks for the patch.
> And thanks to Jason for spotting the issue.

And for coping with my initial inability to parse his report!

Thanks,

	tglx
Santosh Shilimkar Feb. 22, 2013, 10:07 a.m. UTC | #3
Thomas,

On Friday 22 February 2013 03:49 AM, Thomas Gleixner wrote:
> On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
>> On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
>>> find below a completely untested patch, which should address that issue.
>>>
>> After looking at the thread, I tried to see the issue on OMAP and could
>> see the same issue as Jason.
>
> That's interesting. We have the same issue on x86 since 2007 and
> nobody noticed ever. It's basically the same problem there, but it
> seems that on x86 getting out of those low power states is way slower
> than the minimal reprogramming delta which is used to enforce the
> local timer to fire after the wakeup.
>
> I'm still amazed that as Jason stated a 1us reprogramming delta is
> sufficient to get this ping-pong going. I somehow doubt that, but
> maybe ARM is really that fast :)
>
>> Your patch fixes the retries on both CPUs on my dual core machine. So
>> you use my tested by if you need one.
>
> They are always welcome.
>
BTW, Lorenzo off-list mentioned to me about warning in boot-up
which I missed while testing your patch. It will take bit more
time for me to look into it and hence thought of reporting it.

[    2.186126] ------------[ cut here ]------------
[    2.190979] WARNING: at kernel/time/tick-broadcast.c:501 
tick_broadcast_oneshot_control+0x1c0/0x21c()
[    2.200622] Modules linked in:
[    2.203826] [<c001bfe4>] (unwind_backtrace+0x0/0xf0) from 
[<c0047d6c>] (warn_slowpath_common+0x4c/0x64)
[    2.213653] [<c0047d6c>] (warn_slowpath_common+0x4c/0x64) from 
[<c0047da0>] (warn_slowpath_null+0x1c/0x24)
[    2.223754] [<c0047da0>] (warn_slowpath_null+0x1c/0x24) from 
[<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c)
[    2.234924] [<c009336c>] (tick_broadcast_oneshot_control+0x1c0/0x21c) 
from [<c00928dc>] (tick_notify+0x23c/0x42c)
[    2.245666] [<c00928dc>] (tick_notify+0x23c/0x42c) from [<c0539a3c>] 
(notifier_call_chain+0x44/0x84)
[    2.255218] [<c0539a3c>] (notifier_call_chain+0x44/0x84) from 
[<c0071068>] (raw_notifier_call_chain+0x18/0x20)
[    2.265686] [<c0071068>] (raw_notifier_call_chain+0x18/0x20) from 
[<c0091c70>] (clockevents_notify+0x2c/0x174)
[    2.276123] [<c0091c70>] (clockevents_notify+0x2c/0x174) from 
[<c0035294>] (omap_enter_idle_smp+0x3c/0x120)
[    2.286315] [<c0035294>] (omap_enter_idle_smp+0x3c/0x120) from 
[<c042e504>] (cpuidle_enter+0x14/0x18)
[    2.295928] [<c042e504>] (cpuidle_enter+0x14/0x18) from [<c042ef14>] 
(cpuidle_wrap_enter+0x34/0xa0)
[    2.305389] [<c042ef14>] (cpuidle_wrap_enter+0x34/0xa0) from 
[<c042eb20>] (cpuidle_idle_call+0xe0/0x328)
[    2.315307] [<c042eb20>] (cpuidle_idle_call+0xe0/0x328) from 
[<c0015100>] (cpu_idle+0x8c/0x11c)
[    2.324401] [<c0015100>] (cpu_idle+0x8c/0x11c) from [<c073d7ac>] 
(start_kernel+0x2b0/0x300)
[    2.333129] ---[ end trace 6fe1f7b4606a9e20 ]---
Thomas Gleixner Feb. 22, 2013, 10:24 a.m. UTC | #4
On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> BTW, Lorenzo off-list mentioned to me about warning in boot-up
> which I missed while testing your patch. It will take bit more
> time for me to look into it and hence thought of reporting it.
> 
> [    2.186126] ------------[ cut here ]------------
> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> tick_broadcast_oneshot_control+0x1c0/0x21c()

Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
Jason Liu Feb. 22, 2013, 10:26 a.m. UTC | #5
Thomas,

2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
> Jason,
>
> On Thu, 21 Feb 2013, Jason Liu wrote:
>> 2013/2/21 Thomas Gleixner <tglx@linutronix.de>:
>> > Now your explanation makes sense.
>> >
>> > I have no fast solution for this, but I think that I have an idea how
>> > to fix it. Stay tuned.
>>
>> Thanks Thomas, wait for your fix. :)
>
> find below a completely untested patch, which should address that issue.
>

I have tested the below patch, but run into the following warnings:

[  713.340593] ------------[ cut here ]------------
[  713.345238] WARNING: at
/linux-2.6-imx/kernel/time/tick-broadcast.c:497
tick_broadcast_oneshot_control+0x210/0x254()
[  713.359240] Modules linked in:
[  713.362332] [<8004b2cc>] (unwind_backtrace+0x0/0xf8) from
[<80079764>] (warn_slowpath_common+0x54/0x64)
[  713.371740] [<80079764>] (warn_slowpath_common+0x54/0x64) from
[<80079790>] (warn_slowpath_null+0x1c/0x24)
[  713.381408] [<80079790>] (warn_slowpath_null+0x1c/0x24) from
[<800a5320>] (tick_broadcast_oneshot_control+0x210/0x254)
[  713.392120] [<800a5320>]
(tick_broadcast_oneshot_control+0x210/0x254) from [<800a48b0>]
(tick_notify+0xd4/0x1f8)
[  713.402317] [<800a48b0>] (tick_notify+0xd4/0x1f8) from [<8009b154>]
(notifier_call_chain+0x4c/0x8c)
[  713.411377] [<8009b154>] (notifier_call_chain+0x4c/0x8c) from
[<8009b24c>] (raw_notifier_call_chain+0x18/0x20)
[  713.421392] [<8009b24c>] (raw_notifier_call_chain+0x18/0x20) from
[<800a3d9c>] (clockevents_notify+0x30/0x198)
[  713.431417] [<800a3d9c>] (clockevents_notify+0x30/0x198) from
[<80052f58>] (arch_idle_multi_core+0x4c/0xc4)
[  713.441175] [<80052f58>] (arch_idle_multi_core+0x4c/0xc4) from
[<80044f68>] (default_idle+0x20/0x28)
[  713.450322] [<80044f68>] (default_idle+0x20/0x28) from [<800455c0>]
(cpu_idle+0xc8/0xfc)
[  713.458425] [<800455c0>] (cpu_idle+0xc8/0xfc) from [<80008984>]
(start_kernel+0x260/0x294)
[  713.466701] [<80008984>] (start_kernel+0x260/0x294) from
[<10008040>] (0x10008040)

So, the code here which bring the warnings:
void tick_broadcast_oneshot_control(unsigned long reason)
{
...
WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));

}

>
> Thanks,
>
>         tglx
>
> Index: linux-2.6/kernel/time/tick-broadcast.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/tick-broadcast.c
> +++ linux-2.6/kernel/time/tick-broadcast.c
> @@ -397,6 +397,8 @@ int tick_resume_broadcast(void)
>
>  /* FIXME: use cpumask_var_t. */
>  static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS);
> +static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
> +static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS);
>
>  /*
>   * Exposed for debugging: see timer_list.c
> @@ -453,12 +455,24 @@ again:
>         /* Find all expired events */
>         for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
>                 td = &per_cpu(tick_cpu_device, cpu);
> -               if (td->evtdev->next_event.tv64 <= now.tv64)
> +               if (td->evtdev->next_event.tv64 <= now.tv64) {
>                         cpumask_set_cpu(cpu, to_cpumask(tmpmask));
> -               else if (td->evtdev->next_event.tv64 < next_event.tv64)
> +                       /*
> +                        * Mark the remote cpu in the pending mask, so
> +                        * it can avoid reprogramming the cpu local
> +                        * timer in tick_broadcast_oneshot_control().
> +                        */
> +                       set_bit(cpu, tick_broadcast_pending);
> +               } else if (td->evtdev->next_event.tv64 < next_event.tv64)
>                         next_event.tv64 = td->evtdev->next_event.tv64;
>         }
>
> +       /* Take care of enforced broadcast requests */
> +       for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) {
> +               set_bit(cpu, tmpmask);
> +               clear_bit(cpu, tick_force_broadcast_mask);
> +       }
> +
>         /*
>          * Wakeup the cpus which have an expired event.
>          */
> @@ -494,6 +508,7 @@ void tick_broadcast_oneshot_control(unsi
>         struct clock_event_device *bc, *dev;
>         struct tick_device *td;
>         unsigned long flags;
> +       ktime_t now;
>         int cpu;
>
>         /*
> @@ -518,6 +533,8 @@ void tick_broadcast_oneshot_control(unsi
>
>         raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
>         if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
> +               WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending));
> +               WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));
>                 if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
>                         cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
>                         clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
> @@ -529,10 +546,63 @@ void tick_broadcast_oneshot_control(unsi
>                         cpumask_clear_cpu(cpu,
>                                           tick_get_broadcast_oneshot_mask());
>                         clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
> -                       if (dev->next_event.tv64 != KTIME_MAX)
> -                               tick_program_event(dev->next_event, 1);
> +                       if (dev->next_event.tv64 == KTIME_MAX)
> +                               goto out;
> +                       /*
> +                        * The cpu handling the broadcast timer marked
> +                        * this cpu in the broadcast pending mask and
> +                        * fired the broadcast IPI. So we are going to
> +                        * handle the expired event anyway via the
> +                        * broadcast IPI handler. No need to reprogram
> +                        * the timer with an already expired event.
> +                        */
> +                       if (test_and_clear_bit(cpu, tick_broadcast_pending))
> +                               goto out;
> +                       /*
> +                        * If the pending bit is not set, then we are
> +                        * either the CPU handling the broadcast
> +                        * interrupt or we got woken by something else.
> +                        *
> +                        * We are not longer in the broadcast mask, so
> +                        * if the cpu local expiry time is already
> +                        * reached, we would reprogram the cpu local
> +                        * timer with an already expired event.
> +                        *
> +                        * This can lead to a ping-pong when we return
> +                        * to idle and therefor rearm the broadcast
> +                        * timer before the cpu local timer was able
> +                        * to fire. This happens because the forced
> +                        * reprogramming makes sure that the event
> +                        * will happen in the future and depending on
> +                        * the min_delta setting this might be far
> +                        * enough out that the ping-pong starts.
> +                        *
> +                        * If the cpu local next_event has expired
> +                        * then we know that the broadcast timer
> +                        * next_event has expired as well and
> +                        * broadcast is about to be handled. So we
> +                        * avoid reprogramming and enforce that the
> +                        * broadcast handler, which did not run yet,
> +                        * will invoke the cpu local handler.
> +                        *
> +                        * We cannot call the handler directly from
> +                        * here, because we might be in a NOHZ phase
> +                        * and we did not go through the irq_enter()
> +                        * nohz fixups.
> +                        */
> +                       now = ktime_get();
> +                       if (dev->next_event.tv64 <= now.tv64) {
> +                               set_bit(cpu, tick_force_broadcast_mask);
> +                               goto out;
> +                       }
> +                       /*
> +                        * We got woken by something else. Reprogram
> +                        * the cpu local timer device.
> +                        */
> +                       tick_program_event(dev->next_event, 1);
>                 }
>         }
> +out:
>         raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>  }
>
Lorenzo Pieralisi Feb. 22, 2013, 10:28 a.m. UTC | #6
On Thu, Feb 21, 2013 at 10:19:18PM +0000, Thomas Gleixner wrote:
> On Thu, 21 Feb 2013, Santosh Shilimkar wrote:
> > On Thursday 21 February 2013 07:18 PM, Thomas Gleixner wrote:
> > > find below a completely untested patch, which should address that issue.
> > > 
> > After looking at the thread, I tried to see the issue on OMAP and could
> > see the same issue as Jason.
> 
> That's interesting. We have the same issue on x86 since 2007 and
> nobody noticed ever. It's basically the same problem there, but it
> seems that on x86 getting out of those low power states is way slower
> than the minimal reprogramming delta which is used to enforce the
> local timer to fire after the wakeup. 
> 
> I'm still amazed that as Jason stated a 1us reprogramming delta is
> sufficient to get this ping-pong going. I somehow doubt that, but
> maybe ARM is really that fast :)

It also depends on when the idle driver exits broadcast mode.
Certainly if that's the last thing it does before enabling IRQs, that
might help trigger the issue.

I am still a bit sceptic myself too, and I take advantage of Thomas'
knowledge on the subject, which is ways deeper than mine BTW, to ask a
question. The thread started with a subject "too many retries...." and
here I have a doubt. If the fix is not applied, on the CPU affine to
the broadcast timer, it is _normal_ to have local timer retries, since
the CPU is setting/forcing the local timer to fire after a min_delta_ns every
time the expired event was local to the CPU affine to the broadcast timer.

The problem, supposedly, is that the timer has not enough time (sorry for the
mouthful) to expire(fire) before IRQs are disabled and the idle thread goes
back to idle again. This means that we should notice a mismatch between
the number of broadcast timer IRQs and local timer IRQs on the CPU
affine to the broadcast timer IRQ (granted, we also have to take into
account broadcast timer IRQs meant to service (through IPI) a local timer
expired on a CPU which is not the one running the broadcast IRQ handler and
"normal" local timer IRQs as well).

I am not sure the sheer number of retries is a symptom of the problem
happening, but I might well be mistaken so I am asking.

For certain, with the fix applied, lots of duplicate IRQs on the CPU
affine to the broadcast timer are eliminated, since the local timer is
not reprogrammed anymore (before the fix, basically the broadcast timer
was firing an IRQ that did nothing since the CPU was already out of
broadcast mode by the time the broadcast handler was running, the real job
was carried out in the local timer handler).

> 
> > Your patch fixes the retries on both CPUs on my dual core machine. So
> > you use my tested by if you need one.
> 
> They are always welcome.
> 
> > Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>

You can add mine too, we should fix the WARN_ON_ONCE mentioned in Santosh's
reply.

Tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Santosh Shilimkar Feb. 22, 2013, 10:30 a.m. UTC | #7
On Friday 22 February 2013 03:54 PM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
>> BTW, Lorenzo off-list mentioned to me about warning in boot-up
>> which I missed while testing your patch. It will take bit more
>> time for me to look into it and hence thought of reporting it.
>>
>> [    2.186126] ------------[ cut here ]------------
>> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
>> tick_broadcast_oneshot_control+0x1c0/0x21c()
>
> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
>
The force broadcast mask one coming from below.
"WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));"

Regards,
Santosh
Lorenzo Pieralisi Feb. 22, 2013, 10:31 a.m. UTC | #8
On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > which I missed while testing your patch. It will take bit more
> > time for me to look into it and hence thought of reporting it.
> > 
> > [    2.186126] ------------[ cut here ]------------
> > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > tick_broadcast_oneshot_control+0x1c0/0x21c()
> 
> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?

It is the tick_force_broadcast_mask and I think that's because on all
systems we are testing, the broadcast timer IRQ is a thundering herd,
all CPUs get out of idle at once and try to get out of broadcast mode
at more or less the same time.

Lorenzo
Santosh Shilimkar Feb. 22, 2013, 11:02 a.m. UTC | #9
On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
>> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
>>> BTW, Lorenzo off-list mentioned to me about warning in boot-up
>>> which I missed while testing your patch. It will take bit more
>>> time for me to look into it and hence thought of reporting it.
>>>
>>> [    2.186126] ------------[ cut here ]------------
>>> [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
>>> tick_broadcast_oneshot_control+0x1c0/0x21c()
>>
>> Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
>
> It is the tick_force_broadcast_mask and I think that's because on all
> systems we are testing, the broadcast timer IRQ is a thundering herd,
> all CPUs get out of idle at once and try to get out of broadcast mode
> at more or less the same time.
>
So the issue comes ups only when the idle state used where CPU wakeup
more or less at same time as Lorenzo mentioned. I have two platforms
where I could test the patch and see the issue only with one platform.

Yesterday I didn't notice the warning because it wasn't seen on that
platform :-) OMAP4 idle entry and exit in deep state is staggered
between CPUs and hence the warning isn't seen. On OMAP5 though,
there is an additional C-state where idle entry/exit for CPU
isn't staggered and I see the issue in that case.

Actually the broad-cast code doesn't expect such a behavior
from CPUs since only the broad-cast affine CPU should wake
up and rest of the CPU should be woken up by the broad-cast
IPIs.

Regards,
Santosh
diff mbox

Patch

Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c
+++ linux-2.6/kernel/time/tick-broadcast.c
@@ -397,6 +397,8 @@  int tick_resume_broadcast(void)
 
 /* FIXME: use cpumask_var_t. */
 static DECLARE_BITMAP(tick_broadcast_oneshot_mask, NR_CPUS);
+static DECLARE_BITMAP(tick_broadcast_pending, NR_CPUS);
+static DECLARE_BITMAP(tick_force_broadcast_mask, NR_CPUS);
 
 /*
  * Exposed for debugging: see timer_list.c
@@ -453,12 +455,24 @@  again:
 	/* Find all expired events */
 	for_each_cpu(cpu, tick_get_broadcast_oneshot_mask()) {
 		td = &per_cpu(tick_cpu_device, cpu);
-		if (td->evtdev->next_event.tv64 <= now.tv64)
+		if (td->evtdev->next_event.tv64 <= now.tv64) {
 			cpumask_set_cpu(cpu, to_cpumask(tmpmask));
-		else if (td->evtdev->next_event.tv64 < next_event.tv64)
+			/*
+			 * Mark the remote cpu in the pending mask, so
+			 * it can avoid reprogramming the cpu local
+			 * timer in tick_broadcast_oneshot_control().
+			 */
+			set_bit(cpu, tick_broadcast_pending);
+		} else if (td->evtdev->next_event.tv64 < next_event.tv64)
 			next_event.tv64 = td->evtdev->next_event.tv64;
 	}
 
+	/* Take care of enforced broadcast requests */
+	for_each_cpu(cpu, to_cpumask(tick_force_broadcast_mask)) {
+		set_bit(cpu, tmpmask);
+		clear_bit(cpu, tick_force_broadcast_mask);
+	}
+
 	/*
 	 * Wakeup the cpus which have an expired event.
 	 */
@@ -494,6 +508,7 @@  void tick_broadcast_oneshot_control(unsi
 	struct clock_event_device *bc, *dev;
 	struct tick_device *td;
 	unsigned long flags;
+	ktime_t now;
 	int cpu;
 
 	/*
@@ -518,6 +533,8 @@  void tick_broadcast_oneshot_control(unsi
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
+		WARN_ON_ONCE(test_bit(cpu, tick_broadcast_pending));
+		WARN_ON_ONCE(test_bit(cpu, tick_force_broadcast_mask));
 		if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
 			cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
@@ -529,10 +546,63 @@  void tick_broadcast_oneshot_control(unsi
 			cpumask_clear_cpu(cpu,
 					  tick_get_broadcast_oneshot_mask());
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
-			if (dev->next_event.tv64 != KTIME_MAX)
-				tick_program_event(dev->next_event, 1);
+			if (dev->next_event.tv64 == KTIME_MAX)
+				goto out;
+			/*
+			 * The cpu handling the broadcast timer marked
+			 * this cpu in the broadcast pending mask and
+			 * fired the broadcast IPI. So we are going to
+			 * handle the expired event anyway via the
+			 * broadcast IPI handler. No need to reprogram
+			 * the timer with an already expired event.
+			 */
+			if (test_and_clear_bit(cpu, tick_broadcast_pending))
+				goto out;
+			/*
+			 * If the pending bit is not set, then we are
+			 * either the CPU handling the broadcast
+			 * interrupt or we got woken by something else.
+			 *
+			 * We are not longer in the broadcast mask, so
+			 * if the cpu local expiry time is already
+			 * reached, we would reprogram the cpu local
+			 * timer with an already expired event.
+			 *
+			 * This can lead to a ping-pong when we return
+			 * to idle and therefor rearm the broadcast
+			 * timer before the cpu local timer was able
+			 * to fire. This happens because the forced
+			 * reprogramming makes sure that the event
+			 * will happen in the future and depending on
+			 * the min_delta setting this might be far
+			 * enough out that the ping-pong starts.
+			 *
+			 * If the cpu local next_event has expired
+			 * then we know that the broadcast timer
+			 * next_event has expired as well and
+			 * broadcast is about to be handled. So we
+			 * avoid reprogramming and enforce that the
+			 * broadcast handler, which did not run yet,
+			 * will invoke the cpu local handler.
+			 *
+			 * We cannot call the handler directly from
+			 * here, because we might be in a NOHZ phase
+			 * and we did not go through the irq_enter()
+			 * nohz fixups.
+			 */
+			now = ktime_get();
+			if (dev->next_event.tv64 <= now.tv64) {
+				set_bit(cpu, tick_force_broadcast_mask);
+				goto out;
+			}
+			/*
+			 * We got woken by something else. Reprogram
+			 * the cpu local timer device.
+			 */
+			tick_program_event(dev->next_event, 1);
 		}
 	}
+out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }