diff mbox

too many timer retries happen when do local timer swtich with broadcast timer

Message ID alpine.LFD.2.02.1302221212500.22263@ionos (mailing list archive)
State New, archived
Headers show

Commit Message

Thomas Gleixner Feb. 22, 2013, 12:07 p.m. UTC
On Fri, 22 Feb 2013, Santosh Shilimkar wrote:

> On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > > > which I missed while testing your patch. It will take bit more
> > > > time for me to look into it and hence thought of reporting it.
> > > > 
> > > > [    2.186126] ------------[ cut here ]------------
> > > > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > > > tick_broadcast_oneshot_control+0x1c0/0x21c()
> > > 
> > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
> > 
> > It is the tick_force_broadcast_mask and I think that's because on all
> > systems we are testing, the broadcast timer IRQ is a thundering herd,
> > all CPUs get out of idle at once and try to get out of broadcast mode
> > at more or less the same time.
> > 
> So the issue comes ups only when the idle state used where CPU wakeup
> more or less at same time as Lorenzo mentioned. I have two platforms
> where I could test the patch and see the issue only with one platform.
> 
> Yesterday I didn't notice the warning because it wasn't seen on that
> platform :-) OMAP4 idle entry and exit in deep state is staggered
> between CPUs and hence the warning isn't seen. On OMAP5 though,
> there is an additional C-state where idle entry/exit for CPU
> isn't staggered and I see the issue in that case.
> 
> Actually the broad-cast code doesn't expect such a behavior
> from CPUs since only the broad-cast affine CPU should wake
> up and rest of the CPU should be woken up by the broad-cast
> IPIs.

That's what I feared. We might have the same issue on x86, depending
on the cpu model.

But thinking more about it. It's actually not a real problem, just
pointless burning of cpu cycles.

So on the CPU which gets woken along with the target CPU of the
broadcast the following happens:

  deep_idle()
			<-- spurious wakeup
  broadcast_exit()
    set forced bit
  
  enable interrupts
    
			<-- Nothing happens

  disable interrupts

  broadcast_enter()
			<-- Here we observe the forced bit is set
  deep_idle()

Now after that the target CPU of the broadcast runs the broadcast
handler and finds the other CPU in both the broadcast and the forced
mask, sends the IPI and stuff gets back to normal.

So it's not actually harmful, just more evidence for the theory, that
hardware designers have access to very special drug supplies.

Now we could make use of that and avoid going deep idle just to come
back right away via the IPI. Unfortunately the notification thingy has
no return value, but we can fix that.

To confirm that theory, could you please try the hack below and add
some instrumentation (trace_printk)?

Thanks,

	tglx

Comments

Lorenzo Pieralisi Feb. 22, 2013, 2:48 p.m. UTC | #1
On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> 
> > On Friday 22 February 2013 04:01 PM, Lorenzo Pieralisi wrote:
> > > On Fri, Feb 22, 2013 at 10:24:00AM +0000, Thomas Gleixner wrote:
> > > > On Fri, 22 Feb 2013, Santosh Shilimkar wrote:
> > > > > BTW, Lorenzo off-list mentioned to me about warning in boot-up
> > > > > which I missed while testing your patch. It will take bit more
> > > > > time for me to look into it and hence thought of reporting it.
> > > > > 
> > > > > [    2.186126] ------------[ cut here ]------------
> > > > > [    2.190979] WARNING: at kernel/time/tick-broadcast.c:501
> > > > > tick_broadcast_oneshot_control+0x1c0/0x21c()
> > > > 
> > > > Which one is that? tick_broadcast_pending or tick_force_broadcast_mask ?
> > > 
> > > It is the tick_force_broadcast_mask and I think that's because on all
> > > systems we are testing, the broadcast timer IRQ is a thundering herd,
> > > all CPUs get out of idle at once and try to get out of broadcast mode
> > > at more or less the same time.
> > > 
> > So the issue comes ups only when the idle state used where CPU wakeup
> > more or less at same time as Lorenzo mentioned. I have two platforms
> > where I could test the patch and see the issue only with one platform.
> > 
> > Yesterday I didn't notice the warning because it wasn't seen on that
> > platform :-) OMAP4 idle entry and exit in deep state is staggered
> > between CPUs and hence the warning isn't seen. On OMAP5 though,
> > there is an additional C-state where idle entry/exit for CPU
> > isn't staggered and I see the issue in that case.
> > 
> > Actually the broad-cast code doesn't expect such a behavior
> > from CPUs since only the broad-cast affine CPU should wake
> > up and rest of the CPU should be woken up by the broad-cast
> > IPIs.
> 
> That's what I feared. We might have the same issue on x86, depending
> on the cpu model.
> 
> But thinking more about it. It's actually not a real problem, just
> pointless burning of cpu cycles.
> 
> So on the CPU which gets woken along with the target CPU of the
> broadcast the following happens:
> 
>   deep_idle()
> 			<-- spurious wakeup
>   broadcast_exit()
>     set forced bit
>   
>   enable interrupts
>     
> 			<-- Nothing happens
> 
>   disable interrupts
> 
>   broadcast_enter()
> 			<-- Here we observe the forced bit is set
>   deep_idle()
> 
> Now after that the target CPU of the broadcast runs the broadcast
> handler and finds the other CPU in both the broadcast and the forced
> mask, sends the IPI and stuff gets back to normal.
> 
> So it's not actually harmful, just more evidence for the theory, that
> hardware designers have access to very special drug supplies.
> 
> Now we could make use of that and avoid going deep idle just to come
> back right away via the IPI. Unfortunately the notification thingy has
> no return value, but we can fix that.
> 
> To confirm that theory, could you please try the hack below and add
> some instrumentation (trace_printk)?

Applied, and it looks like that's exactly why the warning triggers, at least
on the platform I am testing on which is a dual-cluster ARM testchip.

There is a still time window though where the CPU (the IPI target) can get
back to idle (tick_broadcast_pending still not set) before the CPU target of
the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
tick_broadcast_pending), or am I missing something ?
It is a corner case, granted. Best thing would be to check pending IRQs in the
idle driver back-end (or have always-on local timers :-)).

Thanks,
Lorenzo
Thomas Gleixner Feb. 22, 2013, 3:03 p.m. UTC | #2
On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > Now we could make use of that and avoid going deep idle just to come
> > back right away via the IPI. Unfortunately the notification thingy has
> > no return value, but we can fix that.
> > 
> > To confirm that theory, could you please try the hack below and add
> > some instrumentation (trace_printk)?
> 
> Applied, and it looks like that's exactly why the warning triggers, at least
> on the platform I am testing on which is a dual-cluster ARM testchip.
> 
> There is a still time window though where the CPU (the IPI target) can get
> back to idle (tick_broadcast_pending still not set) before the CPU target of
> the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> tick_broadcast_pending), or am I missing something ?

Well, the tick_broadcast_pending bit is uninteresting if the
force_broadcast bit is set. Because if that bit is set we know for
sure, that we got woken with the cpu which gets the broadcast timer
and raced back to idle before the broadcast handler managed to
send the IPI.

If we did not get woken before the broadcast IPI then the pending bit
is set when we exit the broadcast mode.

> It is a corner case, granted. Best thing would be to check pending IRQs in the
> idle driver back-end (or have always-on local timers :-)).

The latter is definitely the only sane solution.

Thanks,

	tglx
Lorenzo Pieralisi Feb. 22, 2013, 3:26 p.m. UTC | #3
On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > Now we could make use of that and avoid going deep idle just to come
> > > back right away via the IPI. Unfortunately the notification thingy has
> > > no return value, but we can fix that.
> > > 
> > > To confirm that theory, could you please try the hack below and add
> > > some instrumentation (trace_printk)?
> > 
> > Applied, and it looks like that's exactly why the warning triggers, at least
> > on the platform I am testing on which is a dual-cluster ARM testchip.
> > 
> > There is a still time window though where the CPU (the IPI target) can get
> > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > tick_broadcast_pending), or am I missing something ?
> 
> Well, the tick_broadcast_pending bit is uninteresting if the
> force_broadcast bit is set. Because if that bit is set we know for
> sure, that we got woken with the cpu which gets the broadcast timer
> and raced back to idle before the broadcast handler managed to
> send the IPI.

Gah, my bad sorry, I mixed things up. I thought

tick_check_broadcast_pending()

was checking against the tick_broadcast_pending mask not

tick_force_broadcast_mask

as it correctly does.

All clear now.

Thanks a lot,
Lorenzo
Thomas Gleixner Feb. 22, 2013, 6:52 p.m. UTC | #4
On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > > Now we could make use of that and avoid going deep idle just to come
> > > > back right away via the IPI. Unfortunately the notification thingy has
> > > > no return value, but we can fix that.
> > > > 
> > > > To confirm that theory, could you please try the hack below and add
> > > > some instrumentation (trace_printk)?
> > > 
> > > Applied, and it looks like that's exactly why the warning triggers, at least
> > > on the platform I am testing on which is a dual-cluster ARM testchip.
> > > 
> > > There is a still time window though where the CPU (the IPI target) can get
> > > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > > tick_broadcast_pending), or am I missing something ?
> > 
> > Well, the tick_broadcast_pending bit is uninteresting if the
> > force_broadcast bit is set. Because if that bit is set we know for
> > sure, that we got woken with the cpu which gets the broadcast timer
> > and raced back to idle before the broadcast handler managed to
> > send the IPI.
> 
> Gah, my bad sorry, I mixed things up. I thought
> 
> tick_check_broadcast_pending()
> 
> was checking against the tick_broadcast_pending mask not
> 
> tick_force_broadcast_mask

Yep, that's a misnomer. I just wanted to make sure that my theory is
correct. I need to think about the real solution some more.

We have two alternatives:

1) Make the clockevents_notify function have a return value.

2) Add something like the hack I gave you with a proper name.

The latter has the beauty, that we just need to modify the platform
independent idle code instead of going down to every callsite of the
clockevents_notify thing.

Thanks,

	tglx
Santosh Shilimkar Feb. 25, 2013, 6:12 a.m. UTC | #5
On Saturday 23 February 2013 12:22 AM, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
>>> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>>>> On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
>>>>> Now we could make use of that and avoid going deep idle just to come
>>>>> back right away via the IPI. Unfortunately the notification thingy has
>>>>> no return value, but we can fix that.
>>>>>
>>>>> To confirm that theory, could you please try the hack below and add
>>>>> some instrumentation (trace_printk)?
>>>>
>>>> Applied, and it looks like that's exactly why the warning triggers, at least
>>>> on the platform I am testing on which is a dual-cluster ARM testchip.
>>>>
I too confirm that the warnings cause is same.

>>>> There is a still time window though where the CPU (the IPI target) can get
>>>> back to idle (tick_broadcast_pending still not set) before the CPU target of
>>>> the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
>>>> tick_broadcast_pending), or am I missing something ?
>>>
>>> Well, the tick_broadcast_pending bit is uninteresting if the
>>> force_broadcast bit is set. Because if that bit is set we know for
>>> sure, that we got woken with the cpu which gets the broadcast timer
>>> and raced back to idle before the broadcast handler managed to
>>> send the IPI.
>>
>> Gah, my bad sorry, I mixed things up. I thought
>>
>> tick_check_broadcast_pending()
>>
>> was checking against the tick_broadcast_pending mask not
>>
>> tick_force_broadcast_mask
>
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.
>
> We have two alternatives:
>
> 1) Make the clockevents_notify function have a return value.
>
> 2) Add something like the hack I gave you with a proper name.
>
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.
>
I agree that 2 is better alternative to avoid multiple changes.
Whichever alternative you choose, will be happy to test it :)

Regards,
Santosh
Jason Liu Feb. 25, 2013, 6:38 a.m. UTC | #6
Thomas,

2013/2/23 Thomas Gleixner <tglx@linutronix.de>:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
>> > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
>> > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
>> > > > Now we could make use of that and avoid going deep idle just to come
>> > > > back right away via the IPI. Unfortunately the notification thingy has
>> > > > no return value, but we can fix that.
>> > > >
>> > > > To confirm that theory, could you please try the hack below and add
>> > > > some instrumentation (trace_printk)?
>> > >
>> > > Applied, and it looks like that's exactly why the warning triggers, at least
>> > > on the platform I am testing on which is a dual-cluster ARM testchip.
>> > >
>> > > There is a still time window though where the CPU (the IPI target) can get
>> > > back to idle (tick_broadcast_pending still not set) before the CPU target of
>> > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
>> > > tick_broadcast_pending), or am I missing something ?
>> >
>> > Well, the tick_broadcast_pending bit is uninteresting if the
>> > force_broadcast bit is set. Because if that bit is set we know for
>> > sure, that we got woken with the cpu which gets the broadcast timer
>> > and raced back to idle before the broadcast handler managed to
>> > send the IPI.
>>
>> Gah, my bad sorry, I mixed things up. I thought
>>
>> tick_check_broadcast_pending()
>>
>> was checking against the tick_broadcast_pending mask not
>>
>> tick_force_broadcast_mask
>
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.

I have applied your patch and tested, there is NO warning at all then.
I think your theory is correct.

>
> We have two alternatives:
>
> 1) Make the clockevents_notify function have a return value.
>
> 2) Add something like the hack I gave you with a proper name.
>
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.

I prefer the solution 2).

>
> Thanks,
>
>         tglx
Lorenzo Pieralisi Feb. 25, 2013, 1:34 p.m. UTC | #7
On Fri, Feb 22, 2013 at 06:52:14PM +0000, Thomas Gleixner wrote:
> On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > On Fri, Feb 22, 2013 at 03:03:02PM +0000, Thomas Gleixner wrote:
> > > On Fri, 22 Feb 2013, Lorenzo Pieralisi wrote:
> > > > On Fri, Feb 22, 2013 at 12:07:30PM +0000, Thomas Gleixner wrote:
> > > > > Now we could make use of that and avoid going deep idle just to come
> > > > > back right away via the IPI. Unfortunately the notification thingy has
> > > > > no return value, but we can fix that.
> > > > > 
> > > > > To confirm that theory, could you please try the hack below and add
> > > > > some instrumentation (trace_printk)?
> > > > 
> > > > Applied, and it looks like that's exactly why the warning triggers, at least
> > > > on the platform I am testing on which is a dual-cluster ARM testchip.
> > > > 
> > > > There is a still time window though where the CPU (the IPI target) can get
> > > > back to idle (tick_broadcast_pending still not set) before the CPU target of
> > > > the broadcast has a chance to run tick_handle_oneshot_broadcast (and set
> > > > tick_broadcast_pending), or am I missing something ?
> > > 
> > > Well, the tick_broadcast_pending bit is uninteresting if the
> > > force_broadcast bit is set. Because if that bit is set we know for
> > > sure, that we got woken with the cpu which gets the broadcast timer
> > > and raced back to idle before the broadcast handler managed to
> > > send the IPI.
> > 
> > Gah, my bad sorry, I mixed things up. I thought
> > 
> > tick_check_broadcast_pending()
> > 
> > was checking against the tick_broadcast_pending mask not
> > 
> > tick_force_broadcast_mask
> 
> Yep, that's a misnomer. I just wanted to make sure that my theory is
> correct. I need to think about the real solution some more.
> 
> We have two alternatives:
> 
> 1) Make the clockevents_notify function have a return value.
> 
> 2) Add something like the hack I gave you with a proper name.
> 
> The latter has the beauty, that we just need to modify the platform
> independent idle code instead of going down to every callsite of the
> clockevents_notify thing.

Thank you.

I am not sure (1) would buy us anything compared to (2) and as you said we
would end up patching all callsites so (2) is preferred.

As I mentioned, we can even just apply your fixes and leave platform specific
code deal with this optimization, at the end of the day idle driver has
just to check pending IRQs/wake-up sources (which would cover all IRQs not
just TIMER IPI) if and when it has to start time consuming operations like
cache cleaning to enter deep idle. If it goes into a shallow C-state so be it.

On x86 I think it is HW/FW that prevents C-state entering if IRQs are pending,
and on ARM it is likely to happen too, so I am just saying you should not
bother if you think the code becomes too hairy to justify this change.

Thank you very much for the fixes and your help,
Lorenzo
diff mbox

Patch

Index: linux-2.6/arch/arm/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/process.c
+++ linux-2.6/arch/arm/kernel/process.c
@@ -199,7 +199,7 @@  void cpu_idle(void)
 #ifdef CONFIG_PL310_ERRATA_769419
 			wmb();
 #endif
-			if (hlt_counter) {
+			if (hlt_counter || tick_check_broadcast_pending()) {
 				local_irq_enable();
 				cpu_relax();
 			} else if (!need_resched()) {
Index: linux-2.6/include/linux/clockchips.h
===================================================================
--- linux-2.6.orig/include/linux/clockchips.h
+++ linux-2.6/include/linux/clockchips.h
@@ -170,6 +170,12 @@  extern void tick_broadcast(const struct 
 extern int tick_receive_broadcast(void);
 #endif
 
+#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern int tick_check_broadcast_pending(void);
+#else
+static inline int tick_check_broadcast_pending(void) { return 0; }
+#endif
+
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void clockevents_notify(unsigned long reason, void *arg);
 #else
Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c
+++ linux-2.6/kernel/time/tick-broadcast.c
@@ -418,6 +418,15 @@  static int tick_broadcast_set_event(ktim
 	return clockevents_program_event(bc, expires, force);
 }
 
+/*
+ * Called before going idle with interrupts disabled. Checks whether a
+ * broadcast event from the other core is about to happen.
+ */
+int tick_check_broadcast_pending(void)
+{
+	return test_bit(smp_processor_id(), tick_force_broadcast_mask);
+}
+
 int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {
 	clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);