diff mbox

CONFIG_NO_HZ + CONFIG_CPU_IDLE freeze the system (Was Re: [PATCH] acpi : remove power from acpi_processor_cx structure)

Message ID 504A2D73.3010702@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

John Stultz Sept. 7, 2012, 5:22 p.m. UTC
On 09/07/2012 07:20 AM, Daniel Lezcano wrote:
> On 09/06/2012 11:18 PM, Rafael J. Wysocki wrote:
>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>> On 09/06/2012 10:04 PM, Rafael J. Wysocki wrote:
>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>> On 09/06/2012 09:54 AM, Daniel Lezcano wrote:
>>>>> I fall into this issue because NETCONSOLE is set, disabling it allowed
>>>>> me to go further.
>>>>>
>>>>> Unfortunately I am facing to some random freeze on the system which
>>>>> seems to be related to CONFIG_NO_HZ=y and CONFIG_CPU_IDLE=y.
>>>>>
>>>>> Disabling one of them, make the freezes to disappear.
>>>>>
>>>>> Is it a known issue ?
>>>> Well, there are systems having problems with this configuration, but they
>>>> should be exceptional.  What system is that?
>>> It is a laptop T61p with a Core 2 Duo T9500. Nothing exceptional I
>>> believe. Maybe someone got the same issue ?
>> Is it a regression for you?
> Yes, I think so. The issue appears between v3.5 and v3.6-rc1.
>
> It is not easy to reproduce but after taking some time to dig, it seems
> to appear with this commit:
>
> 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 is the first bad commit
> commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1
> Author: John Stultz <john.stultz@linaro.org>
> Date:   Fri Jul 13 01:21:53 2012 -0400
>
>      time: Condense timekeeper.xtime into xtime_sec
>
>      The timekeeper struct has a xtime_nsec, which keeps the
>      sub-nanosecond remainder.  This ends up being somewhat
>      duplicative of the timekeeper.xtime.tv_nsec value, and we
>      have to do extra work to keep them apart, copying the full
>      nsec portion out and back in over and over.
>
>      This patch simplifies some of the logic by taking the timekeeper
>      xtime value and splitting it into timekeeper.xtime_sec and
>      reuses the timekeeper.xtime_nsec for the sub-second portion
>      (stored in higher res shifted nanoseconds).
>
>      This simplifies some of the accumulation logic. And will
>      allow for more accurate timekeeping once the vsyscall code
>      is updated to use the shifted nanosecond remainder.
>
>      Signed-off-by: John Stultz <john.stultz@linaro.org>
>      Reviewed-by: Ingo Molnar <mingo@kernel.org>
>      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>      Cc: Richard Cochran <richardcochran@gmail.com>
>      Cc: Prarit Bhargava <prarit@redhat.com>
>      Link:
> http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
>      Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> :040000 040000 4d6541ac1f6075d7adee1eef494b31a0cbda0934
> dc5708bc738af695f092bf822809b13a1da104b6 M	kernel
>
> How to reproduce: with a laptop T61p, with a Core 2 Duo. I boot the
> kernel in busybox and wait some minutes before writing something in the
> console. At this moment, nothing appears to the console but the
> characters are echo'ed several seconds later (could be 1, 5, or 10 secs
> or more).
>
> That happens when CONFIG_CPU_IDLE and CONFIG_NO_HZ are set. Disabling
> one of them, the issue does not appear.

Thanks for bisecting this down and the heads up!

Right off I can't see what might be causing this.  Bunch of questions:

Is this a 32 or 64 bit kernel?

By your description above, it sounds like the system is still 
functioning, but there's just a high latency for key-input. Is that right?

Are other things on the system happening slowly?

Does generating interrupts by hitting/holding down the ctrl key make the 
system respond faster?

Is there any dmesg output near when it occurs?

If you don't wait that minute after boot before typing anything, does it 
still trigger later? (or is it tied to early boot?)

On a whim, does the patch below avoid the problem?

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Daniel Lezcano Sept. 7, 2012, 9:35 p.m. UTC | #1
On 09/07/2012 07:22 PM, John Stultz wrote:
> On 09/07/2012 07:20 AM, Daniel Lezcano wrote:
>> On 09/06/2012 11:18 PM, Rafael J. Wysocki wrote:
>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>> On 09/06/2012 10:04 PM, Rafael J. Wysocki wrote:
>>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>>> On 09/06/2012 09:54 AM, Daniel Lezcano wrote:
>>>>>> I fall into this issue because NETCONSOLE is set, disabling it
>>>>>> allowed
>>>>>> me to go further.
>>>>>>
>>>>>> Unfortunately I am facing to some random freeze on the system which
>>>>>> seems to be related to CONFIG_NO_HZ=y and CONFIG_CPU_IDLE=y.
>>>>>>
>>>>>> Disabling one of them, make the freezes to disappear.
>>>>>>
>>>>>> Is it a known issue ?
>>>>> Well, there are systems having problems with this configuration,
>>>>> but they
>>>>> should be exceptional.  What system is that?
>>>> It is a laptop T61p with a Core 2 Duo T9500. Nothing exceptional I
>>>> believe. Maybe someone got the same issue ?
>>> Is it a regression for you?
>> Yes, I think so. The issue appears between v3.5 and v3.6-rc1.
>>
>> It is not easy to reproduce but after taking some time to dig, it seems
>> to appear with this commit:
>>
>> 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 is the first bad commit
>> commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1
>> Author: John Stultz <john.stultz@linaro.org>
>> Date:   Fri Jul 13 01:21:53 2012 -0400
>>
>>      time: Condense timekeeper.xtime into xtime_sec
>>
>>      The timekeeper struct has a xtime_nsec, which keeps the
>>      sub-nanosecond remainder.  This ends up being somewhat
>>      duplicative of the timekeeper.xtime.tv_nsec value, and we
>>      have to do extra work to keep them apart, copying the full
>>      nsec portion out and back in over and over.
>>
>>      This patch simplifies some of the logic by taking the timekeeper
>>      xtime value and splitting it into timekeeper.xtime_sec and
>>      reuses the timekeeper.xtime_nsec for the sub-second portion
>>      (stored in higher res shifted nanoseconds).
>>
>>      This simplifies some of the accumulation logic. And will
>>      allow for more accurate timekeeping once the vsyscall code
>>      is updated to use the shifted nanosecond remainder.
>>
>>      Signed-off-by: John Stultz <john.stultz@linaro.org>
>>      Reviewed-by: Ingo Molnar <mingo@kernel.org>
>>      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>      Cc: Richard Cochran <richardcochran@gmail.com>
>>      Cc: Prarit Bhargava <prarit@redhat.com>
>>      Link:
>> http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
>>
>>      Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> :040000 040000 4d6541ac1f6075d7adee1eef494b31a0cbda0934
>> dc5708bc738af695f092bf822809b13a1da104b6 M    kernel
>>
>> How to reproduce: with a laptop T61p, with a Core 2 Duo. I boot the
>> kernel in busybox and wait some minutes before writing something in the
>> console. At this moment, nothing appears to the console but the
>> characters are echo'ed several seconds later (could be 1, 5, or 10 secs
>> or more).
>>
>> That happens when CONFIG_CPU_IDLE and CONFIG_NO_HZ are set. Disabling
>> one of them, the issue does not appear.
> 
> Thanks for bisecting this down and the heads up!
> 
> Right off I can't see what might be causing this.  Bunch of questions:
> 
> Is this a 32 or 64 bit kernel?

It is a 32 bit kernel.

> By your description above, it sounds like the system is still
> functioning, but there's just a high latency for key-input. Is that right?

Yes that's correct but not only. During this freeze time, I can't ping
the host. When the output is echo'ed, the ping works again.

But if I ping the host indefinitely, it does not freeze and the console
is echo'ed without problem.

> Are other things on the system happening slowly?

I have a very minimal system but at the first glance when it is not frozen

> Does generating interrupts by hitting/holding down the ctrl key make the
> system respond faster?

no.

> Is there any dmesg output near when it occurs?

no.

> If you don't wait that minute after boot before typing anything, does it
> still trigger later? (or is it tied to early boot?)

That depends, that could happen immediately or later. It is more or less
random.

> On a whim, does the patch below avoid the problem?

Nope, same issue :/

Thanks
  -- Daniel

> 
> thanks
> -john
> 
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 34e5eac..2fa0e52 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1179,6 +1179,7 @@ static void update_wall_time(void)
>      timekeeping_adjust(tk, offset);
>  
>  
> +#if 0
>      /*
>      * Store only full nanoseconds into xtime_nsec after rounding
>      * it up and add the remainder to the error difference.
> @@ -1192,6 +1193,7 @@ static void update_wall_time(void)
>      tk->xtime_nsec -= remainder;
>      tk->xtime_nsec += 1ULL << tk->shift;
>      tk->ntp_error += remainder << tk->ntp_error_shift;
> +#endif
>  
>      /*
>       * Finally, make sure that after the rounding
>
John Stultz Sept. 10, 2012, 5:14 p.m. UTC | #2
On 09/07/2012 02:35 PM, Daniel Lezcano wrote:
> On 09/07/2012 07:22 PM, John Stultz wrote:
>> On 09/07/2012 07:20 AM, Daniel Lezcano wrote:
>>> On 09/06/2012 11:18 PM, Rafael J. Wysocki wrote:
>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>> On 09/06/2012 10:04 PM, Rafael J. Wysocki wrote:
>>>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>>>> On 09/06/2012 09:54 AM, Daniel Lezcano wrote:
>>>>>>> I fall into this issue because NETCONSOLE is set, disabling it
>>>>>>> allowed
>>>>>>> me to go further.
>>>>>>>
>>>>>>> Unfortunately I am facing to some random freeze on the system which
>>>>>>> seems to be related to CONFIG_NO_HZ=y and CONFIG_CPU_IDLE=y.
>>>>>>>
>>>>>>> Disabling one of them, make the freezes to disappear.
>>>>>>>
>>>>>>> Is it a known issue ?
>>>>>> Well, there are systems having problems with this configuration,
>>>>>> but they
>>>>>> should be exceptional.  What system is that?
>>>>> It is a laptop T61p with a Core 2 Duo T9500. Nothing exceptional I
>>>>> believe. Maybe someone got the same issue ?
>>>> Is it a regression for you?
>>> Yes, I think so. The issue appears between v3.5 and v3.6-rc1.
>>>
>>> It is not easy to reproduce but after taking some time to dig, it seems
>>> to appear with this commit:
>>>
>>> 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 is the first bad commit
>>> commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1
>>> Author: John Stultz <john.stultz@linaro.org>
>>> Date:   Fri Jul 13 01:21:53 2012 -0400
>>>
>>>       time: Condense timekeeper.xtime into xtime_sec
>>>
>>>       The timekeeper struct has a xtime_nsec, which keeps the
>>>       sub-nanosecond remainder.  This ends up being somewhat
>>>       duplicative of the timekeeper.xtime.tv_nsec value, and we
>>>       have to do extra work to keep them apart, copying the full
>>>       nsec portion out and back in over and over.
>>>
>>>       This patch simplifies some of the logic by taking the timekeeper
>>>       xtime value and splitting it into timekeeper.xtime_sec and
>>>       reuses the timekeeper.xtime_nsec for the sub-second portion
>>>       (stored in higher res shifted nanoseconds).
>>>
>>>       This simplifies some of the accumulation logic. And will
>>>       allow for more accurate timekeeping once the vsyscall code
>>>       is updated to use the shifted nanosecond remainder.
>>>
>>>       Signed-off-by: John Stultz <john.stultz@linaro.org>
>>>       Reviewed-by: Ingo Molnar <mingo@kernel.org>
>>>       Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>       Cc: Richard Cochran <richardcochran@gmail.com>
>>>       Cc: Prarit Bhargava <prarit@redhat.com>
>>>       Link:
>>> http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
>>>
>>>       Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>>
>>> :040000 040000 4d6541ac1f6075d7adee1eef494b31a0cbda0934
>>> dc5708bc738af695f092bf822809b13a1da104b6 M    kernel
>>>
>>> How to reproduce: with a laptop T61p, with a Core 2 Duo. I boot the
>>> kernel in busybox and wait some minutes before writing something in the
>>> console. At this moment, nothing appears to the console but the
>>> characters are echo'ed several seconds later (could be 1, 5, or 10 secs
>>> or more).
>>>
>>> That happens when CONFIG_CPU_IDLE and CONFIG_NO_HZ are set. Disabling
>>> one of them, the issue does not appear.
>> Thanks for bisecting this down and the heads up!
>>
>> Right off I can't see what might be causing this.  Bunch of questions:
>>
>> Is this a 32 or 64 bit kernel?
> It is a 32 bit kernel.

Thanks for your answers! Has this has been seen on 3.6-rc4+ kernels? 
There were a few casting fixes that landed in 3.6-rc4 that would affect 
32bit systems.

In the meantime, I'll try to reproduce on my T61. If you could send me 
your .config, I'd appreciate it.

thanks!
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Sept. 10, 2012, 7:45 p.m. UTC | #3
On 09/10/2012 07:14 PM, John Stultz wrote:
> On 09/07/2012 02:35 PM, Daniel Lezcano wrote:
>> On 09/07/2012 07:22 PM, John Stultz wrote:
>>> On 09/07/2012 07:20 AM, Daniel Lezcano wrote:
>>>> On 09/06/2012 11:18 PM, Rafael J. Wysocki wrote:
>>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>>> On 09/06/2012 10:04 PM, Rafael J. Wysocki wrote:
>>>>>>> On Thursday, September 06, 2012, Daniel Lezcano wrote:
>>>>>>>> On 09/06/2012 09:54 AM, Daniel Lezcano wrote:
>>>>>>>> I fall into this issue because NETCONSOLE is set, disabling it
>>>>>>>> allowed
>>>>>>>> me to go further.
>>>>>>>>
>>>>>>>> Unfortunately I am facing to some random freeze on the system
>>>>>>>> which
>>>>>>>> seems to be related to CONFIG_NO_HZ=y and CONFIG_CPU_IDLE=y.
>>>>>>>>
>>>>>>>> Disabling one of them, make the freezes to disappear.
>>>>>>>>
>>>>>>>> Is it a known issue ?
>>>>>>> Well, there are systems having problems with this configuration,
>>>>>>> but they
>>>>>>> should be exceptional. What system is that?
>>>>>> It is a laptop T61p with a Core 2 Duo T9500. Nothing exceptional I
>>>>>> believe. Maybe someone got the same issue ?
>>>>> Is it a regression for you?
>>>> Yes, I think so. The issue appears between v3.5 and v3.6-rc1.
>>>>
>>>> It is not easy to reproduce but after taking some time to dig, it
>>>> seems
>>>> to appear with this commit:
>>>>
>>>> 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1 is the first bad commit
>>>> commit 1e75fa8be9fb61e1af46b5b3b176347a4c958ca1
>>>> Author: John Stultz <john.stultz@linaro.org>
>>>> Date: Fri Jul 13 01:21:53 2012 -0400
>>>>
>>>> time: Condense timekeeper.xtime into xtime_sec
>>>>
>>>> The timekeeper struct has a xtime_nsec, which keeps the
>>>> sub-nanosecond remainder. This ends up being somewhat
>>>> duplicative of the timekeeper.xtime.tv_nsec value, and we
>>>> have to do extra work to keep them apart, copying the full
>>>> nsec portion out and back in over and over.
>>>>
>>>> This patch simplifies some of the logic by taking the timekeeper
>>>> xtime value and splitting it into timekeeper.xtime_sec and
>>>> reuses the timekeeper.xtime_nsec for the sub-second portion
>>>> (stored in higher res shifted nanoseconds).
>>>>
>>>> This simplifies some of the accumulation logic. And will
>>>> allow for more accurate timekeeping once the vsyscall code
>>>> is updated to use the shifted nanosecond remainder.
>>>>
>>>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>>>> Reviewed-by: Ingo Molnar <mingo@kernel.org>
>>>> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>> Cc: Richard Cochran <richardcochran@gmail.com>
>>>> Cc: Prarit Bhargava <prarit@redhat.com>
>>>> Link:
>>>> http://lkml.kernel.org/r/1342156917-25092-5-git-send-email-john.stultz@linaro.org
>>>>
>>>>
>>>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>>>
>>>> :040000 040000 4d6541ac1f6075d7adee1eef494b31a0cbda0934
>>>> dc5708bc738af695f092bf822809b13a1da104b6 M kernel
>>>>
>>>> How to reproduce: with a laptop T61p, with a Core 2 Duo. I boot the
>>>> kernel in busybox and wait some minutes before writing something in
>>>> the
>>>> console. At this moment, nothing appears to the console but the
>>>> characters are echo'ed several seconds later (could be 1, 5, or 10
>>>> secs
>>>> or more).
>>>>
>>>> That happens when CONFIG_CPU_IDLE and CONFIG_NO_HZ are set. Disabling
>>>> one of them, the issue does not appear.
>>> Thanks for bisecting this down and the heads up!
>>>
>>> Right off I can't see what might be causing this. Bunch of questions:
>>>
>>> Is this a 32 or 64 bit kernel?
>> It is a 32 bit kernel.
>
> Thanks for your answers! Has this has been seen on 3.6-rc4+ kernels?
> There were a few casting fixes that landed in 3.6-rc4 that would
> affect 32bit systems.

Ok, I have to check that. Unfortunately not before Wednesday.

>
> In the meantime, I'll try to reproduce on my T61. If you could send me
> your .config, I'd appreciate it.

http://pastebin.com/qSxqfdDK

The header of the config file shows for a v3.5-rc7 because it is the
result of the git-bisect. If you keep this config file for the latest
kernel that should reproduce the problem.

Let me know if you were able to reproduce the problem.

Thanks
-- Daniel
diff mbox

Patch

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 34e5eac..2fa0e52 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1179,6 +1179,7 @@  static void update_wall_time(void)
  	timekeeping_adjust(tk, offset);
  
  
+#if 0
  	/*
  	* Store only full nanoseconds into xtime_nsec after rounding
  	* it up and add the remainder to the error difference.
@@ -1192,6 +1193,7 @@  static void update_wall_time(void)
  	tk->xtime_nsec -= remainder;
  	tk->xtime_nsec += 1ULL << tk->shift;
  	tk->ntp_error += remainder << tk->ntp_error_shift;
+#endif
  
  	/*
  	 * Finally, make sure that after the rounding