Message ID | 20220505015814.3727692-1-rui.zhang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | PM: Solution for S0ix failure caused by PCH overheating | expand |
On 05.05.22 03:58, Zhang Rui wrote: > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > PCH thermal sensor that monitors the PCH temperature and blocks the system > from entering S0ix in case it overheats. > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH > temperature above threshold") introduces a delay loop to cool the > temperature down for this purpose. > > However, in practice, we found that the time it takes to cool the PCH down > below threshold highly depends on the initial PCH temperature when the > delay starts, as well as the ambient temperature. > > This patch series has been tested on the same Dell XPS 9360 laptop and > S0ix is 100% achieved across 1000+ s2idle iterations. > Hi, what is the user experience if this ever triggers? At that stage the system will appear to be suspended to an external observer, won't it? So in effect you'd have a system that spontaneously wakes up, won't you? Regards Oliver
On Thu, May 5, 2022 at 10:23 AM Oliver Neukum <oneukum@suse.com> wrote: > > > > On 05.05.22 03:58, Zhang Rui wrote: > > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > > PCH thermal sensor that monitors the PCH temperature and blocks the system > > from entering S0ix in case it overheats. > > > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH > > temperature above threshold") introduces a delay loop to cool the > > temperature down for this purpose. > > > > However, in practice, we found that the time it takes to cool the PCH down > > below threshold highly depends on the initial PCH temperature when the > > delay starts, as well as the ambient temperature. > > > > > This patch series has been tested on the same Dell XPS 9360 laptop and > > S0ix is 100% achieved across 1000+ s2idle iterations. > > > Hi, > > what is the user experience if this ever triggers? At that stage the > system will appear to be suspended to an external observer, won't it? > So in effect you'd have a system that spontaneously wakes up, won't you? No, you won't. It will just go ahead and reach S0ix when it can. It will only wake up if there's a legitimate wakeup even in the meantime.
Hi, Neukum, Thanks for your response, I missed your original reply in my Inbox. On Thu, 2022-05-05 at 14:02 +0200, Rafael J. Wysocki wrote: > On Thu, May 5, 2022 at 10:23 AM Oliver Neukum <oneukum@suse.com> > wrote: > > > > > > > > On 05.05.22 03:58, Zhang Rui wrote: > > > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > > > PCH thermal sensor that monitors the PCH temperature and blocks > > > the system > > > from entering S0ix in case it overheats. > > > > > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due > > > to PCH > > > temperature above threshold") introduces a delay loop to cool the > > > temperature down for this purpose. > > > > > > However, in practice, we found that the time it takes to cool the > > > PCH down > > > below threshold highly depends on the initial PCH temperature > > > when the > > > delay starts, as well as the ambient temperature. > > > > > > This patch series has been tested on the same Dell XPS 9360 > > > laptop and > > > S0ix is 100% achieved across 1000+ s2idle iterations. > > > > > > > Hi, > > > > what is the user experience if this ever triggers? At that stage > > the > > system will appear to be suspended to an external observer, won't > > it? > > So in effect you'd have a system that spontaneously wakes up, won't > > you? > > No, you won't. > > It will just go ahead and reach S0ix when it can. It will only wake > up if there's a legitimate wakeup even in the meantime. Please correct me if I misunderstand your question, Oliver. Without the patch, the system becomes suspended and stays in PCx. With the patch, the system first stays in PCx during suspending (in the intel_pch_thermal driver' cooling delays), and then becomes suspended and stays in S0ix. thanks, rui
On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@intel.com> wrote: > > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > PCH thermal sensor that monitors the PCH temperature and blocks the system > from entering S0ix in case it overheats. > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH > temperature above threshold") introduces a delay loop to cool the > temperature down for this purpose. > > However, in practice, we found that the time it takes to cool the PCH down > below threshold highly depends on the initial PCH temperature when the > delay starts, as well as the ambient temperature. > > For example, on a Dell XPS 9360 laptop, the problem can be triggered > 1. when it is suspended with heavy workload running. > or > 2. when it is moved from New Hampshire to Florida. > > In these cases, the 1 second delay is not sufficient. As a result, the > system stays in a shallower power state like PCx instead of S0ix, and > drains the battery power, without user' notice. > > In this patch series, we first fix the problem in patch 1/7 ~ 3/7, by > 1. expand the default overall cooling delay timeout to 60 seconds. > 2. make sure the temperature is below threshold rather than equal to it. > 3. move the delay to .suspend_noirq phase instead, in order to > a) do the cooling when the system is in a more quiescent state > b) be aware of wakeup events during the long delay, because some wakeup > events (ACPI Power button Press, USB mouse, etc) become valid only > in .suspend_noirq phase and later. > > However, this potential long delay introduces a problem to our suspend > stress automation test, because the delay makes it hard to predict how > much time it takes to suspend the system. > As we want to do as much suspend iterations as possible in limited time, > setting a 60+ seconds rtc alarm for suspend which usually takes shorter > than 1 second is far beyond overkill. > > Thus, in patch 4/7 ~ 7/7, a rtc driver hook is introduced, which cancels > the armed rtc alarm in the beginning of suspend and then rearm the rtc > alarm with a short interval (say, 2 second) right before system suspended. > > By running > # echo 2 > /sys/module/rtc_cmos/parameters/rtc_wake_override_sec > before suspend, the system can be resumed by RTC alarm right after it is > suspended, no matter how much time the suspend really takes. > > This patch series has been tested on the same Dell XPS 9360 laptop and > S0ix is 100% achieved across 1000+ s2idle iterations. Overall, the first three patches in the series can go in without the rest, so let's put them into a separate series. Patch [4/7] doesn't depend on the first three ones, so it can go in by itself. Patch [5/7] is to be dropped anyway as per the earlier discussion. Patch [6/7] is only needed to apply patch [7/7] which is controversial. I think that we can drop or defer patches [6-7/7] for now.
On 17/05/2022 17:11:05+0200, Rafael J. Wysocki wrote: > On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@intel.com> wrote: > > > > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > > PCH thermal sensor that monitors the PCH temperature and blocks the system > > from entering S0ix in case it overheats. > > > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH > > temperature above threshold") introduces a delay loop to cool the > > temperature down for this purpose. > > > > However, in practice, we found that the time it takes to cool the PCH down > > below threshold highly depends on the initial PCH temperature when the > > delay starts, as well as the ambient temperature. > > > > For example, on a Dell XPS 9360 laptop, the problem can be triggered > > 1. when it is suspended with heavy workload running. > > or > > 2. when it is moved from New Hampshire to Florida. > > > > In these cases, the 1 second delay is not sufficient. As a result, the > > system stays in a shallower power state like PCx instead of S0ix, and > > drains the battery power, without user' notice. > > > > In this patch series, we first fix the problem in patch 1/7 ~ 3/7, by > > 1. expand the default overall cooling delay timeout to 60 seconds. > > 2. make sure the temperature is below threshold rather than equal to it. > > 3. move the delay to .suspend_noirq phase instead, in order to > > a) do the cooling when the system is in a more quiescent state > > b) be aware of wakeup events during the long delay, because some wakeup > > events (ACPI Power button Press, USB mouse, etc) become valid only > > in .suspend_noirq phase and later. > > > > However, this potential long delay introduces a problem to our suspend > > stress automation test, because the delay makes it hard to predict how > > much time it takes to suspend the system. > > As we want to do as much suspend iterations as possible in limited time, > > setting a 60+ seconds rtc alarm for suspend which usually takes shorter > > than 1 second is far beyond overkill. > > > > Thus, in patch 4/7 ~ 7/7, a rtc driver hook is introduced, which cancels > > the armed rtc alarm in the beginning of suspend and then rearm the rtc > > alarm with a short interval (say, 2 second) right before system suspended. > > > > By running > > # echo 2 > /sys/module/rtc_cmos/parameters/rtc_wake_override_sec > > before suspend, the system can be resumed by RTC alarm right after it is > > suspended, no matter how much time the suspend really takes. > > > > This patch series has been tested on the same Dell XPS 9360 laptop and > > S0ix is 100% achieved across 1000+ s2idle iterations. > > Overall, the first three patches in the series can go in without the > rest, so let's put them into a separate series. > > Patch [4/7] doesn't depend on the first three ones, so it can go in by itself. > > Patch [5/7] is to be dropped anyway as per the earlier discussion. > > Patch [6/7] is only needed to apply patch [7/7] which is controversial. > > I think that we can drop or defer patches [6-7/7] for now. I don't think 7/7 is really useful in the upstream kernel, I don't plan to apply it
Hi, Rafael, On Tue, 2022-05-17 at 17:11 +0200, Rafael J. Wysocki wrote: > On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@intel.com> wrote: > > > > On some Intel client platforms like SKL/KBL/CNL/CML, there is a > > PCH thermal sensor that monitors the PCH temperature and blocks the > > system > > from entering S0ix in case it overheats. > > > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to > > PCH > > temperature above threshold") introduces a delay loop to cool the > > temperature down for this purpose. > > > > However, in practice, we found that the time it takes to cool the > > PCH down > > below threshold highly depends on the initial PCH temperature when > > the > > delay starts, as well as the ambient temperature. > > > > For example, on a Dell XPS 9360 laptop, the problem can be > > triggered > > 1. when it is suspended with heavy workload running. > > or > > 2. when it is moved from New Hampshire to Florida. > > > > In these cases, the 1 second delay is not sufficient. As a result, > > the > > system stays in a shallower power state like PCx instead of S0ix, > > and > > drains the battery power, without user' notice. > > > > In this patch series, we first fix the problem in patch 1/7 ~ 3/7, > > by > > 1. expand the default overall cooling delay timeout to 60 seconds. > > 2. make sure the temperature is below threshold rather than equal > > to it. > > 3. move the delay to .suspend_noirq phase instead, in order to > > a) do the cooling when the system is in a more quiescent state > > b) be aware of wakeup events during the long delay, because some > > wakeup > > events (ACPI Power button Press, USB mouse, etc) become valid > > only > > in .suspend_noirq phase and later. > > > > However, this potential long delay introduces a problem to our > > suspend > > stress automation test, because the delay makes it hard to predict > > how > > much time it takes to suspend the system. > > As we want to do as much suspend iterations as possible in limited > > time, > > setting a 60+ seconds rtc alarm for suspend which usually takes > > shorter > > than 1 second is far beyond overkill. > > > > Thus, in patch 4/7 ~ 7/7, a rtc driver hook is introduced, which > > cancels > > the armed rtc alarm in the beginning of suspend and then rearm the > > rtc > > alarm with a short interval (say, 2 second) right before system > > suspended. > > > > By running > > # echo 2 > /sys/module/rtc_cmos/parameters/rtc_wake_override_sec > > before suspend, the system can be resumed by RTC alarm right after > > it is > > suspended, no matter how much time the suspend really takes. > > > > This patch series has been tested on the same Dell XPS 9360 laptop > > and > > S0ix is 100% achieved across 1000+ s2idle iterations. > > Overall, the first three patches in the series can go in without the > rest, so let's put them into a separate series. > > Patch [4/7] doesn't depend on the first three ones, so it can go in > by itself. > > Patch [5/7] is to be dropped anyway as per the earlier discussion. > > Patch [6/7] is only needed to apply patch [7/7] which is > controversial. > > I think that we can drop or defer patches [6-7/7] for now. This all sounds reasonable to me. I will resend them separately. -rui