Message ID | 20160318173551.675ff6e5@annuminas.surriel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Rafael Wysocki |
Headers | show |
On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote: > On Fri, 18 Mar 2016 11:32:28 -0700 > "Doug Smythies" <dsmythies@telus.net> wrote: > >> Energy: >> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules >> >> Trace data (very crude summary): >> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0) >> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0) > > This is a bit of a big hammer, but since the polling loop already uses > full CPU power anyway, this could be harmless, and solve your problem. Yes, it would be good to test this. No, I don't like it. If we are to break out of the polling loop after a specific time, we shouldn't be using polling at all in the first place. So I very much prefer to compare data->next_timer_us with the target residency of C1 instead of doing this. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote: > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> > wrote: > > > > On Fri, 18 Mar 2016 11:32:28 -0700 > > "Doug Smythies" <dsmythies@telus.net> wrote: > > > > > > > > Energy: > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules > > > > > > Trace data (very crude summary): > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load > > > (idle state 0) > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, > > > CPU load (not all idle state 0) > > This is a bit of a big hammer, but since the polling loop already > > uses > > full CPU power anyway, this could be harmless, and solve your > > problem. > Yes, it would be good to test this. > > No, I don't like it. > > If we are to break out of the polling loop after a specific time, we > shouldn't be using polling at all in the first place. > > So I very much prefer to compare data->next_timer_us with the target > residency of C1 instead of doing this. The problem is that data->next_timer_us does not tell us when the next network packet is about to arrive, though. This could cause issues when dealing with a fast stream of network traffic, which get_typical_interval() would be able to deal with.
On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote: > On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote: > > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> > > wrote: > > > > > > On Fri, 18 Mar 2016 11:32:28 -0700 > > > "Doug Smythies" <dsmythies@telus.net> wrote: > > > > > > > > > > > Energy: > > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules > > > > > > > > Trace data (very crude summary): > > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load > > > > (idle state 0) > > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, > > > > CPU load (not all idle state 0) > > > This is a bit of a big hammer, but since the polling loop already > > > uses > > > full CPU power anyway, this could be harmless, and solve your > > > problem. > > Yes, it would be good to test this. > > > > No, I don't like it. > > > > If we are to break out of the polling loop after a specific time, we > > shouldn't be using polling at all in the first place. > > > > So I very much prefer to compare data->next_timer_us with the target > > residency of C1 instead of doing this. > > The problem is that data->next_timer_us does not tell us > when the next network packet is about to arrive, though. > > This could cause issues when dealing with a fast stream > of network traffic, which get_typical_interval() would > be able to deal with. If it predicted things correctly for small intervals, that is. It doesn't seem to be able to do that, however. You seem to want to trade energy consumption for performance with a very narrow use case in mind and that by changing bahavior that's been there pretty much forever, so by now it really has become the expected one. Needless to say I'm not a fan of changes like that.
On Fri, 2016-03-18 at 23:09 +0100, Rafael J. Wysocki wrote: > On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote: > > > > On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote: > > > > > > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> > > > wrote: > > > > > > > > > > > > On Fri, 18 Mar 2016 11:32:28 -0700 > > > > "Doug Smythies" <dsmythies@telus.net> wrote: > > > > > > > > > > > > > > > > > > > Energy: > > > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules > > > > > > > > > > Trace data (very crude summary): > > > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU > > > > > load > > > > > (idle state 0) > > > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but > > > > > less, > > > > > CPU load (not all idle state 0) > > > > This is a bit of a big hammer, but since the polling loop > > > > already > > > > uses > > > > full CPU power anyway, this could be harmless, and solve your > > > > problem. > > > Yes, it would be good to test this. > > > > > > No, I don't like it. > > > > > > If we are to break out of the polling loop after a specific time, > > > we > > > shouldn't be using polling at all in the first place. > > > > > > So I very much prefer to compare data->next_timer_us with the > > > target > > > residency of C1 instead of doing this. > > The problem is that data->next_timer_us does not tell us > > when the next network packet is about to arrive, though. > > > > This could cause issues when dealing with a fast stream > > of network traffic, which get_typical_interval() would > > be able to deal with. > If it predicted things correctly for small intervals, that is. > > It doesn't seem to be able to do that, however. It appears to get it right the vast majority of the time, it is just that when it gets wrong in a tiny minority of the cases, the penalty for polling is really high. > You seem to want to trade energy consumption for performance with a > very narrow > use case in mind and that by changing bahavior that's been there > pretty much > forever, so by now it really has become the expected one. > > Needless to say I'm not a fan of changes like that. I am not sure it is a narrow use case, and I suspect cases with over half a million wakeups a second are not going to be all that unusual. The same thing applies not just to network traffic, but also to tasks ping-ponging info between CPUs, eg. because some tasks are pinned, or because we are dealing with a controller thread and a larger number of workers.
On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> wrote: > On Fri, 18 Mar 2016 11:32:28 -0700 > "Doug Smythies" <dsmythies@telus.net> wrote: > >> Energy: >> Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules >> >> Trace data (very crude summary): >> Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0) >> Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0) > > This is a bit of a big hammer, but since the polling loop already uses > full CPU power anyway, this could be harmless, and solve your problem. > > ---8<--- > > Subject: cpuidle: break out of idle polling loop after HLT threshold > > Staying in the poll loop can be beneficial for workloads that > wake back up very, very quickly, but staying in the poll loop > for too long can waste a lot of power. > > Break out of the idle polling loop if the system has spent more > time in the loop than the threshold for entering the next power > saving mode. > > The poll loop uses full CPU power already, so this will not > increase the power use of the poll loop. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > drivers/cpuidle/driver.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c > index 389ade4572be..a5dfeafd243f 100644 > --- a/drivers/cpuidle/driver.c > +++ b/drivers/cpuidle/driver.c > @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv) > static int poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > + /* > + * Polling is state 0; break out of the polling loop after the > + * threshold for the next power state has passed. Doing this several > + * times in a row should cause the menu governor to immediately > + * select a deeper power state. > + */ > + u64 limit = drv->states[1].target_residency * NSEC_PER_USEC; > + ktime_t start = ktime_get(); > + int i = 0; > + > local_irq_enable(); > if (!current_set_polling_and_test()) { > - while (!need_resched()) > + while (!need_resched()) { > + ktime_t now; > cpu_relax(); > + > + /* Occasionally check for a timeout. */ > + if (!(i % 1000)) { > + now = ktime_get(); > + if (ktime_to_ns(ktime_sub(now, start)) > limit) > + break; > + } > + } > } > current_clr_polling(); I'm not sure how this is going to help, but maybe I'm missing something. Say we hit the timeout and break out of the loop. We get back to cpuidle_enter_state() and from there (via a couple of irrelevant steps) to cpuidle_idle_call() and to cpu_idle_loop(). Then, since we are still idle, the while (1) loop continues and we land in menu_select() again. That will choose polling again (because now there's less time to the next timer than it was last time, so if it didn't choose C1 then, it won't do that now either) and we land in idle_poll() again. So is there anything I'm overlooking? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 18, 2016 at 11:28 PM, Rik van Riel <riel@redhat.com> wrote: > On Fri, 2016-03-18 at 23:09 +0100, Rafael J. Wysocki wrote: >> On Friday, March 18, 2016 05:52:49 PM Rik van Riel wrote: >> > >> > On Fri, 2016-03-18 at 22:48 +0100, Rafael J. Wysocki wrote: >> > > >> > > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> >> > > wrote: >> > > > >> > > > >> > > > On Fri, 18 Mar 2016 11:32:28 -0700 >> > > > "Doug Smythies" <dsmythies@telus.net> wrote: >> > > > >> > > > > >> > > > > >> > > > > Energy: >> > > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules >> > > > > >> > > > > Trace data (very crude summary): >> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU >> > > > > load >> > > > > (idle state 0) >> > > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but >> > > > > less, >> > > > > CPU load (not all idle state 0) >> > > > This is a bit of a big hammer, but since the polling loop >> > > > already >> > > > uses >> > > > full CPU power anyway, this could be harmless, and solve your >> > > > problem. >> > > Yes, it would be good to test this. >> > > >> > > No, I don't like it. >> > > >> > > If we are to break out of the polling loop after a specific time, >> > > we >> > > shouldn't be using polling at all in the first place. >> > > >> > > So I very much prefer to compare data->next_timer_us with the >> > > target >> > > residency of C1 instead of doing this. >> > The problem is that data->next_timer_us does not tell us >> > when the next network packet is about to arrive, though. >> > >> > This could cause issues when dealing with a fast stream >> > of network traffic, which get_typical_interval() would >> > be able to deal with. >> If it predicted things correctly for small intervals, that is. >> >> It doesn't seem to be able to do that, however. > > It appears to get it right the vast majority of the > time, it is just that when it gets wrong in a tiny > minority of the cases, the penalty for polling is > really high. The overall penalty is what matters, though (and it is cumulative, unfortunately). >> You seem to want to trade energy consumption for performance with a >> very narrow >> use case in mind and that by changing bahavior that's been there >> pretty much >> forever, so by now it really has become the expected one. >> >> Needless to say I'm not a fan of changes like that. > > I am not sure it is a narrow use case, and I suspect > cases with over half a million wakeups a second are > not going to be all that unusual. Well, the definition of "unusual" is somewhat elusive. :-) > The same thing applies not just to network traffic, > but also to tasks ping-ponging info between CPUs, > eg. because some tasks are pinned, or because we > are dealing with a controller thread and a larger > number of workers. All true, but still 10% energy consumption increase is rather a big deal. I'd like to hear if there's any difference from going to nanosecond time scale in the first place at this point. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2016-03-18 at 23:56 +0100, Rafael J. Wysocki wrote: > On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> > wrote: > > > > On Fri, 18 Mar 2016 11:32:28 -0700 > > "Doug Smythies" <dsmythies@telus.net> wrote: > > > > > > > > Energy: > > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules > > > > > > Trace data (very crude summary): > > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load > > > (idle state 0) > > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, > > > CPU load (not all idle state 0) > > This is a bit of a big hammer, but since the polling loop already > > uses > > full CPU power anyway, this could be harmless, and solve your > > problem. > > > > ---8<--- > > > > Subject: cpuidle: break out of idle polling loop after HLT > > threshold > > > > Staying in the poll loop can be beneficial for workloads that > > wake back up very, very quickly, but staying in the poll loop > > for too long can waste a lot of power. > > > > Break out of the idle polling loop if the system has spent more > > time in the loop than the threshold for entering the next power > > saving mode. > > > > The poll loop uses full CPU power already, so this will not > > increase the power use of the poll loop. > > > > Signed-off-by: Rik van Riel <riel@redhat.com> > > --- > > drivers/cpuidle/driver.c | 21 ++++++++++++++++++++- > > 1 file changed, 20 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c > > index 389ade4572be..a5dfeafd243f 100644 > > --- a/drivers/cpuidle/driver.c > > +++ b/drivers/cpuidle/driver.c > > @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct > > cpuidle_driver *drv) > > static int poll_idle(struct cpuidle_device *dev, > > struct cpuidle_driver *drv, int index) > > { > > + /* > > + * Polling is state 0; break out of the polling loop after > > the > > + * threshold for the next power state has passed. Doing > > this several > > + * times in a row should cause the menu governor to > > immediately > > + * select a deeper power state. > > + */ > > + u64 limit = drv->states[1].target_residency * > > NSEC_PER_USEC; > > + ktime_t start = ktime_get(); > > + int i = 0; > > + > > local_irq_enable(); > > if (!current_set_polling_and_test()) { > > - while (!need_resched()) > > + while (!need_resched()) { > > + ktime_t now; > > cpu_relax(); > > + > > + /* Occasionally check for a timeout. */ > > + if (!(i % 1000)) { > > + now = ktime_get(); > > + if (ktime_to_ns(ktime_sub(now, > > start)) > limit) > > + break; > > + } > > + } > > } > > current_clr_polling(); > I'm not sure how this is going to help, but maybe I'm missing > something. > > Say we hit the timeout and break out of the loop. We get back to > cpuidle_enter_state() and from there (via a couple of irrelevant > steps) to cpuidle_idle_call() and to cpu_idle_loop(). Then, since we > are still idle, the while (1) loop continues and we land in > menu_select() again. That will choose polling again (because now > there's less time to the next timer than it was last time, so if it > didn't choose C1 then, it won't do that now either) and we land in > idle_poll() again. > > So is there anything I'm overlooking? After about three microseconds of the above, get_typical_interval() will return an interval larger than the C1 target residency, or fail to find a typical interval, because 3 of the 8 intervals will exceed the C1 target residency at that point. Only repeated very fast wakeups will lead to polling, and just a few periods of polling timing out will lead to the system reverting to HLT. In other words, the system will only use polling while it appears to be beneficial, and should snap out of that mode very quickly. The kernel test robot also showed over 6% improvement in (IIRC) the wikibench test, with polling being done a little more often. Other parts of the kernel do pretty much anything for 2% extra performance, so if we can find a way to preserve that 6%, without hurting power usage significantly, I suspect it may be worthwhile.
On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com> wrote: > On Fri, 2016-03-18 at 23:56 +0100, Rafael J. Wysocki wrote: >> On Fri, Mar 18, 2016 at 10:35 PM, Rik van Riel <riel@redhat.com> >> wrote: >> > >> > On Fri, 18 Mar 2016 11:32:28 -0700 >> > "Doug Smythies" <dsmythies@telus.net> wrote: >> > >> > > >> > > Energy: >> > > Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules >> > > >> > > Trace data (very crude summary): >> > > Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load >> > > (idle state 0) >> > > Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, >> > > CPU load (not all idle state 0) >> > This is a bit of a big hammer, but since the polling loop already >> > uses >> > full CPU power anyway, this could be harmless, and solve your >> > problem. >> > >> > ---8<--- >> > >> > Subject: cpuidle: break out of idle polling loop after HLT >> > threshold >> > >> > Staying in the poll loop can be beneficial for workloads that >> > wake back up very, very quickly, but staying in the poll loop >> > for too long can waste a lot of power. >> > >> > Break out of the idle polling loop if the system has spent more >> > time in the loop than the threshold for entering the next power >> > saving mode. >> > >> > The poll loop uses full CPU power already, so this will not >> > increase the power use of the poll loop. >> > >> > Signed-off-by: Rik van Riel <riel@redhat.com> >> > --- >> > drivers/cpuidle/driver.c | 21 ++++++++++++++++++++- >> > 1 file changed, 20 insertions(+), 1 deletion(-) >> > >> > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c >> > index 389ade4572be..a5dfeafd243f 100644 >> > --- a/drivers/cpuidle/driver.c >> > +++ b/drivers/cpuidle/driver.c >> > @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct >> > cpuidle_driver *drv) >> > static int poll_idle(struct cpuidle_device *dev, >> > struct cpuidle_driver *drv, int index) >> > { >> > + /* >> > + * Polling is state 0; break out of the polling loop after >> > the >> > + * threshold for the next power state has passed. Doing >> > this several >> > + * times in a row should cause the menu governor to >> > immediately >> > + * select a deeper power state. >> > + */ >> > + u64 limit = drv->states[1].target_residency * >> > NSEC_PER_USEC; >> > + ktime_t start = ktime_get(); >> > + int i = 0; >> > + >> > local_irq_enable(); >> > if (!current_set_polling_and_test()) { >> > - while (!need_resched()) >> > + while (!need_resched()) { >> > + ktime_t now; >> > cpu_relax(); >> > + >> > + /* Occasionally check for a timeout. */ >> > + if (!(i % 1000)) { >> > + now = ktime_get(); >> > + if (ktime_to_ns(ktime_sub(now, >> > start)) > limit) >> > + break; >> > + } >> > + } >> > } >> > current_clr_polling(); >> I'm not sure how this is going to help, but maybe I'm missing >> something. >> >> Say we hit the timeout and break out of the loop. We get back to >> cpuidle_enter_state() and from there (via a couple of irrelevant >> steps) to cpuidle_idle_call() and to cpu_idle_loop(). Then, since we >> are still idle, the while (1) loop continues and we land in >> menu_select() again. That will choose polling again (because now >> there's less time to the next timer than it was last time, so if it >> didn't choose C1 then, it won't do that now either) and we land in >> idle_poll() again. >> >> So is there anything I'm overlooking? > > After about three microseconds of the above, get_typical_interval() > will return an interval larger than the C1 target residency, or > fail to find a typical interval, because 3 of the 8 intervals will > exceed the C1 target residency at that point. > > Only repeated very fast wakeups will lead to polling, and just > a few periods of polling timing out will lead to the system > reverting to HLT. > > In other words, the system will only use polling while it appears > to be beneficial, and should snap out of that mode very quickly. OK, thanks! > The kernel test robot also showed over 6% improvement in (IIRC) > the wikibench test, with polling being done a little more often. > > Other parts of the kernel do pretty much anything for 2% extra > performance, so if we can find a way to preserve that 6%, without > hurting power usage significantly, I suspect it may be worthwhile. Well. OK. That is a worthy goal, but things in this thread look a bit too ad-hoc to me to be honest. I'd like to take one step at a time and in such a way that improvements are actually confirmed in every step. Does this sound reasonable? If so, I first would like to restore the energy consumption in the Doug's case at least to the level from before commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling). That's what I'd like to do for 4.6 and possibly before the merge window is over. After 4.6-rc1, in the 4.7 cycle, we can think about improving performance here and maybe sacrificing some energy for that, but in a controlled way as I said above. What do you think? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2016-03-19 at 03:06 +0100, Rafael J. Wysocki wrote: > On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com> > wrote: > > > > The kernel test robot also showed over 6% improvement in (IIRC) > > the wikibench test, with polling being done a little more often. > > > > Other parts of the kernel do pretty much anything for 2% extra > > performance, so if we can find a way to preserve that 6%, without > > hurting power usage significantly, I suspect it may be worthwhile. > Well. OK. That is a worthy goal, but things in this thread look a > bit > too ad-hoc to me to be honest. > > I'd like to take one step at a time and in such a way that > improvements are actually confirmed in every step. > > Does this sound reasonable? > > If so, I first would like to restore the energy consumption in the > Doug's case at least to the level from before commit a9ceb78bc75c > (cpuidle,menu: use interactivity_req to disable polling). That's > what > I'd like to do for 4.6 and possibly before the merge window is over. > > After 4.6-rc1, in the 4.7 cycle, we can think about improving > performance here and maybe sacrificing some energy for that, but in a > controlled way as I said above. > > What do you think? That sounds perfectly reasonable to me. Count me in for any efforts to improve performance with the cpuidle code, and hopefully allow more people to enable c-states on their servers.
On Sat, Mar 19, 2016 at 3:17 AM, Rik van Riel <riel@redhat.com> wrote: > On Sat, 2016-03-19 at 03:06 +0100, Rafael J. Wysocki wrote: >> On Sat, Mar 19, 2016 at 2:53 AM, Rik van Riel <riel@redhat.com> >> wrote: >> > >> > The kernel test robot also showed over 6% improvement in (IIRC) >> > the wikibench test, with polling being done a little more often. >> > >> > Other parts of the kernel do pretty much anything for 2% extra >> > performance, so if we can find a way to preserve that 6%, without >> > hurting power usage significantly, I suspect it may be worthwhile. >> Well. OK. That is a worthy goal, but things in this thread look a >> bit >> too ad-hoc to me to be honest. >> >> I'd like to take one step at a time and in such a way that >> improvements are actually confirmed in every step. >> >> Does this sound reasonable? >> >> If so, I first would like to restore the energy consumption in the >> Doug's case at least to the level from before commit a9ceb78bc75c >> (cpuidle,menu: use interactivity_req to disable polling). That's >> what >> I'd like to do for 4.6 and possibly before the merge window is over. >> >> After 4.6-rc1, in the 4.7 cycle, we can think about improving >> performance here and maybe sacrificing some energy for that, but in a >> controlled way as I said above. >> >> What do you think? > > That sounds perfectly reasonable to me. > > Count me in for any efforts to improve performance with > the cpuidle code, and hopefully allow more people to enable > c-states on their servers. OK, great! I'm going to send a patch for Doug to test that I'd like to apply for 4.6 (on top of http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=e132b9b3bc7f19e9b158e42b323881d5dee5ecf3), maybe not today (because I'm falling asleep now), but likely tomorrow or on Sunday. If that restores things for him, I'll apply it and that will be the starting point for future changes. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c index 389ade4572be..a5dfeafd243f 100644 --- a/drivers/cpuidle/driver.c +++ b/drivers/cpuidle/driver.c @@ -181,10 +181,29 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv) static int poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + /* + * Polling is state 0; break out of the polling loop after the + * threshold for the next power state has passed. Doing this several + * times in a row should cause the menu governor to immediately + * select a deeper power state. + */ + u64 limit = drv->states[1].target_residency * NSEC_PER_USEC; + ktime_t start = ktime_get(); + int i = 0; + local_irq_enable(); if (!current_set_polling_and_test()) { - while (!need_resched()) + while (!need_resched()) { + ktime_t now; cpu_relax(); + + /* Occasionally check for a timeout. */ + if (!(i % 1000)) { + now = ktime_get(); + if (ktime_to_ns(ktime_sub(now, start)) > limit) + break; + } + } } current_clr_polling();