Message ID | 4137867.C4jYrWdt8n@aspire.rjw.lan (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Monday, March 12, 2018 10:36:27 AM CET Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > If poll_idle() is allowed to spin until need_resched() returns 'true', > it may actually spin for a much longer time than expected by the idle > governor, since set_tsk_need_resched() is not always called by the > timer interrupt handler. If that happens, the CPU may spend much > more time than anticipated in the "polling" state. > > To prevent that from happening, limit the time of the spinning loop > in poll_idle(). > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000. > > --- > drivers/cpuidle/poll_state.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/cpuidle/poll_state.c > =================================================================== > --- linux-pm.orig/drivers/cpuidle/poll_state.c > +++ linux-pm/drivers/cpuidle/poll_state.c > @@ -5,16 +5,31 @@ > */ > > #include <linux/cpuidle.h> > +#include <linux/ktime.h> > #include <linux/sched.h> > #include <linux/sched/idle.h> > > +#define POLL_IDLE_TIME_CHECK_COUNT 1000 > +#define POLL_IDLE_TIME_LIMIT (TICK_NSEC / 16) > + > static int __cpuidle poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > + ktime_t start = ktime_get(); > + > local_irq_enable(); > if (!current_set_polling_and_test()) { > - while (!need_resched()) > + unsigned int time_check_counter = 0; > + > + while (!need_resched()) { > cpu_relax(); > + if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT) > + continue; > + > + time_check_counter = 0; > + if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT) > + break; > + } > } > current_clr_polling(); > No comments, so I'm assuming no objections or concerns. I've seen reports telling me that this patch alone may reduce the CPU package power by as much as 30% sometimes. Thanks!
On Mon, Mar 12, 2018 at 10:36:27AM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > If poll_idle() is allowed to spin until need_resched() returns 'true', > it may actually spin for a much longer time than expected by the idle > governor, since set_tsk_need_resched() is not always called by the > timer interrupt handler. If that happens, the CPU may spend much > more time than anticipated in the "polling" state. > > To prevent that from happening, limit the time of the spinning loop > in poll_idle(). > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000. > > --- > drivers/cpuidle/poll_state.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/cpuidle/poll_state.c > =================================================================== > --- linux-pm.orig/drivers/cpuidle/poll_state.c > +++ linux-pm/drivers/cpuidle/poll_state.c > @@ -5,16 +5,31 @@ > */ > > #include <linux/cpuidle.h> > +#include <linux/ktime.h> > #include <linux/sched.h> > #include <linux/sched/idle.h> > > +#define POLL_IDLE_TIME_CHECK_COUNT 1000 > +#define POLL_IDLE_TIME_LIMIT (TICK_NSEC / 16) > + > static int __cpuidle poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > + ktime_t start = ktime_get(); I would recoomend not using ktime_get(), imagine the 'joy' if that happens to be the HPET. > local_irq_enable(); > if (!current_set_polling_and_test()) { > + unsigned int time_check_counter = 0; > + > + while (!need_resched()) { > cpu_relax(); > + if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT) > + continue; > + > + time_check_counter = 0; > + if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT) > + break; > + } > } > current_clr_polling(); Since the idle loop is strictly per-cpu, you can use regular sched_clock() here. Something like: u64 start = sched_clock(); local_irq_enable(); if (!current_set_polling_and_test()) { while (!need_resched()) { cpu_relax(); if (sched_clock() - start > POLL_IDLE_TIME_LIMIT) break; } } current_clr_polling(); On x86 we don't have to use that time_check_counter thing, sched_clock() is really cheap, not sure if it makes sense on other platforms.
On Wednesday, March 14, 2018 1:04:50 PM CET Peter Zijlstra wrote: > On Mon, Mar 12, 2018 at 10:36:27AM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > If poll_idle() is allowed to spin until need_resched() returns 'true', > > it may actually spin for a much longer time than expected by the idle > > governor, since set_tsk_need_resched() is not always called by the > > timer interrupt handler. If that happens, the CPU may spend much > > more time than anticipated in the "polling" state. > > > > To prevent that from happening, limit the time of the spinning loop > > in poll_idle(). > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > -> v2: After additional testing reduce POLL_IDLE_TIME_CHECK_COUNT to 1000. > > > > --- > > drivers/cpuidle/poll_state.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > Index: linux-pm/drivers/cpuidle/poll_state.c > > =================================================================== > > --- linux-pm.orig/drivers/cpuidle/poll_state.c > > +++ linux-pm/drivers/cpuidle/poll_state.c > > @@ -5,16 +5,31 @@ > > */ > > > > #include <linux/cpuidle.h> > > +#include <linux/ktime.h> > > #include <linux/sched.h> > > #include <linux/sched/idle.h> > > > > +#define POLL_IDLE_TIME_CHECK_COUNT 1000 > > +#define POLL_IDLE_TIME_LIMIT (TICK_NSEC / 16) > > + > > static int __cpuidle poll_idle(struct cpuidle_device *dev, > > struct cpuidle_driver *drv, int index) > > { > > + ktime_t start = ktime_get(); > > I would recoomend not using ktime_get(), imagine the 'joy' if that > happens to be the HPET. > > > local_irq_enable(); > > if (!current_set_polling_and_test()) { > > + unsigned int time_check_counter = 0; > > + > > + while (!need_resched()) { > > cpu_relax(); > > + if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT) > > + continue; > > + > > + time_check_counter = 0; > > + if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT) > > + break; > > + } > > } > > current_clr_polling(); > > Since the idle loop is strictly per-cpu, you can use regular > sched_clock() here. Something like: > > u64 start = sched_clock(); > > local_irq_enable(); > if (!current_set_polling_and_test()) { > while (!need_resched()) { > cpu_relax(); > > if (sched_clock() - start > POLL_IDLE_TIME_LIMIT) > break; > } > } > current_clr_polling(); Good idea! > On x86 we don't have to use that time_check_counter thing, sched_clock() > is really cheap, not sure if it makes sense on other platforms. Let's do it the way you suggested, if there are issues with it, we can still add the counter later. Thanks!
On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: > On x86 we don't have to use that time_check_counter thing, > sched_clock() > is really cheap, not sure if it makes sense on other platforms. Are you sure? I saw a 5-10% increase in CPU use, for a constant query rate to a memcache style workload, with v3 of this patch.
On Thu, Mar 22, 2018 at 12:32:18PM -0400, Rik van Riel wrote: > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: > > > On x86 we don't have to use that time_check_counter thing, > > sched_clock() > > is really cheap, not sure if it makes sense on other platforms. > > Are you sure? I saw a 5-10% increase in CPU use, > for a constant query rate to a memcache style > workload, with v3 of this patch. Well, I expected it to be really cheap, but if you have numbers saying otherwise, clearly I'll need to adjust expectations ;-)
On 2018.03.22 09:32 Rik van Riel wrote: > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: > >> On x86 we don't have to use that time_check_counter thing, >> sched_clock() >> is really cheap, not sure if it makes sense on other platforms. > > Are you sure? I saw a 5-10% increase in CPU use, > for a constant query rate to a memcache style > workload, with v3 of this patch. I would very much like to be able to repeat your test results. However, I am not sure what you mean by "memcache style workload". Is there a test you can point me to? Say a Phoronix type test, for example. All of my tests with the V3 of this patch have been fine. ... Doug
On 2018.03.22 12:12 Doug Smythies wrote: >On 2018.03.22 09:32 Rik van Riel wrote: >> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: >> >>> On x86 we don't have to use that time_check_counter thing, >>> sched_clock() >>> is really cheap, not sure if it makes sense on other platforms. >> >> Are you sure? I saw a 5-10% increase in CPU use, >> for a constant query rate to a memcache style >> workload, with v3 of this patch. > > I would very much like to be able to repeat your test results. > However, I am not sure what you mean by "memcache style workload". > Is there a test you can point me to? Say a Phoronix type test, for example. > > All of my tests with the V3 of this patch have been fine. What is the difference between sched_clock() talked about herein, and local_clock() used in the patch? I'm not sure how good it is but I made a test. I didn't believe the results, so I did it 3 times. V7.3 is as from the git branch. V7.3p is plus the patch adding the counter loop to poll_state.c The test is a tight loop (about 19600 loops per second) running on all 8 CPUs. I can not seem to get my system to use Idle State 0, so I disabled Idle States 1 and 2 to force use of Idle State 0. V7.3 uses a processor package power of 62.5 Watts V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power. The loop times do not change. The Idle state 0 residency per unit time does not change. ... Doug
On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote: > On 2018.03.22 12:12 Doug Smythies wrote: >>On 2018.03.22 09:32 Rik van Riel wrote: >>> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: >>> >>>> On x86 we don't have to use that time_check_counter thing, >>>> sched_clock() >>>> is really cheap, not sure if it makes sense on other platforms. >>> >>> Are you sure? I saw a 5-10% increase in CPU use, >>> for a constant query rate to a memcache style >>> workload, with v3 of this patch. >> >> I would very much like to be able to repeat your test results. >> However, I am not sure what you mean by "memcache style workload". >> Is there a test you can point me to? Say a Phoronix type test, for example. >> >> All of my tests with the V3 of this patch have been fine. > > What is the difference between sched_clock() talked about herein, > and local_clock() used in the patch? It is almost the same (modulo offset) unless sched_clock_stable() returns 'false'. > I'm not sure how good it is but I made a test. I didn't believe > the results, so I did it 3 times. > > V7.3 is as from the git branch. > V7.3p is plus the patch adding the counter loop to poll_state.c > > The test is a tight loop (about 19600 loops per second) running > on all 8 CPUs. I can not seem to get my system to use Idle State > 0, so I disabled Idle States 1 and 2 to force use of Idle State 0. > > V7.3 uses a processor package power of 62.5 Watts > V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power. > > The loop times do not change. > The Idle state 0 residency per unit time does not change. OK, so this means that the results should improve for Rik with this patch too. :-) Thanks!
On Fri, Mar 23, 2018 at 9:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote: >> On 2018.03.22 12:12 Doug Smythies wrote: >>>On 2018.03.22 09:32 Rik van Riel wrote: >>>> On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: >>>> >>>>> On x86 we don't have to use that time_check_counter thing, >>>>> sched_clock() >>>>> is really cheap, not sure if it makes sense on other platforms. >>>> >>>> Are you sure? I saw a 5-10% increase in CPU use, >>>> for a constant query rate to a memcache style >>>> workload, with v3 of this patch. >>> >>> I would very much like to be able to repeat your test results. >>> However, I am not sure what you mean by "memcache style workload". >>> Is there a test you can point me to? Say a Phoronix type test, for example. >>> >>> All of my tests with the V3 of this patch have been fine. >> >> What is the difference between sched_clock() talked about herein, >> and local_clock() used in the patch? > > It is almost the same (modulo offset) unless sched_clock_stable() > returns 'false'. > >> I'm not sure how good it is but I made a test. I didn't believe >> the results, so I did it 3 times. >> >> V7.3 is as from the git branch. >> V7.3p is plus the patch adding the counter loop to poll_state.c >> >> The test is a tight loop (about 19600 loops per second) running >> on all 8 CPUs. I can not seem to get my system to use Idle State >> 0, so I disabled Idle States 1 and 2 to force use of Idle State 0. >> >> V7.3 uses a processor package power of 62.5 Watts >> V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power. >> >> The loop times do not change. >> The Idle state 0 residency per unit time does not change. > > OK, so this means that the results should improve for Rik with this > patch too. :-) BTW, can you possibly check how much of a difference it makes to reduce POLL_IDLE_COUNT in the patch to, say, 500 or even more? The lower it is, the less noise it will introduce AFAICS.
On 2018.03.23 02:08 Rafael J. Wysocki wrote: > On Fri, Mar 23, 2018 at 9:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> On Fri, Mar 23, 2018 at 4:19 AM, Doug Smythies <dsmythies@telus.net> wrote: >>> On 2018.03.22 12:12 Doug Smythies wrote: ...[snip]... >> >>> I'm not sure how good it is but I made a test. I didn't believe >>> the results, so I did it 3 times. >>> >>> V7.3 is as from the git branch. >>> V7.3p is plus the patch adding the counter loop to poll_state.c >>> >>> The test is a tight loop (about 19600 loops per second) running >>> on all 8 CPUs. I can not seem to get my system to use Idle State >>> 0, so I disabled Idle States 1 and 2 to force use of Idle State 0. >>> >>> V7.3 uses a processor package power of 62.5 Watts >>> V7.3p uses a processor package power of 53.4 Watts, or 14.6% less power. >>> >>> The loop times do not change. >>> The Idle state 0 residency per unit time does not change. >> >> OK, so this means that the results should improve for Rik with this >> patch too. :-) I hope so. > BTW, can you possibly check how much of a difference it makes to > reduce POLL_IDLE_COUNT in the patch to, say, 500 or even more? > > The lower it is, the less noise it will introduce AFAICS. Well, we would expect the curve to be something like a typical 1/x curve: Power = 53.4 + k1/(k2* POLL_IDLE_COUNT + k3) I did some runs and did a crude fit: Power ~= 53.4 + 35/(POLL_IDLE_COUNT + 3) And then calculate an allowed error from that. A count of 100 gives back only 0.64% of the power, and so I suggest would be a reasonable number. That being said, my test is quite crude and we should first see what others, including Rik, get. These two graphs might help explain what I did: http://fast.smythies.com/v73p_vary_count.png http://fast.smythies.com/v73p_extra_power.png It is just my opinion, but I think users with very stringent idle state 0 exit latency requirements should test with POLL_IDLE_COUNT set to 1. Then they know the worst case works, whereas they might not hit it at 1/POLL_IDLE_COUNT probability. Once happy that the worst case works, use nominal (T.B.D.) POLL_IDLE_COUNT, for the power savings. ... Doug
On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com> wrote: > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: > >> On x86 we don't have to use that time_check_counter thing, >> sched_clock() >> is really cheap, not sure if it makes sense on other platforms. > > Are you sure? I saw a 5-10% increase in CPU use, > for a constant query rate to a memcache style > workload, with v3 of this patch. I think I know what's going on. Increased utilization with the same amount of work per time unit (and I guess that's the case given the lack of specific information about the workload) means more non-idle time with respect to total time and that implies reduced frequency (eg. less turbo). Now, combine that with the Doug's observation that limiting the rate of local_clock() invocations in the poll_idle() loop reduces power draw during experiments on his system significantly and with the other one that in both cases local_lock() ends up being rdtsc() (most likely). What this implies to me is that invoking rdtsc() at a high rate on multiple logical CPUs in parallel causes chips to get hot. Actually that may be so hot that they hit power/thremal (eg. RAPL) limits and get their frequency reduced as a result. Limiting the rate of local_clock() invocations obviously avoids this issue.
On Tue, 2018-03-27 at 18:42 +0200, Rafael J. Wysocki wrote: > On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com> > wrote: > > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: > > > > > On x86 we don't have to use that time_check_counter thing, > > > sched_clock() > > > is really cheap, not sure if it makes sense on other platforms. > > > > Are you sure? I saw a 5-10% increase in CPU use, > > for a constant query rate to a memcache style > > workload, with v3 of this patch. > > I think I know what's going on. I ran my tests wrong, and the script never propagated errors back to me. Sigh. However, the poll_idle() that reads the TSC at a reduced rate seems to perform better than the one that reads the TSC every time it goes around the loop. The size of the idle loop seems to make a slight difference, too. Having just one cpu_relax() in the entire loop seems to be better than having them all over the place.
On Tue, Mar 27, 2018 at 8:02 PM, Rik van Riel <riel@surriel.com> wrote: > On Tue, 2018-03-27 at 18:42 +0200, Rafael J. Wysocki wrote: >> On Thu, Mar 22, 2018 at 5:32 PM, Rik van Riel <riel@surriel.com> >> wrote: >> > On Wed, 2018-03-14 at 13:04 +0100, Peter Zijlstra wrote: >> > >> > > On x86 we don't have to use that time_check_counter thing, >> > > sched_clock() >> > > is really cheap, not sure if it makes sense on other platforms. >> > >> > Are you sure? I saw a 5-10% increase in CPU use, >> > for a constant query rate to a memcache style >> > workload, with v3 of this patch. >> >> I think I know what's going on. > > I ran my tests wrong, and the script never propagated > errors back to me. Sigh. Again, no worries. :-) > However, the poll_idle() that reads the TSC at a > reduced rate seems to perform better than the one > that reads the TSC every time it goes around the > loop. OK > The size of the idle loop seems to make a slight > difference, too. Having just one cpu_relax() in > the entire loop seems to be better than having > them all over the place. OK Thanks for the above observations, they match my understanding of what's happening. I'll resend the patch to read the TSC at a reduced rate with a proper changelog and I'll tentatively add it to my pm-cpuidle branch. Thanks!
Index: linux-pm/drivers/cpuidle/poll_state.c =================================================================== --- linux-pm.orig/drivers/cpuidle/poll_state.c +++ linux-pm/drivers/cpuidle/poll_state.c @@ -5,16 +5,31 @@ */ #include <linux/cpuidle.h> +#include <linux/ktime.h> #include <linux/sched.h> #include <linux/sched/idle.h> +#define POLL_IDLE_TIME_CHECK_COUNT 1000 +#define POLL_IDLE_TIME_LIMIT (TICK_NSEC / 16) + static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + ktime_t start = ktime_get(); + local_irq_enable(); if (!current_set_polling_and_test()) { - while (!need_resched()) + unsigned int time_check_counter = 0; + + while (!need_resched()) { cpu_relax(); + if (time_check_counter++ < POLL_IDLE_TIME_CHECK_COUNT) + continue; + + time_check_counter = 0; + if (ktime_sub(ktime_get(), start) > POLL_IDLE_TIME_LIMIT) + break; + } } current_clr_polling();