Message ID | 20090519073942.GA10864@sli10-desk.sh.intel.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > idle. > > This patch adds one API to change cpuset top group's cpus. If we want to > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > then all tasks will be migrate to other cpus, and other tasks will not be > migrated to this cpu again. No functional changes. > > We will use this API in new ACPI processor aggregator device driver later. I don't think so. There really is a lot more to do than move processes about. Furthermore, I object to being able to remove online cpus from the top cpuset, that just doesn't make sense. I'd suggest using hotplug for this. NAK > Signed-off-by: Shaohua Li<shaohua.li@intel.com> > --- > include/linux/cpuset.h | 5 +++++ > kernel/cpuset.c | 27 ++++++++++++++++++++++++--- > 2 files changed, 29 insertions(+), 3 deletions(-) > > Index: linux/kernel/cpuset.c > =================================================================== > --- linux.orig/kernel/cpuset.c 2009-05-12 16:27:16.000000000 +0800 > +++ linux/kernel/cpuset.c 2009-05-19 10:05:36.000000000 +0800 > @@ -929,14 +929,14 @@ static void update_tasks_cpumask(struct > * @buf: buffer of cpu numbers written to this cpuset > */ > static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, > - const char *buf) > + const char *buf, bool top_ok) > { > struct ptr_heap heap; > int retval; > int is_load_balanced; > > /* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */ > - if (cs == &top_cpuset) > + if (cs == &top_cpuset && !top_ok) > return -EACCES; > > /* > @@ -1496,7 +1496,7 @@ static int cpuset_write_resmask(struct c > > switch (cft->private) { > case FILE_CPULIST: > - retval = update_cpumask(cs, trialcs, buf); > + retval = update_cpumask(cs, trialcs, buf, false); > break; > case FILE_MEMLIST: > retval = update_nodemask(cs, trialcs, buf); > @@ -1511,6 +1511,27 @@ static int cpuset_write_resmask(struct c > return retval; > } > > +int cpuset_change_top_cpumask(const char *buf) > +{ > + int retval = 0; > + struct cpuset *cs = &top_cpuset; > + struct cpuset *trialcs; > + > + if (!cgroup_lock_live_group(cs->css.cgroup)) > + return -ENODEV; > + > + trialcs = alloc_trial_cpuset(cs); > + if (!trialcs) > + return -ENOMEM; > + > + retval = update_cpumask(cs, trialcs, buf, true); > + > + free_trial_cpuset(trialcs); > + cgroup_unlock(); > + return retval; > +} > +EXPORT_SYMBOL(cpuset_change_top_cpumask); > + > /* > * These ascii lists should be read in a single call, by using a user > * buffer large enough to hold the entire map. If read in smaller > Index: linux/include/linux/cpuset.h > =================================================================== > --- linux.orig/include/linux/cpuset.h 2009-05-12 16:27:15.000000000 +0800 > +++ linux/include/linux/cpuset.h 2009-05-19 10:05:36.000000000 +0800 > @@ -92,6 +92,7 @@ extern void rebuild_sched_domains(void); > > extern void cpuset_print_task_mems_allowed(struct task_struct *p); > > +extern int cpuset_change_top_cpumask(const char *buf); > #else /* !CONFIG_CPUSETS */ > > static inline int cpuset_init_early(void) { return 0; } > @@ -188,6 +189,10 @@ static inline void cpuset_print_task_mem > { > } > > +static inline int cpuset_change_top_cpumask(const char *buf) > +{ > + return -ENODEV; > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > idle. > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > then all tasks will be migrate to other cpus, and other tasks will not be > > migrated to this cpu again. No functional changes. > > > > We will use this API in new ACPI processor aggregator device driver later. > > I don't think so. There really is a lot more to do than move processes > about. no processor running is good enough for us, we don't care about interrupts/softirq/ timers so far. > Furthermore, I object to being able to remove online cpus from the top > cpuset, that just doesn't make sense. > > I'd suggest using hotplug for this. cpu hotplug involves too much things, and we are afraid it's not reliable. Besides, a hot removed cpu will do a dead loop halt, which isn't power saving efficient. To make hot removed cpu enters deep C-state is in whish list for a long time, but still not available. The acpi_processor_idle is a module, and cpuidle governor potentially can't handle offline cpu. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-19 at 16:48 +0800, Shaohua Li wrote: > On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > > idle. > > > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > > then all tasks will be migrate to other cpus, and other tasks will not be > > > migrated to this cpu again. No functional changes. > > > > > > We will use this API in new ACPI processor aggregator device driver later. > > > > I don't think so. There really is a lot more to do than move processes > > about. > no processor running is good enough for us, we don't care about interrupts/softirq/ > timers so far. Well, I don't care for this interface. > > Furthermore, I object to being able to remove online cpus from the top > > cpuset, that just doesn't make sense. > > > > I'd suggest using hotplug for this. > cpu hotplug involves too much things, and we are afraid it's not reliable. Then make it more reliable instead of providing ugly ass shit like this. > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > efficient. To make hot removed cpu enters deep C-state is in whish list for a > long time, but still not available. The acpi_processor_idle is a module, and > cpuidle governor potentially can't handle offline cpu. Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, and I've no idea why its still there, seems like a much better candidate for your efforts than this. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2009 at 04:56:04PM +0800, Peter Zijlstra wrote: > On Tue, 2009-05-19 at 16:48 +0800, Shaohua Li wrote: > > On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > > > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > > > idle. > > > > > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > > > then all tasks will be migrate to other cpus, and other tasks will not be > > > > migrated to this cpu again. No functional changes. > > > > > > > > We will use this API in new ACPI processor aggregator device driver later. > > > > > > I don't think so. There really is a lot more to do than move processes > > > about. > > no processor running is good enough for us, we don't care about interrupts/softirq/ > > timers so far. > > Well, I don't care for this interface. > > > > Furthermore, I object to being able to remove online cpus from the top > > > cpuset, that just doesn't make sense. > > > > > > I'd suggest using hotplug for this. > > > cpu hotplug involves too much things, and we are afraid it's not reliable. > > Then make it more reliable instead of providing ugly ass shit like this. I wonder why this is that ugly. We have a cpu_isolated_map, which is just like this. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-19 at 17:06 +0800, Shaohua Li wrote: > On Tue, May 19, 2009 at 04:56:04PM +0800, Peter Zijlstra wrote: > > On Tue, 2009-05-19 at 16:48 +0800, Shaohua Li wrote: > > > On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > > > > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > > > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > > > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > > > > idle. > > > > > > > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > > > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > > > > then all tasks will be migrate to other cpus, and other tasks will not be > > > > > migrated to this cpu again. No functional changes. > > > > > > > > > > We will use this API in new ACPI processor aggregator device driver later. > > > > > > > > I don't think so. There really is a lot more to do than move processes > > > > about. > > > no processor running is good enough for us, we don't care about interrupts/softirq/ > > > timers so far. > > > > Well, I don't care for this interface. > > > > > > Furthermore, I object to being able to remove online cpus from the top > > > > cpuset, that just doesn't make sense. > > > > > > > > I'd suggest using hotplug for this. > > > > > cpu hotplug involves too much things, and we are afraid it's not reliable. > > > > Then make it more reliable instead of providing ugly ass shit like this. > I wonder why this is that ugly. We have a cpu_isolated_map, which is just like > this. And just as ugly -- it should die too. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-19 at 10:56 +0200, Peter Zijlstra wrote: > On Tue, 2009-05-19 at 16:48 +0800, Shaohua Li wrote: > > On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > > > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > > > idle. > > > > > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > > > then all tasks will be migrate to other cpus, and other tasks will not be > > > > migrated to this cpu again. No functional changes. > > > > > > > > We will use this API in new ACPI processor aggregator device driver later. > > > > > > I don't think so. There really is a lot more to do than move processes > > > about. > > no processor running is good enough for us, we don't care about interrupts/softirq/ > > timers so far. > > Well, I don't care for this interface. > > > > Furthermore, I object to being able to remove online cpus from the top > > > cpuset, that just doesn't make sense. > > > > > > I'd suggest using hotplug for this. > > > cpu hotplug involves too much things, and we are afraid it's not reliable. > > Then make it more reliable instead of providing ugly ass shit like this. OK, so perhaps I should have use different words. But the point is, we don't need a new interface to force a cpu idle. Hotplug does that. Furthermore, we should not want anything outside of that, either the cpu is there available for work, or its not -- halfway measures don't make sense. Furthermore, we already have power aware scheduling which tries to aggregate idle time on cpu/core/packages so as to maximize the idle time power savings. Use it there. > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > long time, but still not available. The acpi_processor_idle is a module, and > > cpuidle governor potentially can't handle offline cpu. > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > and I've no idea why its still there, seems like a much better candidate > for your efforts than this. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra <peterz@infradead.org> writes: > > Furthermore, I object to being able to remove online cpus from the top > cpuset, that just doesn't make sense. Note you can already do it at boot time with isolated_cpus=... So your objection seems to be a few years too late. Shaohua's patch just makes it work at runtime too. -Andi
On Tue, 2009-05-19 at 13:27 +0200, Andi Kleen wrote: > Peter Zijlstra <peterz@infradead.org> writes: > > > > Furthermore, I object to being able to remove online cpus from the top > > cpuset, that just doesn't make sense. > > Note you can already do it at boot time with isolated_cpus=... > So your objection seems to be a few years too late. > > Shaohua's patch just makes it work at runtime too. No it doesn't. And isolcpus should die. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Peter Zijlstra <peterz@infradead.org> [2009-05-19 12:38:58]: > On Tue, 2009-05-19 at 10:56 +0200, Peter Zijlstra wrote: > > On Tue, 2009-05-19 at 16:48 +0800, Shaohua Li wrote: > > > On Tue, May 19, 2009 at 04:40:54PM +0800, Peter Zijlstra wrote: > > > > On Tue, 2009-05-19 at 15:39 +0800, Shaohua Li wrote: > > > > > ACPI 4.0 defines processor aggregator device. The device can notify OS to idle > > > > > some CPUs to save power. This isn't to hot remove cpus, but just makes cpus > > > > > idle. > > > > > > > > > > This patch adds one API to change cpuset top group's cpus. If we want to > > > > > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > > > > > then all tasks will be migrate to other cpus, and other tasks will not be > > > > > migrated to this cpu again. No functional changes. > > > > > > > > > > We will use this API in new ACPI processor aggregator device driver later. > > > > > > > > I don't think so. There really is a lot more to do than move processes > > > > about. > > > no processor running is good enough for us, we don't care about interrupts/softirq/ > > > timers so far. > > > > Well, I don't care for this interface. > > > > > > Furthermore, I object to being able to remove online cpus from the top > > > > cpuset, that just doesn't make sense. > > > > > > > > I'd suggest using hotplug for this. > > > > > cpu hotplug involves too much things, and we are afraid it's not reliable. > > > > Then make it more reliable instead of providing ugly ass shit like this. > > OK, so perhaps I should have use different words. But the point is, we > don't need a new interface to force a cpu idle. Hotplug does that. We tried similar approaches to create idle time for power savings, but cpu hotplug interface seem to be a clean choice. There could be issues with the interface, we should fix it. Is there any other reason why cpuhotplug is 'ugly' other than its performance (speed)? I have tried few load balancer hacks to evacuate cores but not a solid design yet. It has its advantages but still needs more work. http://lkml.org/lkml/2009/5/13/173 > Furthermore, we should not want anything outside of that, either the cpu > is there available for work, or its not -- halfway measures don't make > sense. > > Furthermore, we already have power aware scheduling which tries to > aggregate idle time on cpu/core/packages so as to maximize the idle time > power savings. Use it there. Power aware scheduling can optimally accumulate idle times. Framework to create idle time to force idle cores is good and useful for power savings. Other than the speed of online/offline I do not know of any other major issue for using cpu hotplug for this purpose. > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > long time, but still not available. The acpi_processor_idle is a module, and > > > cpuidle governor potentially can't handle offline cpu. > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > and I've no idea why its still there, seems like a much better candidate > > for your efforts than this. I agree with Peter. We need to make cpu hotplug save power first and later improve upon its performance. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> ... the point is, we > don't need a new interface to force a cpu idle. Hotplug does that. > > Furthermore, we should not want anything outside of that, either the cpu > is there available for work, or its not -- halfway measures don't make > sense. > > Furthermore, we already have power aware scheduling which tries to > aggregate idle time on cpu/core/packages so as to maximize the idle time > power savings. Use it there. Some context... In the past, server room power and thermal issues were handled either by spending too much money to provision power and thermals for theoretical worst case, or by abruptly shutting off servers when hard limits were reached. Going forward, platforms are getting smarter, measuring how much power is drawn from the power supply, measuring the room thermals etc. so that real dollars can be saved by deploying systems that exceed the theoretical worst case if the power and thermal limits are enforced. So if server approaches a budget, the platform will notify the OS to limit its P-states, and limit its T-states in order to draw less power. If that is not sufficient, the platform will ask us to take processors off-line. These are not processors that are otherwise idle -- those are already saving as much power as they can -- these are processors that are fully utilized. So power-aware scheduling is moot here, this isn't the partially idle case, this is the fully utilized case. If power draw continues to be too high, the platform will simply ask us to take more processors off line. If this dance doesn't reduce power below that required, the platform will be shut off. So it is sufficient to simply not schedule cpu burners on the 'idled' processor. Interrupts should generally not matter -- and if they do, we'll end up simply idling an additional processor. > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > long time, but still not available. The acpi_processor_idle is a module, and > > > cpuidle governor potentially can't handle offline cpu. > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > and I've no idea why its still there, seems like a much better candidate > > for your efforts than this. CONFIG_HOTPLUG_CPU has been problematic in the past. It does more than what we need here, so we thought a lighter-weight and lower-latency method that simply didn't schedule to the idled cpu would suffice. Personally, I don't think that CONFIG_HOTPLUG_CPU should exist, taking processors on and off-line should be part of CONFIG_SMP. A while back when I selected CONFIG_HOTPLUG_CPU from ACPI && SMP, there was a torrent of outrage that it infringed on user's right's to save that additional 18KB of memory that CONFIG_HOTPLUG_CPU includes that SMP does not... We are fixing the hotplug-unplug idle loop, but there turns out to be some issues with it related to idle processors with interrupts disabled that don't actually get down into the deep C-states we request:-( So this is why you see a patch for a "halfway measure", it does what is necessary, and does nothing more. -Len -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 19, 2009 at 12:39 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > This patch adds one API to change cpuset top group's cpus. If we want to > make one cpu idle, simply remove the cpu from cpuset top group's cpu list, > then all tasks will be migrate to other cpus, and other tasks will not be > migrated to this cpu again. No functional changes. > > +int cpuset_change_top_cpumask(const char *buf) > +{ > + Â Â Â int retval = 0; > + Â Â Â struct cpuset *cs = &top_cpuset; > + Â Â Â struct cpuset *trialcs; > + > + Â Â Â if (!cgroup_lock_live_group(cs->css.cgroup)) > + Â Â Â Â Â Â Â return -ENODEV; top_cpuset can't possibly be dead, so a plain cgroup_lock() would be fine here. > + > + Â Â Â trialcs = alloc_trial_cpuset(cs); > + Â Â Â if (!trialcs) > + Â Â Â Â Â Â Â return -ENOMEM; You returned without doing a cgroup_unlock() > + > + Â Â Â retval = update_cpumask(cs, trialcs, buf, true); This will fail if any child cpuset is using any cpu not in the new cpumask, since a child's cpumask must be a subset of its parent's. So this can't work without co-ordination with userspace regarding child cpusets. Given that, it seems simpler to do the whole thing in userspace, or just use the existing hotplug infrastructure. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-19 at 15:01 -0400, Len Brown wrote: > > ... the point is, we > > don't need a new interface to force a cpu idle. Hotplug does that. > > > > Furthermore, we should not want anything outside of that, either the cpu > > is there available for work, or its not -- halfway measures don't make > > sense. > > > > Furthermore, we already have power aware scheduling which tries to > > aggregate idle time on cpu/core/packages so as to maximize the idle time > > power savings. Use it there. > > Some context... <snip default story of thermal overcommit> > > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > > long time, but still not available. The acpi_processor_idle is a module, and > > > > cpuidle governor potentially can't handle offline cpu. > > > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > > and I've no idea why its still there, seems like a much better candidate > > > for your efforts than this. > > CONFIG_HOTPLUG_CPU has been problematic in the past. > It does more than what we need here, so we thought > a lighter-weight and lower-latency method that simply > didn't schedule to the idled cpu would suffice. > We are fixing the hotplug-unplug idle loop, but there > turns out to be some issues with it related to idle > processors with interrupts disabled that don't actually > get down into the deep C-states we request:-( > > So this is why you see a patch for a "halfway measure", > it does what is necessary, and does nothing more. Its broken, its ill-defined and its not going to happen. Ripping cpus out of the top cpuset might upset the cpuset configuration and has no regards for any realtime processes. And I must take back my earlier suggestion, hotplug is a bad solution too. There's just too much user policy (cpuset configuration) to upset. The IBM folks are working on a scheduler based solution, please talk to them. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra <peterz@infradead.org> writes: Peter, in general the discussion would be much more fruitful if you explained your reasoning more verbosely. I can only guess what your rationales are from your half sentence pronouncements. > and has no regards for any realtime processes. You're saying this should not be done if any realtime processes are currently bound to a to be temporarily removed CPU? That sounds reasonable and I'm sure could be implemented with the original patch. > And I must take back my > earlier suggestion, hotplug is a bad solution too. > > There's just too much user policy (cpuset configuration) to upset. Could you explain that please? How does changing the top level cpuset affect other cpu sets? > The IBM folks are working on a scheduler based solution, please talk to > them. I don't claim to fully understand the scheduler, but naively since cpusets can already do this adding another mechanism for it too that needs to be checked in fast paths would seem somewhat redundant? -Andi
On Wed, 2009-05-20 at 13:58 +0200, Andi Kleen wrote: > Could you explain that please? How does changing the top level > cpuset affect other cpu sets? Suppose you have 8 cpus and created 3 cpusets: A: cpu0 - system administration stuff B: cpu1-5 - generic computational stuff C: cpu6-7 - latency critical stuff Each such set is made a load-balance domain (iow load-balancing on the top level set is disabled). Now, suppose someone thinks its a good idea to remove cpu0 because the machine is running against some thermal limit -- what will all the administration stuff (including sshd) do? Same goes for the latency critical stuff. You really want to start shrinking the generic computational capacity first. The thing is, you cannot simply rip cpus out from under a system, people might rely on them being there and have policy attached to them -- esp. people touching cpusets should know that a machine isn't configured homogeneous and any odd cpu will do. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks for the explanation. My naive reaction would be to fail if the socket to be taken out is the only member of some cpuset. Or maybe break affinities in this case. > You really want to start shrinking the generic computational capacity > first. One general issue to remember that if you don't react to the platform hint the platform will likely force a lower p-state on you to not exceed the thermal limits, making everyone slower. (this will likely also not make your real time process happy) So it's a bit more than a hint; it's more like a command "or else" So it's a good idea to react or at least make at least a reasonable attempt to react. > The thing is, you cannot simply rip cpus out from under a system, people > might rely on them being there and have policy attached to them -- esp. > people touching cpusets should know that a machine isn't configured > homogeneous and any odd cpu will do. Ok, so do you think it's possible to figure out based on the cpuset graph / real time runqueue if a socket can be taken out? -Andi
On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote: > Thanks for the explanation. > > My naive reaction would be to fail if the socket to be taken out > is the only member of some cpuset. Or maybe break affinities in this case. Right, breaking affinities would go against the policy of the admin, I'm not sure we'd want to go there. We could start generating msgs about how we're in thermal trouble and the given configuration is obstructing counter measures etc.. Currently hot-unplug does break affinities, but that's an explicit action by the admin himself, so he gets what he asks for (and we do generate complaints in syslog about it). [ Same scenario for the HPC guys who affinity fix all their threads to specific cpus, there's really nothing you can do there. Then again such folks generally run their machines at 100% so they'd better be able to deal with their thermal peak capacity anyway. ] > > You really want to start shrinking the generic computational capacity > > first. > > One general issue to remember that if you don't react to the platform hint > the platform will likely force a lower p-state on you to not exceed > the thermal limits, making everyone slower. > > (this will likely also not make your real time process happy) Quite. > So it's a bit more than a hint; it's more like a command "or else" > > So it's a good idea to react or at least make at least a reasonable attempt > to react. Sure, does the thing give more than a: 'react now, or else' impulse? That is, can we see it coming, or will we have to deal with it when we're there? The latter also has the problem that you have to react very quickly. > > The thing is, you cannot simply rip cpus out from under a system, people > > might rely on them being there and have policy attached to them -- esp. > > people touching cpusets should know that a machine isn't configured > > homogeneous and any odd cpu will do. > > Ok, so do you think it's possible to figure out based on the cpuset > graph / real time runqueue if a socket can be taken out? Right, so all of this depends on a number of things, how frequent and how fast would these situations occur? I would think they'd be rare events, otherwise you really messed up your infrastructure. I also think reaction times should be in the seconds, otherwise you're cutting it way to close. The work IBM has been doing is centered around overloading neighbouring packages in order to keep some idle. The overload is exposed as a percentage. This works within scheduling domains, so if you carve your machine up in tiny (<= 1 package) domains its impossible to do anything (corner case, we could send cries for help syslog's way). I was hoping we could control the situation with that. But for that to work we need some gradual information in order to make that thermal<->overload feedback work. A single: idle a core now (< 'n' sec) or die, isn't really helpful. [ figuring out how to deal with RT tasks and the like is still open, the problem with SCHED_FIFO/RR is that such tasks don't give utilization numbers, so we'll have to guesstimate them based on historic behaviour. SCHED_EDF or similar future realtime bits would be much easier to deal with in this case ] -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 03:41:55PM +0200, Peter Zijlstra wrote: > On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote: > > Thanks for the explanation. > > > > My naive reaction would be to fail if the socket to be taken out > > is the only member of some cpuset. Or maybe break affinities in this case. > > Right, breaking affinities would go against the policy of the admin, I'm > not sure we'd want to go there. > We could start generating msgs about how > we're in thermal trouble and the given configuration is obstructing > counter measures etc.. Makes sense. > > Currently hot-unplug does break affinities, but that's an explicit > action by the admin himself, so he gets what he asks for (and we do I have some code which can do it implicitely too in mcelog (not yet out). Basically the CPU can detect when its caches have a problem and the reaction is then to offline the affected CPUs. But that's a very obscure case and the alternative is to die. > generate complaints in syslog about it). One possible alternative would be also "weak breaking", as in remembering the old affinities and reinstating them once the CPU becomes online again. > [ Same scenario for the HPC guys who affinity fix all their threads to > specific cpus, there's really nothing you can do there. Then again > such folks generally run their machines at 100% so they'd better > be able to deal with their thermal peak capacity anyway. ] Yes. Same for real time. These guys are really not expected to use these advanced power management features. > > So it's a bit more than a hint; it's more like a command "or else" > > > > So it's a good idea to react or at least make at least a reasonable attempt > > to react. > > Sure, does the thing give more than a: 'react now, or else' impulse? > That is, can we see it coming, or will we have to deal with it when > we're there? > > The latter also has the problem that you have to react very quickly. My understanding it is a quite strong hint: "do the best you can" So yes doing it quickly would be good. > > > > The thing is, you cannot simply rip cpus out from under a system, people > > > might rely on them being there and have policy attached to them -- esp. > > > people touching cpusets should know that a machine isn't configured > > > homogeneous and any odd cpu will do. > > > > Ok, so do you think it's possible to figure out based on the cpuset > > graph / real time runqueue if a socket can be taken out? > > Right, so all of this depends on a number of things, how frequent and > how fast would these situations occur? > > I would think they'd be rare events, otherwise you really messed up your My assumption too. > infrastructure. I also think reaction times should be in the seconds, > otherwise you're cutting it way to close. Yep. > I was hoping we could control the situation with that. But for that to > work we need some gradual information in order to make that > thermal<->overload feedback work. > > > A single: idle a core now (< 'n' sec) or die, isn't really helpful. That's what you get unfortuantely. -Andi
* Len Brown <lenb@kernel.org> [2009-05-19 15:01:46]: > > ... the point is, we > > don't need a new interface to force a cpu idle. Hotplug does that. > > > > Furthermore, we should not want anything outside of that, either the cpu > > is there available for work, or its not -- halfway measures don't make > > sense. > > > > Furthermore, we already have power aware scheduling which tries to > > aggregate idle time on cpu/core/packages so as to maximize the idle time > > power savings. Use it there. > > Some context... > > In the past, server room power and thermal issues were handled > either by spending too much money to provision power and > thermals for theoretical worst case, or by abruptly shutting off > servers when hard limits were reached. > > Going forward, platforms are getting smarter, measuring how > much power is drawn from the power supply, measuring the room > thermals etc. so that real dollars can be saved by deploying > systems that exceed the theoretical worst case if the power > and thermal limits are enforced. > > So if server approaches a budget, the platform > will notify the OS to limit its P-states, and limit its T-states > in order to draw less power. > > If that is not sufficient, the platform will ask us to take > processors off-line. These are not processors that are otherwise idle > -- those are already saving as much power as they can -- > these are processors that are fully utilized. > > So power-aware scheduling is moot here, this isn't the > partially idle case, this is the fully utilized case. Hi Len, Over and above power-aware scheduling we have been exploring possibility of forcefully idle cpu for power savings. This is mostly useful in thermal case that you have mentioned and also to provide fine grain power vs performance trade-offs. Creating idle times and consolidating idle time efficiently in order to evacuate cores and packages provides a framework to exploit C-States apart from P-States and T-States that you have mentioned above. Addition of C-States control to save power and heat may make the system do more instructions at a given power/thermal constraint. Reference: http://lkml.org/lkml/2009/5/13/173 > If power draw continues to be too high, the platform > will simply ask us to take more processors off line. > > If this dance doesn't reduce power below that required, > the platform will be shut off. > > So it is sufficient to simply not schedule cpu burners > on the 'idled' processor. Interrupts should generally > not matter -- and if they do, we'll end up simply idling > an additional processor. The requirements and use cases are clear. > > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > > long time, but still not available. The acpi_processor_idle is a module, and > > > > cpuidle governor potentially can't handle offline cpu. > > > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > > and I've no idea why its still there, seems like a much better candidate > > > for your efforts than this. > > CONFIG_HOTPLUG_CPU has been problematic in the past. > It does more than what we need here, so we thought > a lighter-weight and lower-latency method that simply > didn't schedule to the idled cpu would suffice. > > Personally, I don't think that CONFIG_HOTPLUG_CPU should exist, > taking processors on and off-line should be part of CONFIG_SMP. > > A while back when I selected CONFIG_HOTPLUG_CPU from ACPI && SMP, > there was a torrent of outrage that it infringed on user's right's > to save that additional 18KB of memory that CONFIG_HOTPLUG_CPU > includes that SMP does not... > > We are fixing the hotplug-unplug idle loop, but there > turns out to be some issues with it related to idle > processors with interrupts disabled that don't actually > get down into the deep C-states we request:-( Fixing the hot-unplug idle loop will help us use the cpu-hotplug infrastructure for many other purposes like power/thermal management purposes. Do you think there could be some workaround/solution for this in short term? > So this is why you see a patch for a "halfway measure", > it does what is necessary, and does nothing more. Peter had detailed comments on this aspect. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Peter Zijlstra <peterz@infradead.org> [2009-05-20 15:41:55]: > On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote: > > Thanks for the explanation. > > > > My naive reaction would be to fail if the socket to be taken out > > is the only member of some cpuset. Or maybe break affinities in this case. > > Right, breaking affinities would go against the policy of the admin, I'm > not sure we'd want to go there. We could start generating msgs about how > we're in thermal trouble and the given configuration is obstructing > counter measures etc.. > > Currently hot-unplug does break affinities, but that's an explicit > action by the admin himself, so he gets what he asks for (and we do > generate complaints in syslog about it). > > [ Same scenario for the HPC guys who affinity fix all their threads to > specific cpus, there's really nothing you can do there. Then again > such folks generally run their machines at 100% so they'd better > be able to deal with their thermal peak capacity anyway. ] > > > > You really want to start shrinking the generic computational capacity > > > first. > > > > One general issue to remember that if you don't react to the platform hint > > the platform will likely force a lower p-state on you to not exceed > > the thermal limits, making everyone slower. > > > > (this will likely also not make your real time process happy) > > Quite. > > > So it's a bit more than a hint; it's more like a command "or else" > > > > So it's a good idea to react or at least make at least a reasonable attempt > > to react. > > Sure, does the thing give more than a: 'react now, or else' impulse? > That is, can we see it coming, or will we have to deal with it when > we're there? > > The latter also has the problem that you have to react very quickly. > > > > The thing is, you cannot simply rip cpus out from under a system, people > > > might rely on them being there and have policy attached to them -- esp. > > > people touching cpusets should know that a machine isn't configured > > > homogeneous and any odd cpu will do. > > > > Ok, so do you think it's possible to figure out based on the cpuset > > graph / real time runqueue if a socket can be taken out? > > Right, so all of this depends on a number of things, how frequent and > how fast would these situations occur? > > I would think they'd be rare events, otherwise you really messed up your > infrastructure. I also think reaction times should be in the seconds, > otherwise you're cutting it way to close. > > > The work IBM has been doing is centered around overloading neighbouring > packages in order to keep some idle. The overload is exposed as a > percentage. > > This works within scheduling domains, so if you carve your machine up in > tiny (<= 1 package) domains its impossible to do anything (corner case, > we could send cries for help syslog's way). > > I was hoping we could control the situation with that. But for that to > work we need some gradual information in order to make that > thermal<->overload feedback work. The advantages of this method is to reduce load on one package and not target a particular CPU. This is less restrictive and can allow the load balancer to work out the details. Keeping a core idle on an average (over a time interval) is good enough to reduce the power and heat. Here we need not touch the RT jobs or break use space policies. We effectively reduce capacity and let the loadbalancer have the flexibility of figuring out which CPU should not be scheduled now. That said, this is not useful for a 'cpu cache error' case, in which case you will have to cpu-hot-unplug anyway. You don't want any interrupts/timers to land there in an unreliable CPU. Overloading the powersave load balancer to assume reduced capacity on some of the packages while overloading some others packages is the core idea. The RFC patches still need a lot of work to meet the required functionality. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 21, 2009 at 01:36:35AM +0800, Vaidyanathan Srinivasan wrote: > * Peter Zijlstra <peterz@infradead.org> [2009-05-20 15:41:55]: > > > On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote: > > > Thanks for the explanation. > > > > > > My naive reaction would be to fail if the socket to be taken out > > > is the only member of some cpuset. Or maybe break affinities in this case. > > > > Right, breaking affinities would go against the policy of the admin, I'm > > not sure we'd want to go there. We could start generating msgs about how > > we're in thermal trouble and the given configuration is obstructing > > counter measures etc.. > > > > Currently hot-unplug does break affinities, but that's an explicit > > action by the admin himself, so he gets what he asks for (and we do > > generate complaints in syslog about it). > > > > [ Same scenario for the HPC guys who affinity fix all their threads to > > specific cpus, there's really nothing you can do there. Then again > > such folks generally run their machines at 100% so they'd better > > be able to deal with their thermal peak capacity anyway. ] > > > > > > You really want to start shrinking the generic computational capacity > > > > first. > > > > > > One general issue to remember that if you don't react to the platform hint > > > the platform will likely force a lower p-state on you to not exceed > > > the thermal limits, making everyone slower. > > > > > > (this will likely also not make your real time process happy) > > > > Quite. > > > > > So it's a bit more than a hint; it's more like a command "or else" > > > > > > So it's a good idea to react or at least make at least a reasonable attempt > > > to react. > > > > Sure, does the thing give more than a: 'react now, or else' impulse? > > That is, can we see it coming, or will we have to deal with it when > > we're there? > > > > The latter also has the problem that you have to react very quickly. > > > > > > The thing is, you cannot simply rip cpus out from under a system, people > > > > might rely on them being there and have policy attached to them -- esp. > > > > people touching cpusets should know that a machine isn't configured > > > > homogeneous and any odd cpu will do. > > > > > > Ok, so do you think it's possible to figure out based on the cpuset > > > graph / real time runqueue if a socket can be taken out? > > > > Right, so all of this depends on a number of things, how frequent and > > how fast would these situations occur? > > > > I would think they'd be rare events, otherwise you really messed up your > > infrastructure. I also think reaction times should be in the seconds, > > otherwise you're cutting it way to close. > > > > > > The work IBM has been doing is centered around overloading neighbouring > > packages in order to keep some idle. The overload is exposed as a > > percentage. > > > > This works within scheduling domains, so if you carve your machine up in > > tiny (<= 1 package) domains its impossible to do anything (corner case, > > we could send cries for help syslog's way). > > > > I was hoping we could control the situation with that. But for that to > > work we need some gradual information in order to make that > > thermal<->overload feedback work. > > The advantages of this method is to reduce load on one package and not > target a particular CPU. This is less restrictive and can allow the > load balancer to work out the details. Keeping a core idle on an > average (over a time interval) is good enough to reduce the power and > heat. > > Here we need not touch the RT jobs or break use space policies. We > effectively reduce capacity and let the loadbalancer have the > flexibility of figuring out which CPU should not be scheduled now. > > That said, this is not useful for a 'cpu cache error' case, in which > case you will have to cpu-hot-unplug anyway. You don't want any > interrupts/timers to land there in an unreliable CPU. > > Overloading the powersave load balancer to assume reduced capacity on > some of the packages while overloading some others packages is the > core idea. The RFC patches still need a lot of work to meet the > required functionality. So the main concern is breaking user policy, but it appears any approach (cpu hotplug/cpuset) will break user policy (affinity). I wonder how the scheduler approach can overcome this to my little scheduler knowledge. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Shaohua Li <shaohua.li@intel.com> [2009-05-21 09:22:13]: > On Thu, May 21, 2009 at 01:36:35AM +0800, Vaidyanathan Srinivasan wrote: > > * Peter Zijlstra <peterz@infradead.org> [2009-05-20 15:41:55]: > > > > > On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote: > > > > Thanks for the explanation. > > > > > > > > My naive reaction would be to fail if the socket to be taken out > > > > is the only member of some cpuset. Or maybe break affinities in this case. > > > > > > Right, breaking affinities would go against the policy of the admin, I'm > > > not sure we'd want to go there. We could start generating msgs about how > > > we're in thermal trouble and the given configuration is obstructing > > > counter measures etc.. > > > > > > Currently hot-unplug does break affinities, but that's an explicit > > > action by the admin himself, so he gets what he asks for (and we do > > > generate complaints in syslog about it). > > > > > > [ Same scenario for the HPC guys who affinity fix all their threads to > > > specific cpus, there's really nothing you can do there. Then again > > > such folks generally run their machines at 100% so they'd better > > > be able to deal with their thermal peak capacity anyway. ] > > > > > > > > You really want to start shrinking the generic computational capacity > > > > > first. > > > > > > > > One general issue to remember that if you don't react to the platform hint > > > > the platform will likely force a lower p-state on you to not exceed > > > > the thermal limits, making everyone slower. > > > > > > > > (this will likely also not make your real time process happy) > > > > > > Quite. > > > > > > > So it's a bit more than a hint; it's more like a command "or else" > > > > > > > > So it's a good idea to react or at least make at least a reasonable attempt > > > > to react. > > > > > > Sure, does the thing give more than a: 'react now, or else' impulse? > > > That is, can we see it coming, or will we have to deal with it when > > > we're there? > > > > > > The latter also has the problem that you have to react very quickly. > > > > > > > > The thing is, you cannot simply rip cpus out from under a system, people > > > > > might rely on them being there and have policy attached to them -- esp. > > > > > people touching cpusets should know that a machine isn't configured > > > > > homogeneous and any odd cpu will do. > > > > > > > > Ok, so do you think it's possible to figure out based on the cpuset > > > > graph / real time runqueue if a socket can be taken out? > > > > > > Right, so all of this depends on a number of things, how frequent and > > > how fast would these situations occur? > > > > > > I would think they'd be rare events, otherwise you really messed up your > > > infrastructure. I also think reaction times should be in the seconds, > > > otherwise you're cutting it way to close. > > > > > > > > > The work IBM has been doing is centered around overloading neighbouring > > > packages in order to keep some idle. The overload is exposed as a > > > percentage. > > > > > > This works within scheduling domains, so if you carve your machine up in > > > tiny (<= 1 package) domains its impossible to do anything (corner case, > > > we could send cries for help syslog's way). > > > > > > I was hoping we could control the situation with that. But for that to > > > work we need some gradual information in order to make that > > > thermal<->overload feedback work. > > > > The advantages of this method is to reduce load on one package and not > > target a particular CPU. This is less restrictive and can allow the > > load balancer to work out the details. Keeping a core idle on an > > average (over a time interval) is good enough to reduce the power and > > heat. > > > > Here we need not touch the RT jobs or break use space policies. We > > effectively reduce capacity and let the loadbalancer have the > > flexibility of figuring out which CPU should not be scheduled now. > > > > That said, this is not useful for a 'cpu cache error' case, in which > > case you will have to cpu-hot-unplug anyway. You don't want any > > interrupts/timers to land there in an unreliable CPU. > > > > Overloading the powersave load balancer to assume reduced capacity on > > some of the packages while overloading some others packages is the > > core idea. The RFC patches still need a lot of work to meet the > > required functionality. > So the main concern is breaking user policy, but it appears any approach > (cpu hotplug/cpuset) will break user policy (affinity). I wonder how the > scheduler approach can overcome this to my little scheduler knowledge. In the scheduler loadbalancer approach we have a notion like run 3 tasks in a quad core but not specify which cpu to evacuate. So it is possible to respect task affinity by throttle tasks so as to not run all the cores simultaneously. Even if the system is completely loaded, we can use all CPUs but avoid one core at a given time. The input knob is a system-wide capacity percentage than can be reduced and this reduced capacity in multiples of cores can be uniformly spread across the system. This is a possibility with the scheduler approach, but the current set of RFC patches is not yet there and we do have implementation challenges. By artificially creating overload (or under-capacity) situations, the load balancer can avoid filling up a sched domain completely. This works at CPU level and NODE level sched domains and allow the MC/SIBLING level domains to balance work among the cores/threads. This is only a possibility and we do have implementation challenges that needs lots of work. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 19 May 2009, Vaidyanathan Srinivasan wrote: > We tried similar approaches to create idle time for power savings, but > cpu hotplug interface seem to be a clean choice. There could be > issues with the interface, we should fix it. Is there any other > reason why cpuhotplug is 'ugly' other than its performance (speed)? > > I have tried few load balancer hacks to evacuate cores but not a solid > design yet. It has its advantages but still needs more work. > > http://lkml.org/lkml/2009/5/13/173 Thanks for the pointer. I agree with Andi, please avoid the term "throttling", since it has been used for ages to refer processor clock throttling -- which is actually significantly less effective at saving energy than what you are trying to do. (not the word "energy" here, where the word "power" is incorrectly used in the thread above) "core evacuation" is a better description, I agree, though I wonder why you don't simply call it "forced idling", since that is what you are trying to do. > > Furthermore, we should not want anything outside of that, either the cpu > > is there available for work, or its not -- halfway measures don't make > > sense. > > > > Furthermore, we already have power aware scheduling which tries to > > aggregate idle time on cpu/core/packages so as to maximize the idle time > > power savings. Use it there. > > Power aware scheduling can optimally accumulate idle times. Framework > to create idle time to force idle cores is good and useful for power > savings. Other than the speed of online/offline I do not know of any > other major issue for using cpu hotplug for this purpose. It sounds like you want to use this technique more often that I had in mind. You are thinking of a warm rack, which may stay warm all day long. I am thinking of a rack which has a theoretical power draw higher than the providioned electrical supply. As there is a huge difference between actual and theoretical power draw, this saves many dollars. So what you're looking at is more frequent use than we need, and that is fine -- as long as you exhaust P-states first -- since forcing cores to be idle has a more severe performance impact than running at a deeper P-state. I didn't see P-states addressed in your thread. > > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > > long time, but still not available. The acpi_processor_idle is a module, and > > > > cpuidle governor potentially can't handle offline cpu. > > > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > > and I've no idea why its still there, seems like a much better candidate > > > for your efforts than this. > > I agree with Peter. We need to make cpu hotplug save power first and > later improve upon its performance. We do have a patch to fix the offline idle loop to save power. We can use hotplug in the short term until something better comes along. Yes, it will break cpusets, just like Shaohua's original patch broke them -- and that will make using it inappropriate for some customers. While I think this mechanism is important, I don't think that a large % of customers will deploy it. I think the ones that deploy it will do so to save money on electrical provisioning, not on pushing the limits of their air conditioner. So I don't expect its performance requirement to be extremely severe. I don't think it will justify tuning the performance of cpu-hotplug, which I don't think was ever intended to be in the performance path. -Len -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Len Brown <lenb@kernel.org> [2009-05-27 22:34:38]: > On Tue, 19 May 2009, Vaidyanathan Srinivasan wrote: > > > We tried similar approaches to create idle time for power savings, but > > cpu hotplug interface seem to be a clean choice. There could be > > issues with the interface, we should fix it. Is there any other > > reason why cpuhotplug is 'ugly' other than its performance (speed)? > > > > I have tried few load balancer hacks to evacuate cores but not a solid > > design yet. It has its advantages but still needs more work. > > > > http://lkml.org/lkml/2009/5/13/173 > > Thanks for the pointer. > I agree with Andi, please avoid the term "throttling", since > it has been used for ages to refer processor clock throttling -- > which is actually significantly less effective at saving > energy than what you are trying to do. (not the word "energy" > here, where the word "power" is incorrectly used in the thread above) Yes, you are right. This throttling is used to refer to hardware methods to slow down things and it is less effective in saving energy. It reduces average power but make the work load run much longer and consume more energy. > "core evacuation" is a better description, I agree, though I wonder > why you don't simply call it "forced idling", since that is what > you are trying to do. Yes, core evacuation is what I propose, but actually what we are doing is starving or throttling tasks in software to create idle time, just to make the description clear. > > > Furthermore, we should not want anything outside of that, either the cpu > > > is there available for work, or its not -- halfway measures don't make > > > sense. > > > > > > Furthermore, we already have power aware scheduling which tries to > > > aggregate idle time on cpu/core/packages so as to maximize the idle time > > > power savings. Use it there. > > > > Power aware scheduling can optimally accumulate idle times. Framework > > to create idle time to force idle cores is good and useful for power > > savings. Other than the speed of online/offline I do not know of any > > other major issue for using cpu hotplug for this purpose. > > It sounds like you want to use this technique more often > that I had in mind. You are thinking of a warm rack, which > may stay warm all day long. I am thinking of a rack which > has a theoretical power draw higher than the providioned > electrical supply. As there is a huge difference between > actual and theoretical power draw, this saves many dollars. Yes, this framework can be used more often to balance average power consumption in systems. Exploiting the margin between theoretical limits and practical usage will definitely save money in a data center. Present generation power capping techniques and related infrastructure are available to exploit this margin. Core evacuation can compliment this safety limit mechanism by providing more fine grain control. > So what you're looking at is more frequent use than we need, > and that is fine -- as long as you exhaust P-states first -- > since forcing cores to be idle has a more severe performance > impact than running at a deeper P-state. Yes, that is the idea. After getting all core to lowest P-State, we can further cut power by forcing idle. Even when not at the lowest P-State, forced idle of complete packages may save more power as compared to running all cores in a large system at lowest P-State. This is generally not the case, but the framework can be more flexible and provide more degrees of control. > I didn't see P-states addressed in your thread. P-States can be flexibly managed using the present cpufreq governors. Ondemand, conservative or userspace can provide us with the required level of control from userspace. Idle cores will be at lowest P-States and C-State in case of ondemand governor. Independent of the P-States the idle cores will save power from C-State and hence cpufreq governors does not make an impact. In the case of busy cores, end users can decide to pick conservative or userspace governor before invoking core evacuation. The main motivation for the core evacuation framework is to provide another degree of control to exploit C-States based power savings apart from P-State manipulation (for which good framework already exist). > > > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving > > > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a > > > > > long time, but still not available. The acpi_processor_idle is a module, and > > > > > cpuidle governor potentially can't handle offline cpu. > > > > > > > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly, > > > > and I've no idea why its still there, seems like a much better candidate > > > > for your efforts than this. > > > > I agree with Peter. We need to make cpu hotplug save power first and > > later improve upon its performance. > > We do have a patch to fix the offline idle loop to save power. This will definitely help the objective. I have looked at Venki's patch. We certainly need that feature even outside of the current context where we want to hotplug faulty CPUs or setup special system configurations where all cores in a package is not to be used. > We can use hotplug in the short term until something better comes along. > Yes, it will break cpusets, just like Shaohua's original patch broke them > -- and that will make using it inappropriate for some customers. It will good to have a solution that does not affect user policy. Otherwise that will discourage its adoption and usability. But the cpu-hotplug solution will work in short term. > While I think this mechanism is important, I don't think that a large % > of customers will deploy it. I think the ones that deploy it will do so > to save money on electrical provisioning, not on pushing the limits > of their air conditioner. So I don't expect its performance requirement > to be extremely severe. I don't think it will justify tuning the > performance of cpu-hotplug, which I don't think was ever intended > to be in the performance path. The motivation to improve cpu-hotplug is that we have begin to find more uses for the framework and if there are issues, this is a good time to fix it. Opportunities to improve performance should be explored because we will have to hotplug multiple CPUs to have an impact. The number of cores in the system will become quite large and we will always have to hotplug multiple cpus to isolate a package for hardware faults or power saving purposes. On a system with 4096 CPUs, perhaps 128 cores my be a package or entity that needs to go off in bulk. We will certainly not be dealing with online/offline of one or two cpus in such a system. Well this is an extreme case and weired example. Hope you get the idea on why we should try to improve cpu-hotplug path. Thanks for the detailed comments and suggestions. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux/kernel/cpuset.c =================================================================== --- linux.orig/kernel/cpuset.c 2009-05-12 16:27:16.000000000 +0800 +++ linux/kernel/cpuset.c 2009-05-19 10:05:36.000000000 +0800 @@ -929,14 +929,14 @@ static void update_tasks_cpumask(struct * @buf: buffer of cpu numbers written to this cpuset */ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, - const char *buf) + const char *buf, bool top_ok) { struct ptr_heap heap; int retval; int is_load_balanced; /* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */ - if (cs == &top_cpuset) + if (cs == &top_cpuset && !top_ok) return -EACCES; /* @@ -1496,7 +1496,7 @@ static int cpuset_write_resmask(struct c switch (cft->private) { case FILE_CPULIST: - retval = update_cpumask(cs, trialcs, buf); + retval = update_cpumask(cs, trialcs, buf, false); break; case FILE_MEMLIST: retval = update_nodemask(cs, trialcs, buf); @@ -1511,6 +1511,27 @@ static int cpuset_write_resmask(struct c return retval; } +int cpuset_change_top_cpumask(const char *buf) +{ + int retval = 0; + struct cpuset *cs = &top_cpuset; + struct cpuset *trialcs; + + if (!cgroup_lock_live_group(cs->css.cgroup)) + return -ENODEV; + + trialcs = alloc_trial_cpuset(cs); + if (!trialcs) + return -ENOMEM; + + retval = update_cpumask(cs, trialcs, buf, true); + + free_trial_cpuset(trialcs); + cgroup_unlock(); + return retval; +} +EXPORT_SYMBOL(cpuset_change_top_cpumask); + /* * These ascii lists should be read in a single call, by using a user * buffer large enough to hold the entire map. If read in smaller Index: linux/include/linux/cpuset.h =================================================================== --- linux.orig/include/linux/cpuset.h 2009-05-12 16:27:15.000000000 +0800 +++ linux/include/linux/cpuset.h 2009-05-19 10:05:36.000000000 +0800 @@ -92,6 +92,7 @@ extern void rebuild_sched_domains(void); extern void cpuset_print_task_mems_allowed(struct task_struct *p); +extern int cpuset_change_top_cpumask(const char *buf); #else /* !CONFIG_CPUSETS */ static inline int cpuset_init_early(void) { return 0; } @@ -188,6 +189,10 @@ static inline void cpuset_print_task_mem { } +static inline int cpuset_change_top_cpumask(const char *buf) +{ + return -ENODEV; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */
ACPI 4.0 defines processor aggregator device. The device can notify OS to idle some CPUs to save power. This isn't to hot remove cpus, but just makes cpus idle. This patch adds one API to change cpuset top group's cpus. If we want to make one cpu idle, simply remove the cpu from cpuset top group's cpu list, then all tasks will be migrate to other cpus, and other tasks will not be migrated to this cpu again. No functional changes. We will use this API in new ACPI processor aggregator device driver later. Signed-off-by: Shaohua Li<shaohua.li@intel.com> --- include/linux/cpuset.h | 5 +++++ kernel/cpuset.c | 27 ++++++++++++++++++++++++--- 2 files changed, 29 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html