diff mbox series

Is: Default governor regardless of cpuidle driver Was: [PATCH v2] cpuidle-haltpoll: vcpu hotplug support

Message ID c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com (mailing list archive)
State RFC, archived
Headers show
Series Is: Default governor regardless of cpuidle driver Was: [PATCH v2] cpuidle-haltpoll: vcpu hotplug support | expand

Commit Message

Joao Martins Aug. 29, 2019, 5:16 p.m. UTC
On 8/29/19 4:10 PM, Joao Martins wrote:
> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> past the online ones and thus fail to register the idle driver.
> This is because cpuidle_add_sysfs() will return with -ENODEV as a
> consequence from get_cpu_device() return no device for a non-existing
> CPU.
> 
> Instead switch to cpuidle_register_driver() and manually register each
> of the present cpus through cpuhp_setup_state() callback and future
> ones that get onlined. This mimmics similar logic that intel_idle does.
> 
> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> ---

While testing the above, I found out another issue on the haltpoll series.
But I am not sure what is best suited to cpuidle framework, hence requesting
some advise if below is a reasonable solution or something else is preferred.

Essentially after haltpoll governor got introduced and regardless of the cpuidle
driver the default governor is gonna be haltpoll for a guest (given haltpoll
governor doesn't get registered for baremetal). Right now, for a KVM guest, the
idle governors have these ratings:

 * ladder            -> 10
 * teo               -> 19
 * menu              -> 20
 * haltpoll          -> 21
 * ladder + nohz=off -> 25

When a guest is booted with MWAIT and intel_idle is probed and sucessfully
registered, we will end up with a haltpoll governor being used as opposed to
'menu' (which used to be the default case). This would prevent IIUC that other
C-states get used other than poll_state (state 0) and state 1.

Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
it doesn't look reasonable to be the default? What about using haltpoll governor
as default when haltpoll idle driver registers or modloads.

My idea to achieve the above would be to decrease the rating to 9 (before the
lowest rated governor) and retain old defaults before haltpoll. Then we would
allow a cpuidle driver to define a preferred governor to switch on idle driver
registration. Naturally all of would be ignored if overidden by
cpuidle.governor=.

The diff below the scissors line is an example of that.

Thoughts?

---------------------------------- >8 --------------------------------

From: Joao Martins <joao.m.martins@oracle.com>
Subject: [PATCH] cpuidle: switch to prefered governor on registration

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/cpuidle/cpuidle-haltpoll.c   |  1 +
 drivers/cpuidle/cpuidle.h            |  1 +
 drivers/cpuidle/driver.c             | 26 ++++++++++++++++++++++++++
 drivers/cpuidle/governor.c           |  6 +++---
 drivers/cpuidle/governors/haltpoll.c |  2 +-
 include/linux/cpuidle.h              |  3 +++
 6 files changed, 35 insertions(+), 4 deletions(-)

Comments

Marcelo Tosatti Aug. 29, 2019, 5:23 p.m. UTC | #1
On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote:
> On 8/29/19 4:10 PM, Joao Martins wrote:
> > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> > past the online ones and thus fail to register the idle driver.
> > This is because cpuidle_add_sysfs() will return with -ENODEV as a
> > consequence from get_cpu_device() return no device for a non-existing
> > CPU.
> > 
> > Instead switch to cpuidle_register_driver() and manually register each
> > of the present cpus through cpuhp_setup_state() callback and future
> > ones that get onlined. This mimmics similar logic that intel_idle does.
> > 
> > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > ---
> 
> While testing the above, I found out another issue on the haltpoll series.
> But I am not sure what is best suited to cpuidle framework, hence requesting
> some advise if below is a reasonable solution or something else is preferred.
> 
> Essentially after haltpoll governor got introduced and regardless of the cpuidle
> driver the default governor is gonna be haltpoll for a guest (given haltpoll
> governor doesn't get registered for baremetal).

Right.

> Right now, for a KVM guest, the
> idle governors have these ratings:
> 
>  * ladder            -> 10
>  * teo               -> 19
>  * menu              -> 20
>  * haltpoll          -> 21
>  * ladder + nohz=off -> 25

Yes. PowerPC KVM guests crash currently due to the use of the haltpoll
governor (have a patch in my queue to fix this, but your solution
embraces more cases).

> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
> registered, we will end up with a haltpoll governor being used as opposed to
> 'menu' (which used to be the default case). This would prevent IIUC that other
> C-states get used other than poll_state (state 0) and state 1.
> 
> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
> it doesn't look reasonable to be the default? What about using haltpoll governor
> as default when haltpoll idle driver registers or modloads.
> 
> My idea to achieve the above would be to decrease the rating to 9 (before the
> lowest rated governor) and retain old defaults before haltpoll. Then we would
> allow a cpuidle driver to define a preferred governor to switch on idle driver
> registration. Naturally all of would be ignored if overidden by
> cpuidle.governor=.
> 
> The diff below the scissors line is an example of that.
> 
> Thoughts?

Works for me. Rafael?

> 
> ---------------------------------- >8 --------------------------------
> 
> From: Joao Martins <joao.m.martins@oracle.com>
> Subject: [PATCH] cpuidle: switch to prefered governor on registration
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/cpuidle/cpuidle-haltpoll.c   |  1 +
>  drivers/cpuidle/cpuidle.h            |  1 +
>  drivers/cpuidle/driver.c             | 26 ++++++++++++++++++++++++++
>  drivers/cpuidle/governor.c           |  6 +++---
>  drivers/cpuidle/governors/haltpoll.c |  2 +-
>  include/linux/cpuidle.h              |  3 +++
>  6 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c
> index 8baade23f8d0..88a38c3c35e4 100644
> --- a/drivers/cpuidle/cpuidle-haltpoll.c
> +++ b/drivers/cpuidle/cpuidle-haltpoll.c
> @@ -33,6 +33,7 @@ static int default_enter_idle(struct cpuidle_device *dev,
> 
>  static struct cpuidle_driver haltpoll_driver = {
>  	.name = "haltpoll",
> +	.governor = "haltpoll",
>  	.owner = THIS_MODULE,
>  	.states = {
>  		{ /* entry 0 is for polling */ },
> diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
> index d6613101af92..c046f49c1920 100644
> --- a/drivers/cpuidle/cpuidle.h
> +++ b/drivers/cpuidle/cpuidle.h
> @@ -22,6 +22,7 @@ extern void cpuidle_install_idle_handler(void);
>  extern void cpuidle_uninstall_idle_handler(void);
> 
>  /* governors */
> +extern struct cpuidle_governor *cpuidle_find_governor(const char *str);
>  extern int cpuidle_switch_governor(struct cpuidle_governor *gov);
> 
>  /* sysfs */
> diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
> index dc32f34e68d9..8b8b9d89ce58 100644
> --- a/drivers/cpuidle/driver.c
> +++ b/drivers/cpuidle/driver.c
> @@ -87,6 +87,7 @@ static inline int __cpuidle_set_driver(struct cpuidle_driver *drv)
>  #else
> 
>  static struct cpuidle_driver *cpuidle_curr_driver;
> +static struct cpuidle_governor *cpuidle_default_governor = NULL;
> 
>  /**
>   * __cpuidle_get_cpu_driver - return the global cpuidle driver pointer.
> @@ -254,12 +255,25 @@ static void __cpuidle_unregister_driver(struct
> cpuidle_driver *drv)
>   */
>  int cpuidle_register_driver(struct cpuidle_driver *drv)
>  {
> +	struct cpuidle_governor *gov;
>  	int ret;
> 
>  	spin_lock(&cpuidle_driver_lock);
>  	ret = __cpuidle_register_driver(drv);
>  	spin_unlock(&cpuidle_driver_lock);
> 
> +	if (!ret && !strlen(param_governor) && drv->governor &&
> +	    (cpuidle_get_driver() == drv)) {
> +		mutex_lock(&cpuidle_lock);
> +		gov = cpuidle_find_governor(drv->governor);
> +		if (gov) {
> +			cpuidle_default_governor = cpuidle_curr_governor;
> +			if (cpuidle_switch_governor(gov) < 0)
> +				cpuidle_default_governor = NULL;
> +		}
> +		mutex_unlock(&cpuidle_lock);
> +	}
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(cpuidle_register_driver);
> @@ -274,9 +288,21 @@ EXPORT_SYMBOL_GPL(cpuidle_register_driver);
>   */
>  void cpuidle_unregister_driver(struct cpuidle_driver *drv)
>  {
> +	bool enabled = (cpuidle_get_driver() == drv);
> +
>  	spin_lock(&cpuidle_driver_lock);
>  	__cpuidle_unregister_driver(drv);
>  	spin_unlock(&cpuidle_driver_lock);
> +
> +	if (!enabled)
> +		return;
> +
> +	mutex_lock(&cpuidle_lock);
> +	if (cpuidle_default_governor) {
> +		if (!cpuidle_switch_governor(cpuidle_default_governor))
> +			cpuidle_default_governor = NULL;
> +	}
> +	mutex_unlock(&cpuidle_lock);
>  }
>  EXPORT_SYMBOL_GPL(cpuidle_unregister_driver);
> 
> diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c
> index 2e3e14192bee..e93c11dc8304 100644
> --- a/drivers/cpuidle/governor.c
> +++ b/drivers/cpuidle/governor.c
> @@ -22,12 +22,12 @@ LIST_HEAD(cpuidle_governors);
>  struct cpuidle_governor *cpuidle_curr_governor;
> 
>  /**
> - * __cpuidle_find_governor - finds a governor of the specified name
> + * cpuidle_find_governor - finds a governor of the specified name
>   * @str: the name
>   *
>   * Must be called with cpuidle_lock acquired.
>   */
> -static struct cpuidle_governor * __cpuidle_find_governor(const char *str)
> +struct cpuidle_governor * cpuidle_find_governor(const char *str)
>  {
>  	struct cpuidle_governor *gov;
> 
> @@ -87,7 +87,7 @@ int cpuidle_register_governor(struct cpuidle_governor *gov)
>  		return -ENODEV;
> 
>  	mutex_lock(&cpuidle_lock);
> -	if (__cpuidle_find_governor(gov->name) == NULL) {
> +	if (cpuidle_find_governor(gov->name) == NULL) {
>  		ret = 0;
>  		list_add_tail(&gov->governor_list, &cpuidle_governors);
>  		if (!cpuidle_curr_governor ||
> diff --git a/drivers/cpuidle/governors/haltpoll.c
> b/drivers/cpuidle/governors/haltpoll.c
> index 797477bda486..7a703d2e0064 100644
> --- a/drivers/cpuidle/governors/haltpoll.c
> +++ b/drivers/cpuidle/governors/haltpoll.c
> @@ -133,7 +133,7 @@ static int haltpoll_enable_device(struct cpuidle_driver *drv,
> 
>  static struct cpuidle_governor haltpoll_governor = {
>  	.name =			"haltpoll",
> -	.rating =		21,
> +	.rating =		9,
>  	.enable =		haltpoll_enable_device,
>  	.select =		haltpoll_select,
>  	.reflect =		haltpoll_reflect,
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 1a9f54eb3aa1..2dc4c6b19c25 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -121,6 +121,9 @@ struct cpuidle_driver {
> 
>  	/* the driver handles the cpus in cpumask */
>  	struct cpumask		*cpumask;
> +
> +	/* preferred governor to switch at register time */
> +	const char		*governor;
>  };
> 
>  #ifdef CONFIG_CPU_IDLE
> -- 
> 2.17.1
Daniel Lezcano Aug. 29, 2019, 5:42 p.m. UTC | #2
On 29/08/2019 19:16, Joao Martins wrote:
> On 8/29/19 4:10 PM, Joao Martins wrote:
>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>> past the online ones and thus fail to register the idle driver.
>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>> consequence from get_cpu_device() return no device for a non-existing
>> CPU.
>>
>> Instead switch to cpuidle_register_driver() and manually register each
>> of the present cpus through cpuhp_setup_state() callback and future
>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>
>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> ---
> 
> While testing the above, I found out another issue on the haltpoll series.
> But I am not sure what is best suited to cpuidle framework, hence requesting
> some advise if below is a reasonable solution or something else is preferred.
> 
> Essentially after haltpoll governor got introduced and regardless of the cpuidle
> driver the default governor is gonna be haltpoll for a guest (given haltpoll
> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
> idle governors have these ratings:
> 
>  * ladder            -> 10
>  * teo               -> 19
>  * menu              -> 20
>  * haltpoll          -> 21
>  * ladder + nohz=off -> 25
> 
> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
> registered, we will end up with a haltpoll governor being used as opposed to
> 'menu' (which used to be the default case). This would prevent IIUC that other
> C-states get used other than poll_state (state 0) and state 1.
> 
> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
> it doesn't look reasonable to be the default? What about using haltpoll governor
> as default when haltpoll idle driver registers or modload.

Are the guest and host kernel the same? IOW compiled with the same
kernel config?


> My idea to achieve the above would be to decrease the rating to 9 (before the
> lowest rated governor) and retain old defaults before haltpoll. Then we would
> allow a cpuidle driver to define a preferred governor to switch on idle driver
> registration. Naturally all of would be ignored if overidden by
> cpuidle.governor=.
>
Joao Martins Aug. 29, 2019, 6:07 p.m. UTC | #3
On 8/29/19 6:42 PM, Daniel Lezcano wrote:
> On 29/08/2019 19:16, Joao Martins wrote:
>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>> past the online ones and thus fail to register the idle driver.
>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>> consequence from get_cpu_device() return no device for a non-existing
>>> CPU.
>>>
>>> Instead switch to cpuidle_register_driver() and manually register each
>>> of the present cpus through cpuhp_setup_state() callback and future
>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>
>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>> ---
>>
>> While testing the above, I found out another issue on the haltpoll series.
>> But I am not sure what is best suited to cpuidle framework, hence requesting
>> some advise if below is a reasonable solution or something else is preferred.
>>
>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
>> idle governors have these ratings:
>>
>>  * ladder            -> 10
>>  * teo               -> 19
>>  * menu              -> 20
>>  * haltpoll          -> 21
>>  * ladder + nohz=off -> 25
>>
>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>> registered, we will end up with a haltpoll governor being used as opposed to
>> 'menu' (which used to be the default case). This would prevent IIUC that other
>> C-states get used other than poll_state (state 0) and state 1.
>>
>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>> it doesn't look reasonable to be the default? What about using haltpoll governor
>> as default when haltpoll idle driver registers or modload.
> 
> Are the guest and host kernel the same? IOW compiled with the same
> kernel config?
> 
You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE):

	CONFIG_CPU_IDLE_GOV_HALTPOLL=y

And *if you are a KVM guest* it will be the default (unless using nohz=off in
which case ladder gets the highest rating -- see the listing right above).

Host will just behave differently because the haltpoll governor is checking if
it is running as kvm guest, and only registering in that case.

> 
>> My idea to achieve the above would be to decrease the rating to 9 (before the
>> lowest rated governor) and retain old defaults before haltpoll. Then we would
>> allow a cpuidle driver to define a preferred governor to switch on idle driver
>> registration. Naturally all of would be ignored if overidden by
>> cpuidle.governor=.
>>
> 
> 
> 
>
Daniel Lezcano Aug. 29, 2019, 6:28 p.m. UTC | #4
On 29/08/2019 20:07, Joao Martins wrote:
> On 8/29/19 6:42 PM, Daniel Lezcano wrote:
>> On 29/08/2019 19:16, Joao Martins wrote:
>>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>>> past the online ones and thus fail to register the idle driver.
>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>>> consequence from get_cpu_device() return no device for a non-existing
>>>> CPU.
>>>>
>>>> Instead switch to cpuidle_register_driver() and manually register each
>>>> of the present cpus through cpuhp_setup_state() callback and future
>>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>>
>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>> ---
>>>
>>> While testing the above, I found out another issue on the haltpoll series.
>>> But I am not sure what is best suited to cpuidle framework, hence requesting
>>> some advise if below is a reasonable solution or something else is preferred.
>>>
>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
>>> idle governors have these ratings:
>>>
>>>  * ladder            -> 10
>>>  * teo               -> 19
>>>  * menu              -> 20
>>>  * haltpoll          -> 21
>>>  * ladder + nohz=off -> 25
>>>
>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>>> registered, we will end up with a haltpoll governor being used as opposed to
>>> 'menu' (which used to be the default case). This would prevent IIUC that other
>>> C-states get used other than poll_state (state 0) and state 1.
>>>
>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>>> it doesn't look reasonable to be the default? What about using haltpoll governor
>>> as default when haltpoll idle driver registers or modload.
>>
>> Are the guest and host kernel the same? IOW compiled with the same
>> kernel config?
>>
> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE):
> 
> 	CONFIG_CPU_IDLE_GOV_HALTPOLL=y
> 
> And *if you are a KVM guest* it will be the default (unless using nohz=off in
> which case ladder gets the highest rating -- see the listing right above).
> 
> Host will just behave differently because the haltpoll governor is checking if
> it is running as kvm guest, and only registering in that case.

I understood the problem. Actually my question was about if the kernels
are compiled for host and guest, and can be run indifferently. In this
case a runtime detection must be done as you propose, otherwise that can
be done at config time. I pretty sure it is the former but before
thinking about the runtime side, I wanted to double check.


>>> My idea to achieve the above would be to decrease the rating to 9 (before the
>>> lowest rated governor) and retain old defaults before haltpoll. Then we would
>>> allow a cpuidle driver to define a preferred governor to switch on idle driver
>>> registration. Naturally all of would be ignored if overidden by
>>> cpuidle.governor=.
Joao Martins Aug. 29, 2019, 7:11 p.m. UTC | #5
On 8/29/19 7:28 PM, Daniel Lezcano wrote:
> On 29/08/2019 20:07, Joao Martins wrote:
>> On 8/29/19 6:42 PM, Daniel Lezcano wrote:
>>> On 29/08/2019 19:16, Joao Martins wrote:
>>>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>>>> past the online ones and thus fail to register the idle driver.
>>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>>>> consequence from get_cpu_device() return no device for a non-existing
>>>>> CPU.
>>>>>
>>>>> Instead switch to cpuidle_register_driver() and manually register each
>>>>> of the present cpus through cpuhp_setup_state() callback and future
>>>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>>>
>>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>>> ---
>>>>
>>>> While testing the above, I found out another issue on the haltpoll series.
>>>> But I am not sure what is best suited to cpuidle framework, hence requesting
>>>> some advise if below is a reasonable solution or something else is preferred.
>>>>
>>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
>>>> idle governors have these ratings:
>>>>
>>>>  * ladder            -> 10
>>>>  * teo               -> 19
>>>>  * menu              -> 20
>>>>  * haltpoll          -> 21
>>>>  * ladder + nohz=off -> 25
>>>>
>>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>>>> registered, we will end up with a haltpoll governor being used as opposed to
>>>> 'menu' (which used to be the default case). This would prevent IIUC that other
>>>> C-states get used other than poll_state (state 0) and state 1.
>>>>
>>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>>>> it doesn't look reasonable to be the default? What about using haltpoll governor
>>>> as default when haltpoll idle driver registers or modload.
>>>
>>> Are the guest and host kernel the same? IOW compiled with the same
>>> kernel config?
>>>
>> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE):
>>
>> 	CONFIG_CPU_IDLE_GOV_HALTPOLL=y
>>
>> And *if you are a KVM guest* it will be the default (unless using nohz=off in
>> which case ladder gets the highest rating -- see the listing right above).
>>
>> Host will just behave differently because the haltpoll governor is checking if
>> it is running as kvm guest, and only registering in that case.
> 
> I understood the problem. Actually my question was about if the kernels
> are compiled for host and guest, and can be run indifferently. 

/nods Correct.

> In this
> case a runtime detection must be done as you propose, otherwise that can
> be done at config time. I pretty sure it is the former but before
> thinking about the runtime side, I wanted to double check.
> 
Hmm, but even with separate kernels/configs for guest and host I think we would
still have the same issue.

What I was trying to convey is that even when running with a config solely for
KVM guests (that is different than baremetal) you can have today various ways of
idling. An Intel x86 kvm guest can have no idle driver (but arch-specific),
intel_idle (like baremetal config) and haltpoll. There are usecases for these
three, and makes sense to consolidate all.

Say you wanted to have a kvm specific config, you would still see the same
problem if you happen to compile intel_idle together with haltpoll
driver+governor. Creating two separate configs here, with and without haltpoll
for VMs doesn't sound effective for distros. Perhaps decreasing the rating of
haltpoll governor, but while a short term fix it wouldn't give much sensible
defaults without the one-off runtime switch.

Unless ofc I am missing something.

> 
>>>> My idea to achieve the above would be to decrease the rating to 9 (before the
>>>> lowest rated governor) and retain old defaults before haltpoll. Then we would
>>>> allow a cpuidle driver to define a preferred governor to switch on idle driver
>>>> registration. Naturally all of would be ignored if overidden by
>>>> cpuidle.governor=.
> 
>
Daniel Lezcano Aug. 29, 2019, 8:22 p.m. UTC | #6
On 29/08/2019 21:11, Joao Martins wrote:
> On 8/29/19 7:28 PM, Daniel Lezcano wrote:
>> On 29/08/2019 20:07, Joao Martins wrote:
>>> On 8/29/19 6:42 PM, Daniel Lezcano wrote:
>>>> On 29/08/2019 19:16, Joao Martins wrote:
>>>>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>>>>> past the online ones and thus fail to register the idle driver.
>>>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>>>>> consequence from get_cpu_device() return no device for a non-existing
>>>>>> CPU.
>>>>>>
>>>>>> Instead switch to cpuidle_register_driver() and manually register each
>>>>>> of the present cpus through cpuhp_setup_state() callback and future
>>>>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>>>>
>>>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>>>> ---
>>>>>
>>>>> While testing the above, I found out another issue on the haltpoll series.
>>>>> But I am not sure what is best suited to cpuidle framework, hence requesting
>>>>> some advise if below is a reasonable solution or something else is preferred.
>>>>>
>>>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>>>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>>>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
>>>>> idle governors have these ratings:
>>>>>
>>>>>  * ladder            -> 10
>>>>>  * teo               -> 19
>>>>>  * menu              -> 20
>>>>>  * haltpoll          -> 21
>>>>>  * ladder + nohz=off -> 25
>>>>>
>>>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>>>>> registered, we will end up with a haltpoll governor being used as opposed to
>>>>> 'menu' (which used to be the default case). This would prevent IIUC that other
>>>>> C-states get used other than poll_state (state 0) and state 1.
>>>>>
>>>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>>>>> it doesn't look reasonable to be the default? What about using haltpoll governor
>>>>> as default when haltpoll idle driver registers or modload.
>>>>
>>>> Are the guest and host kernel the same? IOW compiled with the same
>>>> kernel config?
>>>>
>>> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE):
>>>
>>> 	CONFIG_CPU_IDLE_GOV_HALTPOLL=y
>>>
>>> And *if you are a KVM guest* it will be the default (unless using nohz=off in
>>> which case ladder gets the highest rating -- see the listing right above).
>>>
>>> Host will just behave differently because the haltpoll governor is checking if
>>> it is running as kvm guest, and only registering in that case.
>>
>> I understood the problem. Actually my question was about if the kernels
>> are compiled for host and guest, and can be run indifferently. 
> 
> /nods Correct.
> 
>> In this
>> case a runtime detection must be done as you propose, otherwise that can
>> be done at config time. I pretty sure it is the former but before
>> thinking about the runtime side, I wanted to double check.
>>
> Hmm, but even with separate kernels/configs for guest and host I think we would
> still have the same issue.
> 
> What I was trying to convey is that even when running with a config solely for
> KVM guests (that is different than baremetal) you can have today various ways of
> idling. An Intel x86 kvm guest can have no idle driver (but arch-specific),
> intel_idle (like baremetal config) and haltpoll. There are usecases for these
> three, and makes sense to consolidate all.
> 
> Say you wanted to have a kvm specific config, you would still see the same
> problem if you happen to compile intel_idle together with haltpoll
> driver+governor. 

Can a guest work with an intel_idle driver?

> Creating two separate configs here, with and without haltpoll
> for VMs doesn't sound effective for distros.

Agree

> Perhaps decreasing the rating of
> haltpoll governor, but while a short term fix it wouldn't give much sensible
> defaults without the one-off runtime switch.
Joao Martins Aug. 29, 2019, 9:12 p.m. UTC | #7
On 8/29/19 9:22 PM, Daniel Lezcano wrote:
> On 29/08/2019 21:11, Joao Martins wrote:
>> On 8/29/19 7:28 PM, Daniel Lezcano wrote:
>>> On 29/08/2019 20:07, Joao Martins wrote:
>>>> On 8/29/19 6:42 PM, Daniel Lezcano wrote:
>>>>> On 29/08/2019 19:16, Joao Martins wrote:
>>>>>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>>>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>>>>>> past the online ones and thus fail to register the idle driver.
>>>>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>>>>>> consequence from get_cpu_device() return no device for a non-existing
>>>>>>> CPU.
>>>>>>>
>>>>>>> Instead switch to cpuidle_register_driver() and manually register each
>>>>>>> of the present cpus through cpuhp_setup_state() callback and future
>>>>>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>>>>>
>>>>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>>>>> ---
>>>>>>
>>>>>> While testing the above, I found out another issue on the haltpoll series.
>>>>>> But I am not sure what is best suited to cpuidle framework, hence requesting
>>>>>> some advise if below is a reasonable solution or something else is preferred.
>>>>>>
>>>>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>>>>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>>>>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the
>>>>>> idle governors have these ratings:
>>>>>>
>>>>>>  * ladder            -> 10
>>>>>>  * teo               -> 19
>>>>>>  * menu              -> 20
>>>>>>  * haltpoll          -> 21
>>>>>>  * ladder + nohz=off -> 25
>>>>>>
>>>>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>>>>>> registered, we will end up with a haltpoll governor being used as opposed to
>>>>>> 'menu' (which used to be the default case). This would prevent IIUC that other
>>>>>> C-states get used other than poll_state (state 0) and state 1.
>>>>>>
>>>>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>>>>>> it doesn't look reasonable to be the default? What about using haltpoll governor
>>>>>> as default when haltpoll idle driver registers or modload.
>>>>>
>>>>> Are the guest and host kernel the same? IOW compiled with the same
>>>>> kernel config?
>>>>>
>>>> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE):
>>>>
>>>> 	CONFIG_CPU_IDLE_GOV_HALTPOLL=y
>>>>
>>>> And *if you are a KVM guest* it will be the default (unless using nohz=off in
>>>> which case ladder gets the highest rating -- see the listing right above).
>>>>
>>>> Host will just behave differently because the haltpoll governor is checking if
>>>> it is running as kvm guest, and only registering in that case.
>>>
>>> I understood the problem. Actually my question was about if the kernels
>>> are compiled for host and guest, and can be run indifferently. 
>>
>> /nods Correct.
>>
>>> In this
>>> case a runtime detection must be done as you propose, otherwise that can
>>> be done at config time. I pretty sure it is the former but before
>>> thinking about the runtime side, I wanted to double check.
>>>
>> Hmm, but even with separate kernels/configs for guest and host I think we would
>> still have the same issue.
>>
>> What I was trying to convey is that even when running with a config solely for
>> KVM guests (that is different than baremetal) you can have today various ways of
>> idling. An Intel x86 kvm guest can have no idle driver (but arch-specific),
>> intel_idle (like baremetal config) and haltpoll. There are usecases for these
>> three, and makes sense to consolidate all.
>>
>> Say you wanted to have a kvm specific config, you would still see the same
>> problem if you happen to compile intel_idle together with haltpoll
>> driver+governor. 
> 
> Can a guest work with an intel_idle driver?
> 
Yes.

If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc,
assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel
version as host (v4.17+).

>> Creating two separate configs here, with and without haltpoll
>> for VMs doesn't sound effective for distros.
> 
> Agree
> 
>> Perhaps decreasing the rating of
>> haltpoll governor, but while a short term fix it wouldn't give much sensible
>> defaults without the one-off runtime switch.
>
Daniel Lezcano Aug. 29, 2019, 9:51 p.m. UTC | #8
On 29/08/2019 23:12, Joao Martins wrote:

[ ... ]

>>> Say you wanted to have a kvm specific config, you would still see the same
>>> problem if you happen to compile intel_idle together with haltpoll
>>> driver+governor. 
>>
>> Can a guest work with an intel_idle driver?
>>
> Yes.
> 
> If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc,
> assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel
> version as host (v4.17+).

Ok, thanks for the clarification.

>>> Creating two separate configs here, with and without haltpoll
>>> for VMs doesn't sound effective for distros.
>>
>> Agree
>>
>>> Perhaps decreasing the rating of
>>> haltpoll governor, but while a short term fix it wouldn't give much sensible
>>> defaults without the one-off runtime switch.

The rating has little meaning because each governor fits a specific
situation (server, desktop, etc...) and it would probably make sense to
remove it and add a default governor in the config file like the cpufreq.

May be I missed the point from some previous discussion but IMHO the
problem you are facing is coming from the design: there is no need to
create a halt governor but move the code inside the cpuidle-halt driver
instead and ignore the state asked by the governor and return the state
the driver entered.
Joao Martins Aug. 30, 2019, 11:07 a.m. UTC | #9
On 8/29/19 10:51 PM, Daniel Lezcano wrote:
> On 29/08/2019 23:12, Joao Martins wrote:
> 
> [ ... ]
> 
>>>> Say you wanted to have a kvm specific config, you would still see the same
>>>> problem if you happen to compile intel_idle together with haltpoll
>>>> driver+governor. 
>>>
>>> Can a guest work with an intel_idle driver?
>>>
>> Yes.
>>
>> If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc,
>> assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel
>> version as host (v4.17+).
> 
> Ok, thanks for the clarification.
> 
>>>> Creating two separate configs here, with and without haltpoll
>>>> for VMs doesn't sound effective for distros.
>>>
>>> Agree
>>>
>>>> Perhaps decreasing the rating of
>>>> haltpoll governor, but while a short term fix it wouldn't give much sensible
>>>> defaults without the one-off runtime switch.
> 
> The rating has little meaning because each governor fits a specific
> situation (server, desktop, etc...) and it would probably make sense to
> remove it and add a default governor in the config file like the cpufreq.
> 
ICYM, I had attached a patch in the first message of this thread [0] right below
the scissors mark. It's not based on config file, but it's the same thing you're
saying (IIUC) but at runtime and thus allowing a driver to state a 'preferred'
governor to switch to at idle registration -- let me know if you think that
looks a sensible approach. Note that the intent of that patch follows the
thinking of leaving all defaults as before haltpoll governor was introduced, but
once user modloads/uses cpuidle-haltpoll this governor then gets switched on.

[0] https://lore.kernel.org/kvm/c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com/

I would think a config-based preference on a governor would be good *if* one
could actually switch idle governors at runtime like you can with cpufreq -- in
case userspace wants something else other than the default. Right now we can't
do that unless you toggle 'cpuidle_sysfs_switch', or picking one at boot with
'cpuidle.governor='.

> May be I missed the point from some previous discussion but IMHO the
> problem you are facing is coming from the design: there is no need to
> create a halt governor but move the code inside the cpuidle-halt driver
> instead and ignore the state asked by the governor and return the state
> the driver entered.
> 
Marcello's original patch series (first 3 revisions to be exact) actually had
everything in the idle driver, but after some revisions (v4+) Rafael asked him
to split the logic into a governor and unify it with poll state[1].

[1]
https://lore.kernel.org/kvm/CAJZ5v0gPbSXB3r71XaT-4Q7LsiFO_UVymBwOmU8J1W5+COk_1g@mail.gmail.com/

	Joao
Rafael J. Wysocki Sept. 2, 2019, 9:55 p.m. UTC | #10
On Thu, Aug 29, 2019 at 7:24 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote:
> > On 8/29/19 4:10 PM, Joao Martins wrote:
> > > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
> > > past the online ones and thus fail to register the idle driver.
> > > This is because cpuidle_add_sysfs() will return with -ENODEV as a
> > > consequence from get_cpu_device() return no device for a non-existing
> > > CPU.
> > >
> > > Instead switch to cpuidle_register_driver() and manually register each
> > > of the present cpus through cpuhp_setup_state() callback and future
> > > ones that get onlined. This mimmics similar logic that intel_idle does.
> > >
> > > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
> > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> > > ---
> >
> > While testing the above, I found out another issue on the haltpoll series.
> > But I am not sure what is best suited to cpuidle framework, hence requesting
> > some advise if below is a reasonable solution or something else is preferred.
> >
> > Essentially after haltpoll governor got introduced and regardless of the cpuidle
> > driver the default governor is gonna be haltpoll for a guest (given haltpoll
> > governor doesn't get registered for baremetal).
>
> Right.
>
> > Right now, for a KVM guest, the
> > idle governors have these ratings:
> >
> >  * ladder            -> 10
> >  * teo               -> 19
> >  * menu              -> 20
> >  * haltpoll          -> 21
> >  * ladder + nohz=off -> 25
>
> Yes. PowerPC KVM guests crash currently due to the use of the haltpoll
> governor (have a patch in my queue to fix this, but your solution
> embraces more cases).
>
> > When a guest is booted with MWAIT and intel_idle is probed and sucessfully
> > registered, we will end up with a haltpoll governor being used as opposed to
> > 'menu' (which used to be the default case). This would prevent IIUC that other
> > C-states get used other than poll_state (state 0) and state 1.
> >
> > Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
> > it doesn't look reasonable to be the default? What about using haltpoll governor
> > as default when haltpoll idle driver registers or modloads.
> >
> > My idea to achieve the above would be to decrease the rating to 9 (before the
> > lowest rated governor) and retain old defaults before haltpoll. Then we would
> > allow a cpuidle driver to define a preferred governor to switch on idle driver
> > registration. Naturally all of would be ignored if overidden by
> > cpuidle.governor=.
> >
> > The diff below the scissors line is an example of that.
> >
> > Thoughts?
>
> Works for me. Rafael?

It works for me too, basically, except that I would rename
cpuidle_default_governor in the patch to cpuidle_prev_governor.
Rafael J. Wysocki Sept. 2, 2019, 9:58 p.m. UTC | #11
On Fri, Aug 30, 2019 at 1:09 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 8/29/19 10:51 PM, Daniel Lezcano wrote:
> > On 29/08/2019 23:12, Joao Martins wrote:
> >
> > [ ... ]
> >
> >>>> Say you wanted to have a kvm specific config, you would still see the same
> >>>> problem if you happen to compile intel_idle together with haltpoll
> >>>> driver+governor.
> >>>
> >>> Can a guest work with an intel_idle driver?
> >>>
> >> Yes.
> >>
> >> If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc,
> >> assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel
> >> version as host (v4.17+).
> >
> > Ok, thanks for the clarification.
> >
> >>>> Creating two separate configs here, with and without haltpoll
> >>>> for VMs doesn't sound effective for distros.
> >>>
> >>> Agree
> >>>
> >>>> Perhaps decreasing the rating of
> >>>> haltpoll governor, but while a short term fix it wouldn't give much sensible
> >>>> defaults without the one-off runtime switch.
> >
> > The rating has little meaning because each governor fits a specific
> > situation (server, desktop, etc...) and it would probably make sense to
> > remove it and add a default governor in the config file like the cpufreq.
> >
> ICYM, I had attached a patch in the first message of this thread [0] right below
> the scissors mark. It's not based on config file, but it's the same thing you're
> saying (IIUC) but at runtime and thus allowing a driver to state a 'preferred'
> governor to switch to at idle registration -- let me know if you think that
> looks a sensible approach. Note that the intent of that patch follows the
> thinking of leaving all defaults as before haltpoll governor was introduced, but
> once user modloads/uses cpuidle-haltpoll this governor then gets switched on.
>
> [0] https://lore.kernel.org/kvm/c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com/
>
> I would think a config-based preference on a governor would be good *if* one
> could actually switch idle governors at runtime like you can with cpufreq -- in
> case userspace wants something else other than the default. Right now we can't
> do that unless you toggle 'cpuidle_sysfs_switch', or picking one at boot with
> 'cpuidle.governor='.

FWIW, I've been thinking about getting rid of the cpuidle_sysfs_switch
command line option and always allowing user space to switch cpuidle
governors at run time.  At least I see no reason why that would not
work ATM.
Joao Martins Sept. 3, 2019, 10:13 a.m. UTC | #12
On 9/2/19 10:55 PM, Rafael J. Wysocki wrote:
> On Thu, Aug 29, 2019 at 7:24 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>> On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote:
>>> On 8/29/19 4:10 PM, Joao Martins wrote:
>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
>>>> past the online ones and thus fail to register the idle driver.
>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a
>>>> consequence from get_cpu_device() return no device for a non-existing
>>>> CPU.
>>>>
>>>> Instead switch to cpuidle_register_driver() and manually register each
>>>> of the present cpus through cpuhp_setup_state() callback and future
>>>> ones that get onlined. This mimmics similar logic that intel_idle does.
>>>>
>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>>>> ---
>>>
>>> While testing the above, I found out another issue on the haltpoll series.
>>> But I am not sure what is best suited to cpuidle framework, hence requesting
>>> some advise if below is a reasonable solution or something else is preferred.
>>>
>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle
>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll
>>> governor doesn't get registered for baremetal).
>>
>> Right.
>>
>>> Right now, for a KVM guest, the
>>> idle governors have these ratings:
>>>
>>>  * ladder            -> 10
>>>  * teo               -> 19
>>>  * menu              -> 20
>>>  * haltpoll          -> 21
>>>  * ladder + nohz=off -> 25
>>
>> Yes. PowerPC KVM guests crash currently due to the use of the haltpoll
>> governor (have a patch in my queue to fix this, but your solution
>> embraces more cases).
>>
>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully
>>> registered, we will end up with a haltpoll governor being used as opposed to
>>> 'menu' (which used to be the default case). This would prevent IIUC that other
>>> C-states get used other than poll_state (state 0) and state 1.
>>>
>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll
>>> it doesn't look reasonable to be the default? What about using haltpoll governor
>>> as default when haltpoll idle driver registers or modloads.
>>>
>>> My idea to achieve the above would be to decrease the rating to 9 (before the
>>> lowest rated governor) and retain old defaults before haltpoll. Then we would
>>> allow a cpuidle driver to define a preferred governor to switch on idle driver
>>> registration. Naturally all of would be ignored if overidden by
>>> cpuidle.governor=.
>>>
>>> The diff below the scissors line is an example of that.
>>>
>>> Thoughts?
>>
>> Works for me. Rafael?
> 
> It works for me too, basically, except that I would rename
> cpuidle_default_governor in the patch to cpuidle_prev_governor.
> 

Great! I'll send over a series with this then (splitted accordingly).

Also, In the course of hotplug/hotunplug testing, I found two small issues with
modload/modunload -- regardless of the hotplug patch. So I am gonna add that to
the series too.

	Joao
diff mbox series

Patch

diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c
index 8baade23f8d0..88a38c3c35e4 100644
--- a/drivers/cpuidle/cpuidle-haltpoll.c
+++ b/drivers/cpuidle/cpuidle-haltpoll.c
@@ -33,6 +33,7 @@  static int default_enter_idle(struct cpuidle_device *dev,

 static struct cpuidle_driver haltpoll_driver = {
 	.name = "haltpoll",
+	.governor = "haltpoll",
 	.owner = THIS_MODULE,
 	.states = {
 		{ /* entry 0 is for polling */ },
diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h
index d6613101af92..c046f49c1920 100644
--- a/drivers/cpuidle/cpuidle.h
+++ b/drivers/cpuidle/cpuidle.h
@@ -22,6 +22,7 @@  extern void cpuidle_install_idle_handler(void);
 extern void cpuidle_uninstall_idle_handler(void);

 /* governors */
+extern struct cpuidle_governor *cpuidle_find_governor(const char *str);
 extern int cpuidle_switch_governor(struct cpuidle_governor *gov);

 /* sysfs */
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index dc32f34e68d9..8b8b9d89ce58 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -87,6 +87,7 @@  static inline int __cpuidle_set_driver(struct cpuidle_driver *drv)
 #else

 static struct cpuidle_driver *cpuidle_curr_driver;
+static struct cpuidle_governor *cpuidle_default_governor = NULL;

 /**
  * __cpuidle_get_cpu_driver - return the global cpuidle driver pointer.
@@ -254,12 +255,25 @@  static void __cpuidle_unregister_driver(struct
cpuidle_driver *drv)
  */
 int cpuidle_register_driver(struct cpuidle_driver *drv)
 {
+	struct cpuidle_governor *gov;
 	int ret;

 	spin_lock(&cpuidle_driver_lock);
 	ret = __cpuidle_register_driver(drv);
 	spin_unlock(&cpuidle_driver_lock);

+	if (!ret && !strlen(param_governor) && drv->governor &&
+	    (cpuidle_get_driver() == drv)) {
+		mutex_lock(&cpuidle_lock);
+		gov = cpuidle_find_governor(drv->governor);
+		if (gov) {
+			cpuidle_default_governor = cpuidle_curr_governor;
+			if (cpuidle_switch_governor(gov) < 0)
+				cpuidle_default_governor = NULL;
+		}
+		mutex_unlock(&cpuidle_lock);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(cpuidle_register_driver);
@@ -274,9 +288,21 @@  EXPORT_SYMBOL_GPL(cpuidle_register_driver);
  */
 void cpuidle_unregister_driver(struct cpuidle_driver *drv)
 {
+	bool enabled = (cpuidle_get_driver() == drv);
+
 	spin_lock(&cpuidle_driver_lock);
 	__cpuidle_unregister_driver(drv);
 	spin_unlock(&cpuidle_driver_lock);
+
+	if (!enabled)
+		return;
+
+	mutex_lock(&cpuidle_lock);
+	if (cpuidle_default_governor) {
+		if (!cpuidle_switch_governor(cpuidle_default_governor))
+			cpuidle_default_governor = NULL;
+	}
+	mutex_unlock(&cpuidle_lock);
 }
 EXPORT_SYMBOL_GPL(cpuidle_unregister_driver);

diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c
index 2e3e14192bee..e93c11dc8304 100644
--- a/drivers/cpuidle/governor.c
+++ b/drivers/cpuidle/governor.c
@@ -22,12 +22,12 @@  LIST_HEAD(cpuidle_governors);
 struct cpuidle_governor *cpuidle_curr_governor;

 /**
- * __cpuidle_find_governor - finds a governor of the specified name
+ * cpuidle_find_governor - finds a governor of the specified name
  * @str: the name
  *
  * Must be called with cpuidle_lock acquired.
  */
-static struct cpuidle_governor * __cpuidle_find_governor(const char *str)
+struct cpuidle_governor * cpuidle_find_governor(const char *str)
 {
 	struct cpuidle_governor *gov;

@@ -87,7 +87,7 @@  int cpuidle_register_governor(struct cpuidle_governor *gov)
 		return -ENODEV;

 	mutex_lock(&cpuidle_lock);
-	if (__cpuidle_find_governor(gov->name) == NULL) {
+	if (cpuidle_find_governor(gov->name) == NULL) {
 		ret = 0;
 		list_add_tail(&gov->governor_list, &cpuidle_governors);
 		if (!cpuidle_curr_governor ||
diff --git a/drivers/cpuidle/governors/haltpoll.c
b/drivers/cpuidle/governors/haltpoll.c
index 797477bda486..7a703d2e0064 100644
--- a/drivers/cpuidle/governors/haltpoll.c
+++ b/drivers/cpuidle/governors/haltpoll.c
@@ -133,7 +133,7 @@  static int haltpoll_enable_device(struct cpuidle_driver *drv,

 static struct cpuidle_governor haltpoll_governor = {
 	.name =			"haltpoll",
-	.rating =		21,
+	.rating =		9,
 	.enable =		haltpoll_enable_device,
 	.select =		haltpoll_select,
 	.reflect =		haltpoll_reflect,
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 1a9f54eb3aa1..2dc4c6b19c25 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -121,6 +121,9 @@  struct cpuidle_driver {

 	/* the driver handles the cpus in cpumask */
 	struct cpumask		*cpumask;
+
+	/* preferred governor to switch at register time */
+	const char		*governor;
 };

 #ifdef CONFIG_CPU_IDLE