Message ID | c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Series | Is: Default governor regardless of cpuidle driver Was: [PATCH v2] cpuidle-haltpoll: vcpu hotplug support | expand |
On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote: > On 8/29/19 4:10 PM, Joao Martins wrote: > > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus > > past the online ones and thus fail to register the idle driver. > > This is because cpuidle_add_sysfs() will return with -ENODEV as a > > consequence from get_cpu_device() return no device for a non-existing > > CPU. > > > > Instead switch to cpuidle_register_driver() and manually register each > > of the present cpus through cpuhp_setup_state() callback and future > > ones that get onlined. This mimmics similar logic that intel_idle does. > > > > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > > --- > > While testing the above, I found out another issue on the haltpoll series. > But I am not sure what is best suited to cpuidle framework, hence requesting > some advise if below is a reasonable solution or something else is preferred. > > Essentially after haltpoll governor got introduced and regardless of the cpuidle > driver the default governor is gonna be haltpoll for a guest (given haltpoll > governor doesn't get registered for baremetal). Right. > Right now, for a KVM guest, the > idle governors have these ratings: > > * ladder -> 10 > * teo -> 19 > * menu -> 20 > * haltpoll -> 21 > * ladder + nohz=off -> 25 Yes. PowerPC KVM guests crash currently due to the use of the haltpoll governor (have a patch in my queue to fix this, but your solution embraces more cases). > When a guest is booted with MWAIT and intel_idle is probed and sucessfully > registered, we will end up with a haltpoll governor being used as opposed to > 'menu' (which used to be the default case). This would prevent IIUC that other > C-states get used other than poll_state (state 0) and state 1. > > Given that haltpoll governor is largely only useful with a cpuidle-haltpoll > it doesn't look reasonable to be the default? What about using haltpoll governor > as default when haltpoll idle driver registers or modloads. > > My idea to achieve the above would be to decrease the rating to 9 (before the > lowest rated governor) and retain old defaults before haltpoll. Then we would > allow a cpuidle driver to define a preferred governor to switch on idle driver > registration. Naturally all of would be ignored if overidden by > cpuidle.governor=. > > The diff below the scissors line is an example of that. > > Thoughts? Works for me. Rafael? > > ---------------------------------- >8 -------------------------------- > > From: Joao Martins <joao.m.martins@oracle.com> > Subject: [PATCH] cpuidle: switch to prefered governor on registration > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > drivers/cpuidle/cpuidle-haltpoll.c | 1 + > drivers/cpuidle/cpuidle.h | 1 + > drivers/cpuidle/driver.c | 26 ++++++++++++++++++++++++++ > drivers/cpuidle/governor.c | 6 +++--- > drivers/cpuidle/governors/haltpoll.c | 2 +- > include/linux/cpuidle.h | 3 +++ > 6 files changed, 35 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c > index 8baade23f8d0..88a38c3c35e4 100644 > --- a/drivers/cpuidle/cpuidle-haltpoll.c > +++ b/drivers/cpuidle/cpuidle-haltpoll.c > @@ -33,6 +33,7 @@ static int default_enter_idle(struct cpuidle_device *dev, > > static struct cpuidle_driver haltpoll_driver = { > .name = "haltpoll", > + .governor = "haltpoll", > .owner = THIS_MODULE, > .states = { > { /* entry 0 is for polling */ }, > diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h > index d6613101af92..c046f49c1920 100644 > --- a/drivers/cpuidle/cpuidle.h > +++ b/drivers/cpuidle/cpuidle.h > @@ -22,6 +22,7 @@ extern void cpuidle_install_idle_handler(void); > extern void cpuidle_uninstall_idle_handler(void); > > /* governors */ > +extern struct cpuidle_governor *cpuidle_find_governor(const char *str); > extern int cpuidle_switch_governor(struct cpuidle_governor *gov); > > /* sysfs */ > diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c > index dc32f34e68d9..8b8b9d89ce58 100644 > --- a/drivers/cpuidle/driver.c > +++ b/drivers/cpuidle/driver.c > @@ -87,6 +87,7 @@ static inline int __cpuidle_set_driver(struct cpuidle_driver *drv) > #else > > static struct cpuidle_driver *cpuidle_curr_driver; > +static struct cpuidle_governor *cpuidle_default_governor = NULL; > > /** > * __cpuidle_get_cpu_driver - return the global cpuidle driver pointer. > @@ -254,12 +255,25 @@ static void __cpuidle_unregister_driver(struct > cpuidle_driver *drv) > */ > int cpuidle_register_driver(struct cpuidle_driver *drv) > { > + struct cpuidle_governor *gov; > int ret; > > spin_lock(&cpuidle_driver_lock); > ret = __cpuidle_register_driver(drv); > spin_unlock(&cpuidle_driver_lock); > > + if (!ret && !strlen(param_governor) && drv->governor && > + (cpuidle_get_driver() == drv)) { > + mutex_lock(&cpuidle_lock); > + gov = cpuidle_find_governor(drv->governor); > + if (gov) { > + cpuidle_default_governor = cpuidle_curr_governor; > + if (cpuidle_switch_governor(gov) < 0) > + cpuidle_default_governor = NULL; > + } > + mutex_unlock(&cpuidle_lock); > + } > + > return ret; > } > EXPORT_SYMBOL_GPL(cpuidle_register_driver); > @@ -274,9 +288,21 @@ EXPORT_SYMBOL_GPL(cpuidle_register_driver); > */ > void cpuidle_unregister_driver(struct cpuidle_driver *drv) > { > + bool enabled = (cpuidle_get_driver() == drv); > + > spin_lock(&cpuidle_driver_lock); > __cpuidle_unregister_driver(drv); > spin_unlock(&cpuidle_driver_lock); > + > + if (!enabled) > + return; > + > + mutex_lock(&cpuidle_lock); > + if (cpuidle_default_governor) { > + if (!cpuidle_switch_governor(cpuidle_default_governor)) > + cpuidle_default_governor = NULL; > + } > + mutex_unlock(&cpuidle_lock); > } > EXPORT_SYMBOL_GPL(cpuidle_unregister_driver); > > diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c > index 2e3e14192bee..e93c11dc8304 100644 > --- a/drivers/cpuidle/governor.c > +++ b/drivers/cpuidle/governor.c > @@ -22,12 +22,12 @@ LIST_HEAD(cpuidle_governors); > struct cpuidle_governor *cpuidle_curr_governor; > > /** > - * __cpuidle_find_governor - finds a governor of the specified name > + * cpuidle_find_governor - finds a governor of the specified name > * @str: the name > * > * Must be called with cpuidle_lock acquired. > */ > -static struct cpuidle_governor * __cpuidle_find_governor(const char *str) > +struct cpuidle_governor * cpuidle_find_governor(const char *str) > { > struct cpuidle_governor *gov; > > @@ -87,7 +87,7 @@ int cpuidle_register_governor(struct cpuidle_governor *gov) > return -ENODEV; > > mutex_lock(&cpuidle_lock); > - if (__cpuidle_find_governor(gov->name) == NULL) { > + if (cpuidle_find_governor(gov->name) == NULL) { > ret = 0; > list_add_tail(&gov->governor_list, &cpuidle_governors); > if (!cpuidle_curr_governor || > diff --git a/drivers/cpuidle/governors/haltpoll.c > b/drivers/cpuidle/governors/haltpoll.c > index 797477bda486..7a703d2e0064 100644 > --- a/drivers/cpuidle/governors/haltpoll.c > +++ b/drivers/cpuidle/governors/haltpoll.c > @@ -133,7 +133,7 @@ static int haltpoll_enable_device(struct cpuidle_driver *drv, > > static struct cpuidle_governor haltpoll_governor = { > .name = "haltpoll", > - .rating = 21, > + .rating = 9, > .enable = haltpoll_enable_device, > .select = haltpoll_select, > .reflect = haltpoll_reflect, > diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h > index 1a9f54eb3aa1..2dc4c6b19c25 100644 > --- a/include/linux/cpuidle.h > +++ b/include/linux/cpuidle.h > @@ -121,6 +121,9 @@ struct cpuidle_driver { > > /* the driver handles the cpus in cpumask */ > struct cpumask *cpumask; > + > + /* preferred governor to switch at register time */ > + const char *governor; > }; > > #ifdef CONFIG_CPU_IDLE > -- > 2.17.1
On 29/08/2019 19:16, Joao Martins wrote: > On 8/29/19 4:10 PM, Joao Martins wrote: >> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >> past the online ones and thus fail to register the idle driver. >> This is because cpuidle_add_sysfs() will return with -ENODEV as a >> consequence from get_cpu_device() return no device for a non-existing >> CPU. >> >> Instead switch to cpuidle_register_driver() and manually register each >> of the present cpus through cpuhp_setup_state() callback and future >> ones that get onlined. This mimmics similar logic that intel_idle does. >> >> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >> --- > > While testing the above, I found out another issue on the haltpoll series. > But I am not sure what is best suited to cpuidle framework, hence requesting > some advise if below is a reasonable solution or something else is preferred. > > Essentially after haltpoll governor got introduced and regardless of the cpuidle > driver the default governor is gonna be haltpoll for a guest (given haltpoll > governor doesn't get registered for baremetal). Right now, for a KVM guest, the > idle governors have these ratings: > > * ladder -> 10 > * teo -> 19 > * menu -> 20 > * haltpoll -> 21 > * ladder + nohz=off -> 25 > > When a guest is booted with MWAIT and intel_idle is probed and sucessfully > registered, we will end up with a haltpoll governor being used as opposed to > 'menu' (which used to be the default case). This would prevent IIUC that other > C-states get used other than poll_state (state 0) and state 1. > > Given that haltpoll governor is largely only useful with a cpuidle-haltpoll > it doesn't look reasonable to be the default? What about using haltpoll governor > as default when haltpoll idle driver registers or modload. Are the guest and host kernel the same? IOW compiled with the same kernel config? > My idea to achieve the above would be to decrease the rating to 9 (before the > lowest rated governor) and retain old defaults before haltpoll. Then we would > allow a cpuidle driver to define a preferred governor to switch on idle driver > registration. Naturally all of would be ignored if overidden by > cpuidle.governor=. >
On 8/29/19 6:42 PM, Daniel Lezcano wrote: > On 29/08/2019 19:16, Joao Martins wrote: >> On 8/29/19 4:10 PM, Joao Martins wrote: >>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>> past the online ones and thus fail to register the idle driver. >>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>> consequence from get_cpu_device() return no device for a non-existing >>> CPU. >>> >>> Instead switch to cpuidle_register_driver() and manually register each >>> of the present cpus through cpuhp_setup_state() callback and future >>> ones that get onlined. This mimmics similar logic that intel_idle does. >>> >>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>> --- >> >> While testing the above, I found out another issue on the haltpoll series. >> But I am not sure what is best suited to cpuidle framework, hence requesting >> some advise if below is a reasonable solution or something else is preferred. >> >> Essentially after haltpoll governor got introduced and regardless of the cpuidle >> driver the default governor is gonna be haltpoll for a guest (given haltpoll >> governor doesn't get registered for baremetal). Right now, for a KVM guest, the >> idle governors have these ratings: >> >> * ladder -> 10 >> * teo -> 19 >> * menu -> 20 >> * haltpoll -> 21 >> * ladder + nohz=off -> 25 >> >> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >> registered, we will end up with a haltpoll governor being used as opposed to >> 'menu' (which used to be the default case). This would prevent IIUC that other >> C-states get used other than poll_state (state 0) and state 1. >> >> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >> it doesn't look reasonable to be the default? What about using haltpoll governor >> as default when haltpoll idle driver registers or modload. > > Are the guest and host kernel the same? IOW compiled with the same > kernel config? > You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE): CONFIG_CPU_IDLE_GOV_HALTPOLL=y And *if you are a KVM guest* it will be the default (unless using nohz=off in which case ladder gets the highest rating -- see the listing right above). Host will just behave differently because the haltpoll governor is checking if it is running as kvm guest, and only registering in that case. > >> My idea to achieve the above would be to decrease the rating to 9 (before the >> lowest rated governor) and retain old defaults before haltpoll. Then we would >> allow a cpuidle driver to define a preferred governor to switch on idle driver >> registration. Naturally all of would be ignored if overidden by >> cpuidle.governor=. >> > > > >
On 29/08/2019 20:07, Joao Martins wrote: > On 8/29/19 6:42 PM, Daniel Lezcano wrote: >> On 29/08/2019 19:16, Joao Martins wrote: >>> On 8/29/19 4:10 PM, Joao Martins wrote: >>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>>> past the online ones and thus fail to register the idle driver. >>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>>> consequence from get_cpu_device() return no device for a non-existing >>>> CPU. >>>> >>>> Instead switch to cpuidle_register_driver() and manually register each >>>> of the present cpus through cpuhp_setup_state() callback and future >>>> ones that get onlined. This mimmics similar logic that intel_idle does. >>>> >>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>> --- >>> >>> While testing the above, I found out another issue on the haltpoll series. >>> But I am not sure what is best suited to cpuidle framework, hence requesting >>> some advise if below is a reasonable solution or something else is preferred. >>> >>> Essentially after haltpoll governor got introduced and regardless of the cpuidle >>> driver the default governor is gonna be haltpoll for a guest (given haltpoll >>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the >>> idle governors have these ratings: >>> >>> * ladder -> 10 >>> * teo -> 19 >>> * menu -> 20 >>> * haltpoll -> 21 >>> * ladder + nohz=off -> 25 >>> >>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >>> registered, we will end up with a haltpoll governor being used as opposed to >>> 'menu' (which used to be the default case). This would prevent IIUC that other >>> C-states get used other than poll_state (state 0) and state 1. >>> >>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >>> it doesn't look reasonable to be the default? What about using haltpoll governor >>> as default when haltpoll idle driver registers or modload. >> >> Are the guest and host kernel the same? IOW compiled with the same >> kernel config? >> > You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE): > > CONFIG_CPU_IDLE_GOV_HALTPOLL=y > > And *if you are a KVM guest* it will be the default (unless using nohz=off in > which case ladder gets the highest rating -- see the listing right above). > > Host will just behave differently because the haltpoll governor is checking if > it is running as kvm guest, and only registering in that case. I understood the problem. Actually my question was about if the kernels are compiled for host and guest, and can be run indifferently. In this case a runtime detection must be done as you propose, otherwise that can be done at config time. I pretty sure it is the former but before thinking about the runtime side, I wanted to double check. >>> My idea to achieve the above would be to decrease the rating to 9 (before the >>> lowest rated governor) and retain old defaults before haltpoll. Then we would >>> allow a cpuidle driver to define a preferred governor to switch on idle driver >>> registration. Naturally all of would be ignored if overidden by >>> cpuidle.governor=.
On 8/29/19 7:28 PM, Daniel Lezcano wrote: > On 29/08/2019 20:07, Joao Martins wrote: >> On 8/29/19 6:42 PM, Daniel Lezcano wrote: >>> On 29/08/2019 19:16, Joao Martins wrote: >>>> On 8/29/19 4:10 PM, Joao Martins wrote: >>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>>>> past the online ones and thus fail to register the idle driver. >>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>>>> consequence from get_cpu_device() return no device for a non-existing >>>>> CPU. >>>>> >>>>> Instead switch to cpuidle_register_driver() and manually register each >>>>> of the present cpus through cpuhp_setup_state() callback and future >>>>> ones that get onlined. This mimmics similar logic that intel_idle does. >>>>> >>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>>> --- >>>> >>>> While testing the above, I found out another issue on the haltpoll series. >>>> But I am not sure what is best suited to cpuidle framework, hence requesting >>>> some advise if below is a reasonable solution or something else is preferred. >>>> >>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle >>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll >>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the >>>> idle governors have these ratings: >>>> >>>> * ladder -> 10 >>>> * teo -> 19 >>>> * menu -> 20 >>>> * haltpoll -> 21 >>>> * ladder + nohz=off -> 25 >>>> >>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >>>> registered, we will end up with a haltpoll governor being used as opposed to >>>> 'menu' (which used to be the default case). This would prevent IIUC that other >>>> C-states get used other than poll_state (state 0) and state 1. >>>> >>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >>>> it doesn't look reasonable to be the default? What about using haltpoll governor >>>> as default when haltpoll idle driver registers or modload. >>> >>> Are the guest and host kernel the same? IOW compiled with the same >>> kernel config? >>> >> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE): >> >> CONFIG_CPU_IDLE_GOV_HALTPOLL=y >> >> And *if you are a KVM guest* it will be the default (unless using nohz=off in >> which case ladder gets the highest rating -- see the listing right above). >> >> Host will just behave differently because the haltpoll governor is checking if >> it is running as kvm guest, and only registering in that case. > > I understood the problem. Actually my question was about if the kernels > are compiled for host and guest, and can be run indifferently. /nods Correct. > In this > case a runtime detection must be done as you propose, otherwise that can > be done at config time. I pretty sure it is the former but before > thinking about the runtime side, I wanted to double check. > Hmm, but even with separate kernels/configs for guest and host I think we would still have the same issue. What I was trying to convey is that even when running with a config solely for KVM guests (that is different than baremetal) you can have today various ways of idling. An Intel x86 kvm guest can have no idle driver (but arch-specific), intel_idle (like baremetal config) and haltpoll. There are usecases for these three, and makes sense to consolidate all. Say you wanted to have a kvm specific config, you would still see the same problem if you happen to compile intel_idle together with haltpoll driver+governor. Creating two separate configs here, with and without haltpoll for VMs doesn't sound effective for distros. Perhaps decreasing the rating of haltpoll governor, but while a short term fix it wouldn't give much sensible defaults without the one-off runtime switch. Unless ofc I am missing something. > >>>> My idea to achieve the above would be to decrease the rating to 9 (before the >>>> lowest rated governor) and retain old defaults before haltpoll. Then we would >>>> allow a cpuidle driver to define a preferred governor to switch on idle driver >>>> registration. Naturally all of would be ignored if overidden by >>>> cpuidle.governor=. > >
On 29/08/2019 21:11, Joao Martins wrote: > On 8/29/19 7:28 PM, Daniel Lezcano wrote: >> On 29/08/2019 20:07, Joao Martins wrote: >>> On 8/29/19 6:42 PM, Daniel Lezcano wrote: >>>> On 29/08/2019 19:16, Joao Martins wrote: >>>>> On 8/29/19 4:10 PM, Joao Martins wrote: >>>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>>>>> past the online ones and thus fail to register the idle driver. >>>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>>>>> consequence from get_cpu_device() return no device for a non-existing >>>>>> CPU. >>>>>> >>>>>> Instead switch to cpuidle_register_driver() and manually register each >>>>>> of the present cpus through cpuhp_setup_state() callback and future >>>>>> ones that get onlined. This mimmics similar logic that intel_idle does. >>>>>> >>>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>>>> --- >>>>> >>>>> While testing the above, I found out another issue on the haltpoll series. >>>>> But I am not sure what is best suited to cpuidle framework, hence requesting >>>>> some advise if below is a reasonable solution or something else is preferred. >>>>> >>>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle >>>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll >>>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the >>>>> idle governors have these ratings: >>>>> >>>>> * ladder -> 10 >>>>> * teo -> 19 >>>>> * menu -> 20 >>>>> * haltpoll -> 21 >>>>> * ladder + nohz=off -> 25 >>>>> >>>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >>>>> registered, we will end up with a haltpoll governor being used as opposed to >>>>> 'menu' (which used to be the default case). This would prevent IIUC that other >>>>> C-states get used other than poll_state (state 0) and state 1. >>>>> >>>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >>>>> it doesn't look reasonable to be the default? What about using haltpoll governor >>>>> as default when haltpoll idle driver registers or modload. >>>> >>>> Are the guest and host kernel the same? IOW compiled with the same >>>> kernel config? >>>> >>> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE): >>> >>> CONFIG_CPU_IDLE_GOV_HALTPOLL=y >>> >>> And *if you are a KVM guest* it will be the default (unless using nohz=off in >>> which case ladder gets the highest rating -- see the listing right above). >>> >>> Host will just behave differently because the haltpoll governor is checking if >>> it is running as kvm guest, and only registering in that case. >> >> I understood the problem. Actually my question was about if the kernels >> are compiled for host and guest, and can be run indifferently. > > /nods Correct. > >> In this >> case a runtime detection must be done as you propose, otherwise that can >> be done at config time. I pretty sure it is the former but before >> thinking about the runtime side, I wanted to double check. >> > Hmm, but even with separate kernels/configs for guest and host I think we would > still have the same issue. > > What I was trying to convey is that even when running with a config solely for > KVM guests (that is different than baremetal) you can have today various ways of > idling. An Intel x86 kvm guest can have no idle driver (but arch-specific), > intel_idle (like baremetal config) and haltpoll. There are usecases for these > three, and makes sense to consolidate all. > > Say you wanted to have a kvm specific config, you would still see the same > problem if you happen to compile intel_idle together with haltpoll > driver+governor. Can a guest work with an intel_idle driver? > Creating two separate configs here, with and without haltpoll > for VMs doesn't sound effective for distros. Agree > Perhaps decreasing the rating of > haltpoll governor, but while a short term fix it wouldn't give much sensible > defaults without the one-off runtime switch.
On 8/29/19 9:22 PM, Daniel Lezcano wrote: > On 29/08/2019 21:11, Joao Martins wrote: >> On 8/29/19 7:28 PM, Daniel Lezcano wrote: >>> On 29/08/2019 20:07, Joao Martins wrote: >>>> On 8/29/19 6:42 PM, Daniel Lezcano wrote: >>>>> On 29/08/2019 19:16, Joao Martins wrote: >>>>>> On 8/29/19 4:10 PM, Joao Martins wrote: >>>>>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>>>>>> past the online ones and thus fail to register the idle driver. >>>>>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>>>>>> consequence from get_cpu_device() return no device for a non-existing >>>>>>> CPU. >>>>>>> >>>>>>> Instead switch to cpuidle_register_driver() and manually register each >>>>>>> of the present cpus through cpuhp_setup_state() callback and future >>>>>>> ones that get onlined. This mimmics similar logic that intel_idle does. >>>>>>> >>>>>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>>>>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>>>>> --- >>>>>> >>>>>> While testing the above, I found out another issue on the haltpoll series. >>>>>> But I am not sure what is best suited to cpuidle framework, hence requesting >>>>>> some advise if below is a reasonable solution or something else is preferred. >>>>>> >>>>>> Essentially after haltpoll governor got introduced and regardless of the cpuidle >>>>>> driver the default governor is gonna be haltpoll for a guest (given haltpoll >>>>>> governor doesn't get registered for baremetal). Right now, for a KVM guest, the >>>>>> idle governors have these ratings: >>>>>> >>>>>> * ladder -> 10 >>>>>> * teo -> 19 >>>>>> * menu -> 20 >>>>>> * haltpoll -> 21 >>>>>> * ladder + nohz=off -> 25 >>>>>> >>>>>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >>>>>> registered, we will end up with a haltpoll governor being used as opposed to >>>>>> 'menu' (which used to be the default case). This would prevent IIUC that other >>>>>> C-states get used other than poll_state (state 0) and state 1. >>>>>> >>>>>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >>>>>> it doesn't look reasonable to be the default? What about using haltpoll governor >>>>>> as default when haltpoll idle driver registers or modload. >>>>> >>>>> Are the guest and host kernel the same? IOW compiled with the same >>>>> kernel config? >>>>> >>>> You just need to toggle this (regardless off CONFIG_HALTPOLL_CPUIDLE): >>>> >>>> CONFIG_CPU_IDLE_GOV_HALTPOLL=y >>>> >>>> And *if you are a KVM guest* it will be the default (unless using nohz=off in >>>> which case ladder gets the highest rating -- see the listing right above). >>>> >>>> Host will just behave differently because the haltpoll governor is checking if >>>> it is running as kvm guest, and only registering in that case. >>> >>> I understood the problem. Actually my question was about if the kernels >>> are compiled for host and guest, and can be run indifferently. >> >> /nods Correct. >> >>> In this >>> case a runtime detection must be done as you propose, otherwise that can >>> be done at config time. I pretty sure it is the former but before >>> thinking about the runtime side, I wanted to double check. >>> >> Hmm, but even with separate kernels/configs for guest and host I think we would >> still have the same issue. >> >> What I was trying to convey is that even when running with a config solely for >> KVM guests (that is different than baremetal) you can have today various ways of >> idling. An Intel x86 kvm guest can have no idle driver (but arch-specific), >> intel_idle (like baremetal config) and haltpoll. There are usecases for these >> three, and makes sense to consolidate all. >> >> Say you wanted to have a kvm specific config, you would still see the same >> problem if you happen to compile intel_idle together with haltpoll >> driver+governor. > > Can a guest work with an intel_idle driver? > Yes. If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc, assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel version as host (v4.17+). >> Creating two separate configs here, with and without haltpoll >> for VMs doesn't sound effective for distros. > > Agree > >> Perhaps decreasing the rating of >> haltpoll governor, but while a short term fix it wouldn't give much sensible >> defaults without the one-off runtime switch. >
On 29/08/2019 23:12, Joao Martins wrote: [ ... ] >>> Say you wanted to have a kvm specific config, you would still see the same >>> problem if you happen to compile intel_idle together with haltpoll >>> driver+governor. >> >> Can a guest work with an intel_idle driver? >> > Yes. > > If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc, > assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel > version as host (v4.17+). Ok, thanks for the clarification. >>> Creating two separate configs here, with and without haltpoll >>> for VMs doesn't sound effective for distros. >> >> Agree >> >>> Perhaps decreasing the rating of >>> haltpoll governor, but while a short term fix it wouldn't give much sensible >>> defaults without the one-off runtime switch. The rating has little meaning because each governor fits a specific situation (server, desktop, etc...) and it would probably make sense to remove it and add a default governor in the config file like the cpufreq. May be I missed the point from some previous discussion but IMHO the problem you are facing is coming from the design: there is no need to create a halt governor but move the code inside the cpuidle-halt driver instead and ignore the state asked by the governor and return the state the driver entered.
On 8/29/19 10:51 PM, Daniel Lezcano wrote: > On 29/08/2019 23:12, Joao Martins wrote: > > [ ... ] > >>>> Say you wanted to have a kvm specific config, you would still see the same >>>> problem if you happen to compile intel_idle together with haltpoll >>>> driver+governor. >>> >>> Can a guest work with an intel_idle driver? >>> >> Yes. >> >> If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc, >> assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel >> version as host (v4.17+). > > Ok, thanks for the clarification. > >>>> Creating two separate configs here, with and without haltpoll >>>> for VMs doesn't sound effective for distros. >>> >>> Agree >>> >>>> Perhaps decreasing the rating of >>>> haltpoll governor, but while a short term fix it wouldn't give much sensible >>>> defaults without the one-off runtime switch. > > The rating has little meaning because each governor fits a specific > situation (server, desktop, etc...) and it would probably make sense to > remove it and add a default governor in the config file like the cpufreq. > ICYM, I had attached a patch in the first message of this thread [0] right below the scissors mark. It's not based on config file, but it's the same thing you're saying (IIUC) but at runtime and thus allowing a driver to state a 'preferred' governor to switch to at idle registration -- let me know if you think that looks a sensible approach. Note that the intent of that patch follows the thinking of leaving all defaults as before haltpoll governor was introduced, but once user modloads/uses cpuidle-haltpoll this governor then gets switched on. [0] https://lore.kernel.org/kvm/c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com/ I would think a config-based preference on a governor would be good *if* one could actually switch idle governors at runtime like you can with cpufreq -- in case userspace wants something else other than the default. Right now we can't do that unless you toggle 'cpuidle_sysfs_switch', or picking one at boot with 'cpuidle.governor='. > May be I missed the point from some previous discussion but IMHO the > problem you are facing is coming from the design: there is no need to > create a halt governor but move the code inside the cpuidle-halt driver > instead and ignore the state asked by the governor and return the state > the driver entered. > Marcello's original patch series (first 3 revisions to be exact) actually had everything in the idle driver, but after some revisions (v4+) Rafael asked him to split the logic into a governor and unify it with poll state[1]. [1] https://lore.kernel.org/kvm/CAJZ5v0gPbSXB3r71XaT-4Q7LsiFO_UVymBwOmU8J1W5+COk_1g@mail.gmail.com/ Joao
On Thu, Aug 29, 2019 at 7:24 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: > > On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote: > > On 8/29/19 4:10 PM, Joao Martins wrote: > > > When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus > > > past the online ones and thus fail to register the idle driver. > > > This is because cpuidle_add_sysfs() will return with -ENODEV as a > > > consequence from get_cpu_device() return no device for a non-existing > > > CPU. > > > > > > Instead switch to cpuidle_register_driver() and manually register each > > > of the present cpus through cpuhp_setup_state() callback and future > > > ones that get onlined. This mimmics similar logic that intel_idle does. > > > > > > Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") > > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > > > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > > > --- > > > > While testing the above, I found out another issue on the haltpoll series. > > But I am not sure what is best suited to cpuidle framework, hence requesting > > some advise if below is a reasonable solution or something else is preferred. > > > > Essentially after haltpoll governor got introduced and regardless of the cpuidle > > driver the default governor is gonna be haltpoll for a guest (given haltpoll > > governor doesn't get registered for baremetal). > > Right. > > > Right now, for a KVM guest, the > > idle governors have these ratings: > > > > * ladder -> 10 > > * teo -> 19 > > * menu -> 20 > > * haltpoll -> 21 > > * ladder + nohz=off -> 25 > > Yes. PowerPC KVM guests crash currently due to the use of the haltpoll > governor (have a patch in my queue to fix this, but your solution > embraces more cases). > > > When a guest is booted with MWAIT and intel_idle is probed and sucessfully > > registered, we will end up with a haltpoll governor being used as opposed to > > 'menu' (which used to be the default case). This would prevent IIUC that other > > C-states get used other than poll_state (state 0) and state 1. > > > > Given that haltpoll governor is largely only useful with a cpuidle-haltpoll > > it doesn't look reasonable to be the default? What about using haltpoll governor > > as default when haltpoll idle driver registers or modloads. > > > > My idea to achieve the above would be to decrease the rating to 9 (before the > > lowest rated governor) and retain old defaults before haltpoll. Then we would > > allow a cpuidle driver to define a preferred governor to switch on idle driver > > registration. Naturally all of would be ignored if overidden by > > cpuidle.governor=. > > > > The diff below the scissors line is an example of that. > > > > Thoughts? > > Works for me. Rafael? It works for me too, basically, except that I would rename cpuidle_default_governor in the patch to cpuidle_prev_governor.
On Fri, Aug 30, 2019 at 1:09 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 8/29/19 10:51 PM, Daniel Lezcano wrote: > > On 29/08/2019 23:12, Joao Martins wrote: > > > > [ ... ] > > > >>>> Say you wanted to have a kvm specific config, you would still see the same > >>>> problem if you happen to compile intel_idle together with haltpoll > >>>> driver+governor. > >>> > >>> Can a guest work with an intel_idle driver? > >>> > >> Yes. > >> > >> If you use Qemu you would add '-overcommit cpu-pm=on' to try it out. ofc, > >> assuming you're on a relatively recent Qemu (v3.0+) and a fairly recent kernel > >> version as host (v4.17+). > > > > Ok, thanks for the clarification. > > > >>>> Creating two separate configs here, with and without haltpoll > >>>> for VMs doesn't sound effective for distros. > >>> > >>> Agree > >>> > >>>> Perhaps decreasing the rating of > >>>> haltpoll governor, but while a short term fix it wouldn't give much sensible > >>>> defaults without the one-off runtime switch. > > > > The rating has little meaning because each governor fits a specific > > situation (server, desktop, etc...) and it would probably make sense to > > remove it and add a default governor in the config file like the cpufreq. > > > ICYM, I had attached a patch in the first message of this thread [0] right below > the scissors mark. It's not based on config file, but it's the same thing you're > saying (IIUC) but at runtime and thus allowing a driver to state a 'preferred' > governor to switch to at idle registration -- let me know if you think that > looks a sensible approach. Note that the intent of that patch follows the > thinking of leaving all defaults as before haltpoll governor was introduced, but > once user modloads/uses cpuidle-haltpoll this governor then gets switched on. > > [0] https://lore.kernel.org/kvm/c8cf8dcc-76a3-3e15-f514-2cb9df1bbbdc@oracle.com/ > > I would think a config-based preference on a governor would be good *if* one > could actually switch idle governors at runtime like you can with cpufreq -- in > case userspace wants something else other than the default. Right now we can't > do that unless you toggle 'cpuidle_sysfs_switch', or picking one at boot with > 'cpuidle.governor='. FWIW, I've been thinking about getting rid of the cpuidle_sysfs_switch command line option and always allowing user space to switch cpuidle governors at run time. At least I see no reason why that would not work ATM.
On 9/2/19 10:55 PM, Rafael J. Wysocki wrote: > On Thu, Aug 29, 2019 at 7:24 PM Marcelo Tosatti <mtosatti@redhat.com> wrote: >> >> On Thu, Aug 29, 2019 at 06:16:05PM +0100, Joao Martins wrote: >>> On 8/29/19 4:10 PM, Joao Martins wrote: >>>> When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus >>>> past the online ones and thus fail to register the idle driver. >>>> This is because cpuidle_add_sysfs() will return with -ENODEV as a >>>> consequence from get_cpu_device() return no device for a non-existing >>>> CPU. >>>> >>>> Instead switch to cpuidle_register_driver() and manually register each >>>> of the present cpus through cpuhp_setup_state() callback and future >>>> ones that get onlined. This mimmics similar logic that intel_idle does. >>>> >>>> Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") >>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>>> --- >>> >>> While testing the above, I found out another issue on the haltpoll series. >>> But I am not sure what is best suited to cpuidle framework, hence requesting >>> some advise if below is a reasonable solution or something else is preferred. >>> >>> Essentially after haltpoll governor got introduced and regardless of the cpuidle >>> driver the default governor is gonna be haltpoll for a guest (given haltpoll >>> governor doesn't get registered for baremetal). >> >> Right. >> >>> Right now, for a KVM guest, the >>> idle governors have these ratings: >>> >>> * ladder -> 10 >>> * teo -> 19 >>> * menu -> 20 >>> * haltpoll -> 21 >>> * ladder + nohz=off -> 25 >> >> Yes. PowerPC KVM guests crash currently due to the use of the haltpoll >> governor (have a patch in my queue to fix this, but your solution >> embraces more cases). >> >>> When a guest is booted with MWAIT and intel_idle is probed and sucessfully >>> registered, we will end up with a haltpoll governor being used as opposed to >>> 'menu' (which used to be the default case). This would prevent IIUC that other >>> C-states get used other than poll_state (state 0) and state 1. >>> >>> Given that haltpoll governor is largely only useful with a cpuidle-haltpoll >>> it doesn't look reasonable to be the default? What about using haltpoll governor >>> as default when haltpoll idle driver registers or modloads. >>> >>> My idea to achieve the above would be to decrease the rating to 9 (before the >>> lowest rated governor) and retain old defaults before haltpoll. Then we would >>> allow a cpuidle driver to define a preferred governor to switch on idle driver >>> registration. Naturally all of would be ignored if overidden by >>> cpuidle.governor=. >>> >>> The diff below the scissors line is an example of that. >>> >>> Thoughts? >> >> Works for me. Rafael? > > It works for me too, basically, except that I would rename > cpuidle_default_governor in the patch to cpuidle_prev_governor. > Great! I'll send over a series with this then (splitted accordingly). Also, In the course of hotplug/hotunplug testing, I found two small issues with modload/modunload -- regardless of the hotplug patch. So I am gonna add that to the series too. Joao
diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c index 8baade23f8d0..88a38c3c35e4 100644 --- a/drivers/cpuidle/cpuidle-haltpoll.c +++ b/drivers/cpuidle/cpuidle-haltpoll.c @@ -33,6 +33,7 @@ static int default_enter_idle(struct cpuidle_device *dev, static struct cpuidle_driver haltpoll_driver = { .name = "haltpoll", + .governor = "haltpoll", .owner = THIS_MODULE, .states = { { /* entry 0 is for polling */ }, diff --git a/drivers/cpuidle/cpuidle.h b/drivers/cpuidle/cpuidle.h index d6613101af92..c046f49c1920 100644 --- a/drivers/cpuidle/cpuidle.h +++ b/drivers/cpuidle/cpuidle.h @@ -22,6 +22,7 @@ extern void cpuidle_install_idle_handler(void); extern void cpuidle_uninstall_idle_handler(void); /* governors */ +extern struct cpuidle_governor *cpuidle_find_governor(const char *str); extern int cpuidle_switch_governor(struct cpuidle_governor *gov); /* sysfs */ diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c index dc32f34e68d9..8b8b9d89ce58 100644 --- a/drivers/cpuidle/driver.c +++ b/drivers/cpuidle/driver.c @@ -87,6 +87,7 @@ static inline int __cpuidle_set_driver(struct cpuidle_driver *drv) #else static struct cpuidle_driver *cpuidle_curr_driver; +static struct cpuidle_governor *cpuidle_default_governor = NULL; /** * __cpuidle_get_cpu_driver - return the global cpuidle driver pointer. @@ -254,12 +255,25 @@ static void __cpuidle_unregister_driver(struct cpuidle_driver *drv) */ int cpuidle_register_driver(struct cpuidle_driver *drv) { + struct cpuidle_governor *gov; int ret; spin_lock(&cpuidle_driver_lock); ret = __cpuidle_register_driver(drv); spin_unlock(&cpuidle_driver_lock); + if (!ret && !strlen(param_governor) && drv->governor && + (cpuidle_get_driver() == drv)) { + mutex_lock(&cpuidle_lock); + gov = cpuidle_find_governor(drv->governor); + if (gov) { + cpuidle_default_governor = cpuidle_curr_governor; + if (cpuidle_switch_governor(gov) < 0) + cpuidle_default_governor = NULL; + } + mutex_unlock(&cpuidle_lock); + } + return ret; } EXPORT_SYMBOL_GPL(cpuidle_register_driver); @@ -274,9 +288,21 @@ EXPORT_SYMBOL_GPL(cpuidle_register_driver); */ void cpuidle_unregister_driver(struct cpuidle_driver *drv) { + bool enabled = (cpuidle_get_driver() == drv); + spin_lock(&cpuidle_driver_lock); __cpuidle_unregister_driver(drv); spin_unlock(&cpuidle_driver_lock); + + if (!enabled) + return; + + mutex_lock(&cpuidle_lock); + if (cpuidle_default_governor) { + if (!cpuidle_switch_governor(cpuidle_default_governor)) + cpuidle_default_governor = NULL; + } + mutex_unlock(&cpuidle_lock); } EXPORT_SYMBOL_GPL(cpuidle_unregister_driver); diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c index 2e3e14192bee..e93c11dc8304 100644 --- a/drivers/cpuidle/governor.c +++ b/drivers/cpuidle/governor.c @@ -22,12 +22,12 @@ LIST_HEAD(cpuidle_governors); struct cpuidle_governor *cpuidle_curr_governor; /** - * __cpuidle_find_governor - finds a governor of the specified name + * cpuidle_find_governor - finds a governor of the specified name * @str: the name * * Must be called with cpuidle_lock acquired. */ -static struct cpuidle_governor * __cpuidle_find_governor(const char *str) +struct cpuidle_governor * cpuidle_find_governor(const char *str) { struct cpuidle_governor *gov; @@ -87,7 +87,7 @@ int cpuidle_register_governor(struct cpuidle_governor *gov) return -ENODEV; mutex_lock(&cpuidle_lock); - if (__cpuidle_find_governor(gov->name) == NULL) { + if (cpuidle_find_governor(gov->name) == NULL) { ret = 0; list_add_tail(&gov->governor_list, &cpuidle_governors); if (!cpuidle_curr_governor || diff --git a/drivers/cpuidle/governors/haltpoll.c b/drivers/cpuidle/governors/haltpoll.c index 797477bda486..7a703d2e0064 100644 --- a/drivers/cpuidle/governors/haltpoll.c +++ b/drivers/cpuidle/governors/haltpoll.c @@ -133,7 +133,7 @@ static int haltpoll_enable_device(struct cpuidle_driver *drv, static struct cpuidle_governor haltpoll_governor = { .name = "haltpoll", - .rating = 21, + .rating = 9, .enable = haltpoll_enable_device, .select = haltpoll_select, .reflect = haltpoll_reflect, diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 1a9f54eb3aa1..2dc4c6b19c25 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -121,6 +121,9 @@ struct cpuidle_driver { /* the driver handles the cpus in cpumask */ struct cpumask *cpumask; + + /* preferred governor to switch at register time */ + const char *governor; }; #ifdef CONFIG_CPU_IDLE