diff mbox

[v2] cpufreq: Don't destroy/realloc policy/sysfs on hotplug/suspend

Message ID 1405052287-4744-1-git-send-email-skannan@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Saravana Kannan July 11, 2014, 4:18 a.m. UTC
The CPUfreq driver moves the cpufreq policy ownership between CPUs when
CPUs within a cluster (CPUs sharing same policy) go ONLINE/OFFLINE. When
moving policy ownership between CPUs, it also moves the cpufreq sysfs
directory between CPUs and also fixes up the symlinks of the other CPUs in
the cluster.

Also, when all the CPUs in a cluster go OFFLINE, all the sysfs nodes and
directories are deleted, the kobject is released and the policy is freed.
And when the first CPU in a cluster comes up, the policy is reallocated and
initialized, kobject is acquired, the sysfs nodes are created or symlinked,
etc.

All these steps end up creating unnecessarily complicated code and locking.
There's no real benefit to adding/removing/moving the sysfs nodes and the
policy between CPUs. Other per CPU sysfs directories like power and cpuidle
are left alone during hotplug. So there's some precedence to what this
patch is trying to do.

This patch simplifies a lot of the code and locking by removing the
adding/removing/moving of policy/sysfs/kobj and just leaves the cpufreq
directory and policy in place irrespective of whether the CPUs are
ONLINE/OFFLINE.

Leaving the policy, sysfs and kobject in place also brings these additional
benefits:
* Faster suspend/resume.
* Faster hotplug.
* Sysfs file permissions maintained across hotplug without userspace
  workarounds.
* Policy settings and governor tunables maintained across suspend/resume
  and hotplug.
* Cpufreq stats would be maintained across hotplug for all CPUs and can be
  queried even after CPU goes OFFLINE.

Change-Id: I39c395e1fee8731880c0fd7c8a9c1d83e2e4b8d0
Tested-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
---

Preliminary testing has been done. cpufreq directories are getting created
properly. Online/offline of CPUs work. Policies remain unmodifiable from
userspace when all policy CPUs are offline.

Error handling code has NOT been updated.

I've added a bunch of FIXME comments next to where I'm not sure about the
locking in the existing code. I believe most of the try_lock's were present
to prevent a deadlock between sysfs lock and the cpufreq locks. Now that
the sysfs entries are not touched after creating them, we should be able to
replace most/all of these try_lock's with a normal lock.

This patch has more room for code simplification, but I would like to get
some acks for the functionality and this code before I do further
simplification.

I should also be able to remove get_online_cpus() in the store function and
replace it with just a check for policy->governor_enabled. That should
theoretically reduce some contention between cpufreq stats check and
hotplug of unrelated CPUs.

Appreciate all the feedback.

Thanks,
Saravana

 drivers/cpufreq/cpufreq.c | 331 ++++++++++------------------------------------
 1 file changed, 69 insertions(+), 262 deletions(-)

Comments

Viresh Kumar July 11, 2014, 6:19 a.m. UTC | #1
Hi Saravana,

Thanks for trying this..

On 11 July 2014 09:48, Saravana Kannan <skannan@codeaurora.org> wrote:
> The CPUfreq driver moves the cpufreq policy ownership between CPUs when

s/driver/core

> CPUs within a cluster (CPUs sharing same policy) go ONLINE/OFFLINE. When
> moving policy ownership between CPUs, it also moves the cpufreq sysfs
> directory between CPUs and also fixes up the symlinks of the other CPUs in
> the cluster.
>
> Also, when all the CPUs in a cluster go OFFLINE, all the sysfs nodes and
> directories are deleted, the kobject is released and the policy is freed.
> And when the first CPU in a cluster comes up, the policy is reallocated and
> initialized, kobject is acquired, the sysfs nodes are created or symlinked,
> etc.
>
> All these steps end up creating unnecessarily complicated code and locking.
> There's no real benefit to adding/removing/moving the sysfs nodes and the
> policy between CPUs. Other per CPU sysfs directories like power and cpuidle
> are left alone during hotplug. So there's some precedence to what this
> patch is trying to do.
>
> This patch simplifies a lot of the code and locking by removing the
> adding/removing/moving of policy/sysfs/kobj and just leaves the cpufreq
> directory and policy in place irrespective of whether the CPUs are
> ONLINE/OFFLINE.
>
> Leaving the policy, sysfs and kobject in place also brings these additional
> benefits:
> * Faster suspend/resume.
> * Faster hotplug.
> * Sysfs file permissions maintained across hotplug without userspace
>   workarounds.
> * Policy settings and governor tunables maintained across suspend/resume
>   and hotplug.

Its already maintained during suspend/resume.

> * Cpufreq stats would be maintained across hotplug for all CPUs and can be
>   queried even after CPU goes OFFLINE.
>
> Change-Id: I39c395e1fee8731880c0fd7c8a9c1d83e2e4b8d0

remove these while sending stuff upstream..

> Tested-by: Stephen Boyd <sboyd@codeaurora.org>
> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
> ---
>
> Preliminary testing has been done. cpufreq directories are getting created
> properly. Online/offline of CPUs work. Policies remain unmodifiable from
> userspace when all policy CPUs are offline.
>
> Error handling code has NOT been updated.
>
> I've added a bunch of FIXME comments next to where I'm not sure about the
> locking in the existing code. I believe most of the try_lock's were present
> to prevent a deadlock between sysfs lock and the cpufreq locks. Now that
> the sysfs entries are not touched after creating them, we should be able to
> replace most/all of these try_lock's with a normal lock.
>
> This patch has more room for code simplification, but I would like to get
> some acks for the functionality and this code before I do further
> simplification.
>
> I should also be able to remove get_online_cpus() in the store function and
> replace it with just a check for policy->governor_enabled. That should
> theoretically reduce some contention between cpufreq stats check and
> hotplug of unrelated CPUs.

Its just too much stuff in a single patch, I can still review it as I
am very much
aware of every bit of code written here. But would be very difficult for others
to review it. These are so many cases, configuration we have to think of
and adding bugs with such a large patch is so so so easy.

>  drivers/cpufreq/cpufreq.c | 331 ++++++++++------------------------------------
>  1 file changed, 69 insertions(+), 262 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 62259d2..e350b15 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -859,16 +859,16 @@ void cpufreq_sysfs_remove_file(const struct attribute *attr)
>  }
>  EXPORT_SYMBOL(cpufreq_sysfs_remove_file);
>
> -/* symlink affected CPUs */
> +/* symlink related CPUs */
>  static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>  {
> -       unsigned int j;
> +       unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
>         int ret = 0;
>
> -       for_each_cpu(j, policy->cpus) {
> +       for_each_cpu(j, policy->related_cpus) {
>                 struct device *cpu_dev;
>
> -               if (j == policy->cpu)
> +               if (j == first_cpu)

why?

>                         continue;
>
>                 pr_debug("Adding link for CPU: %u\n", j);
> @@ -881,12 +881,16 @@ static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>         return ret;
>  }
>
> -static int cpufreq_add_dev_interface(struct cpufreq_policy *policy,
> -                                    struct device *dev)
> +static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>  {
>         struct freq_attr **drv_attr;
> +       struct device *dev;
>         int ret = 0;
>
> +       dev = get_cpu_device(cpumask_first(policy->related_cpus));
> +       if (!dev)
> +               return -EINVAL;
> +

Why?

>         /* prepare interface data */
>         ret = kobject_init_and_add(&policy->kobj, &ktype_cpufreq,
>                                    &dev->kobj, "cpufreq");
> @@ -961,60 +965,53 @@ static void cpufreq_init_policy(struct cpufreq_policy *policy)
>  }
>
>  #ifdef CONFIG_HOTPLUG_CPU
> -static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy,
> -                                 unsigned int cpu, struct device *dev)
> +static int cpufreq_change_policy_cpus(struct cpufreq_policy *policy,
> +                                 unsigned int cpu, bool add)
>  {
>         int ret = 0;
> -       unsigned long flags;
> +       unsigned int cpus;
>
> -       if (has_target()) {
> +       down_write(&policy->rwsem);
> +       cpus = cpumask_weight(policy->cpus);
> +       if (has_target() && cpus > 0) {
>                 ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
>                 if (ret) {
>                         pr_err("%s: Failed to stop governor\n", __func__);
> -                       return ret;
> +                       goto unlock;
>                 }
>         }
>
> -       down_write(&policy->rwsem);
> -
> -       write_lock_irqsave(&cpufreq_driver_lock, flags);
> -
> -       cpumask_set_cpu(cpu, policy->cpus);
> -       per_cpu(cpufreq_cpu_data, cpu) = policy;
> -       write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> +       if (add)
> +               cpumask_set_cpu(cpu, policy->cpus);
> +       else
> +               cpumask_clear_cpu(cpu, policy->cpus);
>
> -       up_write(&policy->rwsem);
> +       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> +                                       CPUFREQ_UPDATE_POLICY_CPU, policy);

This should be only called when policy->cpu is updated. And shouldn't
be called anymore and can be dropped as you might not wanna change
policy->cpu after this patch.

>
> -       if (has_target()) {
> +       cpus = cpumask_weight(policy->cpus);
> +       policy->cpu = cpumask_first(policy->cpus);

why update it at all? Also, as per your logic what if cpus == 0?

> +       if (has_target() && cpus > 0) {

Instead of > or < use cpus or !cpus.

>                 ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
>                 if (!ret)
>                         ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
>
>                 if (ret) {
>                         pr_err("%s: Failed to start governor\n", __func__);
> -                       return ret;
> +                       goto unlock;
>                 }
>         }
>
> -       return sysfs_create_link(&dev->kobj, &policy->kobj, "cpufreq");
> -}
> -#endif
> -
> -static struct cpufreq_policy *cpufreq_policy_restore(unsigned int cpu)
> -{
> -       struct cpufreq_policy *policy;
> -       unsigned long flags;
> -
> -       read_lock_irqsave(&cpufreq_driver_lock, flags);
> -
> -       policy = per_cpu(cpufreq_cpu_data_fallback, cpu);
> -
> -       read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> +       if (cpus < 1 && cpufreq_driver->stop_cpu && cpufreq_driver->setpolicy) {

Can be made 'else' part of above 'if', just need to move if(target)
inside the cpus > 1
block.

> +               cpufreq_driver->stop_cpu(policy);

Where is ->exit() gone?

> +       }
>
> -       policy->governor = NULL;
> +unlock:
> +       up_write(&policy->rwsem);
>
> -       return policy;
> +       return ret;
>  }
> +#endif
>
>  static struct cpufreq_policy *cpufreq_policy_alloc(void)
>  {
> @@ -1076,22 +1073,6 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>         kfree(policy);
>  }
>
> -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
> -{
> -       if (WARN_ON(cpu == policy->cpu))
> -               return;
> -
> -       down_write(&policy->rwsem);
> -
> -       policy->last_cpu = policy->cpu;
> -       policy->cpu = cpu;
> -
> -       up_write(&policy->rwsem);
> -
> -       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> -                       CPUFREQ_UPDATE_POLICY_CPU, policy);
> -}
> -
>  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  {
>         unsigned int j, cpu = dev->id;
> @@ -1099,9 +1080,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>         struct cpufreq_policy *policy;
>         unsigned long flags;
>         bool recover_policy = cpufreq_suspended;
> -#ifdef CONFIG_HOTPLUG_CPU
> -       struct cpufreq_policy *tpolicy;
> -#endif
>
>         if (cpu_is_offline(cpu))
>                 return 0;
> @@ -1111,55 +1089,28 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  #ifdef CONFIG_SMP
>         /* check whether a different CPU already registered this
>          * CPU because it is in the same boat. */
> +       /* FIXME: This probably needs fixing to avoid "try lock" from
> +        * returning NULL. Also, change to likely() */

I wanted to give this comment later, but that's fine ..

- First, I couldn't understand the try-lock fixme
- Second likely() would be better
- policy will not be available only while adding first CPU of every cluster on
every driver registration..
- And you aren't freeing 'struct cpufreq_policy' at all now. Would result in
Memory leak cpufreq driver is compiled as a module and inserted/removed
multiple times.

>         policy = cpufreq_cpu_get(cpu);
>         if (unlikely(policy)) {
> +               cpufreq_change_policy_cpus(policy, cpu, true);
>                 cpufreq_cpu_put(policy);
>                 return 0;
>         }

This optimization wasn't for the hotplug case, but for adding non-policy->cpu
cpus for the first time.

In your case policy->cpus would already be updated and so calling
cpufreq_change_policy_cpus() isn't required.

>  #endif
>
> +       /* FIXME: Is returning 0 the right thing to do?! Existing code */
>         if (!down_read_trylock(&cpufreq_rwsem))
>                 return 0;

Yeah, must have a better return value. But as I said, these kind of changes
must be added in separate patches.

>
> -#ifdef CONFIG_HOTPLUG_CPU
> -       /* Check if this cpu was hot-unplugged earlier and has siblings */
> -       read_lock_irqsave(&cpufreq_driver_lock, flags);
> -       list_for_each_entry(tpolicy, &cpufreq_policy_list, policy_list) {
> -               if (cpumask_test_cpu(cpu, tpolicy->related_cpus)) {
> -                       read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -                       ret = cpufreq_add_policy_cpu(tpolicy, cpu, dev);
> -                       up_read(&cpufreq_rwsem);
> -                       return ret;
> -               }
> -       }
> -       read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -#endif

And this one was for the hotplug case which you could have reused.

> -       /*
> -        * Restore the saved policy when doing light-weight init and fall back
> -        * to the full init if that fails.
> -        */
> -       policy = recover_policy ? cpufreq_policy_restore(cpu) : NULL;
> -       if (!policy) {
> -               recover_policy = false;
> -               policy = cpufreq_policy_alloc();
> -               if (!policy)
> -                       goto nomem_out;
> -       }

You might need to use it somehow.. Currently you are never doing
this:  per_cpu(cpufreq_cpu_data, cpu) = NULL;

which would result in crazy things once you try {un}registering your
driver...

> -       /*
> -        * In the resume path, since we restore a saved policy, the assignment
> -        * to policy->cpu is like an update of the existing policy, rather than
> -        * the creation of a brand new one. So we need to perform this update
> -        * by invoking update_policy_cpu().
> -        */
> -       if (recover_policy && cpu != policy->cpu)
> -               update_policy_cpu(policy, cpu);
> -       else
> -               policy->cpu = cpu;
> +       /* If we get this far, this is the first time we are adding the
> +        * policy */
> +       policy = cpufreq_policy_alloc();
> +       if (!policy)
> +               goto nomem_out;
> +       policy->cpu = cpu;
>
>         cpumask_copy(policy->cpus, cpumask_of(cpu));
> -
>         init_completion(&policy->kobj_unregister);
>         INIT_WORK(&policy->update, handle_update);
>
> @@ -1175,20 +1126,19 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>         /* related cpus should atleast have policy->cpus */
>         cpumask_or(policy->related_cpus, policy->related_cpus, policy->cpus);
>
> +       /* Weed out impossible CPUs. */
> +       cpumask_and(policy->related_cpus, policy->related_cpus,
> +                       cpu_possible_mask);

why?


Sorry but I am stopping now. I have already pointed out some issues
which make this unusable.

Over that, its way too hard to review all this in a single patch. For every
piece of line you add/remove I have to spend 10 mins thinking about
all the possible cases that were solved with this.. And if the rest of the
patch is going to fix them or not, etc.. To make it simple I did apply
your patch and had a close look at the new state of code, but its getting
tougher and tougher.

Please make sure you take care of these issues:
- suspend/resume
- hotplug
- module insert/remove
- Memory leaks
- multi cluster systems (with one and multiple CPU per cluster)
*by cluster I mean group of CPUs sharing clock line
- single cluster ones, one and multiple CPUs

Will see how V3 goes. Thanks.

--
viresh
Srivatsa S. Bhat July 11, 2014, 7:43 a.m. UTC | #2
On 07/11/2014 09:48 AM, Saravana Kannan wrote:
> The CPUfreq driver moves the cpufreq policy ownership between CPUs when
> CPUs within a cluster (CPUs sharing same policy) go ONLINE/OFFLINE. When
> moving policy ownership between CPUs, it also moves the cpufreq sysfs
> directory between CPUs and also fixes up the symlinks of the other CPUs in
> the cluster.
> 
> Also, when all the CPUs in a cluster go OFFLINE, all the sysfs nodes and
> directories are deleted, the kobject is released and the policy is freed.
> And when the first CPU in a cluster comes up, the policy is reallocated and
> initialized, kobject is acquired, the sysfs nodes are created or symlinked,
> etc.
> 
> All these steps end up creating unnecessarily complicated code and locking.
> There's no real benefit to adding/removing/moving the sysfs nodes and the
> policy between CPUs. Other per CPU sysfs directories like power and cpuidle
> are left alone during hotplug. So there's some precedence to what this
> patch is trying to do.
> 
> This patch simplifies a lot of the code and locking by removing the
> adding/removing/moving of policy/sysfs/kobj and just leaves the cpufreq
> directory and policy in place irrespective of whether the CPUs are
> ONLINE/OFFLINE.
> 
> Leaving the policy, sysfs and kobject in place also brings these additional
> benefits:
> * Faster suspend/resume.
> * Faster hotplug.
> * Sysfs file permissions maintained across hotplug without userspace
>   workarounds.
> * Policy settings and governor tunables maintained across suspend/resume
>   and hotplug.
> * Cpufreq stats would be maintained across hotplug for all CPUs and can be
>   queried even after CPU goes OFFLINE.
> 
> Change-Id: I39c395e1fee8731880c0fd7c8a9c1d83e2e4b8d0
> Tested-by: Stephen Boyd <sboyd@codeaurora.org>
> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
> ---
> 
> Preliminary testing has been done. cpufreq directories are getting created
> properly. Online/offline of CPUs work. Policies remain unmodifiable from
> userspace when all policy CPUs are offline.
> 
> Error handling code has NOT been updated.
> 
> I've added a bunch of FIXME comments next to where I'm not sure about the
> locking in the existing code. I believe most of the try_lock's were present
> to prevent a deadlock between sysfs lock and the cpufreq locks. Now that
> the sysfs entries are not touched after creating them, we should be able to
> replace most/all of these try_lock's with a normal lock.
> 
> This patch has more room for code simplification, but I would like to get
> some acks for the functionality and this code before I do further
> simplification.
> 

The idea behind this work is very welcome indeed! IMHO, there is nothing
conceptually wrong in maintaining the per-cpu sysfs files across CPU hotplug
(as long as we take care to return appropriate error codes if userspace
tries to set values using the control files of offline CPUs). So, it really
boils down to whether or not we get the implementation right; the idea itself
looks fine as of now. Hence, your efforts in making this patch(set) easier to
review will certainly help. Perhaps you can simplify the code later, but at
this point, splitting up this patch into multiple smaller, reviewable pieces
(accompanied by well-written changelogs that explain the intent) is the utmost
priority. Just like Viresh, even I had a hard time reviewing all of this in
one go.

Thank you for taking up this work!

Regards,
Srivatsa S. Bhat

> I should also be able to remove get_online_cpus() in the store function and
> replace it with just a check for policy->governor_enabled. That should
> theoretically reduce some contention between cpufreq stats check and
> hotplug of unrelated CPUs.
> 
> Appreciate all the feedback.
> 
> Thanks,
> Saravana
> 
>  drivers/cpufreq/cpufreq.c | 331 ++++++++++------------------------------------
>  1 file changed, 69 insertions(+), 262 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 62259d2..e350b15 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -859,16 +859,16 @@ void cpufreq_sysfs_remove_file(const struct attribute *attr)
>  }
>  EXPORT_SYMBOL(cpufreq_sysfs_remove_file);
> 
> -/* symlink affected CPUs */
> +/* symlink related CPUs */
>  static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>  {
> -	unsigned int j;
> +	unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
>  	int ret = 0;
> 
> -	for_each_cpu(j, policy->cpus) {
> +	for_each_cpu(j, policy->related_cpus) {
>  		struct device *cpu_dev;
> 
> -		if (j == policy->cpu)
> +		if (j == first_cpu)
>  			continue;
> 
>  		pr_debug("Adding link for CPU: %u\n", j);
> @@ -881,12 +881,16 @@ static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>  	return ret;
>  }
> 
> -static int cpufreq_add_dev_interface(struct cpufreq_policy *policy,
> -				     struct device *dev)
> +static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>  {
>  	struct freq_attr **drv_attr;
> +	struct device *dev;
>  	int ret = 0;
> 
> +	dev = get_cpu_device(cpumask_first(policy->related_cpus));
> +	if (!dev)
> +		return -EINVAL;
> +
>  	/* prepare interface data */
>  	ret = kobject_init_and_add(&policy->kobj, &ktype_cpufreq,
>  				   &dev->kobj, "cpufreq");
> @@ -961,60 +965,53 @@ static void cpufreq_init_policy(struct cpufreq_policy *policy)
>  }
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> -static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy,
> -				  unsigned int cpu, struct device *dev)
> +static int cpufreq_change_policy_cpus(struct cpufreq_policy *policy,
> +				  unsigned int cpu, bool add)
>  {
>  	int ret = 0;
> -	unsigned long flags;
> +	unsigned int cpus;
> 
> -	if (has_target()) {
> +	down_write(&policy->rwsem);
> +	cpus = cpumask_weight(policy->cpus);
> +	if (has_target() && cpus > 0) {
>  		ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
>  		if (ret) {
>  			pr_err("%s: Failed to stop governor\n", __func__);
> -			return ret;
> +			goto unlock;
>  		}
>  	}
> 
> -	down_write(&policy->rwsem);
> -
> -	write_lock_irqsave(&cpufreq_driver_lock, flags);
> -
> -	cpumask_set_cpu(cpu, policy->cpus);
> -	per_cpu(cpufreq_cpu_data, cpu) = policy;
> -	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> +	if (add)
> +		cpumask_set_cpu(cpu, policy->cpus);
> +	else
> +		cpumask_clear_cpu(cpu, policy->cpus);
> 
> -	up_write(&policy->rwsem);
> +	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> +					CPUFREQ_UPDATE_POLICY_CPU, policy);
> 
> -	if (has_target()) {
> +	cpus = cpumask_weight(policy->cpus);
> +	policy->cpu = cpumask_first(policy->cpus);
> +	if (has_target() && cpus > 0) {
>  		ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
>  		if (!ret)
>  			ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
> 
>  		if (ret) {
>  			pr_err("%s: Failed to start governor\n", __func__);
> -			return ret;
> +			goto unlock;
>  		}
>  	}
> 
> -	return sysfs_create_link(&dev->kobj, &policy->kobj, "cpufreq");
> -}
> -#endif
> -
> -static struct cpufreq_policy *cpufreq_policy_restore(unsigned int cpu)
> -{
> -	struct cpufreq_policy *policy;
> -	unsigned long flags;
> -
> -	read_lock_irqsave(&cpufreq_driver_lock, flags);
> -
> -	policy = per_cpu(cpufreq_cpu_data_fallback, cpu);
> -
> -	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> +	if (cpus < 1 && cpufreq_driver->stop_cpu && cpufreq_driver->setpolicy) {
> +		cpufreq_driver->stop_cpu(policy);
> +	}
> 
> -	policy->governor = NULL;
> +unlock:
> +	up_write(&policy->rwsem);
> 
> -	return policy;
> +	return ret;
>  }
> +#endif
> 
>  static struct cpufreq_policy *cpufreq_policy_alloc(void)
>  {
> @@ -1076,22 +1073,6 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>  	kfree(policy);
>  }
> 
> -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
> -{
> -	if (WARN_ON(cpu == policy->cpu))
> -		return;
> -
> -	down_write(&policy->rwsem);
> -
> -	policy->last_cpu = policy->cpu;
> -	policy->cpu = cpu;
> -
> -	up_write(&policy->rwsem);
> -
> -	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> -			CPUFREQ_UPDATE_POLICY_CPU, policy);
> -}
> -
>  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  {
>  	unsigned int j, cpu = dev->id;
> @@ -1099,9 +1080,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  	struct cpufreq_policy *policy;
>  	unsigned long flags;
>  	bool recover_policy = cpufreq_suspended;
> -#ifdef CONFIG_HOTPLUG_CPU
> -	struct cpufreq_policy *tpolicy;
> -#endif
> 
>  	if (cpu_is_offline(cpu))
>  		return 0;
> @@ -1111,55 +1089,28 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  #ifdef CONFIG_SMP
>  	/* check whether a different CPU already registered this
>  	 * CPU because it is in the same boat. */
> +	/* FIXME: This probably needs fixing to avoid "try lock" from
> +	 * returning NULL. Also, change to likely() */
>  	policy = cpufreq_cpu_get(cpu);
>  	if (unlikely(policy)) {
> +		cpufreq_change_policy_cpus(policy, cpu, true);
>  		cpufreq_cpu_put(policy);
>  		return 0;
>  	}
>  #endif
> 
> +	/* FIXME: Is returning 0 the right thing to do?! Existing code */
>  	if (!down_read_trylock(&cpufreq_rwsem))
>  		return 0;
> 
> -#ifdef CONFIG_HOTPLUG_CPU
> -	/* Check if this cpu was hot-unplugged earlier and has siblings */
> -	read_lock_irqsave(&cpufreq_driver_lock, flags);
> -	list_for_each_entry(tpolicy, &cpufreq_policy_list, policy_list) {
> -		if (cpumask_test_cpu(cpu, tpolicy->related_cpus)) {
> -			read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -			ret = cpufreq_add_policy_cpu(tpolicy, cpu, dev);
> -			up_read(&cpufreq_rwsem);
> -			return ret;
> -		}
> -	}
> -	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -#endif
> -
> -	/*
> -	 * Restore the saved policy when doing light-weight init and fall back
> -	 * to the full init if that fails.
> -	 */
> -	policy = recover_policy ? cpufreq_policy_restore(cpu) : NULL;
> -	if (!policy) {
> -		recover_policy = false;
> -		policy = cpufreq_policy_alloc();
> -		if (!policy)
> -			goto nomem_out;
> -	}
> -
> -	/*
> -	 * In the resume path, since we restore a saved policy, the assignment
> -	 * to policy->cpu is like an update of the existing policy, rather than
> -	 * the creation of a brand new one. So we need to perform this update
> -	 * by invoking update_policy_cpu().
> -	 */
> -	if (recover_policy && cpu != policy->cpu)
> -		update_policy_cpu(policy, cpu);
> -	else
> -		policy->cpu = cpu;
> +	/* If we get this far, this is the first time we are adding the
> +	 * policy */
> +	policy = cpufreq_policy_alloc();
> +	if (!policy)
> +		goto nomem_out;
> +	policy->cpu = cpu;
> 
>  	cpumask_copy(policy->cpus, cpumask_of(cpu));
> -
>  	init_completion(&policy->kobj_unregister);
>  	INIT_WORK(&policy->update, handle_update);
> 
> @@ -1175,20 +1126,19 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  	/* related cpus should atleast have policy->cpus */
>  	cpumask_or(policy->related_cpus, policy->related_cpus, policy->cpus);
> 
> +	/* Weed out impossible CPUs. */
> +	cpumask_and(policy->related_cpus, policy->related_cpus,
> +			cpu_possible_mask);
> +
>  	/*
>  	 * affected cpus must always be the one, which are online. We aren't
>  	 * managing offline cpus here.
>  	 */
>  	cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);
> 
> -	if (!recover_policy) {
> -		policy->user_policy.min = policy->min;
> -		policy->user_policy.max = policy->max;
> -	}
> -
>  	down_write(&policy->rwsem);
>  	write_lock_irqsave(&cpufreq_driver_lock, flags);
> -	for_each_cpu(j, policy->cpus)
> +	for_each_cpu(j, policy->related_cpus)
>  		per_cpu(cpufreq_cpu_data, j) = policy;
>  	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> 
> @@ -1243,13 +1193,11 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>  				     CPUFREQ_START, policy);
> 
> -	if (!recover_policy) {
> -		ret = cpufreq_add_dev_interface(policy, dev);
> -		if (ret)
> -			goto err_out_unregister;
> -		blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> -				CPUFREQ_CREATE_POLICY, policy);
> -	}
> +	ret = cpufreq_add_dev_interface(policy);
> +	if (ret)
> +		goto err_out_unregister;
> +	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> +			CPUFREQ_CREATE_POLICY, policy);
> 
>  	write_lock_irqsave(&cpufreq_driver_lock, flags);
>  	list_add(&policy->policy_list, &cpufreq_policy_list);
> @@ -1257,10 +1205,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
> 
>  	cpufreq_init_policy(policy);
> 
> -	if (!recover_policy) {
> -		policy->user_policy.policy = policy->policy;
> -		policy->user_policy.governor = policy->governor;
> -	}
>  	up_write(&policy->rwsem);
> 
>  	kobject_uevent(&policy->kobj, KOBJ_ADD);
> @@ -1307,100 +1251,16 @@ static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>  	return __cpufreq_add_dev(dev, sif);
>  }
> 
> -static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy,
> -					   unsigned int old_cpu)
> -{
> -	struct device *cpu_dev;
> -	int ret;
> -
> -	/* first sibling now owns the new sysfs dir */
> -	cpu_dev = get_cpu_device(cpumask_any_but(policy->cpus, old_cpu));
> -
> -	sysfs_remove_link(&cpu_dev->kobj, "cpufreq");
> -	ret = kobject_move(&policy->kobj, &cpu_dev->kobj);
> -	if (ret) {
> -		pr_err("%s: Failed to move kobj: %d\n", __func__, ret);
> -
> -		down_write(&policy->rwsem);
> -		cpumask_set_cpu(old_cpu, policy->cpus);
> -		up_write(&policy->rwsem);
> -
> -		ret = sysfs_create_link(&cpu_dev->kobj, &policy->kobj,
> -					"cpufreq");
> -
> -		return -EINVAL;
> -	}
> -
> -	return cpu_dev->id;
> -}
> -
> -static int __cpufreq_remove_dev_prepare(struct device *dev,
> -					struct subsys_interface *sif)
> +static int __cpufreq_remove_dev(struct device *dev,
> +				struct subsys_interface *sif)
>  {
> -	unsigned int cpu = dev->id, cpus;
> -	int new_cpu, ret;
> +	unsigned int cpu = dev->id;
> +	int ret = 0;
>  	unsigned long flags;
>  	struct cpufreq_policy *policy;
> 
>  	pr_debug("%s: unregistering CPU %u\n", __func__, cpu);
> 
> -	write_lock_irqsave(&cpufreq_driver_lock, flags);
> -
> -	policy = per_cpu(cpufreq_cpu_data, cpu);
> -
> -	/* Save the policy somewhere when doing a light-weight tear-down */
> -	if (cpufreq_suspended)
> -		per_cpu(cpufreq_cpu_data_fallback, cpu) = policy;
> -
> -	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -
> -	if (!policy) {
> -		pr_debug("%s: No cpu_data found\n", __func__);
> -		return -EINVAL;
> -	}
> -
> -	if (has_target()) {
> -		ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> -		if (ret) {
> -			pr_err("%s: Failed to stop governor\n", __func__);
> -			return ret;
> -		}
> -	}
> -
> -	if (!cpufreq_driver->setpolicy)
> -		strncpy(per_cpu(cpufreq_cpu_governor, cpu),
> -			policy->governor->name, CPUFREQ_NAME_LEN);
> -
> -	down_read(&policy->rwsem);
> -	cpus = cpumask_weight(policy->cpus);
> -	up_read(&policy->rwsem);
> -
> -	if (cpu != policy->cpu) {
> -		sysfs_remove_link(&dev->kobj, "cpufreq");
> -	} else if (cpus > 1) {
> -		new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu);
> -		if (new_cpu >= 0) {
> -			update_policy_cpu(policy, new_cpu);
> -
> -			if (!cpufreq_suspended)
> -				pr_debug("%s: policy Kobject moved to cpu: %d from: %d\n",
> -					 __func__, new_cpu, cpu);
> -		}
> -	} else if (cpufreq_driver->stop_cpu && cpufreq_driver->setpolicy) {
> -		cpufreq_driver->stop_cpu(policy);
> -	}
> -
> -	return 0;
> -}
> -
> -static int __cpufreq_remove_dev_finish(struct device *dev,
> -				       struct subsys_interface *sif)
> -{
> -	unsigned int cpu = dev->id, cpus;
> -	int ret;
> -	unsigned long flags;
> -	struct cpufreq_policy *policy;
> -
>  	read_lock_irqsave(&cpufreq_driver_lock, flags);
>  	policy = per_cpu(cpufreq_cpu_data, cpu);
>  	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
> @@ -1410,56 +1270,11 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
>  		return -EINVAL;
>  	}
> 
> -	down_write(&policy->rwsem);
> -	cpus = cpumask_weight(policy->cpus);
> -
> -	if (cpus > 1)
> -		cpumask_clear_cpu(cpu, policy->cpus);
> -	up_write(&policy->rwsem);
> -
> -	/* If cpu is last user of policy, free policy */
> -	if (cpus == 1) {
> -		if (has_target()) {
> -			ret = __cpufreq_governor(policy,
> -					CPUFREQ_GOV_POLICY_EXIT);
> -			if (ret) {
> -				pr_err("%s: Failed to exit governor\n",
> -				       __func__);
> -				return ret;
> -			}
> -		}
> -
> -		if (!cpufreq_suspended)
> -			cpufreq_policy_put_kobj(policy);
> -
> -		/*
> -		 * Perform the ->exit() even during light-weight tear-down,
> -		 * since this is a core component, and is essential for the
> -		 * subsequent light-weight ->init() to succeed.
> -		 */
> -		if (cpufreq_driver->exit)
> -			cpufreq_driver->exit(policy);
> -
> -		/* Remove policy from list of active policies */
> -		write_lock_irqsave(&cpufreq_driver_lock, flags);
> -		list_del(&policy->policy_list);
> -		write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -
> -		if (!cpufreq_suspended)
> -			cpufreq_policy_free(policy);
> -	} else if (has_target()) {
> -		ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
> -		if (!ret)
> -			ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
> -
> -		if (ret) {
> -			pr_err("%s: Failed to start governor\n", __func__);
> -			return ret;
> -		}
> -	}
> +#ifdef CONFIG_HOTPLUG_CPU
> +	ret = cpufreq_change_policy_cpus(policy, cpu, false);
> +#endif
> 
> -	per_cpu(cpufreq_cpu_data, cpu) = NULL;
> -	return 0;
> +	return ret;
>  }
> 
>  /**
> @@ -1475,10 +1290,7 @@ static int cpufreq_remove_dev(struct device *dev, struct subsys_interface *sif)
>  	if (cpu_is_offline(cpu))
>  		return 0;
> 
> -	ret = __cpufreq_remove_dev_prepare(dev, sif);
> -
> -	if (!ret)
> -		ret = __cpufreq_remove_dev_finish(dev, sif);
> +	ret = __cpufreq_remove_dev(dev, sif);
> 
>  	return ret;
>  }
> @@ -2141,7 +1953,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>  				struct cpufreq_policy *new_policy)
>  {
>  	struct cpufreq_governor *old_gov;
> -	int ret;
> +	int ret = 0;
> 
>  	pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
>  		 new_policy->cpu, new_policy->min, new_policy->max);
> @@ -2226,7 +2038,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> 
>   out:
>  	pr_debug("governor: change or update limits\n");
> -	return __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
> +	if (policy->governor_enabled)
> +		ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
> +	return ret;
>  }
> 
>  /**
> @@ -2295,19 +2109,12 @@ static int cpufreq_cpu_callback(struct notifier_block *nfb,
>  	if (dev) {
>  		switch (action & ~CPU_TASKS_FROZEN) {
>  		case CPU_ONLINE:
> +		case CPU_DOWN_FAILED:
>  			__cpufreq_add_dev(dev, NULL);
>  			break;
> 
>  		case CPU_DOWN_PREPARE:
> -			__cpufreq_remove_dev_prepare(dev, NULL);
> -			break;
> -
> -		case CPU_POST_DEAD:
> -			__cpufreq_remove_dev_finish(dev, NULL);
> -			break;
> -
> -		case CPU_DOWN_FAILED:
> -			__cpufreq_add_dev(dev, NULL);
> +			__cpufreq_remove_dev(dev, NULL);
>  			break;
>  		}
>  	}
>
Saravana Kannan July 11, 2014, 9:59 a.m. UTC | #3
Viresh Kumar wrote:
> Hi Saravana,
>
> Thanks for trying this..
>
> On 11 July 2014 09:48, Saravana Kannan <skannan@codeaurora.org> wrote:
>> The CPUfreq driver moves the cpufreq policy ownership between CPUs when
>
> s/driver/core

Will do

>
>> CPUs within a cluster (CPUs sharing same policy) go ONLINE/OFFLINE. When
>> moving policy ownership between CPUs, it also moves the cpufreq sysfs
>> directory between CPUs and also fixes up the symlinks of the other CPUs
>> in
>> the cluster.
>>
>> Also, when all the CPUs in a cluster go OFFLINE, all the sysfs nodes and
>> directories are deleted, the kobject is released and the policy is
>> freed.
>> And when the first CPU in a cluster comes up, the policy is reallocated
>> and
>> initialized, kobject is acquired, the sysfs nodes are created or
>> symlinked,
>> etc.
>>
>> All these steps end up creating unnecessarily complicated code and
>> locking.
>> There's no real benefit to adding/removing/moving the sysfs nodes and
>> the
>> policy between CPUs. Other per CPU sysfs directories like power and
>> cpuidle
>> are left alone during hotplug. So there's some precedence to what this
>> patch is trying to do.
>>
>> This patch simplifies a lot of the code and locking by removing the
>> adding/removing/moving of policy/sysfs/kobj and just leaves the cpufreq
>> directory and policy in place irrespective of whether the CPUs are
>> ONLINE/OFFLINE.
>>
>> Leaving the policy, sysfs and kobject in place also brings these
>> additional
>> benefits:
>> * Faster suspend/resume.
>> * Faster hotplug.
>> * Sysfs file permissions maintained across hotplug without userspace
>>   workarounds.
>> * Policy settings and governor tunables maintained across suspend/resume
>>   and hotplug.
>
> Its already maintained during suspend/resume.

But not across hotplug. Which is also very useful when you have 2 clusters
and one gets hotplugged offline due to thermal and then reinserted.
Userspace has to come and restore it back today. In our tree, we "stitched
up" the governor. Also, this make the suspend/resume code a tiny bit
simpler -- in the sense, it's not a special case anymore.

>> * Cpufreq stats would be maintained across hotplug for all CPUs and can
>> be
>>   queried even after CPU goes OFFLINE.
>>
>> Change-Id: I39c395e1fee8731880c0fd7c8a9c1d83e2e4b8d0
>
> remove these while sending stuff upstream..

Yeah, I always think of doing this but keep forgetting in the last minute.

>> Tested-by: Stephen Boyd <sboyd@codeaurora.org>
>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>> ---
>>
>> Preliminary testing has been done. cpufreq directories are getting
>> created
>> properly. Online/offline of CPUs work. Policies remain unmodifiable from
>> userspace when all policy CPUs are offline.
>>
>> Error handling code has NOT been updated.
>>
>> I've added a bunch of FIXME comments next to where I'm not sure about
>> the
>> locking in the existing code. I believe most of the try_lock's were
>> present
>> to prevent a deadlock between sysfs lock and the cpufreq locks. Now that
>> the sysfs entries are not touched after creating them, we should be able
>> to
>> replace most/all of these try_lock's with a normal lock.
>>
>> This patch has more room for code simplification, but I would like to
>> get
>> some acks for the functionality and this code before I do further
>> simplification.
>>
>> I should also be able to remove get_online_cpus() in the store function
>> and
>> replace it with just a check for policy->governor_enabled. That should
>> theoretically reduce some contention between cpufreq stats check and
>> hotplug of unrelated CPUs.
>
> Its just too much stuff in a single patch, I can still review it as I
> am very much
> aware of every bit of code written here. But would be very difficult for
> others
> to review it. These are so many cases, configuration we have to think of
> and adding bugs with such a large patch is so so so easy.

Actually this is the smallest bit of code that will work. Well, after I
fix suspend/resume. I'm trying to make each patch such that the tree
continues to work after it's pulled in.

Unfortunately, I'm throwing away a lot of code that it ends up with a
fairly large diff. But if you look at the actual final code, it's very
simple.

But I do see your point. I'll try to keep the patch as small as possible
to continue working and make additional improvements as additional
patches.

>>  drivers/cpufreq/cpufreq.c | 331
>> ++++++++++------------------------------------
>>  1 file changed, 69 insertions(+), 262 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index 62259d2..e350b15 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -859,16 +859,16 @@ void cpufreq_sysfs_remove_file(const struct
>> attribute *attr)
>>  }
>>  EXPORT_SYMBOL(cpufreq_sysfs_remove_file);
>>
>> -/* symlink affected CPUs */
>> +/* symlink related CPUs */
>>  static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>>  {
>> -       unsigned int j;
>> +       unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
>>         int ret = 0;
>>
>> -       for_each_cpu(j, policy->cpus) {
>> +       for_each_cpu(j, policy->related_cpus) {
>>                 struct device *cpu_dev;
>>
>> -               if (j == policy->cpu)
>> +               if (j == first_cpu)
>
> why?

The first CPU is a cluster always own the real nodes.

>>                         continue;
>>
>>                 pr_debug("Adding link for CPU: %u\n", j);
>> @@ -881,12 +881,16 @@ static int cpufreq_add_dev_symlink(struct
>> cpufreq_policy *policy)
>>         return ret;
>>  }
>>
>> -static int cpufreq_add_dev_interface(struct cpufreq_policy *policy,
>> -                                    struct device *dev)
>> +static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>>  {
>>         struct freq_attr **drv_attr;
>> +       struct device *dev;
>>         int ret = 0;
>>
>> +       dev = get_cpu_device(cpumask_first(policy->related_cpus));
>> +       if (!dev)
>> +               return -EINVAL;
>> +
>
> Why?

I'm just always adding the real nodes to the first CPU in a cluster
independent of which CPU gets added first. Makes it easier to know which
ones to symlink. See comment next to policy->cpu for full context.

>>         /* prepare interface data */
>>         ret = kobject_init_and_add(&policy->kobj, &ktype_cpufreq,
>>                                    &dev->kobj, "cpufreq");
>> @@ -961,60 +965,53 @@ static void cpufreq_init_policy(struct
>> cpufreq_policy *policy)
>>  }
>>
>>  #ifdef CONFIG_HOTPLUG_CPU
>> -static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy,
>> -                                 unsigned int cpu, struct device *dev)
>> +static int cpufreq_change_policy_cpus(struct cpufreq_policy *policy,
>> +                                 unsigned int cpu, bool add)
>>  {
>>         int ret = 0;
>> -       unsigned long flags;
>> +       unsigned int cpus;
>>
>> -       if (has_target()) {
>> +       down_write(&policy->rwsem);
>> +       cpus = cpumask_weight(policy->cpus);
>> +       if (has_target() && cpus > 0) {
>>                 ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
>>                 if (ret) {
>>                         pr_err("%s: Failed to stop governor\n",
>> __func__);
>> -                       return ret;
>> +                       goto unlock;
>>                 }
>>         }
>>
>> -       down_write(&policy->rwsem);
>> -
>> -       write_lock_irqsave(&cpufreq_driver_lock, flags);
>> -
>> -       cpumask_set_cpu(cpu, policy->cpus);
>> -       per_cpu(cpufreq_cpu_data, cpu) = policy;
>> -       write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>> +       if (add)
>> +               cpumask_set_cpu(cpu, policy->cpus);
>> +       else
>> +               cpumask_clear_cpu(cpu, policy->cpus);
>>
>> -       up_write(&policy->rwsem);
>> +       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>> +                                       CPUFREQ_UPDATE_POLICY_CPU,
>> policy);
>
> This should be only called when policy->cpu is updated. And shouldn't
> be called anymore and can be dropped as you might not wanna change
> policy->cpu after this patch.

Right. I should have reordered this to after.

But I always HAVE to send it when I add/remove a CPU. That's how I believe
cpufreq stats can keep of CPUs going offline. Oh, yeah. Which bring me to
another point. If I'm not mistaken, cpufreq stats keeps track of policy
stats. Really, it should track a CPU's stats. If it's hotplugged out, it
should count that time towards it's current freq.

So, yeah, for now, I can sent this only when policy->cpu changes. I can
fix that.

>>
>> -       if (has_target()) {
>> +       cpus = cpumask_weight(policy->cpus);
>> +       policy->cpu = cpumask_first(policy->cpus);
>
> why update it at all? Also, as per your logic what if cpus == 0?

Yeah, I didn't write it this way at first. But the governors are making
the assumption that policy->cpu is always an online CPU. So, they try to
queue work there and use data structs of that CPU (even if they free it in
the STOP event since it went offline).

Another option is to leave policy->cpu unchanged and then fix all the
governors. But this patch would get even more complicated. So, we can
leave this as is, or fix that up in a separate patch.

>> +       if (has_target() && cpus > 0) {
>
> Instead of > or < use cpus or !cpus.

Kinda personal style I guess. cpus > 0 reads like "there's more than 0
CPUs" which makes sense in English too. But sure, I can change if you
really want to.

>
>>                 ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
>>                 if (!ret)
>>                         ret = __cpufreq_governor(policy,
>> CPUFREQ_GOV_LIMITS);
>>
>>                 if (ret) {
>>                         pr_err("%s: Failed to start governor\n",
>> __func__);
>> -                       return ret;
>> +                       goto unlock;
>>                 }
>>         }
>>
>> -       return sysfs_create_link(&dev->kobj, &policy->kobj, "cpufreq");
>> -}
>> -#endif
>> -
>> -static struct cpufreq_policy *cpufreq_policy_restore(unsigned int cpu)
>> -{
>> -       struct cpufreq_policy *policy;
>> -       unsigned long flags;
>> -
>> -       read_lock_irqsave(&cpufreq_driver_lock, flags);
>> -
>> -       policy = per_cpu(cpufreq_cpu_data_fallback, cpu);
>> -
>> -       read_unlock_irqrestore(&cpufreq_driver_lock, flags);
>> +       if (cpus < 1 && cpufreq_driver->stop_cpu &&
>> cpufreq_driver->setpolicy) {
>
> Can be made 'else' part of above 'if', just need to move if(target)
> inside the cpus > 1
> block.

Sorry if I gave the impression that this patch is complete. I'm just
sending out updates so people can see where I'm going with this instead of
dumping it all in one go.

Yes, I'll refactor the if/elses once I have a logic right.

>
>> +               cpufreq_driver->stop_cpu(policy);
>
> Where is ->exit() gone?
>

Why should I exit? I don't think we need to. I'm not planning on exit()
and init() every time an entire cluster is offlined or onlined. Just stop
the governor and let if go quiet.

>> +       }
>>
>> -       policy->governor = NULL;
>> +unlock:
>> +       up_write(&policy->rwsem);
>>
>> -       return policy;
>> +       return ret;
>>  }
>> +#endif
>>
>>  static struct cpufreq_policy *cpufreq_policy_alloc(void)
>>  {
>> @@ -1076,22 +1073,6 @@ static void cpufreq_policy_free(struct
>> cpufreq_policy *policy)
>>         kfree(policy);
>>  }
>>
>> -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned
>> int cpu)
>> -{
>> -       if (WARN_ON(cpu == policy->cpu))
>> -               return;
>> -
>> -       down_write(&policy->rwsem);
>> -
>> -       policy->last_cpu = policy->cpu;
>> -       policy->cpu = cpu;
>> -
>> -       up_write(&policy->rwsem);
>> -
>> -       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>> -                       CPUFREQ_UPDATE_POLICY_CPU, policy);
>> -}
>> -
>>  static int __cpufreq_add_dev(struct device *dev, struct
>> subsys_interface *sif)
>>  {
>>         unsigned int j, cpu = dev->id;
>> @@ -1099,9 +1080,6 @@ static int __cpufreq_add_dev(struct device *dev,
>> struct subsys_interface *sif)
>>         struct cpufreq_policy *policy;
>>         unsigned long flags;
>>         bool recover_policy = cpufreq_suspended;
>> -#ifdef CONFIG_HOTPLUG_CPU
>> -       struct cpufreq_policy *tpolicy;
>> -#endif
>>
>>         if (cpu_is_offline(cpu))
>>                 return 0;
>> @@ -1111,55 +1089,28 @@ static int __cpufreq_add_dev(struct device *dev,
>> struct subsys_interface *sif)
>>  #ifdef CONFIG_SMP
>>         /* check whether a different CPU already registered this
>>          * CPU because it is in the same boat. */
>> +       /* FIXME: This probably needs fixing to avoid "try lock" from
>> +        * returning NULL. Also, change to likely() */
>
> I wanted to give this comment later, but that's fine ..
>
> - First, I couldn't understand the try-lock fixme

If trylock fails when we are adding a new CPU to an existing cluster, I
should skip adding it just because someone else was holding the lock and
the try lock failed.

But I don't think trylocks are needed anymore. Maybe I can fix it in later
patches. We'll see.

> - Second likely() would be better
> - policy will not be available only while adding first CPU of every
> cluster on
> every driver registration..

Yes. I agree.

> - And you aren't freeing 'struct cpufreq_policy' at all now. Would result
> in
> Memory leak cpufreq driver is compiled as a module and inserted/removed
> multiple times.

Again, sorry, if I gave the impression this patch was done. If you don't
want me to send intermediate RFC patches, I can hold off. I'm well aware
that this is not completed yet. When I'm done, there should be no memory
leak when modules get added/removed.

>
>>         policy = cpufreq_cpu_get(cpu);
>>         if (unlikely(policy)) {
>> +               cpufreq_change_policy_cpus(policy, cpu, true);
>>                 cpufreq_cpu_put(policy);
>>                 return 0;
>>         }
>
> This optimization wasn't for the hotplug case, but for adding
> non-policy->cpu
> cpus for the first time.

Sure, but I'm making it an optimization for any time a CPU is added except
for the first time. Since I'm not always freeing and allocating policies
or moving them around, I can simplify it as such.

> In your case policy->cpus would already be updated and so calling
> cpufreq_change_policy_cpus() isn't required.

How? I don't you misunderstood the intent here.

>>  #endif
>>
>> +       /* FIXME: Is returning 0 the right thing to do?! Existing code
>> */
>>         if (!down_read_trylock(&cpufreq_rwsem))
>>                 return 0;
>
> Yeah, must have a better return value. But as I said, these kind of
> changes
> must be added in separate patches.

Yeah, that's why I didn't fix it here yet. Leaving existing bugs as is.

>>
>> -#ifdef CONFIG_HOTPLUG_CPU
>> -       /* Check if this cpu was hot-unplugged earlier and has siblings
>> */
>> -       read_lock_irqsave(&cpufreq_driver_lock, flags);
>> -       list_for_each_entry(tpolicy, &cpufreq_policy_list, policy_list)
>> {
>> -               if (cpumask_test_cpu(cpu, tpolicy->related_cpus)) {
>> -                       read_unlock_irqrestore(&cpufreq_driver_lock,
>> flags);
>> -                       ret = cpufreq_add_policy_cpu(tpolicy, cpu, dev);
>> -                       up_read(&cpufreq_rwsem);
>> -                       return ret;
>> -               }
>> -       }
>> -       read_unlock_irqrestore(&cpufreq_driver_lock, flags);
>> -#endif
>
> And this one was for the hotplug case which you could have reused.

Nope. The whole point is to remove all this complexity.

>
>> -       /*
>> -        * Restore the saved policy when doing light-weight init and
>> fall back
>> -        * to the full init if that fails.
>> -        */
>> -       policy = recover_policy ? cpufreq_policy_restore(cpu) : NULL;
>> -       if (!policy) {
>> -               recover_policy = false;
>> -               policy = cpufreq_policy_alloc();
>> -               if (!policy)
>> -                       goto nomem_out;
>> -       }
>
> You might need to use it somehow.. Currently you are never doing
> this:  per_cpu(cpufreq_cpu_data, cpu) = NULL;

Intentionally not setting it to NULL.

> which would result in crazy things once you try {un}registering your
> driver...

I haven't handled that part yet. Planning to. But it should be pretty simple.

>> -       /*
>> -        * In the resume path, since we restore a saved policy, the
>> assignment
>> -        * to policy->cpu is like an update of the existing policy,
>> rather than
>> -        * the creation of a brand new one. So we need to perform this
>> update
>> -        * by invoking update_policy_cpu().
>> -        */
>> -       if (recover_policy && cpu != policy->cpu)
>> -               update_policy_cpu(policy, cpu);
>> -       else
>> -               policy->cpu = cpu;
>> +       /* If we get this far, this is the first time we are adding the
>> +        * policy */
>> +       policy = cpufreq_policy_alloc();
>> +       if (!policy)
>> +               goto nomem_out;
>> +       policy->cpu = cpu;
>>
>>         cpumask_copy(policy->cpus, cpumask_of(cpu));
>> -
>>         init_completion(&policy->kobj_unregister);
>>         INIT_WORK(&policy->update, handle_update);
>>
>> @@ -1175,20 +1126,19 @@ static int __cpufreq_add_dev(struct device *dev,
>> struct subsys_interface *sif)
>>         /* related cpus should atleast have policy->cpus */
>>         cpumask_or(policy->related_cpus, policy->related_cpus,
>> policy->cpus);
>>
>> +       /* Weed out impossible CPUs. */
>> +       cpumask_and(policy->related_cpus, policy->related_cpus,
>> +                       cpu_possible_mask);
>
> why?

Why not? It should make the future bit ops faster? Also, it keep the
"first cpu" to a CPU that's actually possible. Also, if a CPU isn't
"possible" I wasn't sure if the cpuX directory would even get created in
the first place. That was another reason.

>
>
> Sorry but I am stopping now. I have already pointed out some issues
> which make this unusable.

Most of them aren't really bugs, but I should clarify the intent though.
Sorry about that. But yes, the patch as is is not usable. It's incomplete.

> Over that, its way too hard to review all this in a single patch. For
> every
> piece of line you add/remove I have to spend 10 mins thinking about
> all the possible cases that were solved with this.. And if the rest of the
> patch is going to fix them or not, etc.. To make it simple I did apply
> your patch and had a close look at the new state of code, but its getting
> tougher and tougher.

In this patch, I'm ONLY touching hotplug and resume related code (except
for one line). I'll give some description in my next patch on how I'm
expecting the events to be across hotplug/suspend and what happens with
the policies. Once we are on the same page on the intent of the patch, it
should be easier.

> Please make sure you take care of these issues:
> - suspend/resume
Didn't test. I expect it to be broken in v2.

> - hotplug
Tested.

> - module insert/remove
Didn't test. Expected to be broken.

> - Memory leaks
Will do.

> - multi cluster systems (with one and multiple CPU per cluster)
> *by cluster I mean group of CPUs sharing clock line
> - single cluster ones, one and multiple CPUs

I actually tested hotplug for all these cases. That's how I found the
governor issue.

>
> Will see how V3 goes. Thanks.

Thanks for taking the time to review and being open to these changes.
Appreciate the cooperation.

-Saravana
Saravana Kannan July 11, 2014, 10:02 a.m. UTC | #4
Srivatsa S. Bhat wrote:
> On 07/11/2014 09:48 AM, Saravana Kannan wrote:
>> The CPUfreq driver moves the cpufreq policy ownership between CPUs when
>> CPUs within a cluster (CPUs sharing same policy) go ONLINE/OFFLINE. When
>> moving policy ownership between CPUs, it also moves the cpufreq sysfs
>> directory between CPUs and also fixes up the symlinks of the other CPUs
>> in
>> the cluster.
>>
>> Also, when all the CPUs in a cluster go OFFLINE, all the sysfs nodes and
>> directories are deleted, the kobject is released and the policy is
>> freed.
>> And when the first CPU in a cluster comes up, the policy is reallocated
>> and
>> initialized, kobject is acquired, the sysfs nodes are created or
>> symlinked,
>> etc.
>>
>> All these steps end up creating unnecessarily complicated code and
>> locking.
>> There's no real benefit to adding/removing/moving the sysfs nodes and
>> the
>> policy between CPUs. Other per CPU sysfs directories like power and
>> cpuidle
>> are left alone during hotplug. So there's some precedence to what this
>> patch is trying to do.
>>
>> This patch simplifies a lot of the code and locking by removing the
>> adding/removing/moving of policy/sysfs/kobj and just leaves the cpufreq
>> directory and policy in place irrespective of whether the CPUs are
>> ONLINE/OFFLINE.
>>
>> Leaving the policy, sysfs and kobject in place also brings these
>> additional
>> benefits:
>> * Faster suspend/resume.
>> * Faster hotplug.
>> * Sysfs file permissions maintained across hotplug without userspace
>>   workarounds.
>> * Policy settings and governor tunables maintained across suspend/resume
>>   and hotplug.
>> * Cpufreq stats would be maintained across hotplug for all CPUs and can
>> be
>>   queried even after CPU goes OFFLINE.
>>
>> Change-Id: I39c395e1fee8731880c0fd7c8a9c1d83e2e4b8d0
>> Tested-by: Stephen Boyd <sboyd@codeaurora.org>
>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>> ---
>>
>> Preliminary testing has been done. cpufreq directories are getting
>> created
>> properly. Online/offline of CPUs work. Policies remain unmodifiable from
>> userspace when all policy CPUs are offline.
>>
>> Error handling code has NOT been updated.
>>
>> I've added a bunch of FIXME comments next to where I'm not sure about
>> the
>> locking in the existing code. I believe most of the try_lock's were
>> present
>> to prevent a deadlock between sysfs lock and the cpufreq locks. Now that
>> the sysfs entries are not touched after creating them, we should be able
>> to
>> replace most/all of these try_lock's with a normal lock.
>>
>> This patch has more room for code simplification, but I would like to
>> get
>> some acks for the functionality and this code before I do further
>> simplification.
>>
>
> The idea behind this work is very welcome indeed! IMHO, there is nothing
> conceptually wrong in maintaining the per-cpu sysfs files across CPU
> hotplug
> (as long as we take care to return appropriate error codes if userspace
> tries to set values using the control files of offline CPUs). So, it
> really
> boils down to whether or not we get the implementation right; the idea
> itself
> looks fine as of now. Hence, your efforts in making this patch(set) easier
> to
> review will certainly help. Perhaps you can simplify the code later, but
> at
> this point, splitting up this patch into multiple smaller, reviewable
> pieces
> (accompanied by well-written changelogs that explain the intent) is the
> utmost
> priority. Just like Viresh, even I had a hard time reviewing all of this
> in
> one go.
>
> Thank you for taking up this work!

Thanks for the support. I'll keep in mind to keep the patches simple and
not do unnecessary optimizations. But the first patch diff unfortunately
is going to be a bit big since it'll delete a lot of code. :( But I'll add
more detailed commit text or "cover" text in the next one. I don't want to
split up the patch so much that individual ones don't compile or boot.

Maybe after patch v3, if you guys can suggest splitting it up into chunks
that won't involve huge rewrites, I can try to do that.

-Saravana
Saravana Kannan July 11, 2014, 10:07 a.m. UTC | #5
skannan@codeaurora.org wrote:
>
> Viresh Kumar wrote:
>> Hi Saravana,
>>
>> Thanks for trying this..
>>
>> On 11 July 2014 09:48, Saravana Kannan <skannan@codeaurora.org> wrote:
>>> The CPUfreq driver moves the cpufreq policy ownership between CPUs when
>>
>> s/driver/core
>
> Will do
>

<snip>

Soooo many typos. This is what happens when I send a late night email! If
a sentence sounds incomplete, this is why.

-Saravana
Viresh Kumar July 11, 2014, 10:52 a.m. UTC | #6
On 11 July 2014 15:29,  <skannan@codeaurora.org> wrote:
> Viresh Kumar wrote:
>> On 11 July 2014 09:48, Saravana Kannan <skannan@codeaurora.org> wrote:

>>> * Policy settings and governor tunables maintained across suspend/resume
>>>   and hotplug.
>>
>> Its already maintained during suspend/resume.
>
> But not across hotplug. Which is also very useful when you have 2 clusters
> and one gets hotplugged offline due to thermal and then reinserted.
> Userspace has to come and restore it back today. In our tree, we "stitched
> up" the governor. Also, this make the suspend/resume code a tiny bit
> simpler -- in the sense, it's not a special case anymore.

Yeah, I understood that. I was just pointing that you need to mention
hotplug alone in the bullet point.

>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c

>>>  static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>>>  {
>>> -       unsigned int j;
>>> +       unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
>>>         int ret = 0;
>>>
>>> -       for_each_cpu(j, policy->cpus) {
>>> +       for_each_cpu(j, policy->related_cpus) {
>>>                 struct device *cpu_dev;
>>>
>>> -               if (j == policy->cpu)
>>> +               if (j == first_cpu)
>>
>> why?
>
> The first CPU is a cluster always own the real nodes.

What I meant was, why not use policy->cpu?

>>> +static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>>>  {
>>>         struct freq_attr **drv_attr;
>>> +       struct device *dev;
>>>         int ret = 0;
>>>
>>> +       dev = get_cpu_device(cpumask_first(policy->related_cpus));
>>> +       if (!dev)
>>> +               return -EINVAL;
>>> +
>>
>> Why?
>
> I'm just always adding the real nodes to the first CPU in a cluster
> independent of which CPU gets added first. Makes it easier to know which
> ones to symlink. See comment next to policy->cpu for full context.

Yeah, and that is the order in which CPUs will boot and cpufreq_add_dev()
will be called. So, isn't policy->cpu the right CPU always?

>>> +       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>>> +                                       CPUFREQ_UPDATE_POLICY_CPU,
>>> policy);
>>
>> This should be only called when policy->cpu is updated. And shouldn't
>> be called anymore and can be dropped as you might not wanna change
>> policy->cpu after this patch.
>
> Right. I should have reordered this to after.
>
> But I always HAVE to send it when I add/remove a CPU. That's how I believe
> cpufreq stats can keep of CPUs going offline.

No. cpufreq-stats doesn't have to care about CPUs going in/out. It just
cares about poilcy->cpu and so we notify it when that gets updated.

> Oh, yeah. Which bring me to
> another point. If I'm not mistaken, cpufreq stats keeps track of policy
> stats. Really, it should track a CPU's stats. If it's hotplugged out, it
> should count that time towards it's current freq.

Even if CPU is hotplugged out, it is getting clock from the same pll. And
so if clock for rest of the CPUs change, it changes for the hotplugged
out one as well.. Even if it is not running.

So, we must account on the new frequencies.

>>> -       if (has_target()) {
>>> +       cpus = cpumask_weight(policy->cpus);
>>> +       policy->cpu = cpumask_first(policy->cpus);
>>
>> why update it at all? Also, as per your logic what if cpus == 0?
>
> Yeah, I didn't write it this way at first. But the governors are making
> the assumption that policy->cpu is always an online CPU. So, they try to

Are you sure? I had a quick look and failed to see that..

> queue work there and use data structs of that CPU (even if they free it in
> the STOP event since it went offline).

So, it queues work on all policy->cpus, not policy->cpu. And the data structures
are just allocated with a CPU number, its fine if its offline.

And where are we freeing that stuff in STOP ?

Sorry if I am really really tired and couldn't read it correctly.

> Another option is to leave policy->cpu unchanged and then fix all the
> governors. But this patch would get even more complicated. So, we can
> leave this as is, or fix that up in a separate patch.

Since we are simplifying it here, I think we should NOT change policy->cpu
at all. It will make life simple (probably).

>>> +       if (has_target() && cpus > 0) {
>>
>> Instead of > or < use cpus or !cpus.
>
> Kinda personal style I guess. cpus > 0 reads like "there's more than 0
> CPUs" which makes sense in English too. But sure, I can change if you
> really want to.

Actually at places its compared with 0 or 1, and so thought cpus, !cpus
would be better..

> Sorry if I gave the impression that this patch is complete.

No, you didn't :)

>>> +               cpufreq_driver->stop_cpu(policy);
>>
>> Where is ->exit() gone?
>>
>
> Why should I exit? I don't think we need to. I'm not planning on exit()
> and init() every time an entire cluster is offlined or onlined. Just stop
> the governor and let if go quiet.

Okay, my driver is compiled as a module. I insert/remove it multiple
times. Only ->init() will be called multiple times, no exit ?

>>>  #ifdef CONFIG_SMP
>>>         /* check whether a different CPU already registered this
>>>          * CPU because it is in the same boat. */
>>> +       /* FIXME: This probably needs fixing to avoid "try lock" from
>>> +        * returning NULL. Also, change to likely() */
>>
>> I wanted to give this comment later, but that's fine ..
>>
>> - First, I couldn't understand the try-lock fixme
>
> If trylock fails when we are adding a new CPU to an existing cluster, I
> should skip adding it just because someone else was holding the lock and
> the try lock failed.

But that trylock thing was later in the code. How does it affect the
'if' block you commented on?

> But I don't think trylocks are needed anymore. Maybe I can fix it in later
> patches. We'll see.

I don't know yet :)

>> - And you aren't freeing 'struct cpufreq_policy' at all now. Would result
>> in
>> Memory leak cpufreq driver is compiled as a module and inserted/removed
>> multiple times.
>
> Again, sorry, if I gave the impression this patch was done. If you don't
> want me to send intermediate RFC patches, I can hold off. I'm well aware
> that this is not completed yet. When I'm done, there should be no memory
> leak when modules get added/removed.

Oh yes. I believed that atleast the basic things are all working. It even had
a Tested-by from Stephen :)

Okay, this stuff isn't THAT big. So, hold-on your patches for sometime and
send when they are almost ready.

I understand that you wanted to have some early feedback, but its already
there with you. YES we want this change to retain settings during hotplug.

The problem is, even when I didn't review it completely it took over an hour
to do the reviews I did :)

So, I would like to see something *much more* stable. Finishing can be
done later.

>>>         policy = cpufreq_cpu_get(cpu);
>>>         if (unlikely(policy)) {
>>> +               cpufreq_change_policy_cpus(policy, cpu, true);
>>>                 cpufreq_cpu_put(policy);
>>>                 return 0;
>>>         }
>>
>> This optimization wasn't for the hotplug case, but for adding
>> non-policy->cpu
>> cpus for the first time.
>
> Sure, but I'm making it an optimization for any time a CPU is added except
> for the first time. Since I'm not always freeing and allocating policies
> or moving them around, I can simplify it as such.

Hmm, but that would be an overhead of calling cpufreq_change_policy_cpus()
for every cpu on boot. Wouldn't be that nice for big servers.

>> In your case policy->cpus would already be updated and so calling
>> cpufreq_change_policy_cpus() isn't required.
>
> How? I don't you misunderstood the intent here.

Only on first boot policy->cpus would be updated. Later on you can do it
separately as it was done in existing code.

> Yeah, that's why I didn't fix it here yet. Leaving existing bugs as is.

I would prefer cleaning them up first, so that the new changes you are
making are rock solid.

>>> +       /* Weed out impossible CPUs. */
>>> +       cpumask_and(policy->related_cpus, policy->related_cpus,
>>> +                       cpu_possible_mask);
>>
>> why?
>
> Why not? It should make the future bit ops faster?

Drivers shouldn't set non-populatable CPUs to this mask.

>> Sorry but I am stopping now. I have already pointed out some issues
>> which make this unusable.
>
> Most of them aren't really bugs, but I should clarify the intent though.

I wasn't worried about most of the comments, but exit(), memory leaks,
etc..

> In this patch, I'm ONLY touching hotplug and resume related code (except
> for one line). I'll give some description in my next patch on how I'm
> expecting the events to be across hotplug/suspend and what happens with
> the policies. Once we are on the same page on the intent of the patch, it
> should be easier.

>> Please make sure you take care of these issues:
>> - suspend/resume
> Didn't test. I expect it to be broken in v2.

:)

>> - module insert/remove
> Didn't test. Expected to be broken.
>
>> - Memory leaks
> Will do.
>
>> - multi cluster systems (with one and multiple CPU per cluster)
>> *by cluster I mean group of CPUs sharing clock line
>> - single cluster ones, one and multiple CPUs
>
> I actually tested hotplug for all these cases. That's how I found the
> governor issue.
>
>>
>> Will see how V3 goes. Thanks.
>
> Thanks for taking the time to review and being open to these changes.
> Appreciate the cooperation.

Thanks Saravana.
Saravana Kannan July 12, 2014, 2:44 a.m. UTC | #7
On 07/11/2014 03:52 AM, Viresh Kumar wrote:

Just responding to one comment. The one about policy->cpu.

>
>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>
>>>>   static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
>>>>   {
>>>> -       unsigned int j;
>>>> +       unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
>>>>          int ret = 0;
>>>>
>>>> -       for_each_cpu(j, policy->cpus) {
>>>> +       for_each_cpu(j, policy->related_cpus) {
>>>>                  struct device *cpu_dev;
>>>>
>>>> -               if (j == policy->cpu)
>>>> +               if (j == first_cpu)
>>>
>>> why?
>>
>> The first CPU is a cluster always own the real nodes.
>
> What I meant was, why not use policy->cpu?
>
>>>> +static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
>>>>   {
>>>>          struct freq_attr **drv_attr;
>>>> +       struct device *dev;
>>>>          int ret = 0;
>>>>
>>>> +       dev = get_cpu_device(cpumask_first(policy->related_cpus));
>>>> +       if (!dev)
>>>> +               return -EINVAL;
>>>> +
>>>
>>> Why?
>>
>> I'm just always adding the real nodes to the first CPU in a cluster
>> independent of which CPU gets added first. Makes it easier to know which
>> ones to symlink. See comment next to policy->cpu for full context.
>
> Yeah, and that is the order in which CPUs will boot and cpufreq_add_dev()
> will be called. So, isn't policy->cpu the right CPU always?

No, the "first" cpu in a cluster doesn't need to be the first one to be 
added. An example is 2x2 cluster system where the system is booted with 
max cpus = 2 and then cpu3 could be onlined first by userspace.

>
>>>> -       if (has_target()) {
>>>> +       cpus = cpumask_weight(policy->cpus);
>>>> +       policy->cpu = cpumask_first(policy->cpus);
>>>
>>> why update it at all? Also, as per your logic what if cpus == 0?
>>
>> Yeah, I didn't write it this way at first. But the governors are making
>> the assumption that policy->cpu is always an online CPU. So, they try to
>
> Are you sure? I had a quick look and failed to see that..
>
>> queue work there and use data structs of that CPU (even if they free it in
>> the STOP event since it went offline).
>
> So, it queues work on all policy->cpus, not policy->cpu.
> And the data structures
> are just allocated with a CPU number, its fine if its offline.
>
> And where are we freeing that stuff in STOP ?
>
> Sorry if I am really really tired and couldn't read it correctly.

Yeah, it is pretty convolution. But pretty much anywhere in the gov code 
where policy->cpu is used could cause this. The specific crash I hit was 
in this code:

static void od_dbs_timer(struct work_struct *work)
{
	struct od_cpu_dbs_info_s *dbs_info =
		container_of(work, struct od_cpu_dbs_info_s, cdbs.work.work);
	unsigned int cpu = dbs_info->cdbs.cur_policy->cpu;

======= CPU is policy->cpu here.

	struct od_cpu_dbs_info_s *core_dbs_info = &per_cpu(od_cpu_dbs_info,
			cpu);

======= Picks the per CPU struct of an offline CPU

<snip>

	mutex_lock(&core_dbs_info->cdbs.timer_mutex);

======= Dies trying to lock a destroyed mutex

>
>> Another option is to leave policy->cpu unchanged and then fix all the
>> governors. But this patch would get even more complicated. So, we can
>> leave this as is, or fix that up in a separate patch.
>
> Since we are simplifying it here, I think we should NOT change policy->cpu
> at all. It will make life simple (probably).

I agree, but then I would have to fix up the governors. In the interest 
of keeping this patch small. I'll continue with what I'm doing and fix 
it up in another patch.

-Saravana
Saravana Kannan July 12, 2014, 3:06 a.m. UTC | #8
On 07/10/2014 11:19 PM, Viresh Kumar wrote:

>
> Please make sure you take care of these issues:
> - suspend/resume
> - hotplug
> - module insert/remove
Ok, I was just at the current code. Does cpufreq_unregister_driver() 
even really work correctly as it stands?

It doesn't even seem to stop any of the governors/policies before it 
just set the cpufreq_driver pointer to NULL.

So, technically my v2 patch doesn't even make anything worse when it 
comes to unregistering the cpufreq driver.

Similar issues for unregister_governor too!

-Saravana
Viresh Kumar July 14, 2014, 6:09 a.m. UTC | #9
On 12 July 2014 08:14, Saravana Kannan <skannan@codeaurora.org> wrote:

>>> I'm just always adding the real nodes to the first CPU in a cluster
>>> independent of which CPU gets added first. Makes it easier to know which
>>> ones to symlink. See comment next to policy->cpu for full context.
>>
>>
>> Yeah, and that is the order in which CPUs will boot and cpufreq_add_dev()
>> will be called. So, isn't policy->cpu the right CPU always?
>
>
> No, the "first" cpu in a cluster doesn't need to be the first one to be
> added. An example is 2x2 cluster system where the system is booted with max
> cpus = 2 and then cpu3 could be onlined first by userspace.

Because we are getting rid of much of the complexity now, I do not want
policy->cpu to keep changing. Just fix it up to the cpu for which the policy
gets created first. That's it. No more changes required. It doesn't matter at
userspace which cpu owns it as symlinks would anyway duplicate it under
every cpu.

> Yeah, it is pretty convolution. But pretty much anywhere in the gov code
> where policy->cpu is used could cause this. The specific crash I hit was in
> this code:
>
> static void od_dbs_timer(struct work_struct *work)
> {
>         struct od_cpu_dbs_info_s *dbs_info =
>                 container_of(work, struct od_cpu_dbs_info_s,
> cdbs.work.work);
>         unsigned int cpu = dbs_info->cdbs.cur_policy->cpu;
>
> ======= CPU is policy->cpu here.
>
>         struct od_cpu_dbs_info_s *core_dbs_info = &per_cpu(od_cpu_dbs_info,
>                         cpu);
>
> ======= Picks the per CPU struct of an offline CPU
>
> <snip>
>
>         mutex_lock(&core_dbs_info->cdbs.timer_mutex);
>
> ======= Dies trying to lock a destroyed mutex

I am still not getting it. Why would we get into this if policy->cpu is fixed
once at boot ?
Viresh Kumar July 14, 2014, 6:13 a.m. UTC | #10
On 12 July 2014 08:36, Saravana Kannan <skannan@codeaurora.org> wrote:
> On 07/10/2014 11:19 PM, Viresh Kumar wrote:
>
>>
>> Please make sure you take care of these issues:
>> - suspend/resume
>> - hotplug
>> - module insert/remove
>
> Ok, I was just at the current code. Does cpufreq_unregister_driver() even
> really work correctly as it stands?
>
> It doesn't even seem to stop any of the governors/policies before it just
> set the cpufreq_driver pointer to NULL.
>
> So, technically my v2 patch doesn't even make anything worse when it comes
> to unregistering the cpufreq driver.

Are you really sure about this? I have tested this *myself* earlier..

subsys_interface_unregister() should take care of stopping/freeing governor
stuff..
Saravana Kannan July 14, 2014, 7:08 p.m. UTC | #11
On 07/13/2014 11:09 PM, Viresh Kumar wrote:
> On 12 July 2014 08:14, Saravana Kannan <skannan@codeaurora.org> wrote:
>
>>>> I'm just always adding the real nodes to the first CPU in a cluster
>>>> independent of which CPU gets added first. Makes it easier to know which
>>>> ones to symlink. See comment next to policy->cpu for full context.
>>>
>>>
>>> Yeah, and that is the order in which CPUs will boot and cpufreq_add_dev()
>>> will be called. So, isn't policy->cpu the right CPU always?
>>
>>
>> No, the "first" cpu in a cluster doesn't need to be the first one to be
>> added. An example is 2x2 cluster system where the system is booted with max
>> cpus = 2 and then cpu3 could be onlined first by userspace.
>
> Because we are getting rid of much of the complexity now, I do not want
> policy->cpu to keep changing. Just fix it up to the cpu for which the policy
> gets created first. That's it. No more changes required. It doesn't matter at
> userspace which cpu owns it as symlinks would anyway duplicate it under
> every cpu.

I think you missed one my of comments in the email. I agree with what 
you are saying here. I'll just do it as a separate patch to keep this 
one simpler. I don't want to touch all the governors and other potential 
uses of policy->cpu in this patch.

>> Yeah, it is pretty convolution. But pretty much anywhere in the gov code
>> where policy->cpu is used could cause this. The specific crash I hit was in
>> this code:
>>
>> static void od_dbs_timer(struct work_struct *work)
>> {
>>          struct od_cpu_dbs_info_s *dbs_info =
>>                  container_of(work, struct od_cpu_dbs_info_s,
>> cdbs.work.work);
>>          unsigned int cpu = dbs_info->cdbs.cur_policy->cpu;
>>
>> ======= CPU is policy->cpu here.
>>
>>          struct od_cpu_dbs_info_s *core_dbs_info = &per_cpu(od_cpu_dbs_info,
>>                          cpu);
>>
>> ======= Picks the per CPU struct of an offline CPU
>>
>> <snip>
>>
>>          mutex_lock(&core_dbs_info->cdbs.timer_mutex);
>>
>> ======= Dies trying to lock a destroyed mutex
>
> I am still not getting it. Why would we get into this if policy->cpu is fixed
> once at boot ?
>

Yeah, it definitely crashes if policy->cpu if an offline cpu. Because 
the mutex would be uninitialized if it's stopped after boot or it would 
never have been initialized (depending on how you fix policy->cpu at boot).

Look at this snippet on the actual tree and it should be pretty evident.

-Saravana
Saravana Kannan July 14, 2014, 7:10 p.m. UTC | #12
On 07/13/2014 11:13 PM, Viresh Kumar wrote:
> On 12 July 2014 08:36, Saravana Kannan <skannan@codeaurora.org> wrote:
>> On 07/10/2014 11:19 PM, Viresh Kumar wrote:
>>
>>>
>>> Please make sure you take care of these issues:
>>> - suspend/resume
>>> - hotplug
>>> - module insert/remove
>>
>> Ok, I was just at the current code. Does cpufreq_unregister_driver() even
>> really work correctly as it stands?
>>
>> It doesn't even seem to stop any of the governors/policies before it just
>> set the cpufreq_driver pointer to NULL.
>>
>> So, technically my v2 patch doesn't even make anything worse when it comes
>> to unregistering the cpufreq driver.
>
> Are you really sure about this? I have tested this *myself* earlier..
>
> subsys_interface_unregister() should take care of stopping/freeing governor
> stuff..
>

I was asking this question based on looking at the code. Didn't actually 
try it -- sent it just before being done for the day. I didn't know 
about the subsys_interface_unregister() coming into play here. I'll take 
a look.

Thanks for the pointer.

-Saravana
Viresh Kumar July 15, 2014, 4:35 a.m. UTC | #13
On 15 July 2014 00:38, Saravana Kannan <skannan@codeaurora.org> wrote:
> Yeah, it definitely crashes if policy->cpu if an offline cpu. Because the
> mutex would be uninitialized if it's stopped after boot or it would never
> have been initialized (depending on how you fix policy->cpu at boot).
>
> Look at this snippet on the actual tree and it should be pretty evident.

Yeah, I missed it. So the problem is we initialize timer_mutex's for
policy->cpus. So we need to do that just for policy->cpu and also we don't
need a per-cpu timer_mutex anymore.
Saravana Kannan July 15, 2014, 5:36 a.m. UTC | #14
On 07/14/2014 09:35 PM, Viresh Kumar wrote:
> On 15 July 2014 00:38, Saravana Kannan <skannan@codeaurora.org> wrote:
>> Yeah, it definitely crashes if policy->cpu if an offline cpu. Because the
>> mutex would be uninitialized if it's stopped after boot or it would never
>> have been initialized (depending on how you fix policy->cpu at boot).
>>
>> Look at this snippet on the actual tree and it should be pretty evident.
>
> Yeah, I missed it. So the problem is we initialize timer_mutex's for
> policy->cpus. So we need to do that just for policy->cpu and also we don't
> need a per-cpu timer_mutex anymore.
>

Btw, I tried to take a stab at removing any assumption in cpufreq code 
about policy->cpu being ONLINE. There are 160 instances of those of with 
23 are in cpufreq.c

So, even if we are sure cpufreq.c is fine, it's 137 other uses spread 
across all the other files. I definitely don't want to try and fix those 
as part of this patch. Way too risky and hard to get the test coverage 
it would need. Even some of the acpi cpufreq drivers seem to be making 
this assumption.

Btw, I think v3 is done. I did some testing and it was fine. But made 
some minor changes. Will test tomorrow to make sure I didn't break 
anything with the minor changes and then send them out.

-Saravana
Viresh Kumar July 15, 2014, 5:52 a.m. UTC | #15
On 15 July 2014 11:06, Saravana Kannan <skannan@codeaurora.org> wrote:
> Btw, I tried to take a stab at removing any assumption in cpufreq code about
> policy->cpu being ONLINE. There are 160 instances of those of with 23 are in
> cpufreq.c
>
> So, even if we are sure cpufreq.c is fine, it's 137 other uses spread across
> all the other files. I definitely don't want to try and fix those as part of
> this patch. Way too risky and hard to get the test coverage it would need.
> Even some of the acpi cpufreq drivers seem to be making this assumption.

Hmm, yeah that would be an issue. So this is what you should do now:
- Left policy->cpu as it is, i.e. updated only when policy->cpu goes down.
- Just make sure sysfs nodes are untouched when any cpu goes down

> Btw, I think v3 is done. I did some testing and it was fine. But made some
> minor changes. Will test tomorrow to make sure I didn't break anything with
> the minor changes and then send them out.

Ok, just comply to the above comments.
Srivatsa S. Bhat July 15, 2014, 6:58 a.m. UTC | #16
On 07/15/2014 11:06 AM, Saravana Kannan wrote:
> On 07/14/2014 09:35 PM, Viresh Kumar wrote:
>> On 15 July 2014 00:38, Saravana Kannan <skannan@codeaurora.org> wrote:
>>> Yeah, it definitely crashes if policy->cpu if an offline cpu. Because
>>> the
>>> mutex would be uninitialized if it's stopped after boot or it would
>>> never
>>> have been initialized (depending on how you fix policy->cpu at boot).
>>>
>>> Look at this snippet on the actual tree and it should be pretty evident.
>>
>> Yeah, I missed it. So the problem is we initialize timer_mutex's for
>> policy->cpus. So we need to do that just for policy->cpu and also we
>> don't
>> need a per-cpu timer_mutex anymore.
>>
> 
> Btw, I tried to take a stab at removing any assumption in cpufreq code
> about policy->cpu being ONLINE.

Wait, allowing an offline CPU to be the policy->cpu (i.e., the CPU which is
considered as the master of the policy/group) is just absurd. If there is
no leader, there is no army. We should NOT sacrifice sane semantics for the
sake of simplifying the code.

> There are 160 instances of those of with
> 23 are in cpufreq.c
>

And that explains why. It is just *natural* to assume that the CPUs governed
by a policy are online. Especially so for the CPU which is supposed to be
the policy leader. Let us please not change that - it will become
counter-intuitive if we do so. [ The other reason is that physical hotplug
is also possible on some systems... in that case your code might make a CPU
which is not even present (but possible) as the policy->cpu.. and great 'fun'
will ensue after that ;-( ]

The goal of this patchset should be to just de-couple the sysfs files/ownership
from the policy->cpu to an extent where it doesn't matter who owns those
files, and probably make it easier to do CPU hotplug without having to
destroy and recreate the files on every hotplug operation.

This is exactly why the _implementation_ matters in this particular case -
if we can't achieve the simplification by keeping sane semantics, then we
shouldn't do the simplification!

That said, I think we should keep trying - we haven't exhausted all ideas
yet :-)

Regards,
Srivatsa S. Bhat

> So, even if we are sure cpufreq.c is fine, it's 137 other uses spread
> across all the other files. I definitely don't want to try and fix those
> as part of this patch. Way too risky and hard to get the test coverage
> it would need. Even some of the acpi cpufreq drivers seem to be making
> this assumption.
> 
> Btw, I think v3 is done. I did some testing and it was fine. But made
> some minor changes. Will test tomorrow to make sure I didn't break
> anything with the minor changes and then send them out.
>
Saravana Kannan July 15, 2014, 5:35 p.m. UTC | #17
Srivatsa S. Bhat wrote:
> On 07/15/2014 11:06 AM, Saravana Kannan wrote:
>> On 07/14/2014 09:35 PM, Viresh Kumar wrote:
>>> On 15 July 2014 00:38, Saravana Kannan <skannan@codeaurora.org> wrote:
>>>> Yeah, it definitely crashes if policy->cpu if an offline cpu. Because
>>>> the
>>>> mutex would be uninitialized if it's stopped after boot or it would
>>>> never
>>>> have been initialized (depending on how you fix policy->cpu at boot).
>>>>
>>>> Look at this snippet on the actual tree and it should be pretty
>>>> evident.
>>>
>>> Yeah, I missed it. So the problem is we initialize timer_mutex's for
>>> policy->cpus. So we need to do that just for policy->cpu and also we
>>> don't
>>> need a per-cpu timer_mutex anymore.
>>>
>>
>> Btw, I tried to take a stab at removing any assumption in cpufreq code
>> about policy->cpu being ONLINE.
>
> Wait, allowing an offline CPU to be the policy->cpu (i.e., the CPU which
> is
> considered as the master of the policy/group) is just absurd. If there is
> no leader, there is no army. We should NOT sacrifice sane semantics for
> the
> sake of simplifying the code.
>
>> There are 160 instances of those of with
>> 23 are in cpufreq.c
>>
>
> And that explains why. It is just *natural* to assume that the CPUs
> governed
> by a policy are online. Especially so for the CPU which is supposed to be
> the policy leader. Let us please not change that - it will become
> counter-intuitive if we do so. [ The other reason is that physical hotplug
> is also possible on some systems... in that case your code might make a
> CPU
> which is not even present (but possible) as the policy->cpu.. and great
> 'fun'
> will ensue after that ;-( ]
>
> The goal of this patchset should be to just de-couple the sysfs
> files/ownership
> from the policy->cpu to an extent where it doesn't matter who owns those
> files, and probably make it easier to do CPU hotplug without having to
> destroy and recreate the files on every hotplug operation.
>
> This is exactly why the _implementation_ matters in this particular case -
> if we can't achieve the simplification by keeping sane semantics, then we
> shouldn't do the simplification!
>
> That said, I think we should keep trying - we haven't exhausted all ideas
> yet :-)
>

I don't think we disagree. To summarize this topic: I tried to keep the
policy->cpu an actual online CPU so as to not break existing semantics in
this patch. Viresh asked "why not fix it at boot?". My response was to
keep it an online CPU and give it a shot in a separate patch if we really
want that. It's too risky to do that in this patch and also not a
mandatory change for this patch.

I think we can work out the details on the need to fixing policy->cpu at
boot and whether there's even a need for policy->cpu (when we already have
policy->cpus) in a separate thread after the dust settles on this one?

-Saravana
Saravana Kannan July 15, 2014, 10:47 p.m. UTC | #18
Series of patchs to simplify policy/sysfs/kobj/locking handling across
suspend/resume

The following have been tested so far on a 2x2 cluster environment:
- Boot with 2 cpus and no cpufreq driver.
- mod probe driver and see cpufreq sysfs files show up only for the 1st cluster.
- Online the rest of the 2 CPUs and have files show up correctly.
- rmmod the driver and see the files go away.
- modprobe again (or back and forth multiples times) and see it work.
- suspend/resume works as expected.
- When a cluster is offline, all read/writes to its sysfs files return an error

Saravana Kannan (2):
  cpufreq: Don't destroy/realloc policy/sysfs on hotplug/suspend
  cpufreq: Simplify and fix mutual exclusion with hotplug

 drivers/cpufreq/cpufreq.c | 432 ++++++++++++++--------------------------------
 1 file changed, 127 insertions(+), 305 deletions(-)
Viresh Kumar July 16, 2014, 5:44 a.m. UTC | #19
On 15 July 2014 12:28, Srivatsa S. Bhat <srivatsa@mit.edu> wrote:
> Wait, allowing an offline CPU to be the policy->cpu (i.e., the CPU which is
> considered as the master of the policy/group) is just absurd.

Yeah, that was as Absurd as I am :)

> The goal of this patchset should be to just de-couple the sysfs files/ownership
> from the policy->cpu to an extent where it doesn't matter who owns those
> files, and probably make it easier to do CPU hotplug without having to
> destroy and recreate the files on every hotplug operation.

I went to that Absurd idea because we thought we can skip playing with
the sysfs nodes on suspend/hotplug.

And if policy->cpu keeps changing with hotplug, we *may* have to keep
sysfs stuff moving as well. One way to avoid that is by using something
like: policy->sysfs_cpu, but wasn't sure if that's the right path to follow.

Lets see what Saravana's new patchset has for us :)
Srivatsa S. Bhat July 16, 2014, 7:44 a.m. UTC | #20
On 07/15/2014 11:05 PM, skannan@codeaurora.org wrote:
> 
> Srivatsa S. Bhat wrote:
>> On 07/15/2014 11:06 AM, Saravana Kannan wrote:
>>> On 07/14/2014 09:35 PM, Viresh Kumar wrote:
>>>> On 15 July 2014 00:38, Saravana Kannan <skannan@codeaurora.org> wrote:
>>>>> Yeah, it definitely crashes if policy->cpu if an offline cpu. Because
>>>>> the
>>>>> mutex would be uninitialized if it's stopped after boot or it would
>>>>> never
>>>>> have been initialized (depending on how you fix policy->cpu at boot).
>>>>>
>>>>> Look at this snippet on the actual tree and it should be pretty
>>>>> evident.
>>>>
>>>> Yeah, I missed it. So the problem is we initialize timer_mutex's for
>>>> policy->cpus. So we need to do that just for policy->cpu and also we
>>>> don't
>>>> need a per-cpu timer_mutex anymore.
>>>>
>>>
>>> Btw, I tried to take a stab at removing any assumption in cpufreq code
>>> about policy->cpu being ONLINE.
>>
>> Wait, allowing an offline CPU to be the policy->cpu (i.e., the CPU which
>> is
>> considered as the master of the policy/group) is just absurd. If there is
>> no leader, there is no army. We should NOT sacrifice sane semantics for
>> the
>> sake of simplifying the code.
>>
>>> There are 160 instances of those of with
>>> 23 are in cpufreq.c
>>>
>>
>> And that explains why. It is just *natural* to assume that the CPUs
>> governed
>> by a policy are online. Especially so for the CPU which is supposed to be
>> the policy leader. Let us please not change that - it will become
>> counter-intuitive if we do so. [ The other reason is that physical hotplug
>> is also possible on some systems... in that case your code might make a
>> CPU
>> which is not even present (but possible) as the policy->cpu.. and great
>> 'fun'
>> will ensue after that ;-( ]
>>
>> The goal of this patchset should be to just de-couple the sysfs
>> files/ownership
>> from the policy->cpu to an extent where it doesn't matter who owns those
>> files, and probably make it easier to do CPU hotplug without having to
>> destroy and recreate the files on every hotplug operation.
>>
>> This is exactly why the _implementation_ matters in this particular case -
>> if we can't achieve the simplification by keeping sane semantics, then we
>> shouldn't do the simplification!
>>
>> That said, I think we should keep trying - we haven't exhausted all ideas
>> yet :-)
>>
> 
> I don't think we disagree. To summarize this topic: I tried to keep the
> policy->cpu an actual online CPU so as to not break existing semantics in
> this patch. Viresh asked "why not fix it at boot?". My response was to
> keep it an online CPU and give it a shot in a separate patch if we really
> want that. It's too risky to do that in this patch and also not a
> mandatory change for this patch.
> 
> I think we can work out the details on the need to fixing policy->cpu at
> boot and whether there's even a need for policy->cpu (when we already have
> policy->cpus) in a separate thread after the dust settles on this one?
>

Sure, that sounds good!

Regards,
Srivatsa S. Bhat
Srivatsa S. Bhat July 16, 2014, 7:49 a.m. UTC | #21
On 07/16/2014 11:14 AM, Viresh Kumar wrote:
> On 15 July 2014 12:28, Srivatsa S. Bhat <srivatsa@mit.edu> wrote:
>> Wait, allowing an offline CPU to be the policy->cpu (i.e., the CPU which is
>> considered as the master of the policy/group) is just absurd.
> 
> Yeah, that was as Absurd as I am :)
> 

I have had my own share of silly ideas over the years; so don't worry, we are
all in the same boat ;-)

>> The goal of this patchset should be to just de-couple the sysfs files/ownership
>> from the policy->cpu to an extent where it doesn't matter who owns those
>> files, and probably make it easier to do CPU hotplug without having to
>> destroy and recreate the files on every hotplug operation.
> 
> I went to that Absurd idea because we thought we can skip playing with
> the sysfs nodes on suspend/hotplug.
> 
> And if policy->cpu keeps changing with hotplug, we *may* have to keep
> sysfs stuff moving as well. One way to avoid that is by using something
> like: policy->sysfs_cpu, but wasn't sure if that's the right path to follow.
>

Hmm, I understand.. Even I don't have any suggestions as of now, since I
haven't spent enough time thinking of alternatives yet.

> Lets see what Saravana's new patchset has for us :)
> 

Yep :-)

Regards,
Srivatsa S. Bhat
Saravana Kannan July 25, 2014, 1:07 a.m. UTC | #22
Series of patchs to simplify policy/sysfs/kobj/locking handling across
suspend/resume

The following have been tested so far on a 2x2 cluster environment:
- Boot with 2 cpus and no cpufreq driver.
- mod probe driver and see cpufreq sysfs files show up only for the 1st cluster.
- Online the rest of the 2 CPUs and have files show up correctly.
- rmmod the driver and see the files go away.
- modprobe again (or back and forth multiples times) and see it work.
- suspend/resume works as expected.
- When a cluster is offline, all read/writes to its sysfs files return an error

v4
- Split it up into smaller patches
- Will handle physical CPU removal correctly
- Fixed earlier mistake of deleting code under !recover_policy
- Dropped some code refactor that reuses a lot of code between add/remove
- Dropped fix for exiting hotplug race with cpufreq driver probe/rmmod
- Dropped changes will come later once this series is acked.


Saravana Kannan (5):
  cpufreq: Don't wait for CPU to going offline to restart governor
  cpufreq: Keep track of which CPU owns the kobj/sysfs nodes separately
  cpufreq: Don't destroy/realloc policy/sysfs on hotplug/suspend
  cpufreq: Properly handle physical CPU hot-add/hot-remove
  cpufreq: Delete dead code related to policy save/restore

 drivers/cpufreq/cpufreq.c | 238 ++++++++++++++++++----------------------------
 include/linux/cpufreq.h   |   1 +
 2 files changed, 93 insertions(+), 146 deletions(-)
Saravana Kannan July 29, 2014, 5:52 a.m. UTC | #23
Saravana Kannan wrote:
> Series of patchs to simplify policy/sysfs/kobj/locking handling across
> suspend/resume
>

Bump.

-Saravana
Rafael J. Wysocki July 30, 2014, 12:29 a.m. UTC | #24
On Thursday, July 24, 2014 06:07:23 PM Saravana Kannan wrote:
> Series of patchs to simplify policy/sysfs/kobj/locking handling across
> suspend/resume

I need someone to review this series for me.  Viresh or Srivatsa, preferably
both.

Thanks!

> The following have been tested so far on a 2x2 cluster environment:
> - Boot with 2 cpus and no cpufreq driver.
> - mod probe driver and see cpufreq sysfs files show up only for the 1st cluster.
> - Online the rest of the 2 CPUs and have files show up correctly.
> - rmmod the driver and see the files go away.
> - modprobe again (or back and forth multiples times) and see it work.
> - suspend/resume works as expected.
> - When a cluster is offline, all read/writes to its sysfs files return an error
> 
> v4
> - Split it up into smaller patches
> - Will handle physical CPU removal correctly
> - Fixed earlier mistake of deleting code under !recover_policy
> - Dropped some code refactor that reuses a lot of code between add/remove
> - Dropped fix for exiting hotplug race with cpufreq driver probe/rmmod
> - Dropped changes will come later once this series is acked.
> 
> 
> Saravana Kannan (5):
>   cpufreq: Don't wait for CPU to going offline to restart governor
>   cpufreq: Keep track of which CPU owns the kobj/sysfs nodes separately
>   cpufreq: Don't destroy/realloc policy/sysfs on hotplug/suspend
>   cpufreq: Properly handle physical CPU hot-add/hot-remove
>   cpufreq: Delete dead code related to policy save/restore
> 
>  drivers/cpufreq/cpufreq.c | 238 ++++++++++++++++++----------------------------
>  include/linux/cpufreq.h   |   1 +
>  2 files changed, 93 insertions(+), 146 deletions(-)
> 
>
Saravana Kannan July 31, 2014, 8:25 p.m. UTC | #25
On 07/29/2014 05:29 PM, Rafael J. Wysocki wrote:
> On Thursday, July 24, 2014 06:07:23 PM Saravana Kannan wrote:
>> Series of patchs to simplify policy/sysfs/kobj/locking handling across
>> suspend/resume
>
> I need someone to review this series for me.  Viresh or Srivatsa, preferably
> both.

Viresh/Srivatsa,

Bump. Can you guys please review this?

-Saravana
Saravana Kannan Aug. 7, 2014, 6:04 a.m. UTC | #26
Rafael J. Wysocki wrote:
> On Thursday, July 24, 2014 06:07:23 PM Saravana Kannan wrote:
>> Series of patchs to simplify policy/sysfs/kobj/locking handling across
>> suspend/resume
>
> I need someone to review this series for me.  Viresh or Srivatsa,
> preferably
> both.
>
> Thanks!

This took quite a bit of work to get to v4. It's been almost 2 weeks.
Could someone please review this? I'll get pulled into some heavy "work"
work soon. I would really prefer not to let this bit rot and later
abandoned due to lack of time on my part. Would really appreciate some
review folks. Thanks.

Regards,
Saravana

>> The following have been tested so far on a 2x2 cluster environment:
>> - Boot with 2 cpus and no cpufreq driver.
>> - mod probe driver and see cpufreq sysfs files show up only for the 1st
>> cluster.
>> - Online the rest of the 2 CPUs and have files show up correctly.
>> - rmmod the driver and see the files go away.
>> - modprobe again (or back and forth multiples times) and see it work.
>> - suspend/resume works as expected.
>> - When a cluster is offline, all read/writes to its sysfs files return
>> an error
>>
>> v4
>> - Split it up into smaller patches
>> - Will handle physical CPU removal correctly
>> - Fixed earlier mistake of deleting code under !recover_policy
>> - Dropped some code refactor that reuses a lot of code between
>> add/remove
>> - Dropped fix for exiting hotplug race with cpufreq driver probe/rmmod
>> - Dropped changes will come later once this series is acked.
>>
>>
>> Saravana Kannan (5):
>>   cpufreq: Don't wait for CPU to going offline to restart governor
>>   cpufreq: Keep track of which CPU owns the kobj/sysfs nodes separately
>>   cpufreq: Don't destroy/realloc policy/sysfs on hotplug/suspend
>>   cpufreq: Properly handle physical CPU hot-add/hot-remove
>>   cpufreq: Delete dead code related to policy save/restore
>>
>>  drivers/cpufreq/cpufreq.c | 238
>> ++++++++++++++++++----------------------------
>>  include/linux/cpufreq.h   |   1 +
>>  2 files changed, 93 insertions(+), 146 deletions(-)
>>
>>
>
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
>
Viresh Kumar Oct. 16, 2014, 8:53 a.m. UTC | #27
On 25 July 2014 06:37, Saravana Kannan <skannan@codeaurora.org> wrote:
> Series of patchs to simplify policy/sysfs/kobj/locking handling across
> suspend/resume
>
> The following have been tested so far on a 2x2 cluster environment:
> - Boot with 2 cpus and no cpufreq driver.
> - mod probe driver and see cpufreq sysfs files show up only for the 1st cluster.
> - Online the rest of the 2 CPUs and have files show up correctly.
> - rmmod the driver and see the files go away.
> - modprobe again (or back and forth multiples times) and see it work.
> - suspend/resume works as expected.
> - When a cluster is offline, all read/writes to its sysfs files return an error
>
> v4
> - Split it up into smaller patches
> - Will handle physical CPU removal correctly
> - Fixed earlier mistake of deleting code under !recover_policy
> - Dropped some code refactor that reuses a lot of code between add/remove
> - Dropped fix for exiting hotplug race with cpufreq driver probe/rmmod
> - Dropped changes will come later once this series is acked.

Hi Saravana,

Any  updates on this? We might need some of this soon or should somebody
else start working on this ?
Saravana Kannan Oct. 23, 2014, 9:41 p.m. UTC | #28
On 10/16/2014 01:53 AM, Viresh Kumar wrote:
> On 25 July 2014 06:37, Saravana Kannan <skannan@codeaurora.org> wrote:
>> Series of patchs to simplify policy/sysfs/kobj/locking handling across
>> suspend/resume
>>
>> The following have been tested so far on a 2x2 cluster environment:
>> - Boot with 2 cpus and no cpufreq driver.
>> - mod probe driver and see cpufreq sysfs files show up only for the 1st cluster.
>> - Online the rest of the 2 CPUs and have files show up correctly.
>> - rmmod the driver and see the files go away.
>> - modprobe again (or back and forth multiples times) and see it work.
>> - suspend/resume works as expected.
>> - When a cluster is offline, all read/writes to its sysfs files return an error
>>
>> v4
>> - Split it up into smaller patches
>> - Will handle physical CPU removal correctly
>> - Fixed earlier mistake of deleting code under !recover_policy
>> - Dropped some code refactor that reuses a lot of code between add/remove
>> - Dropped fix for exiting hotplug race with cpufreq driver probe/rmmod
>> - Dropped changes will come later once this series is acked.
>
> Hi Saravana,
>
> Any  updates on this? We might need some of this soon or should somebody
> else start working on this ?
>

Hey,

Sorry for the delay. Got side tracked with some commercial stuff. I'm 
still invested in finishing this up. I'll try to send out something 
within a week.

I did notice (didn't read mych) the "Locking issues with cpufreq and 
sysfs" thread. I think my patches should side step most of it.

Thanks,
Saravana
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 62259d2..e350b15 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -859,16 +859,16 @@  void cpufreq_sysfs_remove_file(const struct attribute *attr)
 }
 EXPORT_SYMBOL(cpufreq_sysfs_remove_file);
 
-/* symlink affected CPUs */
+/* symlink related CPUs */
 static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
 {
-	unsigned int j;
+	unsigned int j, first_cpu = cpumask_first(policy->related_cpus);
 	int ret = 0;
 
-	for_each_cpu(j, policy->cpus) {
+	for_each_cpu(j, policy->related_cpus) {
 		struct device *cpu_dev;
 
-		if (j == policy->cpu)
+		if (j == first_cpu)
 			continue;
 
 		pr_debug("Adding link for CPU: %u\n", j);
@@ -881,12 +881,16 @@  static int cpufreq_add_dev_symlink(struct cpufreq_policy *policy)
 	return ret;
 }
 
-static int cpufreq_add_dev_interface(struct cpufreq_policy *policy,
-				     struct device *dev)
+static int cpufreq_add_dev_interface(struct cpufreq_policy *policy)
 {
 	struct freq_attr **drv_attr;
+	struct device *dev;
 	int ret = 0;
 
+	dev = get_cpu_device(cpumask_first(policy->related_cpus));
+	if (!dev)
+		return -EINVAL;
+
 	/* prepare interface data */
 	ret = kobject_init_and_add(&policy->kobj, &ktype_cpufreq,
 				   &dev->kobj, "cpufreq");
@@ -961,60 +965,53 @@  static void cpufreq_init_policy(struct cpufreq_policy *policy)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy,
-				  unsigned int cpu, struct device *dev)
+static int cpufreq_change_policy_cpus(struct cpufreq_policy *policy,
+				  unsigned int cpu, bool add)
 {
 	int ret = 0;
-	unsigned long flags;
+	unsigned int cpus;
 
-	if (has_target()) {
+	down_write(&policy->rwsem);
+	cpus = cpumask_weight(policy->cpus);
+	if (has_target() && cpus > 0) {
 		ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
 		if (ret) {
 			pr_err("%s: Failed to stop governor\n", __func__);
-			return ret;
+			goto unlock;
 		}
 	}
 
-	down_write(&policy->rwsem);
-
-	write_lock_irqsave(&cpufreq_driver_lock, flags);
-
-	cpumask_set_cpu(cpu, policy->cpus);
-	per_cpu(cpufreq_cpu_data, cpu) = policy;
-	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
+	if (add)
+		cpumask_set_cpu(cpu, policy->cpus);
+	else
+		cpumask_clear_cpu(cpu, policy->cpus);
 
-	up_write(&policy->rwsem);
+	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
+					CPUFREQ_UPDATE_POLICY_CPU, policy);
 
-	if (has_target()) {
+	cpus = cpumask_weight(policy->cpus);
+	policy->cpu = cpumask_first(policy->cpus);
+	if (has_target() && cpus > 0) {
 		ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
 		if (!ret)
 			ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
 
 		if (ret) {
 			pr_err("%s: Failed to start governor\n", __func__);
-			return ret;
+			goto unlock;
 		}
 	}
 
-	return sysfs_create_link(&dev->kobj, &policy->kobj, "cpufreq");
-}
-#endif
-
-static struct cpufreq_policy *cpufreq_policy_restore(unsigned int cpu)
-{
-	struct cpufreq_policy *policy;
-	unsigned long flags;
-
-	read_lock_irqsave(&cpufreq_driver_lock, flags);
-
-	policy = per_cpu(cpufreq_cpu_data_fallback, cpu);
-
-	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
+	if (cpus < 1 && cpufreq_driver->stop_cpu && cpufreq_driver->setpolicy) {
+		cpufreq_driver->stop_cpu(policy);
+	}
 
-	policy->governor = NULL;
+unlock:
+	up_write(&policy->rwsem);
 
-	return policy;
+	return ret;
 }
+#endif
 
 static struct cpufreq_policy *cpufreq_policy_alloc(void)
 {
@@ -1076,22 +1073,6 @@  static void cpufreq_policy_free(struct cpufreq_policy *policy)
 	kfree(policy);
 }
 
-static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
-{
-	if (WARN_ON(cpu == policy->cpu))
-		return;
-
-	down_write(&policy->rwsem);
-
-	policy->last_cpu = policy->cpu;
-	policy->cpu = cpu;
-
-	up_write(&policy->rwsem);
-
-	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
-			CPUFREQ_UPDATE_POLICY_CPU, policy);
-}
-
 static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 {
 	unsigned int j, cpu = dev->id;
@@ -1099,9 +1080,6 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 	struct cpufreq_policy *policy;
 	unsigned long flags;
 	bool recover_policy = cpufreq_suspended;
-#ifdef CONFIG_HOTPLUG_CPU
-	struct cpufreq_policy *tpolicy;
-#endif
 
 	if (cpu_is_offline(cpu))
 		return 0;
@@ -1111,55 +1089,28 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 #ifdef CONFIG_SMP
 	/* check whether a different CPU already registered this
 	 * CPU because it is in the same boat. */
+	/* FIXME: This probably needs fixing to avoid "try lock" from
+	 * returning NULL. Also, change to likely() */
 	policy = cpufreq_cpu_get(cpu);
 	if (unlikely(policy)) {
+		cpufreq_change_policy_cpus(policy, cpu, true);
 		cpufreq_cpu_put(policy);
 		return 0;
 	}
 #endif
 
+	/* FIXME: Is returning 0 the right thing to do?! Existing code */
 	if (!down_read_trylock(&cpufreq_rwsem))
 		return 0;
 
-#ifdef CONFIG_HOTPLUG_CPU
-	/* Check if this cpu was hot-unplugged earlier and has siblings */
-	read_lock_irqsave(&cpufreq_driver_lock, flags);
-	list_for_each_entry(tpolicy, &cpufreq_policy_list, policy_list) {
-		if (cpumask_test_cpu(cpu, tpolicy->related_cpus)) {
-			read_unlock_irqrestore(&cpufreq_driver_lock, flags);
-			ret = cpufreq_add_policy_cpu(tpolicy, cpu, dev);
-			up_read(&cpufreq_rwsem);
-			return ret;
-		}
-	}
-	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
-#endif
-
-	/*
-	 * Restore the saved policy when doing light-weight init and fall back
-	 * to the full init if that fails.
-	 */
-	policy = recover_policy ? cpufreq_policy_restore(cpu) : NULL;
-	if (!policy) {
-		recover_policy = false;
-		policy = cpufreq_policy_alloc();
-		if (!policy)
-			goto nomem_out;
-	}
-
-	/*
-	 * In the resume path, since we restore a saved policy, the assignment
-	 * to policy->cpu is like an update of the existing policy, rather than
-	 * the creation of a brand new one. So we need to perform this update
-	 * by invoking update_policy_cpu().
-	 */
-	if (recover_policy && cpu != policy->cpu)
-		update_policy_cpu(policy, cpu);
-	else
-		policy->cpu = cpu;
+	/* If we get this far, this is the first time we are adding the
+	 * policy */
+	policy = cpufreq_policy_alloc();
+	if (!policy)
+		goto nomem_out;
+	policy->cpu = cpu;
 
 	cpumask_copy(policy->cpus, cpumask_of(cpu));
-
 	init_completion(&policy->kobj_unregister);
 	INIT_WORK(&policy->update, handle_update);
 
@@ -1175,20 +1126,19 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 	/* related cpus should atleast have policy->cpus */
 	cpumask_or(policy->related_cpus, policy->related_cpus, policy->cpus);
 
+	/* Weed out impossible CPUs. */
+	cpumask_and(policy->related_cpus, policy->related_cpus,
+			cpu_possible_mask);
+
 	/*
 	 * affected cpus must always be the one, which are online. We aren't
 	 * managing offline cpus here.
 	 */
 	cpumask_and(policy->cpus, policy->cpus, cpu_online_mask);
 
-	if (!recover_policy) {
-		policy->user_policy.min = policy->min;
-		policy->user_policy.max = policy->max;
-	}
-
 	down_write(&policy->rwsem);
 	write_lock_irqsave(&cpufreq_driver_lock, flags);
-	for_each_cpu(j, policy->cpus)
+	for_each_cpu(j, policy->related_cpus)
 		per_cpu(cpufreq_cpu_data, j) = policy;
 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
 
@@ -1243,13 +1193,11 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
 				     CPUFREQ_START, policy);
 
-	if (!recover_policy) {
-		ret = cpufreq_add_dev_interface(policy, dev);
-		if (ret)
-			goto err_out_unregister;
-		blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
-				CPUFREQ_CREATE_POLICY, policy);
-	}
+	ret = cpufreq_add_dev_interface(policy);
+	if (ret)
+		goto err_out_unregister;
+	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
+			CPUFREQ_CREATE_POLICY, policy);
 
 	write_lock_irqsave(&cpufreq_driver_lock, flags);
 	list_add(&policy->policy_list, &cpufreq_policy_list);
@@ -1257,10 +1205,6 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 
 	cpufreq_init_policy(policy);
 
-	if (!recover_policy) {
-		policy->user_policy.policy = policy->policy;
-		policy->user_policy.governor = policy->governor;
-	}
 	up_write(&policy->rwsem);
 
 	kobject_uevent(&policy->kobj, KOBJ_ADD);
@@ -1307,100 +1251,16 @@  static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 	return __cpufreq_add_dev(dev, sif);
 }
 
-static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy,
-					   unsigned int old_cpu)
-{
-	struct device *cpu_dev;
-	int ret;
-
-	/* first sibling now owns the new sysfs dir */
-	cpu_dev = get_cpu_device(cpumask_any_but(policy->cpus, old_cpu));
-
-	sysfs_remove_link(&cpu_dev->kobj, "cpufreq");
-	ret = kobject_move(&policy->kobj, &cpu_dev->kobj);
-	if (ret) {
-		pr_err("%s: Failed to move kobj: %d\n", __func__, ret);
-
-		down_write(&policy->rwsem);
-		cpumask_set_cpu(old_cpu, policy->cpus);
-		up_write(&policy->rwsem);
-
-		ret = sysfs_create_link(&cpu_dev->kobj, &policy->kobj,
-					"cpufreq");
-
-		return -EINVAL;
-	}
-
-	return cpu_dev->id;
-}
-
-static int __cpufreq_remove_dev_prepare(struct device *dev,
-					struct subsys_interface *sif)
+static int __cpufreq_remove_dev(struct device *dev,
+				struct subsys_interface *sif)
 {
-	unsigned int cpu = dev->id, cpus;
-	int new_cpu, ret;
+	unsigned int cpu = dev->id;
+	int ret = 0;
 	unsigned long flags;
 	struct cpufreq_policy *policy;
 
 	pr_debug("%s: unregistering CPU %u\n", __func__, cpu);
 
-	write_lock_irqsave(&cpufreq_driver_lock, flags);
-
-	policy = per_cpu(cpufreq_cpu_data, cpu);
-
-	/* Save the policy somewhere when doing a light-weight tear-down */
-	if (cpufreq_suspended)
-		per_cpu(cpufreq_cpu_data_fallback, cpu) = policy;
-
-	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
-
-	if (!policy) {
-		pr_debug("%s: No cpu_data found\n", __func__);
-		return -EINVAL;
-	}
-
-	if (has_target()) {
-		ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
-		if (ret) {
-			pr_err("%s: Failed to stop governor\n", __func__);
-			return ret;
-		}
-	}
-
-	if (!cpufreq_driver->setpolicy)
-		strncpy(per_cpu(cpufreq_cpu_governor, cpu),
-			policy->governor->name, CPUFREQ_NAME_LEN);
-
-	down_read(&policy->rwsem);
-	cpus = cpumask_weight(policy->cpus);
-	up_read(&policy->rwsem);
-
-	if (cpu != policy->cpu) {
-		sysfs_remove_link(&dev->kobj, "cpufreq");
-	} else if (cpus > 1) {
-		new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu);
-		if (new_cpu >= 0) {
-			update_policy_cpu(policy, new_cpu);
-
-			if (!cpufreq_suspended)
-				pr_debug("%s: policy Kobject moved to cpu: %d from: %d\n",
-					 __func__, new_cpu, cpu);
-		}
-	} else if (cpufreq_driver->stop_cpu && cpufreq_driver->setpolicy) {
-		cpufreq_driver->stop_cpu(policy);
-	}
-
-	return 0;
-}
-
-static int __cpufreq_remove_dev_finish(struct device *dev,
-				       struct subsys_interface *sif)
-{
-	unsigned int cpu = dev->id, cpus;
-	int ret;
-	unsigned long flags;
-	struct cpufreq_policy *policy;
-
 	read_lock_irqsave(&cpufreq_driver_lock, flags);
 	policy = per_cpu(cpufreq_cpu_data, cpu);
 	read_unlock_irqrestore(&cpufreq_driver_lock, flags);
@@ -1410,56 +1270,11 @@  static int __cpufreq_remove_dev_finish(struct device *dev,
 		return -EINVAL;
 	}
 
-	down_write(&policy->rwsem);
-	cpus = cpumask_weight(policy->cpus);
-
-	if (cpus > 1)
-		cpumask_clear_cpu(cpu, policy->cpus);
-	up_write(&policy->rwsem);
-
-	/* If cpu is last user of policy, free policy */
-	if (cpus == 1) {
-		if (has_target()) {
-			ret = __cpufreq_governor(policy,
-					CPUFREQ_GOV_POLICY_EXIT);
-			if (ret) {
-				pr_err("%s: Failed to exit governor\n",
-				       __func__);
-				return ret;
-			}
-		}
-
-		if (!cpufreq_suspended)
-			cpufreq_policy_put_kobj(policy);
-
-		/*
-		 * Perform the ->exit() even during light-weight tear-down,
-		 * since this is a core component, and is essential for the
-		 * subsequent light-weight ->init() to succeed.
-		 */
-		if (cpufreq_driver->exit)
-			cpufreq_driver->exit(policy);
-
-		/* Remove policy from list of active policies */
-		write_lock_irqsave(&cpufreq_driver_lock, flags);
-		list_del(&policy->policy_list);
-		write_unlock_irqrestore(&cpufreq_driver_lock, flags);
-
-		if (!cpufreq_suspended)
-			cpufreq_policy_free(policy);
-	} else if (has_target()) {
-		ret = __cpufreq_governor(policy, CPUFREQ_GOV_START);
-		if (!ret)
-			ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
-
-		if (ret) {
-			pr_err("%s: Failed to start governor\n", __func__);
-			return ret;
-		}
-	}
+#ifdef CONFIG_HOTPLUG_CPU
+	ret = cpufreq_change_policy_cpus(policy, cpu, false);
+#endif
 
-	per_cpu(cpufreq_cpu_data, cpu) = NULL;
-	return 0;
+	return ret;
 }
 
 /**
@@ -1475,10 +1290,7 @@  static int cpufreq_remove_dev(struct device *dev, struct subsys_interface *sif)
 	if (cpu_is_offline(cpu))
 		return 0;
 
-	ret = __cpufreq_remove_dev_prepare(dev, sif);
-
-	if (!ret)
-		ret = __cpufreq_remove_dev_finish(dev, sif);
+	ret = __cpufreq_remove_dev(dev, sif);
 
 	return ret;
 }
@@ -2141,7 +1953,7 @@  static int cpufreq_set_policy(struct cpufreq_policy *policy,
 				struct cpufreq_policy *new_policy)
 {
 	struct cpufreq_governor *old_gov;
-	int ret;
+	int ret = 0;
 
 	pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
 		 new_policy->cpu, new_policy->min, new_policy->max);
@@ -2226,7 +2038,9 @@  static int cpufreq_set_policy(struct cpufreq_policy *policy,
 
  out:
 	pr_debug("governor: change or update limits\n");
-	return __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
+	if (policy->governor_enabled)
+		ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
+	return ret;
 }
 
 /**
@@ -2295,19 +2109,12 @@  static int cpufreq_cpu_callback(struct notifier_block *nfb,
 	if (dev) {
 		switch (action & ~CPU_TASKS_FROZEN) {
 		case CPU_ONLINE:
+		case CPU_DOWN_FAILED:
 			__cpufreq_add_dev(dev, NULL);
 			break;
 
 		case CPU_DOWN_PREPARE:
-			__cpufreq_remove_dev_prepare(dev, NULL);
-			break;
-
-		case CPU_POST_DEAD:
-			__cpufreq_remove_dev_finish(dev, NULL);
-			break;
-
-		case CPU_DOWN_FAILED:
-			__cpufreq_add_dev(dev, NULL);
+			__cpufreq_remove_dev(dev, NULL);
 			break;
 		}
 	}