diff mbox

cpufreq: Set policy to non-NULL only after all hotplug online work is done

Message ID 1393225072-3997-1-git-send-email-skannan@codeaurora.org (mailing list archive)
State New, archived
Headers show

Commit Message

Saravana Kannan Feb. 24, 2014, 6:57 a.m. UTC
The existing code sets the per CPU policy to a non-NULL value before all
the steps performed during the hotplug online path is done. Specifically,
this is done before the policy min/max, governors, etc are initialized for
the policy.  This in turn means that calls to cpufreq_cpu_get() return a
non-NULL policy before the policy/CPU is ready to be used.

To fix this, move the update of per CPU policy to a valid value after all
the initialization steps for the policy are completed.

Example kernel panic without this fix:
[  512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020
[  512.146195] pgd = c0003000
[  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
[  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
<snip>
[  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
[  512.146312] LR is at cpufreq_update_policy+0x114/0x150
<snip>
[  512.149740] ---[ end trace f23a8defea6cd706 ]---
[  512.149761] Kernel panic - not syncing: Fatal exception
[  513.152016] CPU0: stopping
[  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W    3.10.0-gd727407-00074-g979ede8 #396
<snip>
[  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
[  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
[  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
[  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
[  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
[  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
[  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68)
[  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44)
[  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc)
[  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78)
[  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74)
[  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c)
[  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180)
[  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68)
[  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30)

In this specific case, CPU0 set's CPU1's policy->governor in
cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in
__cpufreq_governor().

Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
---
 drivers/cpufreq/cpufreq.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Srivatsa S. Bhat Feb. 24, 2014, 7:42 a.m. UTC | #1
On 02/24/2014 12:27 PM, Saravana Kannan wrote:
> The existing code sets the per CPU policy to a non-NULL value before all
> the steps performed during the hotplug online path is done. Specifically,
> this is done before the policy min/max, governors, etc are initialized for
> the policy.  This in turn means that calls to cpufreq_cpu_get() return a
> non-NULL policy before the policy/CPU is ready to be used.
> 
> To fix this, move the update of per CPU policy to a valid value after all
> the initialization steps for the policy are completed.
> 
> Example kernel panic without this fix:
> [  512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020
> [  512.146195] pgd = c0003000
> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
> <snip>
> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
> <snip>
> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
> [  512.149761] Kernel panic - not syncing: Fatal exception
> [  513.152016] CPU0: stopping
> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W    3.10.0-gd727407-00074-g979ede8 #396
> <snip>
> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68)
> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44)
> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc)
> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78)
> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74)
> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c)
> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180)
> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68)
> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30)
> 
> In this specific case, CPU0 set's CPU1's policy->governor in
> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in
> __cpufreq_governor().
> 

Whoa! That's horrible! Can you please explain how exactly you
triggered this? I'm curious...

> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
> ---
>  drivers/cpufreq/cpufreq.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index cb003a6..d5ceb43 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
>  		goto err_set_policy_cpu;
>  	}
> 
> -	write_lock_irqsave(&cpufreq_driver_lock, flags);
> -	for_each_cpu(j, policy->cpus)
> -		per_cpu(cpufreq_cpu_data, j) = policy;
> -	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> -
>  	if (cpufreq_driver->get) {
>  		policy->cur = cpufreq_driver->get(policy->cpu);

If you move the per-cpu init further down, then what happens to the
cpufreq_generic_get() that gets invoked here by some of the drivers?
It will almost always fail (because policy will be NULL) and hence
CPU online will be unsuccessful (though you wont observe it because
the error code is not bubbled up to the CPU hotplug core; perhaps we
should).


IMHO, we should fix the synchronization in cpufreq_init_policy().
I notice cpufreq_update_policy() as well as __cpufreq_governor() in
your stack trace, which means cpufreq_set_policy() was involved.
cpufreq_update_policy() takes the policy->rwsem for write, whereas
cpufreq_init_policy() doesn't take that lock at all, which is clearly
wrong.

My guess is that, if we fix that locking, everything will be fine and
you won't hit the bug. Would you like to give that a shot?

Regards,
Srivatsa S. Bhat

>  		if (!policy->cur) {
> @@ -1207,6 +1202,11 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
>  		policy->user_policy.governor = policy->governor;
>  	}
> 
> +	write_lock_irqsave(&cpufreq_driver_lock, flags);
> +	for_each_cpu(j, policy->cpus)
> +		per_cpu(cpufreq_cpu_data, j) = policy;
> +	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> +
>  	kobject_uevent(&policy->kobj, KOBJ_ADD);
>  	up_read(&cpufreq_rwsem);
>
Viresh Kumar Feb. 24, 2014, 8:11 a.m. UTC | #2
On 24 February 2014 13:12, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 02/24/2014 12:27 PM, Saravana Kannan wrote:
>> The existing code sets the per CPU policy to a non-NULL value before all
>> the steps performed during the hotplug online path is done. Specifically,
>> this is done before the policy min/max, governors, etc are initialized for
>> the policy.  This in turn means that calls to cpufreq_cpu_get() return a
>> non-NULL policy before the policy/CPU is ready to be used.
>>
>> To fix this, move the update of per CPU policy to a valid value after all
>> the initialization steps for the policy are completed.
>>
>> Example kernel panic without this fix:
>> [  512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020
>> [  512.146195] pgd = c0003000
>> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
>> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>> <snip>
>> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
>> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
>> <snip>
>> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
>> [  512.149761] Kernel panic - not syncing: Fatal exception
>> [  513.152016] CPU0: stopping
>> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W    3.10.0-gd727407-00074-g979ede8 #396
>> <snip>
>> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
>> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
>> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
>> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68)
>> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44)
>> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc)
>> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78)
>> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74)
>> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c)
>> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180)
>> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68)
>> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30)
>>
>> In this specific case, CPU0 set's CPU1's policy->governor in
>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in
>> __cpufreq_governor().
>>
>
> Whoa! That's horrible! Can you please explain how exactly you
> triggered this? I'm curious...

:)

Actually I didn't understood his problem very well until now. I couldn't
get from his trace that which pointer was actually set to NULL here?
And hence what caused this crash?

Also, what Saravana just wrote here didn't looked like relevant at all.
i.e.:  policy->governor being set to NULL.

So what? It was set to NULL with a reason and it shouldn't cause
any crashes I hope. And also its sort of wrong to say that CPU0
has set governor for CPU1, as governor was set for a policy and
both CPUs were part of the same policy.

>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>> ---
>>  drivers/cpufreq/cpufreq.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index cb003a6..d5ceb43 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
>>               goto err_set_policy_cpu;
>>       }
>>
>> -     write_lock_irqsave(&cpufreq_driver_lock, flags);
>> -     for_each_cpu(j, policy->cpus)
>> -             per_cpu(cpufreq_cpu_data, j) = policy;
>> -     write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>> -
>>       if (cpufreq_driver->get) {
>>               policy->cur = cpufreq_driver->get(policy->cpu);
>
> If you move the per-cpu init further down, then what happens to the
> cpufreq_generic_get() that gets invoked here by some of the drivers?
> It will almost always fail (because policy will be NULL) and hence
> CPU online will be unsuccessful

Thanks, I was about to point that as well.

> (though you wont observe it because
> the error code is not bubbled up to the CPU hotplug core; perhaps we
> should).

Don't know :)

> IMHO, we should fix the synchronization in cpufreq_init_policy().
> I notice cpufreq_update_policy() as well as __cpufreq_governor() in
> your stack trace, which means cpufreq_set_policy() was involved.
> cpufreq_update_policy() takes the policy->rwsem for write, whereas
> cpufreq_init_policy() doesn't take that lock at all, which is clearly
> wrong.
>
> My guess is that, if we fix that locking, everything will be fine and
> you won't hit the bug. Would you like to give that a shot?

I am not sure about that guess as well. I first want to know what
went bad exactly..
Saravana Kannan Feb. 24, 2014, 8:39 a.m. UTC | #3
Srivatsa S. Bhat wrote:
> On 02/24/2014 12:27 PM, Saravana Kannan wrote:
>> The existing code sets the per CPU policy to a non-NULL value before all
>> the steps performed during the hotplug online path is done.
>> Specifically,
>> this is done before the policy min/max, governors, etc are initialized
>> for
>> the policy.  This in turn means that calls to cpufreq_cpu_get() return a
>> non-NULL policy before the policy/CPU is ready to be used.
>>
>> To fix this, move the update of per CPU policy to a valid value after
>> all
>> the initialization steps for the policy are completed.
>>
>> Example kernel panic without this fix:
>> [  512.146185] Unable to handle kernel NULL pointer dereference at
>> virtual address 00000020
>> [  512.146195] pgd = c0003000
>> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
>> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>> <snip>
>> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
>> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
>> <snip>
>> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
>> [  512.149761] Kernel panic - not syncing: Fatal exception
>> [  513.152016] CPU0: stopping
>> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W
>> 3.10.0-gd727407-00074-g979ede8 #396
>> <snip>
>> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
>> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from
>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
>> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from
>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from
>> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
>> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from
>> [<c0afe180>] (notifier_call_chain+0x40/0x68)
>> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>> [<c02812dc>] (__cpu_notify+0x28/0x44)
>> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>]
>> (_cpu_up+0xf4/0x1dc)
>> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>]
>> (cpu_up+0x5c/0x78)
>> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>]
>> (store_online+0x44/0x74)
>> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>]
>> (sysfs_write_file+0x108/0x14c)
>> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from
>> [<c03517d4>] (vfs_write+0xd0/0x180)
>> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>]
>> (SyS_write+0x38/0x68)
>> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>]
>> (ret_fast_syscall+0x0/0x30)
>>
>> In this specific case, CPU0 set's CPU1's policy->governor in
>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor
>> in
>> __cpufreq_governor().
>>
>
> Whoa! That's horrible! Can you please explain how exactly you
> triggered this? I'm curious...

Just call cpufreq_update_policy(cpu) on a CPU and have another thread
online/offline it.

>
>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>> ---
>>  drivers/cpufreq/cpufreq.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index cb003a6..d5ceb43 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev,
>> struct subsys_interface *sif,
>>  		goto err_set_policy_cpu;
>>  	}
>>
>> -	write_lock_irqsave(&cpufreq_driver_lock, flags);
>> -	for_each_cpu(j, policy->cpus)
>> -		per_cpu(cpufreq_cpu_data, j) = policy;
>> -	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>> -
>>  	if (cpufreq_driver->get) {
>>  		policy->cur = cpufreq_driver->get(policy->cpu);
>
> If you move the per-cpu init further down, then what happens to the
> cpufreq_generic_get() that gets invoked here by some of the drivers?

While cpufreq_generic_get() was a good refactor, I think it's causing
unnecessary cyclic dependency. You need that function to not fail for a
policy to get added properly and you need a proper policy for the function
to work. I care more about fixing the panic than trying to keep
cpufreq_generic_get().

> It will almost always fail (because policy will be NULL) and hence
> CPU online will be unsuccessful (though you wont observe it because
> the error code is not bubbled up to the CPU hotplug core; perhaps we
> should).

Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq
code base. Since this new function isn't there at that point, I missed it.
Even if I did use the latest kernel, I wouldn't have hit this issue,
because the MSM cpufreq driver doesn't use this function.

>
> IMHO, we should fix the synchronization in cpufreq_init_policy().
> I notice cpufreq_update_policy() as well as __cpufreq_governor() in
> your stack trace, which means cpufreq_set_policy() was involved.
> cpufreq_update_policy() takes the policy->rwsem for write, whereas
> cpufreq_init_policy() doesn't take that lock at all, which is clearly
> wrong.
>
> My guess is that, if we fix that locking, everything will be fine and
> you won't hit the bug. Would you like to give that a shot?

While locking might need fixing, this is not about the locking though.
Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy
that they consider valid and also use it as a way to check and reject API
calls trying to manipulate an offline CPU.

Without this fix, the framework is advertising an incomplete policy object
as available. I think that breaks the CPUfreq framework APIs in a very
fundamental sense. This is a "no-no" in programming.

It's like trying to register a CPUfreq driver before getting the clocks to
control the CPU.

As for the ->get() issue, I think the framework should just call
clk_get_rate() instead of calling .get() if policy->clk is not NULL. No
point in doing framework -> driver -> framework.

Thanks,
Saravana
Saravana Kannan Feb. 24, 2014, 8:41 a.m. UTC | #4
Viresh Kumar wrote:
> On 24 February 2014 13:12, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 02/24/2014 12:27 PM, Saravana Kannan wrote:
>>> The existing code sets the per CPU policy to a non-NULL value before
>>> all
>>> the steps performed during the hotplug online path is done.
>>> Specifically,
>>> this is done before the policy min/max, governors, etc are initialized
>>> for
>>> the policy.  This in turn means that calls to cpufreq_cpu_get() return
>>> a
>>> non-NULL policy before the policy/CPU is ready to be used.
>>>
>>> To fix this, move the update of per CPU policy to a valid value after
>>> all
>>> the initialization steps for the policy are completed.
>>>
>>> Example kernel panic without this fix:
>>> [  512.146185] Unable to handle kernel NULL pointer dereference at
>>> virtual address 00000020
>>> [  512.146195] pgd = c0003000
>>> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
>>> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>>> <snip>
>>> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
>>> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
>>> <snip>
>>> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
>>> [  512.149761] Kernel panic - not syncing: Fatal exception
>>> [  513.152016] CPU0: stopping
>>> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W
>>> 3.10.0-gd727407-00074-g979ede8 #396
>>> <snip>
>>> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
>>> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from
>>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
>>> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from
>>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>>> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>>> from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
>>> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from
>>> [<c0afe180>] (notifier_call_chain+0x40/0x68)
>>> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>> [<c02812dc>] (__cpu_notify+0x28/0x44)
>>> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>]
>>> (_cpu_up+0xf4/0x1dc)
>>> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>]
>>> (cpu_up+0x5c/0x78)
>>> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>]
>>> (store_online+0x44/0x74)
>>> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>]
>>> (sysfs_write_file+0x108/0x14c)
>>> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from
>>> [<c03517d4>] (vfs_write+0xd0/0x180)
>>> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>]
>>> (SyS_write+0x38/0x68)
>>> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>]
>>> (ret_fast_syscall+0x0/0x30)
>>>
>>> In this specific case, CPU0 set's CPU1's policy->governor in
>>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor
>>> in
>>> __cpufreq_governor().
>>>
>>
>> Whoa! That's horrible! Can you please explain how exactly you
>> triggered this? I'm curious...
>
> :)
>
> Actually I didn't understood his problem very well until now. I couldn't
> get from his trace that which pointer was actually set to NULL here?
> And hence what caused this crash?
>
> Also, what Saravana just wrote here didn't looked like relevant at all.
> i.e.:  policy->governor being set to NULL.
>
> So what? It was set to NULL with a reason and it shouldn't cause
> any crashes I hope. And also its sort of wrong to say that CPU0
> has set governor for CPU1, as governor was set for a policy and
> both CPUs were part of the same policy.
>
>>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>>> ---
>>>  drivers/cpufreq/cpufreq.c | 10 +++++-----
>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>> index cb003a6..d5ceb43 100644
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev,
>>> struct subsys_interface *sif,
>>>               goto err_set_policy_cpu;
>>>       }
>>>
>>> -     write_lock_irqsave(&cpufreq_driver_lock, flags);
>>> -     for_each_cpu(j, policy->cpus)
>>> -             per_cpu(cpufreq_cpu_data, j) = policy;
>>> -     write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>>> -
>>>       if (cpufreq_driver->get) {
>>>               policy->cur = cpufreq_driver->get(policy->cpu);
>>
>> If you move the per-cpu init further down, then what happens to the
>> cpufreq_generic_get() that gets invoked here by some of the drivers?
>> It will almost always fail (because policy will be NULL) and hence
>> CPU online will be unsuccessful
>
> Thanks, I was about to point that as well.
>
>> (though you wont observe it because
>> the error code is not bubbled up to the CPU hotplug core; perhaps we
>> should).
>
> Don't know :)
>
>> IMHO, we should fix the synchronization in cpufreq_init_policy().
>> I notice cpufreq_update_policy() as well as __cpufreq_governor() in
>> your stack trace, which means cpufreq_set_policy() was involved.
>> cpufreq_update_policy() takes the policy->rwsem for write, whereas
>> cpufreq_init_policy() doesn't take that lock at all, which is clearly
>> wrong.
>>
>> My guess is that, if we fix that locking, everything will be fine and
>> you won't hit the bug. Would you like to give that a shot?
>
> I am not sure about that guess as well. I first want to know what
> went bad exactly..
>

I just replied to the other email. I think I answered both your questions
there. Sorry about mixing up CPU and policy. In my case, each CPU is
independently scalable -- so for now take CPU == policy. I'll fix it up
once we agree on the fix.

-Saravana
Viresh Kumar Feb. 24, 2014, 8:43 a.m. UTC | #5
On 24 February 2014 14:11,  <skannan@codeaurora.org> wrote:
> I just replied to the other email. I think I answered both your questions
> there. Sorry about mixing up CPU and policy. In my case, each CPU is
> independently scalable -- so for now take CPU == policy. I'll fix it up
> once we agree on the fix.

But why do you say this then?

>>> In this specific case, CPU0 set's CPU1's policy->governor in
>>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor
>>> in __cpufreq_governor().
Saravana Kannan Feb. 24, 2014, 8:47 a.m. UTC | #6
Viresh Kumar wrote:
> On 24 February 2014 14:11,  <skannan@codeaurora.org> wrote:
>> I just replied to the other email. I think I answered both your
>> questions
>> there. Sorry about mixing up CPU and policy. In my case, each CPU is
>> independently scalable -- so for now take CPU == policy. I'll fix it up
>> once we agree on the fix.
>
> But why do you say this then?

Sorry, not sure I understand what you mean.

I agree, wording in my commit text might be unclear. I'll fix it after we
agree on the code fix. In the MSM case, each CPU has it's own policy.

I'm assuming your original complaint was about my confusing wording. Maybe
that's not what you were pointing out?

-Saravana
Viresh Kumar Feb. 24, 2014, 8:50 a.m. UTC | #7
On 24 February 2014 14:17,  <skannan@codeaurora.org> wrote:
> Sorry, not sure I understand what you mean.
>
> I agree, wording in my commit text might be unclear. I'll fix it after we
> agree on the code fix. In the MSM case, each CPU has it's own policy.
>
> I'm assuming your original complaint was about my confusing wording. Maybe
> that's not what you were pointing out?

In your case each CPU has a separate policy structure as they have separately
controllable clocks. But you also said that CPU0 is setting CPU1's governor to
NULL. I don't see that happening. Each CPU sets its own governor to NULL on
init().
Saravana Kannan Feb. 24, 2014, 9 a.m. UTC | #8
Viresh Kumar wrote:
> On 24 February 2014 14:17,  <skannan@codeaurora.org> wrote:
>> Sorry, not sure I understand what you mean.
>>
>> I agree, wording in my commit text might be unclear. I'll fix it after
>> we
>> agree on the code fix. In the MSM case, each CPU has it's own policy.
>>
>> I'm assuming your original complaint was about my confusing wording.
>> Maybe
>> that's not what you were pointing out?
>
> In your case each CPU has a separate policy structure as they have
> separately
> controllable clocks. But you also said that CPU0 is setting CPU1's
> governor to
> NULL. I don't see that happening. Each CPU sets its own governor to NULL
> on
> init().

When I said "CPU0 is setting CPU1's governor to NULL", I meant
thread/context running in CPU0 is setting CPU1's governor as part of
CPU1's ONLINE notifier.

-Saravana
Viresh Kumar Feb. 24, 2014, 10:55 a.m. UTC | #9
On 24 February 2014 14:09,  <skannan@codeaurora.org> wrote:
>
> Srivatsa S. Bhat wrote:
>> On 02/24/2014 12:27 PM, Saravana Kannan wrote:
>>> The existing code sets the per CPU policy to a non-NULL value before all
>>> the steps performed during the hotplug online path is done.
>>> Specifically,
>>> this is done before the policy min/max, governors, etc are initialized
>>> for
>>> the policy.  This in turn means that calls to cpufreq_cpu_get() return a
>>> non-NULL policy before the policy/CPU is ready to be used.
>>>
>>> To fix this, move the update of per CPU policy to a valid value after
>>> all
>>> the initialization steps for the policy are completed.
>>>
>>> Example kernel panic without this fix:
>>> [  512.146185] Unable to handle kernel NULL pointer dereference at
>>> virtual address 00000020
>>> [  512.146195] pgd = c0003000
>>> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
>>> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>>> <snip>
>>> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
>>> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
>>> <snip>
>>> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
>>> [  512.149761] Kernel panic - not syncing: Fatal exception
>>> [  513.152016] CPU0: stopping
>>> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W
>>> 3.10.0-gd727407-00074-g979ede8 #396
>>> <snip>
>>> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
>>> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from
>>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
>>> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from
>>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>>> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from
>>> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
>>> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from
>>> [<c0afe180>] (notifier_call_chain+0x40/0x68)
>>> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>> [<c02812dc>] (__cpu_notify+0x28/0x44)
>>> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>]
>>> (_cpu_up+0xf4/0x1dc)
>>> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>]
>>> (cpu_up+0x5c/0x78)
>>> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>]
>>> (store_online+0x44/0x74)
>>> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>]
>>> (sysfs_write_file+0x108/0x14c)
>>> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from
>>> [<c03517d4>] (vfs_write+0xd0/0x180)
>>> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>]
>>> (SyS_write+0x38/0x68)
>>> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>]
>>> (ret_fast_syscall+0x0/0x30)
>>>
>>> In this specific case, CPU0 set's CPU1's policy->governor in
>>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor
>>> in
>>> __cpufreq_governor().
>>>
>>
>> Whoa! That's horrible! Can you please explain how exactly you
>> triggered this? I'm curious...
>
> Just call cpufreq_update_policy(cpu) on a CPU and have another thread
> online/offline it.

But would you do that? Means why would you call them this way?
cpufreq_update_policy() shouldn't be called that way I believe..

>>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>>> ---
>>>  drivers/cpufreq/cpufreq.c | 10 +++++-----
>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>> index cb003a6..d5ceb43 100644
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev,
>>> struct subsys_interface *sif,
>>>              goto err_set_policy_cpu;
>>>      }
>>>
>>> -    write_lock_irqsave(&cpufreq_driver_lock, flags);
>>> -    for_each_cpu(j, policy->cpus)
>>> -            per_cpu(cpufreq_cpu_data, j) = policy;
>>> -    write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>>> -
>>>      if (cpufreq_driver->get) {
>>>              policy->cur = cpufreq_driver->get(policy->cpu);
>>
>> If you move the per-cpu init further down, then what happens to the
>> cpufreq_generic_get() that gets invoked here by some of the drivers?
>
> While cpufreq_generic_get() was a good refactor, I think it's causing
> unnecessary cyclic dependency. You need that function to not fail for a
> policy to get added properly and you need a proper policy for the function
> to work. I care more about fixing the panic than trying to keep
> cpufreq_generic_get().

Surely, I will like a solution which would keep this as is :)..
cpufreq_generic_get() should pass..

>> It will almost always fail (because policy will be NULL) and hence
>> CPU online will be unsuccessful (though you wont observe it because
>> the error code is not bubbled up to the CPU hotplug core; perhaps we
>> should).
>
> Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq
> code base. Since this new function isn't there at that point, I missed it.
> Even if I did use the latest kernel, I wouldn't have hit this issue,
> because the MSM cpufreq driver doesn't use this function.
>
>>
>> IMHO, we should fix the synchronization in cpufreq_init_policy().
>> I notice cpufreq_update_policy() as well as __cpufreq_governor() in
>> your stack trace, which means cpufreq_set_policy() was involved.
>> cpufreq_update_policy() takes the policy->rwsem for write, whereas
>> cpufreq_init_policy() doesn't take that lock at all, which is clearly
>> wrong.
>>
>> My guess is that, if we fix that locking, everything will be fine and
>> you won't hit the bug. Would you like to give that a shot?
>
> While locking might need fixing, this is not about the locking though.
> Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy
> that they consider valid and also use it as a way to check and reject API
> calls trying to manipulate an offline CPU.
>
> Without this fix, the framework is advertising an incomplete policy object
> as available. I think that breaks the CPUfreq framework APIs in a very
> fundamental sense. This is a "no-no" in programming.
>
> It's like trying to register a CPUfreq driver before getting the clocks to
> control the CPU.

So the real question here is: What fields of 'policy' do we need to guarantee
to be initialized before policy is made available for others? And probably
that is what we need to fix.

Even your current solution would break things. For example, below will happen
before policy is set in per-cpu variable:
- CPUFREQ_CREATE_POLICY notifier will be fired and calls to do cpufreq_cpu_get()
there will fail. And hence cpufreq-stats sysfs will break.
- Governors also use cpufreq_cpu_get() and so those would also fail as they
are started from cpufreq_init_policy()..

..
Saravana Kannan Feb. 24, 2014, 8:23 p.m. UTC | #10
On 02/24/2014 02:55 AM, Viresh Kumar wrote:
> On 24 February 2014 14:09,  <skannan@codeaurora.org> wrote:
>>
>> Srivatsa S. Bhat wrote:
>>> On 02/24/2014 12:27 PM, Saravana Kannan wrote:
>>>> The existing code sets the per CPU policy to a non-NULL value before all
>>>> the steps performed during the hotplug online path is done.
>>>> Specifically,
>>>> this is done before the policy min/max, governors, etc are initialized
>>>> for
>>>> the policy.  This in turn means that calls to cpufreq_cpu_get() return a
>>>> non-NULL policy before the policy/CPU is ready to be used.
>>>>
>>>> To fix this, move the update of per CPU policy to a valid value after
>>>> all
>>>> the initialization steps for the policy are completed.
>>>>
>>>> Example kernel panic without this fix:
>>>> [  512.146185] Unable to handle kernel NULL pointer dereference at
>>>> virtual address 00000020
>>>> [  512.146195] pgd = c0003000
>>>> [  512.146213] [00000020] *pgd=80000000004003, *pmd=00000000
>>>> [  512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>>>> <snip>
>>>> [  512.146297] PC is at __cpufreq_governor+0x10/0x1ac
>>>> [  512.146312] LR is at cpufreq_update_policy+0x114/0x150
>>>> <snip>
>>>> [  512.149740] ---[ end trace f23a8defea6cd706 ]---
>>>> [  512.149761] Kernel panic - not syncing: Fatal exception
>>>> [  513.152016] CPU0: stopping
>>>> [  513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G      D W
>>>> 3.10.0-gd727407-00074-g979ede8 #396
>>>> <snip>
>>>> [  513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>>> [  513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58)
>>>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>>> [  513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c)
>>>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8)
>>>> [  513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from
>>>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98)
>>>> [  513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from
>>>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4)
>>>> [  513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from
>>>> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84)
>>>> [  513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from
>>>> [<c0afe180>] (notifier_call_chain+0x40/0x68)
>>>> [  513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from
>>>> [<c02812dc>] (__cpu_notify+0x28/0x44)
>>>> [  513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>]
>>>> (_cpu_up+0xf4/0x1dc)
>>>> [  513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>]
>>>> (cpu_up+0x5c/0x78)
>>>> [  513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>]
>>>> (store_online+0x44/0x74)
>>>> [  513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>]
>>>> (sysfs_write_file+0x108/0x14c)
>>>> [  513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from
>>>> [<c03517d4>] (vfs_write+0xd0/0x180)
>>>> [  513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>]
>>>> (SyS_write+0x38/0x68)
>>>> [  513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>]
>>>> (ret_fast_syscall+0x0/0x30)
>>>>
>>>> In this specific case, CPU0 set's CPU1's policy->governor in
>>>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor
>>>> in
>>>> __cpufreq_governor().
>>>>
>>>
>>> Whoa! That's horrible! Can you please explain how exactly you
>>> triggered this? I'm curious...
>>
>> Just call cpufreq_update_policy(cpu) on a CPU and have another thread
>> online/offline it.
>
> But would you do that? Means why would you call them this way?
> cpufreq_update_policy() shouldn't be called that way I believe..

I was simplifying the scenario that causes it. We change the min/max 
using ADJUST notifiers for multiple reasons -- thermal being one of them.

thermal/cpu_cooling is one example of it.

So, cpufreq_update_policy() can be called on any CPU. If that races with 
someone offlining a CPU and onlining it, you'll get this crash.

>>>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
>>>> ---
>>>>   drivers/cpufreq/cpufreq.c | 10 +++++-----
>>>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>>> index cb003a6..d5ceb43 100644
>>>> --- a/drivers/cpufreq/cpufreq.c
>>>> +++ b/drivers/cpufreq/cpufreq.c
>>>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev,
>>>> struct subsys_interface *sif,
>>>>               goto err_set_policy_cpu;
>>>>       }
>>>>
>>>> -    write_lock_irqsave(&cpufreq_driver_lock, flags);
>>>> -    for_each_cpu(j, policy->cpus)
>>>> -            per_cpu(cpufreq_cpu_data, j) = policy;
>>>> -    write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>>>> -
>>>>       if (cpufreq_driver->get) {
>>>>               policy->cur = cpufreq_driver->get(policy->cpu);
>>>
>>> If you move the per-cpu init further down, then what happens to the
>>> cpufreq_generic_get() that gets invoked here by some of the drivers?
>>
>> While cpufreq_generic_get() was a good refactor, I think it's causing
>> unnecessary cyclic dependency. You need that function to not fail for a
>> policy to get added properly and you need a proper policy for the function
>> to work. I care more about fixing the panic than trying to keep
>> cpufreq_generic_get().
>
> Surely, I will like a solution which would keep this as is :)..
> cpufreq_generic_get() should pass..

The idea would exist, but we can just call cpufreq_generic_get() and 
pass it policy->clk if it is not NULL. Does that work for you?

>>> It will almost always fail (because policy will be NULL) and hence
>>> CPU online will be unsuccessful (though you wont observe it because
>>> the error code is not bubbled up to the CPU hotplug core; perhaps we
>>> should).
>>
>> Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq
>> code base. Since this new function isn't there at that point, I missed it.
>> Even if I did use the latest kernel, I wouldn't have hit this issue,
>> because the MSM cpufreq driver doesn't use this function.
>>
>>>
>>> IMHO, we should fix the synchronization in cpufreq_init_policy().
>>> I notice cpufreq_update_policy() as well as __cpufreq_governor() in
>>> your stack trace, which means cpufreq_set_policy() was involved.
>>> cpufreq_update_policy() takes the policy->rwsem for write, whereas
>>> cpufreq_init_policy() doesn't take that lock at all, which is clearly
>>> wrong.
>>>
>>> My guess is that, if we fix that locking, everything will be fine and
>>> you won't hit the bug. Would you like to give that a shot?
>>
>> While locking might need fixing, this is not about the locking though.
>> Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy
>> that they consider valid and also use it as a way to check and reject API
>> calls trying to manipulate an offline CPU.
>>
>> Without this fix, the framework is advertising an incomplete policy object
>> as available. I think that breaks the CPUfreq framework APIs in a very
>> fundamental sense. This is a "no-no" in programming.
>>
>> It's like trying to register a CPUfreq driver before getting the clocks to
>> control the CPU.
>
> So the real question here is: What fields of 'policy' do we need to guarantee
> to be initialized before policy is made available for others? And probably
> that is what we need to fix.

That's going to be hard to define since future uses could be looking at 
different fields. What is the API semantics of cpufreq_cpu_get(). I 
can't ever imagine it being correct for any API to return a partially 
initialized data structure.

> Even your current solution would break things. For example, below will happen
> before policy is set in per-cpu variable:
> - CPUFREQ_CREATE_POLICY notifier will be fired and calls to do cpufreq_cpu_get()
> there will fail. And hence cpufreq-stats sysfs will break.
I did this on 3.12. I'll fix it up to handle this one.

> - Governors also use cpufreq_cpu_get() and so those would also fail as they
> are started from cpufreq_init_policy()..
The only use of this in governors is in update_sampling_rate() and the 
behavior after this fix would be correct in that case. If the policy is 
not fully initialized -- that update doesn't get to go through.

All other uses of this API fall under one of these categories:
* CPUs are already fully offline whenever it's called.
* Already get the real policy as part of the notifier so don't need to
   call cpufreq_cpu_get

If I find any that doesn't fit the above, I would be glad to fix that if 
it's pointed out.

Regards,
Saravana
Viresh Kumar Feb. 25, 2014, 8:50 a.m. UTC | #11
On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote:
> I was simplifying the scenario that causes it. We change the min/max using
> ADJUST notifiers for multiple reasons -- thermal being one of them.
>
> thermal/cpu_cooling is one example of it.

Just to understand the clear picture, you are actually hitting this bug? Or
is this only a theoretical bug?

> So, cpufreq_update_policy() can be called on any CPU. If that races with
> someone offlining a CPU and onlining it, you'll get this crash.

Then shouldn't that be fixed by locks? I think yes. That makes me agree with
Srivatsa more here.

Though I would say that your argument was also valid that 'policy' shouldn't be
up for sale unless it is prepared to. And for that reason only I
floated that question
earlier: What exactly we need to make sure is initialized in policy? Because
policy might keep changing in future as well and that needs locks to protect
that stuff. Like min/max/governor/ etc..

So, probably a solution here might be a mix of both. Initialize policy to this
minimum level and then make sure locking is used correctly..

> The idea would exist, but we can just call cpufreq_generic_get() and pass it
> policy->clk if it is not NULL. Does that work for you?

No. Not all drivers implement clk interface. And so clk doesn't look to be the
right parameter. I thought maybe 'policy' can be the right parameter and
then people can get use policy->cpu to get cpu id out of it.

But even that doesn't look to be a great idea. X86 drivers may share policy
structure for CPUs that don't actually share a clock line. And so they do need
right CPU number as parameter instead of policy. As they might be doing
some tricky stuff there. Also, we need to make sure that ->get() returns
the frequency at which CPU x is running.

>> So the real question here is: What fields of 'policy' do we need to
>> guarantee
>> to be initialized before policy is made available for others? And probably
>> that is what we need to fix.
>
> That's going to be hard to define since future uses could be looking at
> different fields. What is the API semantics of cpufreq_cpu_get(). I can't
> ever imagine it being correct for any API to return a partially initialized
> data structure.

I do agree that we can't broadcast 'policy' before it is usable. And that's
why I am asking about the right place where we are sure that it is ready..

>> Even your current solution would break things. For example, below will
>> happen
>> before policy is set in per-cpu variable:
>> - CPUFREQ_CREATE_POLICY notifier will be fired and calls to do
>> cpufreq_cpu_get()
>> there will fail. And hence cpufreq-stats sysfs will break.
>
> I did this on 3.12. I'll fix it up to handle this one.

This can be moved down without much issues I believe.

>> - Governors also use cpufreq_cpu_get() and so those would also fail as
>> they
>> are started from cpufreq_init_policy()..
>
> The only use of this in governors is in update_sampling_rate() and the
> behavior after this fix would be correct in that case. If the policy is not
> fully initialized -- that update doesn't get to go through.

hmm.. but even that looks odd. We have reached upto governors but
policy isn't available to be used? It should have been ready by now?

> All other uses of this API fall under one of these categories:
> * CPUs are already fully offline whenever it's called.
> * Already get the real policy as part of the notifier so don't need to
>   call cpufreq_cpu_get
>
> If I find any that doesn't fit the above, I would be glad to fix that if
> it's pointed out.

I have just sent out a patchset to you and others, would be great if
you can give it a review/test ..
Rafael J. Wysocki Feb. 25, 2014, 1:04 p.m. UTC | #12
On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote:
> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote:
> > I was simplifying the scenario that causes it. We change the min/max using
> > ADJUST notifiers for multiple reasons -- thermal being one of them.
> >
> > thermal/cpu_cooling is one example of it.
> 
> Just to understand the clear picture, you are actually hitting this bug? Or
> is this only a theoretical bug?
> 
> > So, cpufreq_update_policy() can be called on any CPU. If that races with
> > someone offlining a CPU and onlining it, you'll get this crash.
> 
> Then shouldn't that be fixed by locks? I think yes. That makes me agree with
> Srivatsa more here.
> 
> Though I would say that your argument was also valid that 'policy' shouldn't be
> up for sale unless it is prepared to. And for that reason only I
> floated that question
> earlier: What exactly we need to make sure is initialized in policy? Because
> policy might keep changing in future as well and that needs locks to protect
> that stuff. Like min/max/governor/ etc..

Well, that depends on what the current users expect it to look like initially.
It should be initialized to the point in which all of them can handle it
correctly.

> So, probably a solution here might be a mix of both. Initialize policy to this
> minimum level and then make sure locking is used correctly..

Yes.

> > The idea would exist, but we can just call cpufreq_generic_get() and pass it
> > policy->clk if it is not NULL. Does that work for you?
> 
> No. Not all drivers implement clk interface. And so clk doesn't look to be the
> right parameter. I thought maybe 'policy' can be the right parameter and
> then people can get use policy->cpu to get cpu id out of it.
> 
> But even that doesn't look to be a great idea. X86 drivers may share policy
> structure for CPUs that don't actually share a clock line. And so they do need
> right CPU number as parameter instead of policy. As they might be doing
> some tricky stuff there. Also, we need to make sure that ->get() returns
> the frequency at which CPU x is running.

That's not going to work in at least some cases anyway, because for some types
of HW we simply can't retrieve the current frequency in a non-racy way.

Thanks!
Viresh Kumar Feb. 25, 2014, 2:40 p.m. UTC | #13
On 25 February 2014 18:34, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> Well, that depends on what the current users expect it to look like initially.
> It should be initialized to the point in which all of them can handle it
> correctly.

Exactly.

> That's not going to work in at least some cases anyway, because for some types
> of HW we simply can't retrieve the current frequency in a non-racy way.

I agree. But this is all that we can guarantee here :)
Saravana Kannan Feb. 25, 2014, 9:11 p.m. UTC | #14
On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote:
> On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote:
>> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote:
>>> I was simplifying the scenario that causes it. We change the min/max using
>>> ADJUST notifiers for multiple reasons -- thermal being one of them.
>>>
>>> thermal/cpu_cooling is one example of it.
>>
>> Just to understand the clear picture, you are actually hitting this bug? Or
>> is this only a theoretical bug?
>>
This is a real bug. But the actual caller of cpufreq_update_policy() is 
a driver that's local to our tree. I'm just giving examples of upstream 
code that act in a similar way.

>>> So, cpufreq_update_policy() can be called on any CPU. If that races with
>>> someone offlining a CPU and onlining it, you'll get this crash.
>>
>> Then shouldn't that be fixed by locks? I think yes. That makes me agree with
>> Srivatsa more here.
>>
>> Though I would say that your argument was also valid that 'policy' shouldn't be
>> up for sale unless it is prepared to. And for that reason only I
>> floated that question
>> earlier: What exactly we need to make sure is initialized in policy? Because
>> policy might keep changing in future as well and that needs locks to protect
>> that stuff. Like min/max/governor/ etc..
>
> Well, that depends on what the current users expect it to look like initially.
> It should be initialized to the point in which all of them can handle it
> correctly.

Yes, so let's not make it available until all of it is initialized. I 
don't like the piece meal check. 6 months down the lane someone making 
changes might not remember this. The problem also applies for drivers 
that might not be upstreamed, etc.

>> So, probably a solution here might be a mix of both. Initialize policy to this
>> minimum level and then make sure locking is used correctly..
>
> Yes.

Rafael, It's not clear what you mean by the yes. Do you want to 
initialize it partly and make it available. I think that's always wrong.

>>> The idea would exist, but we can just call cpufreq_generic_get() and pass it
>>> policy->clk if it is not NULL. Does that work for you?
>>
>> No. Not all drivers implement clk interface. And so clk doesn't look to be the
>> right parameter. I thought maybe 'policy' can be the right parameter and
>> then people can get use policy->cpu to get cpu id out of it.
>>
>> But even that doesn't look to be a great idea. X86 drivers may share policy
>> structure for CPUs that don't actually share a clock line. And so they do need
>> right CPU number as parameter instead of policy. As they might be doing
>> some tricky stuff there. Also, we need to make sure that ->get() returns
>> the frequency at which CPU x is running.
>
> That's not going to work in at least some cases anyway, because for some types
> of HW we simply can't retrieve the current frequency in a non-racy way.

I think there's been a misunderstanding of what I'm proposing. The 
reference to policy->clk is only to get rid of the dependency that 
cpufreq_generic_get() has on the per cpu policy variable being filled. 
You can do that by just removing calls to get _IF_ clk is filled in.

Viresh,

I'll look at the patches later today and respond. Although, I would have 
been nice you had helped me fix any issues with my patch than coming up 
with new ones. Kinda removes to motivation for submitting patches upstream.

Regards,
Saravana
Rafael J. Wysocki Feb. 25, 2014, 10:41 p.m. UTC | #15
On Tuesday, February 25, 2014 01:11:27 PM Saravana Kannan wrote:
> On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote:
> > On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote:
> >> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote:
> >>> I was simplifying the scenario that causes it. We change the min/max using
> >>> ADJUST notifiers for multiple reasons -- thermal being one of them.
> >>>
> >>> thermal/cpu_cooling is one example of it.
> >>
> >> Just to understand the clear picture, you are actually hitting this bug? Or
> >> is this only a theoretical bug?
> >>
> This is a real bug. But the actual caller of cpufreq_update_policy() is 
> a driver that's local to our tree. I'm just giving examples of upstream 
> code that act in a similar way.
> 
> >>> So, cpufreq_update_policy() can be called on any CPU. If that races with
> >>> someone offlining a CPU and onlining it, you'll get this crash.
> >>
> >> Then shouldn't that be fixed by locks? I think yes. That makes me agree with
> >> Srivatsa more here.
> >>
> >> Though I would say that your argument was also valid that 'policy' shouldn't be
> >> up for sale unless it is prepared to. And for that reason only I
> >> floated that question
> >> earlier: What exactly we need to make sure is initialized in policy? Because
> >> policy might keep changing in future as well and that needs locks to protect
> >> that stuff. Like min/max/governor/ etc..
> >
> > Well, that depends on what the current users expect it to look like initially.
> > It should be initialized to the point in which all of them can handle it
> > correctly.
> 
> Yes, so let's not make it available until all of it is initialized.

And is "fully initialized" actually well defined?

> I don't like the piece meal check. 6 months down the lane someone making 
> changes might not remember this. The problem also applies for drivers 
> that might not be upstreamed, etc.

Please don't bring up out-of-the-tree drivers argument in mainline discussions,
it is meaningless here.

> >> So, probably a solution here might be a mix of both. Initialize policy to this
> >> minimum level and then make sure locking is used correctly..
> >
> > Yes.
> 
> Rafael, It's not clear what you mean by the yes. Do you want to 
> initialize it partly and make it available. I think that's always wrong.

So what actually is your porposal?

> >>> The idea would exist, but we can just call cpufreq_generic_get() and pass it
> >>> policy->clk if it is not NULL. Does that work for you?
> >>
> >> No. Not all drivers implement clk interface. And so clk doesn't look to be the
> >> right parameter. I thought maybe 'policy' can be the right parameter and
> >> then people can get use policy->cpu to get cpu id out of it.
> >>
> >> But even that doesn't look to be a great idea. X86 drivers may share policy
> >> structure for CPUs that don't actually share a clock line. And so they do need
> >> right CPU number as parameter instead of policy. As they might be doing
> >> some tricky stuff there. Also, we need to make sure that ->get() returns
> >> the frequency at which CPU x is running.
> >
> > That's not going to work in at least some cases anyway, because for some types
> > of HW we simply can't retrieve the current frequency in a non-racy way.
> 
> I think there's been a misunderstanding of what I'm proposing. The 
> reference to policy->clk is only to get rid of the dependency that 
> cpufreq_generic_get() has on the per cpu policy variable being filled. 
> You can do that by just removing calls to get _IF_ clk is filled in.

I still have a little problem understanding what you mean exactly.  At least
please explain the last sentence.

> Viresh,
> 
> I'll look at the patches later today and respond. Although, I would have 
> been nice you had helped me fix any issues with my patch than coming up 
> with new ones. Kinda removes to motivation for submitting patches upstream.

Everyone is free to post patches at any time during the discussion.  Viresh is
as well as you are.

Thanks!
Saravana Kannan Feb. 26, 2014, 1:48 a.m. UTC | #16
On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote:
> On Tuesday, February 25, 2014 01:11:27 PM Saravana Kannan wrote:
>> On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote:
>>> On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote:
>>>> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote:
>>>>> I was simplifying the scenario that causes it. We change the min/max using
>>>>> ADJUST notifiers for multiple reasons -- thermal being one of them.
>>>>>
>>>>> thermal/cpu_cooling is one example of it.
>>>>
>>>> Just to understand the clear picture, you are actually hitting this bug? Or
>>>> is this only a theoretical bug?
>>>>
>> This is a real bug. But the actual caller of cpufreq_update_policy() is
>> a driver that's local to our tree. I'm just giving examples of upstream
>> code that act in a similar way.
>>
>>>>> So, cpufreq_update_policy() can be called on any CPU. If that races with
>>>>> someone offlining a CPU and onlining it, you'll get this crash.
>>>>
>>>> Then shouldn't that be fixed by locks? I think yes. That makes me agree with
>>>> Srivatsa more here.
>>>>
>>>> Though I would say that your argument was also valid that 'policy' shouldn't be
>>>> up for sale unless it is prepared to. And for that reason only I
>>>> floated that question
>>>> earlier: What exactly we need to make sure is initialized in policy? Because
>>>> policy might keep changing in future as well and that needs locks to protect
>>>> that stuff. Like min/max/governor/ etc..
>>>
>>> Well, that depends on what the current users expect it to look like initially.
>>> It should be initialized to the point in which all of them can handle it
>>> correctly.
>>
>> Yes, so let's not make it available until all of it is initialized.
>
> And is "fully initialized" actually well defined?

The point in add dev/hot plug path after which we will no longer change 
policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU / 
CPUFRE_NOTIFY notifiers.

Pretty much the end of __cpufreq_add_dev() so that it's after:
- cpufreq_init_policy()
- And the update of userpolicy fields that after thie init call

>> I don't like the piece meal check. 6 months down the lane someone making
>> changes might not remember this. The problem also applies for drivers
>> that might not be upstreamed, etc.
>
> Please don't bring up out-of-the-tree drivers argument in mainline discussions,
> it is meaningless here.

Sure. This also applies to in-tree code. The 6 months down the lane, 
what fields are being looked at could still change and we might slip up 
on making sure that the policy is not made available before that field 
is initialized.

>>>> So, probably a solution here might be a mix of both. Initialize policy to this
>>>> minimum level and then make sure locking is used correctly..
>>>
>>> Yes.
>>
>> Rafael, It's not clear what you mean by the yes. Do you want to
>> initialize it partly and make it available. I think that's always wrong.
>
> So what actually is your porposal?
>

Same as reply to "fully initialized". We make the policy after it's 
fully initialized. After that point, any changes to the policy can be 
tracked by registering for CPUFREQ_UPDATE_POLICY_CPU / CPUFREQ_NOTIFY.

>>>>> The idea would exist, but we can just call cpufreq_generic_get() and pass it
>>>>> policy->clk if it is not NULL. Does that work for you?
>>>>
>>>> No. Not all drivers implement clk interface. And so clk doesn't look to be the
>>>> right parameter. I thought maybe 'policy' can be the right parameter and
>>>> then people can get use policy->cpu to get cpu id out of it.
>>>>
>>>> But even that doesn't look to be a great idea. X86 drivers may share policy
>>>> structure for CPUs that don't actually share a clock line. And so they do need
>>>> right CPU number as parameter instead of policy. As they might be doing
>>>> some tricky stuff there. Also, we need to make sure that ->get() returns
>>>> the frequency at which CPU x is running.
>>>
>>> That's not going to work in at least some cases anyway, because for some types
>>> of HW we simply can't retrieve the current frequency in a non-racy way.
>>
>> I think there's been a misunderstanding of what I'm proposing. The
>> reference to policy->clk is only to get rid of the dependency that
>> cpufreq_generic_get() has on the per cpu policy variable being filled.
>> You can do that by just removing calls to get _IF_ clk is filled in.
>
> I still have a little problem understanding what you mean exactly.  At least
> please explain the last sentence.

Ok, here's some pseudo code to explain it better:

Something like, replace all calls to cpufreq_driver->get with 
__cpufreq_driver_get() with the fn being something like:

unsigned int __cpufreq_driver_get(cpufreq_policy *policy)
{
   if (policy->clk)
      return clk_get_rate(policy->clk) / 1000;
   else
      return cpufreq_driver->get(policy->cpu);
}

That way, we don't have to depend on cpufreq_cpu_get() to return a 
policy in cpufreq_generic_get() for the existing add dev code path to 
complete without an error (after my patch to advertise the policy struct 
after it's fully initialized)

This also avoid unnecessary function calls to ->get() when we just as 
well know that it's going to call back into cpufreq_generic_get() and 
get the clock rate. We can just do that in the framework.

-Saravana
Viresh Kumar Feb. 26, 2014, 5:20 a.m. UTC | #17
On 26 February 2014 02:41, Saravana Kannan <skannan@codeaurora.org> wrote:
> On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote:
>> On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote:

> I think there's been a misunderstanding of what I'm proposing. The reference
> to policy->clk is only to get rid of the dependency that
> cpufreq_generic_get() has on the per cpu policy variable being filled. You
> can do that by just removing calls to get _IF_ clk is filled in.

cpufreq_cpu_get() can be called by other drivers as well, which may not have
clock interface implemented for them. We can't stop them from calling it.

> I'll look at the patches later today and respond. Although, I would have
> been nice you had helped me fix any issues with my patch than coming up with
> new ones. Kinda removes to motivation for submitting patches upstream.

Sorry if I demotivated you at all :)

I just wanted to get to a quick-fix and wasn't interested in getting
my patch count
up. Thought that isn't always bad :)

I sent my patches as they were very different then your approach. Obviously, I
wouldn't have done this otherwise :)
Viresh Kumar Feb. 26, 2014, 6:02 a.m. UTC | #18
On 26 February 2014 07:18, Saravana Kannan <skannan@codeaurora.org> wrote:
> On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote:

>> And is "fully initialized" actually well defined?
>
> The point in add dev/hot plug path after which we will no longer change
> policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU /
> CPUFRE_NOTIFY notifiers.

Okay..

> Pretty much the end of __cpufreq_add_dev() so that it's after:
> - cpufreq_init_policy()
> - And the update of userpolicy fields that after thie init call

No. In that case it can be considered initialized before cpufreq_init_policy().
As we do send CPUFREQ_NOTIFY after that from cpufreq_init_policy()->
cpufreq_set_policy().

There are two types of fields within policy, some are very basic: cpu/min/max/
affected_cpus/related_cpus

some are advanced: sysfs/governors/..

And as a rule you have to get policy->rwsem lock before accessing policy
members. We might not have followed it very well for small things like cpu.

And so if you are doing anything over that, please use a lock and that is
already present in cpufreq_update_policy().

With my latest patchset that I sent yesterday, locking is improved and now
a policy will be usable only after the rwsem is released. And that should be
fine. And so making it available in the per-cpu variable after all the necessary
fields are filled looks fine to me. And so I don't think we need to move it
after call to cpufreq_init_policy(maybe a better name to this function is
required)..

> Ok, here's some pseudo code to explain it better:
>
> Something like, replace all calls to cpufreq_driver->get with
> __cpufreq_driver_get() with the fn being something like:
>
> unsigned int __cpufreq_driver_get(cpufreq_policy *policy)
> {
>   if (policy->clk)
>      return clk_get_rate(policy->clk) / 1000;
>   else
>      return cpufreq_driver->get(policy->cpu);

This part may still use cpufreq_cpu_get().

> }

Drivers are free to have their implementation of ->get() even
if they have a valid policy->clk field..
Saravana Kannan Feb. 26, 2014, 8:20 p.m. UTC | #19
On 02/25/2014 10:02 PM, Viresh Kumar wrote:
> On 26 February 2014 07:18, Saravana Kannan <skannan@codeaurora.org> wrote:
>> On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote:
>
>>> And is "fully initialized" actually well defined?
>>
>> The point in add dev/hot plug path after which we will no longer change
>> policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU /
>> CPUFRE_NOTIFY notifiers.
>
> Okay..
>
>> Pretty much the end of __cpufreq_add_dev() so that it's after:
>> - cpufreq_init_policy()
>> - And the update of userpolicy fields that after thie init call
>
> No. In that case it can be considered initialized before cpufreq_init_policy().
> As we do send CPUFREQ_NOTIFY after that from cpufreq_init_policy()->
> cpufreq_set_policy().

Ok, valid hole in my definition of "fully initialized".
>
> There are two types of fields within policy, some are very basic: cpu/min/max/
> affected_cpus/related_cpus
>
> some are advanced: sysfs/governors/..
>
> And as a rule you have to get policy->rwsem lock before accessing policy
> members. We might not have followed it very well for small things like cpu.
>
> And so if you are doing anything over that, please use a lock and that is
> already present in cpufreq_update_policy().
>
> With my latest patchset that I sent yesterday, locking is improved and now
> a policy will be usable only after the rwsem is released. And that should be
> fine. And so making it available in the per-cpu variable after all the necessary
> fields are filled looks fine to me. And so I don't think we need to move it
> after call to cpufreq_init_policy(maybe a better name to this function is
> required)..

I'll take a closer look. Internal tree cpufreq code is in 3.12, so 
back-porting all the cpufreq changes and testing it can take a bit of 
time. Will get back on this.

-Saravana
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index cb003a6..d5ceb43 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1109,11 +1109,6 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
 		goto err_set_policy_cpu;
 	}
 
-	write_lock_irqsave(&cpufreq_driver_lock, flags);
-	for_each_cpu(j, policy->cpus)
-		per_cpu(cpufreq_cpu_data, j) = policy;
-	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
-
 	if (cpufreq_driver->get) {
 		policy->cur = cpufreq_driver->get(policy->cpu);
 		if (!policy->cur) {
@@ -1207,6 +1202,11 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
 		policy->user_policy.governor = policy->governor;
 	}
 
+	write_lock_irqsave(&cpufreq_driver_lock, flags);
+	for_each_cpu(j, policy->cpus)
+		per_cpu(cpufreq_cpu_data, j) = policy;
+	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
 	kobject_uevent(&policy->kobj, KOBJ_ADD);
 	up_read(&cpufreq_rwsem);