Message ID | 1393225072-3997-1-git-send-email-skannan@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/24/2014 12:27 PM, Saravana Kannan wrote: > The existing code sets the per CPU policy to a non-NULL value before all > the steps performed during the hotplug online path is done. Specifically, > this is done before the policy min/max, governors, etc are initialized for > the policy. This in turn means that calls to cpufreq_cpu_get() return a > non-NULL policy before the policy/CPU is ready to be used. > > To fix this, move the update of per CPU policy to a valid value after all > the initialization steps for the policy are completed. > > Example kernel panic without this fix: > [ 512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020 > [ 512.146195] pgd = c0003000 > [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 > [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM > <snip> > [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac > [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 > <snip> > [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- > [ 512.149761] Kernel panic - not syncing: Fatal exception > [ 513.152016] CPU0: stopping > [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W 3.10.0-gd727407-00074-g979ede8 #396 > <snip> > [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) > [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) > [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) > [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) > [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) > [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) > [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68) > [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44) > [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc) > [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78) > [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74) > [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c) > [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180) > [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68) > [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30) > > In this specific case, CPU0 set's CPU1's policy->governor in > cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in > __cpufreq_governor(). > Whoa! That's horrible! Can you please explain how exactly you triggered this? I'm curious... > Signed-off-by: Saravana Kannan <skannan@codeaurora.org> > --- > drivers/cpufreq/cpufreq.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index cb003a6..d5ceb43 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, > goto err_set_policy_cpu; > } > > - write_lock_irqsave(&cpufreq_driver_lock, flags); > - for_each_cpu(j, policy->cpus) > - per_cpu(cpufreq_cpu_data, j) = policy; > - write_unlock_irqrestore(&cpufreq_driver_lock, flags); > - > if (cpufreq_driver->get) { > policy->cur = cpufreq_driver->get(policy->cpu); If you move the per-cpu init further down, then what happens to the cpufreq_generic_get() that gets invoked here by some of the drivers? It will almost always fail (because policy will be NULL) and hence CPU online will be unsuccessful (though you wont observe it because the error code is not bubbled up to the CPU hotplug core; perhaps we should). IMHO, we should fix the synchronization in cpufreq_init_policy(). I notice cpufreq_update_policy() as well as __cpufreq_governor() in your stack trace, which means cpufreq_set_policy() was involved. cpufreq_update_policy() takes the policy->rwsem for write, whereas cpufreq_init_policy() doesn't take that lock at all, which is clearly wrong. My guess is that, if we fix that locking, everything will be fine and you won't hit the bug. Would you like to give that a shot? Regards, Srivatsa S. Bhat > if (!policy->cur) { > @@ -1207,6 +1202,11 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, > policy->user_policy.governor = policy->governor; > } > > + write_lock_irqsave(&cpufreq_driver_lock, flags); > + for_each_cpu(j, policy->cpus) > + per_cpu(cpufreq_cpu_data, j) = policy; > + write_unlock_irqrestore(&cpufreq_driver_lock, flags); > + > kobject_uevent(&policy->kobj, KOBJ_ADD); > up_read(&cpufreq_rwsem); >
On 24 February 2014 13:12, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > On 02/24/2014 12:27 PM, Saravana Kannan wrote: >> The existing code sets the per CPU policy to a non-NULL value before all >> the steps performed during the hotplug online path is done. Specifically, >> this is done before the policy min/max, governors, etc are initialized for >> the policy. This in turn means that calls to cpufreq_cpu_get() return a >> non-NULL policy before the policy/CPU is ready to be used. >> >> To fix this, move the update of per CPU policy to a valid value after all >> the initialization steps for the policy are completed. >> >> Example kernel panic without this fix: >> [ 512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020 >> [ 512.146195] pgd = c0003000 >> [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 >> [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM >> <snip> >> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac >> [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 >> <snip> >> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- >> [ 512.149761] Kernel panic - not syncing: Fatal exception >> [ 513.152016] CPU0: stopping >> [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W 3.10.0-gd727407-00074-g979ede8 #396 >> <snip> >> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >> [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >> [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) >> [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) >> [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >> [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) >> [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68) >> [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44) >> [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc) >> [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78) >> [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74) >> [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c) >> [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180) >> [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68) >> [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30) >> >> In this specific case, CPU0 set's CPU1's policy->governor in >> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in >> __cpufreq_governor(). >> > > Whoa! That's horrible! Can you please explain how exactly you > triggered this? I'm curious... :) Actually I didn't understood his problem very well until now. I couldn't get from his trace that which pointer was actually set to NULL here? And hence what caused this crash? Also, what Saravana just wrote here didn't looked like relevant at all. i.e.: policy->governor being set to NULL. So what? It was set to NULL with a reason and it shouldn't cause any crashes I hope. And also its sort of wrong to say that CPU0 has set governor for CPU1, as governor was set for a policy and both CPUs were part of the same policy. >> Signed-off-by: Saravana Kannan <skannan@codeaurora.org> >> --- >> drivers/cpufreq/cpufreq.c | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >> index cb003a6..d5ceb43 100644 >> --- a/drivers/cpufreq/cpufreq.c >> +++ b/drivers/cpufreq/cpufreq.c >> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, >> goto err_set_policy_cpu; >> } >> >> - write_lock_irqsave(&cpufreq_driver_lock, flags); >> - for_each_cpu(j, policy->cpus) >> - per_cpu(cpufreq_cpu_data, j) = policy; >> - write_unlock_irqrestore(&cpufreq_driver_lock, flags); >> - >> if (cpufreq_driver->get) { >> policy->cur = cpufreq_driver->get(policy->cpu); > > If you move the per-cpu init further down, then what happens to the > cpufreq_generic_get() that gets invoked here by some of the drivers? > It will almost always fail (because policy will be NULL) and hence > CPU online will be unsuccessful Thanks, I was about to point that as well. > (though you wont observe it because > the error code is not bubbled up to the CPU hotplug core; perhaps we > should). Don't know :) > IMHO, we should fix the synchronization in cpufreq_init_policy(). > I notice cpufreq_update_policy() as well as __cpufreq_governor() in > your stack trace, which means cpufreq_set_policy() was involved. > cpufreq_update_policy() takes the policy->rwsem for write, whereas > cpufreq_init_policy() doesn't take that lock at all, which is clearly > wrong. > > My guess is that, if we fix that locking, everything will be fine and > you won't hit the bug. Would you like to give that a shot? I am not sure about that guess as well. I first want to know what went bad exactly..
Srivatsa S. Bhat wrote: > On 02/24/2014 12:27 PM, Saravana Kannan wrote: >> The existing code sets the per CPU policy to a non-NULL value before all >> the steps performed during the hotplug online path is done. >> Specifically, >> this is done before the policy min/max, governors, etc are initialized >> for >> the policy. This in turn means that calls to cpufreq_cpu_get() return a >> non-NULL policy before the policy/CPU is ready to be used. >> >> To fix this, move the update of per CPU policy to a valid value after >> all >> the initialization steps for the policy are completed. >> >> Example kernel panic without this fix: >> [ 512.146185] Unable to handle kernel NULL pointer dereference at >> virtual address 00000020 >> [ 512.146195] pgd = c0003000 >> [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 >> [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM >> <snip> >> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac >> [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 >> <snip> >> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- >> [ 512.149761] Kernel panic - not syncing: Fatal exception >> [ 513.152016] CPU0: stopping >> [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W >> 3.10.0-gd727407-00074-g979ede8 #396 >> <snip> >> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >> [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >> [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) >> [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from >> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) >> [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from >> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >> [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from >> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) >> [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from >> [<c0afe180>] (notifier_call_chain+0x40/0x68) >> [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >> [<c02812dc>] (__cpu_notify+0x28/0x44) >> [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] >> (_cpu_up+0xf4/0x1dc) >> [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] >> (cpu_up+0x5c/0x78) >> [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] >> (store_online+0x44/0x74) >> [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] >> (sysfs_write_file+0x108/0x14c) >> [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from >> [<c03517d4>] (vfs_write+0xd0/0x180) >> [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] >> (SyS_write+0x38/0x68) >> [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] >> (ret_fast_syscall+0x0/0x30) >> >> In this specific case, CPU0 set's CPU1's policy->governor in >> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor >> in >> __cpufreq_governor(). >> > > Whoa! That's horrible! Can you please explain how exactly you > triggered this? I'm curious... Just call cpufreq_update_policy(cpu) on a CPU and have another thread online/offline it. > >> Signed-off-by: Saravana Kannan <skannan@codeaurora.org> >> --- >> drivers/cpufreq/cpufreq.c | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >> index cb003a6..d5ceb43 100644 >> --- a/drivers/cpufreq/cpufreq.c >> +++ b/drivers/cpufreq/cpufreq.c >> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, >> struct subsys_interface *sif, >> goto err_set_policy_cpu; >> } >> >> - write_lock_irqsave(&cpufreq_driver_lock, flags); >> - for_each_cpu(j, policy->cpus) >> - per_cpu(cpufreq_cpu_data, j) = policy; >> - write_unlock_irqrestore(&cpufreq_driver_lock, flags); >> - >> if (cpufreq_driver->get) { >> policy->cur = cpufreq_driver->get(policy->cpu); > > If you move the per-cpu init further down, then what happens to the > cpufreq_generic_get() that gets invoked here by some of the drivers? While cpufreq_generic_get() was a good refactor, I think it's causing unnecessary cyclic dependency. You need that function to not fail for a policy to get added properly and you need a proper policy for the function to work. I care more about fixing the panic than trying to keep cpufreq_generic_get(). > It will almost always fail (because policy will be NULL) and hence > CPU online will be unsuccessful (though you wont observe it because > the error code is not bubbled up to the CPU hotplug core; perhaps we > should). Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq code base. Since this new function isn't there at that point, I missed it. Even if I did use the latest kernel, I wouldn't have hit this issue, because the MSM cpufreq driver doesn't use this function. > > IMHO, we should fix the synchronization in cpufreq_init_policy(). > I notice cpufreq_update_policy() as well as __cpufreq_governor() in > your stack trace, which means cpufreq_set_policy() was involved. > cpufreq_update_policy() takes the policy->rwsem for write, whereas > cpufreq_init_policy() doesn't take that lock at all, which is clearly > wrong. > > My guess is that, if we fix that locking, everything will be fine and > you won't hit the bug. Would you like to give that a shot? While locking might need fixing, this is not about the locking though. Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy that they consider valid and also use it as a way to check and reject API calls trying to manipulate an offline CPU. Without this fix, the framework is advertising an incomplete policy object as available. I think that breaks the CPUfreq framework APIs in a very fundamental sense. This is a "no-no" in programming. It's like trying to register a CPUfreq driver before getting the clocks to control the CPU. As for the ->get() issue, I think the framework should just call clk_get_rate() instead of calling .get() if policy->clk is not NULL. No point in doing framework -> driver -> framework. Thanks, Saravana
Viresh Kumar wrote: > On 24 February 2014 13:12, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> On 02/24/2014 12:27 PM, Saravana Kannan wrote: >>> The existing code sets the per CPU policy to a non-NULL value before >>> all >>> the steps performed during the hotplug online path is done. >>> Specifically, >>> this is done before the policy min/max, governors, etc are initialized >>> for >>> the policy. This in turn means that calls to cpufreq_cpu_get() return >>> a >>> non-NULL policy before the policy/CPU is ready to be used. >>> >>> To fix this, move the update of per CPU policy to a valid value after >>> all >>> the initialization steps for the policy are completed. >>> >>> Example kernel panic without this fix: >>> [ 512.146185] Unable to handle kernel NULL pointer dereference at >>> virtual address 00000020 >>> [ 512.146195] pgd = c0003000 >>> [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 >>> [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM >>> <snip> >>> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac >>> [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 >>> <snip> >>> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- >>> [ 512.149761] Kernel panic - not syncing: Fatal exception >>> [ 513.152016] CPU0: stopping >>> [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W >>> 3.10.0-gd727407-00074-g979ede8 #396 >>> <snip> >>> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>> [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>> [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) >>> [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from >>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) >>> [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from >>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >>> [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >>> from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) >>> [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from >>> [<c0afe180>] (notifier_call_chain+0x40/0x68) >>> [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>> [<c02812dc>] (__cpu_notify+0x28/0x44) >>> [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] >>> (_cpu_up+0xf4/0x1dc) >>> [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] >>> (cpu_up+0x5c/0x78) >>> [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] >>> (store_online+0x44/0x74) >>> [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] >>> (sysfs_write_file+0x108/0x14c) >>> [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from >>> [<c03517d4>] (vfs_write+0xd0/0x180) >>> [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] >>> (SyS_write+0x38/0x68) >>> [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] >>> (ret_fast_syscall+0x0/0x30) >>> >>> In this specific case, CPU0 set's CPU1's policy->governor in >>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor >>> in >>> __cpufreq_governor(). >>> >> >> Whoa! That's horrible! Can you please explain how exactly you >> triggered this? I'm curious... > > :) > > Actually I didn't understood his problem very well until now. I couldn't > get from his trace that which pointer was actually set to NULL here? > And hence what caused this crash? > > Also, what Saravana just wrote here didn't looked like relevant at all. > i.e.: policy->governor being set to NULL. > > So what? It was set to NULL with a reason and it shouldn't cause > any crashes I hope. And also its sort of wrong to say that CPU0 > has set governor for CPU1, as governor was set for a policy and > both CPUs were part of the same policy. > >>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org> >>> --- >>> drivers/cpufreq/cpufreq.c | 10 +++++----- >>> 1 file changed, 5 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >>> index cb003a6..d5ceb43 100644 >>> --- a/drivers/cpufreq/cpufreq.c >>> +++ b/drivers/cpufreq/cpufreq.c >>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, >>> struct subsys_interface *sif, >>> goto err_set_policy_cpu; >>> } >>> >>> - write_lock_irqsave(&cpufreq_driver_lock, flags); >>> - for_each_cpu(j, policy->cpus) >>> - per_cpu(cpufreq_cpu_data, j) = policy; >>> - write_unlock_irqrestore(&cpufreq_driver_lock, flags); >>> - >>> if (cpufreq_driver->get) { >>> policy->cur = cpufreq_driver->get(policy->cpu); >> >> If you move the per-cpu init further down, then what happens to the >> cpufreq_generic_get() that gets invoked here by some of the drivers? >> It will almost always fail (because policy will be NULL) and hence >> CPU online will be unsuccessful > > Thanks, I was about to point that as well. > >> (though you wont observe it because >> the error code is not bubbled up to the CPU hotplug core; perhaps we >> should). > > Don't know :) > >> IMHO, we should fix the synchronization in cpufreq_init_policy(). >> I notice cpufreq_update_policy() as well as __cpufreq_governor() in >> your stack trace, which means cpufreq_set_policy() was involved. >> cpufreq_update_policy() takes the policy->rwsem for write, whereas >> cpufreq_init_policy() doesn't take that lock at all, which is clearly >> wrong. >> >> My guess is that, if we fix that locking, everything will be fine and >> you won't hit the bug. Would you like to give that a shot? > > I am not sure about that guess as well. I first want to know what > went bad exactly.. > I just replied to the other email. I think I answered both your questions there. Sorry about mixing up CPU and policy. In my case, each CPU is independently scalable -- so for now take CPU == policy. I'll fix it up once we agree on the fix. -Saravana
On 24 February 2014 14:11, <skannan@codeaurora.org> wrote: > I just replied to the other email. I think I answered both your questions > there. Sorry about mixing up CPU and policy. In my case, each CPU is > independently scalable -- so for now take CPU == policy. I'll fix it up > once we agree on the fix. But why do you say this then? >>> In this specific case, CPU0 set's CPU1's policy->governor in >>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor >>> in __cpufreq_governor().
Viresh Kumar wrote: > On 24 February 2014 14:11, <skannan@codeaurora.org> wrote: >> I just replied to the other email. I think I answered both your >> questions >> there. Sorry about mixing up CPU and policy. In my case, each CPU is >> independently scalable -- so for now take CPU == policy. I'll fix it up >> once we agree on the fix. > > But why do you say this then? Sorry, not sure I understand what you mean. I agree, wording in my commit text might be unclear. I'll fix it after we agree on the code fix. In the MSM case, each CPU has it's own policy. I'm assuming your original complaint was about my confusing wording. Maybe that's not what you were pointing out? -Saravana
On 24 February 2014 14:17, <skannan@codeaurora.org> wrote: > Sorry, not sure I understand what you mean. > > I agree, wording in my commit text might be unclear. I'll fix it after we > agree on the code fix. In the MSM case, each CPU has it's own policy. > > I'm assuming your original complaint was about my confusing wording. Maybe > that's not what you were pointing out? In your case each CPU has a separate policy structure as they have separately controllable clocks. But you also said that CPU0 is setting CPU1's governor to NULL. I don't see that happening. Each CPU sets its own governor to NULL on init().
Viresh Kumar wrote: > On 24 February 2014 14:17, <skannan@codeaurora.org> wrote: >> Sorry, not sure I understand what you mean. >> >> I agree, wording in my commit text might be unclear. I'll fix it after >> we >> agree on the code fix. In the MSM case, each CPU has it's own policy. >> >> I'm assuming your original complaint was about my confusing wording. >> Maybe >> that's not what you were pointing out? > > In your case each CPU has a separate policy structure as they have > separately > controllable clocks. But you also said that CPU0 is setting CPU1's > governor to > NULL. I don't see that happening. Each CPU sets its own governor to NULL > on > init(). When I said "CPU0 is setting CPU1's governor to NULL", I meant thread/context running in CPU0 is setting CPU1's governor as part of CPU1's ONLINE notifier. -Saravana
On 24 February 2014 14:09, <skannan@codeaurora.org> wrote: > > Srivatsa S. Bhat wrote: >> On 02/24/2014 12:27 PM, Saravana Kannan wrote: >>> The existing code sets the per CPU policy to a non-NULL value before all >>> the steps performed during the hotplug online path is done. >>> Specifically, >>> this is done before the policy min/max, governors, etc are initialized >>> for >>> the policy. This in turn means that calls to cpufreq_cpu_get() return a >>> non-NULL policy before the policy/CPU is ready to be used. >>> >>> To fix this, move the update of per CPU policy to a valid value after >>> all >>> the initialization steps for the policy are completed. >>> >>> Example kernel panic without this fix: >>> [ 512.146185] Unable to handle kernel NULL pointer dereference at >>> virtual address 00000020 >>> [ 512.146195] pgd = c0003000 >>> [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 >>> [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM >>> <snip> >>> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac >>> [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 >>> <snip> >>> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- >>> [ 512.149761] Kernel panic - not syncing: Fatal exception >>> [ 513.152016] CPU0: stopping >>> [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W >>> 3.10.0-gd727407-00074-g979ede8 #396 >>> <snip> >>> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>> [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>> [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) >>> [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from >>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) >>> [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from >>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >>> [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from >>> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) >>> [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from >>> [<c0afe180>] (notifier_call_chain+0x40/0x68) >>> [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>> [<c02812dc>] (__cpu_notify+0x28/0x44) >>> [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] >>> (_cpu_up+0xf4/0x1dc) >>> [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] >>> (cpu_up+0x5c/0x78) >>> [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] >>> (store_online+0x44/0x74) >>> [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] >>> (sysfs_write_file+0x108/0x14c) >>> [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from >>> [<c03517d4>] (vfs_write+0xd0/0x180) >>> [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] >>> (SyS_write+0x38/0x68) >>> [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] >>> (ret_fast_syscall+0x0/0x30) >>> >>> In this specific case, CPU0 set's CPU1's policy->governor in >>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor >>> in >>> __cpufreq_governor(). >>> >> >> Whoa! That's horrible! Can you please explain how exactly you >> triggered this? I'm curious... > > Just call cpufreq_update_policy(cpu) on a CPU and have another thread > online/offline it. But would you do that? Means why would you call them this way? cpufreq_update_policy() shouldn't be called that way I believe.. >>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org> >>> --- >>> drivers/cpufreq/cpufreq.c | 10 +++++----- >>> 1 file changed, 5 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >>> index cb003a6..d5ceb43 100644 >>> --- a/drivers/cpufreq/cpufreq.c >>> +++ b/drivers/cpufreq/cpufreq.c >>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, >>> struct subsys_interface *sif, >>> goto err_set_policy_cpu; >>> } >>> >>> - write_lock_irqsave(&cpufreq_driver_lock, flags); >>> - for_each_cpu(j, policy->cpus) >>> - per_cpu(cpufreq_cpu_data, j) = policy; >>> - write_unlock_irqrestore(&cpufreq_driver_lock, flags); >>> - >>> if (cpufreq_driver->get) { >>> policy->cur = cpufreq_driver->get(policy->cpu); >> >> If you move the per-cpu init further down, then what happens to the >> cpufreq_generic_get() that gets invoked here by some of the drivers? > > While cpufreq_generic_get() was a good refactor, I think it's causing > unnecessary cyclic dependency. You need that function to not fail for a > policy to get added properly and you need a proper policy for the function > to work. I care more about fixing the panic than trying to keep > cpufreq_generic_get(). Surely, I will like a solution which would keep this as is :).. cpufreq_generic_get() should pass.. >> It will almost always fail (because policy will be NULL) and hence >> CPU online will be unsuccessful (though you wont observe it because >> the error code is not bubbled up to the CPU hotplug core; perhaps we >> should). > > Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq > code base. Since this new function isn't there at that point, I missed it. > Even if I did use the latest kernel, I wouldn't have hit this issue, > because the MSM cpufreq driver doesn't use this function. > >> >> IMHO, we should fix the synchronization in cpufreq_init_policy(). >> I notice cpufreq_update_policy() as well as __cpufreq_governor() in >> your stack trace, which means cpufreq_set_policy() was involved. >> cpufreq_update_policy() takes the policy->rwsem for write, whereas >> cpufreq_init_policy() doesn't take that lock at all, which is clearly >> wrong. >> >> My guess is that, if we fix that locking, everything will be fine and >> you won't hit the bug. Would you like to give that a shot? > > While locking might need fixing, this is not about the locking though. > Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy > that they consider valid and also use it as a way to check and reject API > calls trying to manipulate an offline CPU. > > Without this fix, the framework is advertising an incomplete policy object > as available. I think that breaks the CPUfreq framework APIs in a very > fundamental sense. This is a "no-no" in programming. > > It's like trying to register a CPUfreq driver before getting the clocks to > control the CPU. So the real question here is: What fields of 'policy' do we need to guarantee to be initialized before policy is made available for others? And probably that is what we need to fix. Even your current solution would break things. For example, below will happen before policy is set in per-cpu variable: - CPUFREQ_CREATE_POLICY notifier will be fired and calls to do cpufreq_cpu_get() there will fail. And hence cpufreq-stats sysfs will break. - Governors also use cpufreq_cpu_get() and so those would also fail as they are started from cpufreq_init_policy().. ..
On 02/24/2014 02:55 AM, Viresh Kumar wrote: > On 24 February 2014 14:09, <skannan@codeaurora.org> wrote: >> >> Srivatsa S. Bhat wrote: >>> On 02/24/2014 12:27 PM, Saravana Kannan wrote: >>>> The existing code sets the per CPU policy to a non-NULL value before all >>>> the steps performed during the hotplug online path is done. >>>> Specifically, >>>> this is done before the policy min/max, governors, etc are initialized >>>> for >>>> the policy. This in turn means that calls to cpufreq_cpu_get() return a >>>> non-NULL policy before the policy/CPU is ready to be used. >>>> >>>> To fix this, move the update of per CPU policy to a valid value after >>>> all >>>> the initialization steps for the policy are completed. >>>> >>>> Example kernel panic without this fix: >>>> [ 512.146185] Unable to handle kernel NULL pointer dereference at >>>> virtual address 00000020 >>>> [ 512.146195] pgd = c0003000 >>>> [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 >>>> [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM >>>> <snip> >>>> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac >>>> [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 >>>> <snip> >>>> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- >>>> [ 512.149761] Kernel panic - not syncing: Fatal exception >>>> [ 513.152016] CPU0: stopping >>>> [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W >>>> 3.10.0-gd727407-00074-g979ede8 #396 >>>> <snip> >>>> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>>> [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>>> [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) >>>> from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>>> [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) >>>> from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) >>>> [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from >>>> [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) >>>> [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from >>>> [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) >>>> [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from >>>> [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) >>>> [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from >>>> [<c0afe180>] (notifier_call_chain+0x40/0x68) >>>> [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from >>>> [<c02812dc>] (__cpu_notify+0x28/0x44) >>>> [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] >>>> (_cpu_up+0xf4/0x1dc) >>>> [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] >>>> (cpu_up+0x5c/0x78) >>>> [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] >>>> (store_online+0x44/0x74) >>>> [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] >>>> (sysfs_write_file+0x108/0x14c) >>>> [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from >>>> [<c03517d4>] (vfs_write+0xd0/0x180) >>>> [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] >>>> (SyS_write+0x38/0x68) >>>> [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] >>>> (ret_fast_syscall+0x0/0x30) >>>> >>>> In this specific case, CPU0 set's CPU1's policy->governor in >>>> cpufreq_init_policy() to NULL while CPU1 is using the policy->governor >>>> in >>>> __cpufreq_governor(). >>>> >>> >>> Whoa! That's horrible! Can you please explain how exactly you >>> triggered this? I'm curious... >> >> Just call cpufreq_update_policy(cpu) on a CPU and have another thread >> online/offline it. > > But would you do that? Means why would you call them this way? > cpufreq_update_policy() shouldn't be called that way I believe.. I was simplifying the scenario that causes it. We change the min/max using ADJUST notifiers for multiple reasons -- thermal being one of them. thermal/cpu_cooling is one example of it. So, cpufreq_update_policy() can be called on any CPU. If that races with someone offlining a CPU and onlining it, you'll get this crash. >>>> Signed-off-by: Saravana Kannan <skannan@codeaurora.org> >>>> --- >>>> drivers/cpufreq/cpufreq.c | 10 +++++----- >>>> 1 file changed, 5 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c >>>> index cb003a6..d5ceb43 100644 >>>> --- a/drivers/cpufreq/cpufreq.c >>>> +++ b/drivers/cpufreq/cpufreq.c >>>> @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, >>>> struct subsys_interface *sif, >>>> goto err_set_policy_cpu; >>>> } >>>> >>>> - write_lock_irqsave(&cpufreq_driver_lock, flags); >>>> - for_each_cpu(j, policy->cpus) >>>> - per_cpu(cpufreq_cpu_data, j) = policy; >>>> - write_unlock_irqrestore(&cpufreq_driver_lock, flags); >>>> - >>>> if (cpufreq_driver->get) { >>>> policy->cur = cpufreq_driver->get(policy->cpu); >>> >>> If you move the per-cpu init further down, then what happens to the >>> cpufreq_generic_get() that gets invoked here by some of the drivers? >> >> While cpufreq_generic_get() was a good refactor, I think it's causing >> unnecessary cyclic dependency. You need that function to not fail for a >> policy to get added properly and you need a proper policy for the function >> to work. I care more about fixing the panic than trying to keep >> cpufreq_generic_get(). > > Surely, I will like a solution which would keep this as is :).. > cpufreq_generic_get() should pass.. The idea would exist, but we can just call cpufreq_generic_get() and pass it policy->clk if it is not NULL. Does that work for you? >>> It will almost always fail (because policy will be NULL) and hence >>> CPU online will be unsuccessful (though you wont observe it because >>> the error code is not bubbled up to the CPU hotplug core; perhaps we >>> should). >> >> Good catch. I actually hit this issue, fixed and test it on a 3.12 cpufreq >> code base. Since this new function isn't there at that point, I missed it. >> Even if I did use the latest kernel, I wouldn't have hit this issue, >> because the MSM cpufreq driver doesn't use this function. >> >>> >>> IMHO, we should fix the synchronization in cpufreq_init_policy(). >>> I notice cpufreq_update_policy() as well as __cpufreq_governor() in >>> your stack trace, which means cpufreq_set_policy() was involved. >>> cpufreq_update_policy() takes the policy->rwsem for write, whereas >>> cpufreq_init_policy() doesn't take that lock at all, which is clearly >>> wrong. >>> >>> My guess is that, if we fix that locking, everything will be fine and >>> you won't hit the bug. Would you like to give that a shot? >> >> While locking might need fixing, this is not about the locking though. >> Plenty of drivers and the framework use cpufreq_cpu_get() to get a policy >> that they consider valid and also use it as a way to check and reject API >> calls trying to manipulate an offline CPU. >> >> Without this fix, the framework is advertising an incomplete policy object >> as available. I think that breaks the CPUfreq framework APIs in a very >> fundamental sense. This is a "no-no" in programming. >> >> It's like trying to register a CPUfreq driver before getting the clocks to >> control the CPU. > > So the real question here is: What fields of 'policy' do we need to guarantee > to be initialized before policy is made available for others? And probably > that is what we need to fix. That's going to be hard to define since future uses could be looking at different fields. What is the API semantics of cpufreq_cpu_get(). I can't ever imagine it being correct for any API to return a partially initialized data structure. > Even your current solution would break things. For example, below will happen > before policy is set in per-cpu variable: > - CPUFREQ_CREATE_POLICY notifier will be fired and calls to do cpufreq_cpu_get() > there will fail. And hence cpufreq-stats sysfs will break. I did this on 3.12. I'll fix it up to handle this one. > - Governors also use cpufreq_cpu_get() and so those would also fail as they > are started from cpufreq_init_policy().. The only use of this in governors is in update_sampling_rate() and the behavior after this fix would be correct in that case. If the policy is not fully initialized -- that update doesn't get to go through. All other uses of this API fall under one of these categories: * CPUs are already fully offline whenever it's called. * Already get the real policy as part of the notifier so don't need to call cpufreq_cpu_get If I find any that doesn't fit the above, I would be glad to fix that if it's pointed out. Regards, Saravana
On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote: > I was simplifying the scenario that causes it. We change the min/max using > ADJUST notifiers for multiple reasons -- thermal being one of them. > > thermal/cpu_cooling is one example of it. Just to understand the clear picture, you are actually hitting this bug? Or is this only a theoretical bug? > So, cpufreq_update_policy() can be called on any CPU. If that races with > someone offlining a CPU and onlining it, you'll get this crash. Then shouldn't that be fixed by locks? I think yes. That makes me agree with Srivatsa more here. Though I would say that your argument was also valid that 'policy' shouldn't be up for sale unless it is prepared to. And for that reason only I floated that question earlier: What exactly we need to make sure is initialized in policy? Because policy might keep changing in future as well and that needs locks to protect that stuff. Like min/max/governor/ etc.. So, probably a solution here might be a mix of both. Initialize policy to this minimum level and then make sure locking is used correctly.. > The idea would exist, but we can just call cpufreq_generic_get() and pass it > policy->clk if it is not NULL. Does that work for you? No. Not all drivers implement clk interface. And so clk doesn't look to be the right parameter. I thought maybe 'policy' can be the right parameter and then people can get use policy->cpu to get cpu id out of it. But even that doesn't look to be a great idea. X86 drivers may share policy structure for CPUs that don't actually share a clock line. And so they do need right CPU number as parameter instead of policy. As they might be doing some tricky stuff there. Also, we need to make sure that ->get() returns the frequency at which CPU x is running. >> So the real question here is: What fields of 'policy' do we need to >> guarantee >> to be initialized before policy is made available for others? And probably >> that is what we need to fix. > > That's going to be hard to define since future uses could be looking at > different fields. What is the API semantics of cpufreq_cpu_get(). I can't > ever imagine it being correct for any API to return a partially initialized > data structure. I do agree that we can't broadcast 'policy' before it is usable. And that's why I am asking about the right place where we are sure that it is ready.. >> Even your current solution would break things. For example, below will >> happen >> before policy is set in per-cpu variable: >> - CPUFREQ_CREATE_POLICY notifier will be fired and calls to do >> cpufreq_cpu_get() >> there will fail. And hence cpufreq-stats sysfs will break. > > I did this on 3.12. I'll fix it up to handle this one. This can be moved down without much issues I believe. >> - Governors also use cpufreq_cpu_get() and so those would also fail as >> they >> are started from cpufreq_init_policy().. > > The only use of this in governors is in update_sampling_rate() and the > behavior after this fix would be correct in that case. If the policy is not > fully initialized -- that update doesn't get to go through. hmm.. but even that looks odd. We have reached upto governors but policy isn't available to be used? It should have been ready by now? > All other uses of this API fall under one of these categories: > * CPUs are already fully offline whenever it's called. > * Already get the real policy as part of the notifier so don't need to > call cpufreq_cpu_get > > If I find any that doesn't fit the above, I would be glad to fix that if > it's pointed out. I have just sent out a patchset to you and others, would be great if you can give it a review/test ..
On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote: > On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote: > > I was simplifying the scenario that causes it. We change the min/max using > > ADJUST notifiers for multiple reasons -- thermal being one of them. > > > > thermal/cpu_cooling is one example of it. > > Just to understand the clear picture, you are actually hitting this bug? Or > is this only a theoretical bug? > > > So, cpufreq_update_policy() can be called on any CPU. If that races with > > someone offlining a CPU and onlining it, you'll get this crash. > > Then shouldn't that be fixed by locks? I think yes. That makes me agree with > Srivatsa more here. > > Though I would say that your argument was also valid that 'policy' shouldn't be > up for sale unless it is prepared to. And for that reason only I > floated that question > earlier: What exactly we need to make sure is initialized in policy? Because > policy might keep changing in future as well and that needs locks to protect > that stuff. Like min/max/governor/ etc.. Well, that depends on what the current users expect it to look like initially. It should be initialized to the point in which all of them can handle it correctly. > So, probably a solution here might be a mix of both. Initialize policy to this > minimum level and then make sure locking is used correctly.. Yes. > > The idea would exist, but we can just call cpufreq_generic_get() and pass it > > policy->clk if it is not NULL. Does that work for you? > > No. Not all drivers implement clk interface. And so clk doesn't look to be the > right parameter. I thought maybe 'policy' can be the right parameter and > then people can get use policy->cpu to get cpu id out of it. > > But even that doesn't look to be a great idea. X86 drivers may share policy > structure for CPUs that don't actually share a clock line. And so they do need > right CPU number as parameter instead of policy. As they might be doing > some tricky stuff there. Also, we need to make sure that ->get() returns > the frequency at which CPU x is running. That's not going to work in at least some cases anyway, because for some types of HW we simply can't retrieve the current frequency in a non-racy way. Thanks!
On 25 February 2014 18:34, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > Well, that depends on what the current users expect it to look like initially. > It should be initialized to the point in which all of them can handle it > correctly. Exactly. > That's not going to work in at least some cases anyway, because for some types > of HW we simply can't retrieve the current frequency in a non-racy way. I agree. But this is all that we can guarantee here :)
On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote: > On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote: >> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote: >>> I was simplifying the scenario that causes it. We change the min/max using >>> ADJUST notifiers for multiple reasons -- thermal being one of them. >>> >>> thermal/cpu_cooling is one example of it. >> >> Just to understand the clear picture, you are actually hitting this bug? Or >> is this only a theoretical bug? >> This is a real bug. But the actual caller of cpufreq_update_policy() is a driver that's local to our tree. I'm just giving examples of upstream code that act in a similar way. >>> So, cpufreq_update_policy() can be called on any CPU. If that races with >>> someone offlining a CPU and onlining it, you'll get this crash. >> >> Then shouldn't that be fixed by locks? I think yes. That makes me agree with >> Srivatsa more here. >> >> Though I would say that your argument was also valid that 'policy' shouldn't be >> up for sale unless it is prepared to. And for that reason only I >> floated that question >> earlier: What exactly we need to make sure is initialized in policy? Because >> policy might keep changing in future as well and that needs locks to protect >> that stuff. Like min/max/governor/ etc.. > > Well, that depends on what the current users expect it to look like initially. > It should be initialized to the point in which all of them can handle it > correctly. Yes, so let's not make it available until all of it is initialized. I don't like the piece meal check. 6 months down the lane someone making changes might not remember this. The problem also applies for drivers that might not be upstreamed, etc. >> So, probably a solution here might be a mix of both. Initialize policy to this >> minimum level and then make sure locking is used correctly.. > > Yes. Rafael, It's not clear what you mean by the yes. Do you want to initialize it partly and make it available. I think that's always wrong. >>> The idea would exist, but we can just call cpufreq_generic_get() and pass it >>> policy->clk if it is not NULL. Does that work for you? >> >> No. Not all drivers implement clk interface. And so clk doesn't look to be the >> right parameter. I thought maybe 'policy' can be the right parameter and >> then people can get use policy->cpu to get cpu id out of it. >> >> But even that doesn't look to be a great idea. X86 drivers may share policy >> structure for CPUs that don't actually share a clock line. And so they do need >> right CPU number as parameter instead of policy. As they might be doing >> some tricky stuff there. Also, we need to make sure that ->get() returns >> the frequency at which CPU x is running. > > That's not going to work in at least some cases anyway, because for some types > of HW we simply can't retrieve the current frequency in a non-racy way. I think there's been a misunderstanding of what I'm proposing. The reference to policy->clk is only to get rid of the dependency that cpufreq_generic_get() has on the per cpu policy variable being filled. You can do that by just removing calls to get _IF_ clk is filled in. Viresh, I'll look at the patches later today and respond. Although, I would have been nice you had helped me fix any issues with my patch than coming up with new ones. Kinda removes to motivation for submitting patches upstream. Regards, Saravana
On Tuesday, February 25, 2014 01:11:27 PM Saravana Kannan wrote: > On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote: > > On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote: > >> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote: > >>> I was simplifying the scenario that causes it. We change the min/max using > >>> ADJUST notifiers for multiple reasons -- thermal being one of them. > >>> > >>> thermal/cpu_cooling is one example of it. > >> > >> Just to understand the clear picture, you are actually hitting this bug? Or > >> is this only a theoretical bug? > >> > This is a real bug. But the actual caller of cpufreq_update_policy() is > a driver that's local to our tree. I'm just giving examples of upstream > code that act in a similar way. > > >>> So, cpufreq_update_policy() can be called on any CPU. If that races with > >>> someone offlining a CPU and onlining it, you'll get this crash. > >> > >> Then shouldn't that be fixed by locks? I think yes. That makes me agree with > >> Srivatsa more here. > >> > >> Though I would say that your argument was also valid that 'policy' shouldn't be > >> up for sale unless it is prepared to. And for that reason only I > >> floated that question > >> earlier: What exactly we need to make sure is initialized in policy? Because > >> policy might keep changing in future as well and that needs locks to protect > >> that stuff. Like min/max/governor/ etc.. > > > > Well, that depends on what the current users expect it to look like initially. > > It should be initialized to the point in which all of them can handle it > > correctly. > > Yes, so let's not make it available until all of it is initialized. And is "fully initialized" actually well defined? > I don't like the piece meal check. 6 months down the lane someone making > changes might not remember this. The problem also applies for drivers > that might not be upstreamed, etc. Please don't bring up out-of-the-tree drivers argument in mainline discussions, it is meaningless here. > >> So, probably a solution here might be a mix of both. Initialize policy to this > >> minimum level and then make sure locking is used correctly.. > > > > Yes. > > Rafael, It's not clear what you mean by the yes. Do you want to > initialize it partly and make it available. I think that's always wrong. So what actually is your porposal? > >>> The idea would exist, but we can just call cpufreq_generic_get() and pass it > >>> policy->clk if it is not NULL. Does that work for you? > >> > >> No. Not all drivers implement clk interface. And so clk doesn't look to be the > >> right parameter. I thought maybe 'policy' can be the right parameter and > >> then people can get use policy->cpu to get cpu id out of it. > >> > >> But even that doesn't look to be a great idea. X86 drivers may share policy > >> structure for CPUs that don't actually share a clock line. And so they do need > >> right CPU number as parameter instead of policy. As they might be doing > >> some tricky stuff there. Also, we need to make sure that ->get() returns > >> the frequency at which CPU x is running. > > > > That's not going to work in at least some cases anyway, because for some types > > of HW we simply can't retrieve the current frequency in a non-racy way. > > I think there's been a misunderstanding of what I'm proposing. The > reference to policy->clk is only to get rid of the dependency that > cpufreq_generic_get() has on the per cpu policy variable being filled. > You can do that by just removing calls to get _IF_ clk is filled in. I still have a little problem understanding what you mean exactly. At least please explain the last sentence. > Viresh, > > I'll look at the patches later today and respond. Although, I would have > been nice you had helped me fix any issues with my patch than coming up > with new ones. Kinda removes to motivation for submitting patches upstream. Everyone is free to post patches at any time during the discussion. Viresh is as well as you are. Thanks!
On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote: > On Tuesday, February 25, 2014 01:11:27 PM Saravana Kannan wrote: >> On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote: >>> On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote: >>>> On 25 February 2014 01:53, Saravana Kannan <skannan@codeaurora.org> wrote: >>>>> I was simplifying the scenario that causes it. We change the min/max using >>>>> ADJUST notifiers for multiple reasons -- thermal being one of them. >>>>> >>>>> thermal/cpu_cooling is one example of it. >>>> >>>> Just to understand the clear picture, you are actually hitting this bug? Or >>>> is this only a theoretical bug? >>>> >> This is a real bug. But the actual caller of cpufreq_update_policy() is >> a driver that's local to our tree. I'm just giving examples of upstream >> code that act in a similar way. >> >>>>> So, cpufreq_update_policy() can be called on any CPU. If that races with >>>>> someone offlining a CPU and onlining it, you'll get this crash. >>>> >>>> Then shouldn't that be fixed by locks? I think yes. That makes me agree with >>>> Srivatsa more here. >>>> >>>> Though I would say that your argument was also valid that 'policy' shouldn't be >>>> up for sale unless it is prepared to. And for that reason only I >>>> floated that question >>>> earlier: What exactly we need to make sure is initialized in policy? Because >>>> policy might keep changing in future as well and that needs locks to protect >>>> that stuff. Like min/max/governor/ etc.. >>> >>> Well, that depends on what the current users expect it to look like initially. >>> It should be initialized to the point in which all of them can handle it >>> correctly. >> >> Yes, so let's not make it available until all of it is initialized. > > And is "fully initialized" actually well defined? The point in add dev/hot plug path after which we will no longer change policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU / CPUFRE_NOTIFY notifiers. Pretty much the end of __cpufreq_add_dev() so that it's after: - cpufreq_init_policy() - And the update of userpolicy fields that after thie init call >> I don't like the piece meal check. 6 months down the lane someone making >> changes might not remember this. The problem also applies for drivers >> that might not be upstreamed, etc. > > Please don't bring up out-of-the-tree drivers argument in mainline discussions, > it is meaningless here. Sure. This also applies to in-tree code. The 6 months down the lane, what fields are being looked at could still change and we might slip up on making sure that the policy is not made available before that field is initialized. >>>> So, probably a solution here might be a mix of both. Initialize policy to this >>>> minimum level and then make sure locking is used correctly.. >>> >>> Yes. >> >> Rafael, It's not clear what you mean by the yes. Do you want to >> initialize it partly and make it available. I think that's always wrong. > > So what actually is your porposal? > Same as reply to "fully initialized". We make the policy after it's fully initialized. After that point, any changes to the policy can be tracked by registering for CPUFREQ_UPDATE_POLICY_CPU / CPUFREQ_NOTIFY. >>>>> The idea would exist, but we can just call cpufreq_generic_get() and pass it >>>>> policy->clk if it is not NULL. Does that work for you? >>>> >>>> No. Not all drivers implement clk interface. And so clk doesn't look to be the >>>> right parameter. I thought maybe 'policy' can be the right parameter and >>>> then people can get use policy->cpu to get cpu id out of it. >>>> >>>> But even that doesn't look to be a great idea. X86 drivers may share policy >>>> structure for CPUs that don't actually share a clock line. And so they do need >>>> right CPU number as parameter instead of policy. As they might be doing >>>> some tricky stuff there. Also, we need to make sure that ->get() returns >>>> the frequency at which CPU x is running. >>> >>> That's not going to work in at least some cases anyway, because for some types >>> of HW we simply can't retrieve the current frequency in a non-racy way. >> >> I think there's been a misunderstanding of what I'm proposing. The >> reference to policy->clk is only to get rid of the dependency that >> cpufreq_generic_get() has on the per cpu policy variable being filled. >> You can do that by just removing calls to get _IF_ clk is filled in. > > I still have a little problem understanding what you mean exactly. At least > please explain the last sentence. Ok, here's some pseudo code to explain it better: Something like, replace all calls to cpufreq_driver->get with __cpufreq_driver_get() with the fn being something like: unsigned int __cpufreq_driver_get(cpufreq_policy *policy) { if (policy->clk) return clk_get_rate(policy->clk) / 1000; else return cpufreq_driver->get(policy->cpu); } That way, we don't have to depend on cpufreq_cpu_get() to return a policy in cpufreq_generic_get() for the existing add dev code path to complete without an error (after my patch to advertise the policy struct after it's fully initialized) This also avoid unnecessary function calls to ->get() when we just as well know that it's going to call back into cpufreq_generic_get() and get the clock rate. We can just do that in the framework. -Saravana
On 26 February 2014 02:41, Saravana Kannan <skannan@codeaurora.org> wrote: > On 02/25/2014 05:04 AM, Rafael J. Wysocki wrote: >> On Tuesday, February 25, 2014 02:20:57 PM Viresh Kumar wrote: > I think there's been a misunderstanding of what I'm proposing. The reference > to policy->clk is only to get rid of the dependency that > cpufreq_generic_get() has on the per cpu policy variable being filled. You > can do that by just removing calls to get _IF_ clk is filled in. cpufreq_cpu_get() can be called by other drivers as well, which may not have clock interface implemented for them. We can't stop them from calling it. > I'll look at the patches later today and respond. Although, I would have > been nice you had helped me fix any issues with my patch than coming up with > new ones. Kinda removes to motivation for submitting patches upstream. Sorry if I demotivated you at all :) I just wanted to get to a quick-fix and wasn't interested in getting my patch count up. Thought that isn't always bad :) I sent my patches as they were very different then your approach. Obviously, I wouldn't have done this otherwise :)
On 26 February 2014 07:18, Saravana Kannan <skannan@codeaurora.org> wrote: > On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote: >> And is "fully initialized" actually well defined? > > The point in add dev/hot plug path after which we will no longer change > policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU / > CPUFRE_NOTIFY notifiers. Okay.. > Pretty much the end of __cpufreq_add_dev() so that it's after: > - cpufreq_init_policy() > - And the update of userpolicy fields that after thie init call No. In that case it can be considered initialized before cpufreq_init_policy(). As we do send CPUFREQ_NOTIFY after that from cpufreq_init_policy()-> cpufreq_set_policy(). There are two types of fields within policy, some are very basic: cpu/min/max/ affected_cpus/related_cpus some are advanced: sysfs/governors/.. And as a rule you have to get policy->rwsem lock before accessing policy members. We might not have followed it very well for small things like cpu. And so if you are doing anything over that, please use a lock and that is already present in cpufreq_update_policy(). With my latest patchset that I sent yesterday, locking is improved and now a policy will be usable only after the rwsem is released. And that should be fine. And so making it available in the per-cpu variable after all the necessary fields are filled looks fine to me. And so I don't think we need to move it after call to cpufreq_init_policy(maybe a better name to this function is required).. > Ok, here's some pseudo code to explain it better: > > Something like, replace all calls to cpufreq_driver->get with > __cpufreq_driver_get() with the fn being something like: > > unsigned int __cpufreq_driver_get(cpufreq_policy *policy) > { > if (policy->clk) > return clk_get_rate(policy->clk) / 1000; > else > return cpufreq_driver->get(policy->cpu); This part may still use cpufreq_cpu_get(). > } Drivers are free to have their implementation of ->get() even if they have a valid policy->clk field..
On 02/25/2014 10:02 PM, Viresh Kumar wrote: > On 26 February 2014 07:18, Saravana Kannan <skannan@codeaurora.org> wrote: >> On 02/25/2014 02:41 PM, Rafael J. Wysocki wrote: > >>> And is "fully initialized" actually well defined? >> >> The point in add dev/hot plug path after which we will no longer change >> policy fields without sending further CPUFREQ_UPDATE_POLICY_CPU / >> CPUFRE_NOTIFY notifiers. > > Okay.. > >> Pretty much the end of __cpufreq_add_dev() so that it's after: >> - cpufreq_init_policy() >> - And the update of userpolicy fields that after thie init call > > No. In that case it can be considered initialized before cpufreq_init_policy(). > As we do send CPUFREQ_NOTIFY after that from cpufreq_init_policy()-> > cpufreq_set_policy(). Ok, valid hole in my definition of "fully initialized". > > There are two types of fields within policy, some are very basic: cpu/min/max/ > affected_cpus/related_cpus > > some are advanced: sysfs/governors/.. > > And as a rule you have to get policy->rwsem lock before accessing policy > members. We might not have followed it very well for small things like cpu. > > And so if you are doing anything over that, please use a lock and that is > already present in cpufreq_update_policy(). > > With my latest patchset that I sent yesterday, locking is improved and now > a policy will be usable only after the rwsem is released. And that should be > fine. And so making it available in the per-cpu variable after all the necessary > fields are filled looks fine to me. And so I don't think we need to move it > after call to cpufreq_init_policy(maybe a better name to this function is > required).. I'll take a closer look. Internal tree cpufreq code is in 3.12, so back-porting all the cpufreq changes and testing it can take a bit of time. Will get back on this. -Saravana
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index cb003a6..d5ceb43 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1109,11 +1109,6 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, goto err_set_policy_cpu; } - write_lock_irqsave(&cpufreq_driver_lock, flags); - for_each_cpu(j, policy->cpus) - per_cpu(cpufreq_cpu_data, j) = policy; - write_unlock_irqrestore(&cpufreq_driver_lock, flags); - if (cpufreq_driver->get) { policy->cur = cpufreq_driver->get(policy->cpu); if (!policy->cur) { @@ -1207,6 +1202,11 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, policy->user_policy.governor = policy->governor; } + write_lock_irqsave(&cpufreq_driver_lock, flags); + for_each_cpu(j, policy->cpus) + per_cpu(cpufreq_cpu_data, j) = policy; + write_unlock_irqrestore(&cpufreq_driver_lock, flags); + kobject_uevent(&policy->kobj, KOBJ_ADD); up_read(&cpufreq_rwsem);
The existing code sets the per CPU policy to a non-NULL value before all the steps performed during the hotplug online path is done. Specifically, this is done before the policy min/max, governors, etc are initialized for the policy. This in turn means that calls to cpufreq_cpu_get() return a non-NULL policy before the policy/CPU is ready to be used. To fix this, move the update of per CPU policy to a valid value after all the initialization steps for the policy are completed. Example kernel panic without this fix: [ 512.146185] Unable to handle kernel NULL pointer dereference at virtual address 00000020 [ 512.146195] pgd = c0003000 [ 512.146213] [00000020] *pgd=80000000004003, *pmd=00000000 [ 512.146228] Internal error: Oops: 206 [#1] PREEMPT SMP ARM <snip> [ 512.146297] PC is at __cpufreq_governor+0x10/0x1ac [ 512.146312] LR is at cpufreq_update_policy+0x114/0x150 <snip> [ 512.149740] ---[ end trace f23a8defea6cd706 ]--- [ 512.149761] Kernel panic - not syncing: Fatal exception [ 513.152016] CPU0: stopping [ 513.154710] CPU: 0 PID: 7136 Comm: mpdecision Tainted: G D W 3.10.0-gd727407-00074-g979ede8 #396 <snip> [ 513.317224] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) [ 513.327809] [<c02a23ac>] (__blocking_notifier_call_chain+0x40/0x58) from [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) [ 513.339182] [<c02a23d8>] (blocking_notifier_call_chain+0x14/0x1c) from [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) [ 513.349594] [<c0803c68>] (cpufreq_set_policy+0xd4/0x2b8) from [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) [ 513.359231] [<c0803e7c>] (cpufreq_init_policy+0x30/0x98) from [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) [ 513.369560] [<c0805a18>] (__cpufreq_add_dev.isra.17+0x4dc/0x7a4) from [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) [ 513.379978] [<c0805d38>] (cpufreq_cpu_callback+0x58/0x84) from [<c0afe180>] (notifier_call_chain+0x40/0x68) [ 513.389704] [<c0afe180>] (notifier_call_chain+0x40/0x68) from [<c02812dc>] (__cpu_notify+0x28/0x44) [ 513.398728] [<c02812dc>] (__cpu_notify+0x28/0x44) from [<c0aeed90>] (_cpu_up+0xf4/0x1dc) [ 513.406797] [<c0aeed90>] (_cpu_up+0xf4/0x1dc) from [<c0aeeed4>] (cpu_up+0x5c/0x78) [ 513.414357] [<c0aeeed4>] (cpu_up+0x5c/0x78) from [<c0aec808>] (store_online+0x44/0x74) [ 513.422253] [<c0aec808>] (store_online+0x44/0x74) from [<c03a40f4>] (sysfs_write_file+0x108/0x14c) [ 513.431195] [<c03a40f4>] (sysfs_write_file+0x108/0x14c) from [<c03517d4>] (vfs_write+0xd0/0x180) [ 513.439958] [<c03517d4>] (vfs_write+0xd0/0x180) from [<c0351ca8>] (SyS_write+0x38/0x68) [ 513.447947] [<c0351ca8>] (SyS_write+0x38/0x68) from [<c0205de0>] (ret_fast_syscall+0x0/0x30) In this specific case, CPU0 set's CPU1's policy->governor in cpufreq_init_policy() to NULL while CPU1 is using the policy->governor in __cpufreq_governor(). Signed-off-by: Saravana Kannan <skannan@codeaurora.org> --- drivers/cpufreq/cpufreq.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)