Message ID | 52304439.3030301@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On 11 September 2013 15:51, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown >>> during suspend/resume". Sorry Stephen, I was away on vacations and came back yesterday only.. And was badly stuck in something other CPUFreq bugs until now :) > Sure, Rafael. Thanks for CC'ing me. Thanks for jumping in and helping us out buddy!!!. > Stephen, I went through the code and I think I found out what is going wrong. > Can you please try the following patch? > > Regards, > Srivatsa S. Bhat > > ---------------------------------------------------------------------------- > > > From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > Subject: [PATCH] cpufreq: Fix crash in cpufreq-stats during suspend/resume > > Stephen Warren reported that the cpufreq-stats code hits a NULL pointer > dereference during the second attempt to suspend a system. He also > pin-pointed the problem to commit 5302c3f "cpufreq: Perform light-weight > init/teardown during suspend/resume". > > That commit actually ensured that the cpufreq-stats table and the > cpufreq-stats sysfs entries are *not* torn down (ie., not freed) during > suspend/resume, which makes it all the more surprising. However, it turns > out that the root-cause is not that we access an already freed memory, but > that the reference to the allocated memory gets moved around and we lose > track of that during resume, leading to the reported crash in a subsequent > suspend attempt. > > In the suspend path, during CPU offline, the value of policy->cpu is updated > by choosing one of the surviving CPUs in that policy, as long as there is > atleast one CPU in that policy. And cpufreq_stats_update_policy_cpu() is > invoked to update the reference to the stats structure by assigning it to > the new CPU. However, in the resume path, during CPU online, we end up > assigning a fresh CPU as the policy->cpu, without letting cpufreq-stats > know about this. Thus the reference to the stats structure remains > (incorrectly) associated with the old CPU. So, in a subsequent suspend attempt, > during CPU offline, we end up accessing an incorrect location to get the > stats structure, which eventually leads to the NULL pointer dereference. > > Fix this by letting cpufreq-stats know about the update of the policy->cpu > during CPU online in the resume path. (Also, move the update_policy_cpu() > function higher up in the file, so that __cpufreq_add_dev() can invoke > it). Observation looks good.. > Reported-by: Stephen Warren <swarren@nvidia.com> > Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> > --- > > drivers/cpufreq/cpufreq.c | 37 ++++++++++++++++++++++++------------- > 1 file changed, 24 insertions(+), 13 deletions(-) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 5a64f66..62bdb95 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -947,6 +947,18 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy) > kfree(policy); > } > > +static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu) > +{ > + policy->last_cpu = policy->cpu; > + policy->cpu = cpu; > + > +#ifdef CONFIG_CPU_FREQ_TABLE > + cpufreq_frequency_table_update_policy_cpu(policy); > +#endif > + blocking_notifier_call_chain(&cpufreq_policy_notifier_list, > + CPUFREQ_UPDATE_POLICY_CPU, policy); > +} > + > static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, > bool frozen) > { > @@ -1000,7 +1012,18 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, > if (!policy) > goto nomem_out; > > - policy->cpu = cpu; > + > + /* > + * In the resume path, since we restore a saved policy, the assignment > + * to policy->cpu is like an update of the existing policy, rather than > + * the creation of a brand new one. So we need to perform this update > + * by invoking update_policy_cpu(). > + */ > + if (frozen && cpu != policy->cpu) > + update_policy_cpu(policy, cpu); > + else > + policy->cpu = cpu; > + > policy->governor = CPUFREQ_DEFAULT_GOVERNOR; > cpumask_copy(policy->cpus, cpumask_of(cpu)); > > @@ -1092,18 +1115,6 @@ static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif) > return __cpufreq_add_dev(dev, sif, false); > } > > -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu) > -{ > - policy->last_cpu = policy->cpu; > - policy->cpu = cpu; > - > -#ifdef CONFIG_CPU_FREQ_TABLE > - cpufreq_frequency_table_update_policy_cpu(policy); > -#endif > - blocking_notifier_call_chain(&cpufreq_policy_notifier_list, > - CPUFREQ_UPDATE_POLICY_CPU, policy); > -} > - But I would have solved it differently :) We don't really need to call update_policy_cpu() again and again as we don't really need to update policy->cpu... Rather it would be better to just move following inside cpufreq_policy_alloc(): -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11 September 2013 16:14, Viresh Kumar <viresh.kumar@linaro.org> wrote: > But I would have solved it differently :) > > We don't really need to call update_policy_cpu() again and again > as we don't really need to update policy->cpu... > > Rather it would be better to just move following inside > cpufreq_policy_alloc(): Half written mail sent :( What about moving following to cpufreq_policy_alloc(): policy->cpu = cpu; ?? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 04:14 PM, Viresh Kumar wrote: > On 11 September 2013 15:51, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: > >>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown >>>> during suspend/resume". > > Sorry Stephen, I was away on vacations and came back yesterday only.. > And was badly stuck in something other CPUFreq bugs until now :) > >> Sure, Rafael. Thanks for CC'ing me. > > Thanks for jumping in and helping us out buddy!!!. > No problem :-) Besides, I am the one who broke it ;-( Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 04:15 PM, Viresh Kumar wrote: > On 11 September 2013 16:14, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> But I would have solved it differently :) >> >> We don't really need to call update_policy_cpu() again and again >> as we don't really need to update policy->cpu... >> >> Rather it would be better to just move following inside >> cpufreq_policy_alloc(): > > Half written mail sent :( > > What about moving following to cpufreq_policy_alloc(): > > policy->cpu = cpu; > > ?? > Hmm? The problem is not about merely updating the policy->cpu field; the main issue is that the existing code was not letting the cpufreq-stats code know that we updated the policy->cpu under the hood. It is important for cpufreq-stats to know this because it maintains the reference to its stats structure by associating it with the policy->cpu. So if policy->cpu changes under the hood, it loses track of its reference. So we need to keep that code informed about changes to policy->cpu. Thus, we need to call update_policy_cpu() in the CPU online path (during resume). I don't see how we can skip that. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11 September 2013 16:40, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > On 09/11/2013 04:14 PM, Viresh Kumar wrote: >> On 11 September 2013 15:51, Srivatsa S. Bhat >> <srivatsa.bhat@linux.vnet.ibm.com> wrote: >>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >> >>>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown >>>>> during suspend/resume". >> >> Sorry Stephen, I was away on vacations and came back yesterday only.. >> And was badly stuck in something other CPUFreq bugs until now :) >> >>> Sure, Rafael. Thanks for CC'ing me. >> >> Thanks for jumping in and helping us out buddy!!!. >> > > No problem :-) Besides, I am the one who broke it ;-( I believe you missed some part of my mail ? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 04:45 PM, Viresh Kumar wrote: > On 11 September 2013 16:40, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> On 09/11/2013 04:14 PM, Viresh Kumar wrote: >>> On 11 September 2013 15:51, Srivatsa S. Bhat >>> <srivatsa.bhat@linux.vnet.ibm.com> wrote: >>>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >>>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >>> >>>>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown >>>>>> during suspend/resume". >>> >>> Sorry Stephen, I was away on vacations and came back yesterday only.. >>> And was badly stuck in something other CPUFreq bugs until now :) >>> >>>> Sure, Rafael. Thanks for CC'ing me. >>> >>> Thanks for jumping in and helping us out buddy!!!. >>> >> >> No problem :-) Besides, I am the one who broke it ;-( > > I believe you missed some part of my mail ? > See my next reply :-) I was composing it :-) Man, you are *fast*! ;-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11 September 2013 16:47, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > See my next reply :-) I was composing it :-) > > Man, you are *fast*! ;-) haha... -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11 September 2013 16:44, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > Hmm? The problem is not about merely updating the policy->cpu field; the > main issue is that the existing code was not letting the cpufreq-stats > code know that we updated the policy->cpu under the hood. It is important > for cpufreq-stats to know this because it maintains the reference to its > stats structure by associating it with the policy->cpu. So if policy->cpu > changes under the hood, it loses track of its reference. So we need to > keep that code informed about changes to policy->cpu. Thus, we need to call > update_policy_cpu() in the CPU online path (during resume). I don't see > how we can skip that. Okay.. There are two different ways in which cpufreq_add_dev() work currently.. Boot cluster (i.e. policy with boot CPU) --------------- Here cpufreq_remove_dev() is never called for boot cpu but all others. And similarly cpufreq_add_dev() is never called for boot cpu but all others. Now policy->cpu contains meaningful cpu at beginning of resume and we don't need to modify that at all.. For all the remaining CPUs we better call cpufreq_add_policy_cpu() rather.. Non-boot Cluster --------------------- All CPUs here are removed and at the end policy->cpu contains the last cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3.. Not at resume we will add cpu2 first and so need to update policy->cpu to 2.. But for all other CPUs in this cluster we return early from cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus was fixed by call to ->init() for the first cpu of this cluster.. And so we never reach the line: policy->cpu = cpu; For the first cpu of non-boot cluster we need to call update_policy_cpu() and not for others.. But for the boot cluster if we can call ->init() somehow at resume time, then things would be fairly similar in both cases.. I am running of time now, as need to leave office now... I hope I made the problem more clear or probably the way I see it :) -- viresh -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 05:29 PM, Viresh Kumar wrote: > On 11 September 2013 16:44, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> Hmm? The problem is not about merely updating the policy->cpu field; the >> main issue is that the existing code was not letting the cpufreq-stats >> code know that we updated the policy->cpu under the hood. It is important >> for cpufreq-stats to know this because it maintains the reference to its >> stats structure by associating it with the policy->cpu. So if policy->cpu >> changes under the hood, it loses track of its reference. So we need to >> keep that code informed about changes to policy->cpu. Thus, we need to call >> update_policy_cpu() in the CPU online path (during resume). I don't see >> how we can skip that. > > Okay.. There are two different ways in which cpufreq_add_dev() work > currently.. > > Boot cluster (i.e. policy with boot CPU) > --------------- > > Here cpufreq_remove_dev() is never called for boot cpu but all others. > And similarly cpufreq_add_dev() is never called for boot cpu but all others. > > Now policy->cpu contains meaningful cpu at beginning of resume and > we don't need to modify that at all.. For all the remaining CPUs we > better call cpufreq_add_policy_cpu() rather.. > Yes, and the existing code already does that. And if you observe the placement of the call to update_policy_cpu() in my patch, you'll find that it won't be called in cases where we call cpufreq_add_policy_cpu(). Which is exactly what you want, IIUC. > Non-boot Cluster > --------------------- > > All CPUs here are removed and at the end policy->cpu contains the last > cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3.. > > Not at resume we will add cpu2 first and so need to update policy->cpu > to 2.. But for all other CPUs in this cluster we return early from > cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus > was fixed by call to ->init() for the first cpu of this cluster.. > > And so we never reach the line: policy->cpu = cpu; > > For the first cpu of non-boot cluster we need to call update_policy_cpu() > and not for others.. > Yes, and that's precisely why I added the call to update_policy_cpu() only near the statement 'policy->cpu = cpu'. All other cases don't need to call that function, and in my patch, they indeed don't call that function. A simple way to look at this is: If policy->cpu is updated anywhere in the resume (cpu online) path, we need to call update_policy_cpu(). Otherwise, we don't need to call that function. This will cover both the scenarios you described above. > > But for the boot cluster if we can call ->init() somehow at resume time, > then things would be fairly similar in both cases.. > > I am running of time now, as need to leave office now... > I hope I made the problem more clear or probably the way I see it :) > Well, your explanation matches my understanding of the scenarios, and I had written the patch keeping those scenarios in mind; so I believe the patch is correct as it is. So what I didn't get is this: are you suggesting that something is wrong in my patch, or are you just reiterating the different scenarios involved so that I can recheck whether my patch is right? If it is the former, please point me to the exact problem you found in my patch. If it is the latter, I'm pretty sure my patch is right, I've already checked it :-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote: > On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote: >>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote: >>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote: >>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote: >>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote: >>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote: >>>>>>>>> Viresh, >>>>>>>>> >>>>>>>>> I'm seeing the crash below when suspending my system for the second time. ... > Stephen, I went through the code and I think I found out what is going wrong. > Can you please try the following patch? Unfortunately, I still see the exact same failure/backtrace with this patch applied. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 09:35 PM, Stephen Warren wrote: > On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote: >> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >>>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote: >>>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote: >>>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote: >>>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote: >>>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote: >>>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote: >>>>>>>>>> Viresh, >>>>>>>>>> >>>>>>>>>> I'm seeing the crash below when suspending my system for the second time. > ... >> Stephen, I went through the code and I think I found out what is going wrong. >> Can you please try the following patch? > > Unfortunately, I still see the exact same failure/backtrace with this > patch applied. > Oh, is it? Can you please give me the map of the related cpus on your system? (ie., cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus for each CPU.) I must be missing something... Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 11:33 PM, Srivatsa S. Bhat wrote: > On 09/11/2013 09:35 PM, Stephen Warren wrote: >> On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote: >>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote: >>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote: >>>>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote: >>>>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote: >>>>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote: >>>>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote: >>>>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote: >>>>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote: >>>>>>>>>>> Viresh, >>>>>>>>>>> >>>>>>>>>>> I'm seeing the crash below when suspending my system for the second time. >> ... >>> Stephen, I went through the code and I think I found out what is going wrong. >>> Can you please try the following patch? >> >> Unfortunately, I still see the exact same failure/backtrace with this >> patch applied. >> > > Oh, is it? Can you please give me the map of the related cpus on your > system? (ie., cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus for > each CPU.) > > I must be missing something... > OK, I took a second look at the code, and I suspect that applying the second patch might help. So can you try by applying both the patches please[1][2]? Basically here is my hunch: say CPUs 2 and 3 are part of a policy and 3 is the policy->cpu. During suspend, CPU 2 will be taken offline first, and we hit this code: 1199 if (cpu != policy->cpu && !frozen) { 1200 sysfs_remove_link(&dev->kobj, "cpufreq"); 1201 } else if (cpus > 1) { 1202 1203 new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu, frozen); 1204 if (new_cpu >= 0) { 1205 WARN_ON(lock_policy_rwsem_write(cpu)); 1206 update_policy_cpu(policy, new_cpu); 1207 unlock_policy_rwsem_write(cpu); 1208 1209 if (!frozen) { 1210 pr_debug("%s: policy Kobject moved to cpu: %d " 1211 "from: %d\n",__func__, new_cpu, cpu); 1212 } 1213 } 1214 } At this point, the first 'if' condition fails because frozen == true. So it enters the else part. But, policy->cpu is actually 3, not 2, and hence we invoke nominate_...() unnecessarily. That function returns 3 since that's the only CPU remaining in the mask, and so we call update_policy_cpu() with new_cpu = 3, and old_cpu was also 3! And that is the perfect recipe for disaster, with the current implementation of update_policy_cpu(). And my second patch [2] tried to fix this exact problem, although I didn't realize we actually had a case where we hit this in the current code itself. So please try by applying both the patches and let me know how it goes. Thanks a lot for your testing efforts! [1]. http://marc.info/?l=linux-kernel&m=137889516210816&w=2 [2]. http://marc.info/?l=linux-kernel&m=137889800511940&w=2 Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/11/2013 12:42 PM, Srivatsa S. Bhat wrote: ... > OK, I took a second look at the code, and I suspect that applying the > second patch might help. So can you try by applying both the patches > please[1][2]? > ... > [1]. http://marc.info/?l=linux-kernel&m=137889516210816&w=2 > [2]. http://marc.info/?l=linux-kernel&m=137889800511940&w=2 Yes, with both of those patches applies, the problem is solved:-) I was going to test the second patch originally, but it sounded like it was more of a cleanup rather than a fix for my issue, so I didn't bother when I found the problem wasn't solved by patch 1. Sorry! For the record, I'm testing on a 2-CPU system, so I'm not sure whether your explanation applies; it talks about CPUs 2 and 3 whereas I only have CPUs 0 and 1, but perhaps your explanation applies equally to any pair of CPUs? For the record, here's the information you requested in the other email: # cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus 0 1 0 1 -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Let me fix my mail first.. I was running out of time yesterday and so couldn't frame things correctly :) On 11 September 2013 17:29, Viresh Kumar <viresh.kumar@linaro.org> wrote: > Okay.. There are two different ways in which cpufreq_add_dev() work > currently.. > > Boot cluster (i.e. policy with boot CPU) > --------------- > > Here cpufreq_remove_dev() is never called for boot cpu but all others. > And similarly cpufreq_add_dev() is never called for boot cpu but all others. > > Now policy->cpu contains meaningful cpu at beginning of resume and > we don't need to modify that at all.. For all the remaining CPUs we > better call cpufreq_add_policy_cpu() rather.. And this should be done without your patch. Or actually we will simply return from this place. Atleast for systems with single cluster, like Tegra. policy->related_cpus is still valid after resume and we haven't removed policy from the cpufreq_policy_list (though there is a bug which I have fixed separately and sent it to you..).. So no change required for a single cluster system.. > Non-boot Cluster > --------------------- > > All CPUs here are removed and at the end policy->cpu contains the last > cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3.. > > Now at resume we will add cpu2 first and so need to update policy->cpu > to 2.. > But for all other CPUs in this cluster we return early from > cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus > was fixed by call to ->init() for the first cpu of this cluster.. This was wrong, we need a valid policy->related_cpus field which is always valid and so we return early here too, but not for the first cpu of cluster. > And so we never reach the line: policy->cpu = cpu; > > For the first cpu of non-boot cluster we need to call update_policy_cpu() > and not for others.. that's correct, thought I have one more idea.. :) > But for the boot cluster if we can call ->init() somehow at resume time, > then things would be fairly similar in both cases.. Not required.. its all working already.. and so Stephen shouldn't need your patch for Tegra, but rather my patches that fix other cpufreq bugs.. Now coming back to the ideas I have... Same code will work if hotplug sequence is fixed a bit. Why aren't we doing exact opposite of suspend in resume? We are removing CPUs (leaving the boot cpu) in ascending order and then adding them back in same order.. Why? Why not remove CPUs in descending order and add in ascending order? Or remove in ascending order and add in descending order? That way policy->cpu will be updated with the right cpu and your patch wouldn't be required.. I am not saying that this can't be hacked/fixed in cpufreq but suspend/resume may also be fixed and that looks logically more correct to me.. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 September 2013 00:12, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > OK, I took a second look at the code, and I suspect that applying the > second patch might help. So can you try by applying both the patches > please[1][2]? > > Basically here is my hunch: say CPUs 2 and 3 are part of a policy and > 3 is the policy->cpu. During suspend, CPU 2 will be taken offline first, > and we hit this code: > > 1199 if (cpu != policy->cpu && !frozen) { > 1200 sysfs_remove_link(&dev->kobj, "cpufreq"); > 1201 } else if (cpus > 1) { > 1202 > 1203 new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu, frozen); > 1204 if (new_cpu >= 0) { > 1205 WARN_ON(lock_policy_rwsem_write(cpu)); > 1206 update_policy_cpu(policy, new_cpu); > 1207 unlock_policy_rwsem_write(cpu); > 1208 > 1209 if (!frozen) { > 1210 pr_debug("%s: policy Kobject moved to cpu: %d " > 1211 "from: %d\n",__func__, new_cpu, cpu); > 1212 } > 1213 } > 1214 } > > At this point, the first 'if' condition fails because frozen == true. > So it enters the else part. But, policy->cpu is actually 3, not 2, > and hence we invoke nominate_...() unnecessarily. That function returns > 3 since that's the only CPU remaining in the mask, and so we call > update_policy_cpu() with new_cpu = 3, and old_cpu was also 3! And that > is the perfect recipe for disaster, with the current implementation of > update_policy_cpu(). The problem here is not the wrong implementation of update_policy_cpu() but 1199 if (cpu != policy->cpu && !frozen) { Though I have fixed it before looking into your replies :) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 September 2013 00:33, Stephen Warren <swarren@wwwdotorg.org> wrote: > For the record, I'm testing on a 2-CPU system, so I'm not sure whether > your explanation applies; it talks about CPUs 2 and 3 whereas I only > have CPUs 0 and 1, but perhaps your explanation applies equally to any > pair of CPUs? It does apply to your system as well.. We remove cpu 1 in suspend and so try to nominate cpu 0 as policy->cpu, which it already is :) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/2013 11:22 AM, Viresh Kumar wrote: > Let me fix my mail first.. I was running out of time yesterday and so couldn't > frame things correctly :) > > On 11 September 2013 17:29, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> Okay.. There are two different ways in which cpufreq_add_dev() work >> currently.. >> >> Boot cluster (i.e. policy with boot CPU) >> --------------- >> >> Here cpufreq_remove_dev() is never called for boot cpu but all others. >> And similarly cpufreq_add_dev() is never called for boot cpu but all others. >> >> Now policy->cpu contains meaningful cpu at beginning of resume and >> we don't need to modify that at all.. For all the remaining CPUs we >> better call cpufreq_add_policy_cpu() rather.. > > And this should be done without your patch. Or actually we will simply > return from this place. Atleast for systems with single cluster, like Tegra. > > policy->related_cpus is still valid after resume and we haven't removed > policy from the cpufreq_policy_list (though there is a bug which I have > fixed separately and sent it to you..).. So no change required for a single > cluster system.. > >> Non-boot Cluster >> --------------------- >> >> All CPUs here are removed and at the end policy->cpu contains the last >> cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3.. >> >> Now at resume we will add cpu2 first and so need to update policy->cpu >> to 2.. > >> But for all other CPUs in this cluster we return early from >> cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus >> was fixed by call to ->init() for the first cpu of this cluster.. > > This was wrong, we need a valid policy->related_cpus field which is always > valid and so we return early here too, but not for the first cpu of cluster. > >> And so we never reach the line: policy->cpu = cpu; >> >> For the first cpu of non-boot cluster we need to call update_policy_cpu() >> and not for others.. > > that's correct, thought I have one more idea.. :) > >> But for the boot cluster if we can call ->init() somehow at resume time, >> then things would be fairly similar in both cases.. > > Not required.. its all working already.. and so Stephen shouldn't need your > patch for Tegra, but rather my patches that fix other cpufreq bugs.. > > Now coming back to the ideas I have... > Same code will work if hotplug sequence is fixed a bit. Why aren't we doing > exact opposite of suspend in resume? > > We are removing CPUs (leaving the boot cpu) in ascending order and then > adding them back in same order.. Why? > > Why not remove CPUs in descending order and add in ascending order? Or > remove in ascending order and add in descending order? > I had the same thought when solving this bug.. We have had similar issues with CPU hotplug notifiers too: why are they invoked in the same order during both CPU down and up, instead of reversing the order? I even had a patchset to perform reverse-invocation of notifiers.. http://lwn.net/Articles/508072/ ... but people didn't find that very compelling to have. > That way policy->cpu will be updated with the right cpu and your patch wouldn't > be required.. > > I am not saying that this can't be hacked/fixed in cpufreq but suspend/resume > may also be fixed and that looks logically more correct to me.. > It does to me too, but I think the reason nobody really bothered is because perhaps not many other subsystems care about the order in which CPUs are torn down or brought up; they just need the total number to match.. cpufreq is one exception as we saw with this bug. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 September 2013 11:56, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > I had the same thought when solving this bug.. We have had similar issues with > CPU hotplug notifiers too: why are they invoked in the same order during both > CPU down and up, instead of reversing the order? I even had a patchset to perform > reverse-invocation of notifiers.. http://lwn.net/Articles/508072/ > ... but people didn't find that very compelling to have. > > It does to me too, but I think the reason nobody really bothered is because perhaps > not many other subsystems care about the order in which CPUs are torn down or > brought up; they just need the total number to match.. cpufreq is one exception > as we saw with this bug. Probably its time to re-spin that series and make CPUFreq as one of the users of that patchset.. Resume should be just opposite of suspend and so that patchset would make sense even if not many people care about it :) Over that there is one more problem that I see, don't know if it is really a big issue.. After a suspend/resume value of policy->cpu may get changed... And so the hierarchy of sysfs cpufreq files too.. Folder that had links to other CPUs folder can now be actual folders instead of links and vice versa.. Don't know if this can break something ?? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/2013 12:11 PM, Viresh Kumar wrote: > On 12 September 2013 11:56, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> I had the same thought when solving this bug.. We have had similar issues with >> CPU hotplug notifiers too: why are they invoked in the same order during both >> CPU down and up, instead of reversing the order? I even had a patchset to perform >> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/ >> ... but people didn't find that very compelling to have. >> > >> It does to me too, but I think the reason nobody really bothered is because perhaps >> not many other subsystems care about the order in which CPUs are torn down or >> brought up; they just need the total number to match.. cpufreq is one exception >> as we saw with this bug. > > Probably its time to re-spin that series and make CPUFreq as one of the users > of that patchset.. Resume should be just opposite of suspend and so > that patchset > would make sense even if not many people care about it :) > > Over that there is one more problem that I see, don't know if it is really a big > issue.. > > After a suspend/resume value of policy->cpu may get changed... And so the > hierarchy of sysfs cpufreq files too.. Folder that had links to other > CPUs folder > can now be actual folders instead of links and vice versa.. > > Don't know if this can break something ?? > Interesting observation :-) But we just managed to retain sysfs file permissions across suspend/resume with a lot of trouble and regressions. That's probably good enough for some time to come ;-) We can retain folder/links when somebody really finds a need to do that ;-) Of course, if we change the suspend/resume sequence and that fixes this, that would be like getting it for free, nobody would say no to it ;-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 September 2013 12:16, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > Of course, if we change the suspend/resume sequence and that fixes this, that > would be like getting it for free, nobody would say no to it ;-) Not really :) Policy with 4 CPUs, 0,1,2,3, policy->cpu currently set to 1, 2 or 3... We will remove CPUs in order 3,2,1 and add back in 1,2,3... Or Vice Versa policy->cpu after resume is 0 :) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/2013 12:22 PM, Viresh Kumar wrote: > On 12 September 2013 12:16, Srivatsa S. Bhat > <srivatsa.bhat@linux.vnet.ibm.com> wrote: >> Of course, if we change the suspend/resume sequence and that fixes this, that >> would be like getting it for free, nobody would say no to it ;-) > > Not really :) > > Policy with 4 CPUs, 0,1,2,3, policy->cpu currently set to 1, 2 or 3... > > We will remove CPUs in order 3,2,1 and add back in 1,2,3... Or Vice Versa > > policy->cpu after resume is 0 :) > Ah, great counter-example! :-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/2013 12:26 AM, Srivatsa S. Bhat wrote: > On 09/12/2013 11:22 AM, Viresh Kumar wrote: ... >> Now coming back to the ideas I have... >> Same code will work if hotplug sequence is fixed a bit. Why aren't we doing >> exact opposite of suspend in resume? >> >> We are removing CPUs (leaving the boot cpu) in ascending order and then >> adding them back in same order.. Why? >> >> Why not remove CPUs in descending order and add in ascending order? Or >> remove in ascending order and add in descending order? > > I had the same thought when solving this bug.. We have had similar issues with > CPU hotplug notifiers too: why are they invoked in the same order during both > CPU down and up, instead of reversing the order? I even had a patchset to perform > reverse-invocation of notifiers.. http://lwn.net/Articles/508072/ > ... but people didn't find that very compelling to have. I'm not sure if you're talking about the order the CPUs get plugged back in after resume, or the order of the (pair of?) notifiers that gets called for each individual CPU. Sorry if this is blindingly obvious, but with CPU hotplug, I can manually unplug/re-plug CPUs in any order I feel like, and cpufreq had better work if I do that. Given that, I don't think there's any particular need for suspend/resume to unplug/re-plug CPUs in any particular order for correctness at least, although perhaps it'd be nice cosmetically for some reason? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/2013 09:25 PM, Stephen Warren wrote: > On 09/12/2013 12:26 AM, Srivatsa S. Bhat wrote: >> On 09/12/2013 11:22 AM, Viresh Kumar wrote: > ... >>> Now coming back to the ideas I have... >>> Same code will work if hotplug sequence is fixed a bit. Why aren't we doing >>> exact opposite of suspend in resume? >>> >>> We are removing CPUs (leaving the boot cpu) in ascending order and then >>> adding them back in same order.. Why? >>> >>> Why not remove CPUs in descending order and add in ascending order? Or >>> remove in ascending order and add in descending order? >> >> I had the same thought when solving this bug.. We have had similar issues with >> CPU hotplug notifiers too: why are they invoked in the same order during both >> CPU down and up, instead of reversing the order? I even had a patchset to perform >> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/ >> ... but people didn't find that very compelling to have. > > I'm not sure if you're talking about the order the CPUs get plugged back > in after resume, or the order of the (pair of?) notifiers that gets > called for each individual CPU. Well, initially we were discussing about the order the CPUs get plugged back in after resume, then I gave an example of a similar ordering issue in the way the registered CPU hotplug notifiers are invoked: during *both* CPU online and offline, the chain of notifiers are invoked in the same order, rather than in the opposite order. To work around this unnatural ordering, subsystems have resorted to splitting up their callbacks into multiple callbacks and then assigning appropriate priorities etc., to them, to get the ordering right. We could have done away with all that complexity/confusion if the core CPU hotplug code had set the ordering right. And that's what my patchset tried to do. Anyway, nevermind, as of now, subsystems do work around this suitably, so there is no known bug as such at the present. Just that we could have probably done it a better way, that's all. > > Sorry if this is blindingly obvious, but with CPU hotplug, I can > manually unplug/re-plug CPUs in any order I feel like, and cpufreq had > better work if I do that. Given that, I don't think there's any > particular need for suspend/resume to unplug/re-plug CPUs in any > particular order for correctness at least, although perhaps it'd be nice > cosmetically for some reason? > You're absolutely right! Regular CPU hotplug is more demanding than suspend/resume in the context we are discussing, since any CPU can be hotplugged at any time and put back in any order. So code like cpufreq should be prepared to work with any ordering. So yes, its not very compelling to change the suspend/resume CPU hotplug ordering, since cpufreq has to deal with the regular (and harsher) situation anyway. That said, one subtle but key distinction between regular CPU hotplug and suspend/resume is that, the kernel guarantees that the state of the system after resume will be exactly the same as it was before suspend (atleast to the extent that is practically possible). On the other hand, no such guarantees are made for regular CPU hotplug, since its almost as if the user is mutating the system at runtime!. However, suspend/resume as a concept itself provides the above mentioned guarantee, and in fact, the very fact that CPU hotplug is used under the hood for suspend/resume is just an implementation detail, and should not affect the guarantees. That's the reason we often pay special attention to CPU hotplug handling in the suspend/resume path, such as preserving sysfs file permissions etc.. On those lines, Viresh and me were just wondering if 'fixing' the suspend/ resume CPU hotplug sequence would yield any additional benefits to better serve this guarantee. That was the context in which this discussion happened. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 September 2013 22:56, Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote: > On 09/12/2013 09:25 PM, Stephen Warren wrote: > Anyway, nevermind, as of now, subsystems do work around this suitably, so > there is no known bug as such at the present. Just that we could have probably > done it a better way, that's all. Yeah, there is no bug as of now due to the number of hacks adopted by different framework.. I believe we can still have a cleanup series to take care of this stuff.. That would be some improvement and would be better for future.. Otherwise this kind of problems would keep coming again and again.. > You're absolutely right! Regular CPU hotplug is more demanding than > suspend/resume in the context we are discussing, since any CPU can be > hotplugged at any time and put back in any order. So code like cpufreq should > be prepared to work with any ordering. And that part is well implemented and tested as far as I know.. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 5a64f66..62bdb95 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -947,6 +947,18 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy) kfree(policy); } +static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu) +{ + policy->last_cpu = policy->cpu; + policy->cpu = cpu; + +#ifdef CONFIG_CPU_FREQ_TABLE + cpufreq_frequency_table_update_policy_cpu(policy); +#endif + blocking_notifier_call_chain(&cpufreq_policy_notifier_list, + CPUFREQ_UPDATE_POLICY_CPU, policy); +} + static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, bool frozen) { @@ -1000,7 +1012,18 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif, if (!policy) goto nomem_out; - policy->cpu = cpu; + + /* + * In the resume path, since we restore a saved policy, the assignment + * to policy->cpu is like an update of the existing policy, rather than + * the creation of a brand new one. So we need to perform this update + * by invoking update_policy_cpu(). + */ + if (frozen && cpu != policy->cpu) + update_policy_cpu(policy, cpu); + else + policy->cpu = cpu; + policy->governor = CPUFREQ_DEFAULT_GOVERNOR; cpumask_copy(policy->cpus, cpumask_of(cpu)); @@ -1092,18 +1115,6 @@ static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif) return __cpufreq_add_dev(dev, sif, false); } -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu) -{ - policy->last_cpu = policy->cpu; - policy->cpu = cpu; - -#ifdef CONFIG_CPU_FREQ_TABLE - cpufreq_frequency_table_update_policy_cpu(policy); -#endif - blocking_notifier_call_chain(&cpufreq_policy_notifier_list, - CPUFREQ_UPDATE_POLICY_CPU, policy); -} - static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy, unsigned int old_cpu, bool frozen) {