diff mbox

cpufreq_stats NULL deref on second system suspend

Message ID 52304439.3030301@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Srivatsa S. Bhat Sept. 11, 2013, 10:21 a.m. UTC
On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote:
>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote:
>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote:
>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote:
>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote:
>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote:
>>>>>>>> Viresh,
>>>>>>>>
>>>>>>>> I'm seeing the crash below when suspending my system for the second time.
>>>>>>>>
>>>>>>>> I can avoid this with the following patch, which adds a check which
>>>>>>>> already exists in all-but-one other places that the same lookup is made:
>>>>>>>
>>>>>>> Which kernel did you test?
>>>>>>
>>>>>> next-20130909.
>>>>>
>>>>> Is it reproducible with the current mainline?
>>>>
>>>> This does not affect v3.11, but does affect current HEAD; 300893b "Merge
>>>> tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs".
>>>
>>> What system does it break on?
>>
>> A dual-core ARM system (NVIDIA Tegra20 SoC, Harmony board).
>>
>>> Any chance to bisect cpufreq changes between 3.11 and the current HEAD?
>>
>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown
>> during suspend/resume".
> 
> Thanks!
> 
> Srivatsa, any chance to look into this?
>

Sure, Rafael. Thanks for CC'ing me.

Stephen, I went through the code and I think I found out what is going wrong.
Can you please try the following patch?

Regards,
Srivatsa S. Bhat

----------------------------------------------------------------------------


From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] cpufreq: Fix crash in cpufreq-stats during suspend/resume

Stephen Warren reported that the cpufreq-stats code hits a NULL pointer
dereference during the second attempt to suspend a system. He also
pin-pointed the problem to commit 5302c3f "cpufreq: Perform light-weight
init/teardown during suspend/resume".

That commit actually ensured that the cpufreq-stats table and the
cpufreq-stats sysfs entries are *not* torn down (ie., not freed) during
suspend/resume, which makes it all the more surprising. However, it turns
out that the root-cause is not that we access an already freed memory, but
that the reference to the allocated memory gets moved around and we lose
track of that during resume, leading to the reported crash in a subsequent
suspend attempt.

In the suspend path, during CPU offline, the value of policy->cpu is updated
by choosing one of the surviving CPUs in that policy, as long as there is
atleast one CPU in that policy. And cpufreq_stats_update_policy_cpu() is
invoked to update the reference to the stats structure by assigning it to
the new CPU. However, in the resume path, during CPU online, we end up
assigning a fresh CPU as the policy->cpu, without letting cpufreq-stats
know about this. Thus the reference to the stats structure remains
(incorrectly) associated with the old CPU. So, in a subsequent suspend attempt,
during CPU offline, we end up accessing an incorrect location to get the
stats structure, which eventually leads to the NULL pointer dereference.

Fix this by letting cpufreq-stats know about the update of the policy->cpu
during CPU online in the resume path. (Also, move the update_policy_cpu()
function higher up in the file, so that __cpufreq_add_dev() can invoke
it).

Reported-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 drivers/cpufreq/cpufreq.c |   37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Viresh Kumar Sept. 11, 2013, 10:44 a.m. UTC | #1
On 11 September 2013 15:51, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:

>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown
>>> during suspend/resume".

Sorry Stephen, I was away on vacations and came back yesterday only..
And was badly stuck in something other CPUFreq bugs until now :)

> Sure, Rafael. Thanks for CC'ing me.

Thanks for jumping in and helping us out buddy!!!.

> Stephen, I went through the code and I think I found out what is going wrong.
> Can you please try the following patch?
>
> Regards,
> Srivatsa S. Bhat
>
> ----------------------------------------------------------------------------
>
>
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: [PATCH] cpufreq: Fix crash in cpufreq-stats during suspend/resume
>
> Stephen Warren reported that the cpufreq-stats code hits a NULL pointer
> dereference during the second attempt to suspend a system. He also
> pin-pointed the problem to commit 5302c3f "cpufreq: Perform light-weight
> init/teardown during suspend/resume".
>
> That commit actually ensured that the cpufreq-stats table and the
> cpufreq-stats sysfs entries are *not* torn down (ie., not freed) during
> suspend/resume, which makes it all the more surprising. However, it turns
> out that the root-cause is not that we access an already freed memory, but
> that the reference to the allocated memory gets moved around and we lose
> track of that during resume, leading to the reported crash in a subsequent
> suspend attempt.
>
> In the suspend path, during CPU offline, the value of policy->cpu is updated
> by choosing one of the surviving CPUs in that policy, as long as there is
> atleast one CPU in that policy. And cpufreq_stats_update_policy_cpu() is
> invoked to update the reference to the stats structure by assigning it to
> the new CPU. However, in the resume path, during CPU online, we end up
> assigning a fresh CPU as the policy->cpu, without letting cpufreq-stats
> know about this. Thus the reference to the stats structure remains
> (incorrectly) associated with the old CPU. So, in a subsequent suspend attempt,
> during CPU offline, we end up accessing an incorrect location to get the
> stats structure, which eventually leads to the NULL pointer dereference.
>
> Fix this by letting cpufreq-stats know about the update of the policy->cpu
> during CPU online in the resume path. (Also, move the update_policy_cpu()
> function higher up in the file, so that __cpufreq_add_dev() can invoke
> it).

Observation looks good..

> Reported-by: Stephen Warren <swarren@nvidia.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>  drivers/cpufreq/cpufreq.c |   37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 5a64f66..62bdb95 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -947,6 +947,18 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
>         kfree(policy);
>  }
>
> +static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
> +{
> +       policy->last_cpu = policy->cpu;
> +       policy->cpu = cpu;
> +
> +#ifdef CONFIG_CPU_FREQ_TABLE
> +       cpufreq_frequency_table_update_policy_cpu(policy);
> +#endif
> +       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> +                       CPUFREQ_UPDATE_POLICY_CPU, policy);
> +}
> +
>  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
>                              bool frozen)
>  {
> @@ -1000,7 +1012,18 @@ static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
>         if (!policy)
>                 goto nomem_out;
>
> -       policy->cpu = cpu;
> +
> +       /*
> +        * In the resume path, since we restore a saved policy, the assignment
> +        * to policy->cpu is like an update of the existing policy, rather than
> +        * the creation of a brand new one. So we need to perform this update
> +        * by invoking update_policy_cpu().
> +        */
> +       if (frozen && cpu != policy->cpu)
> +               update_policy_cpu(policy, cpu);
> +       else
> +               policy->cpu = cpu;
> +
>         policy->governor = CPUFREQ_DEFAULT_GOVERNOR;
>         cpumask_copy(policy->cpus, cpumask_of(cpu));
>
> @@ -1092,18 +1115,6 @@ static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
>         return __cpufreq_add_dev(dev, sif, false);
>  }
>
> -static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
> -{
> -       policy->last_cpu = policy->cpu;
> -       policy->cpu = cpu;
> -
> -#ifdef CONFIG_CPU_FREQ_TABLE
> -       cpufreq_frequency_table_update_policy_cpu(policy);
> -#endif
> -       blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> -                       CPUFREQ_UPDATE_POLICY_CPU, policy);
> -}
> -

But I would have solved it differently :)

We don't really need to call update_policy_cpu() again and again
as we don't really need to update policy->cpu...

Rather it would be better to just move following inside
cpufreq_policy_alloc():
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 11, 2013, 10:45 a.m. UTC | #2
On 11 September 2013 16:14, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> But I would have solved it differently :)
>
> We don't really need to call update_policy_cpu() again and again
> as we don't really need to update policy->cpu...
>
> Rather it would be better to just move following inside
> cpufreq_policy_alloc():

Half written mail sent :(

What about moving following to cpufreq_policy_alloc():

policy->cpu = cpu;

??
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 11:10 a.m. UTC | #3
On 09/11/2013 04:14 PM, Viresh Kumar wrote:
> On 11 September 2013 15:51, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
> 
>>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown
>>>> during suspend/resume".
> 
> Sorry Stephen, I was away on vacations and came back yesterday only..
> And was badly stuck in something other CPUFreq bugs until now :)
> 
>> Sure, Rafael. Thanks for CC'ing me.
> 
> Thanks for jumping in and helping us out buddy!!!.
>

No problem :-) Besides, I am the one who broke it ;-(
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 11:14 a.m. UTC | #4
On 09/11/2013 04:15 PM, Viresh Kumar wrote:
> On 11 September 2013 16:14, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> But I would have solved it differently :)
>>
>> We don't really need to call update_policy_cpu() again and again
>> as we don't really need to update policy->cpu...
>>
>> Rather it would be better to just move following inside
>> cpufreq_policy_alloc():
> 
> Half written mail sent :(
> 
> What about moving following to cpufreq_policy_alloc():
> 
> policy->cpu = cpu;
> 
> ??
> 

Hmm? The problem is not about merely updating the policy->cpu field; the
main issue is that the existing code was not letting the cpufreq-stats
code know that we updated the policy->cpu under the hood. It is important
for cpufreq-stats to know this because it maintains the reference to its
stats structure by associating it with the policy->cpu. So if policy->cpu
changes under the hood, it loses track of its reference. So we need to
keep that code informed about changes to policy->cpu. Thus, we need to call
update_policy_cpu() in the CPU online path (during resume). I don't see
how we can skip that.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 11, 2013, 11:15 a.m. UTC | #5
On 11 September 2013 16:40, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 09/11/2013 04:14 PM, Viresh Kumar wrote:
>> On 11 September 2013 15:51, Srivatsa S. Bhat
>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>>
>>>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown
>>>>> during suspend/resume".
>>
>> Sorry Stephen, I was away on vacations and came back yesterday only..
>> And was badly stuck in something other CPUFreq bugs until now :)
>>
>>> Sure, Rafael. Thanks for CC'ing me.
>>
>> Thanks for jumping in and helping us out buddy!!!.
>>
>
> No problem :-) Besides, I am the one who broke it ;-(

I believe you missed some part of my mail ?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 11:17 a.m. UTC | #6
On 09/11/2013 04:45 PM, Viresh Kumar wrote:
> On 11 September 2013 16:40, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 09/11/2013 04:14 PM, Viresh Kumar wrote:
>>> On 11 September 2013 15:51, Srivatsa S. Bhat
>>> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>>>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>>>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>>>
>>>>>> Sure, it's due to 5302c3f "cpufreq: Perform light-weight init/teardown
>>>>>> during suspend/resume".
>>>
>>> Sorry Stephen, I was away on vacations and came back yesterday only..
>>> And was badly stuck in something other CPUFreq bugs until now :)
>>>
>>>> Sure, Rafael. Thanks for CC'ing me.
>>>
>>> Thanks for jumping in and helping us out buddy!!!.
>>>
>>
>> No problem :-) Besides, I am the one who broke it ;-(
> 
> I believe you missed some part of my mail ?
> 

See my next reply :-) I was composing it :-)

Man, you are *fast*! ;-)
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 11, 2013, 11:41 a.m. UTC | #7
On 11 September 2013 16:47, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> See my next reply :-) I was composing it :-)
>
> Man, you are *fast*! ;-)

haha...
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 11, 2013, 11:59 a.m. UTC | #8
On 11 September 2013 16:44, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Hmm? The problem is not about merely updating the policy->cpu field; the
> main issue is that the existing code was not letting the cpufreq-stats
> code know that we updated the policy->cpu under the hood. It is important
> for cpufreq-stats to know this because it maintains the reference to its
> stats structure by associating it with the policy->cpu. So if policy->cpu
> changes under the hood, it loses track of its reference. So we need to
> keep that code informed about changes to policy->cpu. Thus, we need to call
> update_policy_cpu() in the CPU online path (during resume). I don't see
> how we can skip that.

Okay.. There are two different ways in which cpufreq_add_dev() work
currently..

Boot cluster (i.e. policy with boot CPU)
---------------

Here cpufreq_remove_dev() is never called for boot cpu but all others.
And similarly cpufreq_add_dev() is never called for boot cpu but all others.

Now policy->cpu contains meaningful cpu at beginning of resume and
we don't need to modify that at all.. For all the remaining CPUs we
better call cpufreq_add_policy_cpu() rather..

Non-boot Cluster
---------------------

All CPUs here are removed and at the end policy->cpu contains the last
cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3..

Not at resume we will add cpu2 first and so need to update policy->cpu
to 2.. But for all other CPUs in this cluster we return early from
cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus
was fixed by call to ->init() for the first cpu of this cluster..

And so we never reach the line: policy->cpu = cpu;

For the first cpu of non-boot cluster we need to call update_policy_cpu()
and not for others..


But for the boot cluster if we can call ->init() somehow at resume time,
then things would be fairly similar in both cases..

I am running of time now, as need to leave office now...
I hope I made the problem more clear or probably the way I see it :)

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 1:56 p.m. UTC | #9
On 09/11/2013 05:29 PM, Viresh Kumar wrote:
> On 11 September 2013 16:44, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Hmm? The problem is not about merely updating the policy->cpu field; the
>> main issue is that the existing code was not letting the cpufreq-stats
>> code know that we updated the policy->cpu under the hood. It is important
>> for cpufreq-stats to know this because it maintains the reference to its
>> stats structure by associating it with the policy->cpu. So if policy->cpu
>> changes under the hood, it loses track of its reference. So we need to
>> keep that code informed about changes to policy->cpu. Thus, we need to call
>> update_policy_cpu() in the CPU online path (during resume). I don't see
>> how we can skip that.
> 
> Okay.. There are two different ways in which cpufreq_add_dev() work
> currently..
> 
> Boot cluster (i.e. policy with boot CPU)
> ---------------
> 
> Here cpufreq_remove_dev() is never called for boot cpu but all others.
> And similarly cpufreq_add_dev() is never called for boot cpu but all others.
> 
> Now policy->cpu contains meaningful cpu at beginning of resume and
> we don't need to modify that at all.. For all the remaining CPUs we
> better call cpufreq_add_policy_cpu() rather..
>

Yes, and the existing code already does that. And if you observe the placement
of the call to update_policy_cpu() in my patch, you'll find that it won't
be called in cases where we call cpufreq_add_policy_cpu(). Which is exactly
what you want, IIUC.

> Non-boot Cluster
> ---------------------
> 
> All CPUs here are removed and at the end policy->cpu contains the last
> cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3..
> 
> Not at resume we will add cpu2 first and so need to update policy->cpu
> to 2.. But for all other CPUs in this cluster we return early from
> cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus
> was fixed by call to ->init() for the first cpu of this cluster..
> 
> And so we never reach the line: policy->cpu = cpu;
> 
> For the first cpu of non-boot cluster we need to call update_policy_cpu()
> and not for others..
> 

Yes, and that's precisely why I added the call to update_policy_cpu() only
near the statement 'policy->cpu = cpu'. All other cases don't need to call
that function, and in my patch, they indeed don't call that function.

A simple way to look at this is:
If policy->cpu is updated anywhere in the resume (cpu online) path, we need
to call update_policy_cpu(). Otherwise, we don't need to call that function.
This will cover both the scenarios you described above.

> 
> But for the boot cluster if we can call ->init() somehow at resume time,
> then things would be fairly similar in both cases..
> 
> I am running of time now, as need to leave office now...
> I hope I made the problem more clear or probably the way I see it :)
> 

Well, your explanation matches my understanding of the scenarios, and
I had written the patch keeping those scenarios in mind; so I believe the
patch is correct as it is. So what I didn't get is this: are you suggesting
that something is wrong in my patch, or are you just reiterating the different
scenarios involved so that I can recheck whether my patch is right?

If it is the former, please point me to the exact problem you found in
my patch. If it is the latter, I'm pretty sure my patch is right, I've
already checked it :-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Sept. 11, 2013, 4:05 p.m. UTC | #10
On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote:
> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote:
>>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote:
>>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote:
>>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote:
>>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote:
>>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote:
>>>>>>>>> Viresh,
>>>>>>>>>
>>>>>>>>> I'm seeing the crash below when suspending my system for the second time.
...
> Stephen, I went through the code and I think I found out what is going wrong.
> Can you please try the following patch?

Unfortunately, I still see the exact same failure/backtrace with this
patch applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 6:03 p.m. UTC | #11
On 09/11/2013 09:35 PM, Stephen Warren wrote:
> On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote:
>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>>>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote:
>>>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote:
>>>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote:
>>>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote:
>>>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote:
>>>>>>>>>> Viresh,
>>>>>>>>>>
>>>>>>>>>> I'm seeing the crash below when suspending my system for the second time.
> ...
>> Stephen, I went through the code and I think I found out what is going wrong.
>> Can you please try the following patch?
> 
> Unfortunately, I still see the exact same failure/backtrace with this
> patch applied.
> 

Oh, is it? Can you please give me the map of the related cpus on your
system? (ie., cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus for
each CPU.)

I must be missing something...
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 11, 2013, 6:42 p.m. UTC | #12
On 09/11/2013 11:33 PM, Srivatsa S. Bhat wrote:
> On 09/11/2013 09:35 PM, Stephen Warren wrote:
>> On 09/11/2013 04:21 AM, Srivatsa S. Bhat wrote:
>>> On 09/11/2013 04:04 AM, Rafael J. Wysocki wrote:
>>>> On Tuesday, September 10, 2013 02:53:01 PM Stephen Warren wrote:
>>>>> On 09/09/2013 05:14 PM, Rafael J. Wysocki wrote:
>>>>>> On Monday, September 09, 2013 03:29:06 PM Stephen Warren wrote:
>>>>>>> On 09/09/2013 02:24 PM, Rafael J. Wysocki wrote:
>>>>>>>> On Monday, September 09, 2013 02:01:32 PM Stephen Warren wrote:
>>>>>>>>> On 09/09/2013 02:01 PM, Rafael J. Wysocki wrote:
>>>>>>>>>> On Monday, September 09, 2013 01:22:23 PM Stephen Warren wrote:
>>>>>>>>>>> Viresh,
>>>>>>>>>>>
>>>>>>>>>>> I'm seeing the crash below when suspending my system for the second time.
>> ...
>>> Stephen, I went through the code and I think I found out what is going wrong.
>>> Can you please try the following patch?
>>
>> Unfortunately, I still see the exact same failure/backtrace with this
>> patch applied.
>>
> 
> Oh, is it? Can you please give me the map of the related cpus on your
> system? (ie., cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus for
> each CPU.)
> 
> I must be missing something...
>

OK, I took a second look at the code, and I suspect that applying the
second patch might help. So can you try by applying both the patches
please[1][2]?

Basically here is my hunch: say CPUs 2 and 3 are part of a policy and
3 is the policy->cpu. During suspend, CPU 2 will be taken offline first,
and we hit this code:

1199         if (cpu != policy->cpu && !frozen) {
1200                 sysfs_remove_link(&dev->kobj, "cpufreq");
1201         } else if (cpus > 1) {
1202 
1203                 new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu, frozen);
1204                 if (new_cpu >= 0) {
1205                         WARN_ON(lock_policy_rwsem_write(cpu));
1206                         update_policy_cpu(policy, new_cpu);
1207                         unlock_policy_rwsem_write(cpu);
1208 
1209                         if (!frozen) {
1210                                 pr_debug("%s: policy Kobject moved to cpu: %d "
1211                                          "from: %d\n",__func__, new_cpu, cpu);
1212                         }
1213                 }
1214         }

At this point, the first 'if' condition fails because frozen == true.
So it enters the else part. But, policy->cpu is actually 3, not 2,
and hence we invoke nominate_...() unnecessarily. That function returns
3 since that's the only CPU remaining in the mask, and so we call
update_policy_cpu() with new_cpu = 3, and old_cpu was also 3! And that
is the perfect recipe for disaster, with the current implementation of
update_policy_cpu().

And my second patch [2] tried to fix this exact problem, although I
didn't realize we actually had a case where we hit this in the current
code itself.

So please try by applying both the patches and let me know how it goes.
Thanks a lot for your testing efforts!

[1]. http://marc.info/?l=linux-kernel&m=137889516210816&w=2
[2]. http://marc.info/?l=linux-kernel&m=137889800511940&w=2
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Sept. 11, 2013, 7:03 p.m. UTC | #13
On 09/11/2013 12:42 PM, Srivatsa S. Bhat wrote:
...
> OK, I took a second look at the code, and I suspect that applying the
> second patch might help. So can you try by applying both the patches
> please[1][2]?
> 
...
> [1]. http://marc.info/?l=linux-kernel&m=137889516210816&w=2
> [2]. http://marc.info/?l=linux-kernel&m=137889800511940&w=2

Yes, with both of those patches applies, the problem is solved:-)

I was going to test the second patch originally, but it sounded like it
was more of a cleanup rather than a fix for my issue, so I didn't bother
when I found the problem wasn't solved by patch 1. Sorry!

For the record, I'm testing on a 2-CPU system, so I'm not sure whether
your explanation applies; it talks about CPUs 2 and 3 whereas I only
have CPUs 0 and 1, but perhaps your explanation applies equally to any
pair of CPUs?

For the record, here's the information you requested in the other email:

# cat /sys/devices/system/cpu/cpu*/cpufreq/related_cpus
0 1
0 1

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 12, 2013, 5:52 a.m. UTC | #14
Let me fix my mail first.. I was running out of time yesterday and so couldn't
frame things correctly :)

On 11 September 2013 17:29, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> Okay.. There are two different ways in which cpufreq_add_dev() work
> currently..
>
> Boot cluster (i.e. policy with boot CPU)
> ---------------
>
> Here cpufreq_remove_dev() is never called for boot cpu but all others.
> And similarly cpufreq_add_dev() is never called for boot cpu but all others.
>
> Now policy->cpu contains meaningful cpu at beginning of resume and
> we don't need to modify that at all.. For all the remaining CPUs we
> better call cpufreq_add_policy_cpu() rather..

And this should be done without your patch. Or actually we will simply
return from this place. Atleast for systems with single cluster, like Tegra.

policy->related_cpus is still valid after resume and we haven't removed
policy from the cpufreq_policy_list (though there is a bug which I have
fixed separately and sent it to you..).. So no change required for a single
cluster system..

> Non-boot Cluster
> ---------------------
>
> All CPUs here are removed and at the end policy->cpu contains the last
> cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3..
>
> Now at resume we will add cpu2 first and so need to update policy->cpu
> to 2..

> But for all other CPUs in this cluster we return early from
> cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus
> was fixed by call to ->init() for the first cpu of this cluster..

This was wrong, we need a valid policy->related_cpus field which is always
valid and so we return early here too, but not for the first cpu of cluster.

> And so we never reach the line: policy->cpu = cpu;
>
> For the first cpu of non-boot cluster we need to call update_policy_cpu()
> and not for others..

that's correct, thought I have one more idea.. :)

> But for the boot cluster if we can call ->init() somehow at resume time,
> then things would be fairly similar in both cases..

Not required.. its all working already.. and so Stephen shouldn't need your
patch for Tegra, but rather my patches that fix other cpufreq bugs..

Now coming back to the ideas I have...
Same code will work if hotplug sequence is fixed a bit. Why aren't we doing
exact opposite of suspend in resume?

We are removing CPUs (leaving the boot cpu) in ascending order and then
adding them back in same order.. Why?

Why not remove CPUs in descending order and add in ascending order? Or
remove in ascending order and add in descending order?

That way policy->cpu will be updated with the right cpu and your patch wouldn't
be required..

I am not saying that this can't be hacked/fixed in cpufreq but suspend/resume
may also be fixed and that looks logically more correct to me..
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 12, 2013, 5:58 a.m. UTC | #15
On 12 September 2013 00:12, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> OK, I took a second look at the code, and I suspect that applying the
> second patch might help. So can you try by applying both the patches
> please[1][2]?
>
> Basically here is my hunch: say CPUs 2 and 3 are part of a policy and
> 3 is the policy->cpu. During suspend, CPU 2 will be taken offline first,
> and we hit this code:
>
> 1199         if (cpu != policy->cpu && !frozen) {
> 1200                 sysfs_remove_link(&dev->kobj, "cpufreq");
> 1201         } else if (cpus > 1) {
> 1202
> 1203                 new_cpu = cpufreq_nominate_new_policy_cpu(policy, cpu, frozen);
> 1204                 if (new_cpu >= 0) {
> 1205                         WARN_ON(lock_policy_rwsem_write(cpu));
> 1206                         update_policy_cpu(policy, new_cpu);
> 1207                         unlock_policy_rwsem_write(cpu);
> 1208
> 1209                         if (!frozen) {
> 1210                                 pr_debug("%s: policy Kobject moved to cpu: %d "
> 1211                                          "from: %d\n",__func__, new_cpu, cpu);
> 1212                         }
> 1213                 }
> 1214         }
>
> At this point, the first 'if' condition fails because frozen == true.
> So it enters the else part. But, policy->cpu is actually 3, not 2,
> and hence we invoke nominate_...() unnecessarily. That function returns
> 3 since that's the only CPU remaining in the mask, and so we call
> update_policy_cpu() with new_cpu = 3, and old_cpu was also 3! And that
> is the perfect recipe for disaster, with the current implementation of
> update_policy_cpu().

The problem here is not the wrong implementation of update_policy_cpu()
but

1199         if (cpu != policy->cpu && !frozen) {

Though I have fixed it before looking into your replies :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 12, 2013, 6 a.m. UTC | #16
On 12 September 2013 00:33, Stephen Warren <swarren@wwwdotorg.org> wrote:
> For the record, I'm testing on a 2-CPU system, so I'm not sure whether
> your explanation applies; it talks about CPUs 2 and 3 whereas I only
> have CPUs 0 and 1, but perhaps your explanation applies equally to any
> pair of CPUs?

It does apply to your system as well.. We remove cpu 1 in suspend and so
try to nominate cpu 0 as policy->cpu, which it already is :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 12, 2013, 6:26 a.m. UTC | #17
On 09/12/2013 11:22 AM, Viresh Kumar wrote:
> Let me fix my mail first.. I was running out of time yesterday and so couldn't
> frame things correctly :)
> 
> On 11 September 2013 17:29, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>> Okay.. There are two different ways in which cpufreq_add_dev() work
>> currently..
>>
>> Boot cluster (i.e. policy with boot CPU)
>> ---------------
>>
>> Here cpufreq_remove_dev() is never called for boot cpu but all others.
>> And similarly cpufreq_add_dev() is never called for boot cpu but all others.
>>
>> Now policy->cpu contains meaningful cpu at beginning of resume and
>> we don't need to modify that at all.. For all the remaining CPUs we
>> better call cpufreq_add_policy_cpu() rather..
> 
> And this should be done without your patch. Or actually we will simply
> return from this place. Atleast for systems with single cluster, like Tegra.
> 
> policy->related_cpus is still valid after resume and we haven't removed
> policy from the cpufreq_policy_list (though there is a bug which I have
> fixed separately and sent it to you..).. So no change required for a single
> cluster system..
> 
>> Non-boot Cluster
>> ---------------------
>>
>> All CPUs here are removed and at the end policy->cpu contains the last
>> cpu removed.. So, for a cluster with cpu 2 and 3.... it will contain 3..
>>
>> Now at resume we will add cpu2 first and so need to update policy->cpu
>> to 2..
> 
>> But for all other CPUs in this cluster we return early from
>> cpufreq_add_dev() and call cpufreq_add_policy_cpu() as policy->cpus
>> was fixed by call to ->init() for the first cpu of this cluster..
> 
> This was wrong, we need a valid policy->related_cpus field which is always
> valid and so we return early here too, but not for the first cpu of cluster.
> 
>> And so we never reach the line: policy->cpu = cpu;
>>
>> For the first cpu of non-boot cluster we need to call update_policy_cpu()
>> and not for others..
> 
> that's correct, thought I have one more idea.. :)
> 
>> But for the boot cluster if we can call ->init() somehow at resume time,
>> then things would be fairly similar in both cases..
> 
> Not required.. its all working already.. and so Stephen shouldn't need your
> patch for Tegra, but rather my patches that fix other cpufreq bugs..
> 
> Now coming back to the ideas I have...
> Same code will work if hotplug sequence is fixed a bit. Why aren't we doing
> exact opposite of suspend in resume?
> 
> We are removing CPUs (leaving the boot cpu) in ascending order and then
> adding them back in same order.. Why?
> 
> Why not remove CPUs in descending order and add in ascending order? Or
> remove in ascending order and add in descending order?
> 

I had the same thought when solving this bug.. We have had similar issues with
CPU hotplug notifiers too: why are they invoked in the same order during both
CPU down and up, instead of reversing the order? I even had a patchset to perform
reverse-invocation of notifiers.. http://lwn.net/Articles/508072/
... but people didn't find that very compelling to have.


> That way policy->cpu will be updated with the right cpu and your patch wouldn't
> be required..
> 
> I am not saying that this can't be hacked/fixed in cpufreq but suspend/resume
> may also be fixed and that looks logically more correct to me..
> 

It does to me too, but I think the reason nobody really bothered is because perhaps
not many other subsystems care about the order in which CPUs are torn down or
brought up; they just need the total number to match.. cpufreq is one exception
as we saw with this bug.


Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 12, 2013, 6:41 a.m. UTC | #18
On 12 September 2013 11:56, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> I had the same thought when solving this bug.. We have had similar issues with
> CPU hotplug notifiers too: why are they invoked in the same order during both
> CPU down and up, instead of reversing the order? I even had a patchset to perform
> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/
> ... but people didn't find that very compelling to have.
>

> It does to me too, but I think the reason nobody really bothered is because perhaps
> not many other subsystems care about the order in which CPUs are torn down or
> brought up; they just need the total number to match.. cpufreq is one exception
> as we saw with this bug.

Probably its time to re-spin that series and make CPUFreq as one of the users
of that patchset.. Resume should be just opposite of suspend and so
that patchset
would make sense even if not many people care about it :)

Over that there is one more problem that I see, don't know if it is really a big
issue..

After a suspend/resume value of policy->cpu may get changed... And so the
hierarchy of sysfs cpufreq files too.. Folder that had links to other
CPUs folder
can now be actual folders instead of links and vice versa..

Don't know if this can break something ??
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 12, 2013, 6:46 a.m. UTC | #19
On 09/12/2013 12:11 PM, Viresh Kumar wrote:
> On 12 September 2013 11:56, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> I had the same thought when solving this bug.. We have had similar issues with
>> CPU hotplug notifiers too: why are they invoked in the same order during both
>> CPU down and up, instead of reversing the order? I even had a patchset to perform
>> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/
>> ... but people didn't find that very compelling to have.
>>
> 
>> It does to me too, but I think the reason nobody really bothered is because perhaps
>> not many other subsystems care about the order in which CPUs are torn down or
>> brought up; they just need the total number to match.. cpufreq is one exception
>> as we saw with this bug.
> 
> Probably its time to re-spin that series and make CPUFreq as one of the users
> of that patchset.. Resume should be just opposite of suspend and so
> that patchset
> would make sense even if not many people care about it :)
> 
> Over that there is one more problem that I see, don't know if it is really a big
> issue..
> 
> After a suspend/resume value of policy->cpu may get changed... And so the
> hierarchy of sysfs cpufreq files too.. Folder that had links to other
> CPUs folder
> can now be actual folders instead of links and vice versa..
> 
> Don't know if this can break something ??
> 

Interesting observation :-) But we just managed to retain sysfs file permissions
across suspend/resume with a lot of trouble and regressions. That's probably
good enough for some time to come ;-) We can retain folder/links when somebody
really finds a need to do that ;-)

Of course, if we change the suspend/resume sequence and that fixes this, that
would be like getting it for free, nobody would say no to it ;-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 12, 2013, 6:52 a.m. UTC | #20
On 12 September 2013 12:16, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> Of course, if we change the suspend/resume sequence and that fixes this, that
> would be like getting it for free, nobody would say no to it ;-)

Not really :)

Policy with 4 CPUs, 0,1,2,3, policy->cpu currently set to 1, 2 or 3...

We will remove CPUs in order 3,2,1 and add back in 1,2,3... Or Vice Versa

policy->cpu after resume is 0 :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 12, 2013, 7:14 a.m. UTC | #21
On 09/12/2013 12:22 PM, Viresh Kumar wrote:
> On 12 September 2013 12:16, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> Of course, if we change the suspend/resume sequence and that fixes this, that
>> would be like getting it for free, nobody would say no to it ;-)
> 
> Not really :)
> 
> Policy with 4 CPUs, 0,1,2,3, policy->cpu currently set to 1, 2 or 3...
> 
> We will remove CPUs in order 3,2,1 and add back in 1,2,3... Or Vice Versa
> 
> policy->cpu after resume is 0 :)
> 

Ah, great counter-example! :-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Sept. 12, 2013, 3:55 p.m. UTC | #22
On 09/12/2013 12:26 AM, Srivatsa S. Bhat wrote:
> On 09/12/2013 11:22 AM, Viresh Kumar wrote:
...
>> Now coming back to the ideas I have...
>> Same code will work if hotplug sequence is fixed a bit. Why aren't we doing
>> exact opposite of suspend in resume?
>>
>> We are removing CPUs (leaving the boot cpu) in ascending order and then
>> adding them back in same order.. Why?
>>
>> Why not remove CPUs in descending order and add in ascending order? Or
>> remove in ascending order and add in descending order?
> 
> I had the same thought when solving this bug.. We have had similar issues with
> CPU hotplug notifiers too: why are they invoked in the same order during both
> CPU down and up, instead of reversing the order? I even had a patchset to perform
> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/
> ... but people didn't find that very compelling to have.

I'm not sure if you're talking about the order the CPUs get plugged back
in after resume, or the order of the (pair of?) notifiers that gets
called for each individual CPU.

Sorry if this is blindingly obvious, but with CPU hotplug, I can
manually unplug/re-plug CPUs in any order I feel like, and cpufreq had
better work if I do that. Given that, I don't think there's any
particular need for suspend/resume to unplug/re-plug CPUs in any
particular order for correctness at least, although perhaps it'd be nice
cosmetically for some reason?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 12, 2013, 5:26 p.m. UTC | #23
On 09/12/2013 09:25 PM, Stephen Warren wrote:
> On 09/12/2013 12:26 AM, Srivatsa S. Bhat wrote:
>> On 09/12/2013 11:22 AM, Viresh Kumar wrote:
> ...
>>> Now coming back to the ideas I have...
>>> Same code will work if hotplug sequence is fixed a bit. Why aren't we doing
>>> exact opposite of suspend in resume?
>>>
>>> We are removing CPUs (leaving the boot cpu) in ascending order and then
>>> adding them back in same order.. Why?
>>>
>>> Why not remove CPUs in descending order and add in ascending order? Or
>>> remove in ascending order and add in descending order?
>>
>> I had the same thought when solving this bug.. We have had similar issues with
>> CPU hotplug notifiers too: why are they invoked in the same order during both
>> CPU down and up, instead of reversing the order? I even had a patchset to perform
>> reverse-invocation of notifiers.. http://lwn.net/Articles/508072/
>> ... but people didn't find that very compelling to have.
> 
> I'm not sure if you're talking about the order the CPUs get plugged back
> in after resume, or the order of the (pair of?) notifiers that gets
> called for each individual CPU.

Well, initially we were discussing about the order the CPUs get plugged back
in after resume, then I gave an example of a similar ordering issue in the
way the registered CPU hotplug notifiers are invoked: during *both* CPU online
and offline, the chain of notifiers are invoked in the same order, rather than
in the opposite order. To work around this unnatural ordering, subsystems have
resorted to splitting up their callbacks into multiple callbacks and then
assigning appropriate priorities etc., to them, to get the ordering right.
We could have done away with all that complexity/confusion if the core CPU
hotplug code had set the ordering right. And that's what my patchset tried to
do. Anyway, nevermind, as of now, subsystems do work around this suitably, so
there is no known bug as such at the present. Just that we could have probably
done it a better way, that's all.

> 
> Sorry if this is blindingly obvious, but with CPU hotplug, I can
> manually unplug/re-plug CPUs in any order I feel like, and cpufreq had
> better work if I do that. Given that, I don't think there's any
> particular need for suspend/resume to unplug/re-plug CPUs in any
> particular order for correctness at least, although perhaps it'd be nice
> cosmetically for some reason?
> 

You're absolutely right! Regular CPU hotplug is more demanding than
suspend/resume in the context we are discussing, since any CPU can be
hotplugged at any time and put back in any order. So code like cpufreq should
be prepared to work with any ordering. So yes, its not very compelling to
change the suspend/resume CPU hotplug ordering, since cpufreq has to deal
with the regular (and harsher) situation anyway.

That said, one subtle but key distinction between regular CPU hotplug and
suspend/resume is that, the kernel guarantees that the state of the system
after resume will be exactly the same as it was before suspend (atleast to
the extent that is practically possible). On the other hand, no such
guarantees are made for regular CPU hotplug, since its almost as if the user
is mutating the system at runtime!. However, suspend/resume as a concept
itself provides the above mentioned guarantee, and in fact, the very fact
that CPU hotplug is used under the hood for suspend/resume is just an
implementation detail, and should not affect the guarantees.

That's the reason we often pay special attention to CPU hotplug handling in
the suspend/resume path, such as preserving sysfs file permissions etc..
On those lines, Viresh and me were just wondering if 'fixing' the suspend/
resume CPU hotplug sequence would yield any additional benefits to better
serve this guarantee. That was the context in which this discussion happened.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Sept. 13, 2013, 4:26 a.m. UTC | #24
On 12 September 2013 22:56, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 09/12/2013 09:25 PM, Stephen Warren wrote:
> Anyway, nevermind, as of now, subsystems do work around this suitably, so
> there is no known bug as such at the present. Just that we could have probably
> done it a better way, that's all.

Yeah, there is no bug as of now due to the number of hacks adopted by different
framework.. I believe we can still have a cleanup series to take care
of this stuff..
That would be some improvement and would be better for future.. Otherwise
this kind of problems would keep coming again and again..

> You're absolutely right! Regular CPU hotplug is more demanding than
> suspend/resume in the context we are discussing, since any CPU can be
> hotplugged at any time and put back in any order. So code like cpufreq should
> be prepared to work with any ordering.

And that part is well implemented and tested as far as I know..
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 5a64f66..62bdb95 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -947,6 +947,18 @@  static void cpufreq_policy_free(struct cpufreq_policy *policy)
 	kfree(policy);
 }
 
+static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
+{
+	policy->last_cpu = policy->cpu;
+	policy->cpu = cpu;
+
+#ifdef CONFIG_CPU_FREQ_TABLE
+	cpufreq_frequency_table_update_policy_cpu(policy);
+#endif
+	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
+			CPUFREQ_UPDATE_POLICY_CPU, policy);
+}
+
 static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
 			     bool frozen)
 {
@@ -1000,7 +1012,18 @@  static int __cpufreq_add_dev(struct device *dev, struct subsys_interface *sif,
 	if (!policy)
 		goto nomem_out;
 
-	policy->cpu = cpu;
+
+	/*
+	 * In the resume path, since we restore a saved policy, the assignment
+	 * to policy->cpu is like an update of the existing policy, rather than
+	 * the creation of a brand new one. So we need to perform this update
+	 * by invoking update_policy_cpu().
+	 */
+	if (frozen && cpu != policy->cpu)
+		update_policy_cpu(policy, cpu);
+	else
+		policy->cpu = cpu;
+
 	policy->governor = CPUFREQ_DEFAULT_GOVERNOR;
 	cpumask_copy(policy->cpus, cpumask_of(cpu));
 
@@ -1092,18 +1115,6 @@  static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif)
 	return __cpufreq_add_dev(dev, sif, false);
 }
 
-static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu)
-{
-	policy->last_cpu = policy->cpu;
-	policy->cpu = cpu;
-
-#ifdef CONFIG_CPU_FREQ_TABLE
-	cpufreq_frequency_table_update_policy_cpu(policy);
-#endif
-	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
-			CPUFREQ_UPDATE_POLICY_CPU, policy);
-}
-
 static int cpufreq_nominate_new_policy_cpu(struct cpufreq_policy *policy,
 					   unsigned int old_cpu, bool frozen)
 {