diff mbox

[2/5] cpufreq, fix locking around CPUFREQ_GOV_POLICY_EXIT calls

Message ID 1415199239-19019-3-git-send-email-prarit@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Prarit Bhargava Nov. 5, 2014, 2:53 p.m. UTC
commit 955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem
lock around CPUFREQ_GOV_POLICY_EXIT") opens up a hole in the locking
scheme for cpufreq.

Simple tests such as rapidly switching the governor between ondemand and
performance or attempting to read policy values while a governor switch occurs
now fail with very NULL pointer warnings, sysfs namespace collisions, and
system hangs.  In short, the locking that policy->rwsem is supposed to provide
is currently broken.

The identified commit attempts to resolve a lockdep warning by removing
a lock around a section of code which does a shutdown of the
existing policy.  The problem is that this is part of the _critical_ section of
code that switches the governors and must be protected by the lock; without
locking readers may access now NULL or stale data, and writes may collide with
each other.

With the previous patch, which now returns -EBUSY during times of
contention the deadlock reported in
955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem lock
around CPUFREQ_GOV_POLICY_EXIT") cannot occur, so adding the locks back
into this section of code is possible.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
---
 drivers/cpufreq/cpufreq.c |    4 ----
 include/linux/cpufreq.h   |    4 ----
 2 files changed, 8 deletions(-)

Comments

Viresh Kumar Nov. 10, 2014, 10:44 a.m. UTC | #1
On 5 November 2014 20:23, Prarit Bhargava <prarit@redhat.com> wrote:
> commit 955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem
> lock around CPUFREQ_GOV_POLICY_EXIT") opens up a hole in the locking
> scheme for cpufreq.
>
> Simple tests such as rapidly switching the governor between ondemand and
> performance or attempting to read policy values while a governor switch occurs
> now fail with very NULL pointer warnings, sysfs namespace collisions, and
> system hangs.  In short, the locking that policy->rwsem is supposed to provide
> is currently broken.
>
> The identified commit attempts to resolve a lockdep warning by removing
> a lock around a section of code which does a shutdown of the
> existing policy.  The problem is that this is part of the _critical_ section of
> code that switches the governors and must be protected by the lock; without
> locking readers may access now NULL or stale data, and writes may collide with
> each other.
>
> With the previous patch, which now returns -EBUSY during times of
> contention the deadlock reported in
> 955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem lock
> around CPUFREQ_GOV_POLICY_EXIT") cannot occur, so adding the locks back
> into this section of code is possible.

I still fail to understand why ? What will the _trylock() change ?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prarit Bhargava Nov. 10, 2014, 12:26 p.m. UTC | #2
On 11/10/2014 05:44 AM, Viresh Kumar wrote:
> On 5 November 2014 20:23, Prarit Bhargava <prarit@redhat.com> wrote:
>> commit 955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem
>> lock around CPUFREQ_GOV_POLICY_EXIT") opens up a hole in the locking
>> scheme for cpufreq.
>>
>> Simple tests such as rapidly switching the governor between ondemand and
>> performance or attempting to read policy values while a governor switch occurs
>> now fail with very NULL pointer warnings, sysfs namespace collisions, and
>> system hangs.  In short, the locking that policy->rwsem is supposed to provide
>> is currently broken.
>>
>> The identified commit attempts to resolve a lockdep warning by removing
>> a lock around a section of code which does a shutdown of the
>> existing policy.  The problem is that this is part of the _critical_ section of
>> code that switches the governors and must be protected by the lock; without
>> locking readers may access now NULL or stale data, and writes may collide with
>> each other.
>>
>> With the previous patch, which now returns -EBUSY during times of
>> contention the deadlock reported in
>> 955ef4833574636819cd269cfbae12f79cbde63a (" cpufreq: Drop rwsem lock
>> around CPUFREQ_GOV_POLICY_EXIT") cannot occur, so adding the locks back
>> into this section of code is possible.
> 
> I still fail to understand why ? What will the _trylock() change ?

viresh, afaict read_trylock will return 0 when busy with write:

static inline int queue_read_trylock(struct qrwlock *lock)
{
        u32 cnts;

        cnts = atomic_read(&lock->cnts);
        if (likely(!(cnts & _QW_WMASK))) {

so the deadlock will not occur.  the show() is opened, write lock is taken, and
if the thread is rescheduled and takes read lock the trylock will return 0, and
the thread will return -EBUSY to userspace avoiding the deadlock.

P.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Nov. 11, 2014, 3:37 a.m. UTC | #3
On 10 November 2014 17:56, Prarit Bhargava <prarit@redhat.com> wrote:

>> I still fail to understand why ? What will the _trylock() change ?
>
> viresh, afaict read_trylock will return 0 when busy with write:

Yes..

> static inline int queue_read_trylock(struct qrwlock *lock)
> {
>         u32 cnts;
>
>         cnts = atomic_read(&lock->cnts);
>         if (likely(!(cnts & _QW_WMASK))) {
>
> so the deadlock will not occur.  the show() is opened, write lock is taken, and
> if the thread is rescheduled and takes read lock the trylock will return 0, and
> the thread will return -EBUSY to userspace avoiding the deadlock.

Which deadlock? And also your changelog talks about accessing invalid pointers
without the trylock change, how can that be possible? After the read
lock is taken,
all the pointers should be valid.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prarit Bhargava Nov. 11, 2014, 12:15 p.m. UTC | #4
On 11/10/2014 10:37 PM, Viresh Kumar wrote:
> On 10 November 2014 17:56, Prarit Bhargava <prarit@redhat.com> wrote:
> 
>>> I still fail to understand why ? What will the _trylock() change ?
>>
>> viresh, afaict read_trylock will return 0 when busy with write:
> 
> Yes..
> 
>> static inline int queue_read_trylock(struct qrwlock *lock)
>> {
>>         u32 cnts;
>>
>>         cnts = atomic_read(&lock->cnts);
>>         if (likely(!(cnts & _QW_WMASK))) {
>>
>> so the deadlock will not occur.  the show() is opened, write lock is taken, and
>> if the thread is rescheduled and takes read lock the trylock will return 0, and
>> the thread will return -EBUSY to userspace avoiding the deadlock.
> 
> Which deadlock? 

the deadlock in commit 955ef4833574636819cd269cfbae12f79cbde63a

[   75.471265]        CPU0                    CPU1
[   75.476327]        ----                    ----
[   75.481385]   lock(&policy->rwsem);
[   75.485307]                                lock(s_active#219);
[   75.491857]                                lock(&policy->rwsem);
[   75.498592]   lock(s_active#219);
[   75.502331]
[   75.502331]  *** DEADLOCK ***

And also your changelog talks about accessing invalid pointers
> without the trylock change, how can that be possible? After the read
> lock is taken,
> all the pointers should be valid.

consider the following very simple case:

the governor is ondemand.  cpu 0 reads cpuinfo_cur_freq. cpu0 expects to get the
current cpu freq for the ondemand governor.

simultaneously, cpu1 changes the governor from ondemand to userspace.

the two threads will race for the policy->mutex

suppose cpu0 gets it first.  then there is no problem.  the userspace program
for cpu0 gets exactly the data it is expecting.

Now suppose cpu1 gets the lock and starts to write ... cpu0 is blocked.

cpu1 completes the governor change, and cpu0 gets the mutex ... and returns
bogus data at this point.

P.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Nov. 11, 2014, 1:07 p.m. UTC | #5
On 11 November 2014 17:45, Prarit Bhargava <prarit@redhat.com> wrote:
> the deadlock in commit 955ef4833574636819cd269cfbae12f79cbde63a
>
> [   75.471265]        CPU0                    CPU1
> [   75.476327]        ----                    ----
> [   75.481385]   lock(&policy->rwsem);
> [   75.485307]                                lock(s_active#219);
> [   75.491857]                                lock(&policy->rwsem);
> [   75.498592]   lock(s_active#219);
> [   75.502331]
> [   75.502331]  *** DEADLOCK ***

I wanted to understand how this deadlock is prevented by a simple change
to trylock..

>> And also your changelog talks about accessing invalid pointers
>> without the trylock change, how can that be possible? After the read
>> lock is taken,
>> all the pointers should be valid.
>
> consider the following very simple case:
>
> the governor is ondemand.  cpu 0 reads cpuinfo_cur_freq. cpu0 expects to get the
> current cpu freq for the ondemand governor.

Name it A.

>
> simultaneously, cpu1 changes the governor from ondemand to userspace.

Name it B.

>
> the two threads will race for the policy->mutex
>
> suppose cpu0 gets it first.  then there is no problem.  the userspace program
> for cpu0 gets exactly the data it is expecting.
>
> Now suppose cpu1 gets the lock and starts to write ... cpu0 is blocked.
>
> cpu1 completes the governor change, and cpu0 gets the mutex ... and returns
> bogus data at this point.

What do you mean by bogus here? That userspace wouldn't be able to know if
the value is for which governor?

If that's the case than it can still happen. Issue both above commands at almost
the same time. You will never be able to differentiate if the sequence is:

- A followed by B
- B followed by A
- A waited for B and so returned -EBUSY (Only this will be clear)

And the value read can still be bogus. So, we haven't solved the problem at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Saravana Kannan Nov. 13, 2014, 9:58 p.m. UTC | #6
On 11/11/2014 05:07 AM, Viresh Kumar wrote:
> On 11 November 2014 17:45, Prarit Bhargava <prarit@redhat.com> wrote:
>> the deadlock in commit 955ef4833574636819cd269cfbae12f79cbde63a
>>
>> [   75.471265]        CPU0                    CPU1
>> [   75.476327]        ----                    ----
>> [   75.481385]   lock(&policy->rwsem);
>> [   75.485307]                                lock(s_active#219);
>> [   75.491857]                                lock(&policy->rwsem);
>> [   75.498592]   lock(s_active#219);
>> [   75.502331]
>> [   75.502331]  *** DEADLOCK ***
>
> I wanted to understand how this deadlock is prevented by a simple change
> to trylock..
>
>>> And also your changelog talks about accessing invalid pointers
>>> without the trylock change, how can that be possible? After the read
>>> lock is taken,
>>> all the pointers should be valid.
>>
>> consider the following very simple case:
>>
>> the governor is ondemand.  cpu 0 reads cpuinfo_cur_freq. cpu0 expects to get the
>> current cpu freq for the ondemand governor.
>
> Name it A.
>
>>
>> simultaneously, cpu1 changes the governor from ondemand to userspace.
>
> Name it B.
>
>>
>> the two threads will race for the policy->mutex
>>
>> suppose cpu0 gets it first.  then there is no problem.  the userspace program
>> for cpu0 gets exactly the data it is expecting.
>>
>> Now suppose cpu1 gets the lock and starts to write ... cpu0 is blocked.
>>
>> cpu1 completes the governor change, and cpu0 gets the mutex ... and returns
>> bogus data at this point.
>
> What do you mean by bogus here? That userspace wouldn't be able to know if
> the value is for which governor?
>
> If that's the case than it can still happen. Issue both above commands at almost
> the same time. You will never be able to differentiate if the sequence is:
>
> - A followed by B
> - B followed by A
> - A waited for B and so returned -EBUSY (Only this will be clear)
>
> And the value read can still be bogus. So, we haven't solved the problem at all.

Ah, we are on this topic again I see. I didn't read the patch/thread 
fully, but I can guess where this is going by reading the partial set of 
patches.

Prarit,

You can't just try lock to avoid the deadlock. If you do, then the 
userspace API becomes a mess. Writes to scaling_governor (or anything 
else) will no longer by guaranteed to work. Userspace will have to read 
back, check and retry. That would break a ton of existing userpace scripts.

Viresh,

The deadlock scenario is read. That's why the code is what it is today.

All,

IMO, the right way to fix this is to have the governor have over it's 
list of attributes it want to expose thru sysfs to the cpufreq 
framework. Then the framework can add/remove this in the right order 
when the governors are changed. The framework can do this outside of the 
policy lock being held when the governors are switched. This would allow 
avoid the original deadlock between sysfs locks and the policy lock 
without just ever having to fail userspace writes to scaling_governor.

-Saravana
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 3f09ca9..e33cb15 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2222,9 +2222,7 @@  static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	/* end old governor */
 	if (old_gov) {
 		__cpufreq_governor(policy, CPUFREQ_GOV_STOP);
-		up_write(&policy->rwsem);
 		__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
-		down_write(&policy->rwsem);
 	}
 
 	/* start new governor */
@@ -2233,9 +2231,7 @@  static int cpufreq_set_policy(struct cpufreq_policy *policy,
 		if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
 			goto out;
 
-		up_write(&policy->rwsem);
 		__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
-		down_write(&policy->rwsem);
 	}
 
 	/* new governor failed, so re-start old one */
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 503b085..43909ad 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -100,10 +100,6 @@  struct cpufreq_policy {
 	 * - Any routine that will write to the policy structure and/or may take away
 	 *   the policy altogether (eg. CPU hotplug), will hold this lock in write
 	 *   mode before doing so.
-	 *
-	 * Additional rules:
-	 * - Lock should not be held across
-	 *     __cpufreq_governor(data, CPUFREQ_GOV_POLICY_EXIT);
 	 */
 	struct rw_semaphore	rwsem;