diff mbox

[V2,0/7] cpufreq: governors: Fix ABBA lockups

Message ID 20160204062439.GZ3469@vireshk (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Viresh Kumar Feb. 4, 2016, 6:24 a.m. UTC
On 03-02-16, 21:40, Viresh Kumar wrote:
> On 03-02-16, 15:54, Juri Lelli wrote:
> > Ouch, I've just got this executing -f basic on Juno. :(
> > It happens with the hotplug_1_by_1 test.
> > 
> > 
> > [ 1086.531252] IRQ1 no longer affine to CPU1
> > [ 1086.531495] CPU1: shutdown
> > [ 1086.538199] psci: CPU1 killed.
> > [ 1086.583396]
> > [ 1086.584881] ======================================================
> > [ 1086.590999] [ INFO: possible circular locking dependency detected ]
> > [ 1086.597205] 4.5.0-rc2+ #37 Not tainted
> > [ 1086.600914] -------------------------------------------------------
> > [ 1086.607118] runme.sh/1052 is trying to acquire lock:
> > [ 1086.612031]  (sb_writers#7){.+.+.+}, at: [<ffffffc000249500>] __sb_start_write+0xcc/0xe0
> > [ 1086.620090]
> > [ 1086.620090] but task is already holding lock:
> > [ 1086.625865]  (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278
> > [ 1086.634081]
> > [ 1086.634081] which lock already depends on the new lock.
> > [ 1086.634081]
> > [ 1086.642180]
> > [ 1086.642180] the existing dependency chain (in reverse order) is:
> > [ 1086.649589]
> > -> #1 (&policy->rwsem){+++++.}:
> > [ 1086.653929]        [<ffffffc00011d9a4>] check_prev_add+0x670/0x754
> > [ 1086.660060]        [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c
> > [ 1086.666876]        [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0
> > [ 1086.673001]        [<ffffffc000120b58>] lock_release+0x244/0x570
> > [ 1086.678955]        [<ffffffc0007351d0>] __mutex_unlock_slowpath+0xa0/0x18c
> > [ 1086.685771]        [<ffffffc0007352dc>] mutex_unlock+0x20/0x2c
> > [ 1086.691553]        [<ffffffc0002ccd24>] kernfs_fop_write+0xb0/0x194
> > [ 1086.697768]        [<ffffffc00024478c>] __vfs_write+0x48/0x104
> > [ 1086.703550]        [<ffffffc0002457a4>] vfs_write+0x98/0x198
> > [ 1086.709161]        [<ffffffc0002465e4>] SyS_write+0x54/0xb0
> > [ 1086.714684]        [<ffffffc000085d30>] el0_svc_naked+0x24/0x28
> > [ 1086.720555]
> > -> #0 (sb_writers#7){.+.+.+}:
> > [ 1086.724730]        [<ffffffc00011c574>] print_circular_bug+0x80/0x2e4
> > [ 1086.731116]        [<ffffffc00011d470>] check_prev_add+0x13c/0x754
> > [ 1086.737243]        [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c
> > [ 1086.744059]        [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0
> > [ 1086.750184]        [<ffffffc0001207f4>] lock_acquire+0xe4/0x204
> > [ 1086.756052]        [<ffffffc000118da0>] percpu_down_read+0x50/0xe4
> > [ 1086.762180]        [<ffffffc000249500>] __sb_start_write+0xcc/0xe0
> > [ 1086.768306]        [<ffffffc00026ae90>] mnt_want_write+0x28/0x54
> > [ 1086.774263]        [<ffffffc0002555f8>] do_last+0x660/0xcb8
> > [ 1086.779788]        [<ffffffc000255cdc>] path_openat+0x8c/0x2b0
> > [ 1086.785570]        [<ffffffc000256fbc>] do_filp_open+0x78/0xf0
> > [ 1086.791353]        [<ffffffc000244058>] do_sys_open+0x150/0x214
> > [ 1086.797222]        [<ffffffc0002441a0>] SyS_openat+0x3c/0x48
> > [ 1086.802831]        [<ffffffc000085d30>] el0_svc_naked+0x24/0x28
> > [ 1086.808700]
> > [ 1086.808700] other info that might help us debug this:
> > [ 1086.808700]
> > [ 1086.816627]  Possible unsafe locking scenario:
> > [ 1086.816627]
> > [ 1086.822488]        CPU0                    CPU1
> > [ 1086.826971]        ----                    ----
> > [ 1086.831453]   lock(&policy->rwsem);
> > [ 1086.834918]                                lock(sb_writers#7);
> > [ 1086.840713]                                lock(&policy->rwsem);
> > [ 1086.846671]   lock(sb_writers#7);
> > [ 1086.849972]
> > [ 1086.849972]  *** DEADLOCK ***
> > [ 1086.849972]
> > [ 1086.855836] 1 lock held by runme.sh/1052:
> > [ 1086.859802]  #0:  (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278
> > [ 1086.868453]
> > [ 1086.868453] stack backtrace:
> > [ 1086.872769] CPU: 5 PID: 1052 Comm: runme.sh Not tainted 4.5.0-rc2+ #37
> > [ 1086.879229] Hardware name: ARM Juno development board (r2) (DT)
> > [ 1086.885089] Call trace:
> > [ 1086.887511] [<ffffffc00008a788>] dump_backtrace+0x0/0x1f4
> > [ 1086.892858] [<ffffffc00008a99c>] show_stack+0x20/0x28
> > [ 1086.897861] [<ffffffc00041a380>] dump_stack+0x84/0xc0
> > [ 1086.902863] [<ffffffc00011c6c8>] print_circular_bug+0x1d4/0x2e4
> > [ 1086.908725] [<ffffffc00011d470>] check_prev_add+0x13c/0x754
> > [ 1086.914244] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c
> > [ 1086.920448] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0
> > [ 1086.925965] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204
> > [ 1086.931224] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4
> > [ 1086.936742] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0
> > [ 1086.942260] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54
> > [ 1086.947605] [<ffffffc0002555f8>] do_last+0x660/0xcb8
> > [ 1086.952520] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0
> > [ 1086.957693] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0
> > [ 1086.962865] [<ffffffc000244058>] do_sys_open+0x150/0x214
> > [ 1086.968123] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48
> > [ 1086.973124] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28
> > [ 1087.019315] Detected PIPT I-cache on CPU1
> > [ 1087.019373] CPU1: Booted secondary processor [410fd080]
> 
> Urg..

Urg square :(

> I failed to understand it for now though. Please test only the first 4
> patches and leave the bottom three. AFAICT, this is caused by the 6th
> patch.

From the code I still failed to understand this since sometime back
and I something just caught my eyes and the 6th patch needs this
fixup:

I tried the basic tests using './runme' and they aren't reporting the
same lockdep now. And yes, your lockdep occurred on my exynos board as
well :)

I have re-pushed my patches again to the same branch. All 7 look fine
to me now :)

Comments

Viresh Kumar Feb. 4, 2016, 12:17 p.m. UTC | #1
On 04-02-16, 11:54, Viresh Kumar wrote:
> From the code I still failed to understand this since sometime back
> and I something just caught my eyes and the 6th patch needs this
> fixup:
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 7bc8a5ed97e5..ac3348ecde7b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1351,7 +1351,7 @@ static void cpufreq_offline(unsigned int cpu)
>                                 pr_err("%s: Failed to start governor\n", __func__);
>                 }
>  
> -               return;
> +               goto unlock;
>         }
>  
>         if (cpufreq_driver->stop_cpu)
> @@ -1373,6 +1373,8 @@ static void cpufreq_offline(unsigned int cpu)
>                 cpufreq_driver->exit(policy);
>                 policy->freq_table = NULL;
>         }
> +
> +unlock:
>         up_write(&policy->rwsem);
>  }
> 
> I tried the basic tests using './runme' and they aren't reporting the
> same lockdep now. And yes, your lockdep occurred on my exynos board as
> well :)
> 
> I have re-pushed my patches again to the same branch. All 7 look fine
> to me now :)

FWIW, Juri has reported on IRC that the above diff fixed the lockdep
he reported yesterday and all the 7 patches are working fine on his
test machine, Juno.
Shilpasri G Bhat Feb. 4, 2016, 8:50 p.m. UTC | #2
Hi,

> 
> FWIW, Juri has reported on IRC that the above diff fixed the lockdep
> he reported yesterday and all the 7 patches are working fine on his
> test machine, Juno.
> 

I could see the previous lockdep warnings on pm/next and on top of patch[4].
On Patch[5-7] I see the below lockdep trace on running './runme' on a Power8 box.

[  710.332841] ======================================================
[  710.332911] [ INFO: possible circular locking dependency detected ]
[  710.332969] 4.5.0-rc2-sgb+ #104 Not tainted
[  710.333004] -------------------------------------------------------
[  710.333060] runme.sh/2476 is trying to acquire lock:
[  710.333107]  (s_active#91){++++.+}, at: [<c000000000407db8>]
kernfs_remove+0x48/0x70
[  710.333215]
but task is already holding lock:
[  710.333272]  (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7434>]
cpufreq_governor_dbs+0x64/0x7e0
[  710.333388]
which lock already depends on the new lock.

[  710.333456]
the existing dependency chain (in reverse order) is:
[  710.333523]
-> #2 (od_dbs_cdata.mutex){+.+.+.}:
[  710.333604]        [<c000000000d48450>] mutex_lock_nested+0x90/0x590
[  710.333673]        [<c000000000ad5698>] update_sampling_rate+0x88/0x1c0
[  710.333741]        [<c000000000ad5830>] store_sampling_rate+0x60/0xa0
[  710.333809]        [<c000000000ad6990>] governor_store+0x80/0xc0
[  710.333865]        [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0
[  710.333923]        [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0
[  710.333991]        [<c000000000347b8c>] __vfs_write+0x6c/0x180
[  710.334049]        [<c0000000003490a0>] vfs_write+0xc0/0x200
[  710.334107]        [<c00000000034a3cc>] SyS_write+0x6c/0x110
[  710.334163]        [<c00000000000926c>] system_call+0x38/0xd0
[  710.334222]
-> #1 (&dbs_data->mutex){+.+...}:
[  710.334290]        [<c000000000d48450>] mutex_lock_nested+0x90/0x590
[  710.334358]        [<c000000000ad6960>] governor_store+0x50/0xc0
[  710.334415]        [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0
[  710.334471]        [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0
[  710.334539]        [<c000000000347b8c>] __vfs_write+0x6c/0x180
[  710.334596]        [<c0000000003490a0>] vfs_write+0xc0/0x200
[  710.334653]        [<c00000000034a3cc>] SyS_write+0x6c/0x110
[  710.334710]        [<c00000000000926c>] system_call+0x38/0xd0
[  710.334767]
-> #0 (s_active#91){++++.+}:
[  710.334847]        [<c00000000015f318>] lock_acquire+0xd8/0x1a0
[  710.334905]        [<c0000000004065f4>] __kernfs_remove+0x344/0x410
[  710.334973]        [<c000000000407db8>] kernfs_remove+0x48/0x70
[  710.335030]        [<c00000000040b868>] sysfs_remove_dir+0x78/0xd0
[  710.335098]        [<c0000000005eccec>] kobject_del+0x2c/0x80
[  710.335156]        [<c0000000005ec9e8>] kobject_release+0xa8/0x250
[  710.335224]        [<c000000000ad7ac8>] cpufreq_governor_dbs+0x6f8/0x7e0
[  710.335292]        [<c000000000ad4a7c>] od_cpufreq_governor_dbs+0x3c/0x60
[  710.335361]        [<c000000000acf7c4>] __cpufreq_governor+0x164/0x300
[  710.335429]        [<c000000000ad0600>] cpufreq_set_policy+0x3b0/0x450
[  710.335497]        [<c000000000ad117c>] store_scaling_governor+0x8c/0xf0
[  710.335565]        [<c000000000aced34>] store+0xb4/0x110
[  710.335622]        [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0
[  710.335679]        [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0
[  710.335747]        [<c000000000347b8c>] __vfs_write+0x6c/0x180
[  710.335803]        [<c0000000003490a0>] vfs_write+0xc0/0x200
[  710.335861]        [<c00000000034a3cc>] SyS_write+0x6c/0x110
[  710.335918]        [<c00000000000926c>] system_call+0x38/0xd0
[  710.335993]
other info that might help us debug this:

[  710.336130] Chain exists of:
  s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex

[  710.336376]  Possible unsafe locking scenario:

[  710.336488]        CPU0                    CPU1
[  710.336577]        ----                    ----
[  710.336666]   lock(od_dbs_cdata.mutex);
[  710.336778]                                lock(&dbs_data->mutex);
[  710.336911]                                lock(od_dbs_cdata.mutex);
[  710.337064]   lock(s_active#91);
[  710.337176]
 *** DEADLOCK ***

[  710.337289] 6 locks held by runme.sh/2476:
[  710.337355]  #0:  (sb_writers#6){.+.+.+}, at: [<c00000000034cf10>]
__sb_start_write+0x120/0x150
[  710.337600]  #1:  (&of->mutex){+.+.+.}, at: [<c00000000040939c>]
kernfs_fop_write+0x7c/0x1f0
[  710.337824]  #2:  (s_active#82){.+.+.+}, at: [<c0000000004093a8>]
kernfs_fop_write+0x88/0x1f0
[  710.338070]  #3:  (cpu_hotplug.lock){++++++}, at: [<c0000000000e06d8>]
get_online_cpus+0x48/0xc0
[  710.338276]  #4:  (&policy->rwsem){+++++.}, at: [<c000000000aced04>]
store+0x84/0x110
[  710.338476]  #5:  (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7434>]
cpufreq_governor_dbs+0x64/0x7e0
[  710.338722]
stack backtrace:
[  710.338813] CPU: 0 PID: 2476 Comm: runme.sh Not tainted 4.5.0-rc2-sgb+ #104
[  710.338929] Call Trace:
[  710.338978] [c000005fd40eaec0] [c000000000d563d0] dump_stack+0x90/0xbc
(unreliable)
[  710.339138] [c000005fd40eaef0] [c00000000015884c] print_circular_bug+0x28c/0x3e0
[  710.339295] [c000005fd40eaf90] [c00000000015ed88] __lock_acquire+0x2278/0x22d0
[  710.339455] [c000005fd40eb120] [c00000000015f318] lock_acquire+0xd8/0x1a0
[  710.339589] [c000005fd40eb1e0] [c0000000004065f4] __kernfs_remove+0x344/0x410
[  710.339724] [c000005fd40eb2e0] [c000000000407db8] kernfs_remove+0x48/0x70
[  710.339859] [c000005fd40eb310] [c00000000040b868] sysfs_remove_dir+0x78/0xd0
[  710.339993] [c000005fd40eb350] [c0000000005eccec] kobject_del+0x2c/0x80
[  710.340128] [c000005fd40eb380] [c0000000005ec9e8] kobject_release+0xa8/0x250
[  710.340265] [c000005fd40eb410] [c000000000ad7ac8]
cpufreq_governor_dbs+0x6f8/0x7e0
[  710.340423] [c000005fd40eb4c0] [c000000000ad4a7c]
od_cpufreq_governor_dbs+0x3c/0x60
[  710.340561] [c000005fd40eb500] [c000000000acf7c4] __cpufreq_governor+0x164/0x300
[  710.340639] [c000005fd40eb580] [c000000000ad0600] cpufreq_set_policy+0x3b0/0x450
[  710.340719] [c000005fd40eb610] [c000000000ad117c]
store_scaling_governor+0x8c/0xf0
[  710.340797] [c000005fd40ebc10] [c000000000aced34] store+0xb4/0x110
[  710.340866] [c000005fd40ebc60] [c00000000040a8a4] sysfs_kf_write+0x94/0xc0
[  710.340934] [c000005fd40ebca0] [c0000000004094a8] kernfs_fop_write+0x188/0x1f0
[  710.341013] [c000005fd40ebcf0] [c000000000347b8c] __vfs_write+0x6c/0x180
[  710.341081] [c000005fd40ebd90] [c0000000003490a0] vfs_write+0xc0/0x200
[  710.341151] [c000005fd40ebde0] [c00000000034a3cc] SyS_write+0x6c/0x110
[  710.341219] [c000005fd40ebe30] [c00000000000926c] system_call+0x38/0xd0

Thanks and Regards,
Shilpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Viresh Kumar Feb. 5, 2016, 2:49 a.m. UTC | #3
On 05-02-16, 02:20, Shilpasri G Bhat wrote:
> I could see the previous lockdep warnings on pm/next and on top of patch[4].
> On Patch[5-7] I see the below lockdep trace on running './runme' on a Power8 box.

> [  710.336130] Chain exists of:
>   s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex
> 
> [  710.336376]  Possible unsafe locking scenario:
> 
> [  710.336488]        CPU0                    CPU1
> [  710.336577]        ----                    ----
> [  710.336666]   lock(od_dbs_cdata.mutex);
> [  710.336778]                                lock(&dbs_data->mutex);
> [  710.336911]                                lock(od_dbs_cdata.mutex);
> [  710.337064]   lock(s_active#91);
> [  710.337176]

This is the same lockdep, just that we have added another mutex
(dbs_data->mutex) to it.

Have you tried if all the lockdeps go away with the 8th patch that I
provided yesterday ?
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 7bc8a5ed97e5..ac3348ecde7b 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1351,7 +1351,7 @@  static void cpufreq_offline(unsigned int cpu)
                                pr_err("%s: Failed to start governor\n", __func__);
                }
 
-               return;
+               goto unlock;
        }
 
        if (cpufreq_driver->stop_cpu)
@@ -1373,6 +1373,8 @@  static void cpufreq_offline(unsigned int cpu)
                cpufreq_driver->exit(policy);
                policy->freq_table = NULL;
        }
+
+unlock:
        up_write(&policy->rwsem);
 }