Message ID | 20231218162348.69101-1-bo.ye@mediatek.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | thermal: fix race condition in suspend/resume | expand |
On Mon, Dec 18, 2023 at 5:24 PM Bo Ye <bo.ye@mediatek.com> wrote: > > From: "yugang.wang" <yugang.wang@mediatek.com> > > Firstly, it needs to be clarified that this issue occurs in a real- > world environment. By analyzing the logs, we inferred that the > issue occurred just as the system was entering suspend mode, and the user > was switching the thermal policy (this action causes all thermal zones > to unregister/register). In addition, we conducted degradation tests > and also reproduced this issue. The specific method is to first switch > the thermal policy through a command, and then immediately put the > system into suspend state through another command. This method can also > reproduce the issue. > > Body: > This patch fixes a race condition during system resume. It occurs > if the system is exiting a suspend state and a user is trying to > register/unregister a thermal zone concurrently. The root cause is > that both actions access the `thermal_tz_list`. > > In detail: > > 1. At PM_POST_SUSPEND during the resume, the system reads all > thermal > zones in `thermal_tz_list`, then resets and updates their > temperatures. > 2. When registering/unregistering a thermal zone, the > `thermal_tz_list` gets manipulated. > > These two actions might occur concurrently, causing a race condition. > To solve this issue, we introduce a mutex lock to protect > `thermal_tz_list` from being modified while it's being read and > updated during the resume from suspend. > > Kernel oops excerpt related to this fix: > > [ 5201.869845] [T316822] pc: [0xffffffeb7d4876f0] > mutex_lock+0x34/0x170 > [ 5201.869856] [T316822] lr: [0xffffffeb7ca98a84] > thermal_pm_notify+0xd4/0x26c > [... cut for brevity ...] > [ 5201.871061] [T316822] suspend_prepare+0x150/0x470 > [ 5201.871067] [T316822] enter_state+0x84/0x6f4 > [ 5201.871076] [T316822] state_store+0x15c/0x1e8 > > 3.Enable thermal policy operation will unregister/register all thermal zones > 10-21 06:13:59.280 854 922 I libMtcLoader: enable thermal policy thermal_policy_09. > > 4.System suspend entry time is 2023-10-20 22:13:59.242 > [ 4106.364175][T609387] binder:534_2: [name:spm&][SPM] PM: suspend entry 2023-10-20 22:13:59.242898243 UTC > [ 4106.366185][T609387] binder:534_2: PM: [name:wakeup&]PM: Pending Wakeup Sources: NETLINK > > 5. It can be proven that the absence of a switch strategy will perform > unregister/register operations on thermal zones (android time is 2023-10-20 22:13:59.282), > Because the logs for other thermal zones switching are not enabled by > default, we cannot see the logs related to other thermal zones. > [ 4106.404167][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK > [ 4106.404215][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK > [ 4106.404225][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK > [ 4106.404504][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 0 > [ 4106.404545][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 2 > [ 4106.404566][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 1 > > 6. thermal_pm_notify trigger KE(android time: 2023-10-20 22:13:59.315894) > [ 4106.437171][T209387] binder:534_2: [name:mrdump&]Kernel Offset:0x289cc80000 from 0xffffffc008000000 > [ 4106.437182][T209387] binder:534_2: [name:mrdump&]PHYS_OFFSET:0x40000000 > [ 4106.437191][T209387] binder:534_2: [name:mrdump&]pstate: 80400005(Nzcv daif +PAN -UAO) > [ 4106.437204][T209387] binder:534_2: [name:mrdump&]pc :[0xffffffe8a6688200] mutex_lock+0x34/0x184 > [ 4106.437214][T209387] binder:534_2: [name:mrdump&]lr :[0xffffffe8a5ce66bc] thermal_pm_notify+0xd4/0x26c > [ 4106.437220][T209387] binder:534_2: [name:mrdump&]sp :ffffffc01bab3ae0 > [ 4106.437226][T209387] binder:534_2: [name:mrdump&]x29:ffffffc01bab3af0 x28: 0000000000000001 > > Signed-off-by: Yugang Wang <yugang.wang@mediatek.com> > Signed-off-by: Bo Ye <bo.ye@mediatek.com> > --- > drivers/thermal/thermal_core.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index 9c17d35ccbbd..73d6b820c8b5 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -1520,12 +1520,14 @@ static int thermal_pm_notify(struct notifier_block *nb, > case PM_POST_HIBERNATION: > case PM_POST_RESTORE: > case PM_POST_SUSPEND: > + mutex_lock(&thermal_list_lock); > atomic_set(&in_suspend, 0); > list_for_each_entry(tz, &thermal_tz_list, node) { > thermal_zone_device_init(tz); > thermal_zone_device_update(tz, > THERMAL_EVENT_UNSPECIFIED); > } > + mutex_unlock(&thermal_list_lock); This isn't sufficient. I have a patch fixing this more completely, will post it later today if time permits. > break; > default: > break; > -- Thanks!
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index 9c17d35ccbbd..73d6b820c8b5 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -1520,12 +1520,14 @@ static int thermal_pm_notify(struct notifier_block *nb, case PM_POST_HIBERNATION: case PM_POST_RESTORE: case PM_POST_SUSPEND: + mutex_lock(&thermal_list_lock); atomic_set(&in_suspend, 0); list_for_each_entry(tz, &thermal_tz_list, node) { thermal_zone_device_init(tz); thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED); } + mutex_unlock(&thermal_list_lock); break; default: break;