Message ID | 20221207052500.10855-2-howard-yh.hsu@mediatek.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Felix Fietkau |
Headers | show |
Series | wifi: mt76: mt7915: rework thermal protection | expand |
On 07/12/2022 06:24, Howard Hsu wrote: > This patch includes 3 changes: > 1. The maximum throttle state can be set to 100 to fix the problem that > thermal_protect_disable can never be triggered. You are modifying the cooling_device part. The cooling_device is explicitly configured to have a max state of MT7915_CDEV_THROTTLE_MAX (=99), so the thermal subsystem will probably prevent mt7915_thermal_set_cur_throttle_state from being called with a higher value. It will also probably complain if get_cur_state starts returning values above MT7915_CDEV_THROTTLE_MAX. And, as the comment below indicates, the thermal subsystem expect that a higher state provide more cooling. So if 99 means "maximum cooling", 100 cannot mean "disable cooling". Also, last time I tried, thermal_protect_disable didn't work; It didn't disable anything, the previous thermal throttle kept being applied. Maybe a new firmware fixed this, but the kernel cannot simply expect the firmware to be up to date. > 2. Throttle state do not need to be different from the previous state. > This will make it is impossible for users to just change the > trigger/restore temp but not the throttle state. The throttle state is mostly set by the kernel's thermal governor and the user has only very little control over it. The thermal governor runs every X seconds and will change the state if it thinks it is too low or too high. The default step_wise governor will aggressively set it to zero if the system isn't overheating, for example. > 3. Add dev_err so that it is easier to see invalid setting while looking at dmesg. > > Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after temporary overheat") > Co-developed-by: Ryder Lee <ryder.lee@mediatek.com> > Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> > Signed-off-by: Howard Hsu <howard-yh.hsu@mediatek.com> > --- > .../net/wireless/mediatek/mt76/mt7915/init.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c > index c810c31fbd6e..abeecf15f1c8 100644 > --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c > +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c > @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev, > u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state; > int ret; > > - if (state > MT7915_CDEV_THROTTLE_MAX) > + if (state > MT7915_THERMAL_THROTTLE_MAX) { > + dev_err(phy->dev->mt76.dev, > + "please specify a valid throttling state\n"); > return -EINVAL; > + } > > - if (phy->throttle_temp[0] > phy->throttle_temp[1]) > - return 0; > - > - if (state == phy->cdev_state) > - return 0; > + if (phy->throttle_temp[0] > phy->throttle_temp[1]) { > + dev_err(phy->dev->mt76.dev, > + "temp1_crit shall not be greater than temp1_max\n"); > + return -EINVAL; > + } > > /* > * cooling_device convention: 0 = no cooling, more = more cooling ^^^^^^^^^^^^^^^^^^^^^^^^^
On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote: > On 07/12/2022 06:24, Howard Hsu wrote: > > This patch includes 3 changes: > > 1. The maximum throttle state can be set to 100 to fix the problem > > that > > thermal_protect_disable can never be triggered. > > You are modifying the cooling_device part. The cooling_device is > explicitly configured to have a max state of > MT7915_CDEV_THROTTLE_MAX > (=99), so the thermal subsystem will probably prevent > mt7915_thermal_set_cur_throttle_state from being called with a > higher > value. It will also probably complain if get_cur_state starts > returning > values above MT7915_CDEV_THROTTLE_MAX. > > And, as the comment below indicates, the thermal subsystem expect > that a > higher state provide more cooling. So if 99 means "maximum > cooling", > 100 cannot mean "disable cooling". > > Also, last time I tried, thermal_protect_disable didn't work; It > didn't > disable anything, the previous thermal throttle kept being applied. > Maybe a new firmware fixed this, but the kernel cannot simply expect > the > firmware to be up to date. > Thanks for your comments. Let me give you an example to confirm with you if I understand your comments correctly. 1. The current cooling state of the cooling device is 50 (cur_state = 50). 2. The cooling state is set to 100 for "disable cooling". 3. The thermal subsystem decides to decrease state because the rest of system is cooler. And then it will adjust it downward based on cur_state, which is 100. For example, thermal subsytem set cur_state to 90. But obviously this will make the performance worse than at step 1, even though the system is cooler. The design for 100 mean "disable cooling" will mess up the thermal governor. Let me know if there is any misunderstanding. And I will remove the first change of this patch. > > 2. Throttle state do not need to be different from the previous > > state. > > This will make it is impossible for users to just change the > > trigger/restore temp but not the throttle state. > > The throttle state is mostly set by the kernel's thermal governor > and > the user has only very little control over it. The thermal governor > runs every X seconds and will change the state if it thinks it is > too > low or too high. > > The default step_wise governor will aggressively set it to zero if > the > system isn't overheating, for example. > I don't think there is any conflict between your comment and second change. If we keep the check that previous cooling state shall be different from the new cooling state, this will bother users who only wants to change the temp1_crit but not the cur_state. It is unreasonable for the user, if they wants the new temp1_crit to take effect in the firmware, they must set a differnt cooling state. > > 3. Add dev_err so that it is easier to see invalid setting while > > looking at dmesg. > > > > Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after > > temporary overheat") > > Co-developed-by: Ryder Lee <ryder.lee@mediatek.com> > > Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> > > Signed-off-by: Howard Hsu <howard-yh.hsu@mediatek.com> > > --- > > .../net/wireless/mediatek/mt76/mt7915/init.c | 18 ++++++++++--- > > ----- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c > > b/drivers/net/wireless/mediatek/mt76/mt7915/init.c > > index c810c31fbd6e..abeecf15f1c8 100644 > > --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c > > +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c > > @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct > > thermal_cooling_device *cdev, > > u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state; > > int ret; > > > > - if (state > MT7915_CDEV_THROTTLE_MAX) > > + if (state > MT7915_THERMAL_THROTTLE_MAX) { > > + dev_err(phy->dev->mt76.dev, > > + "please specify a valid throttling state\n"); > > return -EINVAL; > > + } > > > > - if (phy->throttle_temp[0] > phy->throttle_temp[1]) > > - return 0; > > - > > - if (state == phy->cdev_state) > > - return 0; > > + if (phy->throttle_temp[0] > phy->throttle_temp[1]) { > > + dev_err(phy->dev->mt76.dev, > > + "temp1_crit shall not be greater than > > temp1_max\n"); > > + return -EINVAL; > > + } > > > > /* > > * cooling_device convention: 0 = no cooling, more = more > > cooling > > ^^^^^^^^^^^^^^^^^^^^^^^^^ >
On 08/12/2022 13:44, Howard-YH Hsu (許育豪) wrote: > On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote: >> On 07/12/2022 06:24, Howard Hsu wrote: >> > This patch includes 3 changes: >> > 1. The maximum throttle state can be set to 100 to fix the problem >> > that >> > thermal_protect_disable can never be triggered. >> >> You are modifying the cooling_device part. The cooling_device is >> explicitly configured to have a max state of >> MT7915_CDEV_THROTTLE_MAX >> (=99), so the thermal subsystem will probably prevent >> mt7915_thermal_set_cur_throttle_state from being called with a >> higher >> value. It will also probably complain if get_cur_state starts >> returning >> values above MT7915_CDEV_THROTTLE_MAX. >> >> And, as the comment below indicates, the thermal subsystem expect >> that a >> higher state provide more cooling. So if 99 means "maximum >> cooling", >> 100 cannot mean "disable cooling". >> >> Also, last time I tried, thermal_protect_disable didn't work; It >> didn't >> disable anything, the previous thermal throttle kept being applied. >> Maybe a new firmware fixed this, but the kernel cannot simply expect >> the >> firmware to be up to date. >> > > Thanks for your comments. Let me give you an example to confirm with > you if I understand your comments correctly. > > 1. The current cooling state of the cooling device is 50 (cur_state = > 50). > 2. The cooling state is set to 100 for "disable cooling". > 3. The thermal subsystem decides to decrease state because the rest of > system is cooler. And then it will adjust it downward based on > cur_state, which is 100. For example, thermal subsytem set cur_state to > 90. But obviously this will make the performance worse than at step 1, > even though the system is cooler. The design for 100 mean "disable > cooling" will mess up the thermal governor. > > Let me know if there is any misunderstanding. And I will remove the > first change of this patch. This is pretty much my second point. The other case is if the system is overheating a lot and the kernel bumps the state from e.g. 80 to 100, then the system should not disable throttling. > >> > 2. Throttle state do not need to be different from the previous >> > state. >> > This will make it is impossible for users to just change the >> > trigger/restore temp but not the throttle state. >> >> The throttle state is mostly set by the kernel's thermal governor >> and >> the user has only very little control over it. The thermal governor >> runs every X seconds and will change the state if it thinks it is >> too >> low or too high. >> >> The default step_wise governor will aggressively set it to zero if >> the >> system isn't overheating, for example. >> > > I don't think there is any conflict between your comment and second > change. If we keep the check that previous cooling state shall be > different from the new cooling state, this will bother users who only > wants to change the temp1_crit but not the cur_state. It is > unreasonable for the user, if they wants the new temp1_crit to take > effect in the firmware, they must set a differnt cooling state. The point I wanted to make is that the kernel sets the throttle state a lot, and I assumed that it could also do it even if the state does not change, which would send unnecessary mcu commands. But it's apparently not the case. Also, if the user changes temp1_crit or temp1_max, the changes should probably be applied immediately instead of expecting the user to change the state afterward.
diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c index c810c31fbd6e..abeecf15f1c8 100644 --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev, u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state; int ret; - if (state > MT7915_CDEV_THROTTLE_MAX) + if (state > MT7915_THERMAL_THROTTLE_MAX) { + dev_err(phy->dev->mt76.dev, + "please specify a valid throttling state\n"); return -EINVAL; + } - if (phy->throttle_temp[0] > phy->throttle_temp[1]) - return 0; - - if (state == phy->cdev_state) - return 0; + if (phy->throttle_temp[0] > phy->throttle_temp[1]) { + dev_err(phy->dev->mt76.dev, + "temp1_crit shall not be greater than temp1_max\n"); + return -EINVAL; + } /* * cooling_device convention: 0 = no cooling, more = more cooling @@ -164,7 +167,7 @@ static void mt7915_unregister_thermal(struct mt7915_phy *phy) struct wiphy *wiphy = phy->mt76->hw->wiphy; if (!phy->cdev) - return; + return; sysfs_remove_link(&wiphy->dev.kobj, "cooling_device"); thermal_cooling_device_unregister(phy->cdev); @@ -1101,7 +1104,6 @@ static void mt7915_stop_hardware(struct mt7915_dev *dev) mt7986_wmac_disable(dev); } - int mt7915_register_device(struct mt7915_dev *dev) { struct ieee80211_hw *hw = mt76_hw(dev);