diff mbox series

[v2,1/3] wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()

Message ID 20221207052500.10855-2-howard-yh.hsu@mediatek.com (mailing list archive)
State Changes Requested
Delegated to: Felix Fietkau
Headers show
Series wifi: mt76: mt7915: rework thermal protection | expand

Commit Message

Howard Hsu Dec. 7, 2022, 5:24 a.m. UTC
This patch includes 3 changes:
1. The maximum throttle state can be set to 100 to fix the problem that
thermal_protect_disable can never be triggered.
2. Throttle state do not need to be different from the previous state.
This will make it is impossible for users to just change the
trigger/restore temp but not the throttle state.
3. Add dev_err so that it is easier to see invalid setting while looking at dmesg.

Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after temporary overheat")
Co-developed-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Howard Hsu <howard-yh.hsu@mediatek.com>
---
 .../net/wireless/mediatek/mt76/mt7915/init.c   | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Comments

Nicolas Cavallari Dec. 7, 2022, 8:15 a.m. UTC | #1
On 07/12/2022 06:24, Howard Hsu wrote:
> This patch includes 3 changes:
> 1. The maximum throttle state can be set to 100 to fix the problem that
> thermal_protect_disable can never be triggered.

You are modifying the cooling_device part.  The cooling_device is 
explicitly configured to have a max state of MT7915_CDEV_THROTTLE_MAX 
(=99), so the thermal subsystem will probably prevent 
mt7915_thermal_set_cur_throttle_state from being called with a higher 
value.  It will also probably complain if get_cur_state starts returning 
values above MT7915_CDEV_THROTTLE_MAX.

And, as the comment below indicates, the thermal subsystem expect that a 
higher state provide more cooling.  So if 99 means "maximum cooling", 
100 cannot mean "disable cooling".

Also, last time I tried, thermal_protect_disable didn't work; It didn't 
disable anything, the previous thermal throttle kept being applied. 
Maybe a new firmware fixed this, but the kernel cannot simply expect the 
firmware to be up to date.

> 2. Throttle state do not need to be different from the previous state.
> This will make it is impossible for users to just change the
> trigger/restore temp but not the throttle state.

The throttle state is mostly set by the kernel's thermal governor and 
the user has only very little control over it.  The thermal governor 
runs every X seconds and will change the state if it thinks it is too 
low or too high.

The default step_wise governor will aggressively set it to zero if the 
system isn't overheating, for example.

> 3. Add dev_err so that it is easier to see invalid setting while looking at dmesg.
> 
> Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after temporary overheat")
> Co-developed-by: Ryder Lee <ryder.lee@mediatek.com>
> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
> Signed-off-by: Howard Hsu <howard-yh.hsu@mediatek.com>
> ---
>   .../net/wireless/mediatek/mt76/mt7915/init.c   | 18 ++++++++++--------
>   1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> index c810c31fbd6e..abeecf15f1c8 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
>   	u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
>   	int ret;
>   
> -	if (state > MT7915_CDEV_THROTTLE_MAX)
> +	if (state > MT7915_THERMAL_THROTTLE_MAX) {
> +		dev_err(phy->dev->mt76.dev,
> +			"please specify a valid throttling state\n");
>   		return -EINVAL;
> +	}
>   
> -	if (phy->throttle_temp[0] > phy->throttle_temp[1])
> -		return 0;
> -
> -	if (state == phy->cdev_state)
> -		return 0;
> +	if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
> +		dev_err(phy->dev->mt76.dev,
> +			"temp1_crit shall not be greater than temp1_max\n");
> +		return -EINVAL;
> +	}
>   
>   	/*
>   	 * cooling_device convention: 0 = no cooling, more = more cooling
            ^^^^^^^^^^^^^^^^^^^^^^^^^
Howard Hsu Dec. 8, 2022, 12:44 p.m. UTC | #2
On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote:
> On 07/12/2022 06:24, Howard Hsu wrote:
> > This patch includes 3 changes:
> > 1. The maximum throttle state can be set to 100 to fix the problem
> > that
> > thermal_protect_disable can never be triggered.
> 
> You are modifying the cooling_device part.  The cooling_device is 
> explicitly configured to have a max state of
> MT7915_CDEV_THROTTLE_MAX 
> (=99), so the thermal subsystem will probably prevent 
> mt7915_thermal_set_cur_throttle_state from being called with a
> higher 
> value.  It will also probably complain if get_cur_state starts
> returning 
> values above MT7915_CDEV_THROTTLE_MAX.
> 
> And, as the comment below indicates, the thermal subsystem expect
> that a 
> higher state provide more cooling.  So if 99 means "maximum
> cooling", 
> 100 cannot mean "disable cooling".
> 
> Also, last time I tried, thermal_protect_disable didn't work; It
> didn't 
> disable anything, the previous thermal throttle kept being applied. 
> Maybe a new firmware fixed this, but the kernel cannot simply expect
> the 
> firmware to be up to date.
> 

Thanks for your comments. Let me give you an example to confirm with
you if I understand your comments correctly.

1. The current cooling state of the cooling device is 50 (cur_state =
50).
2. The cooling state is set to 100 for "disable cooling".
3. The thermal subsystem decides to decrease state because the rest of
system is cooler. And then it will adjust it downward based on
cur_state, which is 100. For example, thermal subsytem set cur_state to
90. But obviously this will make the performance worse than at step 1,
even though the system is cooler. The design for 100 mean "disable
cooling" will mess up the thermal governor.

Let me know if there is any misunderstanding. And I will remove the
first change of this patch.

> > 2. Throttle state do not need to be different from the previous
> > state.
> > This will make it is impossible for users to just change the
> > trigger/restore temp but not the throttle state.
> 
> The throttle state is mostly set by the kernel's thermal governor
> and 
> the user has only very little control over it.  The thermal governor 
> runs every X seconds and will change the state if it thinks it is
> too 
> low or too high.
> 
> The default step_wise governor will aggressively set it to zero if
> the 
> system isn't overheating, for example.
> 

I don't think there is any conflict between your comment and second
change. If we keep the check that previous cooling state shall be
different from the new cooling state, this will bother users who only
wants to change the temp1_crit but not the cur_state. It is
unreasonable for the user, if they wants the new temp1_crit to take
effect in the firmware, they must set a differnt cooling state.

 
> > 3. Add dev_err so that it is easier to see invalid setting while
> > looking at dmesg.
> > 
> > Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after
> > temporary overheat")
> > Co-developed-by: Ryder Lee <ryder.lee@mediatek.com>
> > Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
> > Signed-off-by: Howard Hsu <howard-yh.hsu@mediatek.com>
> > ---
> >   .../net/wireless/mediatek/mt76/mt7915/init.c   | 18 ++++++++++---
> > -----
> >   1 file changed, 10 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > index c810c31fbd6e..abeecf15f1c8 100644
> > --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct
> > thermal_cooling_device *cdev,
> >   	u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
> >   	int ret;
> >   
> > -	if (state > MT7915_CDEV_THROTTLE_MAX)
> > +	if (state > MT7915_THERMAL_THROTTLE_MAX) {
> > +		dev_err(phy->dev->mt76.dev,
> > +			"please specify a valid throttling state\n");
> >   		return -EINVAL;
> > +	}
> >   
> > -	if (phy->throttle_temp[0] > phy->throttle_temp[1])
> > -		return 0;
> > -
> > -	if (state == phy->cdev_state)
> > -		return 0;
> > +	if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
> > +		dev_err(phy->dev->mt76.dev,
> > +			"temp1_crit shall not be greater than
> > temp1_max\n");
> > +		return -EINVAL;
> > +	}
> >   
> >   	/*
> >   	 * cooling_device convention: 0 = no cooling, more = more
> > cooling
> 
>             ^^^^^^^^^^^^^^^^^^^^^^^^^
>
Nicolas Cavallari Dec. 8, 2022, 4:30 p.m. UTC | #3
On 08/12/2022 13:44, Howard-YH Hsu (許育豪) wrote:
> On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote:
>> On 07/12/2022 06:24, Howard Hsu wrote:
>> > This patch includes 3 changes:
>> > 1. The maximum throttle state can be set to 100 to fix the problem
>> > that
>> > thermal_protect_disable can never be triggered.
>> 
>> You are modifying the cooling_device part.  The cooling_device is 
>> explicitly configured to have a max state of
>> MT7915_CDEV_THROTTLE_MAX 
>> (=99), so the thermal subsystem will probably prevent 
>> mt7915_thermal_set_cur_throttle_state from being called with a
>> higher 
>> value.  It will also probably complain if get_cur_state starts
>> returning 
>> values above MT7915_CDEV_THROTTLE_MAX.
>> 
>> And, as the comment below indicates, the thermal subsystem expect
>> that a 
>> higher state provide more cooling.  So if 99 means "maximum
>> cooling", 
>> 100 cannot mean "disable cooling".
>> 
>> Also, last time I tried, thermal_protect_disable didn't work; It
>> didn't 
>> disable anything, the previous thermal throttle kept being applied. 
>> Maybe a new firmware fixed this, but the kernel cannot simply expect
>> the 
>> firmware to be up to date.
>> 
> 
> Thanks for your comments. Let me give you an example to confirm with
> you if I understand your comments correctly.
> 
> 1. The current cooling state of the cooling device is 50 (cur_state =
> 50).
> 2. The cooling state is set to 100 for "disable cooling".
> 3. The thermal subsystem decides to decrease state because the rest of
> system is cooler. And then it will adjust it downward based on
> cur_state, which is 100. For example, thermal subsytem set cur_state to
> 90. But obviously this will make the performance worse than at step 1,
> even though the system is cooler. The design for 100 mean "disable
> cooling" will mess up the thermal governor.
> 
> Let me know if there is any misunderstanding. And I will remove the
> first change of this patch.

This is pretty much my second point.

The other case is if the system is overheating a lot and the kernel bumps the 
state from e.g. 80 to 100, then the system should not disable throttling.

> 
>> > 2. Throttle state do not need to be different from the previous
>> > state.
>> > This will make it is impossible for users to just change the
>> > trigger/restore temp but not the throttle state.
>> 
>> The throttle state is mostly set by the kernel's thermal governor
>> and 
>> the user has only very little control over it.  The thermal governor 
>> runs every X seconds and will change the state if it thinks it is
>> too 
>> low or too high.
>> 
>> The default step_wise governor will aggressively set it to zero if
>> the 
>> system isn't overheating, for example.
>> 
> 
> I don't think there is any conflict between your comment and second
> change. If we keep the check that previous cooling state shall be
> different from the new cooling state, this will bother users who only
> wants to change the temp1_crit but not the cur_state. It is
> unreasonable for the user, if they wants the new temp1_crit to take
> effect in the firmware, they must set a differnt cooling state.

The point I wanted to make is that the kernel sets the throttle state a lot, and 
I assumed that it could also do it even if the state does not change, which 
would send unnecessary mcu commands. But it's apparently not the case.

Also, if the user changes temp1_crit or temp1_max, the changes should probably 
be applied immediately instead of expecting the user to change the state afterward.
diff mbox series

Patch

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index c810c31fbd6e..abeecf15f1c8 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -131,14 +131,17 @@  mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
 	u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
 	int ret;
 
-	if (state > MT7915_CDEV_THROTTLE_MAX)
+	if (state > MT7915_THERMAL_THROTTLE_MAX) {
+		dev_err(phy->dev->mt76.dev,
+			"please specify a valid throttling state\n");
 		return -EINVAL;
+	}
 
-	if (phy->throttle_temp[0] > phy->throttle_temp[1])
-		return 0;
-
-	if (state == phy->cdev_state)
-		return 0;
+	if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
+		dev_err(phy->dev->mt76.dev,
+			"temp1_crit shall not be greater than temp1_max\n");
+		return -EINVAL;
+	}
 
 	/*
 	 * cooling_device convention: 0 = no cooling, more = more cooling
@@ -164,7 +167,7 @@  static void mt7915_unregister_thermal(struct mt7915_phy *phy)
 	struct wiphy *wiphy = phy->mt76->hw->wiphy;
 
 	if (!phy->cdev)
-	    return;
+		return;
 
 	sysfs_remove_link(&wiphy->dev.kobj, "cooling_device");
 	thermal_cooling_device_unregister(phy->cdev);
@@ -1101,7 +1104,6 @@  static void mt7915_stop_hardware(struct mt7915_dev *dev)
 		mt7986_wmac_disable(dev);
 }
 
-
 int mt7915_register_device(struct mt7915_dev *dev)
 {
 	struct ieee80211_hw *hw = mt76_hw(dev);