Message ID | 1602176947-17385-1-git-send-email-akhilpo@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] arm64: dts: qcom: sc7180: Add gpu cooling support | expand |
Hi, On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen <akhilpo@codeaurora.org> wrote: > > Add cooling-cells property and the cooling maps for the gpu tzones > to support GPU cooling. > > Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> > --- > arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 ++++++++++++++++++++++------- > 1 file changed, 22 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi b/arch/arm64/boot/dts/qcom/sc7180.dtsi > index d46b383..40d6a28 100644 > --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi > +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi > @@ -2,7 +2,7 @@ > /* > * SC7180 SoC device tree source > * > - * Copyright (c) 2019, The Linux Foundation. All rights reserved. > + * Copyright (c) 2019-20, The Linux Foundation. All rights reserved. > */ > > #include <dt-bindings/clock/qcom,dispcc-sc7180.h> > @@ -1885,6 +1885,7 @@ > iommus = <&adreno_smmu 0>; > operating-points-v2 = <&gpu_opp_table>; > qcom,gmu = <&gmu>; > + #cooling-cells = <2>; Presumably we should add this to the devicetree bindings, too? > interconnects = <&gem_noc MASTER_GFX3D &mc_virt SLAVE_EBI1>; > interconnect-names = "gfx-mem"; > @@ -3825,16 +3826,16 @@ > }; > > gpuss0-thermal { > - polling-delay-passive = <0>; > + polling-delay-passive = <100>; Why did you make this change? I'm pretty sure that we _don't_ want this since we're using interrupts for the thermal sensor. See commit 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in Thermal-zones node"). > polling-delay = <0>; > > thermal-sensors = <&tsens0 13>; > > trips { > gpuss0_alert0: trip-point0 { > - temperature = <90000>; > + temperature = <95000>; > hysteresis = <2000>; > - type = "hot"; > + type = "passive"; Matthias probably knows better, but I wonder if we should be making two passive trip levels like we do with CPU. IIRC this is important if someone wants to be able to use this with IPA.
On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote: > Hi, > > On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen <akhilpo@codeaurora.org> wrote: > > > > Add cooling-cells property and the cooling maps for the gpu tzones > > to support GPU cooling. > > > > Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> > > --- > > arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 ++++++++++++++++++++++------- > > 1 file changed, 22 insertions(+), 7 deletions(-) > > > > diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi b/arch/arm64/boot/dts/qcom/sc7180.dtsi > > index d46b383..40d6a28 100644 > > --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi > > +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi > > @@ -2,7 +2,7 @@ > > /* > > * SC7180 SoC device tree source > > * > > - * Copyright (c) 2019, The Linux Foundation. All rights reserved. > > + * Copyright (c) 2019-20, The Linux Foundation. All rights reserved. > > */ > > > > #include <dt-bindings/clock/qcom,dispcc-sc7180.h> > > @@ -1885,6 +1885,7 @@ > > iommus = <&adreno_smmu 0>; > > operating-points-v2 = <&gpu_opp_table>; > > qcom,gmu = <&gmu>; > > + #cooling-cells = <2>; > > Presumably we should add this to the devicetree bindings, too? > > > > interconnects = <&gem_noc MASTER_GFX3D &mc_virt SLAVE_EBI1>; > > interconnect-names = "gfx-mem"; > > @@ -3825,16 +3826,16 @@ > > }; > > > > gpuss0-thermal { > > - polling-delay-passive = <0>; > > + polling-delay-passive = <100>; > > Why did you make this change? I'm pretty sure that we _don't_ want > this since we're using interrupts for the thermal sensor. See commit > 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in > Thermal-zones node"). I was going to ask the same, this shouldn't be needed. > > polling-delay = <0>; > > > > thermal-sensors = <&tsens0 13>; > > > > trips { > > gpuss0_alert0: trip-point0 { > > - temperature = <90000>; > > + temperature = <95000>; > > hysteresis = <2000>; > > - type = "hot"; > > + type = "passive"; > > Matthias probably knows better, but I wonder if we should be making > two passive trip levels like we do with CPU. IIRC this is important > if someone wants to be able to use this with IPA. Yes, please introduce a second trip point and make both of them 'passive'.
On 10/9/2020 10:27 PM, Matthias Kaehlcke wrote: > On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote: >> Hi, >> >> On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen <akhilpo@codeaurora.org> wrote: >>> >>> Add cooling-cells property and the cooling maps for the gpu tzones >>> to support GPU cooling. >>> >>> Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> >>> --- >>> arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 ++++++++++++++++++++++------- >>> 1 file changed, 22 insertions(+), 7 deletions(-) >>> >>> diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>> index d46b383..40d6a28 100644 >>> --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi >>> +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>> @@ -2,7 +2,7 @@ >>> /* >>> * SC7180 SoC device tree source >>> * >>> - * Copyright (c) 2019, The Linux Foundation. All rights reserved. >>> + * Copyright (c) 2019-20, The Linux Foundation. All rights reserved. >>> */ >>> >>> #include <dt-bindings/clock/qcom,dispcc-sc7180.h> >>> @@ -1885,6 +1885,7 @@ >>> iommus = <&adreno_smmu 0>; >>> operating-points-v2 = <&gpu_opp_table>; >>> qcom,gmu = <&gmu>; >>> + #cooling-cells = <2>; >> >> Presumably we should add this to the devicetree bindings, too? Yes, thanks for catching this. Will update in the next patch. >> >> >>> interconnects = <&gem_noc MASTER_GFX3D &mc_virt SLAVE_EBI1>; >>> interconnect-names = "gfx-mem"; >>> @@ -3825,16 +3826,16 @@ >>> }; >>> >>> gpuss0-thermal { >>> - polling-delay-passive = <0>; >>> + polling-delay-passive = <100>; >> >> Why did you make this change? I'm pretty sure that we _don't_ want >> this since we're using interrupts for the thermal sensor. See commit >> 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in >> Thermal-zones node"). > > I was going to ask the same, this shouldn't be needed. > >>> polling-delay = <0>; >>> >>> thermal-sensors = <&tsens0 13>; >>> >>> trips { >>> gpuss0_alert0: trip-point0 { >>> - temperature = <90000>; >>> + temperature = <95000>; >>> hysteresis = <2000>; >>> - type = "hot"; >>> + type = "passive"; >> >> Matthias probably knows better, but I wonder if we should be making >> two passive trip levels like we do with CPU. IIRC this is important >> if someone wants to be able to use this with IPA. > > Yes, please introduce a second trip point and make both of them > 'passive'. > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel > Adding Manaf here. -Akhil.
On 2020-10-14 18:59, Akhil P Oommen wrote: > On 10/9/2020 10:27 PM, Matthias Kaehlcke wrote: >> On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote: >>> Hi, >>> >>> On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen >>> <akhilpo@codeaurora.org> wrote: >>>> >>>> Add cooling-cells property and the cooling maps for the gpu tzones >>>> to support GPU cooling. >>>> >>>> Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> >>>> --- >>>> arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 >>>> ++++++++++++++++++++++------- >>>> 1 file changed, 22 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>> b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>> index d46b383..40d6a28 100644 >>>> --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>> +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>> @@ -2,7 +2,7 @@ >>>> /* >>>> * SC7180 SoC device tree source >>>> * >>>> - * Copyright (c) 2019, The Linux Foundation. All rights reserved. >>>> + * Copyright (c) 2019-20, The Linux Foundation. All rights >>>> reserved. >>>> */ >>>> >>>> #include <dt-bindings/clock/qcom,dispcc-sc7180.h> >>>> @@ -1885,6 +1885,7 @@ >>>> iommus = <&adreno_smmu 0>; >>>> operating-points-v2 = <&gpu_opp_table>; >>>> qcom,gmu = <&gmu>; >>>> + #cooling-cells = <2>; >>> >>> Presumably we should add this to the devicetree bindings, too? > Yes, thanks for catching this. Will update in the next patch. > >>> >>> >>>> interconnects = <&gem_noc MASTER_GFX3D >>>> &mc_virt SLAVE_EBI1>; >>>> interconnect-names = "gfx-mem"; >>>> @@ -3825,16 +3826,16 @@ >>>> }; >>>> >>>> gpuss0-thermal { >>>> - polling-delay-passive = <0>; >>>> + polling-delay-passive = <100>; >>> >>> Why did you make this change? I'm pretty sure that we _don't_ want >>> this since we're using interrupts for the thermal sensor. See commit >>> 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in >>> Thermal-zones node"). >> >> I was going to ask the same, this shouldn't be needed. As per our understanding unlike "polling-delay", this delay property is intended to activate polling thread on post trip threshold violation and it is irrespective of sensor is capable for trip interrupt or not. This polling is more of governor related. Below are the few references from Documentation/code which tells polling-delay-passive is needed for IPA for better IPA performance. As per Power allocator documentations 1. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/thermal/power_allocator.rst?h=v5.4.71#n264 "The power allocator governor's PID controller works best if there is a periodic tick. If you have a driver that calls `thermal_zone_device_update()` (or anything that ends up calling the governor's `throttle()` function) repetitively, the governor response won't be very good. Note that this is not particular to this governor, step-wise will also misbehave if you call its throttle() faster than the normal thermal framework tick (due to interrupts for example) as it will overreact" 2. In Power allocator code, when switch_on/control trip temp violation, it is enabling passive counter to activate passive polling @ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n634 3. while calculating derivative term, it is using passive_delay @ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n243 4. Sensor interrupt will work if temperature is fluctuating between trip_temp and hysteresis. But say a case where we are not enabling polling-delay-passive. In this case if current temperature > control_temp trip(2nd passive trip) and temperature trend is still raising, then sensor high trip will be disabled (OR configured for critical trip threshold). No more trip interrupt from sensor until it reaches critical trip or falls below control_temp hysteresis. How the governor re-evaluate its next mitigation without passive polling thread here ? I think the same is required for CPU thermal zone as well. >> >>>> polling-delay = <0>; >>>> >>>> thermal-sensors = <&tsens0 13>; >>>> >>>> trips { >>>> gpuss0_alert0: trip-point0 { >>>> - temperature = <90000>; >>>> + temperature = <95000>; >>>> hysteresis = <2000>; >>>> - type = "hot"; >>>> + type = "passive"; >>> >>> Matthias probably knows better, but I wonder if we should be making >>> two passive trip levels like we do with CPU. IIRC this is important >>> if someone wants to be able to use this with IPA. >> >> Yes, please introduce a second trip point and make both of them >> 'passive'. >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel >> > Adding Manaf here. > > -Akhil.
Hi, On Thu, Oct 15, 2020 at 12:07:01AM +0530, manafm@codeaurora.org wrote: > On 2020-10-14 18:59, Akhil P Oommen wrote: > > On 10/9/2020 10:27 PM, Matthias Kaehlcke wrote: > > > On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote: > > > > Hi, > > > > > > > > On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen > > > > <akhilpo@codeaurora.org> wrote: > > > > > > > > > > Add cooling-cells property and the cooling maps for the gpu tzones > > > > > to support GPU cooling. > > > > > > > > > > Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> > > > > > --- > > > > > arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 > > > > > ++++++++++++++++++++++------- > > > > > 1 file changed, 22 insertions(+), 7 deletions(-) > > > > > > > > > > diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi > > > > > b/arch/arm64/boot/dts/qcom/sc7180.dtsi > > > > > index d46b383..40d6a28 100644 > > > > > --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi > > > > > +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi > > > > > @@ -2,7 +2,7 @@ > > > > > /* > > > > > * SC7180 SoC device tree source > > > > > * > > > > > - * Copyright (c) 2019, The Linux Foundation. All rights reserved. > > > > > + * Copyright (c) 2019-20, The Linux Foundation. All rights > > > > > reserved. > > > > > */ > > > > > > > > > > #include <dt-bindings/clock/qcom,dispcc-sc7180.h> > > > > > @@ -1885,6 +1885,7 @@ > > > > > iommus = <&adreno_smmu 0>; > > > > > operating-points-v2 = <&gpu_opp_table>; > > > > > qcom,gmu = <&gmu>; > > > > > + #cooling-cells = <2>; > > > > > > > > Presumably we should add this to the devicetree bindings, too? > > Yes, thanks for catching this. Will update in the next patch. > > > > > > > > > > > > > > > interconnects = <&gem_noc > > > > > MASTER_GFX3D &mc_virt SLAVE_EBI1>; > > > > > interconnect-names = "gfx-mem"; > > > > > @@ -3825,16 +3826,16 @@ > > > > > }; > > > > > > > > > > gpuss0-thermal { > > > > > - polling-delay-passive = <0>; > > > > > + polling-delay-passive = <100>; > > > > > > > > Why did you make this change? I'm pretty sure that we _don't_ want > > > > this since we're using interrupts for the thermal sensor. See commit > > > > 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in > > > > Thermal-zones node"). > > > > > > I was going to ask the same, this shouldn't be needed. > As per our understanding unlike "polling-delay", this delay property is > intended to activate polling thread on post trip threshold violation and it > is irrespective of sensor is capable for trip interrupt or not. > This polling is more of governor related. Below are the few references from > Documentation/code which tells polling-delay-passive is needed for IPA for > better IPA performance. > > As per Power allocator documentations > > 1. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/thermal/power_allocator.rst?h=v5.4.71#n264 > > "The power allocator governor's PID controller works best if there is a > periodic tick. If you have a driver that calls > `thermal_zone_device_update()` (or anything that ends up calling the > governor's `throttle()` function) repetitively, the governor response > won't be very good. Note that this is not particular to this > governor, step-wise will also misbehave if you call its throttle() > faster than the normal thermal framework tick (due to interrupts for > example) as it will overreact" > > 2. In Power allocator code, when switch_on/control trip temp violation, it > is enabling passive counter to activate passive polling @ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n634 > > 3. while calculating derivative term, it is using passive_delay @ > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n243 > > 4. Sensor interrupt will work if temperature is fluctuating between > trip_temp and hysteresis. But say a case where we are not enabling > polling-delay-passive. In this case if current temperature > control_temp > trip(2nd passive trip) and > temperature trend is still raising, then sensor high trip will be disabled > (OR configured for critical trip threshold). No more trip interrupt from > sensor until it reaches critical trip or falls below control_temp > hysteresis. > How the governor re-evaluate its next mitigation without passive polling > thread here ? > > I think the same is required for CPU thermal zone as well. Thanks for the explication and pointers! I ran some tests to re-confirm. For that I lowered the trip point temperatures of CPU6 to 60/70, to make it easier to trigger throttling without necessarily affecting the other CPUs. Further I enabled tracing for the events 'thermal_temperature', 'thermal_zone_trip' and 'thermal_power_allocator'. With that I ran a CPU intensive task on CPU6. Without polling-delay the trace log looks like this: irq/40-c263000.-157 [000] .... 48.035986: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=57800 temp=60000 irq/40-c263000.-157 [000] .... 48.036029: thermal_power_allocator_pid: thermal_zone_id=6 err=10000 err_integral=0 p=2402 i=0 d=0 output=1776 irq/40-c263000.-157 [000] .... 48.036036: thermal_power_allocator: thermal_zone_id=6 req_power={{0x96}} total_req_power=150 granted_power={{0x6f0}} total_granted_power=1776 power_range=1776 max_allocatable_power=1776 current_temperature=60000 delta_temperature=10000 irq/40-c263000.-157 [000] .... 52.480888: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=60000 temp=70000 irq/40-c263000.-157 [000] .... 52.480925: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=0 p=0 i=0 d=0 output=1202 irq/40-c263000.-157 [000] .... 52.480931: thermal_power_allocator: thermal_zone_id=6 req_power={{0x45e}} total_req_power=1118 granted_power={{0x4b2}} total_granted_power=1202 power_range=1202 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 i.e. power_allocator only acts on the sensor interrupts at the trip points It looks different with a non-zero value for 'polling-delay-passive': irq/40-c263000.-156 [000] .... 104.501777: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 irq/40-c263000.-156 [000] .... 104.523073: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 irq/40-c263000.-156 [000] .... 104.523121: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897 irq/40-c263000.-156 [000] .... 104.523148: thermal_power_allocator: thermal_zone_id=6 req_power={{0x406}} total_req_power=1030 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 irq/40-c263000.-156 [000] .... 104.608566: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800 irq/40-c263000.-156 [000] .... 104.608612: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425 irq/40-c263000.-156 [000] .... 104.608642: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 irq/40-c263000.-156 [000] .... 104.630863: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 irq/40-c263000.-156 [000] .... 104.630907: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897 irq/40-c263000.-156 [000] .... 104.630932: thermal_power_allocator: thermal_zone_id=6 req_power={{0x3f4}} total_req_power=1012 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 irq/40-c263000.-156 [000] .... 104.687495: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800 irq/40-c263000.-156 [000] .... 104.687541: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425 irq/40-c263000.-156 [000] .... 104.687567: thermal_power_allocator: thermal_zone_id=6 req_power={{0x338}} total_req_power=824 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 irq/40-c263000.-156 [000] .... 104.711664: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 So it seems indeed the 'polling-delay-passive' is needed for better reactivity, and it should also be re-added to the other thermal zones. On a different note I first tried something similar on the GPU, the trip points triggered, however there was no reaction from power_allocator, the reason is that there is no power information for the GPU (num power actors = 0). It seems it doesn' make sense to use IPA as long as there is no energy model (even if it worked by using the lowest frequency as 'estimated power' throttling would likely be overly aggressive). Since the trip point configuration for IPA and 'step_wise' (and probably others) is somewhat incompatible (IPA aims for a temperature around the 2nd trip point, 'step_wise' interprets the first trip point as limit) I think it makes sense to continue with a single trip point for now. Thanks Matthias
On 10/16/2020 3:49 AM, Matthias Kaehlcke wrote: > Hi, > > On Thu, Oct 15, 2020 at 12:07:01AM +0530, manafm@codeaurora.org wrote: >> On 2020-10-14 18:59, Akhil P Oommen wrote: >>> On 10/9/2020 10:27 PM, Matthias Kaehlcke wrote: >>>> On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote: >>>>> Hi, >>>>> >>>>> On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen >>>>> <akhilpo@codeaurora.org> wrote: >>>>>> >>>>>> Add cooling-cells property and the cooling maps for the gpu tzones >>>>>> to support GPU cooling. >>>>>> >>>>>> Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> >>>>>> --- >>>>>> arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 >>>>>> ++++++++++++++++++++++------- >>>>>> 1 file changed, 22 insertions(+), 7 deletions(-) >>>>>> >>>>>> diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>>>> b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>>>> index d46b383..40d6a28 100644 >>>>>> --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>>>> +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi >>>>>> @@ -2,7 +2,7 @@ >>>>>> /* >>>>>> * SC7180 SoC device tree source >>>>>> * >>>>>> - * Copyright (c) 2019, The Linux Foundation. All rights reserved. >>>>>> + * Copyright (c) 2019-20, The Linux Foundation. All rights >>>>>> reserved. >>>>>> */ >>>>>> >>>>>> #include <dt-bindings/clock/qcom,dispcc-sc7180.h> >>>>>> @@ -1885,6 +1885,7 @@ >>>>>> iommus = <&adreno_smmu 0>; >>>>>> operating-points-v2 = <&gpu_opp_table>; >>>>>> qcom,gmu = <&gmu>; >>>>>> + #cooling-cells = <2>; >>>>> >>>>> Presumably we should add this to the devicetree bindings, too? >>> Yes, thanks for catching this. Will update in the next patch. >>> >>>>> >>>>> >>>>>> interconnects = <&gem_noc >>>>>> MASTER_GFX3D &mc_virt SLAVE_EBI1>; >>>>>> interconnect-names = "gfx-mem"; >>>>>> @@ -3825,16 +3826,16 @@ >>>>>> }; >>>>>> >>>>>> gpuss0-thermal { >>>>>> - polling-delay-passive = <0>; >>>>>> + polling-delay-passive = <100>; >>>>> >>>>> Why did you make this change? I'm pretty sure that we _don't_ want >>>>> this since we're using interrupts for the thermal sensor. See commit >>>>> 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in >>>>> Thermal-zones node"). >>>> >>>> I was going to ask the same, this shouldn't be needed. >> As per our understanding unlike "polling-delay", this delay property is >> intended to activate polling thread on post trip threshold violation and it >> is irrespective of sensor is capable for trip interrupt or not. >> This polling is more of governor related. Below are the few references from >> Documentation/code which tells polling-delay-passive is needed for IPA for >> better IPA performance. >> >> As per Power allocator documentations >> >> 1. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/thermal/power_allocator.rst?h=v5.4.71#n264 >> >> "The power allocator governor's PID controller works best if there is a >> periodic tick. If you have a driver that calls >> `thermal_zone_device_update()` (or anything that ends up calling the >> governor's `throttle()` function) repetitively, the governor response >> won't be very good. Note that this is not particular to this >> governor, step-wise will also misbehave if you call its throttle() >> faster than the normal thermal framework tick (due to interrupts for >> example) as it will overreact" >> >> 2. In Power allocator code, when switch_on/control trip temp violation, it >> is enabling passive counter to activate passive polling @ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n634 >> >> 3. while calculating derivative term, it is using passive_delay @ >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n243 >> >> 4. Sensor interrupt will work if temperature is fluctuating between >> trip_temp and hysteresis. But say a case where we are not enabling >> polling-delay-passive. In this case if current temperature > control_temp >> trip(2nd passive trip) and >> temperature trend is still raising, then sensor high trip will be disabled >> (OR configured for critical trip threshold). No more trip interrupt from >> sensor until it reaches critical trip or falls below control_temp >> hysteresis. >> How the governor re-evaluate its next mitigation without passive polling >> thread here ? >> >> I think the same is required for CPU thermal zone as well. > > Thanks for the explication and pointers! > > I ran some tests to re-confirm. For that I lowered the trip point temperatures > of CPU6 to 60/70, to make it easier to trigger throttling without necessarily > affecting the other CPUs. Further I enabled tracing for the events 'thermal_temperature', > 'thermal_zone_trip' and 'thermal_power_allocator'. With that I ran a CPU > intensive task on CPU6. > > Without polling-delay the trace log looks like this: > > irq/40-c263000.-157 [000] .... 48.035986: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=57800 temp=60000 > irq/40-c263000.-157 [000] .... 48.036029: thermal_power_allocator_pid: thermal_zone_id=6 err=10000 err_integral=0 p=2402 i=0 d=0 output=1776 > irq/40-c263000.-157 [000] .... 48.036036: thermal_power_allocator: thermal_zone_id=6 req_power={{0x96}} total_req_power=150 granted_power={{0x6f0}} total_granted_power=1776 power_range=1776 max_allocatable_power=1776 current_temperature=60000 delta_temperature=10000 > irq/40-c263000.-157 [000] .... 52.480888: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=60000 temp=70000 > irq/40-c263000.-157 [000] .... 52.480925: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=0 p=0 i=0 d=0 output=1202 > irq/40-c263000.-157 [000] .... 52.480931: thermal_power_allocator: thermal_zone_id=6 req_power={{0x45e}} total_req_power=1118 granted_power={{0x4b2}} total_granted_power=1202 power_range=1202 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 > > i.e. power_allocator only acts on the sensor interrupts at the trip points > > It looks different with a non-zero value for 'polling-delay-passive': > > irq/40-c263000.-156 [000] .... 104.501777: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 > irq/40-c263000.-156 [000] .... 104.523073: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 > irq/40-c263000.-156 [000] .... 104.523121: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897 > irq/40-c263000.-156 [000] .... 104.523148: thermal_power_allocator: thermal_zone_id=6 req_power={{0x406}} total_req_power=1030 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 > irq/40-c263000.-156 [000] .... 104.608566: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800 > irq/40-c263000.-156 [000] .... 104.608612: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425 > irq/40-c263000.-156 [000] .... 104.608642: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 > irq/40-c263000.-156 [000] .... 104.630863: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 > irq/40-c263000.-156 [000] .... 104.630907: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897 > irq/40-c263000.-156 [000] .... 104.630932: thermal_power_allocator: thermal_zone_id=6 req_power={{0x3f4}} total_req_power=1012 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0 > irq/40-c263000.-156 [000] .... 104.687495: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800 > irq/40-c263000.-156 [000] .... 104.687541: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425 > irq/40-c263000.-156 [000] .... 104.687567: thermal_power_allocator: thermal_zone_id=6 req_power={{0x338}} total_req_power=824 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200 > irq/40-c263000.-156 [000] .... 104.711664: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000 > > So it seems indeed the 'polling-delay-passive' is needed for better reactivity, > and it should also be re-added to the other thermal zones. > > On a different note I first tried something similar on the GPU, the trip points > triggered, however there was no reaction from power_allocator, the reason is > that there is no power information for the GPU (num power actors = 0). It seems > it doesn' make sense to use IPA as long as there is no energy model (even if it > worked by using the lowest frequency as 'estimated power' throttling would > likely be overly aggressive). Since the trip point configuration for IPA and > 'step_wise' (and probably others) is somewhat incompatible (IPA aims for a > temperature around the 2nd trip point, 'step_wise' interprets the first trip > point as limit) I think it makes sense to continue with a single trip point > for now. > > Thanks > > Matthias > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel > Looks like we have consensus here. I will spin another patchset. Manaf will share a separate patch to fix the CPU tzones. -Akhil.
diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi b/arch/arm64/boot/dts/qcom/sc7180.dtsi index d46b383..40d6a28 100644 --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi @@ -2,7 +2,7 @@ /* * SC7180 SoC device tree source * - * Copyright (c) 2019, The Linux Foundation. All rights reserved. + * Copyright (c) 2019-20, The Linux Foundation. All rights reserved. */ #include <dt-bindings/clock/qcom,dispcc-sc7180.h> @@ -1885,6 +1885,7 @@ iommus = <&adreno_smmu 0>; operating-points-v2 = <&gpu_opp_table>; qcom,gmu = <&gmu>; + #cooling-cells = <2>; interconnects = <&gem_noc MASTER_GFX3D &mc_virt SLAVE_EBI1>; interconnect-names = "gfx-mem"; @@ -3825,16 +3826,16 @@ }; gpuss0-thermal { - polling-delay-passive = <0>; + polling-delay-passive = <100>; polling-delay = <0>; thermal-sensors = <&tsens0 13>; trips { gpuss0_alert0: trip-point0 { - temperature = <90000>; + temperature = <95000>; hysteresis = <2000>; - type = "hot"; + type = "passive"; }; gpuss0_crit: gpuss0_crit { @@ -3843,19 +3844,26 @@ type = "critical"; }; }; + + cooling-maps { + map0 { + trip = <&gpuss0_alert0>; + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>; + }; + }; }; gpuss1-thermal { - polling-delay-passive = <0>; + polling-delay-passive = <100>; polling-delay = <0>; thermal-sensors = <&tsens0 14>; trips { gpuss1_alert0: trip-point0 { - temperature = <90000>; + temperature = <95000>; hysteresis = <2000>; - type = "hot"; + type = "passive"; }; gpuss1_crit: gpuss1_crit { @@ -3864,6 +3872,13 @@ type = "critical"; }; }; + + cooling-maps { + map0 { + trip = <&gpuss0_alert0>; + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>; + }; + }; }; aoss1-thermal {
Add cooling-cells property and the cooling maps for the gpu tzones to support GPU cooling. Signed-off-by: Akhil P Oommen <akhilpo@codeaurora.org> --- arch/arm64/boot/dts/qcom/sc7180.dtsi | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-)