diff mbox series

[v5,6/6] arm64: dts: qcom: Enable cpu cooling devices for QCS9075 platforms

Message ID 20241229152332.3068172-7-quic_wasimn@quicinc.com (mailing list archive)
State Changes Requested
Headers show
Series arm64: qcom: Add support for QCS9075 boards | expand

Commit Message

Wasim Nazir Dec. 29, 2024, 3:23 p.m. UTC
From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>

In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
does corrective action for each subsystem based on sensor violation
to comply safety standards. But as QCS9075 is non-safe SoC it
requires conventional thermal mitigation to control thermal for
different subsystems.

The cpu frequency throttling for different cpu tsens is enabled in
hardware as first defense for cpu thermal control. But QCS9075 SoC
has higher ambient specification. During high ambient condition, even
lowest frequency with multi cores can slowly build heat over the time
and it can lead to thermal run-away situations. This patch restrict
cpu cores during this scenario helps further thermal control and
avoids thermal critical violation.

Add cpu idle injection cooling bindings for cpu tsens thermal zones
as a mitigation for cpu subsystem prior to thermal shutdown.

Add cpu frequency cooling devices that will be used by userspace
thermal governor to mitigate skin thermal management.

Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
---
 arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
 arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
 arch/arm64/boot/dts/qcom/qcs9075-ride.dts     |   1 +
 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi | 287 ++++++++++++++++++
 4 files changed, 290 insertions(+)
 create mode 100644 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi

--
2.47.0

Comments

Aiqun(Maria) Yu Dec. 30, 2024, 6:02 a.m. UTC | #1
On 12/29/2024 11:23 PM, Wasim Nazir wrote:
> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> 
> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
[...]
> Add cpu frequency cooling devices that will be used by userspace
> thermal governor to mitigate skin thermal management.
> 
> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>

Also need to add SOB from the patch handler(Wasim).

Doc can reference [1].
snippets:
 - Signed-off-by: ``Patch handler <handler@mail>``

   SOBs after the author SOB are from people handling and transporting
   the patch, but were not involved in development. SOB chains should
   reflect the **real** route a patch took as it was propagated to us,
   with the first SOB entry signalling primary authorship of a single
   author.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/maintainer-tip.rst
[1]

> ---
>  arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
>  arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
[...]
> 
>  #include "sa8775p-ride.dtsi"
> +#include "qcs9075-thermal.dtsi"

Thermal nodes are usually added by soc.dtsi chips like sa8775p.dtsi.
From the description, it seems that having thermal information is a
common feature for SOC qcs9075.

Would it be better to have below dts structure instead?:

1) Add a qcs9075.dtsi that includes sa8775p.dtsi and qcs9075-thermal.dtsi.
2) Have a qcs9075-ride.dtsi that includes sa8776p.dtsi and
qcs9075-thermal.dtsi.
3) Ensure all qcs9075 board dts include qcs9075-ride.dtsi

> 
>  / {
>  	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> new file mode 100644
> index 000000000000..40544c8582c4
> --- /dev/null
> +++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> @@ -0,0 +1,287 @@
> +// SPDX-License-Identifier: BSD-3-Clause
> +/*
> + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#include <dt-bindings/thermal/thermal.h>
> +
> +&cpu0 {
> +	#cooling-cells = <2>;

Why is cpu0 treated specially when it doesn't include
cpu0_idle/thermal-idle nodes? Could you provide the information to the
commit message?

By the way, if there is no cpu0_idle, does that mean the #cooling-cell
is also not needed?

> +};
> +
> +&cpu1 {
[...]
> +
> +/ {
> +	thermal-zones {

The first /thermal-zones is located in sa8775p.dtsi. Should it have an
alias instead of referencing the whole node with the path? Using an
alias can help the reviewer check the previous node's information and
imply that it is an override rather than a newly added node.

> +		cpu-0-1-0-thermal {
> +			trips {
> +				cpu_0_1_0_passive: trip-point1 {

It seems like a common attribute for cpu1-cpu7. Can it be a common trips
node that can be referenced by different cpu-*-*-*-thermal nodes?
Konrad Dybcio Dec. 30, 2024, 3:35 p.m. UTC | #2
On 29.12.2024 4:23 PM, Wasim Nazir wrote:
> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> 
> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> does corrective action for each subsystem based on sensor violation
> to comply safety standards. But as QCS9075 is non-safe SoC it
> requires conventional thermal mitigation to control thermal for
> different subsystems.
> 
> The cpu frequency throttling for different cpu tsens is enabled in
> hardware as first defense for cpu thermal control. But QCS9075 SoC
> has higher ambient specification. During high ambient condition, even
> lowest frequency with multi cores can slowly build heat over the time
> and it can lead to thermal run-away situations. This patch restrict
> cpu cores during this scenario helps further thermal control and
> avoids thermal critical violation.
> 
> Add cpu idle injection cooling bindings for cpu tsens thermal zones
> as a mitigation for cpu subsystem prior to thermal shutdown.
> 
> Add cpu frequency cooling devices that will be used by userspace
> thermal governor to mitigate skin thermal management.
> 
> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> ---

Does this bring measurable benefits over just making the CPU a cooling
device and pointing the thermal zones to it (and not the idle subnode)?

Konrad
Dmitry Baryshkov Dec. 30, 2024, 3:40 p.m. UTC | #3
On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> 
> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> does corrective action for each subsystem based on sensor violation
> to comply safety standards. But as QCS9075 is non-safe SoC it
> requires conventional thermal mitigation to control thermal for
> different subsystems.
> 
> The cpu frequency throttling for different cpu tsens is enabled in
> hardware as first defense for cpu thermal control. But QCS9075 SoC
> has higher ambient specification. During high ambient condition, even
> lowest frequency with multi cores can slowly build heat over the time
> and it can lead to thermal run-away situations. This patch restrict
> cpu cores during this scenario helps further thermal control and
> avoids thermal critical violation.
> 
> Add cpu idle injection cooling bindings for cpu tsens thermal zones
> as a mitigation for cpu subsystem prior to thermal shutdown.
> 
> Add cpu frequency cooling devices that will be used by userspace
> thermal governor to mitigate skin thermal management.

Does anything prevent us from having this config as a part of the basic
sa8775p.dtsi setup? If HW is present in the base version but it is not
accessible for whatever reason, please move it the base device config
and use status "disabled" or "reserved" to the respective board files.

> 
> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> ---
>  arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
>  arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
>  arch/arm64/boot/dts/qcom/qcs9075-ride.dts     |   1 +
>  arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi | 287 ++++++++++++++++++
>  4 files changed, 290 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> 
> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> index ecaa383b6508..3ab6deeaacf1 100644
> --- a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> +++ b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> @@ -9,6 +9,7 @@
> 
>  #include "sa8775p.dtsi"
>  #include "sa8775p-pmics.dtsi"
> +#include "qcs9075-thermal.dtsi"
> 
>  / {
>  	model = "Qualcomm Technologies, Inc. Robotics RB8";
> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> index d9a8956d3a76..5f2d9f416617 100644
> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> @@ -5,6 +5,7 @@
>  /dts-v1/;
> 
>  #include "sa8775p-ride.dtsi"
> +#include "qcs9075-thermal.dtsi"
> 
>  / {
>  	model = "Qualcomm Technologies, Inc. QCS9075 Ride Rev3";
> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> index 3b524359a72d..10ce48e7ba2f 100644
> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> @@ -5,6 +5,7 @@
>  /dts-v1/;
> 
>  #include "sa8775p-ride.dtsi"
> +#include "qcs9075-thermal.dtsi"
> 
>  / {
>  	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> new file mode 100644
> index 000000000000..40544c8582c4
> --- /dev/null
> +++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> @@ -0,0 +1,287 @@
> +// SPDX-License-Identifier: BSD-3-Clause
> +/*
> + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#include <dt-bindings/thermal/thermal.h>
> +
> +&cpu0 {
> +	#cooling-cells = <2>;
> +};
> +
> +&cpu1 {
> +	#cooling-cells = <2>;
> +	cpu1_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu2 {
> +	#cooling-cells = <2>;
> +	cpu2_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu3 {
> +	#cooling-cells = <2>;
> +	cpu3_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu4 {
> +	#cooling-cells = <2>;
> +	cpu4_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu5 {
> +	#cooling-cells = <2>;
> +	cpu5_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu6 {
> +	#cooling-cells = <2>;
> +	cpu6_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +&cpu7 {
> +	#cooling-cells = <2>;
> +	cpu7_idle: thermal-idle {
> +		#cooling-cells = <2>;
> +		duration-us = <800000>;
> +		exit-latency-us = <10000>;
> +	};
> +};
> +
> +/ {
> +	thermal-zones {
> +		cpu-0-1-0-thermal {
> +			trips {
> +				cpu_0_1_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_1_0_passive>;
> +					cooling-device = <&cpu1_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-0-2-0-thermal {
> +			trips {
> +				cpu_0_2_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_2_0_passive>;
> +					cooling-device = <&cpu2_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-0-3-0-thermal {
> +			trips {
> +				cpu_0_3_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_3_0_passive>;
> +					cooling-device = <&cpu3_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-0-1-1-thermal {
> +			trips {
> +				cpu_0_1_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_1_1_passive>;
> +					cooling-device = <&cpu1_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-0-2-1-thermal {
> +			trips {
> +				cpu_0_2_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_2_1_passive>;
> +					cooling-device = <&cpu2_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-0-3-1-thermal {
> +			trips {
> +				cpu_0_3_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_0_3_1_passive>;
> +					cooling-device = <&cpu3_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-0-0-thermal {
> +			trips {
> +				cpu_1_0_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_0_0_passive>;
> +					cooling-device = <&cpu4_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-1-0-thermal {
> +			trips {
> +				cpu_1_1_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_1_0_passive>;
> +					cooling-device = <&cpu5_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-2-0-thermal {
> +			trips {
> +				cpu_1_2_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_2_0_passive>;
> +					cooling-device = <&cpu6_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-3-0-thermal {
> +			trips {
> +				cpu_1_3_0_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_3_0_passive>;
> +					cooling-device = <&cpu7_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-0-1-thermal {
> +			trips {
> +				cpu_1_0_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_0_1_passive>;
> +					cooling-device = <&cpu4_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-1-1-thermal {
> +			trips {
> +				cpu_1_1_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_1_1_passive>;
> +					cooling-device = <&cpu5_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-2-1-thermal {
> +			trips {
> +				cpu_1_2_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_2_1_passive>;
> +					cooling-device = <&cpu6_idle 100 100>;
> +				};
> +			};
> +		};
> +
> +		cpu-1-3-1-thermal {
> +			trips {
> +				cpu_1_3_1_passive: trip-point1 {
> +					temperature = <116000>;
> +				};
> +			};
> +
> +			cooling-maps {
> +				map0 {
> +					trip = <&cpu_1_3_1_passive>;
> +					cooling-device = <&cpu7_idle 100 100>;
> +				};
> +			};
> +		};
> +	};
> +};
> --
> 2.47.0
>
Manaf Meethalavalappu Pallikunhi Dec. 31, 2024, 11:05 a.m. UTC | #4
Hi Konrad,

On 12/30/2024 9:05 PM, Konrad Dybcio wrote:
> On 29.12.2024 4:23 PM, Wasim Nazir wrote:
>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>
>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>> does corrective action for each subsystem based on sensor violation
>> to comply safety standards. But as QCS9075 is non-safe SoC it
>> requires conventional thermal mitigation to control thermal for
>> different subsystems.
>>
>> The cpu frequency throttling for different cpu tsens is enabled in
>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>> has higher ambient specification. During high ambient condition, even
>> lowest frequency with multi cores can slowly build heat over the time
>> and it can lead to thermal run-away situations. This patch restrict
>> cpu cores during this scenario helps further thermal control and
>> avoids thermal critical violation.
>>
>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>
>> Add cpu frequency cooling devices that will be used by userspace
>> thermal governor to mitigate skin thermal management.
>>
>> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>> ---
> Does this bring measurable benefits over just making the CPU a cooling
> device and pointing the thermal zones to it (and not the idle subnode)?
>
> Konrad
As noted in the commit, CPU frequency mitigation is handled by hardware 
as a first level mitigation. The software/scheduler will be updated via 
arch_update_hw_pressure API [1] for this mitigation. Adding the same CPU 
mitigation in thermal zones is redundant. We are adding idle injection 
with a 100% duty cycle as an additional mitigation step  at higher trip 
to further reduce CPU power consumption. This helps device thermal 
stability further, especially in high ambient conditions.

[1]. 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/cpufreq/qcom-cpufreq-hw.c?h=next-20241220#n352

Best regards,

Manaf
Manaf Meethalavalappu Pallikunhi Dec. 31, 2024, 12:01 p.m. UTC | #5
Hi Dmitry,

On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>
>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>> does corrective action for each subsystem based on sensor violation
>> to comply safety standards. But as QCS9075 is non-safe SoC it
>> requires conventional thermal mitigation to control thermal for
>> different subsystems.
>>
>> The cpu frequency throttling for different cpu tsens is enabled in
>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>> has higher ambient specification. During high ambient condition, even
>> lowest frequency with multi cores can slowly build heat over the time
>> and it can lead to thermal run-away situations. This patch restrict
>> cpu cores during this scenario helps further thermal control and
>> avoids thermal critical violation.
>>
>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>
>> Add cpu frequency cooling devices that will be used by userspace
>> thermal governor to mitigate skin thermal management.
> Does anything prevent us from having this config as a part of the basic
> sa8775p.dtsi setup? If HW is present in the base version but it is not
> accessible for whatever reason, please move it the base device config
> and use status "disabled" or "reserved" to the respective board files.

Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and 
keep it disabled state. #cooling cells property for CPU, still wanted to 
keep it in board files as we don't want to enable any cooling device in 
base DT.

Best Regards,

Manaf

>
>> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>> ---
>>   arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
>>   arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
>>   arch/arm64/boot/dts/qcom/qcs9075-ride.dts     |   1 +
>>   arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi | 287 ++++++++++++++++++
>>   4 files changed, 290 insertions(+)
>>   create mode 100644 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>>
>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>> index ecaa383b6508..3ab6deeaacf1 100644
>> --- a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>> @@ -9,6 +9,7 @@
>>
>>   #include "sa8775p.dtsi"
>>   #include "sa8775p-pmics.dtsi"
>> +#include "qcs9075-thermal.dtsi"
>>
>>   / {
>>   	model = "Qualcomm Technologies, Inc. Robotics RB8";
>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>> index d9a8956d3a76..5f2d9f416617 100644
>> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>> @@ -5,6 +5,7 @@
>>   /dts-v1/;
>>
>>   #include "sa8775p-ride.dtsi"
>> +#include "qcs9075-thermal.dtsi"
>>
>>   / {
>>   	model = "Qualcomm Technologies, Inc. QCS9075 Ride Rev3";
>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>> index 3b524359a72d..10ce48e7ba2f 100644
>> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>> @@ -5,6 +5,7 @@
>>   /dts-v1/;
>>
>>   #include "sa8775p-ride.dtsi"
>> +#include "qcs9075-thermal.dtsi"
>>
>>   / {
>>   	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>> new file mode 100644
>> index 000000000000..40544c8582c4
>> --- /dev/null
>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>> @@ -0,0 +1,287 @@
>> +// SPDX-License-Identifier: BSD-3-Clause
>> +/*
>> + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
>> + */
>> +
>> +#include <dt-bindings/thermal/thermal.h>
>> +
>> +&cpu0 {
>> +	#cooling-cells = <2>;
>> +};
>> +
>> +&cpu1 {
>> +	#cooling-cells = <2>;
>> +	cpu1_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu2 {
>> +	#cooling-cells = <2>;
>> +	cpu2_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu3 {
>> +	#cooling-cells = <2>;
>> +	cpu3_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu4 {
>> +	#cooling-cells = <2>;
>> +	cpu4_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu5 {
>> +	#cooling-cells = <2>;
>> +	cpu5_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu6 {
>> +	#cooling-cells = <2>;
>> +	cpu6_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +&cpu7 {
>> +	#cooling-cells = <2>;
>> +	cpu7_idle: thermal-idle {
>> +		#cooling-cells = <2>;
>> +		duration-us = <800000>;
>> +		exit-latency-us = <10000>;
>> +	};
>> +};
>> +
>> +/ {
>> +	thermal-zones {
>> +		cpu-0-1-0-thermal {
>> +			trips {
>> +				cpu_0_1_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_1_0_passive>;
>> +					cooling-device = <&cpu1_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-0-2-0-thermal {
>> +			trips {
>> +				cpu_0_2_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_2_0_passive>;
>> +					cooling-device = <&cpu2_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-0-3-0-thermal {
>> +			trips {
>> +				cpu_0_3_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_3_0_passive>;
>> +					cooling-device = <&cpu3_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-0-1-1-thermal {
>> +			trips {
>> +				cpu_0_1_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_1_1_passive>;
>> +					cooling-device = <&cpu1_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-0-2-1-thermal {
>> +			trips {
>> +				cpu_0_2_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_2_1_passive>;
>> +					cooling-device = <&cpu2_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-0-3-1-thermal {
>> +			trips {
>> +				cpu_0_3_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_0_3_1_passive>;
>> +					cooling-device = <&cpu3_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-0-0-thermal {
>> +			trips {
>> +				cpu_1_0_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_0_0_passive>;
>> +					cooling-device = <&cpu4_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-1-0-thermal {
>> +			trips {
>> +				cpu_1_1_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_1_0_passive>;
>> +					cooling-device = <&cpu5_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-2-0-thermal {
>> +			trips {
>> +				cpu_1_2_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_2_0_passive>;
>> +					cooling-device = <&cpu6_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-3-0-thermal {
>> +			trips {
>> +				cpu_1_3_0_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_3_0_passive>;
>> +					cooling-device = <&cpu7_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-0-1-thermal {
>> +			trips {
>> +				cpu_1_0_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_0_1_passive>;
>> +					cooling-device = <&cpu4_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-1-1-thermal {
>> +			trips {
>> +				cpu_1_1_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_1_1_passive>;
>> +					cooling-device = <&cpu5_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-2-1-thermal {
>> +			trips {
>> +				cpu_1_2_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_2_1_passive>;
>> +					cooling-device = <&cpu6_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +
>> +		cpu-1-3-1-thermal {
>> +			trips {
>> +				cpu_1_3_1_passive: trip-point1 {
>> +					temperature = <116000>;
>> +				};
>> +			};
>> +
>> +			cooling-maps {
>> +				map0 {
>> +					trip = <&cpu_1_3_1_passive>;
>> +					cooling-device = <&cpu7_idle 100 100>;
>> +				};
>> +			};
>> +		};
>> +	};
>> +};
>> --
>> 2.47.0
>>
Konrad Dybcio Dec. 31, 2024, 4:21 p.m. UTC | #6
On 31.12.2024 12:05 PM, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Konrad,
> 
> On 12/30/2024 9:05 PM, Konrad Dybcio wrote:
>> On 29.12.2024 4:23 PM, Wasim Nazir wrote:
>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>
>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>> does corrective action for each subsystem based on sensor violation
>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>> requires conventional thermal mitigation to control thermal for
>>> different subsystems.
>>>
>>> The cpu frequency throttling for different cpu tsens is enabled in
>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>> has higher ambient specification. During high ambient condition, even
>>> lowest frequency with multi cores can slowly build heat over the time
>>> and it can lead to thermal run-away situations. This patch restrict
>>> cpu cores during this scenario helps further thermal control and
>>> avoids thermal critical violation.
>>>
>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>
>>> Add cpu frequency cooling devices that will be used by userspace
>>> thermal governor to mitigate skin thermal management.
>>>
>>> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>> ---
>> Does this bring measurable benefits over just making the CPU a cooling
>> device and pointing the thermal zones to it (and not the idle subnode)?
>>
>> Konrad
> As noted in the commit, CPU frequency mitigation is handled by hardware as a first level mitigation. The software/scheduler will be updated via arch_update_hw_pressure API [1] for this mitigation. Adding the same CPU mitigation in thermal zones is redundant. We are adding idle injection with a 100% duty cycle as an additional mitigation step  at higher trip to further reduce CPU power consumption. This helps device thermal stability further, especially in high ambient conditions.

I understood this much from the commit message.

What I'm asking is, whether your solution actually works better than just
letting Linux software-throttle the CPUs, preferably backed by some
numbers.

I'm also unsure how this is supposed to reduce power consumption. If the
CPUs aren't busy, they should idle, and if they are not fully utilized, a
lower frequency would likely be scheduled.

Konrad


> 
> [1]. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/cpufreq/qcom-cpufreq-hw.c?h=next-20241220#n352
> 
> Best regards,
> 
> Manaf
>
Dmitry Baryshkov Jan. 3, 2025, 5:51 a.m. UTC | #7
On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Dmitry,
> 
> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> > On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> > > From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> > > 
> > > In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> > > does corrective action for each subsystem based on sensor violation
> > > to comply safety standards. But as QCS9075 is non-safe SoC it
> > > requires conventional thermal mitigation to control thermal for
> > > different subsystems.
> > > 
> > > The cpu frequency throttling for different cpu tsens is enabled in
> > > hardware as first defense for cpu thermal control. But QCS9075 SoC
> > > has higher ambient specification. During high ambient condition, even
> > > lowest frequency with multi cores can slowly build heat over the time
> > > and it can lead to thermal run-away situations. This patch restrict
> > > cpu cores during this scenario helps further thermal control and
> > > avoids thermal critical violation.
> > > 
> > > Add cpu idle injection cooling bindings for cpu tsens thermal zones
> > > as a mitigation for cpu subsystem prior to thermal shutdown.
> > > 
> > > Add cpu frequency cooling devices that will be used by userspace
> > > thermal governor to mitigate skin thermal management.
> > Does anything prevent us from having this config as a part of the basic
> > sa8775p.dtsi setup? If HW is present in the base version but it is not
> > accessible for whatever reason, please move it the base device config
> > and use status "disabled" or "reserved" to the respective board files.
> 
> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
> it disabled state. #cooling cells property for CPU, still wanted to keep it
> in board files as we don't want to enable any cooling device in base DT.

"we don't want" is not a proper justification. So, no.

> 
> Best Regards,
> 
> Manaf
> 
> > 
> > > Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> > > ---
> > >   arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
> > >   arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
> > >   arch/arm64/boot/dts/qcom/qcs9075-ride.dts     |   1 +
> > >   arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi | 287 ++++++++++++++++++
> > >   4 files changed, 290 insertions(+)
> > >   create mode 100644 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> > > 
> > > diff --git a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> > > index ecaa383b6508..3ab6deeaacf1 100644
> > > --- a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> > > +++ b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
> > > @@ -9,6 +9,7 @@
> > > 
> > >   #include "sa8775p.dtsi"
> > >   #include "sa8775p-pmics.dtsi"
> > > +#include "qcs9075-thermal.dtsi"
> > > 
> > >   / {
> > >   	model = "Qualcomm Technologies, Inc. Robotics RB8";
> > > diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> > > index d9a8956d3a76..5f2d9f416617 100644
> > > --- a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> > > +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
> > > @@ -5,6 +5,7 @@
> > >   /dts-v1/;
> > > 
> > >   #include "sa8775p-ride.dtsi"
> > > +#include "qcs9075-thermal.dtsi"
> > > 
> > >   / {
> > >   	model = "Qualcomm Technologies, Inc. QCS9075 Ride Rev3";
> > > diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> > > index 3b524359a72d..10ce48e7ba2f 100644
> > > --- a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> > > +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
> > > @@ -5,6 +5,7 @@
> > >   /dts-v1/;
> > > 
> > >   #include "sa8775p-ride.dtsi"
> > > +#include "qcs9075-thermal.dtsi"
> > > 
> > >   / {
> > >   	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
> > > diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> > > new file mode 100644
> > > index 000000000000..40544c8582c4
> > > --- /dev/null
> > > +++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
> > > @@ -0,0 +1,287 @@
> > > +// SPDX-License-Identifier: BSD-3-Clause
> > > +/*
> > > + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
> > > + */
> > > +
> > > +#include <dt-bindings/thermal/thermal.h>
> > > +
> > > +&cpu0 {
> > > +	#cooling-cells = <2>;
> > > +};
> > > +
> > > +&cpu1 {
> > > +	#cooling-cells = <2>;
> > > +	cpu1_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu2 {
> > > +	#cooling-cells = <2>;
> > > +	cpu2_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu3 {
> > > +	#cooling-cells = <2>;
> > > +	cpu3_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu4 {
> > > +	#cooling-cells = <2>;
> > > +	cpu4_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu5 {
> > > +	#cooling-cells = <2>;
> > > +	cpu5_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu6 {
> > > +	#cooling-cells = <2>;
> > > +	cpu6_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +&cpu7 {
> > > +	#cooling-cells = <2>;
> > > +	cpu7_idle: thermal-idle {
> > > +		#cooling-cells = <2>;
> > > +		duration-us = <800000>;
> > > +		exit-latency-us = <10000>;
> > > +	};
> > > +};
> > > +
> > > +/ {
> > > +	thermal-zones {
> > > +		cpu-0-1-0-thermal {
> > > +			trips {
> > > +				cpu_0_1_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_1_0_passive>;
> > > +					cooling-device = <&cpu1_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-0-2-0-thermal {
> > > +			trips {
> > > +				cpu_0_2_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_2_0_passive>;
> > > +					cooling-device = <&cpu2_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-0-3-0-thermal {
> > > +			trips {
> > > +				cpu_0_3_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_3_0_passive>;
> > > +					cooling-device = <&cpu3_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-0-1-1-thermal {
> > > +			trips {
> > > +				cpu_0_1_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_1_1_passive>;
> > > +					cooling-device = <&cpu1_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-0-2-1-thermal {
> > > +			trips {
> > > +				cpu_0_2_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_2_1_passive>;
> > > +					cooling-device = <&cpu2_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-0-3-1-thermal {
> > > +			trips {
> > > +				cpu_0_3_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_0_3_1_passive>;
> > > +					cooling-device = <&cpu3_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-0-0-thermal {
> > > +			trips {
> > > +				cpu_1_0_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_0_0_passive>;
> > > +					cooling-device = <&cpu4_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-1-0-thermal {
> > > +			trips {
> > > +				cpu_1_1_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_1_0_passive>;
> > > +					cooling-device = <&cpu5_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-2-0-thermal {
> > > +			trips {
> > > +				cpu_1_2_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_2_0_passive>;
> > > +					cooling-device = <&cpu6_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-3-0-thermal {
> > > +			trips {
> > > +				cpu_1_3_0_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_3_0_passive>;
> > > +					cooling-device = <&cpu7_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-0-1-thermal {
> > > +			trips {
> > > +				cpu_1_0_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_0_1_passive>;
> > > +					cooling-device = <&cpu4_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-1-1-thermal {
> > > +			trips {
> > > +				cpu_1_1_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_1_1_passive>;
> > > +					cooling-device = <&cpu5_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-2-1-thermal {
> > > +			trips {
> > > +				cpu_1_2_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_2_1_passive>;
> > > +					cooling-device = <&cpu6_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +
> > > +		cpu-1-3-1-thermal {
> > > +			trips {
> > > +				cpu_1_3_1_passive: trip-point1 {
> > > +					temperature = <116000>;
> > > +				};
> > > +			};
> > > +
> > > +			cooling-maps {
> > > +				map0 {
> > > +					trip = <&cpu_1_3_1_passive>;
> > > +					cooling-device = <&cpu7_idle 100 100>;
> > > +				};
> > > +			};
> > > +		};
> > > +	};
> > > +};
> > > --
> > > 2.47.0
> > >
Manaf Meethalavalappu Pallikunhi Jan. 8, 2025, 12:10 p.m. UTC | #8
Hi Konrad,

On 12/31/2024 9:51 PM, Konrad Dybcio wrote:
> On 31.12.2024 12:05 PM, Manaf Meethalavalappu Pallikunhi wrote:
>> Hi Konrad,
>>
>> On 12/30/2024 9:05 PM, Konrad Dybcio wrote:
>>> On 29.12.2024 4:23 PM, Wasim Nazir wrote:
>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>
>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>> does corrective action for each subsystem based on sensor violation
>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>> requires conventional thermal mitigation to control thermal for
>>>> different subsystems.
>>>>
>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>> has higher ambient specification. During high ambient condition, even
>>>> lowest frequency with multi cores can slowly build heat over the time
>>>> and it can lead to thermal run-away situations. This patch restrict
>>>> cpu cores during this scenario helps further thermal control and
>>>> avoids thermal critical violation.
>>>>
>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>
>>>> Add cpu frequency cooling devices that will be used by userspace
>>>> thermal governor to mitigate skin thermal management.
>>>>
>>>> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>> ---
>>> Does this bring measurable benefits over just making the CPU a cooling
>>> device and pointing the thermal zones to it (and not the idle subnode)?
>>>
>>> Konrad
>> As noted in the commit, CPU frequency mitigation is handled by hardware as a first level mitigation. The software/scheduler will be updated via arch_update_hw_pressure API [1] for this mitigation. Adding the same CPU mitigation in thermal zones is redundant. We are adding idle injection with a 100% duty cycle as an additional mitigation step  at higher trip to further reduce CPU power consumption. This helps device thermal stability further, especially in high ambient conditions.
> I understood this much from the commit message.
>
> What I'm asking is, whether your solution actually works better than just
> letting Linux software-throttle the CPUs, preferably backed by some
> numbers.
I hope by ‘your solution’ you mean HW CPU frequency throttling. Yes, we 
benefit from the hardware approach compared to Linux software-based CPU 
throttling, both in terms of tighter thermal control and improved 
performance.
For the Dhrystone use case from one of our boards, we observe only a 
0.3°C overshoot compared to 2.5°C with software CPU throttling using the 
stepwise governor for same trip threshold.
>
> I'm also unsure how this is supposed to reduce power consumption. If the
> CPUs aren't busy, they should idle, and if they are not fully utilized, a
> lower frequency would likely be scheduled.

By using CPU idle injection, we force the CPU to enter idle mode with 
the lowest LPM modes during high temperature. This approach is similar 
to hot-plugging a core and will further reduce static power for that 
CPU, helping to manage temperature further.

[1]. https://docs.kernel.org/driver-api/thermal/cpu-idle-cooling.html

Best regards,

Manaf

>
> Konrad
>
>
>> [1]. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/cpufreq/qcom-cpufreq-hw.c?h=next-20241220#n352
>>
>> Best regards,
>>
>> Manaf
>>
Manaf Meethalavalappu Pallikunhi Jan. 8, 2025, 12:27 p.m. UTC | #9
Hi Dmitry,


On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>> Hi Dmitry,
>>
>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>
>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>> does corrective action for each subsystem based on sensor violation
>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>> requires conventional thermal mitigation to control thermal for
>>>> different subsystems.
>>>>
>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>> has higher ambient specification. During high ambient condition, even
>>>> lowest frequency with multi cores can slowly build heat over the time
>>>> and it can lead to thermal run-away situations. This patch restrict
>>>> cpu cores during this scenario helps further thermal control and
>>>> avoids thermal critical violation.
>>>>
>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>
>>>> Add cpu frequency cooling devices that will be used by userspace
>>>> thermal governor to mitigate skin thermal management.
>>> Does anything prevent us from having this config as a part of the basic
>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
>>> accessible for whatever reason, please move it the base device config
>>> and use status "disabled" or "reserved" to the respective board files.
>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
>> it disabled state. #cooling cells property for CPU, still wanted to keep it
>> in board files as we don't want to enable any cooling device in base DT.
> "we don't want" is not a proper justification. So, no.

As noted in the commit, thermal cooling mitigation is only necessary for 
non-safe SoCs. Adding this cooling cell property to the CPU node in the 
base DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, 
would violate the requirements for safe SoCs. Therefore, we will include 
it only in non-safe SoC boards.

Best Regards,

Manaf

>> Best Regards,
>>
>> Manaf
>>
>>>> Signed-off-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>> ---
>>>>    arch/arm64/boot/dts/qcom/qcs9075-rb8.dts      |   1 +
>>>>    arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts  |   1 +
>>>>    arch/arm64/boot/dts/qcom/qcs9075-ride.dts     |   1 +
>>>>    arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi | 287 ++++++++++++++++++
>>>>    4 files changed, 290 insertions(+)
>>>>    create mode 100644 arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>>>>
>>>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>>>> index ecaa383b6508..3ab6deeaacf1 100644
>>>> --- a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>>>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
>>>> @@ -9,6 +9,7 @@
>>>>
>>>>    #include "sa8775p.dtsi"
>>>>    #include "sa8775p-pmics.dtsi"
>>>> +#include "qcs9075-thermal.dtsi"
>>>>
>>>>    / {
>>>>    	model = "Qualcomm Technologies, Inc. Robotics RB8";
>>>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>>>> index d9a8956d3a76..5f2d9f416617 100644
>>>> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>>>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
>>>> @@ -5,6 +5,7 @@
>>>>    /dts-v1/;
>>>>
>>>>    #include "sa8775p-ride.dtsi"
>>>> +#include "qcs9075-thermal.dtsi"
>>>>
>>>>    / {
>>>>    	model = "Qualcomm Technologies, Inc. QCS9075 Ride Rev3";
>>>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>>>> index 3b524359a72d..10ce48e7ba2f 100644
>>>> --- a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>>>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
>>>> @@ -5,6 +5,7 @@
>>>>    /dts-v1/;
>>>>
>>>>    #include "sa8775p-ride.dtsi"
>>>> +#include "qcs9075-thermal.dtsi"
>>>>
>>>>    / {
>>>>    	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
>>>> diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>>>> new file mode 100644
>>>> index 000000000000..40544c8582c4
>>>> --- /dev/null
>>>> +++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
>>>> @@ -0,0 +1,287 @@
>>>> +// SPDX-License-Identifier: BSD-3-Clause
>>>> +/*
>>>> + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
>>>> + */
>>>> +
>>>> +#include <dt-bindings/thermal/thermal.h>
>>>> +
>>>> +&cpu0 {
>>>> +	#cooling-cells = <2>;
>>>> +};
>>>> +
>>>> +&cpu1 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu1_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu2 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu2_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu3 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu3_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu4 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu4_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu5 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu5_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu6 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu6_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +&cpu7 {
>>>> +	#cooling-cells = <2>;
>>>> +	cpu7_idle: thermal-idle {
>>>> +		#cooling-cells = <2>;
>>>> +		duration-us = <800000>;
>>>> +		exit-latency-us = <10000>;
>>>> +	};
>>>> +};
>>>> +
>>>> +/ {
>>>> +	thermal-zones {
>>>> +		cpu-0-1-0-thermal {
>>>> +			trips {
>>>> +				cpu_0_1_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_1_0_passive>;
>>>> +					cooling-device = <&cpu1_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-0-2-0-thermal {
>>>> +			trips {
>>>> +				cpu_0_2_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_2_0_passive>;
>>>> +					cooling-device = <&cpu2_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-0-3-0-thermal {
>>>> +			trips {
>>>> +				cpu_0_3_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_3_0_passive>;
>>>> +					cooling-device = <&cpu3_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-0-1-1-thermal {
>>>> +			trips {
>>>> +				cpu_0_1_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_1_1_passive>;
>>>> +					cooling-device = <&cpu1_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-0-2-1-thermal {
>>>> +			trips {
>>>> +				cpu_0_2_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_2_1_passive>;
>>>> +					cooling-device = <&cpu2_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-0-3-1-thermal {
>>>> +			trips {
>>>> +				cpu_0_3_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_0_3_1_passive>;
>>>> +					cooling-device = <&cpu3_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-0-0-thermal {
>>>> +			trips {
>>>> +				cpu_1_0_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_0_0_passive>;
>>>> +					cooling-device = <&cpu4_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-1-0-thermal {
>>>> +			trips {
>>>> +				cpu_1_1_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_1_0_passive>;
>>>> +					cooling-device = <&cpu5_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-2-0-thermal {
>>>> +			trips {
>>>> +				cpu_1_2_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_2_0_passive>;
>>>> +					cooling-device = <&cpu6_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-3-0-thermal {
>>>> +			trips {
>>>> +				cpu_1_3_0_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_3_0_passive>;
>>>> +					cooling-device = <&cpu7_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-0-1-thermal {
>>>> +			trips {
>>>> +				cpu_1_0_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_0_1_passive>;
>>>> +					cooling-device = <&cpu4_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-1-1-thermal {
>>>> +			trips {
>>>> +				cpu_1_1_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_1_1_passive>;
>>>> +					cooling-device = <&cpu5_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-2-1-thermal {
>>>> +			trips {
>>>> +				cpu_1_2_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_2_1_passive>;
>>>> +					cooling-device = <&cpu6_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +
>>>> +		cpu-1-3-1-thermal {
>>>> +			trips {
>>>> +				cpu_1_3_1_passive: trip-point1 {
>>>> +					temperature = <116000>;
>>>> +				};
>>>> +			};
>>>> +
>>>> +			cooling-maps {
>>>> +				map0 {
>>>> +					trip = <&cpu_1_3_1_passive>;
>>>> +					cooling-device = <&cpu7_idle 100 100>;
>>>> +				};
>>>> +			};
>>>> +		};
>>>> +	};
>>>> +};
>>>> --
>>>> 2.47.0
>>>>
Dmitry Baryshkov Jan. 8, 2025, 12:46 p.m. UTC | #10
On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Dmitry,
> 
> 
> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> > On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > Hi Dmitry,
> > > 
> > > On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> > > > On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> > > > > From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> > > > > 
> > > > > In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> > > > > does corrective action for each subsystem based on sensor violation
> > > > > to comply safety standards. But as QCS9075 is non-safe SoC it
> > > > > requires conventional thermal mitigation to control thermal for
> > > > > different subsystems.
> > > > > 
> > > > > The cpu frequency throttling for different cpu tsens is enabled in
> > > > > hardware as first defense for cpu thermal control. But QCS9075 SoC
> > > > > has higher ambient specification. During high ambient condition, even
> > > > > lowest frequency with multi cores can slowly build heat over the time
> > > > > and it can lead to thermal run-away situations. This patch restrict
> > > > > cpu cores during this scenario helps further thermal control and
> > > > > avoids thermal critical violation.
> > > > > 
> > > > > Add cpu idle injection cooling bindings for cpu tsens thermal zones
> > > > > as a mitigation for cpu subsystem prior to thermal shutdown.
> > > > > 
> > > > > Add cpu frequency cooling devices that will be used by userspace
> > > > > thermal governor to mitigate skin thermal management.
> > > > Does anything prevent us from having this config as a part of the basic
> > > > sa8775p.dtsi setup? If HW is present in the base version but it is not
> > > > accessible for whatever reason, please move it the base device config
> > > > and use status "disabled" or "reserved" to the respective board files.
> > > Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
> > > it disabled state. #cooling cells property for CPU, still wanted to keep it
> > > in board files as we don't want to enable any cooling device in base DT.
> > "we don't want" is not a proper justification. So, no.
> 
> As noted in the commit, thermal cooling mitigation is only necessary for
> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
> violate the requirements for safe SoCs. Therefore, we will include it only
> in non-safe SoC boards.

"is only necessary" is fine. It means that it is an optional part which
is going to be unused / ignored / duplicate functionality on the "safe"
SoCs. What kind of requirement is going to be violated in this way?
Manaf Meethalavalappu Pallikunhi Jan. 8, 2025, 4:08 p.m. UTC | #11
Hi Dmitry,


On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>> Hi Dmitry,
>>
>>
>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>> Hi Dmitry,
>>>>
>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>>>
>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>>>> does corrective action for each subsystem based on sensor violation
>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>>>> requires conventional thermal mitigation to control thermal for
>>>>>> different subsystems.
>>>>>>
>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>>>> has higher ambient specification. During high ambient condition, even
>>>>>> lowest frequency with multi cores can slowly build heat over the time
>>>>>> and it can lead to thermal run-away situations. This patch restrict
>>>>>> cpu cores during this scenario helps further thermal control and
>>>>>> avoids thermal critical violation.
>>>>>>
>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>>>
>>>>>> Add cpu frequency cooling devices that will be used by userspace
>>>>>> thermal governor to mitigate skin thermal management.
>>>>> Does anything prevent us from having this config as a part of the basic
>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
>>>>> accessible for whatever reason, please move it the base device config
>>>>> and use status "disabled" or "reserved" to the respective board files.
>>>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
>>>> in board files as we don't want to enable any cooling device in base DT.
>>> "we don't want" is not a proper justification. So, no.
>> As noted in the commit, thermal cooling mitigation is only necessary for
>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
>> violate the requirements for safe SoCs. Therefore, we will include it only
>> in non-safe SoC boards.
> "is only necessary" is fine. It means that it is an optional part which
> is going to be unused / ignored / duplicate functionality on the "safe"
> SoCs. What kind of requirement is going to be violated in this way?

 From the perspective of a safe SoC, any software mitigation that 
compromises the safety subsystem’s compliance should not be allowed. 
Enabling the cooling device also opens up the sysfs interface for 
userspace, which we may not fully control. Userspace apps or partner 
apps might inadvertently use it. Therefore, we believe it is better not 
to expose such an interface, as it is not required for that SoC and 
helps to avoid opening up an interface that could potentially lead to a 
safety failure.

Best Regards,

Manaf

>
Konrad Dybcio Jan. 9, 2025, 2:30 p.m. UTC | #12
On 8.01.2025 5:08 PM, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Dmitry,
> 
> 
> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
>> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>> Hi Dmitry,
>>>
>>>
>>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
>>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>>> Hi Dmitry,
>>>>>
>>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
>>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>>>>
>>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>>>>> does corrective action for each subsystem based on sensor violation
>>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>>>>> requires conventional thermal mitigation to control thermal for
>>>>>>> different subsystems.
>>>>>>>
>>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>>>>> has higher ambient specification. During high ambient condition, even
>>>>>>> lowest frequency with multi cores can slowly build heat over the time
>>>>>>> and it can lead to thermal run-away situations. This patch restrict
>>>>>>> cpu cores during this scenario helps further thermal control and
>>>>>>> avoids thermal critical violation.
>>>>>>>
>>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>>>>
>>>>>>> Add cpu frequency cooling devices that will be used by userspace
>>>>>>> thermal governor to mitigate skin thermal management.
>>>>>> Does anything prevent us from having this config as a part of the basic
>>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
>>>>>> accessible for whatever reason, please move it the base device config
>>>>>> and use status "disabled" or "reserved" to the respective board files.
>>>>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
>>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
>>>>> in board files as we don't want to enable any cooling device in base DT.
>>>> "we don't want" is not a proper justification. So, no.
>>> As noted in the commit, thermal cooling mitigation is only necessary for
>>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
>>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
>>> violate the requirements for safe SoCs. Therefore, we will include it only
>>> in non-safe SoC boards.
>> "is only necessary" is fine. It means that it is an optional part which
>> is going to be unused / ignored / duplicate functionality on the "safe"
>> SoCs. What kind of requirement is going to be violated in this way?
> 
> From the perspective of a safe SoC, any software mitigation that compromises the safety subsystem’s compliance should not be allowed. Enabling the cooling device also opens up the sysfs interface for userspace, which we may not fully control. Userspace apps or partner apps might inadvertently use it. Therefore, we believe it is better not to expose such an interface, as it is not required for that SoC and helps to avoid opening up an interface that could potentially lead to a safety failure.

So, to recalibrate, would this be accurate?:

* "safe" SoCs are the ones with a SAIL/Safety Island block which
  performs thermal throttling without OS intervention and does so
  on all SAIL-equipped platforms

* SoCs with a SAIL are intended to be used in e.g. cars and if we
  want to keep mainline viable on those, we must comply with some
  regulations, which prevent e.g. software thermal throttling

* idle injection provides measurable improvements over software-
  based frequency throttling on platforms with SAIL

Konrad
Dmitry Baryshkov Jan. 9, 2025, 11:54 p.m. UTC | #13
On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Dmitry,
> 
> 
> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
> > On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > Hi Dmitry,
> > > 
> > > 
> > > On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> > > > On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > > > Hi Dmitry,
> > > > > 
> > > > > On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> > > > > > On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> > > > > > > From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> > > > > > > 
> > > > > > > In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> > > > > > > does corrective action for each subsystem based on sensor violation
> > > > > > > to comply safety standards. But as QCS9075 is non-safe SoC it
> > > > > > > requires conventional thermal mitigation to control thermal for
> > > > > > > different subsystems.
> > > > > > > 
> > > > > > > The cpu frequency throttling for different cpu tsens is enabled in
> > > > > > > hardware as first defense for cpu thermal control. But QCS9075 SoC
> > > > > > > has higher ambient specification. During high ambient condition, even
> > > > > > > lowest frequency with multi cores can slowly build heat over the time
> > > > > > > and it can lead to thermal run-away situations. This patch restrict
> > > > > > > cpu cores during this scenario helps further thermal control and
> > > > > > > avoids thermal critical violation.
> > > > > > > 
> > > > > > > Add cpu idle injection cooling bindings for cpu tsens thermal zones
> > > > > > > as a mitigation for cpu subsystem prior to thermal shutdown.
> > > > > > > 
> > > > > > > Add cpu frequency cooling devices that will be used by userspace
> > > > > > > thermal governor to mitigate skin thermal management.
> > > > > > Does anything prevent us from having this config as a part of the basic
> > > > > > sa8775p.dtsi setup? If HW is present in the base version but it is not
> > > > > > accessible for whatever reason, please move it the base device config
> > > > > > and use status "disabled" or "reserved" to the respective board files.
> > > > > Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
> > > > > it disabled state. #cooling cells property for CPU, still wanted to keep it
> > > > > in board files as we don't want to enable any cooling device in base DT.
> > > > "we don't want" is not a proper justification. So, no.
> > > As noted in the commit, thermal cooling mitigation is only necessary for
> > > non-safe SoCs. Adding this cooling cell property to the CPU node in the base
> > > DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
> > > violate the requirements for safe SoCs. Therefore, we will include it only
> > > in non-safe SoC boards.
> > "is only necessary" is fine. It means that it is an optional part which
> > is going to be unused / ignored / duplicate functionality on the "safe"
> > SoCs. What kind of requirement is going to be violated in this way?
> 
> From the perspective of a safe SoC, any software mitigation that compromises
> the safety subsystem’s compliance should not be allowed. Enabling the
> cooling device also opens up the sysfs interface for userspace, which we may
> not fully control.

THere are a lot of interfaces exported to the userspace.

> Userspace apps or partner apps might inadvertently use
> it. Therefore, we believe it is better not to expose such an interface, as
> it is not required for that SoC and helps to avoid opening up an interface
> that could potentially lead to a safety failure.

How can thermal mitigation interface lead to safety failure? Userspace
can possibly lower trip points, but it can not override existing
firmware-based mitigation.
And if there is a known problem with the interface, it should be fixed
instead.

> 
> Best Regards,
> 
> Manaf
> 
> >
Konrad Dybcio Jan. 10, 2025, 12:31 p.m. UTC | #14
On 10.01.2025 12:54 AM, Dmitry Baryshkov wrote:
> On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>
>> Hi Dmitry,
>>
>>
>> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
>>> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>> Hi Dmitry,
>>>>
>>>>
>>>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
>>>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
>>>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>>>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>>>>>
>>>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>>>>>> does corrective action for each subsystem based on sensor violation
>>>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>>>>>> requires conventional thermal mitigation to control thermal for
>>>>>>>> different subsystems.
>>>>>>>>
>>>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>>>>>> has higher ambient specification. During high ambient condition, even
>>>>>>>> lowest frequency with multi cores can slowly build heat over the time
>>>>>>>> and it can lead to thermal run-away situations. This patch restrict
>>>>>>>> cpu cores during this scenario helps further thermal control and
>>>>>>>> avoids thermal critical violation.
>>>>>>>>
>>>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>>>>>
>>>>>>>> Add cpu frequency cooling devices that will be used by userspace
>>>>>>>> thermal governor to mitigate skin thermal management.
>>>>>>> Does anything prevent us from having this config as a part of the basic
>>>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
>>>>>>> accessible for whatever reason, please move it the base device config
>>>>>>> and use status "disabled" or "reserved" to the respective board files.
>>>>>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
>>>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
>>>>>> in board files as we don't want to enable any cooling device in base DT.
>>>>> "we don't want" is not a proper justification. So, no.
>>>> As noted in the commit, thermal cooling mitigation is only necessary for
>>>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
>>>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
>>>> violate the requirements for safe SoCs. Therefore, we will include it only
>>>> in non-safe SoC boards.
>>> "is only necessary" is fine. It means that it is an optional part which
>>> is going to be unused / ignored / duplicate functionality on the "safe"
>>> SoCs. What kind of requirement is going to be violated in this way?
>>
>> From the perspective of a safe SoC, any software mitigation that compromises
>> the safety subsystem’s compliance should not be allowed. Enabling the
>> cooling device also opens up the sysfs interface for userspace, which we may
>> not fully control.
> 
> THere are a lot of interfaces exported to the userspace.
> 
>> Userspace apps or partner apps might inadvertently use
>> it. Therefore, we believe it is better not to expose such an interface, as
>> it is not required for that SoC and helps to avoid opening up an interface
>> that could potentially lead to a safety failure.
> 
> How can thermal mitigation interface lead to safety failure? Userspace
> can possibly lower trip points, but it can not override existing
> firmware-based mitigation.
> And if there is a known problem with the interface, it should be fixed
> instead.

I think the intended case to avoid is where a malicious actor would set
the trips too low, resulting in throttling down the CPU to FMIN / Linux
throttling CPUs to try and escape what it believes to be possible thermal
runaway / a system reboot. Not something desired in a car.

Konrad
Dmitry Baryshkov Jan. 13, 2025, 8:43 a.m. UTC | #15
On Fri, Jan 10, 2025 at 01:31:00PM +0100, Konrad Dybcio wrote:
> On 10.01.2025 12:54 AM, Dmitry Baryshkov wrote:
> > On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>
> >> Hi Dmitry,
> >>
> >>
> >> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
> >>> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>>> Hi Dmitry,
> >>>>
> >>>>
> >>>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> >>>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>>>>> Hi Dmitry,
> >>>>>>
> >>>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> >>>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> >>>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> >>>>>>>>
> >>>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> >>>>>>>> does corrective action for each subsystem based on sensor violation
> >>>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
> >>>>>>>> requires conventional thermal mitigation to control thermal for
> >>>>>>>> different subsystems.
> >>>>>>>>
> >>>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
> >>>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
> >>>>>>>> has higher ambient specification. During high ambient condition, even
> >>>>>>>> lowest frequency with multi cores can slowly build heat over the time
> >>>>>>>> and it can lead to thermal run-away situations. This patch restrict
> >>>>>>>> cpu cores during this scenario helps further thermal control and
> >>>>>>>> avoids thermal critical violation.
> >>>>>>>>
> >>>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
> >>>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
> >>>>>>>>
> >>>>>>>> Add cpu frequency cooling devices that will be used by userspace
> >>>>>>>> thermal governor to mitigate skin thermal management.
> >>>>>>> Does anything prevent us from having this config as a part of the basic
> >>>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
> >>>>>>> accessible for whatever reason, please move it the base device config
> >>>>>>> and use status "disabled" or "reserved" to the respective board files.
> >>>>>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
> >>>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
> >>>>>> in board files as we don't want to enable any cooling device in base DT.
> >>>>> "we don't want" is not a proper justification. So, no.
> >>>> As noted in the commit, thermal cooling mitigation is only necessary for
> >>>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
> >>>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
> >>>> violate the requirements for safe SoCs. Therefore, we will include it only
> >>>> in non-safe SoC boards.
> >>> "is only necessary" is fine. It means that it is an optional part which
> >>> is going to be unused / ignored / duplicate functionality on the "safe"
> >>> SoCs. What kind of requirement is going to be violated in this way?
> >>
> >> From the perspective of a safe SoC, any software mitigation that compromises
> >> the safety subsystem’s compliance should not be allowed. Enabling the
> >> cooling device also opens up the sysfs interface for userspace, which we may
> >> not fully control.
> > 
> > THere are a lot of interfaces exported to the userspace.
> > 
> >> Userspace apps or partner apps might inadvertently use
> >> it. Therefore, we believe it is better not to expose such an interface, as
> >> it is not required for that SoC and helps to avoid opening up an interface
> >> that could potentially lead to a safety failure.
> > 
> > How can thermal mitigation interface lead to safety failure? Userspace
> > can possibly lower trip points, but it can not override existing
> > firmware-based mitigation.
> > And if there is a known problem with the interface, it should be fixed
> > instead.
> 
> I think the intended case to avoid is where a malicious actor would set
> the trips too low, resulting in throttling down the CPU to FMIN / Linux
> throttling CPUs to try and escape what it believes to be possible thermal
> runaway / a system reboot. Not something desired in a car.

Being able to set trip points via sysfs means that the system is already
compromised. At this point it can do whatever the actor wants - e.g.
display malicious HUD or just a gren bar or black screen, scream into
dynamic, etc. That doesn't sound like the temperature trip points being
the only or the major problem of a car.

Anyway, if that's really the only problem, please use the framework to
make the temperature and hysteresis of the trip point R/O for sa8775p /
qcs9100. Other attributes might need to be made R/O too. It well might
be that I'm missing one of the automotive peculiarties here. In such a
case the commit message should be more explicit that it's AGL or some
other requirement and provide a link.
Manaf Meethalavalappu Pallikunhi Jan. 14, 2025, 7:16 p.m. UTC | #16
Hi Dmitry,

Sorry, I was busy with some other priority tasks.

On 1/10/2025 5:24 AM, Dmitry Baryshkov wrote:
> On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>> Hi Dmitry,
>>
>>
>> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
>>> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>> Hi Dmitry,
>>>>
>>>>
>>>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
>>>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
>>>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
>>>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
>>>>>>>>
>>>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
>>>>>>>> does corrective action for each subsystem based on sensor violation
>>>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
>>>>>>>> requires conventional thermal mitigation to control thermal for
>>>>>>>> different subsystems.
>>>>>>>>
>>>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
>>>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
>>>>>>>> has higher ambient specification. During high ambient condition, even
>>>>>>>> lowest frequency with multi cores can slowly build heat over the time
>>>>>>>> and it can lead to thermal run-away situations. This patch restrict
>>>>>>>> cpu cores during this scenario helps further thermal control and
>>>>>>>> avoids thermal critical violation.
>>>>>>>>
>>>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
>>>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
>>>>>>>>
>>>>>>>> Add cpu frequency cooling devices that will be used by userspace
>>>>>>>> thermal governor to mitigate skin thermal management.
>>>>>>> Does anything prevent us from having this config as a part of the basic
>>>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
>>>>>>> accessible for whatever reason, please move it the base device config
>>>>>>> and use status "disabled" or "reserved" to the respective board files.
>>>>>> Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
>>>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
>>>>>> in board files as we don't want to enable any cooling device in base DT.
>>>>> "we don't want" is not a proper justification. So, no.
>>>> As noted in the commit, thermal cooling mitigation is only necessary for
>>>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
>>>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
>>>> violate the requirements for safe SoCs. Therefore, we will include it only
>>>> in non-safe SoC boards.
>>> "is only necessary" is fine. It means that it is an optional part which
>>> is going to be unused / ignored / duplicate functionality on the "safe"
>>> SoCs. What kind of requirement is going to be violated in this way?
>>  From the perspective of a safe SoC, any software mitigation that compromises
>> the safety subsystem’s compliance should not be allowed. Enabling the
>> cooling device also opens up the sysfs interface for userspace, which we may
>> not fully control.
> THere are a lot of interfaces exported to the userspace.
>
>> Userspace apps or partner apps might inadvertently use
>> it. Therefore, we believe it is better not to expose such an interface, as
>> it is not required for that SoC and helps to avoid opening up an interface
>> that could potentially lead to a safety failure.
> How can thermal mitigation interface lead to safety failure? Userspace
> can possibly lower trip points, but it can not override existing
> firmware-based mitigation.
Both firmware and software passive mitigations (CPU/GPU) are not 
permitted for Safe SoC. As mentioned in the commit, it is the 
responsibility of the safety subsystem to take corrective action for 
high temperatures. One of the safety requirements (not a functional 
requirement) for Safe SoC is to avoid scaling of CPUs and bus DCVS, as 
this could impact safety-critical workloads. Therefore, Safe SoC will 
operate at maximum frequency with the performance governor. Enabling a 
cooling device for the CPU would expose the cooling device sysfs 
interface, allowing userspace to request different cooling states via 
the cooling device cur_state sysfs, which could potentially lower the 
frequency and violate the safety requirement.
> And if there is a known problem with the interface, it should be fixed
> instead.

There is no interface issue, as it is not required and can be disabled 
for Safe SoC. This approach has already been used for commercializing 
the SA8775P, and we do not want to disrupt it now. Therefore, we believe 
it is better not to add cpu cooling cell property (CPU cooling device) 
in sa8775p base dtsi.

Best Regards,

Manaf

>
>> Best Regards,
>>
>> Manaf
>>
Dmitry Baryshkov Jan. 15, 2025, 8:54 a.m. UTC | #17
On Wed, Jan 15, 2025 at 12:46:50AM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> 
> Hi Dmitry,
> 
> Sorry, I was busy with some other priority tasks.
> 
> On 1/10/2025 5:24 AM, Dmitry Baryshkov wrote:
> > On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > Hi Dmitry,
> > > 
> > > 
> > > On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
> > > > On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > > > Hi Dmitry,
> > > > > 
> > > > > 
> > > > > On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> > > > > > On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> > > > > > > Hi Dmitry,
> > > > > > > 
> > > > > > > On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> > > > > > > > On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> > > > > > > > > From: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
> > > > > > > > > 
> > > > > > > > > In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> > > > > > > > > does corrective action for each subsystem based on sensor violation
> > > > > > > > > to comply safety standards. But as QCS9075 is non-safe SoC it
> > > > > > > > > requires conventional thermal mitigation to control thermal for
> > > > > > > > > different subsystems.
> > > > > > > > > 
> > > > > > > > > The cpu frequency throttling for different cpu tsens is enabled in
> > > > > > > > > hardware as first defense for cpu thermal control. But QCS9075 SoC
> > > > > > > > > has higher ambient specification. During high ambient condition, even
> > > > > > > > > lowest frequency with multi cores can slowly build heat over the time
> > > > > > > > > and it can lead to thermal run-away situations. This patch restrict
> > > > > > > > > cpu cores during this scenario helps further thermal control and
> > > > > > > > > avoids thermal critical violation.
> > > > > > > > > 
> > > > > > > > > Add cpu idle injection cooling bindings for cpu tsens thermal zones
> > > > > > > > > as a mitigation for cpu subsystem prior to thermal shutdown.
> > > > > > > > > 
> > > > > > > > > Add cpu frequency cooling devices that will be used by userspace
> > > > > > > > > thermal governor to mitigate skin thermal management.
> > > > > > > > Does anything prevent us from having this config as a part of the basic
> > > > > > > > sa8775p.dtsi setup? If HW is present in the base version but it is not
> > > > > > > > accessible for whatever reason, please move it the base device config
> > > > > > > > and use status "disabled" or "reserved" to the respective board files.
> > > > > > > Sure,  I will move idle injection node for each cpu to sa8775p.dtsi and keep
> > > > > > > it disabled state. #cooling cells property for CPU, still wanted to keep it
> > > > > > > in board files as we don't want to enable any cooling device in base DT.
> > > > > > "we don't want" is not a proper justification. So, no.
> > > > > As noted in the commit, thermal cooling mitigation is only necessary for
> > > > > non-safe SoCs. Adding this cooling cell property to the CPU node in the base
> > > > > DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
> > > > > violate the requirements for safe SoCs. Therefore, we will include it only
> > > > > in non-safe SoC boards.
> > > > "is only necessary" is fine. It means that it is an optional part which
> > > > is going to be unused / ignored / duplicate functionality on the "safe"
> > > > SoCs. What kind of requirement is going to be violated in this way?
> > >  From the perspective of a safe SoC, any software mitigation that compromises
> > > the safety subsystem’s compliance should not be allowed. Enabling the
> > > cooling device also opens up the sysfs interface for userspace, which we may
> > > not fully control.
> > THere are a lot of interfaces exported to the userspace.
> > 
> > > Userspace apps or partner apps might inadvertently use
> > > it. Therefore, we believe it is better not to expose such an interface, as
> > > it is not required for that SoC and helps to avoid opening up an interface
> > > that could potentially lead to a safety failure.
> > How can thermal mitigation interface lead to safety failure? Userspace
> > can possibly lower trip points, but it can not override existing
> > firmware-based mitigation.
> Both firmware and software passive mitigations (CPU/GPU) are not permitted
> for Safe SoC.

Not permitted by whom?

> As mentioned in the commit, it is the responsibility of the
> safety subsystem to take corrective action for high temperatures. One of the
> safety requirements (not a functional requirement) for Safe SoC is to avoid
> scaling of CPUs and bus DCVS, as this could impact safety-critical
> workloads. Therefore, Safe SoC will operate at maximum frequency with the
> performance governor. Enabling a cooling device for the CPU would expose the
> cooling device sysfs interface, allowing userspace to request different
> cooling states via the cooling device cur_state sysfs, which could
> potentially lower the frequency and violate the safety requirement.

So, you disable thermal mitigation controls, but your description allows
userspace to change the CPUfreq governor through sysfs. Isn't that also
unsafe?

> > And if there is a known problem with the interface, it should be fixed
> > instead.
> 
> There is no interface issue, as it is not required and can be disabled for
> Safe SoC. This approach has already been used for commercializing the
> SA8775P, and we do not want to disrupt it now. Therefore, we believe it is
> better not to add cpu cooling cell property (CPU cooling device) in sa8775p
> base dtsi.

Okay, assuming  SA8775P may not have any thermal-related properties,
what is the issue with having them for the IoT-related QCS9100 device?

So, you have an alternative proposal: rename sa8775p.dtsi to
qcs9100.dtsi, defining a full set of thermal properties there and add
sa8775p.dtsi which includes qcs9100.dtsi and just removes thermal and
cpufreq nodes. Doing it in any other way, especially by including a
random SoC-related include defeats the logic of the DTSI files.

> 
> Best Regards,
> 
> Manaf
> 
> > 
> > > Best Regards,
> > > 
> > > Manaf
> > >
diff mbox series

Patch

diff --git a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
index ecaa383b6508..3ab6deeaacf1 100644
--- a/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
+++ b/arch/arm64/boot/dts/qcom/qcs9075-rb8.dts
@@ -9,6 +9,7 @@ 

 #include "sa8775p.dtsi"
 #include "sa8775p-pmics.dtsi"
+#include "qcs9075-thermal.dtsi"

 / {
 	model = "Qualcomm Technologies, Inc. Robotics RB8";
diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
index d9a8956d3a76..5f2d9f416617 100644
--- a/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
+++ b/arch/arm64/boot/dts/qcom/qcs9075-ride-r3.dts
@@ -5,6 +5,7 @@ 
 /dts-v1/;

 #include "sa8775p-ride.dtsi"
+#include "qcs9075-thermal.dtsi"

 / {
 	model = "Qualcomm Technologies, Inc. QCS9075 Ride Rev3";
diff --git a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
index 3b524359a72d..10ce48e7ba2f 100644
--- a/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
+++ b/arch/arm64/boot/dts/qcom/qcs9075-ride.dts
@@ -5,6 +5,7 @@ 
 /dts-v1/;

 #include "sa8775p-ride.dtsi"
+#include "qcs9075-thermal.dtsi"

 / {
 	model = "Qualcomm Technologies, Inc. QCS9075 Ride";
diff --git a/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
new file mode 100644
index 000000000000..40544c8582c4
--- /dev/null
+++ b/arch/arm64/boot/dts/qcom/qcs9075-thermal.dtsi
@@ -0,0 +1,287 @@ 
+// SPDX-License-Identifier: BSD-3-Clause
+/*
+ * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#include <dt-bindings/thermal/thermal.h>
+
+&cpu0 {
+	#cooling-cells = <2>;
+};
+
+&cpu1 {
+	#cooling-cells = <2>;
+	cpu1_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu2 {
+	#cooling-cells = <2>;
+	cpu2_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu3 {
+	#cooling-cells = <2>;
+	cpu3_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu4 {
+	#cooling-cells = <2>;
+	cpu4_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu5 {
+	#cooling-cells = <2>;
+	cpu5_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu6 {
+	#cooling-cells = <2>;
+	cpu6_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+&cpu7 {
+	#cooling-cells = <2>;
+	cpu7_idle: thermal-idle {
+		#cooling-cells = <2>;
+		duration-us = <800000>;
+		exit-latency-us = <10000>;
+	};
+};
+
+/ {
+	thermal-zones {
+		cpu-0-1-0-thermal {
+			trips {
+				cpu_0_1_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_1_0_passive>;
+					cooling-device = <&cpu1_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-0-2-0-thermal {
+			trips {
+				cpu_0_2_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_2_0_passive>;
+					cooling-device = <&cpu2_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-0-3-0-thermal {
+			trips {
+				cpu_0_3_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_3_0_passive>;
+					cooling-device = <&cpu3_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-0-1-1-thermal {
+			trips {
+				cpu_0_1_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_1_1_passive>;
+					cooling-device = <&cpu1_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-0-2-1-thermal {
+			trips {
+				cpu_0_2_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_2_1_passive>;
+					cooling-device = <&cpu2_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-0-3-1-thermal {
+			trips {
+				cpu_0_3_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_0_3_1_passive>;
+					cooling-device = <&cpu3_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-0-0-thermal {
+			trips {
+				cpu_1_0_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_0_0_passive>;
+					cooling-device = <&cpu4_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-1-0-thermal {
+			trips {
+				cpu_1_1_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_1_0_passive>;
+					cooling-device = <&cpu5_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-2-0-thermal {
+			trips {
+				cpu_1_2_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_2_0_passive>;
+					cooling-device = <&cpu6_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-3-0-thermal {
+			trips {
+				cpu_1_3_0_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_3_0_passive>;
+					cooling-device = <&cpu7_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-0-1-thermal {
+			trips {
+				cpu_1_0_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_0_1_passive>;
+					cooling-device = <&cpu4_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-1-1-thermal {
+			trips {
+				cpu_1_1_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_1_1_passive>;
+					cooling-device = <&cpu5_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-2-1-thermal {
+			trips {
+				cpu_1_2_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_2_1_passive>;
+					cooling-device = <&cpu6_idle 100 100>;
+				};
+			};
+		};
+
+		cpu-1-3-1-thermal {
+			trips {
+				cpu_1_3_1_passive: trip-point1 {
+					temperature = <116000>;
+				};
+			};
+
+			cooling-maps {
+				map0 {
+					trip = <&cpu_1_3_1_passive>;
+					cooling-device = <&cpu7_idle 100 100>;
+				};
+			};
+		};
+	};
+};