Message ID | 20231129110853.94344-24-lukasz.luba@arm.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | Introduce runtime modifiable Energy Model | expand |
On 29/11/2023 12:08, Lukasz Luba wrote: > Add a new section 'Design' which covers the information about Energy > Model. It contains the design decisions, describes models and how they > reflect the reality. Remove description of the default EM. Change the > other section IDs. Add documentation bit for the new feature which > allows to modify the EM in runtime. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++-- > 1 file changed, 196 insertions(+), 10 deletions(-) > > diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst > index 13225965c9a4..1f8cf36914b1 100644 > --- a/Documentation/power/energy-model.rst > +++ b/Documentation/power/energy-model.rst > @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance > domains can have different micro-architectures. > > > -2. Core APIs > +2. Design > +----------------- > + > +2.1 Runtime modifiable EM > +^^^^^^^^^^^^^^^^^^^^^^^^^ The issue I see here is that since now the EM is runtime modifiable and there is only one EM people might be confused in locking for a non-runtime modifiable EM. (which matches the design till v4). So 'runtime modifiability' is now feature of the EM itself. There is also a figure in this document illustrating the use of em_get_energy(), em_cpu_get() and em_dev_register_perf_domain(). I wonder if this should be extended to cover all the new interfaces created for the 'runtime modifiability' feature? [...]
On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Add a new section 'Design' which covers the information about Energy > Model. It contains the design decisions, describes models and how they > reflect the reality. Remove description of the default EM. Change the > other section IDs. Add documentation bit for the new feature which > allows to modify the EM in runtime. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++-- > 1 file changed, 196 insertions(+), 10 deletions(-) > > diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst > index 13225965c9a4..1f8cf36914b1 100644 > --- a/Documentation/power/energy-model.rst > +++ b/Documentation/power/energy-model.rst > @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance > domains can have different micro-architectures. > > > -2. Core APIs > +2. Design > +----------------- > + > +2.1 Runtime modifiable EM > +^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +To better reflect power variation due to static power (leakage) the EM > +supports runtime modifications of the power values. The mechanism relies on > +RCU to free the modifiable EM perf_state table memory. Its user, the task > +scheduler, also uses RCU to access this memory. The EM framework provides > +API for allocating/freeing the new memory for the modifiable EM table. > +The old memory is freed automatically using RCU callback mechanism when there > +are no owners anymore for the given EM runtime table instance. This is tracked > +using kref mechanism. The device driver which provided the new EM at runtime, > +should call EM API to free it safely when it's no longer needed. The EM > +framework will handle the clean-up when it's possible. > + > +The kernel code which want to modify the EM values is protected from concurrent > +access using a mutex. Therefore, the device driver code must run in sleeping > +context when it tries to modify the EM. > + > +With the runtime modifiable EM we switch from a 'single and during the entire > +runtime static EM' (system property) design to a 'single EM which can be > +changed during runtime according e.g. to the workload' (system and workload > +property) design. > + > +It is possible also to modify the CPU performance values for each EM's > +performance state. Thus, the full power and performance profile (which > +is an exponential curve) can be changed according e.g. to the workload > +or system property. > + > + > +3. Core APIs > ------------ > > -2.1 Config options > +3.1 Config options > ^^^^^^^^^^^^^^^^^^ > > CONFIG_ENERGY_MODEL must be enabled to use the EM framework. > > > -2.2 Registration of performance domains > +3.2 Registration of performance domains > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Registration of 'advanced' EM > @@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel > subsystems which use EM might rely on this flag to check if all EM devices use > the same scale. If there are different scales, these subsystems might decide > to return warning/error, stop working or panic. > -See Section 3. for an example of driver implementing this > -callback, or Section 2.4 for further documentation on this API > +See Section 4. for an example of driver implementing this > +callback, or Section 3.4 for further documentation on this API > > Registration of EM using DT > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the > physics of a real device, e.g. when static power (leakage) is important. > > > -2.3 Accessing performance domains > +3.3 Accessing performance domains > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > There are two API functions which provide the access to the energy model: > @@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is > not provided for other type of devices. > > More details about the above APIs can be found in ``<linux/energy_model.h>`` > -or in Section 2.4 > +or in Section 3.5 > + > + > +3.4 Runtime modifications > +^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Drivers willing to update the EM at runtime should use the following dedicated > +function to allocate a new instance of the modified EM. The API is listed > +below:: > + > + struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd); > + > +This allows to allocate a structure which contains the new EM table with > +also RCU and kref needed by the EM framework. The 'struct em_perf_table' > +contains array 'struct em_perf_state state[]' which is a list of performance > +states in ascending order. That list must be populated by the device driver > +which wants to update the EM. The list of frequencies can be taken from > +existing EM (created during boot). The content in the 'struct em_perf_state' > +must be populated by the driver as well. > + > +This is the API which does the EM update, using RCU pointers swap:: > + > + int em_dev_update_perf_domain(struct device *dev, > + struct em_perf_table __rcu *new_table); > + > +Drivers must provide a pointer to the allocated and initialized new EM > +'struct em_perf_table'. That new EM will be safely used inside the EM framework > +and will be visible to other sub-systems in the kernel (thermal, powercap). > +The main design goal for this API is to be fast and avoid extra calculations > +or memory allocations at runtime. When pre-computed EMs are available in the > +device driver, than it should be possible to simply re-use them with low > +performance overhead. > + > +In order to free the EM, provided earlier by the driver (e.g. when the module > +is unloaded), there is a need to call the API:: > + > + void em_free_table(struct em_perf_table __rcu *table); > + > +It will allow the EM framework to safely remove the memory, when there is > +no other sub-system using it, e.g. EAS. > + > +To use the power values in other sub-systems (like thermal, powercap) there is > +a need to call API which protects the reader and provide consistency of the EM > +table data:: > > + struct em_perf_state *em_get_table(struct em_perf_domain *pd); > > -2.4 Description details of this API > +It returns the 'struct em_perf_state' pointer which is an array of performance > +states in ascending order. > + > +When the EM table is not needed anymore there is a need to call dedicated API:: > + > + void em_put_table(void); > + > +In this way the EM safely uses the RCU read section and protects the users. > +It also allows the EM framework to manage the memory and free it. > + > +There is dedicated API for device drivers to calculate em_perf_state::cost > +values:: > + > + int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, > + int nr_states); > + > +These 'cost' values from EM are used in EAS. The new EM table should be passed > +together with the number of entries and device pointer. When the computation > +of the cost values is done properly the return value from the function is 0. > +The function takes care for right setting of inefficiency for each performance > +state as well. It updates em_perf_state::flags accordingly. > +Then such prepared new EM can be passed to the em_dev_update_perf_domain() > +function, which will allow to use it. > + > +More details about the above APIs can be found in ``<linux/energy_model.h>`` > +or in Section 4.2 with an example code showing simple implementation of the > +updating mechanism in a device driver. > + > + > +3.5 Description details of this API > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > .. kernel-doc:: include/linux/energy_model.h > :internal: > @@ -187,8 +292,11 @@ or in Section 2.4 > :export: > > > -3. Example driver > ------------------ > +4. Examples > +----------- > + > +4.1 Example driver with EM registration > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The CPUFreq framework supports dedicated callback for registering > the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em(). > @@ -242,3 +350,81 @@ EM framework:: > 39 static struct cpufreq_driver foo_cpufreq_driver = { > 40 .register_em = foo_cpufreq_register_em, > 41 }; > + > + > +4.2 Example driver with EM modification > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +This section provides a simple example of a thermal driver modifying the EM. > +The driver implements a foo_thermal_em_update() function. The driver is woken > +up periodically to check the temperature and modify the EM data:: > + > + -> drivers/soc/example/example_em_mod.c > + > + 01 static void foo_get_new_em(struct device *dev) > + 02 { > + 03 struct em_perf_table __rcu *runtime_table; > + 04 struct em_perf_state *table, *new_table; > + 05 struct em_perf_domain *pd; > + 06 unsigned long freq; > + 07 int i, ret; > + 08 > + 09 pd = em_pd_get(dev); > + 10 if (!pd) > + 11 return; > + 12 > + 13 runtime_table = em_allocate_table(pd); > + 14 if (!runtime_table) > + 15 return; > + 16 > + 17 new_table = runtime_table->state; > + 18 > + 19 table = em_get_table(pd); > + 20 for (i = 0; i < pd->nr_perf_states; i++) { > + 21 freq = table[i].frequency; > + 22 foo_get_power_perf_values(dev, freq, &new_table[i]); > + 23 } > + 24 em_put_table(); > + 25 > + 26 /* Calculate 'cost' values for EAS */ > + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states); > + 28 if (ret) { > + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret); > + 30 em_free_table(runtime_table); > + 31 return; > + 32 } > + 33 > + 34 ret = em_dev_update_perf_domain(dev, runtime_table); > + 35 if (ret) { > + 36 dev_warn(dev, "EM: update failed %d\n", ret); > + 37 em_free_table(runtime_table); > + 38 return; > + 39 } > + 40 > + 41 ctx->runtime_table = runtime_table; Because here is ctx, maybe the foo_get_new_em(struct device *dev) shoule be foo_get_new_em(struct foo_context *ctx)? BR --- xuewen > + 42 } > + 43 > + 44 /* > + 45 * Function called periodically to check the temperature and > + 46 * update the EM if needed > + 47 */ > + 48 static void foo_thermal_em_update(struct foo_context *ctx) > + 49 { > + 50 struct device *dev = ctx->dev; > + 51 int cpu; > + 52 > + 53 ctx->temperature = foo_get_temp(dev, ctx); > + 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD) > + 55 return; > + 56 > + 57 foo_get_new_em(dev); > + 58 } > + 59 > + 60 static void foo_exit(void) > + 61 { > + 62 struct foo_context *ctx = glob_ctx; > + 63 > + 64 em_free_table(ctx->runtime_table); > + 65 } > + 66 > + 67 module_exit(foo_exit); > -- > 2.25.1 >
Hi Lukasz, On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Add a new section 'Design' which covers the information about Energy > Model. It contains the design decisions, describes models and how they > reflect the reality. Remove description of the default EM. Change the > other section IDs. Add documentation bit for the new feature which > allows to modify the EM in runtime. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++-- > 1 file changed, 196 insertions(+), 10 deletions(-) > > diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst > index 13225965c9a4..1f8cf36914b1 100644 > --- a/Documentation/power/energy-model.rst > +++ b/Documentation/power/energy-model.rst > @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance > domains can have different micro-architectures. > > > -2. Core APIs > +2. Design > +----------------- > + > +2.1 Runtime modifiable EM > +^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +To better reflect power variation due to static power (leakage) the EM > +supports runtime modifications of the power values. The mechanism relies on > +RCU to free the modifiable EM perf_state table memory. Its user, the task > +scheduler, also uses RCU to access this memory. The EM framework provides > +API for allocating/freeing the new memory for the modifiable EM table. > +The old memory is freed automatically using RCU callback mechanism when there > +are no owners anymore for the given EM runtime table instance. This is tracked > +using kref mechanism. The device driver which provided the new EM at runtime, > +should call EM API to free it safely when it's no longer needed. The EM > +framework will handle the clean-up when it's possible. > + > +The kernel code which want to modify the EM values is protected from concurrent > +access using a mutex. Therefore, the device driver code must run in sleeping > +context when it tries to modify the EM. > + > +With the runtime modifiable EM we switch from a 'single and during the entire > +runtime static EM' (system property) design to a 'single EM which can be > +changed during runtime according e.g. to the workload' (system and workload > +property) design. > + > +It is possible also to modify the CPU performance values for each EM's > +performance state. Thus, the full power and performance profile (which > +is an exponential curve) can be changed according e.g. to the workload > +or system property. > + > + > +3. Core APIs > ------------ > > -2.1 Config options > +3.1 Config options > ^^^^^^^^^^^^^^^^^^ > > CONFIG_ENERGY_MODEL must be enabled to use the EM framework. > > > -2.2 Registration of performance domains > +3.2 Registration of performance domains > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Registration of 'advanced' EM > @@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel > subsystems which use EM might rely on this flag to check if all EM devices use > the same scale. If there are different scales, these subsystems might decide > to return warning/error, stop working or panic. > -See Section 3. for an example of driver implementing this > -callback, or Section 2.4 for further documentation on this API > +See Section 4. for an example of driver implementing this > +callback, or Section 3.4 for further documentation on this API > > Registration of EM using DT > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the > physics of a real device, e.g. when static power (leakage) is important. > > > -2.3 Accessing performance domains > +3.3 Accessing performance domains > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > There are two API functions which provide the access to the energy model: > @@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is > not provided for other type of devices. > > More details about the above APIs can be found in ``<linux/energy_model.h>`` > -or in Section 2.4 > +or in Section 3.5 > + > + > +3.4 Runtime modifications > +^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Drivers willing to update the EM at runtime should use the following dedicated > +function to allocate a new instance of the modified EM. The API is listed > +below:: > + > + struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd); > + > +This allows to allocate a structure which contains the new EM table with > +also RCU and kref needed by the EM framework. The 'struct em_perf_table' > +contains array 'struct em_perf_state state[]' which is a list of performance > +states in ascending order. That list must be populated by the device driver > +which wants to update the EM. The list of frequencies can be taken from > +existing EM (created during boot). The content in the 'struct em_perf_state' > +must be populated by the driver as well. > + > +This is the API which does the EM update, using RCU pointers swap:: > + > + int em_dev_update_perf_domain(struct device *dev, > + struct em_perf_table __rcu *new_table); > + > +Drivers must provide a pointer to the allocated and initialized new EM > +'struct em_perf_table'. That new EM will be safely used inside the EM framework > +and will be visible to other sub-systems in the kernel (thermal, powercap). > +The main design goal for this API is to be fast and avoid extra calculations > +or memory allocations at runtime. When pre-computed EMs are available in the > +device driver, than it should be possible to simply re-use them with low > +performance overhead. > + > +In order to free the EM, provided earlier by the driver (e.g. when the module > +is unloaded), there is a need to call the API:: > + > + void em_free_table(struct em_perf_table __rcu *table); > + > +It will allow the EM framework to safely remove the memory, when there is > +no other sub-system using it, e.g. EAS. > + > +To use the power values in other sub-systems (like thermal, powercap) there is > +a need to call API which protects the reader and provide consistency of the EM > +table data:: > > + struct em_perf_state *em_get_table(struct em_perf_domain *pd); > > -2.4 Description details of this API > +It returns the 'struct em_perf_state' pointer which is an array of performance > +states in ascending order. > + > +When the EM table is not needed anymore there is a need to call dedicated API:: > + > + void em_put_table(void); > + > +In this way the EM safely uses the RCU read section and protects the users. > +It also allows the EM framework to manage the memory and free it. > + > +There is dedicated API for device drivers to calculate em_perf_state::cost > +values:: > + > + int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, > + int nr_states); > + > +These 'cost' values from EM are used in EAS. The new EM table should be passed > +together with the number of entries and device pointer. When the computation > +of the cost values is done properly the return value from the function is 0. > +The function takes care for right setting of inefficiency for each performance > +state as well. It updates em_perf_state::flags accordingly. > +Then such prepared new EM can be passed to the em_dev_update_perf_domain() > +function, which will allow to use it. > + > +More details about the above APIs can be found in ``<linux/energy_model.h>`` > +or in Section 4.2 with an example code showing simple implementation of the > +updating mechanism in a device driver. > + > + > +3.5 Description details of this API > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > .. kernel-doc:: include/linux/energy_model.h > :internal: > @@ -187,8 +292,11 @@ or in Section 2.4 > :export: > > > -3. Example driver > ------------------ > +4. Examples > +----------- > + > +4.1 Example driver with EM registration > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The CPUFreq framework supports dedicated callback for registering > the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em(). > @@ -242,3 +350,81 @@ EM framework:: > 39 static struct cpufreq_driver foo_cpufreq_driver = { > 40 .register_em = foo_cpufreq_register_em, > 41 }; > + > + > +4.2 Example driver with EM modification > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +This section provides a simple example of a thermal driver modifying the EM. > +The driver implements a foo_thermal_em_update() function. The driver is woken > +up periodically to check the temperature and modify the EM data:: > + > + -> drivers/soc/example/example_em_mod.c > + > + 01 static void foo_get_new_em(struct device *dev) Because now some drivers use the dev_pm_opp_of_register_em() to register energy model, and maybe we can add a new function to update the energy model using "EM_SET_ACTIVE_POWER_CB(em_cb, cb)" instead of letting users set power again? Thanks! > + 02 { > + 03 struct em_perf_table __rcu *runtime_table; > + 04 struct em_perf_state *table, *new_table; > + 05 struct em_perf_domain *pd; > + 06 unsigned long freq; > + 07 int i, ret; > + 08 > + 09 pd = em_pd_get(dev); > + 10 if (!pd) > + 11 return; > + 12 > + 13 runtime_table = em_allocate_table(pd); > + 14 if (!runtime_table) > + 15 return; > + 16 > + 17 new_table = runtime_table->state; > + 18 > + 19 table = em_get_table(pd); > + 20 for (i = 0; i < pd->nr_perf_states; i++) { > + 21 freq = table[i].frequency; > + 22 foo_get_power_perf_values(dev, freq, &new_table[i]); > + 23 } > + 24 em_put_table(); > + 25 > + 26 /* Calculate 'cost' values for EAS */ > + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states); > + 28 if (ret) { > + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret); > + 30 em_free_table(runtime_table); > + 31 return; > + 32 } > + 33 > + 34 ret = em_dev_update_perf_domain(dev, runtime_table); > + 35 if (ret) { > + 36 dev_warn(dev, "EM: update failed %d\n", ret); > + 37 em_free_table(runtime_table); > + 38 return; > + 39 } > + 40 > + 41 ctx->runtime_table = runtime_table; > + 42 } > + 43 > + 44 /* > + 45 * Function called periodically to check the temperature and > + 46 * update the EM if needed > + 47 */ > + 48 static void foo_thermal_em_update(struct foo_context *ctx) > + 49 { > + 50 struct device *dev = ctx->dev; > + 51 int cpu; > + 52 > + 53 ctx->temperature = foo_get_temp(dev, ctx); > + 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD) > + 55 return; > + 56 > + 57 foo_get_new_em(dev); > + 58 } > + 59 > + 60 static void foo_exit(void) > + 61 { > + 62 struct foo_context *ctx = glob_ctx; > + 63 > + 64 em_free_table(ctx->runtime_table); > + 65 } > + 66 > + 67 module_exit(foo_exit); > -- > 2.25.1 >
On 12/19/23 04:42, Xuewen Yan wrote: > On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: [snip] >> + >> + -> drivers/soc/example/example_em_mod.c >> + >> + 01 static void foo_get_new_em(struct device *dev) >> + 02 { >> + 03 struct em_perf_table __rcu *runtime_table; >> + 04 struct em_perf_state *table, *new_table; >> + 05 struct em_perf_domain *pd; >> + 06 unsigned long freq; >> + 07 int i, ret; >> + 08 >> + 09 pd = em_pd_get(dev); >> + 10 if (!pd) >> + 11 return; >> + 12 >> + 13 runtime_table = em_allocate_table(pd); >> + 14 if (!runtime_table) >> + 15 return; >> + 16 >> + 17 new_table = runtime_table->state; >> + 18 >> + 19 table = em_get_table(pd); >> + 20 for (i = 0; i < pd->nr_perf_states; i++) { >> + 21 freq = table[i].frequency; >> + 22 foo_get_power_perf_values(dev, freq, &new_table[i]); >> + 23 } >> + 24 em_put_table(); >> + 25 >> + 26 /* Calculate 'cost' values for EAS */ >> + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states); >> + 28 if (ret) { >> + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret); >> + 30 em_free_table(runtime_table); >> + 31 return; >> + 32 } >> + 33 >> + 34 ret = em_dev_update_perf_domain(dev, runtime_table); >> + 35 if (ret) { >> + 36 dev_warn(dev, "EM: update failed %d\n", ret); >> + 37 em_free_table(runtime_table); >> + 38 return; >> + 39 } >> + 40 >> + 41 ctx->runtime_table = runtime_table; > > Because here is ctx, maybe the foo_get_new_em(struct device *dev) > shoule be foo_get_new_em(struct foo_context *ctx)? Make sense, I will change that bit. Thanks!
On 12/19/23 06:22, Xuewen Yan wrote: > Hi Lukasz, > > On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: [snip] >> + >> + -> drivers/soc/example/example_em_mod.c >> + >> + 01 static void foo_get_new_em(struct device *dev) > > Because now some drivers use the dev_pm_opp_of_register_em() to > register energy model, > and maybe we can add a new function to update the energy model using > "EM_SET_ACTIVE_POWER_CB(em_cb, cb)" > instead of letting users set power again? > There are different usage of this EM feature: 1. Adjust power values after boot is finish and e.g. ASV in Exynos has adjusted new voltage values in the OPP framework. It's due to chip binning. I have described that in conversation below patch 22/23. I'm going to send a patch for that platform and OPP fwk later as a follow up to this series. 2. Change the EM power values after long gaming, when the GPU heats up the SoC heavily and CPUs start increase the leakage 3. Change the EM for long running heavy apps, e.g. video conference app, which is using camera w/ image AI and filters (so some heavy stuff) 4. any other optimization that vendor/OEM like to have for
On 12/12/23 18:51, Dietmar Eggemann wrote: > On 29/11/2023 12:08, Lukasz Luba wrote: >> Add a new section 'Design' which covers the information about Energy >> Model. It contains the design decisions, describes models and how they >> reflect the reality. Remove description of the default EM. Change the >> other section IDs. Add documentation bit for the new feature which >> allows to modify the EM in runtime. >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++-- >> 1 file changed, 196 insertions(+), 10 deletions(-) >> >> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst >> index 13225965c9a4..1f8cf36914b1 100644 >> --- a/Documentation/power/energy-model.rst >> +++ b/Documentation/power/energy-model.rst >> @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance >> domains can have different micro-architectures. >> >> >> -2. Core APIs >> +2. Design >> +----------------- >> + >> +2.1 Runtime modifiable EM >> +^^^^^^^^^^^^^^^^^^^^^^^^^ > > The issue I see here is that since now the EM is runtime modifiable and > there is only one EM people might be confused in locking for a > non-runtime modifiable EM. (which matches the design till v4). > > So 'runtime modifiability' is now feature of the EM itself. True, I can skip this, since it's now default. > > There is also a figure in this document illustrating the use of > em_get_energy(), em_cpu_get() and em_dev_register_perf_domain(). > > I wonder if this should be extended to cover all the new interfaces > created for the 'runtime modifiability' feature? That ASCI picture would be totally messy, with that many interfaces. We can think about some other picture later, when this basic code and basic doc is merged.
On Tue, Dec 19, 2023 at 5:31 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > On 12/19/23 06:22, Xuewen Yan wrote: > > Hi Lukasz, > > > > On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > [snip] > > >> + > >> + -> drivers/soc/example/example_em_mod.c > >> + > >> + 01 static void foo_get_new_em(struct device *dev) > > > > Because now some drivers use the dev_pm_opp_of_register_em() to > > register energy model, > > and maybe we can add a new function to update the energy model using > > "EM_SET_ACTIVE_POWER_CB(em_cb, cb)" > > instead of letting users set power again? > > > > There are different usage of this EM feature: > 1. Adjust power values after boot is finish and e.g. ASV in Exynos > has adjusted new voltage values in the OPP framework. It's > due to chip binning. I have described that in conversation > below patch 22/23. I'm going to send a patch for that > platform and OPP fwk later as a follow up to this series. I understand what you mean, what I mean is that if we can provide an interface for changing EM of opp fwk, it will be more friendly for those users who use opp, because then they don't have to calculate the new EM by themselves, but only need After updating the voltage of opp, just call this interface directly. BR --- xuewen > 2. Change the EM power values after long gaming, when the GPU > heats up the SoC heavily and CPUs start increase the leakage > 3. Change the EM for long running heavy apps, e.g. video conference app, > which is using camera w/ image AI and filters (so some heavy stuff) > 4. any other optimization that vendor/OEM like to have for
On 12/20/23 02:08, Xuewen Yan wrote: > On Tue, Dec 19, 2023 at 5:31 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> >> >> On 12/19/23 06:22, Xuewen Yan wrote: >>> Hi Lukasz, >>> >>> On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> [snip] >> >>>> + >>>> + -> drivers/soc/example/example_em_mod.c >>>> + >>>> + 01 static void foo_get_new_em(struct device *dev) >>> >>> Because now some drivers use the dev_pm_opp_of_register_em() to >>> register energy model, >>> and maybe we can add a new function to update the energy model using >>> "EM_SET_ACTIVE_POWER_CB(em_cb, cb)" >>> instead of letting users set power again? >>> >> >> There are different usage of this EM feature: >> 1. Adjust power values after boot is finish and e.g. ASV in Exynos >> has adjusted new voltage values in the OPP framework. It's >> due to chip binning. I have described that in conversation >> below patch 22/23. I'm going to send a patch for that >> platform and OPP fwk later as a follow up to this series. > > I understand what you mean, what I mean is that if we can provide an > interface for changing EM of opp fwk, it will be more friendly for > those users who use opp, because then they don't have to calculate the > new EM by themselves, but only need After updating the voltage of opp, > just call this interface directly. It is the plan. Don't worry. I didn't wanted to push this in one big patch set. Exynos driver + the OPP change would do exactly this. The EM functions from drivers/opp/of.c will be re-used for this. It is too big to be made in one step. There is pattern in those more complex changes, like in Arm SCMI fwk to make the improvements gradually. This folds into the same bucket. Although, you are another person asking for similar thing, so I will send a follow-up change using this new EM API - instead of waiting to finish this review. Thanks, Lukasz
diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst index 13225965c9a4..1f8cf36914b1 100644 --- a/Documentation/power/energy-model.rst +++ b/Documentation/power/energy-model.rst @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance domains can have different micro-architectures. -2. Core APIs +2. Design +----------------- + +2.1 Runtime modifiable EM +^^^^^^^^^^^^^^^^^^^^^^^^^ + +To better reflect power variation due to static power (leakage) the EM +supports runtime modifications of the power values. The mechanism relies on +RCU to free the modifiable EM perf_state table memory. Its user, the task +scheduler, also uses RCU to access this memory. The EM framework provides +API for allocating/freeing the new memory for the modifiable EM table. +The old memory is freed automatically using RCU callback mechanism when there +are no owners anymore for the given EM runtime table instance. This is tracked +using kref mechanism. The device driver which provided the new EM at runtime, +should call EM API to free it safely when it's no longer needed. The EM +framework will handle the clean-up when it's possible. + +The kernel code which want to modify the EM values is protected from concurrent +access using a mutex. Therefore, the device driver code must run in sleeping +context when it tries to modify the EM. + +With the runtime modifiable EM we switch from a 'single and during the entire +runtime static EM' (system property) design to a 'single EM which can be +changed during runtime according e.g. to the workload' (system and workload +property) design. + +It is possible also to modify the CPU performance values for each EM's +performance state. Thus, the full power and performance profile (which +is an exponential curve) can be changed according e.g. to the workload +or system property. + + +3. Core APIs ------------ -2.1 Config options +3.1 Config options ^^^^^^^^^^^^^^^^^^ CONFIG_ENERGY_MODEL must be enabled to use the EM framework. -2.2 Registration of performance domains +3.2 Registration of performance domains ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Registration of 'advanced' EM @@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel subsystems which use EM might rely on this flag to check if all EM devices use the same scale. If there are different scales, these subsystems might decide to return warning/error, stop working or panic. -See Section 3. for an example of driver implementing this -callback, or Section 2.4 for further documentation on this API +See Section 4. for an example of driver implementing this +callback, or Section 3.4 for further documentation on this API Registration of EM using DT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the physics of a real device, e.g. when static power (leakage) is important. -2.3 Accessing performance domains +3.3 Accessing performance domains ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are two API functions which provide the access to the energy model: @@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is not provided for other type of devices. More details about the above APIs can be found in ``<linux/energy_model.h>`` -or in Section 2.4 +or in Section 3.5 + + +3.4 Runtime modifications +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Drivers willing to update the EM at runtime should use the following dedicated +function to allocate a new instance of the modified EM. The API is listed +below:: + + struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd); + +This allows to allocate a structure which contains the new EM table with +also RCU and kref needed by the EM framework. The 'struct em_perf_table' +contains array 'struct em_perf_state state[]' which is a list of performance +states in ascending order. That list must be populated by the device driver +which wants to update the EM. The list of frequencies can be taken from +existing EM (created during boot). The content in the 'struct em_perf_state' +must be populated by the driver as well. + +This is the API which does the EM update, using RCU pointers swap:: + + int em_dev_update_perf_domain(struct device *dev, + struct em_perf_table __rcu *new_table); + +Drivers must provide a pointer to the allocated and initialized new EM +'struct em_perf_table'. That new EM will be safely used inside the EM framework +and will be visible to other sub-systems in the kernel (thermal, powercap). +The main design goal for this API is to be fast and avoid extra calculations +or memory allocations at runtime. When pre-computed EMs are available in the +device driver, than it should be possible to simply re-use them with low +performance overhead. + +In order to free the EM, provided earlier by the driver (e.g. when the module +is unloaded), there is a need to call the API:: + + void em_free_table(struct em_perf_table __rcu *table); + +It will allow the EM framework to safely remove the memory, when there is +no other sub-system using it, e.g. EAS. + +To use the power values in other sub-systems (like thermal, powercap) there is +a need to call API which protects the reader and provide consistency of the EM +table data:: + struct em_perf_state *em_get_table(struct em_perf_domain *pd); -2.4 Description details of this API +It returns the 'struct em_perf_state' pointer which is an array of performance +states in ascending order. + +When the EM table is not needed anymore there is a need to call dedicated API:: + + void em_put_table(void); + +In this way the EM safely uses the RCU read section and protects the users. +It also allows the EM framework to manage the memory and free it. + +There is dedicated API for device drivers to calculate em_perf_state::cost +values:: + + int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, + int nr_states); + +These 'cost' values from EM are used in EAS. The new EM table should be passed +together with the number of entries and device pointer. When the computation +of the cost values is done properly the return value from the function is 0. +The function takes care for right setting of inefficiency for each performance +state as well. It updates em_perf_state::flags accordingly. +Then such prepared new EM can be passed to the em_dev_update_perf_domain() +function, which will allow to use it. + +More details about the above APIs can be found in ``<linux/energy_model.h>`` +or in Section 4.2 with an example code showing simple implementation of the +updating mechanism in a device driver. + + +3.5 Description details of this API ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. kernel-doc:: include/linux/energy_model.h :internal: @@ -187,8 +292,11 @@ or in Section 2.4 :export: -3. Example driver ------------------ +4. Examples +----------- + +4.1 Example driver with EM registration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The CPUFreq framework supports dedicated callback for registering the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em(). @@ -242,3 +350,81 @@ EM framework:: 39 static struct cpufreq_driver foo_cpufreq_driver = { 40 .register_em = foo_cpufreq_register_em, 41 }; + + +4.2 Example driver with EM modification +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This section provides a simple example of a thermal driver modifying the EM. +The driver implements a foo_thermal_em_update() function. The driver is woken +up periodically to check the temperature and modify the EM data:: + + -> drivers/soc/example/example_em_mod.c + + 01 static void foo_get_new_em(struct device *dev) + 02 { + 03 struct em_perf_table __rcu *runtime_table; + 04 struct em_perf_state *table, *new_table; + 05 struct em_perf_domain *pd; + 06 unsigned long freq; + 07 int i, ret; + 08 + 09 pd = em_pd_get(dev); + 10 if (!pd) + 11 return; + 12 + 13 runtime_table = em_allocate_table(pd); + 14 if (!runtime_table) + 15 return; + 16 + 17 new_table = runtime_table->state; + 18 + 19 table = em_get_table(pd); + 20 for (i = 0; i < pd->nr_perf_states; i++) { + 21 freq = table[i].frequency; + 22 foo_get_power_perf_values(dev, freq, &new_table[i]); + 23 } + 24 em_put_table(); + 25 + 26 /* Calculate 'cost' values for EAS */ + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states); + 28 if (ret) { + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret); + 30 em_free_table(runtime_table); + 31 return; + 32 } + 33 + 34 ret = em_dev_update_perf_domain(dev, runtime_table); + 35 if (ret) { + 36 dev_warn(dev, "EM: update failed %d\n", ret); + 37 em_free_table(runtime_table); + 38 return; + 39 } + 40 + 41 ctx->runtime_table = runtime_table; + 42 } + 43 + 44 /* + 45 * Function called periodically to check the temperature and + 46 * update the EM if needed + 47 */ + 48 static void foo_thermal_em_update(struct foo_context *ctx) + 49 { + 50 struct device *dev = ctx->dev; + 51 int cpu; + 52 + 53 ctx->temperature = foo_get_temp(dev, ctx); + 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD) + 55 return; + 56 + 57 foo_get_new_em(dev); + 58 } + 59 + 60 static void foo_exit(void) + 61 { + 62 struct foo_context *ctx = glob_ctx; + 63 + 64 em_free_table(ctx->runtime_table); + 65 } + 66 + 67 module_exit(foo_exit);
Add a new section 'Design' which covers the information about Energy Model. It contains the design decisions, describes models and how they reflect the reality. Remove description of the default EM. Change the other section IDs. Add documentation bit for the new feature which allows to modify the EM in runtime. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> --- Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++-- 1 file changed, 196 insertions(+), 10 deletions(-)