Message ID | 20200122173538.1142069-1-douglas.raillard@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | sched/cpufreq: Make schedutil energy aware | expand |
Hi Peter, Since the v3 was posted a while ago, here is a short recap of the hanging comments: * The boost margin was relative, but we came to the conclusion it would make more sense to make it absolute (done in that v4). * The main remaining blur point was why defining boost=(util - util_est) makes sense. The justification for that is that we use PELT-shaped signal to drive the frequency, so using a PELT-shaped signal for the boost makes sense for the same reasons. AFAIK there is no specific criteria to meet for frequency selection signal shape for anything else than periodic tasks (if we don't add other constraints on top), so (util - util_est)=(util - constant) seems as good as anything else. Especially since util is deemed to be a good fit in practice for frequency selection. Let me know if I missed anything on that front. v3 thread: https://lore.kernel.org/lkml/20191011134500.235736-1-douglas.raillard@arm.com/ Cheers, Douglas On 1/22/20 5:35 PM, Douglas RAILLARD wrote: > Make schedutil cpufreq governor energy-aware. > > - patch 1 introduces a function to retrieve a frequency given a base > frequency and an energy cost margin. > - patch 2 links Energy Model perf_domain to sugov_policy. > - patch 3 updates get_next_freq() to make use of the Energy Model. > - patch 4 adds sugov_cpu_ramp_boost() function. > - patch 5 updates sugov_update_(single|shared)() to make use of > sugov_cpu_ramp_boost(). > - patch 6 introduces a tracepoint in get_next_freq() for > testing/debugging. Since it's not a trace event, it's not exposed to > userspace in a directly usable way, allowing for painless future > updates/removal. > > The benefits of using the EM in schedutil are twofold: > > 1) Selecting the highest possible frequency for a given cost. Some > platforms can have lower frequencies that are less efficient than > higher ones, in which case they should be skipped for most purposes. > They can still be useful to give more freedom to thermal throttling > mechanisms, but not under normal circumstances. > note: the EM framework will warn about such OPPs "hertz/watts ratio > non-monotonically decreasing" > > 2) Driving the frequency selection with power in mind, in addition to > maximizing the utilization of the non-idle CPUs in the system. > > Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and > enabled in schedutil by > "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()". > > Point 2) is enabled in > "sched/cpufreq: Boost schedutil frequency ramp up". It allows using > higher frequencies when it is known that the true utilization of > currently running tasks is exceeding their previous stable point. > The benefits are: > > * Boosting the frequency when the behavior of a runnable task changes, > leading to an increase in utilization. That shortens the frequency > ramp up duration, which in turns allows the utilization signal to > reach stable values quicker. Since the allowed frequency boost is > bounded in energy, it will behave consistently across platforms, > regardless of the OPP cost range. > > * The boost is only transient, and should not impact a lot the energy > consumed of workloads with very stable utilization signals. > > This has been ligthly tested with a rtapp task ramping from 10% to 75% > utilisation on a big core. > > v1 -> v2: > > * Split the new sugov_cpu_ramp_boost() from the existing > sugov_cpu_is_busy() as they seem to seek a different goal. > > * Implement sugov_cpu_ramp_boost() based on CFS util_avg and > util_est_enqueued signals, rather than using idle calls count. > This makes the ramp boost much more accurate in finding boost > opportunities, and give a "continuous" output rather than a boolean. > > * Add EM_COST_MARGIN_SCALE=1024 to represent the > margin values of em_pd_get_higher_freq(). > > v2 -> v3: > > * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update() > to avoid boosting when the utilization is decreasing. > > * Add a tracepoint for testing. > > v3 -> v4: > > * em_pd_get_higher_freq() now interprets the margin as absolute, > rather than relative to the cost of the base frequency. > > * Modify misleading comment in em_pd_get_higher_freq() since min_freq > can actually be higher than the max available frequency in normal > operations. > > Douglas RAILLARD (6): > PM: Introduce em_pd_get_higher_freq() > sched/cpufreq: Attach perf domain to sugov policy > sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() > sched/cpufreq: Introduce sugov_cpu_ramp_boost > sched/cpufreq: Boost schedutil frequency ramp up > sched/cpufreq: Add schedutil_em_tp tracepoint > > include/linux/energy_model.h | 56 ++++++++++++++ > include/trace/events/power.h | 9 +++ > kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++-- > 3 files changed, 182 insertions(+), 7 deletions(-) >
On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD <douglas.raillard@arm.com> wrote: > > Make schedutil cpufreq governor energy-aware. I have to say that your terminology is confusing to me, like what exactly does "energy-aware" mean in the first place? > - patch 1 introduces a function to retrieve a frequency given a base > frequency and an energy cost margin. > - patch 2 links Energy Model perf_domain to sugov_policy. > - patch 3 updates get_next_freq() to make use of the Energy Model. > - patch 4 adds sugov_cpu_ramp_boost() function. > - patch 5 updates sugov_update_(single|shared)() to make use of > sugov_cpu_ramp_boost(). > - patch 6 introduces a tracepoint in get_next_freq() for > testing/debugging. Since it's not a trace event, it's not exposed to > userspace in a directly usable way, allowing for painless future > updates/removal. > > The benefits of using the EM in schedutil are twofold: I guess you mean using the EM directly in schedutil (note that it is used indirectly already, because of EAS), but that needs to be clearly stated. > 1) Selecting the highest possible frequency for a given cost. Some > platforms can have lower frequencies that are less efficient than > higher ones, in which case they should be skipped for most purposes. > They can still be useful to give more freedom to thermal throttling > mechanisms, but not under normal circumstances. > note: the EM framework will warn about such OPPs "hertz/watts ratio > non-monotonically decreasing" While all of that is fair enough for platforms using the EM, do you realize that the EM is not available on the majority of architectures (including some fairly significant ones) and so adding overhead related to it for all of them is quite less than welcome? > 2) Driving the frequency selection with power in mind, in addition to > maximizing the utilization of the non-idle CPUs in the system. Care to explain this? I'm totally unsure what you mean here. > Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and > enabled in schedutil by > "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()". > > Point 2) is enabled in > "sched/cpufreq: Boost schedutil frequency ramp up". It allows using > higher frequencies when it is known that the true utilization of > currently running tasks is exceeding their previous stable point. Please explain "true utilization" and "stable point". > The benefits are: > > * Boosting the frequency when the behavior of a runnable task changes, > leading to an increase in utilization. That shortens the frequency > ramp up duration, which in turns allows the utilization signal to > reach stable values quicker. Since the allowed frequency boost is > bounded in energy, it will behave consistently across platforms, > regardless of the OPP cost range. Sounds good. Can you please describe the algorithm applied to achieve that? > * The boost is only transient, and should not impact a lot the energy > consumed of workloads with very stable utilization signals.
Hi Rafael, On 1/23/20 3:43 PM, Rafael J. Wysocki wrote: > On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD > <douglas.raillard@arm.com> wrote: >> >> Make schedutil cpufreq governor energy-aware. > > I have to say that your terminology is confusing to me, like what > exactly does "energy-aware" mean in the first place? Should be better rephrased as "Make schedutil cpufreq governor use the energy model" I guess. Schedutil is indeed already energy aware since it tries to use the lowest frequency possible for the job to be done (kind of). > >> - patch 1 introduces a function to retrieve a frequency given a base >> frequency and an energy cost margin. >> - patch 2 links Energy Model perf_domain to sugov_policy. >> - patch 3 updates get_next_freq() to make use of the Energy Model. >> - patch 4 adds sugov_cpu_ramp_boost() function. >> - patch 5 updates sugov_update_(single|shared)() to make use of >> sugov_cpu_ramp_boost(). >> - patch 6 introduces a tracepoint in get_next_freq() for >> testing/debugging. Since it's not a trace event, it's not exposed to >> userspace in a directly usable way, allowing for painless future >> updates/removal. >> >> The benefits of using the EM in schedutil are twofold: > > I guess you mean using the EM directly in schedutil (note that it is > used indirectly already, because of EAS), but that needs to be clearly > stated. In the current state (of the code and my knowledge), the EM "leaks" into schedutil only by the fact that tasks are moved around by EAS, so the CPU util seen by schedutil is impacted compared to the same workload on non-EAS setup. Other than that, the only energy-related information schedutil uses is the assumption that lower freq == better efficiency. Explicit use of the EM allows to refine this assumption. > >> 1) Selecting the highest possible frequency for a given cost. Some >> platforms can have lower frequencies that are less efficient than >> higher ones, in which case they should be skipped for most purposes. >> They can still be useful to give more freedom to thermal throttling >> mechanisms, but not under normal circumstances. >> note: the EM framework will warn about such OPPs "hertz/watts ratio >> non-monotonically decreasing" > > While all of that is fair enough for platforms using the EM, do you > realize that the EM is not available on the majority of architectures > (including some fairly significant ones) and so adding overhead > related to it for all of them is quite less than welcome? When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is defined to a static inline no-op function, so that feature won't incur overhead (patch 1+2+3). Patch 4 and 5 do add some new logic that could be used on any platform. Current code will use the boost as an energy margin, but it would be straightforward to make a util-based version (like iowait boost) on non-EM platforms. >> 2) Driving the frequency selection with power in mind, in addition to >> maximizing the utilization of the non-idle CPUs in the system. > > Care to explain this? I'm totally unsure what you mean here. Currently, schedutil is basically tailoring the CPU capacity to the util of the tasks on it. That's all good for periodic tasks, but there are situations where we can do better than assuming the task is periodic with a fixed duty cycle. The case improved by that series is when a task increases its duty cycle. In that specific case, it can be a good idea to increase the frequency until the util stabilizes again. We don't have a crystal ball so we can't adjust the freq right away. However, we do want to avoid the task to crave for speed until schedutil realizes it needs it. Using the EM here allows to boost within reasonable limits, without destroying the average energy consumption. > >> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and >> enabled in schedutil by >> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()". >> >> Point 2) is enabled in >> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using >> higher frequencies when it is known that the true utilization of >> currently running tasks is exceeding their previous stable point. > > Please explain "true utilization" and "stable point". "true utilization" would be an instantaneous duty cycle. If a task suddenly starts doing twice as much work, its "true utilization" will double instantly. "stable point" would be util est enqueued here. If a task is periodic, util est enqueued will be constant once it reaches a steady state. As soon as the duty cycle of the task changes, util est enqueued will change. > >> The benefits are: >> >> * Boosting the frequency when the behavior of a runnable task changes, >> leading to an increase in utilization. That shortens the frequency >> ramp up duration, which in turns allows the utilization signal to >> reach stable values quicker. Since the allowed frequency boost is >> bounded in energy, it will behave consistently across platforms, >> regardless of the OPP cost range. > > Sounds good. > > Can you please describe the algorithm applied to achieve that? The util est enqueued of a task is basically a snapshot of the util of the task just before it's dequeued. This means that when the util has stabilized, util est enqueued will be a constant signal. Specifically, util est enqueued will be an upper bound of the swing of util avg. When the task starts doing more work than at the previous activation, its util avg will rise above the current util est enqueued. This means we cannot assume anymore that util est enqueued represents an upper bound of the duty cycle, so we can decide to boost until util avg "stabilizes" again [note]. At the CPU level, we can track that in the rq aggregated signals: - "stable rq's util est enqueued" is assumed to mean "same set of enqueued tasks as the last time we looked at that rq". - task util est enqueued and util avg can be replaced by the rq signal. This will hide cases where a task's util increases while another one decreases by the same amount. The limitations of both assumptions can be fixed by more invasive changes (a rq cookie to know the set of enqueued tasks and an OR-aggregated per-task flag to ask for boosting), but these heuristics allow using the existing signals with changes limited to schedutil. Once we detected this situation, we can decide to boost. We don't want black&white boosting, since a tiny increase in util should lead to a tiny boost. Here, we use (util - util_est_enqueued). If the increase is small, that boost will be small. [note]: util avg of a periodic task never actually stabilizes, it just enters an interval and never leaves it. When the duty cycle changes, it will leave that interval to enter another one. The centre of that interval is the task's duty cycle. >> * The boost is only transient, and should not impact a lot the energy >> consumed of workloads with very stable utilization signals. Thanks, Douglas
On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <douglas.raillard@arm.com> wrote: > > Make schedutil cpufreq governor energy-aware. > > - patch 1 introduces a function to retrieve a frequency given a base > frequency and an energy cost margin. > - patch 2 links Energy Model perf_domain to sugov_policy. > - patch 3 updates get_next_freq() to make use of the Energy Model. > - patch 4 adds sugov_cpu_ramp_boost() function. > - patch 5 updates sugov_update_(single|shared)() to make use of > sugov_cpu_ramp_boost(). > - patch 6 introduces a tracepoint in get_next_freq() for > testing/debugging. Since it's not a trace event, it's not exposed to > userspace in a directly usable way, allowing for painless future > updates/removal. > > The benefits of using the EM in schedutil are twofold: > > 1) Selecting the highest possible frequency for a given cost. Some > platforms can have lower frequencies that are less efficient than > higher ones, in which case they should be skipped for most purposes. This make sense. Why using a lower frequency when a higher one is more power efficient > They can still be useful to give more freedom to thermal throttling > mechanisms, but not under normal circumstances. > note: the EM framework will warn about such OPPs "hertz/watts ratio > non-monotonically decreasing" > > 2) Driving the frequency selection with power in mind, in addition to > maximizing the utilization of the non-idle CPUs in the system. > > Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and > enabled in schedutil by > "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()". > > Point 2) is enabled in > "sched/cpufreq: Boost schedutil frequency ramp up". It allows using > higher frequencies when it is known that the true utilization of > currently running tasks is exceeding their previous stable point. > The benefits are: > > * Boosting the frequency when the behavior of a runnable task changes, > leading to an increase in utilization. That shortens the frequency > ramp up duration, which in turns allows the utilization signal to > reach stable values quicker. Since the allowed frequency boost is > bounded in energy, it will behave consistently across platforms, > regardless of the OPP cost range. Could you explain this a bit more ? > > * The boost is only transient, and should not impact a lot the energy > consumed of workloads with very stable utilization signals. > > This has been lightly tested with a rtapp task ramping from 10% to 75% > utilisation on a big core. Which kind of UC are you targeting ? Do you have some benchmark showing the benefit and how you can bound the increase of energy ? The benefit of point2 is less obvious for me. We already have uclamp which helps to overwrite the "utilization" that is seen by schedutil to boost or cap the frequency when some tasks are running. I'm curious to see what would be the benefit of this on top. > > v1 -> v2: > > * Split the new sugov_cpu_ramp_boost() from the existing > sugov_cpu_is_busy() as they seem to seek a different goal. > > * Implement sugov_cpu_ramp_boost() based on CFS util_avg and > util_est_enqueued signals, rather than using idle calls count. > This makes the ramp boost much more accurate in finding boost > opportunities, and give a "continuous" output rather than a boolean. > > * Add EM_COST_MARGIN_SCALE=1024 to represent the > margin values of em_pd_get_higher_freq(). > > v2 -> v3: > > * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update() > to avoid boosting when the utilization is decreasing. > > * Add a tracepoint for testing. > > v3 -> v4: > > * em_pd_get_higher_freq() now interprets the margin as absolute, > rather than relative to the cost of the base frequency. > > * Modify misleading comment in em_pd_get_higher_freq() since min_freq > can actually be higher than the max available frequency in normal > operations. > > Douglas RAILLARD (6): > PM: Introduce em_pd_get_higher_freq() > sched/cpufreq: Attach perf domain to sugov policy > sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() > sched/cpufreq: Introduce sugov_cpu_ramp_boost > sched/cpufreq: Boost schedutil frequency ramp up > sched/cpufreq: Add schedutil_em_tp tracepoint > > include/linux/energy_model.h | 56 ++++++++++++++ > include/trace/events/power.h | 9 +++ > kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++-- > 3 files changed, 182 insertions(+), 7 deletions(-) > > -- > 2.24.1 >
Hi Vincent, On 1/27/20 5:16 PM, Vincent Guittot wrote: > On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <douglas.raillard@arm.com> wrote: >> >> Make schedutil cpufreq governor energy-aware. >> >> - patch 1 introduces a function to retrieve a frequency given a base >> frequency and an energy cost margin. >> - patch 2 links Energy Model perf_domain to sugov_policy. >> - patch 3 updates get_next_freq() to make use of the Energy Model. >> - patch 4 adds sugov_cpu_ramp_boost() function. >> - patch 5 updates sugov_update_(single|shared)() to make use of >> sugov_cpu_ramp_boost(). >> - patch 6 introduces a tracepoint in get_next_freq() for >> testing/debugging. Since it's not a trace event, it's not exposed to >> userspace in a directly usable way, allowing for painless future >> updates/removal. >> >> The benefits of using the EM in schedutil are twofold: >> >> 1) Selecting the highest possible frequency for a given cost. Some >> platforms can have lower frequencies that are less efficient than >> higher ones, in which case they should be skipped for most purposes. > > This make sense. Why using a lower frequency when a higher one is more > power efficient Apparently in some cases it can be useful for thermal capping. AFAIU the alternate solution is to race to idle with a more efficient OPP (idle injection work of Linaro). >> They can still be useful to give more freedom to thermal throttling >> mechanisms, but not under normal circumstances. >> note: the EM framework will warn about such OPPs "hertz/watts ratio >> non-monotonically decreasing" >> >> 2) Driving the frequency selection with power in mind, in addition to >> maximizing the utilization of the non-idle CPUs in the system. >> >> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and >> enabled in schedutil by >> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()". >> >> Point 2) is enabled in >> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using >> higher frequencies when it is known that the true utilization of >> currently running tasks is exceeding their previous stable point. >> The benefits are: >> >> * Boosting the frequency when the behavior of a runnable task changes, >> leading to an increase in utilization. That shortens the frequency >> ramp up duration, which in turns allows the utilization signal to >> reach stable values quicker. Since the allowed frequency boost is >> bounded in energy, it will behave consistently across platforms, >> regardless of the OPP cost range. > > Could you explain this a bit more ? The goal is to detect when the task starts asking more CPU time than it did during the previous period. At this stage, we don't know how much more, so we increase the frequency faster to allow signals to settle more quickly. The PELT signal does increases independently from the chosen frequency, but that's only up until idle time shows up. At this point, the util will drop again, and the frequency with it. >> >> * The boost is only transient, and should not impact a lot the energy >> consumed of workloads with very stable utilization signals. >> >> This has been lightly tested with a rtapp task ramping from 10% to 75% >> utilisation on a big core. > > Which kind of UC are you targeting ? One case are tasks with "random" behavior like threads in thread pools that can end up doing very different things. There may be other cases as well, but I'll need to do more extensive testing with actual applications. > > Do you have some benchmark showing the benefit and how you can bound > the increase of energy ? In the test setup described above, it increases the energy consumption by ~2.5%. I also did some preliminary experiments to reduce the margin taken in map_util_freq(), which becomes less necessary if the frequency is boosted in the cases where util increase is getting out of hands. That can recover some amount of lost power. The real cost in practice heavily depends on: * the workloads (if its util jumps around, it will boost more frequently) * the discrete frequencies available (if boosting does not bring us to the next freq, no boost is actually applied). > > The benefit of point2 is less obvious for me. We already have uclamp > which helps to overwrite the "utilization" that is seen by schedutil > to boost or cap the frequency when some tasks are running. I'm curious > to see what would be the benefit of this on top. uclamp is only useful when a target utilization is known beforehand by the task itself or some kind of manager. In all the cases relying on plain PELT, we can decrease the freq change reaction time. Note that schedutil is already built around the duty cycle detection with a bias for higher frequency when the task period increases (using util est enqueued). What this series bring is a way to detect when util est enqueued turns from a set-point into a lower bound. >> >> v1 -> v2: >> >> * Split the new sugov_cpu_ramp_boost() from the existing >> sugov_cpu_is_busy() as they seem to seek a different goal. >> >> * Implement sugov_cpu_ramp_boost() based on CFS util_avg and >> util_est_enqueued signals, rather than using idle calls count. >> This makes the ramp boost much more accurate in finding boost >> opportunities, and give a "continuous" output rather than a boolean. >> >> * Add EM_COST_MARGIN_SCALE=1024 to represent the >> margin values of em_pd_get_higher_freq(). >> >> v2 -> v3: >> >> * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update() >> to avoid boosting when the utilization is decreasing. >> >> * Add a tracepoint for testing. >> >> v3 -> v4: >> >> * em_pd_get_higher_freq() now interprets the margin as absolute, >> rather than relative to the cost of the base frequency. >> >> * Modify misleading comment in em_pd_get_higher_freq() since min_freq >> can actually be higher than the max available frequency in normal >> operations. >> >> Douglas RAILLARD (6): >> PM: Introduce em_pd_get_higher_freq() >> sched/cpufreq: Attach perf domain to sugov policy >> sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() >> sched/cpufreq: Introduce sugov_cpu_ramp_boost >> sched/cpufreq: Boost schedutil frequency ramp up >> sched/cpufreq: Add schedutil_em_tp tracepoint >> >> include/linux/energy_model.h | 56 ++++++++++++++ >> include/trace/events/power.h | 9 +++ >> kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++-- >> 3 files changed, 182 insertions(+), 7 deletions(-) >> >> -- >> 2.24.1 >>
On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote: > Hi Peter, > > Since the v3 was posted a while ago, here is a short recap of the hanging > comments: > > * The boost margin was relative, but we came to the conclusion it would make > more sense to make it absolute (done in that v4). As per (patch #1): + max_cost = pd->table[pd->nr_cap_states - 1].cost; + cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; So we'll allow the boost to double energy consumption (or rather, since you cannot go above the max OPP, we're allowed that). > * The main remaining blur point was why defining boost=(util - util_est) makes > sense. The justification for that is that we use PELT-shaped signal to drive > the frequency, so using a PELT-shaped signal for the boost makes sense for the > same reasons. As per (patch #4): + unsigned long boost = 0; + if (util_est_enqueued == sg_cpu->util_est_enqueued && + util_avg >= sg_cpu->util_avg && + util_avg > util_est_enqueued) + boost = util_avg - util_est_enqueued; The result of that is not, strictly speaking, a PELT shaped signal. Although when it is !0 the curves are similar, albeit offset. > AFAIK there is no specific criteria to meet for frequency selection signal shape > for anything else than periodic tasks (if we don't add other constraints on > top), so (util - util_est)=(util - constant) seems as good as anything else. > Especially since util is deemed to be a good fit in practice for frequency > selection. Let me know if I missed anything on that front. Given: sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg. our next_f becomes: next_f = 1.25 * util_est * max_freq / max; so our min_freq in em_pd_get_higher_freq() will already be compensated for the offset. So even when: boost = util_avg - util_est is small, despite util_avg being huge (~1024), due to large util_est, we'll still get an effective boost to max_cost ASSUMING cs[].cost and cost_margin have the same curve. They have not. assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a factor f^2 on the table. So the higher the min_freq, the less effective the boost. Maybe it all works out in practise, but I'm missing a big picture description of it all somewhere.
On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote: > Hi Rafael, > > On 1/23/20 3:43 PM, Rafael J. Wysocki wrote: > > On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD > > <douglas.raillard@arm.com> wrote: > >> > >> Make schedutil cpufreq governor energy-aware. > > > > I have to say that your terminology is confusing to me, like what > > exactly does "energy-aware" mean in the first place? > > Should be better rephrased as "Make schedutil cpufreq governor use the > energy model" I guess. Schedutil is indeed already energy aware since it > tries to use the lowest frequency possible for the job to be done (kind of). So ARM64 will soon get x86-like power management if I read these here patches right: https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com And I'm thinking a part of Rafael's concerns will also apply to those platforms. > Other than that, the only energy-related information schedutil uses is > the assumption that lower freq == better efficiency. Explicit use of the > EM allows to refine this assumption. I'm thinking that such platforms guarantee this on their own, if not, there just isn't anything we can do about it, so that assumption is fair. (I've always found it weird to have less efficient OPPs listed anyway) > >> 1) Selecting the highest possible frequency for a given cost. Some > >> platforms can have lower frequencies that are less efficient than > >> higher ones, in which case they should be skipped for most purposes. > >> They can still be useful to give more freedom to thermal throttling > >> mechanisms, but not under normal circumstances. > >> note: the EM framework will warn about such OPPs "hertz/watts ratio > >> non-monotonically decreasing" > > > > While all of that is fair enough for platforms using the EM, do you > > realize that the EM is not available on the majority of architectures > > (including some fairly significant ones) and so adding overhead > > related to it for all of them is quite less than welcome? > > When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is > defined to a static inline no-op function, so that feature won't incur > overhead (patch 1+2+3). > > Patch 4 and 5 do add some new logic that could be used on any platform. > Current code will use the boost as an energy margin, but it would be > straightforward to make a util-based version (like iowait boost) on > non-EM platforms. Right, so the condition 'util_avg > util_est' makes sense to trigger some sort of boost off of. What kind would make sense for these platforms? One possibility would be to instead of frobbing the energy margin, as you do here, to frob the C in get_next_freq(). (I have vague memories of this being proposed earlier; it also avoids that double OPP iteration thing complained about elsewhere in this thread if I'm not mistaken). That is; I'm thinking it is important (esp. now that we got frequency invariance sorted for x86), to have this patch also work for !EM architectures (as those ARM64-AMU things would be).
On 2/10/20 1:30 PM, Peter Zijlstra wrote: > On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote: >> Hi Rafael, >> >> On 1/23/20 3:43 PM, Rafael J. Wysocki wrote: >>> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD >>> <douglas.raillard@arm.com> wrote: >>>> >>>> Make schedutil cpufreq governor energy-aware. >>> >>> I have to say that your terminology is confusing to me, like what >>> exactly does "energy-aware" mean in the first place? >> >> Should be better rephrased as "Make schedutil cpufreq governor use the >> energy model" I guess. Schedutil is indeed already energy aware since it >> tries to use the lowest frequency possible for the job to be done (kind of). > > So ARM64 will soon get x86-like power management if I read these here > patches right: > > https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com > > And I'm thinking a part of Rafael's concerns will also apply to those > platforms. AFAIU there is an important difference: ARM64 firmware should not end up increasing frequency on its own, it should only cap the frequency. That means that the situation stays the same for that boost: Let's say you let schedutil selecting a freq that is +2% more power hungry. That will probably not be enough to make it jump to the next OPP, so you end up not boosting. Now if there is a firmware that decides for some reasons to cap frequency, it will be a similar situation. > >> Other than that, the only energy-related information schedutil uses is >> the assumption that lower freq == better efficiency. Explicit use of the >> EM allows to refine this assumption. > > I'm thinking that such platforms guarantee this on their own, if not, > there just isn't anything we can do about it, so that assumption is > fair. > > (I've always found it weird to have less efficient OPPs listed anyway) Ultimately, (mostly) the piece of code involved in thermal capping needs to know about these inefficient OPPs (be it the firmware or some kernel subsystem). The rest of the world doesn't need to care. >>>> 1) Selecting the highest possible frequency for a given cost. Some >>>> platforms can have lower frequencies that are less efficient than >>>> higher ones, in which case they should be skipped for most purposes. >>>> They can still be useful to give more freedom to thermal throttling >>>> mechanisms, but not under normal circumstances. >>>> note: the EM framework will warn about such OPPs "hertz/watts ratio >>>> non-monotonically decreasing" >>> >>> While all of that is fair enough for platforms using the EM, do you >>> realize that the EM is not available on the majority of architectures >>> (including some fairly significant ones) and so adding overhead >>> related to it for all of them is quite less than welcome? >> >> When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is >> defined to a static inline no-op function, so that feature won't incur >> overhead (patch 1+2+3). >> >> Patch 4 and 5 do add some new logic that could be used on any platform. >> Current code will use the boost as an energy margin, but it would be >> straightforward to make a util-based version (like iowait boost) on >> non-EM platforms. > > Right, so the condition 'util_avg > util_est' makes sense to trigger > some sort of boost off of. > > What kind would make sense for these platforms? One possibility would be > to instead of frobbing the energy margin, as you do here, to frob the C > in get_next_freq(). If I'm correct, changing the C value would be somewhat similar to the relative boosting I had in a previous version. Maybe adding a fixed offset would give more predictable results as was discussed with Vincent Guittot. In any case, it would change the perceived util (like iowait boost). > (I have vague memories of this being proposed earlier; it also avoids > that double OPP iteration thing complained about elsewhere in this > thread if I'm not mistaken). It should be possible to get rid of the double iteration mentioned by Quentin. Choosing to boost the util or the energy boils down to: 1) If you care more about predictable battery life (or energy bill) than predictability of the boost feature, EM should be used. 2) If you don't have an EM or you care more about having a predictable boost for a given workload, use util (or disable that boost). The rational is that with 1), you will get a different speed boost for a given workload depending on the other things executing at the same time, as the speed up is not linear with the task-related metric (util - util_est). If you are already at high freq because of another workload, the speed up will be small because the next 100Mhz will cost much more than the same +100Mhz delta starting from a low OPP. > That is; I'm thinking it is important (esp. now that we got frequency > invariance sorted for x86), to have this patch also work for !EM > architectures (as those ARM64-AMU things would be). For sure, that feature is supposed to help in cases that would be impossible to pinpoint with hardware, since it has to know what tasks execute.
On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote: > On 2/10/20 1:30 PM, Peter Zijlstra wrote: > > So ARM64 will soon get x86-like power management if I read these here > > patches right: > > > > https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com > > > > And I'm thinking a part of Rafael's concerns will also apply to those > > platforms. > > AFAIU there is an important difference: ARM64 firmware should not end up > increasing frequency on its own, it should only cap the frequency. That > means that the situation stays the same for that boost: > > Let's say you let schedutil selecting a freq that is +2% more power > hungry. That will probably not be enough to make it jump to the next > OPP, so you end up not boosting. Now if there is a firmware that decides > for some reasons to cap frequency, it will be a similar situation. The moment you give out OPP selection to a 3rd party (be it firmware or a micro-controller) things are uncertain at best anyway. Still, in general, if you give it higher input, it tends to at least consider going faster -- which might be all you can ask for... So I'm not exactly seeing what your argument is here. > > Right, so the condition 'util_avg > util_est' makes sense to trigger > > some sort of boost off of. > > > > What kind would make sense for these platforms? One possibility would be > > to instead of frobbing the energy margin, as you do here, to frob the C > > in get_next_freq(). > > If I'm correct, changing the C value would be somewhat similar to the > relative boosting I had in a previous version. Maybe adding a fixed > offset would give more predictable results as was discussed with Vincent > Guittot. In any case, it would change the perceived util (like iowait > boost). It depends a bit on what you change C into. If we do something trivial like: 1.25 ; !(util_avg > util_est) C := { 2 ; (util_avg > util_est) ie. a binary selection of constants, then yes, I suppose that is the case. But nothing stops us from making it more complicated; or having it depend on the presence of EM data. > > (I have vague memories of this being proposed earlier; it also avoids > > that double OPP iteration thing complained about elsewhere in this > > thread if I'm not mistaken). > > It should be possible to get rid of the double iteration mentioned by > Quentin. Choosing to boost the util or the energy boils down to: > > 1) If you care more about predictable battery life (or energy bill) than > predictability of the boost feature, EM should be used. > > 2) If you don't have an EM or you care more about having a predictable > boost for a given workload, use util (or disable that boost). > > The rational is that with 1), you will get a different speed boost for a > given workload depending on the other things executing at the same time, > as the speed up is not linear with the task-related metric (util - > util_est). If you are already at high freq because of another workload, > the speed up will be small because the next 100Mhz will cost much more > than the same +100Mhz delta starting from a low OPP. It's just that I'm not seeing how 1 actually works or provides that more predictable battery life I suppose. We have this other sub-thread to argue about that :-) > > That is; I'm thinking it is important (esp. now that we got frequency > > invariance sorted for x86), to have this patch also work for !EM > > architectures (as those ARM64-AMU things would be). > > For sure, that feature is supposed to help in cases that would be > impossible to pinpoint with hardware, since it has to know what tasks > execute. OK, so I'm thinking we're agreeing that it would be good to have this support !EM systems too.
On 2/10/20 1:21 PM, Peter Zijlstra wrote: > On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote: >> Hi Peter, >> >> Since the v3 was posted a while ago, here is a short recap of the hanging >> comments: >> >> * The boost margin was relative, but we came to the conclusion it would make >> more sense to make it absolute (done in that v4). > > As per (patch #1): > > + max_cost = pd->table[pd->nr_cap_states - 1].cost; > + cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; > > So we'll allow the boost to double energy consumption (or rather, since > you cannot go above the max OPP, we're allowed that). Indeed. This might need some tweaking based on testing, maybe +50% is enough, or maybe +200% is even better. >> * The main remaining blur point was why defining boost=(util - util_est) makes >> sense. The justification for that is that we use PELT-shaped signal to drive >> the frequency, so using a PELT-shaped signal for the boost makes sense for the >> same reasons. > > As per (patch #4): > > + unsigned long boost = 0; > > + if (util_est_enqueued == sg_cpu->util_est_enqueued && > + util_avg >= sg_cpu->util_avg && > + util_avg > util_est_enqueued) > + boost = util_avg - util_est_enqueued; > > The result of that is not, strictly speaking, a PELT shaped signal. > Although when it is !0 the curves are similar, albeit offset. Yes, it has the same rate of increase as PELT. > >> AFAIK there is no specific criteria to meet for frequency selection signal shape >> for anything else than periodic tasks (if we don't add other constraints on >> top), so (util - util_est)=(util - constant) seems as good as anything else. >> Especially since util is deemed to be a good fit in practice for frequency >> selection. Let me know if I missed anything on that front. > > > Given: > > sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg. cpu_util_cfs uses max_t (maybe irrelevant for this discussion): UTIL_EST ? max(util_est.enqueued, util_avg) : util_avg > our next_f becomes: > > next_f = 1.25 * util_est * max_freq / max; > so our min_freq in em_pd_get_higher_freq() will already be compensated > for the offset. Yes, the boost is added on top of the existing behavior. > So even when: > > boost = util_avg - util_est > > is small, despite util_avg being huge (~1024), due to large util_est, > we'll still get an effective boost to max_cost ASSUMING cs[].cost and > cost_margin have the same curve. I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but cost_margin is a time-based signal (the boost value), so it would be plotted against time. > > They have not. > > assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a > factor f^2 on the table. I'm guessing that you arrived to `cost_margin ~ f` this way: cost_margin = util - util_est_enqueued cost_margin = util - constant # with constant small enough cost_margin ~ util # with util ~ 1/f cost_margin ~ 1/f In the case you describe, `constant` is actually almost equal to `util` so `cost_margin ~! util`, and that series assumes frequency invariant util_avg so `util !~ 1/f` (I'll probably have to fix that). > So the higher the min_freq, the less effective the boost. Yes, since the boost is allowing a fixed amount of extra power. Higher OPPs are less efficient than lower ones, so if min_freq is high, we won't speed up as much as if min_freq was low. > Maybe it all works out in practise, but I'm missing a big picture Here is a big picture :) https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg The board is a Juno R0, with a periodic task pinned on a big CPU (capa=1024): * phase 1: 5% duty cycle (=51 PELT units) * phase 2: 75% duty cycle (=768 PELT units) Legend: * blue square wave: when the task executes (like in kernelshark) * base_cost = cost of frequency as selected by schedutil in normal operations * allowed_cost = base_cost + cost_margin * util = util_avg note: the small gaps right after the duty cycle transition between t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue and no util_est update. > description of it all somewhere. Now a textual version of it: em_pd_get_higher_freq() does the following: # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete # value of "max_cost", which is the highest OPP on that CPU. concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; # Then it finds the lowest OPP satisfying min_freq: min_opp = OPP_AT_FREQ(min_freq) # It takes the cost associated, and finds the highest OPP that has a # cost lower than that: max_cost = COST_OF(min_opp) + concrete_margin final_freq = MAX( FREQ_OF(opp) for opp in available_opps if COST_OF(opp) <= max_cost ) So this means that: util - util_est_enqueued ~= 0 => cost_margin ~= 0 => concrete_cost_margin ~= 0 => max_cost = COST_OF(min_opp) + 0 => final_freq = FREQ_OF(min_opp) The effective boost is ~0, so you will get the current behaviour of schedutil. If the task starts needing more cycles than during its previous period, `util - util_est_enqueued` will grow like util since util_est_enqueued is constant. The longer we wait, the higher the boost, until the task goes to sleep again. At next wakeup, util_est_enqueued has caught up and either: 1) util becomes stable, so no more boosting 2) util keeps increasing, so go for another round of boosting Thanks, Douglas
On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote: > > So even when: > > > > boost = util_avg - util_est > > > > is small, despite util_avg being huge (~1024), due to large util_est, > > we'll still get an effective boost to max_cost ASSUMING cs[].cost and > > cost_margin have the same curve. > > I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but > cost_margin is a time-based signal (the boost value), so it would be > plotted against time. Suppose we have the normalized energy vs frequency curve: x^3 ( P ~ V^2 * f, due to lack of better: V ~ f -> P ~ f^3 ) 1 +--------------------------------------------------------------------+ | + + + + *| | x**3 ******* | | ** | 0.8 |-+ ** +-| | ** | | * | | ** | 0.6 |-+ ** +-| | ** | | ** | | *** | 0.4 |-+ *** +-| | ** | | *** | | *** | 0.2 |-+ **** +-| | **** | | ****** | | + ********** + + | 0 +--------------------------------------------------------------------+ 0 0.2 0.4 0.6 0.8 1 where x is our normalized frequency and y is the normalized energy. Further, remember that schedutil does (per construction; for lack of better): f ~ u So at u=0.6, we're at f=0.6 and P=0.2 + boost = util_avg - util_est_enqueued; So for util_est = 0.6, we're limited to: boost = 0.4. + max_cost = pd->table[pd->nr_cap_states - 1].cost; + cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; Which then gives: cost_margin = boost = 0.4 And we find that: P' = P + cost_margin = 0.2 + 0.4 = 0.6 < 1 So even though set out to allow a 100% boost in energy usage, we were in fact incapable of achieving this, because our cost_margin is linear in u while the energy (or cost) curve is cubic in u. That was my argument; but I think that now that I've expanded on it, I see a flaw, because when we do have boost = 0.4, this means util_avg = 1, and we would've selected f = 1, and boosting would've been pointless. So let me try again: f = util_avg, P = f^3, boost = util_avg - util_est P' = util_avg ^ 3 + util_avg - util_est And I'm then failing to make further sense of that; it of course means that P'(u) is larger than P(2u) for some u, but I don't think we set that as a goal either. Let me ponder this a little more while I go read the rest of your email.
On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote: > On 2/10/20 1:21 PM, Peter Zijlstra wrote: > > assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a > > factor f^2 on the table. > > I'm guessing that you arrived to `cost_margin ~ f` this way: > > cost_margin = util - util_est_enqueued > cost_margin = util - constant > > # with constant small enough > cost_margin ~ util > > # with util ~ 1/f > cost_margin ~ 1/f > > In the case you describe, `constant` is actually almost equal to `util` > so `cost_margin ~! util`, and that series assumes frequency invariant > util_avg so `util !~ 1/f` (I'll probably have to fix that). Nah, perhaps already clear from the other email; but it goes like: boost = util_avg - util_est cost_margin = boost * C = C * util_avg - C * util_est And since u ~ f (per schedutil construction), cost_margin is a function linear in either u or f. > > So the higher the min_freq, the less effective the boost. > > Yes, since the boost is allowing a fixed amount of extra power. Higher > OPPs are less efficient than lower ones, so if min_freq is high, we > won't speed up as much as if min_freq was low. > > > Maybe it all works out in practise, but I'm missing a big picture > > Here is a big picture :) > > https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg > > The board is a Juno R0, with a periodic task pinned on a big CPU > (capa=1024): > * phase 1: 5% duty cycle (=51 PELT units) > * phase 2: 75% duty cycle (=768 PELT units) > > Legend: > * blue square wave: when the task executes (like in kernelshark) > * base_cost = cost of frequency as selected by schedutil in normal > operations > * allowed_cost = base_cost + cost_margin > * util = util_avg > > note: the small gaps right after the duty cycle transition between > t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue > and no util_est update. I'm confused by the giant drop in frequency (blue line) around 4.18 schedutil shouldn't select f < max(util_avg, util_est), which is violated right about there. I'm also confused by the base_cost line; how can that be flat until somewhere around 4.16. Sadly there is no line for pure schedutil freq to compare against. Other than that, I can see the green line is consistent with util_avg>util_est, and how it help grow the frequency (blue).
On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote: > > description of it all somewhere. > > Now a textual version of it: > > em_pd_get_higher_freq() does the following: > > # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a > # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete > # value of "max_cost", which is the highest OPP on that CPU. > concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; > > # Then it finds the lowest OPP satisfying min_freq: > min_opp = OPP_AT_FREQ(min_freq) > > # It takes the cost associated, and finds the highest OPP that has a > # cost lower than that: > max_cost = COST_OF(min_opp) + concrete_margin > > final_freq = MAX( > FREQ_OF(opp) > for opp in available_opps > if COST_OF(opp) <= max_cost > ) Right; I got that. > So this means that: > util - util_est_enqueued ~= 0 Only if you assume the task will get scheduled out reasonably frequent. > => cost_margin ~= 0 > => concrete_cost_margin ~= 0 > => max_cost = COST_OF(min_opp) + 0 > => final_freq = FREQ_OF(min_opp) > > The effective boost is ~0, so you will get the current behaviour of > schedutil. But the argument holds; because if things don't get scheduled out, we'll peg u = 1 and hit f = 1 and all is well anyway. Which is a useful property; it shows that in the steady state, this patch-set is a NOP, but the above argument only relies on 'util_avg > util_est' being used a trigger. > If the task starts needing more cycles than during its previous period, > `util - util_est_enqueued` will grow like util since util_est_enqueued > is constant. The longer we wait, the higher the boost, until the task > goes to sleep again. > > At next wakeup, util_est_enqueued has caught up and either: > 1) util becomes stable, so no more boosting > 2) util keeps increasing, so go for another round of boosting Agreed; however elsewhere you wrote: > 1) If you care more about predictable battery life (or energy bill) than > predictability of the boost feature, EM should be used. > > 2) If you don't have an EM or you care more about having a predictable > boost for a given workload, use util (or disable that boost). This is the part I'm still not sure about; how do the specifics of the cost_margin setup lead to 1), or how would some frobbing with frequency selection destroy that property.
Hi Peter, On 2/13/20 1:20 PM, Peter Zijlstra wrote: > On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote: >> On 2/10/20 1:30 PM, Peter Zijlstra wrote: > >>> So ARM64 will soon get x86-like power management if I read these here >>> patches right: >>> >>> https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com >>> >>> And I'm thinking a part of Rafael's concerns will also apply to those >>> platforms. >> >> AFAIU there is an important difference: ARM64 firmware should not end up >> increasing frequency on its own, it should only cap the frequency. That >> means that the situation stays the same for that boost: >> >> Let's say you let schedutil selecting a freq that is +2% more power >> hungry. That will probably not be enough to make it jump to the next >> OPP, so you end up not boosting. Now if there is a firmware that decides >> for some reasons to cap frequency, it will be a similar situation. > > The moment you give out OPP selection to a 3rd party (be it firmware or > a micro-controller) things are uncertain at best anyway. > > Still, in general, if you give it higher input, it tends to at least > consider going faster -- which might be all you can ask for... > > So I'm not exactly seeing what your argument is here. My point is that a +2% boost will give *up to* +2% increase in energy use. With or without a fancy firmware, having cost_margin > 0 does not mean you will always actually get a speedup. >>> Right, so the condition 'util_avg > util_est' makes sense to trigger >>> some sort of boost off of. >>> >>> What kind would make sense for these platforms? One possibility would be >>> to instead of frobbing the energy margin, as you do here, to frob the C >>> in get_next_freq(). >> >> If I'm correct, changing the C value would be somewhat similar to the >> relative boosting I had in a previous version. Maybe adding a fixed >> offset would give more predictable results as was discussed with Vincent >> Guittot. In any case, it would change the perceived util (like iowait >> boost). > > It depends a bit on what you change C into. If we do something trivial > like: > 1.25 ; !(util_avg > util_est) > C := { > 2 ; (util_avg > util_est) > > ie. a binary selection of constants, then yes, I suppose that is the > case. > > But nothing stops us from making it more complicated; or having it > depend on the presence of EM data. The series currently fiddles with energy cost directly, but it's possible to have the exact same effect by fiddling with util if we have the function `(base_util, cost_margin) -> boosted_util`. It just that it forces to: 1. map util to freq 2. find a higher freq for the given cost_margin 3. Map freq to util 4. Re-inject the new util, which will eventually get remapped to a freq While it's easier to just do it directly: 1. map util to freq 2. find higher_freq for the given cost_margin 3. Use the increased freq > >>> (I have vague memories of this being proposed earlier; it also avoids >>> that double OPP iteration thing complained about elsewhere in this >>> thread if I'm not mistaken). >> >> It should be possible to get rid of the double iteration mentioned by >> Quentin. Choosing to boost the util or the energy boils down to: >> >> 1) If you care more about predictable battery life (or energy bill) than >> predictability of the boost feature, EM should be used. >> >> 2) If you don't have an EM or you care more about having a predictable >> boost for a given workload, use util (or disable that boost). >> >> The rational is that with 1), you will get a different speed boost for a >> given workload depending on the other things executing at the same time, >> as the speed up is not linear with the task-related metric (util - >> util_est). If you are already at high freq because of another workload, >> the speed up will be small because the next 100Mhz will cost much more >> than the same +100Mhz delta starting from a low OPP. > > It's just that I'm not seeing how 1 actually works or provides that more > predictable battery life I suppose. We have this other sub-thread to > argue about that :-) Ok, I've posted the answer there, so this thread can focus on boost=(util - util_est_enqueued) logic, and the other one on how to make actual use of the boost value. >>> That is; I'm thinking it is important (esp. now that we got frequency >>> invariance sorted for x86), to have this patch also work for !EM >>> architectures (as those ARM64-AMU things would be). >> >> For sure, that feature is supposed to help in cases that would be >> impossible to pinpoint with hardware, since it has to know what tasks >> execute. > > OK, so I'm thinking we're agreeing that it would be good to have this > support !EM systems too. > Thanks, Douglas
Hi Peter, The preemption stack unwinded back to schedutil :) On 2/14/20 12:52 PM, Peter Zijlstra wrote: > On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote: >> On 2/10/20 1:21 PM, Peter Zijlstra wrote: > >>> assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a >>> factor f^2 on the table. >> >> I'm guessing that you arrived to `cost_margin ~ f` this way: >> >> cost_margin = util - util_est_enqueued >> cost_margin = util - constant >> >> # with constant small enough >> cost_margin ~ util >> >> # with util ~ 1/f >> cost_margin ~ 1/f >> >> In the case you describe, `constant` is actually almost equal to `util` >> so `cost_margin ~! util`, and that series assumes frequency invariant >> util_avg so `util !~ 1/f` (I'll probably have to fix that). > > Nah, perhaps already clear from the other email; but it goes like: > > boost = util_avg - util_est > cost_margin = boost * C = C * util_avg - C * util_est > > And since u ~ f (per schedutil construction), cost_margin is a function > linear in either u or f. cost_margin(u) is not linear in `u` or `f`, as that does not hold: cost_margin(A*u) = (A*u - CONSTANT) != A*u - A*CONSTANT A * (u - CONSTANT) A * cost_margin(u) This would only approximately work if CONSTANT was much smaller than u, which it isn't. The tricky part is that CONSTANT is actually CONSTANT(u), but the relation between u and CONSTANT is time-dependent. That said, the problem can be simplified if we split it in two phases: * when boost != 0 for a given task, in which case CONSTANT(u) is a true constant by construction, i.e. cost_margin(u) !~ u. * when boost = 0 for a given task: all that matters is that only one task is boosted at a time, so that we can easily relate task composition and boost composition as done in: https://lore.kernel.org/lkml/5d732dc1-d343-24d2-bda9-072021a510ed@arm.com/ note: as mentioned in that email, this reasoning relies on util_est_enqueued(wload) at rq level being linear in wload, which does not hold with preemption. That would be fixed by working on task signals rather than rq aggregated ones. > >>> So the higher the min_freq, the less effective the boost. Ultimately I can't remember exactly what the cost_margin(u) linearity was intended to be used for, but the relative impact of boosting does become smaller if u is already high. This is by design, as the dynamic of that boosting is purposely mostly tied to local PELT variations, by removing a fixed offset from it. I don't think big tasks specifically deserve more boosting, as it would imply the incertitude on their new period is higher. We would need a more real world trace evaluating how the duty cycle changes relate to the initial duty cycle. That should not be particularly difficult to do, but the trace parsing infrastructure in LISA currently does not handle very nicely large traces, so that will probably have to wait until I get around fixing that by using libtraceevent from trace-cmd project. This means boosting currently depends on two things: 1) the PELT parameters (half life and window size) 2) the former task period 1) is fixed by definition in the kernel. 2) will make boosting more aggressive for short periods (i.e. the boost will rise faster). The initial rate of increase of boosting once it triggers should be related to the former task period in (approximately) that way: | period (s) | boost rate of increase (PELT unit/ms) | |-------------:|----------------------------------------:| | 0 | 21.3839 | | 0.016 | 15.3101 | | 0.032 | 10.9615 | | 0.048 | 7.84807 | | 0.064 | 5.61895 | | 0.08 | 4.02297 | | 0.096 | 2.8803 | | 0.112 | 2.0622 | | 0.128 | 1.47646 | This assumes that while the task is running: util_avg(t) = util_avg(t_enqueue) + 1024 * (1-e^(-(t - t_enqueue)/tau)) with tau: # tau as defined by: https://en.wikipedia.org/wiki/Low-pass_filter#Simple_infinite_impulse_response_filter # Alpha as defined in https://en.wikipedia.org/wiki/Moving_average decay = (1 / 2)^(1 / half_life) alpha = 1 - decay window = 1024 * 1024 * 1e-9 tau = window * ((1 - alpha) / alpha) tau = 0.047886417385345964 >> >> Yes, since the boost is allowing a fixed amount of extra power. Higher >> OPPs are less efficient than lower ones, so if min_freq is high, we >> won't speed up as much as if min_freq was low. >> >>> Maybe it all works out in practise, but I'm missing a big picture >> >> Here is a big picture :) >> >> https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg >> >> The board is a Juno R0, with a periodic task pinned on a big CPU >> (capa=1024): >> * phase 1: 5% duty cycle (=51 PELT units) >> * phase 2: 75% duty cycle (=768 PELT units) >> >> Legend: >> * blue square wave: when the task executes (like in kernelshark) >> * base_cost = cost of frequency as selected by schedutil in normal >> operations >> * allowed_cost = base_cost + cost_margin >> * util = util_avg >> >> note: the small gaps right after the duty cycle transition between >> t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue >> and no util_est update. > > I'm confused by the giant drop in frequency (blue line) around 4.18 > > schedutil shouldn't select f < max(util_avg, util_est), which is > violated right about there. AFAICT that's due the boost being disabled, which means the allowed cost becomes smaller. The boost is disabled because the task's util_est_enqueued has been updated since last schedutil invocation, so the rq one changes, which in turn disables boosting. Here are the normalized costs on the CPU the workload was run on: | freq | cost | capa | |--------:|---------:|---------:| | 450000 | 67.086 | 418.909 | | 625000 | 72.1509 | 581.818 | | 800000 | 80.8962 | 744.727 | | 950000 | 90.1688 | 884.364 | | 1100000 | 100 | 1024 | In our case, it seems that schedutil is invoked when the task is preempted (presumably by schedutil kthread) and we end up with: util_avg ~= 370 util_est_enqueued ~= 427. We therefore end up at freq=625000 (capa=581). > I'm also confused by the base_cost line; how can that be flat until > somewhere around 4.16. The minimum frequency on the board being used provides a capacity of 418, so it stays flat until max(util_avg, util_est) goes above that. > Sadly there is no line for pure schedutil freq to > compare against. I could add that one fairly easily. Would you be interested in the "real" frequency selected by normal schedutil or the "ideal" frequency schedutil would like to apply ? > Other than that, I can see the green line is consistent with > util_avg>util_est, and how it help grow the frequency (blue). >
On 2/14/20 1:37 PM, Peter Zijlstra wrote: > On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote: > >>> description of it all somewhere. >> >> Now a textual version of it: >> >> em_pd_get_higher_freq() does the following: >> >> # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a >> # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete >> # value of "max_cost", which is the highest OPP on that CPU. >> concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE; >> >> # Then it finds the lowest OPP satisfying min_freq: >> min_opp = OPP_AT_FREQ(min_freq) >> >> # It takes the cost associated, and finds the highest OPP that has a >> # cost lower than that: >> max_cost = COST_OF(min_opp) + concrete_margin >> >> final_freq = MAX( >> FREQ_OF(opp) >> for opp in available_opps >> if COST_OF(opp) <= max_cost >> ) > > Right; I got that. > >> So this means that: >> util - util_est_enqueued ~= 0 > > Only if you assume the task will get scheduled out reasonably frequent. > >> => cost_margin ~= 0 >> => concrete_cost_margin ~= 0 >> => max_cost = COST_OF(min_opp) + 0 >> => final_freq = FREQ_OF(min_opp) >> >> The effective boost is ~0, so you will get the current behaviour of >> schedutil. > > But the argument holds; because if things don't get scheduled out, we'll > peg u = 1 and hit f = 1 and all is well anyway. > > Which is a useful property; it shows that in the steady state, this > patch-set is a NOP, but the above argument only relies on 'util_avg > > util_est' being used a trigger. Yes, `util_avg > util_est` can only happen when the task's duty cycle is changing, which does not happen at steady state. Either it's periodic and the boost is legitimate, or it's not periodic and we assume it's a periodic task well represented by its last activation and sleep (for the purpose of boosting). Tasks with a high variability in their activation durations (i.e. not periodic at all) will likely get more boosting on average, which is probably good since we can't predict much about them, so in doubt we tilt the behaviour of schedutil toward racing to completion. >> If the task starts needing more cycles than during its previous period, >> `util - util_est_enqueued` will grow like util since util_est_enqueued >> is constant. The longer we wait, the higher the boost, until the task >> goes to sleep again. >> >> At next wakeup, util_est_enqueued has caught up and either: >> 1) util becomes stable, so no more boosting >> 2) util keeps increasing, so go for another round of boosting > > Agreed; however elsewhere you wrote: > >> 1) If you care more about predictable battery life (or energy bill) than >> predictability of the boost feature, EM should be used. >> >> 2) If you don't have an EM or you care more about having a predictable >> boost for a given workload, use util (or disable that boost). > > This is the part I'm still not sure about; how do the specifics of the > cost_margin setup lead to 1), or how would some frobbing with frequency > selection destroy that property. This should be answered by this other thread: https://lore.kernel.org/lkml/5d732dc1-d343-24d2-bda9-072021a510ed@arm.com/#t