[RFC,5/8] cpufreq/schedutil: Remove iowait boost

Message ID	20240905092645.2885200-6-christian.loehle@arm.com (mailing list archive)
State	New, archived
Headers	show Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4C33E19882C; Thu, 5 Sep 2024 09:27:41 +0000 (UTC) From: Christian Loehle <christian.loehle@arm.com> To: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, rafael@kernel.org, peterz@infradead.org Cc: juri.lelli@redhat.com, mingo@redhat.com, dietmar.eggemann@arm.com, vschneid@redhat.com, vincent.guittot@linaro.org, Johannes.Thumshirn@wdc.com, adrian.hunter@intel.com, ulf.hansson@linaro.org, bvanassche@acm.org, andres@anarazel.de, asml.silence@gmail.com, linux-block@vger.kernel.org, io-uring@vger.kernel.org, qyousef@layalina.io, dsmythies@telus.net, axboe@kernel.dk, Christian Loehle <christian.loehle@arm.com> Subject: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost Date: Thu, 5 Sep 2024 10:26:42 +0100 Message-Id: <20240905092645.2885200-6-christian.loehle@arm.com> In-Reply-To: <20240905092645.2885200-1-christian.loehle@arm.com> References: <20240905092645.2885200-1-christian.loehle@arm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	cpufreq: cpuidle: Remove iowait behaviour \| expand [RFT,RFC,0/8] cpufreq: cpuidle: Remove iowait behaviour [RFC,1/8] cpuidle: menu: Remove iowait influence [RFC,2/8] cpuidle: Prefer teo over menu governor [RFC,3/8] TEST: cpufreq/schedutil: Linear iowait boost step [RFC,4/8] TEST: cpufreq/schedutil: iowait boost cap sysfs [RFC,5/8] cpufreq/schedutil: Remove iowait boost [RFC,6/8] cpufreq: intel_pstate: Remove iowait boost [RFC,7/8] cpufreq: Remove SCHED_CPUFREQ_IOWAIT update [RFC,8/8] io_uring: Do not set iowait before sleeping

Christian Loehle Sept. 5, 2024, 9:26 a.m. UTC

iowait boost in schedutil was introduced by
commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
with it more or less following intel_pstate's approach to increase
frequency after an iowait wakeup.
Behaviour that is piggy-backed onto iowait boost is problematic
due to a lot of reasons, so remove it.

For schedutil specifically these are some of the reasons:
1. Boosting is applied even in scenarios where it doesn't improve
throughput.
2. The boost is not accounted for in EAS: a) feec() will only consider
 the actual task utilization for task placement, but another CPU might
 be more energy-efficient at that capacity than the boosted one.)
 b) When placing a non-IO task while a CPU is boosted compute_energy()
 assumes a lower OPP than what is actually applied. This leads to
 wrong EAS decisions.
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't accounted for in task placement.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks lead to boosting.
8. Boosting is hard to control with UCLAMP_MAX (which is only active
when the task is on the rq, which for boosted tasks is usually not
the case for most of the time).

One benefit of schedutil specifically is the reliance on the
scheduler's utilization signals, which have evolved a lot since it's
original introduction. Some cases that benefitted from iowait boosting
in the past can now be covered by e.g. util_est.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 181 +------------------------------
 1 file changed, 3 insertions(+), 178 deletions(-)

Rafael J. Wysocki Sept. 30, 2024, 4:34 p.m. UTC | #1

On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> iowait boost in schedutil was introduced by
> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
> with it more or less following intel_pstate's approach to increase
> frequency after an iowait wakeup.
> Behaviour that is piggy-backed onto iowait boost is problematic
> due to a lot of reasons, so remove it.
>
> For schedutil specifically these are some of the reasons:
> 1. Boosting is applied even in scenarios where it doesn't improve
> throughput.

Well, I wouldn't argue this way because it is kind of like saying that
air conditioning is used even when it doesn't really help.  It is
sometimes hard to know in advance whether or not it will help though.

> 2. The boost is not accounted for in EAS: a) feec() will only consider
>  the actual task utilization for task placement, but another CPU might
>  be more energy-efficient at that capacity than the boosted one.)
>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>  assumes a lower OPP than what is actually applied. This leads to
>  wrong EAS decisions.

That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
in your list).

If the goal is to set the adequate performance for a given utilization
level (either actual or prescribed), boosting doesn't really play well
with this and it shouldn't be used at least in these cases.

> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.

Do infrequent in_iowait wakeups really cause the boosting to be
applied at full swing?

> 4. The boost isn't accounted for in task placement.

I'm not sure what exactly this means.  "Big" vs "little" or something else?

> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.

Fair enough, but this is rather a problem with the implementation of
boosting and not with the basic idea of it.

> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.

Well, that again is somewhat implementation-related IMV, and it need
not be problematic in principle.  Namely, if a task migrates and it is
not the only one in the "new" CPUs runqueue, and the other tasks in
there don't use in_iowait, maybe it's better to not boost it?

It also means that boosting is not very consistent, though, which is a
valid point.

> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks lead to boosting.

Well, that's by design: it boosts, when "there is enough IO pressure
in the runqueue", so to speak.

Basically, it is a departure from the "make performance follow
utilization" general idea and it is based on the observation that in
some cases performance can be improved by taking additional
information into account.

It is also about pure performance, not about energy efficiency.

> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
> when the task is on the rq, which for boosted tasks is usually not
> the case for most of the time).
>
> One benefit of schedutil specifically is the reliance on the
> scheduler's utilization signals, which have evolved a lot since it's
> original introduction. Some cases that benefitted from iowait boosting
> in the past can now be covered by e.g. util_est.

And it would be good to give some examples of this.

IMV you have a clean-cut argument in the EAS and UCLAMP_MAX cases, but
apart from that it is all a bit hand-wavy.

> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 181 +------------------------------
>  1 file changed, 3 insertions(+), 178 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 5324f07fc93a..55b8b8ba7238 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -6,12 +6,9 @@
>   * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>   */
>
> -#define IOWAIT_BOOST_MIN       (SCHED_CAPACITY_SCALE / 8)
> -
>  struct sugov_tunables {
>         struct gov_attr_set     attr_set;
>         unsigned int            rate_limit_us;
> -       unsigned int            iowait_boost_cap;
>  };
>
>  struct sugov_policy {
> @@ -36,8 +33,6 @@ struct sugov_policy {
>
>         bool                    limits_changed;
>         bool                    need_freq_update;
> -
> -       unsigned int            iowait_boost_cap;
>  };
>
>  struct sugov_cpu {
> @@ -45,10 +40,6 @@ struct sugov_cpu {
>         struct sugov_policy     *sg_policy;
>         unsigned int            cpu;
>
> -       bool                    iowait_boost_pending;
> -       unsigned int            iowait_boost;
> -       u64                     last_update;
> -
>         unsigned long           util;
>         unsigned long           bw_min;
>
> @@ -198,137 +189,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>         return max(min, max);
>  }
>
> -static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
>         unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
>
>         util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> -       util = max(util, boost);
>         sg_cpu->bw_min = min;
>         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
>  }
>
> -/**
> - * sugov_iowait_reset() - Reset the IO boost status of a CPU.
> - * @sg_cpu: the sugov data for the CPU to boost
> - * @time: the update time from the caller
> - * @set_iowait_boost: true if an IO boost has been requested
> - *
> - * The IO wait boost of a task is disabled after a tick since the last update
> - * of a CPU. If a new IO wait boost is requested after more then a tick, then
> - * we enable the boost starting from IOWAIT_BOOST_MIN, which improves energy
> - * efficiency by ignoring sporadic wakeups from IO.
> - */
> -static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
> -                              bool set_iowait_boost)
> -{
> -       s64 delta_ns = time - sg_cpu->last_update;
> -
> -       /* Reset boost only if a tick has elapsed since last request */
> -       if (delta_ns <= TICK_NSEC)
> -               return false;
> -
> -       sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
> -       sg_cpu->iowait_boost_pending = set_iowait_boost;
> -
> -       return true;
> -}
> -
> -/**
> - * sugov_iowait_boost() - Updates the IO boost status of a CPU.
> - * @sg_cpu: the sugov data for the CPU to boost
> - * @time: the update time from the caller
> - * @flags: SCHED_CPUFREQ_IOWAIT if the task is waking up after an IO wait
> - *
> - * Each time a task wakes up after an IO operation, the CPU utilization can be
> - * boosted to a certain utilization which doubles at each "frequent and
> - * successive" wakeup from IO, ranging from IOWAIT_BOOST_MIN to the utilization
> - * of the maximum OPP.
> - *
> - * To keep doubling, an IO boost has to be requested at least once per tick,
> - * otherwise we restart from the utilization of the minimum OPP.
> - */
> -static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned int flags)
> -{
> -       bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> -
> -       /* Reset boost if the CPU appears to have been idle enough */
> -       if (sg_cpu->iowait_boost &&
> -           sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
> -               return;
> -
> -       /* Boost only tasks waking up after IO */
> -       if (!set_iowait_boost)
> -               return;
> -
> -       /* Ensure boost doubles only one time at each request */
> -       if (sg_cpu->iowait_boost_pending)
> -               return;
> -       sg_cpu->iowait_boost_pending = true;
> -
> -       /* Double the boost at each request */
> -       if (sg_cpu->iowait_boost) {
> -               sg_cpu->iowait_boost =
> -                       min_t(unsigned int,
> -                             sg_cpu->iowait_boost + IOWAIT_BOOST_MIN, SCHED_CAPACITY_SCALE);
> -               return;
> -       }
> -
> -       /* First wakeup after IO: start with minimum boost */
> -       sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
> -}
> -
> -/**
> - * sugov_iowait_apply() - Apply the IO boost to a CPU.
> - * @sg_cpu: the sugov data for the cpu to boost
> - * @time: the update time from the caller
> - * @max_cap: the max CPU capacity
> - *
> - * A CPU running a task which woken up after an IO operation can have its
> - * utilization boosted to speed up the completion of those IO operations.
> - * The IO boost value is increased each time a task wakes up from IO, in
> - * sugov_iowait_apply(), and it's instead decreased by this function,
> - * each time an increase has not been requested (!iowait_boost_pending).
> - *
> - * A CPU which also appears to have been idle for at least one tick has also
> - * its IO boost utilization reset.
> - *
> - * This mechanism is designed to boost high frequently IO waiting tasks, while
> - * being more conservative on tasks which does sporadic IO operations.
> - */
> -static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned long max_cap)
> -{
> -       /* No boost currently required */
> -       if (!sg_cpu->iowait_boost)
> -               return 0;
> -
> -       /* Reset boost if the CPU appears to have been idle enough */
> -       if (sugov_iowait_reset(sg_cpu, time, false))
> -               return 0;
> -
> -       if (!sg_cpu->iowait_boost_pending) {
> -               /*
> -                * No boost pending; reduce the boost value.
> -                */
> -               sg_cpu->iowait_boost -= IOWAIT_BOOST_MIN;
> -               if (!sg_cpu->iowait_boost)
> -                       return 0;
> -       }
> -
> -       sg_cpu->iowait_boost_pending = false;
> -
> -       if (sg_cpu->iowait_boost > sg_cpu->sg_policy->iowait_boost_cap)
> -               sg_cpu->iowait_boost = sg_cpu->sg_policy->iowait_boost_cap;
> -
> -       /*
> -        * sg_cpu->util is already in capacity scale; convert iowait_boost
> -        * into the same scale so we can compare.
> -        */
> -       return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> -}
> -
>  #ifdef CONFIG_NO_HZ_COMMON
>  static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>  {
> @@ -356,18 +225,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>                                               u64 time, unsigned long max_cap,
>                                               unsigned int flags)
>  {
> -       unsigned long boost;
> -
> -       sugov_iowait_boost(sg_cpu, time, flags);
> -       sg_cpu->last_update = time;
> -
>         ignore_dl_rate_limit(sg_cpu);
>
>         if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>                 return false;
>
> -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> -       sugov_get_util(sg_cpu, boost);
> +       sugov_get_util(sg_cpu);
>
>         return true;
>  }
> @@ -468,11 +331,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>
>         for_each_cpu(j, policy->cpus) {
>                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> -               unsigned long boost;
> -
> -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> -               sugov_get_util(j_sg_cpu, boost);
>
> +               sugov_get_util(j_sg_cpu);
>                 util = max(j_sg_cpu->util, util);
>         }
>
> @@ -488,9 +348,6 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         raw_spin_lock(&sg_policy->update_lock);
>
> -       sugov_iowait_boost(sg_cpu, time, flags);
> -       sg_cpu->last_update = time;
> -
>         ignore_dl_rate_limit(sg_cpu);
>
>         if (sugov_should_update_freq(sg_policy, time)) {
> @@ -560,14 +417,6 @@ static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
>         return sprintf(buf, "%u\n", tunables->rate_limit_us);
>  }
>
> -
> -static ssize_t iowait_boost_cap_show(struct gov_attr_set *attr_set, char *buf)
> -{
> -       struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> -
> -       return sprintf(buf, "%u\n", tunables->iowait_boost_cap);
> -}
> -
>  static ssize_t
>  rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
>  {
> @@ -586,30 +435,10 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
>         return count;
>  }
>
> -static ssize_t
> -iowait_boost_cap_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
> -{
> -       struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> -       struct sugov_policy *sg_policy;
> -       unsigned int iowait_boost_cap;
> -
> -       if (kstrtouint(buf, 10, &iowait_boost_cap))
> -               return -EINVAL;
> -
> -       tunables->iowait_boost_cap = iowait_boost_cap;
> -
> -       list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> -               sg_policy->iowait_boost_cap = iowait_boost_cap;
> -
> -       return count;
> -}
> -
>  static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
> -static struct governor_attr iowait_boost_cap = __ATTR_RW(iowait_boost_cap);
>
>  static struct attribute *sugov_attrs[] = {
>         &rate_limit_us.attr,
> -       &iowait_boost_cap.attr,
>         NULL
>  };
>  ATTRIBUTE_GROUPS(sugov);
> @@ -799,8 +628,6 @@ static int sugov_init(struct cpufreq_policy *policy)
>
>         tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
>
> -       tunables->iowait_boost_cap = SCHED_CAPACITY_SCALE;
> -
>         policy->governor_data = sg_policy;
>         sg_policy->tunables = tunables;
>
> @@ -870,8 +697,6 @@ static int sugov_start(struct cpufreq_policy *policy)
>         sg_policy->limits_changed               = false;
>         sg_policy->cached_raw_freq              = 0;
>
> -       sg_policy->iowait_boost_cap             = SCHED_CAPACITY_SCALE;
> -
>         sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
>
>         if (policy_is_shared(policy))
> --
> 2.34.1
>

Christian Loehle Oct. 3, 2024, 9:10 a.m. UTC | #2

On 9/30/24 17:34, Rafael J. Wysocki wrote:
> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> iowait boost in schedutil was introduced by
>> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
>> with it more or less following intel_pstate's approach to increase
>> frequency after an iowait wakeup.
>> Behaviour that is piggy-backed onto iowait boost is problematic
>> due to a lot of reasons, so remove it.
>>
>> For schedutil specifically these are some of the reasons:
>> 1. Boosting is applied even in scenarios where it doesn't improve
>> throughput.
> 
> Well, I wouldn't argue this way because it is kind of like saying that
> air conditioning is used even when it doesn't really help.  It is
> sometimes hard to know in advance whether or not it will help though.

Right, it's a heuristic that's often wrong and costs energy when it
triggers is what I was trying to say.

> 
>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>  the actual task utilization for task placement, but another CPU might
>>  be more energy-efficient at that capacity than the boosted one.)
>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>  assumes a lower OPP than what is actually applied. This leads to
>>  wrong EAS decisions.
> 
> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
> in your list).
> 
> If the goal is to set the adequate performance for a given utilization
> level (either actual or prescribed), boosting doesn't really play well
> with this and it shouldn't be used at least in these cases.
> 
>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>> in_iowait wakeups.
> 
> Do infrequent in_iowait wakeups really cause the boosting to be
> applied at full swing?

Maybe not full swing, but the relatively high rate_limit_us and TICK_NSEC
found on Android deivces does indeed lead to occasional boosting periods
even for 'infrequent'/unrelated wakeups.

> 
>> 4. The boost isn't accounted for in task placement.
> 
> I'm not sure what exactly this means.  "Big" vs "little" or something else?

That should be "[...] in task placement for HMP", you're right.
Essentially if we were to consider a task to be 100% of capacity boost-worthy,
we need to consider that at task placement. Now we cap out at the local CPU,
which might be rather small. (~10% of the biggest CPU on mobile).
Logically this argument (a CAS argument essentially), should probably come
before the EAS one to make more sense.

>> 5. The boost isn't associated with a task, it therefore lingers on the
>> rq even after the responsible task has migrated / stopped.
> 
> Fair enough, but this is rather a problem with the implementation of
> boosting and not with the basic idea of it.

Unfortunately the lingering (or to use a term with less negative connotation:
holding) almost is a necessity, too, as described in the cover-letter.
If we only boost at enqueue (and immediately scale down on dequeue) we lose
out massively, as the interrupt isn't boosted and we have to run at the lower
frequency for the DVFS transition delay (even if on x86 that may be close to
negligible). IMO this is the main reason why the mechanism can't evolve (into
something like a per-task strategy).
Even a per-task strategy would need to a) set a timer in case the iowait
period is too long and b) remove boost from prev_cpu if enqueued somewhere
else.

> 
>> 6. The boost isn't associated with a task, it therefore needs to ramp
>> up again when migrated.
> 
> Well, that again is somewhat implementation-related IMV, and it need
> not be problematic in principle.  Namely, if a task migrates and it is
> not the only one in the "new" CPUs runqueue, and the other tasks in
> there don't use in_iowait, maybe it's better to not boost it?

Agreed, this can be argued about (and also isn't a huge problem in
practice).

> 
> It also means that boosting is not very consistent, though, which is a
> valid point.
> 
>> 7. Since schedutil doesn't know which task is getting woken up,
>> multiple unrelated in_iowait tasks lead to boosting.
> 
> Well, that's by design: it boosts, when "there is enough IO pressure
> in the runqueue", so to speak.> 
> Basically, it is a departure from the "make performance follow
> utilization" general idea and it is based on the observation that in
> some cases performance can be improved by taking additional
> information into account.
> 
> It is also about pure performance, not about energy efficiency.

And the lines between those become more and more blurry, see the GFX
regression. There's very few free lunches up for grabs these days, if
you're boosting performance on X, you're likely paying for it on Y.
That is fine as long as boosting X is deliberate which iowait boosting
very much is not.

> 
>> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
>> when the task is on the rq, which for boosted tasks is usually not
>> the case for most of the time).
>>
>> One benefit of schedutil specifically is the reliance on the
>> scheduler's utilization signals, which have evolved a lot since it's
>> original introduction. Some cases that benefitted from iowait boosting
>> in the past can now be covered by e.g. util_est.
> 
> And it would be good to give some examples of this.
> 
> IMV you have a clean-cut argument in the EAS and UCLAMP_MAX cases, but
> apart from that it is all a bit hand-wavy.

Thanks Rafael, you brought up some good points!

Quentin Perret Oct. 3, 2024, 9:47 a.m. UTC | #3

On Monday 30 Sep 2024 at 18:34:24 (+0200), Rafael J. Wysocki wrote:
> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > iowait boost in schedutil was introduced by
> > commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
> > with it more or less following intel_pstate's approach to increase
> > frequency after an iowait wakeup.
> > Behaviour that is piggy-backed onto iowait boost is problematic
> > due to a lot of reasons, so remove it.
> >
> > For schedutil specifically these are some of the reasons:
> > 1. Boosting is applied even in scenarios where it doesn't improve
> > throughput.
> 
> Well, I wouldn't argue this way because it is kind of like saying that
> air conditioning is used even when it doesn't really help.  It is
> sometimes hard to know in advance whether or not it will help though.
> 
> > 2. The boost is not accounted for in EAS: a) feec() will only consider
> >  the actual task utilization for task placement, but another CPU might
> >  be more energy-efficient at that capacity than the boosted one.)
> >  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >  assumes a lower OPP than what is actually applied. This leads to
> >  wrong EAS decisions.
> 
> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
> in your list).

I would actually argue that this is also an implementation problem
rather than something fundamental about boosting. EAS could be taught
about iowait boosting and factor that into the decisions.

> If the goal is to set the adequate performance for a given utilization
> level (either actual or prescribed), boosting doesn't really play well
> with this and it shouldn't be used at least in these cases.

There's plenty of cases where EAS will correctly understand that
migrating a task away will not reduce the OPP (e.g. another task on the
rq has a uclamp_min request, or another CPU in the perf domain has a
higher request), so iowait boosting could probably be added.

In fact if the iowait boost was made a task property, EAS could easily
understand the effect of migrating that boost with the task (it's not
fundamentally different from migrating a task with a high uclamp_min
from the energy model perspective).

> > 3. Actual IO heavy workloads are hardly distinguished from infrequent
> > in_iowait wakeups.
> 
> Do infrequent in_iowait wakeups really cause the boosting to be
> applied at full swing?
> 
> > 4. The boost isn't accounted for in task placement.
> 
> I'm not sure what exactly this means.  "Big" vs "little" or something else?
> 
> > 5. The boost isn't associated with a task, it therefore lingers on the
> > rq even after the responsible task has migrated / stopped.
> 
> Fair enough, but this is rather a problem with the implementation of
> boosting and not with the basic idea of it.

+1

> > 6. The boost isn't associated with a task, it therefore needs to ramp
> > up again when migrated.
> 
> Well, that again is somewhat implementation-related IMV, and it need
> not be problematic in principle.  Namely, if a task migrates and it is
> not the only one in the "new" CPUs runqueue, and the other tasks in
> there don't use in_iowait, maybe it's better to not boost it?
> 
> It also means that boosting is not very consistent, though, which is a
> valid point.
> 
> > 7. Since schedutil doesn't know which task is getting woken up,
> > multiple unrelated in_iowait tasks lead to boosting.
> 
> Well, that's by design: it boosts, when "there is enough IO pressure
> in the runqueue", so to speak.
> 
> Basically, it is a departure from the "make performance follow
> utilization" general idea and it is based on the observation that in
> some cases performance can be improved by taking additional
> information into account.
> 
> It is also about pure performance, not about energy efficiency.
> 
> > 8. Boosting is hard to control with UCLAMP_MAX (which is only active
> > when the task is on the rq, which for boosted tasks is usually not
> > the case for most of the time).

Sounds like another reason to make iowait boosting per-task to me :-)

I've always thought that turning iowait boosting into some sort of
in-kernel uclamp_min request would be a good approach for most of the
issues mentioned above. Note that I'm not necessarily saying to use the
actual uclamp infrastructure (though it's valid option), I'm really just
talking about the concept. Is that something you've considered?

I presume we could even factor out the 'logic' part of the code that
decides out to request the boost into its own thing, and possibly have
different policies for different use-cases, but that might be overkill.

Thanks,
Quentin

Christian Loehle Oct. 3, 2024, 10:30 a.m. UTC | #4

On 10/3/24 10:47, Quentin Perret wrote:
> On Monday 30 Sep 2024 at 18:34:24 (+0200), Rafael J. Wysocki wrote:
>> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
>> <christian.loehle@arm.com> wrote:
>>>
>>> iowait boost in schedutil was introduced by
>>> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
>>> with it more or less following intel_pstate's approach to increase
>>> frequency after an iowait wakeup.
>>> Behaviour that is piggy-backed onto iowait boost is problematic
>>> due to a lot of reasons, so remove it.
>>>
>>> For schedutil specifically these are some of the reasons:
>>> 1. Boosting is applied even in scenarios where it doesn't improve
>>> throughput.
>>
>> Well, I wouldn't argue this way because it is kind of like saying that
>> air conditioning is used even when it doesn't really help.  It is
>> sometimes hard to know in advance whether or not it will help though.
>>
>>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>>  the actual task utilization for task placement, but another CPU might
>>>  be more energy-efficient at that capacity than the boosted one.)
>>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>>  assumes a lower OPP than what is actually applied. This leads to
>>>  wrong EAS decisions.
>>
>> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
>> in your list).
> 
> I would actually argue that this is also an implementation problem
> rather than something fundamental about boosting. EAS could be taught
> about iowait boosting and factor that into the decisions.

Definitely, and I did do exactly that.

> 
>> If the goal is to set the adequate performance for a given utilization
>> level (either actual or prescribed), boosting doesn't really play well
>> with this and it shouldn't be used at least in these cases.
> 
> There's plenty of cases where EAS will correctly understand that
> migrating a task away will not reduce the OPP (e.g. another task on the
> rq has a uclamp_min request, or another CPU in the perf domain has a
> higher request), so iowait boosting could probably be added.
> 
> In fact if the iowait boost was made a task property, EAS could easily
> understand the effect of migrating that boost with the task (it's not
> fundamentally different from migrating a task with a high uclamp_min
> from the energy model perspective).

True.
> 
>>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>>> in_iowait wakeups.
>>
>> Do infrequent in_iowait wakeups really cause the boosting to be
>> applied at full swing?
>>
>>> 4. The boost isn't accounted for in task placement.
>>
>> I'm not sure what exactly this means.  "Big" vs "little" or something else?
>>
>>> 5. The boost isn't associated with a task, it therefore lingers on the
>>> rq even after the responsible task has migrated / stopped.
>>
>> Fair enough, but this is rather a problem with the implementation of
>> boosting and not with the basic idea of it.
> 
> +1
> 
>>> 6. The boost isn't associated with a task, it therefore needs to ramp
>>> up again when migrated.
>>
>> Well, that again is somewhat implementation-related IMV, and it need
>> not be problematic in principle.  Namely, if a task migrates and it is
>> not the only one in the "new" CPUs runqueue, and the other tasks in
>> there don't use in_iowait, maybe it's better to not boost it?
>>
>> It also means that boosting is not very consistent, though, which is a
>> valid point.
>>
>>> 7. Since schedutil doesn't know which task is getting woken up,
>>> multiple unrelated in_iowait tasks lead to boosting.
>>
>> Well, that's by design: it boosts, when "there is enough IO pressure
>> in the runqueue", so to speak.
>>
>> Basically, it is a departure from the "make performance follow
>> utilization" general idea and it is based on the observation that in
>> some cases performance can be improved by taking additional
>> information into account.
>>
>> It is also about pure performance, not about energy efficiency.
>>
>>> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
>>> when the task is on the rq, which for boosted tasks is usually not
>>> the case for most of the time).
> 
> Sounds like another reason to make iowait boosting per-task to me :-)
> 
> I've always thought that turning iowait boosting into some sort of
> in-kernel uclamp_min request would be a good approach for most of the
> issues mentioned above. Note that I'm not necessarily saying to use the
> actual uclamp infrastructure (though it's valid option), I'm really just
> talking about the concept. Is that something you've considered?
> 
> I presume we could even factor out the 'logic' part of the code that
> decides out to request the boost into its own thing, and possibly have
> different policies for different use-cases, but that might be overkill.

See the cover-letter part on per-task iowait boosting, specifically:
[1]
v1 per-task io boost
https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/
v2 per-task io boost
https://lore.kernel.org/lkml/20240518113947.2127802-2-christian.loehle@arm.com/
[2]
OSPM24 discussion iowait boosting
https://www.youtube.com/watch?v=MSQGEsSziZ4

These are the main issues with transforming the existing mechanism into
a per-task attribute.
Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
actually improve throughput even (assuming for now that this throughput is
something we care about, I'm sure you know that isn't always the case, e.g.
background tasks). With MCQ devices and some reasonable IO workload that is
IO-bound our iowait boosting is often just boosting CPU frequency (which uses
power obviously) to queue in yet another request for a device which has essentially
endless pending requests. If pending request N+1 arrives x usecs earlier or
later at the device then makes no difference in IO throughput.
If boosting would improve e.g. IOPS (of that device) is something the block layer
(with a lot of added infrastructure, but at least in theory it would know what
device we're iowaiting on, unlike the scheduler) could tell us about. If that is
actually useful for user experience (i.e. worth the power) only userspace can decide
(and then we're back at uclamp_min anyway).
(The above all assumes that iowait even means "is waiting for block IO and
about to send another block IO" which is far from reality.)

Thanks Quentin for getting involved, your input is very much appreciated!

Regards,
Christian

Andres Freund Oct. 5, 2024, 12:39 a.m. UTC | #5

Hi,

A caveat: I'm a userspace developer that occasionally strays into kernel land
(see e.g. the io_uring iowait thing). So I'm likely to get some kernel side
things wrong.

On 2024-10-03 11:30:52 +0100, Christian Loehle wrote:
> These are the main issues with transforming the existing mechanism into
> a per-task attribute.
> Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
> actually improve throughput even (assuming for now that this throughput is
> something we care about, I'm sure you know that isn't always the case, e.g.
> background tasks). With MCQ devices and some reasonable IO workload that is
> IO-bound our iowait boosting is often just boosting CPU frequency (which uses
> power obviously) to queue in yet another request for a device which has essentially
> endless pending requests. If pending request N+1 arrives x usecs earlier or
> later at the device then makes no difference in IO throughput.

That's sometimes true, but definitely not all the time? There are plenty
workloads with low-queue-depth style IO. Which often are also rather latency
sensitive.

E.g. the device a database journal resides on will typically have a low queue
depth. It's extremely common in OLTPish workloads to be bound by the latency
of journal flushes. If, after the journal flush completes, the CPU is clocked
low and takes a while to wake up, you'll see substantially worse performance.

> If boosting would improve e.g. IOPS (of that device) is something the block layer
> (with a lot of added infrastructure, but at least in theory it would know what
> device we're iowaiting on, unlike the scheduler) could tell us about. If that is
> actually useful for user experience (i.e. worth the power) only userspace can decide
> (and then we're back at uclamp_min anyway).

I think there are many cases where userspace won't realistically be able to do
anything about that.

For one, just because, for some workload, a too deep idle state is bad during
IO, doesn't mean userspace won't ever want to clock down. And it's probably
going to be too expensive to change any attributes around idle states for
individual IOs.

Are there actually any non-privileged APIs around this that userspace *could*
even change? I'd not consider moving to busy-polling based APIs a realistic
alternative.

For many workloads cpuidle is way too aggressive dropping into lower states
*despite* iowait. But just disabling all lower idle states obviously has
undesirable energy usage implications. It surely is the answer for some
workloads, but I don't think it'd be good to promote it as the sole solution.

It's easy to under-estimate the real-world impact of a change like this. When
benchmarking we tend to see what kind of throughput we can get, by having N
clients hammering the server as fast as they can. But in the real world that's
pretty rare for anything latency sensitive to go full blast - rather there's a
rate of requests incoming and that the clients are sensitive to requests being
processed more slowly.

That's not to say that the current situation can't be improved - I've seen way
too many workloads where the only ways to get decent performance were one of:

- disable most idle states (via sysfs or /dev/cpu_dma_latency)
- just have busy loops when idling - doesn't work when doing synchronous
  syscalls that block though
- have some lower priority tasks scheduled that just burns CPU

I'm just worried that removing iowait will make this worse.

Greetings,

Andres Freund

Christian Loehle Oct. 9, 2024, 9:54 a.m. UTC | #6

On 10/5/24 01:39, Andres Freund wrote:
> Hi,
> 
> 
> A caveat: I'm a userspace developer that occasionally strays into kernel land
> (see e.g. the io_uring iowait thing). So I'm likely to get some kernel side
> things wrong.

Thank you for your input!

> 
> On 2024-10-03 11:30:52 +0100, Christian Loehle wrote:
>> These are the main issues with transforming the existing mechanism into
>> a per-task attribute.
>> Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
>> actually improve throughput even (assuming for now that this throughput is
>> something we care about, I'm sure you know that isn't always the case, e.g.
>> background tasks). With MCQ devices and some reasonable IO workload that is
>> IO-bound our iowait boosting is often just boosting CPU frequency (which uses
>> power obviously) to queue in yet another request for a device which has essentially
>> endless pending requests. If pending request N+1 arrives x usecs earlier or
>> later at the device then makes no difference in IO throughput.
> 
> That's sometimes true, but definitely not all the time? There are plenty
> workloads with low-queue-depth style IO. Which often are also rather latency
> sensitive.
> 
> E.g. the device a database journal resides on will typically have a low queue
> depth. It's extremely common in OLTPish workloads to be bound by the latency
> of journal flushes. If, after the journal flush completes, the CPU is clocked
> low and takes a while to wake up, you'll see substantially worse performance.

Yeah absolutely and if we knew what a latency-sensitive journal flush is tuning
cpuidle and cpufreq to it would probably be reasonable.
I did test mmtests filebench-oltp that looked fine, do you have any other
benchmarks you would like to see?

>> If boosting would improve e.g. IOPS (of that device) is something the block layer
>> (with a lot of added infrastructure, but at least in theory it would know what
>> device we're iowaiting on, unlike the scheduler) could tell us about. If that is
>> actually useful for user experience (i.e. worth the power) only userspace can decide
>> (and then we're back at uclamp_min anyway).
> 
> I think there are many cases where userspace won't realistically be able to do
> anything about that.
> 
> For one, just because, for some workload, a too deep idle state is bad during
> IO, doesn't mean userspace won't ever want to clock down. And it's probably
> going to be too expensive to change any attributes around idle states for
> individual IOs.

So the kernel currently applies these to all of them essentially.

> 
> Are there actually any non-privileged APIs around this that userspace *could*
> even change? I'd not consider moving to busy-polling based APIs a realistic
> alternative.

No and I'm not sure an actual non-privileged API would be a good idea, would
it? It is essentially changing hardware behavior.
So does busy-polling of course, but the kernel can at least curb that and
maintain fairness and so forth.

> 
> For many workloads cpuidle is way too aggressive dropping into lower states
> *despite* iowait. But just disabling all lower idle states obviously has
> undesirable energy usage implications. It surely is the answer for some
> workloads, but I don't think it'd be good to promote it as the sole solution.

Right, but we (cpuidle) don't know how to distinguish the two, we just do it
for all of them. Whether kernel or userspace applies the same (awful) heuristic
doesn't make that much of a difference in practice.

> 
> It's easy to under-estimate the real-world impact of a change like this. When
> benchmarking we tend to see what kind of throughput we can get, by having N
> clients hammering the server as fast as they can. But in the real world that's
> pretty rare for anything latency sensitive to go full blast - rather there's a
> rate of requests incoming and that the clients are sensitive to requests being
> processed more slowly.

Agreed, this series is posted as RFT and I'm happy to take a look at any
regressions for both the cpufreq and cpuidle parts of it.

> 
> 
> That's not to say that the current situation can't be improved - I've seen way
> too many workloads where the only ways to get decent performance were one of:
> 
> - disable most idle states (via sysfs or /dev/cpu_dma_latency)
> - just have busy loops when idling - doesn't work when doing synchronous
>   syscalls that block though
> - have some lower priority tasks scheduled that just burns CPU
> 
> I'm just worried that removing iowait will make this worse.

I just need to mention again that almost all of what you replied does refer to
cpuidle, not cpufreq (which this particular patch was about), not to create more
confusion.

Regards,
Christian

[RFC,5/8] cpufreq/schedutil: Remove iowait boost

Commit Message

Comments

Patch