[RFC,0/2] Introduce per-task io utilization boost

Message ID	20240304201625.100619-1-christian.loehle@arm.com (mailing list archive)
Headers	show Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7F09B7A150; Mon, 4 Mar 2024 20:17:03 +0000 (UTC) From: Christian Loehle <christian.loehle@arm.com> To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com, rafael@kernel.org, dietmar.eggemann@arm.com, vschneid@redhat.com, vincent.guittot@linaro.org, Johannes.Thumshirn@wdc.com, adrian.hunter@intel.com, ulf.hansson@linaro.org, andres@anarazel.de, asml.silence@gmail.com, linux-pm@vger.kernel.org, linux-block@vger.kernel.org, io-uring@vger.kernel.org, Christian Loehle <christian.loehle@arm.com> Subject: [RFC PATCH 0/2] Introduce per-task io utilization boost Date: Mon, 4 Mar 2024 20:16:23 +0000 Message-Id: <20240304201625.100619-1-christian.loehle@arm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introduce per-task io utilization boost \| expand [RFC,0/2] Introduce per-task io utilization boost [RFC,1/2] sched/fair: Introduce per-task io util boost [RFC,2/2] cpufreq/schedutil: Remove iowait boost

Christian Loehle March 4, 2024, 8:16 p.m. UTC

There is a feature inside of both schedutil and intel_pstate called
iowait boosting which tries to prevent selecting a low frequency
during IO workloads when it impacts throughput.
The feature is implemented by checking for task wakeups that have
the in_iowait flag set and boost the CPU of the rq accordingly
(implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).

The necessity of the feature is argued with the potentially low
utilization of a task being frequently in_iowait (i.e. most of the
time not enqueued on any rq and cannot build up utilization).

The RFC focuses on the schedutil implementation.
intel_pstate frequency selection isn't touched for now, suggestions are
very welcome.
Current schedutil iowait boosting has several issues:
1. Boosting happens even in scenarios where it doesn't improve
throughput. [1]
2. The boost is not accounted for in EAS: a) feec() will only consider
 the actual utilization for task placement, but another CPU might be
 more energy-efficient at that capacity than the boosted one.)
 b) When placing a non-IO task while a CPU is boosted compute_energy()
 will not consider the (potentially 'free') boosted capacity, but the
 one it would have without the boost (since the boost is only applied
 in sugov).
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't associated with a task, it therefore isn't considered
for task placement, potentially missing out on higher capacity CPUs on
heterogeneous CPU topologies.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks might lead to boosting.

We attempt to mitigate all of the above by reworking the way the
iowait boosting (io boosting from here on) works in two major ways:
- Carry the boost in task_struct, so it is a per-task attribute and
behaves similar to utilization of the task in some ways.
- Employ a counting-based tracking strategy that only boosts as long
as it sees benefits and returns to no boosting dynamically.

Note that some the issues (1, 3) can be solved by using a
counting-based strategy on a per-rq basis, i.e. in sugov entirely.
Experiments with Android in particular showed that such a strategy
(which necessarily needs longer intervals to be reasonably stable)
is too prone to migrations to be useful generally.
We therefore consider the additional complexity of such a per-task
based approach like proposed to be worth it.

We require a minimum of 1000 iowait wakeups per second to start
boosting.
This isn't too far off from what sugov currently does, since it resets
the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
We justify this by the small possible improvement by boosting in the
first place with 'rare' few iowait wakeups.

When IO even leads to a task being in iowait isn't as straightforward
to explain.
Of course if the issued IO can be served by the page cache (e.g. on
reads because the pages are contained, on writes because they can be
marked dirty and the writeback takes care of it later) the actual
issuing task is usually not in iowait.
We consider this the good case, since whenever the scheduler and a
potential userspace / kernel switch is in the critical path for IO
there is possibly overhead impacting throughput.
We therefore focus on random read from here on, because (on synchronous
IO [3]) this will lead to the task being set in iowait for every IO.
This is where iowait boosting shows its biggest throughput improvement.
From here on IOPS (IO operations per second) and iowait wakeups may
therefore be used interchangeably.

Performance:
Throughput for random read tries to be on par with the sugov
implementation of iowait boosting for reasonably long-lived workloads.
See the following table for some results, values are in IOPS, the
tests are ran for 30s with pauses in-between, results are sorted.

nvme on rk3399
[3588, 3590, 3597, 3632, 3745] sugov mainline
[3581, 3751, 3770, 3771, 3885] per-task tracking
[2592, 2639, 2701, 2717, 2784] sugov no iowait boost
[3218, 3451, 3598, 3848, 3921] performance governor

emmc with cqe on rk3399
[4146, 4155, 4159, 4161, 4193] sugov mainline
[2848, 3217, 4375, 4380, 4454] per-task tracking
[2510, 2665, 3093, 3101, 3105] sugov no iowait boost
[4690, 4803, 4860, 4976, 5069] performance governor

sd card on rk3399
[1777, 1780, 1806, 1827, 1850] sugov mainline
[1470, 1476, 1507, 1534, 1586] per-task tracking
[1356, 1372, 1373, 1377, 1416] sugov no iowait boost
[1861, 1890, 1901, 1905, 1908] performance governor

Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
[6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
[7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
[2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
[7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor

Apple M1 apple-nvme
[27421, 28331, 28515, 28699, 29529] sugov mainline
[27274, 27344, 27345, 27384, 27930] per-task tracking
[14480, 14512, 14625, 14872, 14967] sugov no iowait boost
[31595, 32085, 32386, 32465, 32643] performance governor

Showcasing some different IO scenarios, again all random read,
median out of 5 runs, all on rk3399 with NVMe.
e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
obtained using:
fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8

+---------------+----------------+-------------------+----------------+-------------+-----------+
|               | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|       psyncx1 |           4073 |              3793 |           2979 |        4190 |      2788 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|       psyncx4 |          13921 |             13503 |          10635 |       13931 |     10225 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|       psyncx6 |          18473 |             17866 |          15902 |       19080 |     15789 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|       psyncx8 |          22498 |             21242 |          19867 |       22650 |     18837 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|      psyncx10 |          24801 |             23552 |          23658 |       25096 |     21474 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|      psyncx12 |          26743 |             25377 |          26372 |       26663 |     23613 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|     libaio1x1 |           4054 |              3542 |           2776 |        4055 |      2780 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|   libaio1x128 |           3959 |              3516 |           2758 |        3590 |      2560 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|   libaio4x128 |          13451 |             12517 |          10313 |       13403 |      9994 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|     libaio6x1 |          18394 |             17432 |          15340 |       18954 |     15251 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|     libaio6x4 |          18329 |             17100 |          15238 |       18623 |     15270 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|   libaio6x128 |          18066 |             16964 |          15139 |       18577 |     15192 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|   io_uring1x1 |           4043 |              3548 |           2810 |        4039 |      2689 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|  io_uring4x64 |          35790 |             32814 |          35983 |       34934 |     33254 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring1x128 |          32651 |             30427 |          32429 |       33232 |      9973 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring2x128 |          34928 |             32595 |          34922 |       33726 |     18790 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring4x128 |          34414 |             32173 |          34932 |       33332 |     33005 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
|   io_uring6x4 |          31578 |             29260 |          31714 |       31399 |     31784 |
+---------------+----------------+-------------------+----------------+-------------+-----------+
| io_uring6x128 |          34480 |             32634 |          34973 |       33390 |     36452 |
+---------------+----------------+-------------------+----------------+-------------+-----------+

Based on the above we can basically categorize these into the following
three:
a) boost is useful
b) boost irrelevant (util dominates)
c) boost is energy-inefficient (boost dominates)

The aim of the patch 1/2 is to boost as much as necessary for a) while
boosting little for c) (thus saving energy).

Energy-savings:
Regarding sugov iowait boosting problem 1 mentioned earlier,
some improvement can be seen:
Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
CPU4 perf domain spans 4-5 with 400MHz to 1800MHz

io_uring6x128:
Sugov iowait boost:
Average frequency for CPU0 : 1.180 GHz
Average frequency for CPU4 : 1.504 GHz
Per-task tracking:
Average frequency for CPU0 : 1.070 GHz
Average frequency for CPU4 : 1.211 GHz

io_uring12x128:
Sugov iowait boost:
Average frequency for CPU0 : 1.324 GHz
Average frequency for CPU4 : 1.444 GHz
Per-task tracking:
Average frequency for CPU0 : 1.260 GHz
Average frequency for CPU4 : 1.062 GHz
(In both cases actually 400MHz on both perf domains is optimal, more
fine-tuning could get us closer [2])

[1]
There are many scenarios when it doesn't, so let's start with
explaining when it does:
Boosting improves throughput if there is frequent IO to a device from
one or few origins, such that the device is likely idle when the task
is enqueued on the rq and reducing this time cuts down on the storage
device idle time.
This might not be true (and boosting doesn't help) if:
- The storage device uses the idle time to actually commit the IO to
persistent storage or do other management activity (this can be
observed with e.g. writes to flash-based storage, which will usually
write to cache and flush the cache when idle or necessary).
- The device is under thermal pressure and needs idle time to cool off
(not uncommon for e.g. nvme devices).
Furthermore the assumption (the device being idle while task is
enqueued) is false altogether if:
- Other tasks use the same storage device.
- The task uses asynchronous IO with iodepth > 1 like io_uring, the
in_iowait is then just to fill the queue on the host again.
- The task just sets in_iowait to signal it is waiting on io to not
appear as system idle, it might not send any io at all (cf with
the various occurrences of in_iowait, io_mutex_lock and io_schedule*).

[3]
Unfortunately even for asynchronous IO iowait may be set, in the case
of io_uring this is specifically for the iowait boost to trigger, see
commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
which is why the energy-savings are so significant here, as io_uring
load on the CPU is minimal.

Problems encountered:
- Higher cap is not always beneficial, we might place the task away
from the CPU where the interrupt handler is running, making it run
on an unboosted CPU which may have a bigger impact than the difference
between the CPU's capacity the task moved to. (Of course the boost will
then be reverted again, but a ping-pong every interval is possible).
- [2] tracking and scaling can be improved (io_uring12x128 still shows
boosting): Unfortunately tracking purely per-task shows some limits.
One task might show more iowaits per second when boosted, but overall
throughput doesn't increase => there is still some boost.
The task throughput improvement is somewhat limited though,
so by fine-tuning the thresholds there could be mitigations.

Christian Loehle (2):
  sched/fair: Introduce per-task io util boost
  cpufreq/schedutil: Remove iowait boost

 include/linux/sched.h            |  15 +++
 kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
 kernel/sched/fair.c              | 165 +++++++++++++++++++++++++++++--
 kernel/sched/sched.h             |   4 +-
 4 files changed, 182 insertions(+), 152 deletions(-)

--
2.34.1

Bart Van Assche March 5, 2024, 12:20 a.m. UTC | #1

On 3/4/24 12:16, Christian Loehle wrote:
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor

Variance of performance results for Pixel devices can be reduced greatly
by disabling devfreq scaling, e.g. as follows (this may cause thermal
issues if the system load is high enough):

      for d in $(adb shell echo /sys/class/devfreq/*); do
	adb shell "cat $d/available_frequencies |
		tr ' ' '\n' |
		sort -n |
		case $devfreq in
			min) head -n1;;
			max) tail -n1;;
		esac > $d/min_freq"
     done

> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8

So buffered I/O was used during this test? Shouldn't direct I/O be used
for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
was configured? I recommend --ioscheduler=none for this kind of tests.

> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).

In the above I see "the interrupt handler". Does this mean that the NVMe
controller in the test setup only supports one completion interrupt for
all completion queues instead of one completion interrupt per completion
queue? There are already Android phones and developer boards available
that support the latter, namely the boards equipped with a UFSHCI 4.0 
controller.

Thanks,

Bart.

Christian Loehle March 5, 2024, 9:13 a.m. UTC | #2

Hi Bart,

On 05/03/2024 00:20, Bart Van Assche wrote:
> On 3/4/24 12:16, Christian Loehle wrote:
>> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
>> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
>> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
>> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
>> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
> 
> Variance of performance results for Pixel devices can be reduced greatly
> by disabling devfreq scaling, e.g. as follows (this may cause thermal
> issues if the system load is high enough):
> 
>      for d in $(adb shell echo /sys/class/devfreq/*); do
>     adb shell "cat $d/available_frequencies |
>         tr ' ' '\n' |
>         sort -n |
>         case $devfreq in
>             min) head -n1;;
>             max) tail -n1;;
>         esac > $d/min_freq"
>     done
> 

Thanks for the hint!

>> Showcasing some different IO scenarios, again all random read,
>> median out of 5 runs, all on rk3399 with NVMe.
>> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
>> obtained using:
>> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
> 
> So buffered I/O was used during this test? Shouldn't direct I/O be used
> for this kind of tests (--buffered=0)? Additionally, which I/O scheduler
> was configured? I recommend --ioscheduler=none for this kind of tests.

Yes I opted for buffered I/O, I guess it's the eternal question if you
should benchmark the device/stack (O_DIRECT) or be more realistic to actual
use cases (probably). I opted for the latter, but since it's 4K randread
on significantly large devices the results don't differ too much.

>> - Higher cap is not always beneficial, we might place the task away
>> from the CPU where the interrupt handler is running, making it run
>> on an unboosted CPU which may have a bigger impact than the difference
>> between the CPU's capacity the task moved to. (Of course the boost will
>> then be reverted again, but a ping-pong every interval is possible).
> 
> In the above I see "the interrupt handler". Does this mean that the NVMe
> controller in the test setup only supports one completion interrupt for
> all completion queues instead of one completion interrupt per completion
> queue? There are already Android phones and developer boards available
> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.

No, both NVMe test setups have one completion interrupt per completion queue,
so this caveat doesn't affect them, higher capacity CPU is strictly better.
The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
interrupt (on CPU0 on my setup).
The difference between the CPU capacities on the Pixel6 is able to make up for this.
The big CPU is still the best to run these single-threaded fio benchmarks on in terms
of throughput.
FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
interrupt to a big CPU, too. Similarly for the mmc.
Unfortunately the infrastructure is far from being there for the scheduler to move the
interrupt to the same performance domain as the task, which is often optimal both in
terms of throughput and in terms of power.
I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
patch will of course be greatly increased.
Thanks!

Best Regards,
Christian

Bart Van Assche March 5, 2024, 6:36 p.m. UTC | #3

On 3/5/24 01:13, Christian Loehle wrote:
> On 05/03/2024 00:20, Bart Van Assche wrote:
>> On 3/4/24 12:16, Christian Loehle wrote:
>>> - Higher cap is not always beneficial, we might place the task away
>>> from the CPU where the interrupt handler is running, making it run
>>> on an unboosted CPU which may have a bigger impact than the difference
>>> between the CPU's capacity the task moved to. (Of course the boost will
>>> then be reverted again, but a ping-pong every interval is possible).
>>
>> In the above I see "the interrupt handler". Does this mean that the NVMe
>> controller in the test setup only supports one completion interrupt for
>> all completion queues instead of one completion interrupt per completion
>> queue? There are already Android phones and developer boards available
>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
> 
> No, both NVMe test setups have one completion interrupt per completion queue,
> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> interrupt (on CPU0 on my setup).

I think that measurements should be provided in the cover letter for the
two types of storage controllers: one series of measurements for a
storage controller with a single completion interrupt and a second
series of measurements for storage controllers with one completion
interrupt per CPU.

> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> interrupt to a big CPU, too. Similarly for the mmc.
> Unfortunately the infrastructure is far from being there for the scheduler to move the
> interrupt to the same performance domain as the task, which is often optimal both in
> terms of throughput and in terms of power.
> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> patch will of course be greatly increased.

I'm not sure whether making the completion interrupt follow the workload
is a good solution. I'm concerned that this would increase energy
consumption by keeping the big cores active longer than necessary. I
like this solution better (improves storage performance on at least
devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
Handle HMP systems when completing IO"
(https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).

Thanks,

Bart.

Christian Loehle March 6, 2024, 10:49 a.m. UTC | #4

Hi Bart,

On 05/03/2024 18:36, Bart Van Assche wrote:
> On 3/5/24 01:13, Christian Loehle wrote:
>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>> - Higher cap is not always beneficial, we might place the task away
>>>> from the CPU where the interrupt handler is running, making it run
>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>> then be reverted again, but a ping-pong every interval is possible).
>>>
>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>> controller in the test setup only supports one completion interrupt for
>>> all completion queues instead of one completion interrupt per completion
>>> queue? There are already Android phones and developer boards available
>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>
>> No, both NVMe test setups have one completion interrupt per completion queue,
>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>> interrupt (on CPU0 on my setup).
> 
> I think that measurements should be provided in the cover letter for the
> two types of storage controllers: one series of measurements for a
> storage controller with a single completion interrupt and a second
> series of measurements for storage controllers with one completion
> interrupt per CPU.

Of the same type of storage controller? Or what is missing for you in
the cover letter exactly (ufs/emmc: single completion interrupt,
nvme: one completion interrupt per CPU).

> 
>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>> interrupt to a big CPU, too. Similarly for the mmc.
>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>> interrupt to the same performance domain as the task, which is often optimal both in
>> terms of throughput and in terms of power.
>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>> patch will of course be greatly increased.
> 
> I'm not sure whether making the completion interrupt follow the workload
> is a good solution. I'm concerned that this would increase energy
> consumption by keeping the big cores active longer than necessary. I
> like this solution better (improves storage performance on at least
> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> Handle HMP systems when completing IO"
> (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).

That patch is good, don't get me wrong, but you still lose out by running everything
up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
while having a big CPU available at a high OPP anyway ("for free").
It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

Pretty numbers (IOPS):
Base irq@CPU0 median: 6969
Base irq@CPU6 median: 8407 (+20.6%)
Applied irq@CPU0 median: 7144 (+2.5%)
Applied irq@CPU6 median: 8288 (18.9%)

This is with psyncx1 4K Random Read again, of course anything with queue depth
takes advantage of batch completions to significantly reduce irq pressure.

Not so pretty numbers and full list commands used:

w/o patch:
irq on CPU0 (default):
psyncx1: 7000 6969 7025 6954 6964
io_uring4x128: 28766 28280 28339 28310 28349
irq on CPU6:
psyncx1: 8342 8492 8355 8407 8532
io_uring4x128: 28641 28356 25908 25787 25853

with patch:
irq on CPU0:
psyncx1: 7672 7144 7301 6976 6889
io_uring4x128: 28266 26314 27648 24482 25301
irq on CPU6:
psyncx1: 8208 8401 8351 8221 8288
io_uring4x128: 25603 25438 25453 25514 25402


for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done

for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done

echo 6 > /proc/irq/296/smp_affinity_list


Kind Regards,
Christian

Qais Yousef March 21, 2024, 12:39 p.m. UTC | #5

(Thanks for the CC Bart)

On 03/06/24 10:49, Christian Loehle wrote:
> Hi Bart,
> 
> On 05/03/2024 18:36, Bart Van Assche wrote:
> > On 3/5/24 01:13, Christian Loehle wrote:
> >> On 05/03/2024 00:20, Bart Van Assche wrote:
> >>> On 3/4/24 12:16, Christian Loehle wrote:
> >>>> - Higher cap is not always beneficial, we might place the task away
> >>>> from the CPU where the interrupt handler is running, making it run
> >>>> on an unboosted CPU which may have a bigger impact than the difference
> >>>> between the CPU's capacity the task moved to. (Of course the boost will
> >>>> then be reverted again, but a ping-pong every interval is possible).
> >>>
> >>> In the above I see "the interrupt handler". Does this mean that the NVMe
> >>> controller in the test setup only supports one completion interrupt for
> >>> all completion queues instead of one completion interrupt per completion
> >>> queue? There are already Android phones and developer boards available
> >>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
> >>
> >> No, both NVMe test setups have one completion interrupt per completion queue,
> >> so this caveat doesn't affect them, higher capacity CPU is strictly better.
> >> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
> >> interrupt (on CPU0 on my setup).
> > 
> > I think that measurements should be provided in the cover letter for the
> > two types of storage controllers: one series of measurements for a
> > storage controller with a single completion interrupt and a second
> > series of measurements for storage controllers with one completion
> > interrupt per CPU.
> 
> Of the same type of storage controller? Or what is missing for you in
> the cover letter exactly (ufs/emmc: single completion interrupt,
> nvme: one completion interrupt per CPU).
> 
> > 
> >> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
> >> interrupt to a big CPU, too. Similarly for the mmc.
> >> Unfortunately the infrastructure is far from being there for the scheduler to move the
> >> interrupt to the same performance domain as the task, which is often optimal both in
> >> terms of throughput and in terms of power.
> >> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
> >> patch will of course be greatly increased.
> > 
> > I'm not sure whether making the completion interrupt follow the workload
> > is a good solution. I'm concerned that this would increase energy
> > consumption by keeping the big cores active longer than necessary. I
> > like this solution better (improves storage performance on at least
> > devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
> > Handle HMP systems when completing IO"
> > (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).
> 
> That patch is good, don't get me wrong, but you still lose out by running everything
> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
> while having a big CPU available at a high OPP anyway ("for free").
> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):

So you want the hardirq to move to the big core? Unlike softirq, there will be
a single hardirq for the controller (to my limited knowledge), so if there are
multiple requests I'm not sure we can easily match which one relates to which
before it triggers. So we can end up waking up the wrong core.

Generally this should be a userspace policy. If there's a scenario where the
throughput is that important they can easily move the hardirq to the big core
unconditionally and move it back again once this high throughput scenario is no
longer important.

Or where you describing a different problem?

Glad to see your series by the way :-) I'll get a chance to review it over the
weekend hopefully.


Cheers

--
Qais Yousef

> 
> Pretty numbers (IOPS):
> Base irq@CPU0 median: 6969
> Base irq@CPU6 median: 8407 (+20.6%)
> Applied irq@CPU0 median: 7144 (+2.5%)
> Applied irq@CPU6 median: 8288 (18.9%)
> 
> This is with psyncx1 4K Random Read again, of course anything with queue depth
> takes advantage of batch completions to significantly reduce irq pressure.
> 
> Not so pretty numbers and full list commands used:
> 
> w/o patch:
> irq on CPU0 (default):
> psyncx1: 7000 6969 7025 6954 6964
> io_uring4x128: 28766 28280 28339 28310 28349
> irq on CPU6:
> psyncx1: 8342 8492 8355 8407 8532
> io_uring4x128: 28641 28356 25908 25787 25853
> 
> with patch:
> irq on CPU0:
> psyncx1: 7672 7144 7301 6976 6889
> io_uring4x128: 28266 26314 27648 24482 25301
> irq on CPU6:
> psyncx1: 8208 8401 8351 8221 8288
> io_uring4x128: 25603 25438 25453 25514 25402
> 
> 
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --minimal | awk -F ";" '{print $8}'; sleep 30; done
> 
> for i in $(seq 0 4); do taskset c0 /data/local/tmp/fio_aosp_build --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/block/sda --ioengine=io_uring --iodepth=128 --numjobs=4 --group_reporting --minimal | awk -F ";" '{print $8}'; sleep 30; done
> 
> echo 6 > /proc/irq/296/smp_affinity_list
> 
> 
> Kind Regards,
> Christian

Christian Loehle March 21, 2024, 5:57 p.m. UTC | #6

On 21/03/2024 12:39, Qais Yousef wrote:
[snip]
>> On 05/03/2024 18:36, Bart Van Assche wrote:
>>> On 3/5/24 01:13, Christian Loehle wrote:
>>>> On 05/03/2024 00:20, Bart Van Assche wrote:
>>>>> On 3/4/24 12:16, Christian Loehle wrote:
>>>>>> - Higher cap is not always beneficial, we might place the task away
>>>>>> from the CPU where the interrupt handler is running, making it run
>>>>>> on an unboosted CPU which may have a bigger impact than the difference
>>>>>> between the CPU's capacity the task moved to. (Of course the boost will
>>>>>> then be reverted again, but a ping-pong every interval is possible).
>>>>>
>>>>> In the above I see "the interrupt handler". Does this mean that the NVMe
>>>>> controller in the test setup only supports one completion interrupt for
>>>>> all completion queues instead of one completion interrupt per completion
>>>>> queue? There are already Android phones and developer boards available
>>>>> that support the latter, namely the boards equipped with a UFSHCI 4.0 controller.
>>>>
>>>> No, both NVMe test setups have one completion interrupt per completion queue,
>>>> so this caveat doesn't affect them, higher capacity CPU is strictly better.
>>>> The UFS and both mmc setups (eMMC with CQE and sdcard) only have one completion
>>>> interrupt (on CPU0 on my setup).
>>>
>>> I think that measurements should be provided in the cover letter for the
>>> two types of storage controllers: one series of measurements for a
>>> storage controller with a single completion interrupt and a second
>>> series of measurements for storage controllers with one completion
>>> interrupt per CPU.
>>
>> Of the same type of storage controller? Or what is missing for you in
>> the cover letter exactly (ufs/emmc: single completion interrupt,
>> nvme: one completion interrupt per CPU).
>>
>>>
>>>> FWIW you do gain an additional ~20% (in my specific setup) if you move the ufshcd
>>>> interrupt to a big CPU, too. Similarly for the mmc.
>>>> Unfortunately the infrastructure is far from being there for the scheduler to move the
>>>> interrupt to the same performance domain as the task, which is often optimal both in
>>>> terms of throughput and in terms of power.
>>>> I'll go looking for a stable testing platform with UFS as you mentioned, benefits of this
>>>> patch will of course be greatly increased.
>>>
>>> I'm not sure whether making the completion interrupt follow the workload
>>> is a good solution. I'm concerned that this would increase energy
>>> consumption by keeping the big cores active longer than necessary. I
>>> like this solution better (improves storage performance on at least
>>> devices with a UFSHCI 3.0 controller): "[PATCH v2 0/2] sched: blk:
>>> Handle HMP systems when completing IO"
>>> (https://lore.kernel.org/linux-block/20240223155749.2958009-1-qyousef@layalina.io/).
>>
>> That patch is good, don't get me wrong, but you still lose out by running everything
>> up to blk_mq_complete_request() on (potentially) a LITTlE (that might be run on a low OPP),
>> while having a big CPU available at a high OPP anyway ("for free").
>> It is only adjacent to the series but I've done some measurements (Pixel6 again, same device
>> as cover letter, Base is Android 6.6 mainline kernel (so without my series, but I somewhat forced
>> the effects by task pinning), Applied is with both of sched: blk: Handle HMP systems when completing IO):
> 
> So you want the hardirq to move to the big core? Unlike softirq, there will be
> a single hardirq for the controller (to my limited knowledge), so if there are
> multiple requests I'm not sure we can easily match which one relates to which
> before it triggers. So we can end up waking up the wrong core.

It would be beneficial to move the hardirq to a big core if the IO task
is using it anyway.
I'm not sure I actually want to. There are quite a few pitfalls (like you
mentioned) that the scheduler really shouldn't be concerned about.
Moving the hardirq, if implemented in the kernel, would have to be done by the
host controller driver anyway, which would explode this series.
(host controller drivers are quite fragmented e.g. on mmc)

The fact that having a higher capacity CPU available ("running faster") for an
IO task doesn't (always) imply higher throughput because of the hardirq staying
on some LITTLE CPU is bothering (for this series), though.

> 
> Generally this should be a userspace policy. If there's a scenario where the
> throughput is that important they can easily move the hardirq to the big core
> unconditionally and move it back again once this high throughput scenario is no
> longer important.

It also feels wrong to let this be a userspace policy, as the hardirq must be
migrated to the perf domain of the task, which userspace isn't aware of.
Unless you expect userspace to do
CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
but then you could just as well ask them to set performance governor for
big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
any iowait boosting.

Furthermore you can't generally expect userspace to know if their IO will lead
to any interrupt at all, much less which one. They ideally don't even know if
the file IO they are doing is backed by any physical storage in the first place.
(Or even further, that they are doing file IO at all, they might just be
e.g. page-faulting.)

> 
> Or where you describing a different problem?

That is the problem I mentioned in the series and Bart and I were discussing.
It's a problem of the series as in "the numbers aren't that impressive".
Current iowait boosting on embedded/mobile systems will perform quite well by
chance, as the (low util) task will often be on the same perf domain the hardirq
will be run on. As can be seen in the cover letter the benefit of running the
task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
will be much greater.
I just wanted to point this out. We might just acknowledge the problem and say
"don't care" about the potential performance benefits of those scenarios that
would require hardirq moving.
In the long-term it looks like for UFS the problem will disappear as we are
expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
is already the case.

I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
is available (and it doesn't look like this is changing anytime soon).
I would also love to hear what Bart or other UFS folks think about it.
Furthermore if I forgot any storage subsystem with the same behavior in that
regards do tell me.

Lastly, you could consider the IO workload:
IO task being in iowait very frequently [1] with just a single IO inflight [2]
and only very little time being spent on the CPU in-between iowaits[3],
therefore the interrupt handler being on the critical path for IO throughput
to a non-negligible degree, to be niche, but it's precisely the use-case where
iowait boosting shows it's biggest benefit.

Sorry for the abomination of a sentence, see footnotes for the reasons.

[1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
won't apply any significant boost currently.
[2] If the storage devices has enough in-flight requests to serve, iowait
boosting is unnecessary/wasteful, see cover letter.
[3] If the task actually uses the CPU in-between iowaits, it will build up
utilization, iowait boosting benefit diminishes.

> 
> Glad to see your series by the way :-) I'll get a chance to review it over the
> weekend hopefully.

Thank you!
Apologies for not CCing you in the first place, I am curious about your opinion
on the concept!

FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
only received a quick smoke test, so 1/2 needs the following:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aaf64023b03..2b6f521be658 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
        } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
                /* Reduce boost */
                if (p->io_boost_level > 1)
-                       io_boost_scale_interval(p, true);
+                       io_boost_scale_interval(p, false);
                else
                        p->io_boost_level = 0;
        } else if (p->io_boost_level == IO_BOOST_LEVELS) {

I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
changes are mostly cosmetic and addressing Bart's comments about the benchmark
numbers in the cover letter.

Kind Regards,
Christian

Bart Van Assche March 21, 2024, 7:52 p.m. UTC | #7

On 3/21/24 10:57, Christian Loehle wrote:
> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.

Why the focus on storage controllers with a single completion interrupt?
It probably won't take long (one year?) until all new high-end
smartphones may have support for multiple completion interrupts.

Thanks,

Bart.

Vincent Guittot March 22, 2024, 6:08 p.m. UTC | #8

Hi Christian,

On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
>
> There is a feature inside of both schedutil and intel_pstate called
> iowait boosting which tries to prevent selecting a low frequency
> during IO workloads when it impacts throughput.
> The feature is implemented by checking for task wakeups that have
> the in_iowait flag set and boost the CPU of the rq accordingly
> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>
> The necessity of the feature is argued with the potentially low
> utilization of a task being frequently in_iowait (i.e. most of the
> time not enqueued on any rq and cannot build up utilization).
>
> The RFC focuses on the schedutil implementation.
> intel_pstate frequency selection isn't touched for now, suggestions are
> very welcome.
> Current schedutil iowait boosting has several issues:
> 1. Boosting happens even in scenarios where it doesn't improve
> throughput. [1]
> 2. The boost is not accounted for in EAS: a) feec() will only consider
>  the actual utilization for task placement, but another CPU might be
>  more energy-efficient at that capacity than the boosted one.)
>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>  will not consider the (potentially 'free') boosted capacity, but the
>  one it would have without the boost (since the boost is only applied
>  in sugov).
> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.
> 4. The boost isn't associated with a task, it therefore isn't considered
> for task placement, potentially missing out on higher capacity CPUs on
> heterogeneous CPU topologies.
> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.
> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.
> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks might lead to boosting.
>
> We attempt to mitigate all of the above by reworking the way the
> iowait boosting (io boosting from here on) works in two major ways:
> - Carry the boost in task_struct, so it is a per-task attribute and
> behaves similar to utilization of the task in some ways.
> - Employ a counting-based tracking strategy that only boosts as long
> as it sees benefits and returns to no boosting dynamically.

Thanks for working on improving IO boosting. I have started to read
your patchset and have few comments about your proposal:

The main one is that the io boosting decision should remain a cpufreq
governor decision and so the io boosting value should be applied by
the governor like in sugov_effective_cpu_perf() as an example instead
of everywhere in the scheduler code.

Then, the algorithm to track the right interval bucket and the mapping
of intervals into utilization really looks like a policy which has
been defined with heuristics and as a result further seems to be a
governor decision

Finally adding some atomic operation in the fast path is not really desirable

I will continue to review your patchset

>
> Note that some the issues (1, 3) can be solved by using a
> counting-based strategy on a per-rq basis, i.e. in sugov entirely.
> Experiments with Android in particular showed that such a strategy
> (which necessarily needs longer intervals to be reasonably stable)
> is too prone to migrations to be useful generally.
> We therefore consider the additional complexity of such a per-task
> based approach like proposed to be worth it.
>
> We require a minimum of 1000 iowait wakeups per second to start
> boosting.
> This isn't too far off from what sugov currently does, since it resets
> the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
> For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
> We justify this by the small possible improvement by boosting in the
> first place with 'rare' few iowait wakeups.
>
> When IO even leads to a task being in iowait isn't as straightforward
> to explain.
> Of course if the issued IO can be served by the page cache (e.g. on
> reads because the pages are contained, on writes because they can be
> marked dirty and the writeback takes care of it later) the actual
> issuing task is usually not in iowait.
> We consider this the good case, since whenever the scheduler and a
> potential userspace / kernel switch is in the critical path for IO
> there is possibly overhead impacting throughput.
> We therefore focus on random read from here on, because (on synchronous
> IO [3]) this will lead to the task being set in iowait for every IO.
> This is where iowait boosting shows its biggest throughput improvement.
> From here on IOPS (IO operations per second) and iowait wakeups may
> therefore be used interchangeably.
>
> Performance:
> Throughput for random read tries to be on par with the sugov
> implementation of iowait boosting for reasonably long-lived workloads.
> See the following table for some results, values are in IOPS, the
> tests are ran for 30s with pauses in-between, results are sorted.
>
> nvme on rk3399
> [3588, 3590, 3597, 3632, 3745] sugov mainline
> [3581, 3751, 3770, 3771, 3885] per-task tracking
> [2592, 2639, 2701, 2717, 2784] sugov no iowait boost
> [3218, 3451, 3598, 3848, 3921] performance governor
>
> emmc with cqe on rk3399
> [4146, 4155, 4159, 4161, 4193] sugov mainline
> [2848, 3217, 4375, 4380, 4454] per-task tracking
> [2510, 2665, 3093, 3101, 3105] sugov no iowait boost
> [4690, 4803, 4860, 4976, 5069] performance governor
>
> sd card on rk3399
> [1777, 1780, 1806, 1827, 1850] sugov mainline
> [1470, 1476, 1507, 1534, 1586] per-task tracking
> [1356, 1372, 1373, 1377, 1416] sugov no iowait boost
> [1861, 1890, 1901, 1905, 1908] performance governor
>
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
>
> Apple M1 apple-nvme
> [27421, 28331, 28515, 28699, 29529] sugov mainline
> [27274, 27344, 27345, 27384, 27930] per-task tracking
> [14480, 14512, 14625, 14872, 14967] sugov no iowait boost
> [31595, 32085, 32386, 32465, 32643] performance governor
>
> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
>
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |               | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx1 |           4073 |              3793 |           2979 |        4190 |      2788 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx4 |          13921 |             13503 |          10635 |       13931 |     10225 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx6 |          18473 |             17866 |          15902 |       19080 |     15789 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |       psyncx8 |          22498 |             21242 |          19867 |       22650 |     18837 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |      psyncx10 |          24801 |             23552 |          23658 |       25096 |     21474 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |      psyncx12 |          26743 |             25377 |          26372 |       26663 |     23613 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio1x1 |           4054 |              3542 |           2776 |        4055 |      2780 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio1x128 |           3959 |              3516 |           2758 |        3590 |      2560 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio4x128 |          13451 |             12517 |          10313 |       13403 |      9994 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio6x1 |          18394 |             17432 |          15340 |       18954 |     15251 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |     libaio6x4 |          18329 |             17100 |          15238 |       18623 |     15270 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   libaio6x128 |          18066 |             16964 |          15139 |       18577 |     15192 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   io_uring1x1 |           4043 |              3548 |           2810 |        4039 |      2689 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |  io_uring4x64 |          35790 |             32814 |          35983 |       34934 |     33254 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x128 |          32651 |             30427 |          32429 |       33232 |      9973 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring2x128 |          34928 |             32595 |          34922 |       33726 |     18790 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x128 |          34414 |             32173 |          34932 |       33332 |     33005 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> |   io_uring6x4 |          31578 |             29260 |          31714 |       31399 |     31784 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x128 |          34480 |             32634 |          34973 |       33390 |     36452 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
>
> Based on the above we can basically categorize these into the following
> three:
> a) boost is useful
> b) boost irrelevant (util dominates)
> c) boost is energy-inefficient (boost dominates)
>
> The aim of the patch 1/2 is to boost as much as necessary for a) while
> boosting little for c) (thus saving energy).
>
> Energy-savings:
> Regarding sugov iowait boosting problem 1 mentioned earlier,
> some improvement can be seen:
> Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
> CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
> CPU4 perf domain spans 4-5 with 400MHz to 1800MHz
>
> io_uring6x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.180 GHz
> Average frequency for CPU4 : 1.504 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.070 GHz
> Average frequency for CPU4 : 1.211 GHz
>
> io_uring12x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.324 GHz
> Average frequency for CPU4 : 1.444 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.260 GHz
> Average frequency for CPU4 : 1.062 GHz
> (In both cases actually 400MHz on both perf domains is optimal, more
> fine-tuning could get us closer [2])
>
> [1]
> There are many scenarios when it doesn't, so let's start with
> explaining when it does:
> Boosting improves throughput if there is frequent IO to a device from
> one or few origins, such that the device is likely idle when the task
> is enqueued on the rq and reducing this time cuts down on the storage
> device idle time.
> This might not be true (and boosting doesn't help) if:
> - The storage device uses the idle time to actually commit the IO to
> persistent storage or do other management activity (this can be
> observed with e.g. writes to flash-based storage, which will usually
> write to cache and flush the cache when idle or necessary).
> - The device is under thermal pressure and needs idle time to cool off
> (not uncommon for e.g. nvme devices).
> Furthermore the assumption (the device being idle while task is
> enqueued) is false altogether if:
> - Other tasks use the same storage device.
> - The task uses asynchronous IO with iodepth > 1 like io_uring, the
> in_iowait is then just to fill the queue on the host again.
> - The task just sets in_iowait to signal it is waiting on io to not
> appear as system idle, it might not send any io at all (cf with
> the various occurrences of in_iowait, io_mutex_lock and io_schedule*).
>
> [3]
> Unfortunately even for asynchronous IO iowait may be set, in the case
> of io_uring this is specifically for the iowait boost to trigger, see
> commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
> which is why the energy-savings are so significant here, as io_uring
> load on the CPU is minimal.
>
> Problems encountered:
> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).
> - [2] tracking and scaling can be improved (io_uring12x128 still shows
> boosting): Unfortunately tracking purely per-task shows some limits.
> One task might show more iowaits per second when boosted, but overall
> throughput doesn't increase => there is still some boost.
> The task throughput improvement is somewhat limited though,
> so by fine-tuning the thresholds there could be mitigations.
>
> Christian Loehle (2):
>   sched/fair: Introduce per-task io util boost
>   cpufreq/schedutil: Remove iowait boost
>
>  include/linux/sched.h            |  15 +++
>  kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
>  kernel/sched/fair.c              | 165 +++++++++++++++++++++++++++++--
>  kernel/sched/sched.h             |   4 +-
>  4 files changed, 182 insertions(+), 152 deletions(-)
>
> --
> 2.34.1
>

Qais Yousef March 25, 2024, 2:20 a.m. UTC | #9

(piggy backing on this reply)

On 03/22/24 19:08, Vincent Guittot wrote:
> Hi Christian,
> 
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
> >
> > There is a feature inside of both schedutil and intel_pstate called
> > iowait boosting which tries to prevent selecting a low frequency
> > during IO workloads when it impacts throughput.
> > The feature is implemented by checking for task wakeups that have
> > the in_iowait flag set and boost the CPU of the rq accordingly
> > (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >
> > The necessity of the feature is argued with the potentially low
> > utilization of a task being frequently in_iowait (i.e. most of the
> > time not enqueued on any rq and cannot build up utilization).
> >
> > The RFC focuses on the schedutil implementation.
> > intel_pstate frequency selection isn't touched for now, suggestions are
> > very welcome.
> > Current schedutil iowait boosting has several issues:
> > 1. Boosting happens even in scenarios where it doesn't improve
> > throughput. [1]
> > 2. The boost is not accounted for in EAS: a) feec() will only consider
> >  the actual utilization for task placement, but another CPU might be
> >  more energy-efficient at that capacity than the boosted one.)
> >  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >  will not consider the (potentially 'free') boosted capacity, but the
> >  one it would have without the boost (since the boost is only applied
> >  in sugov).
> > 3. Actual IO heavy workloads are hardly distinguished from infrequent
> > in_iowait wakeups.
> > 4. The boost isn't associated with a task, it therefore isn't considered
> > for task placement, potentially missing out on higher capacity CPUs on
> > heterogeneous CPU topologies.
> > 5. The boost isn't associated with a task, it therefore lingers on the
> > rq even after the responsible task has migrated / stopped.
> > 6. The boost isn't associated with a task, it therefore needs to ramp
> > up again when migrated.
> > 7. Since schedutil doesn't know which task is getting woken up,
> > multiple unrelated in_iowait tasks might lead to boosting.

You forgot an important problem which what was the main request from Android
when this first came up few years back. iowait boost is a power hungry
feature and not all tasks require iowait boost. By having it per task we want
to be able to prevent tasks from causing frequency spikes due to iowait boost
when it is not warranted.

> >
> > We attempt to mitigate all of the above by reworking the way the
> > iowait boosting (io boosting from here on) works in two major ways:
> > - Carry the boost in task_struct, so it is a per-task attribute and
> > behaves similar to utilization of the task in some ways.
> > - Employ a counting-based tracking strategy that only boosts as long
> > as it sees benefits and returns to no boosting dynamically.
> 
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
> 
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code.

I have similar thoughts.

I think we want the scheduler to treat iowait boost like uclamp_min, but
requested by block subsystem rather than by the user.

I think we should create a new task_min/max_perf() and replace all current
callers in scheduler to uclamp_eff_value() with task_min/max_perf() where
task_min/max_perf()

unsigned long task_min_perf(struct task_struct *p)
{
	return max(uclamp_eff_value(p, UCLAMP_MIN), p->iowait_boost);
}

unsigned long task_max_perf(struct task_struct *p)
{
	return uclamp_eff_value(p, UCLAMP_MAX);
}

then all users of uclamp_min in the scheduler will see the request for boost
from iowait and do the correct task placement decision. Including under thermal
pressure and ensuring that they don't accidentally escape uclamp_max which I am
not sure if your series caters for with the open coding it. You're missing the
load balancer paths from what I see.

It will also solve the problem I mention above. The tasks that should not use
iowait boost are likely restricted with uclamp_max already. If we treat iowait
boost as an additional source of min_perf request, then uclamp_max will prevent
it from going above a certain perf level and give us the desired impact without
any additional hint. I don't think it is important to disable it completely but
rather have a way to prevent tasks from consuming too much resources when not
needed, which we already have from uclamp_max.

I am not sure it makes sense to have a separate control where a task can run
fast due to util but can't have iowait boost or vice versa. I think existing
uclamp_max should be enough to restrict tasks from exceeding a performance
limit.

> 
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

Hmm do you think this should not be a per-task value then Vincent?

Or oh, I think I see what you mean. Make effective_cpu_util() set min parameter
correctly. I think that would work too, yes. iowait boost is just another min
perf request and as long as it is treated as such, it is good for me. We'll
just need to add a new parameter for the task like I did in remove uclamp max
aggregation serires.

Generally I think it's better to split the patches so that the conversion to
iowait boost with current algorithm to being per-task as a separate patch. And
then look at improving the algorithm logic on top. These are two different
problems IMHO.

One major problem and big difference in per-task iowait that I see Christian
alluded to is that the CPU will no longer be boosted when the task is sleeping.
I think there will be cases out there where some users relied on that for the
BLOCK softirq to run faster too. We need an additional way to ensure that the
softirq runs at a similar performance level to the task that initiated the
request. So we need a way to hold the cpufreq policy's min perf until the
softirq is serviced. Or just keep the CPU boosted until the task is migrated.
I'm not sure what is better yet.

> 
> Finally adding some atomic operation in the fast path is not really desirable

Yes I was thinking if we can apply the value when we set the p->in_iowait flag
instead?

Qais Yousef March 25, 2024, 2:53 a.m. UTC | #10

On 03/21/24 17:57, Christian Loehle wrote:

> > So you want the hardirq to move to the big core? Unlike softirq, there will be
> > a single hardirq for the controller (to my limited knowledge), so if there are
> > multiple requests I'm not sure we can easily match which one relates to which
> > before it triggers. So we can end up waking up the wrong core.
> 
> It would be beneficial to move the hardirq to a big core if the IO task
> is using it anyway.
> I'm not sure I actually want to. There are quite a few pitfalls (like you

I'm actually against it. I think it's too much complexity for not necessasrily
a big gain. FWIW, one of the design request to get per task iowait boost so
that we can *disable* it. It wastes power when only a handful of tasks actually
care about perf.

Caring where the hardirq run for perf is unlikely a problem in practice.
Softirq should follow the requester already when it matters.

> mentioned) that the scheduler really shouldn't be concerned about.
> Moving the hardirq, if implemented in the kernel, would have to be done by the
> host controller driver anyway, which would explode this series.
> (host controller drivers are quite fragmented e.g. on mmc)
> 
> The fact that having a higher capacity CPU available ("running faster") for an
> IO task doesn't (always) imply higher throughput because of the hardirq staying
> on some LITTLE CPU is bothering (for this series), though.
> 
> > 
> > Generally this should be a userspace policy. If there's a scenario where the
> > throughput is that important they can easily move the hardirq to the big core
> > unconditionally and move it back again once this high throughput scenario is no
> > longer important.
> 
> It also feels wrong to let this be a userspace policy, as the hardirq must be
> migrated to the perf domain of the task, which userspace isn't aware of.
> Unless you expect userspace to do

irq balancer is a userspace policy. For kernel to make an automatic decision
there are a lot of ifs must be present. Again, I don't see on such system
maximizing throughput is a concern. And userspace can fix the problem simply
- they know after all when the throughput really matters to the point where the
hardirq runs is a bottleneck. In practice, I don't think it is a bottleneck.
But this is my handwavy judgement. The experts know better. And note, I mean
use cases that are not benchmarks ;-)

> CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
> but then you could just as well ask them to set performance governor for
> big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
> any iowait boosting.
> 
> Furthermore you can't generally expect userspace to know if their IO will lead
> to any interrupt at all, much less which one. They ideally don't even know if
> the file IO they are doing is backed by any physical storage in the first place.
> (Or even further, that they are doing file IO at all, they might just be
> e.g. page-faulting.)

The way I see it, it's like gigabit networking. The hardirq will matter once
you reach such high throughput scenarios. Which are corner cases and not the
norm?

> 
> > 
> > Or where you describing a different problem?
> 
> That is the problem I mentioned in the series and Bart and I were discussing.
> It's a problem of the series as in "the numbers aren't that impressive".
> Current iowait boosting on embedded/mobile systems will perform quite well by
> chance, as the (low util) task will often be on the same perf domain the hardirq
> will be run on. As can be seen in the cover letter the benefit of running the
> task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
> for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
> will be much greater.
> I just wanted to point this out. We might just acknowledge the problem and say
> "don't care" about the potential performance benefits of those scenarios that
> would require hardirq moving.

I thought the softirq does the bulk of the work. hardirq being such
a bottleneck is (naively maybe) a red flag for me that it's doing too much than
a simple interrupt servicing.

You don't boost when the task is sleeping, right? I think this is likely
a cause of the problem where softirq is not running as fast - where before the
series the CPU will be iowait boosted regardless the task is blocked or not.

> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.
> 
> I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
> 'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
> is available (and it doesn't look like this is changing anytime soon).
> I would also love to hear what Bart or other UFS folks think about it.
> Furthermore if I forgot any storage subsystem with the same behavior in that
> regards do tell me.
> 
> Lastly, you could consider the IO workload:
> IO task being in iowait very frequently [1] with just a single IO inflight [2]
> and only very little time being spent on the CPU in-between iowaits[3],
> therefore the interrupt handler being on the critical path for IO throughput
> to a non-negligible degree, to be niche, but it's precisely the use-case where
> iowait boosting shows it's biggest benefit.
> 
> Sorry for the abomination of a sentence, see footnotes for the reasons.
> 
> [1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
> won't apply any significant boost currently.

I CCed you to a patch where I fix this. I've been sleeping on it for too long.
Maybe I should have split this fix out of the consolidation patch.

> [2] If the storage devices has enough in-flight requests to serve, iowait
> boosting is unnecessary/wasteful, see cover letter.
> [3] If the task actually uses the CPU in-between iowaits, it will build up
> utilization, iowait boosting benefit diminishes.

The current mechanism is very aggressive. It needs to evolve for sure.

> 
> > 
> > Glad to see your series by the way :-) I'll get a chance to review it over the
> > weekend hopefully.
> 
> Thank you!
> Apologies for not CCing you in the first place, I am curious about your opinion
> on the concept!

I actually had a patch that implements iowait boost per-task (on top of my
remove uclamp max aggregation series) where I did actually take the extra step
to remove iowait from intel_pstate. Can share the patches if you think you'll
find them useful.

Just want to note that this mechanism can end up waste power and this is an
important direction to consider. It's not about perf only (which matters too).

> 
> FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
> only received a quick smoke test, so 1/2 needs the following:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4aaf64023b03..2b6f521be658 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
>         } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
>                 /* Reduce boost */
>                 if (p->io_boost_level > 1)
> -                       io_boost_scale_interval(p, true);
> +                       io_boost_scale_interval(p, false);
>                 else
>                         p->io_boost_level = 0;
>         } else if (p->io_boost_level == IO_BOOST_LEVELS) {
> 
> 
> I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
> changes are mostly cosmetic and addressing Bart's comments about the benchmark
> numbers in the cover letter.

I didn't spend a lot of time on the series, but I can see a number of problems.
Let us discuss them first and plan a future direction. No need to v2 if it's
just for this fix IMO.

Christian Loehle March 25, 2024, 12:06 p.m. UTC | #11

On 21/03/2024 19:52, Bart Van Assche wrote:
> On 3/21/24 10:57, Christian Loehle wrote:
>> In the long-term it looks like for UFS the problem will disappear as we are
>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>> is already the case.
> 
> Why the focus on storage controllers with a single completion interrupt?
> It probably won't take long (one year?) until all new high-end
> smartphones may have support for multiple completion interrupts.
> 
> Thanks,
> 
> Bart.
> 

Apart from going to "This patch shows significant performance improvements on
hardware that runs mainline today" to "This patch will have significant
performance improvements on devices running mainline in a couple years"
nothing in particular.
I'm fine with leaving it with having acknowledged the problem.
Maybe I would just gate the task placement on the task having been in
UFS (with multiple completion interrupts) or NVMe submission recently to
avoid regressions to current behavior in future versions. I did have that
already at some point, although it was a bit hacky.
Anyway, thank you for your input on that, it is what I wanted to hear!

Kind Regards,
Christian

Christian Loehle March 25, 2024, 12:24 p.m. UTC | #12

On 22/03/2024 18:08, Vincent Guittot wrote:
> Hi Christian,
Hi Vincent,
thanks for taking a look.

> 
> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
>>
>> There is a feature inside of both schedutil and intel_pstate called
>> iowait boosting which tries to prevent selecting a low frequency
>> during IO workloads when it impacts throughput.
>> The feature is implemented by checking for task wakeups that have
>> the in_iowait flag set and boost the CPU of the rq accordingly
>> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>>
>> The necessity of the feature is argued with the potentially low
>> utilization of a task being frequently in_iowait (i.e. most of the
>> time not enqueued on any rq and cannot build up utilization).
>>
>> The RFC focuses on the schedutil implementation.
>> intel_pstate frequency selection isn't touched for now, suggestions are
>> very welcome.
>> Current schedutil iowait boosting has several issues:
>> 1. Boosting happens even in scenarios where it doesn't improve
>> throughput. [1]
>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>  the actual utilization for task placement, but another CPU might be
>>  more energy-efficient at that capacity than the boosted one.)
>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>  will not consider the (potentially 'free') boosted capacity, but the
>>  one it would have without the boost (since the boost is only applied
>>  in sugov).
>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>> in_iowait wakeups.
>> 4. The boost isn't associated with a task, it therefore isn't considered
>> for task placement, potentially missing out on higher capacity CPUs on
>> heterogeneous CPU topologies.
>> 5. The boost isn't associated with a task, it therefore lingers on the
>> rq even after the responsible task has migrated / stopped.
>> 6. The boost isn't associated with a task, it therefore needs to ramp
>> up again when migrated.
>> 7. Since schedutil doesn't know which task is getting woken up,
>> multiple unrelated in_iowait tasks might lead to boosting.
>>
>> We attempt to mitigate all of the above by reworking the way the
>> iowait boosting (io boosting from here on) works in two major ways:
>> - Carry the boost in task_struct, so it is a per-task attribute and
>> behaves similar to utilization of the task in some ways.
>> - Employ a counting-based tracking strategy that only boosts as long
>> as it sees benefits and returns to no boosting dynamically.
> 
> Thanks for working on improving IO boosting. I have started to read
> your patchset and have few comments about your proposal:
> 
> The main one is that the io boosting decision should remain a cpufreq
> governor decision and so the io boosting value should be applied by
> the governor like in sugov_effective_cpu_perf() as an example instead
> of everywhere in the scheduler code
Having it move into the scheduler is to enable it for EAS (e.g. boosting
a LITTLE to it's highest OPP often being much less energy-efficient than
having a higher cap CPU at a lower OPP) and to enable higher capacities
reachable on other CPUs, too.
I guess for you the first one is the more interesting one.

> 
> Then, the algorithm to track the right interval bucket and the mapping
> of intervals into utilization really looks like a policy which has
> been defined with heuristics and as a result further seems to be a
> governor decision

I did have a comparable thing as a governor decision, but the entire
"Test if util boost increases iowaits seen per interval and only boost
accordingly" really only works if the interval is long enough, my proposed
starting length of 25ms really being the lower limit for the storage devices
we want to cover (IO latency not being constant and therefore iowaits per
interval being somewhat noisy).
Given that the IO tasks will be enqueued/dequeued very frequently it just
isn't credible to expect them to land on the same CPU for many intervals,
unless your system is very bare-bones and idle, but even on an idle Android
I see any interval above 50ms to be unusable and not provide any throughput
improvement.
The idea of tracking the iowaits I do find the best option in this vague and
noisy environment of "iowait wakeups" and definitely worth having, so that's
why I opted for it being in the scheduler code, but I'd love to hear your
thoughts/alternatives.
I'd also like an improvement on the definition of iowait or some more separate
flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
the dark which I'm trying to combat, but it's somewhat out of scope here.

> 
> Finally adding some atomic operation in the fast path is not really desirable
Agreed, I'll look into it, for now I wanted as much feedback on the two major
changes:
- iowait boost now per-task
- boosting based upon iowaits seen per interval

> 
> I will continue to review your patchset

Thank you, looking forward to seeing your review.

>>[snip]
Kind Regards,
Christian

Christian Loehle March 25, 2024, 5:18 p.m. UTC | #13

On 25/03/2024 02:20, Qais Yousef wrote:
> (piggy backing on this reply)
> 
> On 03/22/24 19:08, Vincent Guittot wrote:
>> Hi Christian,
>>
>> On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
>>>
>>> There is a feature inside of both schedutil and intel_pstate called
>>> iowait boosting which tries to prevent selecting a low frequency
>>> during IO workloads when it impacts throughput.
>>> The feature is implemented by checking for task wakeups that have
>>> the in_iowait flag set and boost the CPU of the rq accordingly
>>> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>>>
>>> The necessity of the feature is argued with the potentially low
>>> utilization of a task being frequently in_iowait (i.e. most of the
>>> time not enqueued on any rq and cannot build up utilization).
>>>
>>> The RFC focuses on the schedutil implementation.
>>> intel_pstate frequency selection isn't touched for now, suggestions are
>>> very welcome.
>>> Current schedutil iowait boosting has several issues:
>>> 1. Boosting happens even in scenarios where it doesn't improve
>>> throughput. [1]
>>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>>  the actual utilization for task placement, but another CPU might be
>>>  more energy-efficient at that capacity than the boosted one.)
>>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>>  will not consider the (potentially 'free') boosted capacity, but the
>>>  one it would have without the boost (since the boost is only applied
>>>  in sugov).
>>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>>> in_iowait wakeups.
>>> 4. The boost isn't associated with a task, it therefore isn't considered
>>> for task placement, potentially missing out on higher capacity CPUs on
>>> heterogeneous CPU topologies.
>>> 5. The boost isn't associated with a task, it therefore lingers on the
>>> rq even after the responsible task has migrated / stopped.
>>> 6. The boost isn't associated with a task, it therefore needs to ramp
>>> up again when migrated.
>>> 7. Since schedutil doesn't know which task is getting woken up,
>>> multiple unrelated in_iowait tasks might lead to boosting.
> 
> You forgot an important problem which what was the main request from Android
> when this first came up few years back. iowait boost is a power hungry
> feature and not all tasks require iowait boost. By having it per task we want
> to be able to prevent tasks from causing frequency spikes due to iowait boost
> when it is not warranted.

It is and most of the time I see it triggering (in day-to-day workloads) it
doesn't help in any measurable way.
Being able to toggle this per-task is the logical next step, although I would
expect very little over-boosting overall compared to the current sugov
implementation. If you observe otherwise please do tell me for which workloads!

>>>
>>> We attempt to mitigate all of the above by reworking the way the
>>> iowait boosting (io boosting from here on) works in two major ways:
>>> - Carry the boost in task_struct, so it is a per-task attribute and
>>> behaves similar to utilization of the task in some ways.
>>> - Employ a counting-based tracking strategy that only boosts as long
>>> as it sees benefits and returns to no boosting dynamically.
>>
>> Thanks for working on improving IO boosting. I have started to read
>> your patchset and have few comments about your proposal:
>>
>> The main one is that the io boosting decision should remain a cpufreq
>> governor decision and so the io boosting value should be applied by
>> the governor like in sugov_effective_cpu_perf() as an example instead
>> of everywhere in the scheduler code.
> 
> I have similar thoughts.
> 
> I think we want the scheduler to treat iowait boost like uclamp_min, but
> requested by block subsystem rather than by the user.
> 
> I think we should create a new task_min/max_perf() and replace all current
> callers in scheduler to uclamp_eff_value() with task_min/max_perf() where
> task_min/max_perf()
> 
> unsigned long task_min_perf(struct task_struct *p)
> {
> 	return max(uclamp_eff_value(p, UCLAMP_MIN), p->iowait_boost);
> }
> 
> unsigned long task_max_perf(struct task_struct *p)
> {
> 	return uclamp_eff_value(p, UCLAMP_MAX);
> }
> 
> then all users of uclamp_min in the scheduler will see the request for boost
> from iowait and do the correct task placement decision. Including under thermal
> pressure and ensuring that they don't accidentally escape uclamp_max which I am
> not sure if your series caters for with the open coding it. You're missing the
> load balancer paths from what I see.

io_boost doesn't have to be clamped at the load balancer path because it isn't
included there (unless I messed up).
Essentially io_boost should never trigger a load balance, we are talking about
tasks that get constantly enqueued and only spend very little time on the CPU
until sleeping again, so any load balancing should be overkill.
For the rest I'm open to anything, it's all a 'minor' implementation detail for
me :)

> 
> It will also solve the problem I mention above. The tasks that should not use
> iowait boost are likely restricted with uclamp_max already. If we treat iowait
> boost as an additional source of min_perf request, then uclamp_max will prevent
> it from going above a certain perf level and give us the desired impact without
> any additional hint. I don't think it is important to disable it completely but
> rather have a way to prevent tasks from consuming too much resources when not
> needed, which we already have from uclamp_max.
> 
> I am not sure it makes sense to have a separate control where a task can run
> fast due to util but can't have iowait boost or vice versa. I think existing
> uclamp_max should be enough to restrict tasks from exceeding a performance
> limit.
> 
>>
>> Then, the algorithm to track the right interval bucket and the mapping
>> of intervals into utilization really looks like a policy which has
>> been defined with heuristics and as a result further seems to be a
>> governor decision
> 
> Hmm do you think this should not be a per-task value then Vincent?

That's how I understood Vincent anyway.
See my other reply.

> 
> Or oh, I think I see what you mean. Make effective_cpu_util() set min parameter
> correctly. I think that would work too, yes. iowait boost is just another min
> perf request and as long as it is treated as such, it is good for me. We'll
> just need to add a new parameter for the task like I did in remove uclamp max
> aggregation serires.

I did have that at some point, too, although before Vincent's rework.
Should be fine from what I can see now.

> 
> Generally I think it's better to split the patches so that the conversion to
> iowait boost with current algorithm to being per-task as a separate patch. And
> then look at improving the algorithm logic on top. These are two different
> problems IMHO.

That's possible, although the current iowait boosting is based on consecutiveness
of the iowait wakeups on the rq (oversimplifying away all that rate_limit_us
stuff), which doesn't really translate well into a per-task property, but
I can come up with something that works just well enough here.
As I said in my other reply this entire piggybacking ontop of iowait wakeups
is such an unfortunate beast, see all the different occurrences of io_schedule*()
and mutex_lock_io(). The entire interval-based tracking strategy attempts to
mitigate that somewhat without going to the entire tree.

> One major problem and big difference in per-task iowait that I see Christian
> alluded to is that the CPU will no longer be boosted when the task is sleeping.
> I think there will be cases out there where some users relied on that for the
> BLOCK softirq to run faster too. We need an additional way to ensure that the
> softirq runs at a similar performance level to the task that initiated the
> request. So we need a way to hold the cpufreq policy's min perf until the
> softirq is serviced. Or just keep the CPU boosted until the task is migrated.
> I'm not sure what is better yet.

Yes, right now rate_limit_us (which is usually at least TICK_NSEC currently)
'protects' this. Almost all of the cpufreq updates will come from the iowait
task(s) enqueue anyway (in cases we apply some io boost).
Having the per-task boost 'linger' around at the runqueue more explicitly is
a bit awkward though, as you would have to remove if the scheduler picks a
different CPU once the task is being re-enqueued.
Not impossible to do but lots of awkwardness there.

>>
>> Finally adding some atomic operation in the fast path is not really desirable
> 
> Yes I was thinking if we can apply the value when we set the p->in_iowait flag
> instead?

Yeah thought about it, too, again the awkwardness is that you don't know on which
rq the task will be enqueued on after the wake up.
(Boost current CPU and then remove if we switched CPUs can be done, but then we
also need to arm a timer for tasks that go into iowait for a long time (and thus
don't deserve boosting anymore)).
Might be worse than the current atomic.
But I'll come up with something, should be the least critical part of this series ;)

Thanks for taking a look, I'll gather some additional numbers for the other replies
and get back to you.

Kind Regards,
Christian

Bart Van Assche March 25, 2024, 5:23 p.m. UTC | #14

On 3/25/24 05:06, Christian Loehle wrote:
> On 21/03/2024 19:52, Bart Van Assche wrote:
>> On 3/21/24 10:57, Christian Loehle wrote:
>>> In the long-term it looks like for UFS the problem will disappear as we are
>>> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
>>> is already the case.
>>
>> Why the focus on storage controllers with a single completion interrupt?
>> It probably won't take long (one year?) until all new high-end
>> smartphones may have support for multiple completion interrupts.
> 
> Apart from going to "This patch shows significant performance improvements on
> hardware that runs mainline today" to "This patch will have significant
> performance improvements on devices running mainline in a couple years"
> nothing in particular.

That doesn't make sense to me. Smartphones with UFSHCI 4.0 controllers
are available from multiple vendors. See also 
https://en.wikipedia.org/wiki/Universal_Flash_Storage. See also
https://www.gsmarena.com/samsung_galaxy_s24-12773.php.

Bart.

Vincent Guittot March 28, 2024, 10:09 a.m. UTC | #15

On Mon, 25 Mar 2024 at 13:24, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 22/03/2024 18:08, Vincent Guittot wrote:
> > Hi Christian,
> Hi Vincent,
> thanks for taking a look.
>
> >
> > On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@arm.com> wrote:
> >>
> >> There is a feature inside of both schedutil and intel_pstate called
> >> iowait boosting which tries to prevent selecting a low frequency
> >> during IO workloads when it impacts throughput.
> >> The feature is implemented by checking for task wakeups that have
> >> the in_iowait flag set and boost the CPU of the rq accordingly
> >> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
> >>
> >> The necessity of the feature is argued with the potentially low
> >> utilization of a task being frequently in_iowait (i.e. most of the
> >> time not enqueued on any rq and cannot build up utilization).
> >>
> >> The RFC focuses on the schedutil implementation.
> >> intel_pstate frequency selection isn't touched for now, suggestions are
> >> very welcome.
> >> Current schedutil iowait boosting has several issues:
> >> 1. Boosting happens even in scenarios where it doesn't improve
> >> throughput. [1]
> >> 2. The boost is not accounted for in EAS: a) feec() will only consider
> >>  the actual utilization for task placement, but another CPU might be
> >>  more energy-efficient at that capacity than the boosted one.)
> >>  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >>  will not consider the (potentially 'free') boosted capacity, but the
> >>  one it would have without the boost (since the boost is only applied
> >>  in sugov).
> >> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> >> in_iowait wakeups.
> >> 4. The boost isn't associated with a task, it therefore isn't considered
> >> for task placement, potentially missing out on higher capacity CPUs on
> >> heterogeneous CPU topologies.
> >> 5. The boost isn't associated with a task, it therefore lingers on the
> >> rq even after the responsible task has migrated / stopped.
> >> 6. The boost isn't associated with a task, it therefore needs to ramp
> >> up again when migrated.
> >> 7. Since schedutil doesn't know which task is getting woken up,
> >> multiple unrelated in_iowait tasks might lead to boosting.
> >>
> >> We attempt to mitigate all of the above by reworking the way the
> >> iowait boosting (io boosting from here on) works in two major ways:
> >> - Carry the boost in task_struct, so it is a per-task attribute and
> >> behaves similar to utilization of the task in some ways.
> >> - Employ a counting-based tracking strategy that only boosts as long
> >> as it sees benefits and returns to no boosting dynamically.
> >
> > Thanks for working on improving IO boosting. I have started to read
> > your patchset and have few comments about your proposal:
> >
> > The main one is that the io boosting decision should remain a cpufreq
> > governor decision and so the io boosting value should be applied by
> > the governor like in sugov_effective_cpu_perf() as an example instead
> > of everywhere in the scheduler code
> Having it move into the scheduler is to enable it for EAS (e.g. boosting
> a LITTLE to it's highest OPP often being much less energy-efficient than
> having a higher cap CPU at a lower OPP) and to enable higher capacities
> reachable on other CPUs, too.

sugov_effective_cpu_perf() is used by EAS when finding the final OPP
and computing the energy so I don't see a problem of moving the policy
(converting some iowait boost information into some performance level)
into the cpufreq governor. EAS should be able to select the more
efficient CPU for the waking task.
Furthermore, you add it into the utilization whereas iowait boost is
not a capacity that will be used by the task but like a minimum
bandwidth requirement to speedup its execution; This could be seen
like uclamp_min especially if you also want to use the iowait boosting
to migrate tasks. But I don't think that this is exactly the same.
Uclamp_min helps when a task has always the same amount of small work
to do periodically. Whatever the frequency, its utilization remains
(almost) the same and is not really expected to impact its period. In
the case of iowait boost, when you increase the frequency, the task
will do more work and its utilization will decrease (because the
overall periods will decrease). This increase of the utilization
should be the trigger for migrating the task on another CPU.

> I guess for you the first one is the more interesting one.
>
> >
> > Then, the algorithm to track the right interval bucket and the mapping
> > of intervals into utilization really looks like a policy which has
> > been defined with heuristics and as a result further seems to be a
> > governor decision
>
> I did have a comparable thing as a governor decision, but the entire
> "Test if util boost increases iowaits seen per interval and only boost
> accordingly" really only works if the interval is long enough, my proposed
> starting length of 25ms really being the lower limit for the storage devices
> we want to cover (IO latency not being constant and therefore iowaits per
> interval being somewhat noisy).

Your explanation above confirms that it's a policy for your storage devices.

> Given that the IO tasks will be enqueued/dequeued very frequently it just
> isn't credible to expect them to land on the same CPU for many intervals,
> unless your system is very bare-bones and idle, but even on an idle Android
> I see any interval above 50ms to be unusable and not provide any throughput
> improvement.
> The idea of tracking the iowaits I do find the best option in this vague and
> noisy environment of "iowait wakeups" and definitely worth having, so that's
> why I opted for it being in the scheduler code, but I'd love to hear your
> thoughts/alternatives.

There is 3 parts in your proposal
1- tracking per task iowait statistics
2- translate that into a capacity more than an utilization
3- use this value in EAS

Having 1- in the scheduler seems ok but 2- and 3- should not be
injected directly into the scheduler



> I'd also like an improvement on the definition of iowait or some more separate
> flag for boostable IO, the entire "boost on any iowait wakeup" is groping in
> the dark which I'm trying to combat, but it's somewhat out of scope here.
>
> >
> > Finally adding some atomic operation in the fast path is not really desirable
> Agreed, I'll look into it, for now I wanted as much feedback on the two major
> changes:
> - iowait boost now per-task
> - boosting based upon iowaits seen per interval
>
> >
> > I will continue to review your patchset
>
> Thank you, looking forward to seeing your review.
>
> >>[snip]
> Kind Regards,
> Christian

[RFC,0/2] Introduce per-task io utilization boost

Message

Comments