diff mbox

Blk-throttle: reduce tail io latency when iops limit is enforced

Message ID 1532132960-44970-1-git-send-email-bo.liu@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Liu Bo July 21, 2018, 12:29 a.m. UTC
When an application's iops has exceeded its cgroup's iops limit, surely it
is throttled and kernel will set a timer for dispatching, thus IO latency
includes the delay.

However, the dispatch delay which is calculated by the limit and the
elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
once the application's iops is (iops limit + 1), it doesn't need to wait
any longer than the remaining time of the current slice.

The difference can be proved by the following fio job and cgroup iops
setting,
-----
$ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
$ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
$ cat r2.job
[global]
name=fio-rand-read
filename=/dev/nullb1
rw=randread
bs=4k
direct=1
numjobs=1
time_based=1
runtime=60
group_reporting=1

[file1]
size=4G
ioengine=libaio
iodepth=1
rate_iops=50000
norandommap=1
thinktime=4ms
-----

wo patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
    slat (usec): min=10, max=336, avg=27.71, stdev=17.82
    clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
     lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
     | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
     | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
     | 99.99th=[28967]

w/ patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
    slat (usec): min=10, max=155, avg=23.24, stdev=16.79
    clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
     lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
     | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
     | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
     | 99.99th=[12387]

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 block/blk-throttle.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

Comments

Liu Bo Aug. 9, 2018, 5:21 p.m. UTC | #1
ping?


On Fri, Jul 20, 2018 at 5:29 PM, Liu Bo <bo.liu@linux.alibaba.com> wrote:
> When an application's iops has exceeded its cgroup's iops limit, surely it
> is throttled and kernel will set a timer for dispatching, thus IO latency
> includes the delay.
>
> However, the dispatch delay which is calculated by the limit and the
> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
> once the application's iops is (iops limit + 1), it doesn't need to wait
> any longer than the remaining time of the current slice.
>
> The difference can be proved by the following fio job and cgroup iops
> setting,
> -----
> $ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
> $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
> $ cat r2.job
> [global]
> name=fio-rand-read
> filename=/dev/nullb1
> rw=randread
> bs=4k
> direct=1
> numjobs=1
> time_based=1
> runtime=60
> group_reporting=1
>
> [file1]
> size=4G
> ioengine=libaio
> iodepth=1
> rate_iops=50000
> norandommap=1
> thinktime=4ms
> -----
>
> wo patch:
> file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7-66-gedfc
> Starting 1 process
>
>    read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
>     slat (usec): min=10, max=336, avg=27.71, stdev=17.82
>     clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
>      lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
>     clat percentiles (usec):
>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>      | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
>      | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
>      | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
>      | 99.99th=[28967]
>
> w/ patch:
> file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7-66-gedfc
> Starting 1 process
>
>    read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
>     slat (usec): min=10, max=155, avg=23.24, stdev=16.79
>     clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
>      lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
>     clat percentiles (usec):
>      |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
>      | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
>      | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
>      | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
>      | 99.99th=[12387]
>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  block/blk-throttle.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 26ddbe5286f8..3ca9a98748be 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -956,12 +956,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
>                 return false;
>
>         /* Calc approx time to dispatch */
> -       jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
> -
> -       if (jiffy_wait > jiffy_elapsed)
> -               jiffy_wait = jiffy_wait - jiffy_elapsed;
> -       else
> -               jiffy_wait = 1;
> +       jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed;
>
>         *wait = jiffy_wait;
>         return false;
> --
> 1.8.3.1
>
Jens Axboe Aug. 9, 2018, 5:22 p.m. UTC | #2
On 7/20/18 6:29 PM, Liu Bo wrote:
> When an application's iops has exceeded its cgroup's iops limit, surely it
> is throttled and kernel will set a timer for dispatching, thus IO latency
> includes the delay.
> 
> However, the dispatch delay which is calculated by the limit and the
> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
> once the application's iops is (iops limit + 1), it doesn't need to wait
> any longer than the remaining time of the current slice.

Looks good to me - can you resend against my for-4.19/block branch, as
it doesn't apply.
Liu Bo Aug. 9, 2018, 5:25 p.m. UTC | #3
On Thu, Aug 9, 2018 at 10:22 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 7/20/18 6:29 PM, Liu Bo wrote:
>> When an application's iops has exceeded its cgroup's iops limit, surely it
>> is throttled and kernel will set a timer for dispatching, thus IO latency
>> includes the delay.
>>
>> However, the dispatch delay which is calculated by the limit and the
>> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
>> once the application's iops is (iops limit + 1), it doesn't need to wait
>> any longer than the remaining time of the current slice.
>
> Looks good to me - can you resend against my for-4.19/block branch, as
> it doesn't apply.
>

Sure, I'll rebase against it.

thanks,
liubo
diff mbox

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 26ddbe5286f8..3ca9a98748be 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -956,12 +956,7 @@  static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 		return false;
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
-
-	if (jiffy_wait > jiffy_elapsed)
-		jiffy_wait = jiffy_wait - jiffy_elapsed;
-	else
-		jiffy_wait = 1;
+	jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed;
 
 	*wait = jiffy_wait;
 	return false;