diff mbox series

blk-throttle: fix repeat limit on bio with BIO_BPS_THROTTLED

Message ID 20240419120747.38031-1-zhoutaiyu@kuaishou.com (mailing list archive)
State New, archived
Headers show
Series blk-throttle: fix repeat limit on bio with BIO_BPS_THROTTLED | expand

Commit Message

周泰宇 April 19, 2024, 12:07 p.m. UTC
Give a concrete example, a bio is throtted because of reaching bps
limit. It is then dispatched to request layer after a delay. In the
request layer, it is split and the split bio flagged with
BIO_BPS_THROTTLED will re-enter blkthrottle.
The bio with BIO_BPS_THROTTLED should not be throttled for its bytes
again. However, when the bps_limit and iops_limit are both set and
sq->queue is not empty, the bio will be throttled again even the tg is
still within iops limit.

Test scrips:
cgpath=/sys/fs/cgroup/blkio/test0
mkdir -p $cgpath
echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device
echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device
for ((i=0;i<50;i++));do
  fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \
-time_based=1 -runtime=30  -name=testt_$i -filename=testf_$i > /dev/null &
  echo $! > $cgpath/tasks
done

The output of iostat:
Device:  ...  wMB/s  ...
sdb      ...  3.75  ...
sdb      ...  2.50  ...
sdb      ...  3.75  ...
sdb      ...  2.50  ...
sdb      ...  3.75  ...

In order to fix this problem, early throttled the bio only when
sq->queue is no empty and the bio is not flagged with BIO_BPS_THROTTLED.

Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com>
---
 block/blk-throttle.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Yu Kuai April 20, 2024, 1:48 a.m. UTC | #1
Hi,

在 2024/04/19 20:07, zhoutaiyu 写道:
> Give a concrete example, a bio is throtted because of reaching bps
> limit. It is then dispatched to request layer after a delay. In the
> request layer, it is split and the split bio flagged with
> BIO_BPS_THROTTLED will re-enter blkthrottle.
> The bio with BIO_BPS_THROTTLED should not be throttled for its bytes
> again. However, when the bps_limit and iops_limit are both set and
> sq->queue is not empty, the bio will be throttled again even the tg is
> still within iops limit.

I don't understand here, split bio should be throttled by iops limit
again, this is expected. If you mean that that throtl time calculated
by iops_limit is wrong, you need to provide more informatiom.
> 
> Test scrips:
> cgpath=/sys/fs/cgroup/blkio/test0
> mkdir -p $cgpath
> echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device
> echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device

What? 8:0 and 8:16?

> for ((i=0;i<50;i++));do
>    fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \
> -time_based=1 -runtime=30  -name=testt_$i -filename=testf_$i > /dev/null &
>    echo $! > $cgpath/tasks
> done
> 
> The output of iostat:
> Device:  ...  wMB/s  ...
> sdb      ...  3.75  ...
> sdb      ...  2.50  ...
> sdb      ...  3.75  ...
> sdb      ...  2.50  ...
> sdb      ...  3.75  ...
> 
> In order to fix this problem, early throttled the bio only when
> sq->queue is no empty and the bio is not flagged with BIO_BPS_THROTTLED.
> 
> Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com>
> ---
>   block/blk-throttle.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index f4850a6..499c006 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -913,7 +913,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>   	 * queued.
>   	 */
>   	BUG_ON(tg->service_queue.nr_queued[rw] &&
> -	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
> +	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]) &&
> +	       !bio_flagged(bio, BIO_BPS_THROTTLED));
>   
>   	/* If tg->bps = -1, then BW is unlimited */
>   	if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) ||
> @@ -2201,7 +2202,7 @@ bool __blk_throtl_bio(struct bio *bio)
>   		throtl_downgrade_check(tg);
>   		throtl_upgrade_check(tg);
>   		/* throtl is FIFO - if bios are already queued, should queue */
> -		if (sq->nr_queued[rw])
> +		if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED))

No, this change is wrong. Split IO will not be throttled by iops limit
anymore.

Thanks,
Kuai

>   			break;
>   
>   		/* if above limits, break to queue */
>
周泰宇 April 22, 2024, 3:33 a.m. UTC | #2
Hi,

>> Test scrips:
>> cgpath=/sys/fs/cgroup/blkio/test0
>> mkdir -p $cgpath
>> echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device
>> echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device

> What? 8:0 and 8:16?


My bad, I made a typos here. It should be all 8:0 or 8:16. 
What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example).


Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s.


Once I cancel the IOPS_LIMIT, i.e., echo 0 to the write_iops_device, the bps stabilizes at around 10M/s.


The root cause of this is that the split bio will be throttled again even though a tg is under the IOPS_LIMIT when the tg's sq->queue is not empty.


Let me explain it with the code.


If we only set BPS_LIMIT and the bio is flagged with BIO_BPS_THROTTLED, the blk_should_throtl() will always return false and the __blk_throtl_bio() will not be called.  So, the bio flagged with BIO_BPS_THROTTLED will not be throttled in any cases.


However, if we set both BPS_LIMIT and IOPS_LIMIT, the blk_should_throtl() will always return true no matter what the value the IOPS_LIMIT is because tg->has_rules_iops[rw] is always true.


After that, the bio will be passed to __blk_throtl_bio().


If the tg's sq->queue is empty, blkthrottle will calculate if the tg is above limit with the bio. 
Since the bio is flagged with BIO_BPS_THROTTLED, blkthrottle will only calculate the IOPS limit for the tg.
If tg is under iops limit with the bio, the bio will not be throttled. 
Otherwise, the bio will be throttled because of the iops limit.


However, If the tg's sq->queue is not empty, the bio will be throttled directly without any iops or bps limitation calculations.


The  related code snippet is :

bool __blk_throtl_bio(struct bio *bio) 
{
.......

while (true) {
......
    /* throtl is FIFO - if bios are already queued, should queue */
    if (sq->nr_queued[rw])
        break;  // ----->  the bio will be throttled in any cases

    /* if above limits, break to queue */
    if (!tg_may_dispatch(tg, bio, NULL)) { // ----> do iops and bps calculations
        break;  // --->  the bio will be throttled if the tg is above bps or iops limits

......
    
    if (!tg) { 
        bio_set_flag(bio, BIO_BPS_THROTTLED);
        goto out_unlock; // ----> pass, no throttle
    }

.......
}
}


So even the bio is flagged with BIO_BPS_THROTTLED and the tg is far more behind IOPS_LIMIT, the bio will be throttled if the sq->queue is not empty.


This problem can be reproduced by running following scripts and comparing the outputs of iostat.
1) 
PNUM=50 # large enough to saturate the sq->queue

cgpath=/sys/fs/cgroup/blkio/test0
mkdir -p $cgpath
echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device
echo "8:0 100000" > $cgpath/blkio.throttle.write_iops_device  # large enough to make it unreachable
for ((i=0;i<;$PUM; i++));do
  fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \
-time_based=1 -runtime=30  -name=testt_$i -filename=testf_$i > /dev/null &
  echo $! > $cgpath/tasks
done

2)

PNUM=50 # large enough to saturate the sq->queue

cgpath=/sys/fs/cgroup/blkio/test0
mkdir -p $cgpath
echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device
echo "8:0 0" > $cgpath/blkio.throttle.write_iops_device  # do not set iops limit
for ((i=0;i<;$PUM; i++));do
  fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \
-time_based=1 -runtime=30  -name=testt_$i -filename=testf_$i > /dev/null &
  echo $! > $cgpath/tasks
done

>> Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com>
>> ---
>>   block/blk-throttle.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index f4850a6..499c006 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -913,7 +913,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>>        * queued.
>>        */
>>       BUG_ON(tg->service_queue.nr_queued[rw] &&
>> -            bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
>> +            bio != throtl_peek_queued(&tg->service_queue.queued[rw]) &&
>> +            !bio_flagged(bio, BIO_BPS_THROTTLED));
>>
>>       /* If tg->bps = -1, then BW is unlimited */
>>       if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) ||
>> @@ -2201,7 +2202,7 @@ bool __blk_throtl_bio(struct bio *bio)
>>               throtl_downgrade_check(tg);
>>               throtl_upgrade_check(tg);
>>               /* throtl is FIFO - if bios are already queued, should queue */
>> -             if (sq->nr_queued[rw])
>> +             if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED))

> No, this change is wrong. Split IO will not be throttled by iops limit
anymore.

After this change, the split IO will be throttled by iops limit again if it reaches a tg's iops limit and will not be throttled in any cases if the sq->queue is not empty.
Yu Kuai April 22, 2024, 3:47 a.m. UTC | #3
Hi!

在 2024/04/22 11:33, 周泰宇 写道:
> What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example).
> 
> 
> Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s.

Yes, I know this behaviour, and this is because blk-throttle works
before IO split, and io stats is accounting bps for rq-based disk after
IO split, if you using Q2C for bps you'll see that bps is stable as
limit.

Hi, Tejun!

Do you think this *phenomenon* need to be fixed? If so, I don't see a
easy way other than throttle bio after *IO split*. Perhaps ohter than
bio merge case, this can be another motivation to move blk-throttle to
rq_qos_throttle().

Thanks,
Kuai
Yu Kuai April 22, 2024, 9:42 a.m. UTC | #4
Hi,

在 2024/04/22 11:33, 周泰宇 写道:
>>>                 /* throtl is FIFO - if bios are already queued, should queue */
>>> -             if (sq->nr_queued[rw])
>>> +             if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED))
>>   No, this change is wrong. Split IO will not be throttled by iops limit
> anymore.
> 
> After this change, the split IO will be throttled by iops limit again if it reaches a tg's iops limit and will not be throttled in any cases if the sq->queue is not empty.

Forgot to reply here,

The ponit here is that you break the rules about FIFO, blk-throttle
only judge the bio from head if it's within limit. Current code to judge
if tg iops reaches limit on the condition that no bio is throttled. And
throtl time is always caculated by first throttled bio. But this patch
will ignore throttled bio case, and that's why I said IO will not be
throttled by iops limist anymore. You can test this with bps limit
disabled.

Thanks,
Kuai
周泰宇 April 22, 2024, 1:05 p.m. UTC | #5
Hi, kuai

Thank you so much for your feedback.

> The ponit here is that you break the rules about FIFO, blk-throttle
> only judge the bio from head if it's within limit. Current code to judge
> if tg iops reaches limit on the condition that no bio is throttled. And
> throtl time is always caculated by first throttled bio. But this patch
> will ignore throttled bio case, and that's why I said IO will not be
> throttled by iops limist anymore. You can test this with bps limit
> disabled.

Sorry, I don't get it.  Yes, I know this patch will break the FIFO rule.
But I didn't find any problems after letting split bio break this rule. 
I think the split bio should be ignored if it does not
make the tg out of iops limit and be judged in tg_within_iops_limit().

In the patch, split bio within iops limit  will be dispatched directly.
However, if the split bio makes a tg out of IOPS limit, the split bio will be throttled 
and be added to the tg's sq->queue[rw] by calling throtl_add_bio_tg().

There exists two cases after the split bio is judged to be throttled.

1) The tg->sq->queue is empty. In this case, the split bio will be the first queued bio.  The tg's 
latest dispatch time is calculated according to the split bio by calling tg_update_disptime(), and 
the dispatch timer is waked up by calling throtl_schedule_next_dispatch().

2) The tg->sq->queue is not empty. In this case, the tg's dispatch time has been updated
according to the first queued bio and the dispatch timer is also waked up.
So, blk-throttle just adds the split bio to the tail of the sq->queue[rw] which acts like
the current code (split bio is throttled because sq->nr_queue[rw] is not zero).


I have tested this patch by only setting iops limit as you advised, with following scripts

#!/bin/bash

CNUM=30
mkdir -p /sys/fs/cgroup/blkio/test0
#echo "8:16 10485760" > /sys/fs/cgroup/blkio/test0/blkio.throttle.write_bps_device
echo "8:16 50" > /sys/fs/cgroup/blkio/test0/blkio.throttle.write_iops_device

for ((i=0;i<$CNUM;i++));do
        fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 -time_based=1 -runtime=110  -name=testt_$i -filename=testf_$i  > /dev/null &
        echo $! > /sys/fs/cgroup/blkio/test0/tasks
done

I monitor the w/s on iostat and the change of /sys/fs/cgroup/blkio/test0/blkio.throttle.io_serviced every second.
It seems work perfectly (around 50).  

I also see the split bio be throttled through the trace_pipe

    kworker/19:2-1603    [019] d..1.   583.556012:   8,16   X  WS 977153216 / 977155776 [kworker/19:2]
    kworker/19:2-1603    [019] d..1.   583.556012:   8,16  1,0  m   N throtl [W] bio. bdisp=0 sz=1572864 bps=18446744073709551615 iodisp=5 iops=50 queued=0/28
    ......
    kworker/19:2-1603    [019] d..1.   583.556017:   8,16   X  WS 1870472 / 1873032 [kworker/19:2]
    kworker/19:2-1603    [019] d..1.   583.556018:   8,16  1,0  m   N throtl [W] bio. bdisp=0 sz=1572864 bps=18446744073709551615 iodisp=5 iops=50 queued=0/29

Would you mind give me more hints to find out the problems of my patch or try my patch? 

Sincerely,
Taiyu Zhou
Tejun Heo April 22, 2024, 5:39 p.m. UTC | #6
Hello, Yu Kuai.

On Mon, Apr 22, 2024 at 11:47:41AM +0800, Yu Kuai wrote:
> Hi!
> 
> 在 2024/04/22 11:33, 周泰宇 写道:
> > What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example).
> > 
> > 
> > Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s.
> 
> Yes, I know this behaviour, and this is because blk-throttle works
> before IO split, and io stats is accounting bps for rq-based disk after
> IO split, if you using Q2C for bps you'll see that bps is stable as
> limit.
>
> Hi, Tejun!
> 
> Do you think this *phenomenon* need to be fixed? If so, I don't see a
> easy way other than throttle bio after *IO split*. Perhaps ohter than
> bio merge case, this can be another motivation to move blk-throttle to
> rq_qos_throttle().

Yeah, blk-throtl is sitting too early in the pipeline to easily track how
the bios actually get issued. However, given that it's been available for
bio-based drivers for a really long time, I don't think it'd be a good idea
to move it, so iops limit is always going to be a bit unreliable w.r.t. what
actually get issued to the device. So, IMHO, if the oddity is just about how
IOs are counted, I don't think it's a critical problem on its own.

Thanks.
Yu Kuai April 23, 2024, 1:50 a.m. UTC | #7
Hi, Tejun!

在 2024/04/23 1:39, tj@kernel.org 写道:
> Hello, Yu Kuai.
> 
> On Mon, Apr 22, 2024 at 11:47:41AM +0800, Yu Kuai wrote:
>> Hi!
>>
>> 在 2024/04/22 11:33, 周泰宇 写道:
>>> What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example).
>>>
>>>
>>> Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s.
>>
>> Yes, I know this behaviour, and this is because blk-throttle works
>> before IO split, and io stats is accounting bps for rq-based disk after
>> IO split, if you using Q2C for bps you'll see that bps is stable as
>> limit.
>>
>> Hi, Tejun!
>>
>> Do you think this *phenomenon* need to be fixed? If so, I don't see a
>> easy way other than throttle bio after *IO split*. Perhaps ohter than
>> bio merge case, this can be another motivation to move blk-throttle to
>> rq_qos_throttle().
> 
> Yeah, blk-throtl is sitting too early in the pipeline to easily track how
> the bios actually get issued. However, given that it's been available for
> bio-based drivers for a really long time, I don't think it'd be a good idea
> to move it, so iops limit is always going to be a bit unreliable w.r.t. what
> actually get issued to the device. So, IMHO, if the oddity is just about how
> IOs are counted, I don't think it's a critical problem on its own.

Got it, and agreed. Consider that bps limit for Q stage is stable,
although iostat can observe bps higher or lower sometimes, overall it
should be accurate.

Hi, Zhoutaiyu,

If you really want to fix this, you must come up with a solution with
the respect of FIFO rules, breaking it like this patch is not something
we'll accept, breaking fairness and some other flaws.

Thanks,
Kuai

> 
> Thanks.
>
周泰宇 April 23, 2024, 2:29 a.m. UTC | #8
Hi,

> If you really want to fix this, you must come up with a solution with
> the respect of FIFO rules, breaking it like this patch is not something
> we'll accept, breaking fairness and some other flaws.

Ok, I see. Thank you so much for the feedback.
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f4850a6..499c006 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -913,7 +913,8 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 * queued.
 	 */
 	BUG_ON(tg->service_queue.nr_queued[rw] &&
-	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
+	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]) &&
+	       !bio_flagged(bio, BIO_BPS_THROTTLED));
 
 	/* If tg->bps = -1, then BW is unlimited */
 	if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) ||
@@ -2201,7 +2202,7 @@  bool __blk_throtl_bio(struct bio *bio)
 		throtl_downgrade_check(tg);
 		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
-		if (sq->nr_queued[rw])
+		if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED))
 			break;
 
 		/* if above limits, break to queue */