Message ID | 20240419120747.38031-1-zhoutaiyu@kuaishou.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-throttle: fix repeat limit on bio with BIO_BPS_THROTTLED | expand |
Hi, 在 2024/04/19 20:07, zhoutaiyu 写道: > Give a concrete example, a bio is throtted because of reaching bps > limit. It is then dispatched to request layer after a delay. In the > request layer, it is split and the split bio flagged with > BIO_BPS_THROTTLED will re-enter blkthrottle. > The bio with BIO_BPS_THROTTLED should not be throttled for its bytes > again. However, when the bps_limit and iops_limit are both set and > sq->queue is not empty, the bio will be throttled again even the tg is > still within iops limit. I don't understand here, split bio should be throttled by iops limit again, this is expected. If you mean that that throtl time calculated by iops_limit is wrong, you need to provide more informatiom. > > Test scrips: > cgpath=/sys/fs/cgroup/blkio/test0 > mkdir -p $cgpath > echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device > echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device What? 8:0 and 8:16? > for ((i=0;i<50;i++));do > fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \ > -time_based=1 -runtime=30 -name=testt_$i -filename=testf_$i > /dev/null & > echo $! > $cgpath/tasks > done > > The output of iostat: > Device: ... wMB/s ... > sdb ... 3.75 ... > sdb ... 2.50 ... > sdb ... 3.75 ... > sdb ... 2.50 ... > sdb ... 3.75 ... > > In order to fix this problem, early throttled the bio only when > sq->queue is no empty and the bio is not flagged with BIO_BPS_THROTTLED. > > Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com> > --- > block/blk-throttle.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index f4850a6..499c006 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -913,7 +913,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, > * queued. > */ > BUG_ON(tg->service_queue.nr_queued[rw] && > - bio != throtl_peek_queued(&tg->service_queue.queued[rw])); > + bio != throtl_peek_queued(&tg->service_queue.queued[rw]) && > + !bio_flagged(bio, BIO_BPS_THROTTLED)); > > /* If tg->bps = -1, then BW is unlimited */ > if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) || > @@ -2201,7 +2202,7 @@ bool __blk_throtl_bio(struct bio *bio) > throtl_downgrade_check(tg); > throtl_upgrade_check(tg); > /* throtl is FIFO - if bios are already queued, should queue */ > - if (sq->nr_queued[rw]) > + if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED)) No, this change is wrong. Split IO will not be throttled by iops limit anymore. Thanks, Kuai > break; > > /* if above limits, break to queue */ >
Hi, >> Test scrips: >> cgpath=/sys/fs/cgroup/blkio/test0 >> mkdir -p $cgpath >> echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device >> echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device > What? 8:0 and 8:16? My bad, I made a typos here. It should be all 8:0 or 8:16. What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example). Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s. Once I cancel the IOPS_LIMIT, i.e., echo 0 to the write_iops_device, the bps stabilizes at around 10M/s. The root cause of this is that the split bio will be throttled again even though a tg is under the IOPS_LIMIT when the tg's sq->queue is not empty. Let me explain it with the code. If we only set BPS_LIMIT and the bio is flagged with BIO_BPS_THROTTLED, the blk_should_throtl() will always return false and the __blk_throtl_bio() will not be called. So, the bio flagged with BIO_BPS_THROTTLED will not be throttled in any cases. However, if we set both BPS_LIMIT and IOPS_LIMIT, the blk_should_throtl() will always return true no matter what the value the IOPS_LIMIT is because tg->has_rules_iops[rw] is always true. After that, the bio will be passed to __blk_throtl_bio(). If the tg's sq->queue is empty, blkthrottle will calculate if the tg is above limit with the bio. Since the bio is flagged with BIO_BPS_THROTTLED, blkthrottle will only calculate the IOPS limit for the tg. If tg is under iops limit with the bio, the bio will not be throttled. Otherwise, the bio will be throttled because of the iops limit. However, If the tg's sq->queue is not empty, the bio will be throttled directly without any iops or bps limitation calculations. The related code snippet is : bool __blk_throtl_bio(struct bio *bio) { ....... while (true) { ...... /* throtl is FIFO - if bios are already queued, should queue */ if (sq->nr_queued[rw]) break; // -----> the bio will be throttled in any cases /* if above limits, break to queue */ if (!tg_may_dispatch(tg, bio, NULL)) { // ----> do iops and bps calculations break; // ---> the bio will be throttled if the tg is above bps or iops limits ...... if (!tg) { bio_set_flag(bio, BIO_BPS_THROTTLED); goto out_unlock; // ----> pass, no throttle } ....... } } So even the bio is flagged with BIO_BPS_THROTTLED and the tg is far more behind IOPS_LIMIT, the bio will be throttled if the sq->queue is not empty. This problem can be reproduced by running following scripts and comparing the outputs of iostat. 1) PNUM=50 # large enough to saturate the sq->queue cgpath=/sys/fs/cgroup/blkio/test0 mkdir -p $cgpath echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device echo "8:0 100000" > $cgpath/blkio.throttle.write_iops_device # large enough to make it unreachable for ((i=0;i<;$PUM; i++));do fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \ -time_based=1 -runtime=30 -name=testt_$i -filename=testf_$i > /dev/null & echo $! > $cgpath/tasks done 2) PNUM=50 # large enough to saturate the sq->queue cgpath=/sys/fs/cgroup/blkio/test0 mkdir -p $cgpath echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device echo "8:0 0" > $cgpath/blkio.throttle.write_iops_device # do not set iops limit for ((i=0;i<;$PUM; i++));do fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \ -time_based=1 -runtime=30 -name=testt_$i -filename=testf_$i > /dev/null & echo $! > $cgpath/tasks done >> Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com> >> --- >> block/blk-throttle.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/block/blk-throttle.c b/block/blk-throttle.c >> index f4850a6..499c006 100644 >> --- a/block/blk-throttle.c >> +++ b/block/blk-throttle.c >> @@ -913,7 +913,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, >> * queued. >> */ >> BUG_ON(tg->service_queue.nr_queued[rw] && >> - bio != throtl_peek_queued(&tg->service_queue.queued[rw])); >> + bio != throtl_peek_queued(&tg->service_queue.queued[rw]) && >> + !bio_flagged(bio, BIO_BPS_THROTTLED)); >> >> /* If tg->bps = -1, then BW is unlimited */ >> if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) || >> @@ -2201,7 +2202,7 @@ bool __blk_throtl_bio(struct bio *bio) >> throtl_downgrade_check(tg); >> throtl_upgrade_check(tg); >> /* throtl is FIFO - if bios are already queued, should queue */ >> - if (sq->nr_queued[rw]) >> + if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED)) > No, this change is wrong. Split IO will not be throttled by iops limit anymore. After this change, the split IO will be throttled by iops limit again if it reaches a tg's iops limit and will not be throttled in any cases if the sq->queue is not empty.
Hi! 在 2024/04/22 11:33, 周泰宇 写道: > What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example). > > > Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s. Yes, I know this behaviour, and this is because blk-throttle works before IO split, and io stats is accounting bps for rq-based disk after IO split, if you using Q2C for bps you'll see that bps is stable as limit. Hi, Tejun! Do you think this *phenomenon* need to be fixed? If so, I don't see a easy way other than throttle bio after *IO split*. Perhaps ohter than bio merge case, this can be another motivation to move blk-throttle to rq_qos_throttle(). Thanks, Kuai
Hi, 在 2024/04/22 11:33, 周泰宇 写道: >>> /* throtl is FIFO - if bios are already queued, should queue */ >>> - if (sq->nr_queued[rw]) >>> + if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED)) >> No, this change is wrong. Split IO will not be throttled by iops limit > anymore. > > After this change, the split IO will be throttled by iops limit again if it reaches a tg's iops limit and will not be throttled in any cases if the sq->queue is not empty. Forgot to reply here, The ponit here is that you break the rules about FIFO, blk-throttle only judge the bio from head if it's within limit. Current code to judge if tg iops reaches limit on the condition that no bio is throttled. And throtl time is always caculated by first throttled bio. But this patch will ignore throttled bio case, and that's why I said IO will not be throttled by iops limist anymore. You can test this with bps limit disabled. Thanks, Kuai
Hi, kuai Thank you so much for your feedback. > The ponit here is that you break the rules about FIFO, blk-throttle > only judge the bio from head if it's within limit. Current code to judge > if tg iops reaches limit on the condition that no bio is throttled. And > throtl time is always caculated by first throttled bio. But this patch > will ignore throttled bio case, and that's why I said IO will not be > throttled by iops limist anymore. You can test this with bps limit > disabled. Sorry, I don't get it. Yes, I know this patch will break the FIFO rule. But I didn't find any problems after letting split bio break this rule. I think the split bio should be ignored if it does not make the tg out of iops limit and be judged in tg_within_iops_limit(). In the patch, split bio within iops limit will be dispatched directly. However, if the split bio makes a tg out of IOPS limit, the split bio will be throttled and be added to the tg's sq->queue[rw] by calling throtl_add_bio_tg(). There exists two cases after the split bio is judged to be throttled. 1) The tg->sq->queue is empty. In this case, the split bio will be the first queued bio. The tg's latest dispatch time is calculated according to the split bio by calling tg_update_disptime(), and the dispatch timer is waked up by calling throtl_schedule_next_dispatch(). 2) The tg->sq->queue is not empty. In this case, the tg's dispatch time has been updated according to the first queued bio and the dispatch timer is also waked up. So, blk-throttle just adds the split bio to the tail of the sq->queue[rw] which acts like the current code (split bio is throttled because sq->nr_queue[rw] is not zero). I have tested this patch by only setting iops limit as you advised, with following scripts #!/bin/bash CNUM=30 mkdir -p /sys/fs/cgroup/blkio/test0 #echo "8:16 10485760" > /sys/fs/cgroup/blkio/test0/blkio.throttle.write_bps_device echo "8:16 50" > /sys/fs/cgroup/blkio/test0/blkio.throttle.write_iops_device for ((i=0;i<$CNUM;i++));do fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 -time_based=1 -runtime=110 -name=testt_$i -filename=testf_$i > /dev/null & echo $! > /sys/fs/cgroup/blkio/test0/tasks done I monitor the w/s on iostat and the change of /sys/fs/cgroup/blkio/test0/blkio.throttle.io_serviced every second. It seems work perfectly (around 50). I also see the split bio be throttled through the trace_pipe kworker/19:2-1603 [019] d..1. 583.556012: 8,16 X WS 977153216 / 977155776 [kworker/19:2] kworker/19:2-1603 [019] d..1. 583.556012: 8,16 1,0 m N throtl [W] bio. bdisp=0 sz=1572864 bps=18446744073709551615 iodisp=5 iops=50 queued=0/28 ...... kworker/19:2-1603 [019] d..1. 583.556017: 8,16 X WS 1870472 / 1873032 [kworker/19:2] kworker/19:2-1603 [019] d..1. 583.556018: 8,16 1,0 m N throtl [W] bio. bdisp=0 sz=1572864 bps=18446744073709551615 iodisp=5 iops=50 queued=0/29 Would you mind give me more hints to find out the problems of my patch or try my patch? Sincerely, Taiyu Zhou
Hello, Yu Kuai. On Mon, Apr 22, 2024 at 11:47:41AM +0800, Yu Kuai wrote: > Hi! > > 在 2024/04/22 11:33, 周泰宇 写道: > > What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example). > > > > > > Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s. > > Yes, I know this behaviour, and this is because blk-throttle works > before IO split, and io stats is accounting bps for rq-based disk after > IO split, if you using Q2C for bps you'll see that bps is stable as > limit. > > Hi, Tejun! > > Do you think this *phenomenon* need to be fixed? If so, I don't see a > easy way other than throttle bio after *IO split*. Perhaps ohter than > bio merge case, this can be another motivation to move blk-throttle to > rq_qos_throttle(). Yeah, blk-throtl is sitting too early in the pipeline to easily track how the bios actually get issued. However, given that it's been available for bio-based drivers for a really long time, I don't think it'd be a good idea to move it, so iops limit is always going to be a bit unreliable w.r.t. what actually get issued to the device. So, IMHO, if the oddity is just about how IOs are counted, I don't think it's a critical problem on its own. Thanks.
Hi, Tejun! 在 2024/04/23 1:39, tj@kernel.org 写道: > Hello, Yu Kuai. > > On Mon, Apr 22, 2024 at 11:47:41AM +0800, Yu Kuai wrote: >> Hi! >> >> 在 2024/04/22 11:33, 周泰宇 写道: >>> What I want to do here was to set an easy to reach value to BPS_LIMIT (10M/s in this example) and an unable to reach value to IOPS_LIMIT (100000 in this example). >>> >>> >>> Under this setting, the iostat shows that the bps is far less than 10M/s and sometimes is far larger than 10M/s. >> >> Yes, I know this behaviour, and this is because blk-throttle works >> before IO split, and io stats is accounting bps for rq-based disk after >> IO split, if you using Q2C for bps you'll see that bps is stable as >> limit. >> >> Hi, Tejun! >> >> Do you think this *phenomenon* need to be fixed? If so, I don't see a >> easy way other than throttle bio after *IO split*. Perhaps ohter than >> bio merge case, this can be another motivation to move blk-throttle to >> rq_qos_throttle(). > > Yeah, blk-throtl is sitting too early in the pipeline to easily track how > the bios actually get issued. However, given that it's been available for > bio-based drivers for a really long time, I don't think it'd be a good idea > to move it, so iops limit is always going to be a bit unreliable w.r.t. what > actually get issued to the device. So, IMHO, if the oddity is just about how > IOs are counted, I don't think it's a critical problem on its own. Got it, and agreed. Consider that bps limit for Q stage is stable, although iostat can observe bps higher or lower sometimes, overall it should be accurate. Hi, Zhoutaiyu, If you really want to fix this, you must come up with a solution with the respect of FIFO rules, breaking it like this patch is not something we'll accept, breaking fairness and some other flaws. Thanks, Kuai > > Thanks. >
Hi, > If you really want to fix this, you must come up with a solution with > the respect of FIFO rules, breaking it like this patch is not something > we'll accept, breaking fairness and some other flaws. Ok, I see. Thank you so much for the feedback.
diff --git a/block/blk-throttle.c b/block/blk-throttle.c index f4850a6..499c006 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -913,7 +913,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, * queued. */ BUG_ON(tg->service_queue.nr_queued[rw] && - bio != throtl_peek_queued(&tg->service_queue.queued[rw])); + bio != throtl_peek_queued(&tg->service_queue.queued[rw]) && + !bio_flagged(bio, BIO_BPS_THROTTLED)); /* If tg->bps = -1, then BW is unlimited */ if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) || @@ -2201,7 +2202,7 @@ bool __blk_throtl_bio(struct bio *bio) throtl_downgrade_check(tg); throtl_upgrade_check(tg); /* throtl is FIFO - if bios are already queued, should queue */ - if (sq->nr_queued[rw]) + if (sq->nr_queued[rw] && !bio_flagged(bio, BIO_BPS_THROTTLED)) break; /* if above limits, break to queue */
Give a concrete example, a bio is throtted because of reaching bps limit. It is then dispatched to request layer after a delay. In the request layer, it is split and the split bio flagged with BIO_BPS_THROTTLED will re-enter blkthrottle. The bio with BIO_BPS_THROTTLED should not be throttled for its bytes again. However, when the bps_limit and iops_limit are both set and sq->queue is not empty, the bio will be throttled again even the tg is still within iops limit. Test scrips: cgpath=/sys/fs/cgroup/blkio/test0 mkdir -p $cgpath echo "8:0 10485760" > $cgpath/blkio.throttle.write_bps_device echo "8:16 100000" > $cgpath/blkio.throttle.write_iops_device for ((i=0;i<50;i++));do fio -rw=write -direct=1 -bs=4M -iodepth=8 -size=200M -numjobs=1 \ -time_based=1 -runtime=30 -name=testt_$i -filename=testf_$i > /dev/null & echo $! > $cgpath/tasks done The output of iostat: Device: ... wMB/s ... sdb ... 3.75 ... sdb ... 2.50 ... sdb ... 3.75 ... sdb ... 2.50 ... sdb ... 3.75 ... In order to fix this problem, early throttled the bio only when sq->queue is no empty and the bio is not flagged with BIO_BPS_THROTTLED. Signed-off-by: zhoutaiyu <zhoutaiyu@kuaishou.com> --- block/blk-throttle.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)