Message ID | 20250222092823.210318-3-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | blk-throttle: fix off-by-one jiffies wait_time | expand |
Hi Yukuai, On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > wait_time is based on jiffies, and the calculation is: > > wait_time = extra_bytes * HZ / bps_limit; > > If wait time is not zero, the remainder is ignored, means wait time is > short by at most 1 jiffes; On the other hand, if wait time is zero, it > will be used as 1 jiffies, means it's excessive by at most 1 jiffes. > > This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y, > fix the problem by recording remainder as debt, and repay the debt > later. After further analysis, I figured out that this one extra jiffy isn't the only reason for throtl/001 failure. In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms): - 20ms can submit 5 bios: 1024K/50(5*4k=20KB) - the 6th bio is throttled, and the calculated wait is 1 jiffy from tg_within_bps_limit() - given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice) is programmed to start pending timer for scheduling dispatch - when the pending timer is expired, the 6th bio is submitted, then the current slice is trim/reset since the throttled 6th bio is dispatched. Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3 jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec. But blktests throtl/001 still should pass since the allowed deviation is 0.5sec, and 1.3 < 1.5. Actually there is another reason of timer delay, looks one extra jiffy is delayed when the timer is triggered, which can be observed reliably by: bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}' Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with observation in throtl/001. Yeah, killing one jiffy may pass blktests throtl/001, but, ... > > Reported-and-tested-by: Ming Lei <ming.lei@redhat.com> > Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/ > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-throttle.c | 24 +++++++++++++++++------- > block/blk-throttle.h | 12 ++++++++---- > 2 files changed, 25 insertions(+), 11 deletions(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index 0096e486b1e3..3828c6535605 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio > } > > static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, > - u64 bps_limit) > + u64 bps_limit, bool *has_debt) > { > bool rw = bio_data_dir(bio); > + long long carryover_bytes; > long long bytes_allowed; > u64 extra_bytes; > unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; > @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, > > /* Calc approx time to dispatch */ > extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed; > - jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit); > - > - if (!jiffy_wait) > - jiffy_wait = 1; > + jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes); > + if (carryover_bytes) { > + /* > + * If extra_bytes is not divisible, the remainder is recorded as > + * debt. Caller must ensure the current slice has at least 1 > + * more jiffies to repay the debt. > + */ > + *has_debt = true; > + tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ); > + } Thinking of further, it may not be good to use ->carryover_bytes[]: - if tg_within_bps_limit() returns 0 and the bio is dispatched immediately, throtl_trim_slice() is called and ->carryover_bytes[] is cleared. - if tg_within_bps_limit() returns >0, this bio will be throttled, and tg_within_bps_limit() may be called more than one time, so tg->carryover_bytes[] could be over-counted. Actually this patch changes tg_within_bps_limit() to one stateful function... > > /* > * This wait time is without taking into consideration the rounding > @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, > unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; > u64 bps_limit = tg_bps_limit(tg, rw); > u32 iops_limit = tg_iops_limit(tg, rw); > + bool has_debt = false; > > /* > * Currently whole state machine of group depends on first bio > @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, > else > throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice); > > - bps_wait = tg_within_bps_limit(tg, bio, bps_limit); > + bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt); > iops_wait = tg_within_iops_limit(tg, bio, iops_limit); > if (bps_wait + iops_wait == 0) { > if (wait) > *wait = 0; > + if (has_debt) > + throtl_extend_slice(tg, rw, jiffies + 1); > return true; > } > > max_wait = max(bps_wait, iops_wait); > if (wait) > *wait = max_wait; > - throtl_extend_slice(tg, rw, jiffies + max_wait); > + throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt); > > return false; > } > diff --git a/block/blk-throttle.h b/block/blk-throttle.h > index 1a36d1278eea..56dcb5ce412f 100644 > --- a/block/blk-throttle.h > +++ b/block/blk-throttle.h > @@ -110,10 +110,14 @@ struct throtl_grp { > unsigned int last_io_disp[2]; > > /* > - * The following two fields are updated when new configuration is > - * submitted while some bios are still throttled, they record how many > - * bytes/ios are waited already in previous configuration, and they will > - * be used to calculate wait time under new configuration. > + * The following two fields are updated when: > + * 1) new configuration is submitted while some bios are still > + * throttled, they record how many bytes/ios are waited already in > + * previous configuration; > + * 2) IOs which may cause priority inversions are dispatched while tg is > + * over limit, these IOs will be dispatched directly; > + * 3) While calculating wait_time for IO, extra_bytes * HZ is not > + * divisible by bps_limit, the remainder will be recorded; > */ blk-throttle takes token bucket algorithm, and the implementation shouldn't be sensitive with the above two factors, because the bps rate is controlled over the whole bucket(slice). Meantime it is still tricky to maintain ->carryover_bytes during slice cycle, which can't cover timer delay too. Another way is to avoid to trim slice too soon in case of owning too much debt, something like the following does avoid this issue: diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 8d149aff9fd0..a778ebbb6887 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw) if (bytes_trim <= 0 && io_trim <= 0) return; + if (tg_bps_limit(tg, rw) != U64_MAX) { + long long disp = tg->bytes_disp[rw]; + + if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed < + 8 * tg->td->throtl_slice) + return; + } + tg->carryover_bytes[rw] = 0; if ((long long)tg->bytes_disp[rw] >= bytes_trim) tg->bytes_disp[rw] -= bytes_trim; Thanks, Ming
diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 0096e486b1e3..3828c6535605 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio } static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, - u64 bps_limit) + u64 bps_limit, bool *has_debt) { bool rw = bio_data_dir(bio); + long long carryover_bytes; long long bytes_allowed; u64 extra_bytes; unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd; @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio, /* Calc approx time to dispatch */ extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed; - jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit); - - if (!jiffy_wait) - jiffy_wait = 1; + jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes); + if (carryover_bytes) { + /* + * If extra_bytes is not divisible, the remainder is recorded as + * debt. Caller must ensure the current slice has at least 1 + * more jiffies to repay the debt. + */ + *has_debt = true; + tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ); + } /* * This wait time is without taking into consideration the rounding @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; u64 bps_limit = tg_bps_limit(tg, rw); u32 iops_limit = tg_iops_limit(tg, rw); + bool has_debt = false; /* * Currently whole state machine of group depends on first bio @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, else throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice); - bps_wait = tg_within_bps_limit(tg, bio, bps_limit); + bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt); iops_wait = tg_within_iops_limit(tg, bio, iops_limit); if (bps_wait + iops_wait == 0) { if (wait) *wait = 0; + if (has_debt) + throtl_extend_slice(tg, rw, jiffies + 1); return true; } max_wait = max(bps_wait, iops_wait); if (wait) *wait = max_wait; - throtl_extend_slice(tg, rw, jiffies + max_wait); + throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt); return false; } diff --git a/block/blk-throttle.h b/block/blk-throttle.h index 1a36d1278eea..56dcb5ce412f 100644 --- a/block/blk-throttle.h +++ b/block/blk-throttle.h @@ -110,10 +110,14 @@ struct throtl_grp { unsigned int last_io_disp[2]; /* - * The following two fields are updated when new configuration is - * submitted while some bios are still throttled, they record how many - * bytes/ios are waited already in previous configuration, and they will - * be used to calculate wait time under new configuration. + * The following two fields are updated when: + * 1) new configuration is submitted while some bios are still + * throttled, they record how many bytes/ios are waited already in + * previous configuration; + * 2) IOs which may cause priority inversions are dispatched while tg is + * over limit, these IOs will be dispatched directly; + * 3) While calculating wait_time for IO, extra_bytes * HZ is not + * divisible by bps_limit, the remainder will be recorded; */ long long carryover_bytes[2]; int carryover_ios[2];