diff mbox series

[2/2] blk-throttle: fix off-by-one jiffies wait_time

Message ID 20250222092823.210318-3-yukuai1@huaweicloud.com (mailing list archive)
State New
Headers show
Series blk-throttle: fix off-by-one jiffies wait_time | expand

Commit Message

Yu Kuai Feb. 22, 2025, 9:28 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

wait_time is based on jiffies, and the calculation is:

wait_time = extra_bytes * HZ / bps_limit;

If wait time is not zero, the remainder is ignored, means wait time is
short by at most 1 jiffes; On the other hand, if wait time is zero, it
will be used as 1 jiffies, means it's excessive by at most 1 jiffes.

This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
fix the problem by recording remainder as debt, and repay the debt
later.

Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 24 +++++++++++++++++-------
 block/blk-throttle.h | 12 ++++++++----
 2 files changed, 25 insertions(+), 11 deletions(-)

Comments

Ming Lei Feb. 22, 2025, 12:16 p.m. UTC | #1
Hi Yukuai,

On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> wait_time is based on jiffies, and the calculation is:
> 
> wait_time = extra_bytes * HZ / bps_limit;
> 
> If wait time is not zero, the remainder is ignored, means wait time is
> short by at most 1 jiffes; On the other hand, if wait time is zero, it
> will be used as 1 jiffies, means it's excessive by at most 1 jiffes.
> 
> This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
> fix the problem by recording remainder as debt, and repay the debt
> later.

After further analysis, I figured out that this one extra jiffy isn't the
only reason for throtl/001 failure.

In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and
COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms):

- 20ms can submit 5 bios: 1024K/50(5*4k=20KB)

- the 6th bio is throttled, and the calculated wait is 1 jiffy from
tg_within_bps_limit()

- given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice)
is programmed to start pending timer for scheduling dispatch

- when the pending timer is expired, the 6th bio is submitted, then the
  current slice is trim/reset since the throttled 6th bio is dispatched.

Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3
jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec.

But blktests throtl/001 still should pass since the allowed deviation is 0.5sec,
and 1.3 < 1.5.

Actually there is another reason of timer delay, looks one extra jiffy is
delayed when the timer is triggered, which can be observed reliably by:

bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'

Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with
observation in throtl/001.

Yeah, killing one jiffy may pass blktests throtl/001, but, ...

> 
> Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
> Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-throttle.c | 24 +++++++++++++++++-------
>  block/blk-throttle.h | 12 ++++++++----
>  2 files changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0096e486b1e3..3828c6535605 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
>  }
>  
>  static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
> -				u64 bps_limit)
> +				u64 bps_limit, bool *has_debt)
>  {
>  	bool rw = bio_data_dir(bio);
> +	long long carryover_bytes;
>  	long long bytes_allowed;
>  	u64 extra_bytes;
>  	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  
>  	/* Calc approx time to dispatch */
>  	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
> -	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
> -
> -	if (!jiffy_wait)
> -		jiffy_wait = 1;
> +	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
> +	if (carryover_bytes) {
> +		/*
> +		 * If extra_bytes is not divisible, the remainder is recorded as
> +		 * debt. Caller must ensure the current slice has at least 1
> +		 * more jiffies to repay the debt.
> +		 */
> +		*has_debt = true;
> +		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
> +	}

Thinking of further, it may not be good to use ->carryover_bytes[]:

- if tg_within_bps_limit() returns 0 and the bio is dispatched
  immediately, throtl_trim_slice() is called and ->carryover_bytes[]
  is cleared.

- if tg_within_bps_limit() returns >0, this bio will be throttled, and
  tg_within_bps_limit() may be called more than one time, so
  tg->carryover_bytes[] could be over-counted.

Actually this patch changes tg_within_bps_limit() to one stateful function...

>  
>  	/*
>  	 * This wait time is without taking into consideration the rounding
> @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>  	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
>  	u64 bps_limit = tg_bps_limit(tg, rw);
>  	u32 iops_limit = tg_iops_limit(tg, rw);
> +	bool has_debt = false;
>  
>  	/*
>   	 * Currently whole state machine of group depends on first bio
> @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>  	else
>  		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
>  
> -	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
> +	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
>  	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
>  	if (bps_wait + iops_wait == 0) {
>  		if (wait)
>  			*wait = 0;
> +		if (has_debt)
> +			throtl_extend_slice(tg, rw, jiffies + 1);
>  		return true;
>  	}
>  
>  	max_wait = max(bps_wait, iops_wait);
>  	if (wait)
>  		*wait = max_wait;
> -	throtl_extend_slice(tg, rw, jiffies + max_wait);
> +	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
>  
>  	return false;
>  }
> diff --git a/block/blk-throttle.h b/block/blk-throttle.h
> index 1a36d1278eea..56dcb5ce412f 100644
> --- a/block/blk-throttle.h
> +++ b/block/blk-throttle.h
> @@ -110,10 +110,14 @@ struct throtl_grp {
>  	unsigned int last_io_disp[2];
>  
>  	/*
> -	 * The following two fields are updated when new configuration is
> -	 * submitted while some bios are still throttled, they record how many
> -	 * bytes/ios are waited already in previous configuration, and they will
> -	 * be used to calculate wait time under new configuration.
> +	 * The following two fields are updated when:
> +	 * 1) new configuration is submitted while some bios are still
> +	 * throttled, they record how many bytes/ios are waited already in
> +	 * previous configuration;
> +	 * 2) IOs which may cause priority inversions are dispatched while tg is
> +	 * over limit, these IOs will be dispatched directly;
> +	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
> +	 * divisible by bps_limit, the remainder will be recorded;
>  	 */

blk-throttle takes token bucket algorithm, and the implementation
shouldn't be sensitive with the above two factors, because the bps
rate is controlled over the whole bucket(slice). Meantime it is still
tricky to maintain ->carryover_bytes during slice cycle, which can't
cover timer delay too.

Another way is to avoid to trim slice too soon in case of owning too
much debt, something like the following does avoid this issue:


diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..a778ebbb6887 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	if (bytes_trim <= 0 && io_trim <= 0)
 		return;
 
+	if (tg_bps_limit(tg, rw) != U64_MAX) {
+		long long disp = tg->bytes_disp[rw];
+
+		if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed <
+				8 * tg->td->throtl_slice)
+			return;
+	}
+
 	tg->carryover_bytes[rw] = 0;
 	if ((long long)tg->bytes_disp[rw] >= bytes_trim)
 		tg->bytes_disp[rw] -= bytes_trim;


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0096e486b1e3..3828c6535605 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -703,9 +703,10 @@  static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
 }
 
 static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
-				u64 bps_limit)
+				u64 bps_limit, bool *has_debt)
 {
 	bool rw = bio_data_dir(bio);
+	long long carryover_bytes;
 	long long bytes_allowed;
 	u64 extra_bytes;
 	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -730,10 +731,16 @@  static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
+	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
+	if (carryover_bytes) {
+		/*
+		 * If extra_bytes is not divisible, the remainder is recorded as
+		 * debt. Caller must ensure the current slice has at least 1
+		 * more jiffies to repay the debt.
+		 */
+		*has_debt = true;
+		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
+	}
 
 	/*
 	 * This wait time is without taking into consideration the rounding
@@ -754,6 +761,7 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
 	u64 bps_limit = tg_bps_limit(tg, rw);
 	u32 iops_limit = tg_iops_limit(tg, rw);
+	bool has_debt = false;
 
 	/*
  	 * Currently whole state machine of group depends on first bio
@@ -784,18 +792,20 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	else
 		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
 
-	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
+	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
 	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
 	if (bps_wait + iops_wait == 0) {
 		if (wait)
 			*wait = 0;
+		if (has_debt)
+			throtl_extend_slice(tg, rw, jiffies + 1);
 		return true;
 	}
 
 	max_wait = max(bps_wait, iops_wait);
 	if (wait)
 		*wait = max_wait;
-	throtl_extend_slice(tg, rw, jiffies + max_wait);
+	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
 
 	return false;
 }
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index 1a36d1278eea..56dcb5ce412f 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -110,10 +110,14 @@  struct throtl_grp {
 	unsigned int last_io_disp[2];
 
 	/*
-	 * The following two fields are updated when new configuration is
-	 * submitted while some bios are still throttled, they record how many
-	 * bytes/ios are waited already in previous configuration, and they will
-	 * be used to calculate wait time under new configuration.
+	 * The following two fields are updated when:
+	 * 1) new configuration is submitted while some bios are still
+	 * throttled, they record how many bytes/ios are waited already in
+	 * previous configuration;
+	 * 2) IOs which may cause priority inversions are dispatched while tg is
+	 * over limit, these IOs will be dispatched directly;
+	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
+	 * divisible by bps_limit, the remainder will be recorded;
 	 */
 	long long carryover_bytes[2];
 	int carryover_ios[2];