diff mbox series

[v7,4/9] blk-throttle: fix io hung due to configuration updates

Message ID 20220802140415.2960284-5-yukuai1@huaweicloud.com (mailing list archive)
State New, archived
Headers show
Series bugfix and cleanup for blk-throttle | expand

Commit Message

Yu Kuai Aug. 2, 2022, 2:04 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

If new configuration is submitted while a bio is throttled, then new
waiting time is recalculated regardless that the bio might already wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio already waited.
In order to do that, add new fields to record how many bytes/io are
waited, and use it to calculate wait time for throttled bio under new
configuration.

Some simple test:
1)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 2048" > blkio.throttle.write_bps_device
{
        sleep 2
        echo "8:0 1024" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

2)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 1024" > blkio.throttle.write_bps_device
{
        sleep 4
        echo "8:0 2048" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

test results: io finish time
	before this patch	with this patch
1)	10s			6s
2)	8s			6s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
---
 block/blk-throttle.c | 63 +++++++++++++++++++++++++++++++++++++++-----
 block/blk-throttle.h | 11 ++++++++
 2 files changed, 68 insertions(+), 6 deletions(-)

Comments

Tejun Heo Aug. 16, 2022, 8:01 p.m. UTC | #1
On Tue, Aug 02, 2022 at 10:04:10PM +0800, Yu Kuai wrote:
...
> +static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
> +{
> +	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
> +	u64 bps_limit = tg_bps_limit(tg, rw);
> +	u32 iops_limit = tg_iops_limit(tg, rw);
> +
> +	/*
> +	 * If config is updated while bios are still throttled, calculate and
> +	 * accumulate how many bytes/io are waited across changes. And
> +	 * bytes/io_skipped will be used to calculate new wait time under new
> +	 * configuration.
> +	 *
> +	 * Following calculation won't overflow as long as bios that are
> +	 * dispatched later won't preempt already throttled bios. Even if such
> +	 * overflow do happen, there should be no problem because unsigned is
> +	 * used here, and bytes_skipped/io_skipped will be updated correctly.
> +	 */

Would it be easier if the fields were signed? It's fragile and odd to
explain "these are unsigned but if they underflow they behave just like
signed when added" when they can just be signed. Also, I have a hard time
understand what "preempt" means above.

> +	if (bps_limit != U64_MAX)
> +		tg->bytes_skipped[rw] +=
> +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
> +			tg->bytes_disp[rw];
> +	if (iops_limit != UINT_MAX)
> +		tg->io_skipped[rw] +=
> +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
> +			tg->io_disp[rw];

So, this is calculating the budgets to carry over. Can we name them
accordingly? I don't know what "skipped" means.

> @@ -115,6 +115,17 @@ struct throtl_grp {
>  	uint64_t bytes_disp[2];
>  	/* Number of bio's dispatched in current slice */
>  	unsigned int io_disp[2];
> +	/*
> +	 * The following two fields are updated when new configuration is
> +	 * submitted while some bios are still throttled, they record how many
> +	 * bytes/io are waited already in previous configuration, and they will
> +	 * be used to calculate wait time under new configuration.
> +	 *
> +	 * Number of bytes will be skipped in current slice
> +	 */
> +	uint64_t bytes_skipped[2];
> +	/* Number of bio will be skipped in current slice */
> +	unsigned int io_skipped[2];

So, the code seems to make sense but the field names and comments don't
really, at least to me. I can't find an intuitive understanding of what's
being skipped. Can you please take another stab at making this more
understandable?

Thanks.
Yu Kuai Aug. 17, 2022, 1:30 a.m. UTC | #2
Hi, Tejun!

在 2022/08/17 4:01, Tejun Heo 写道:
> On Tue, Aug 02, 2022 at 10:04:10PM +0800, Yu Kuai wrote:
> ...
>> +static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
>> +{
>> +	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
>> +	u64 bps_limit = tg_bps_limit(tg, rw);
>> +	u32 iops_limit = tg_iops_limit(tg, rw);
>> +
>> +	/*
>> +	 * If config is updated while bios are still throttled, calculate and
>> +	 * accumulate how many bytes/io are waited across changes. And
>> +	 * bytes/io_skipped will be used to calculate new wait time under new
>> +	 * configuration.
>> +	 *
>> +	 * Following calculation won't overflow as long as bios that are
>> +	 * dispatched later won't preempt already throttled bios. Even if such
>> +	 * overflow do happen, there should be no problem because unsigned is
>> +	 * used here, and bytes_skipped/io_skipped will be updated correctly.
>> +	 */
> 
> Would it be easier if the fields were signed? It's fragile and odd to
> explain "these are unsigned but if they underflow they behave just like
> signed when added" when they can just be signed. Also, I have a hard time
> understand what "preempt" means above.

I think preempt shound never happen based on current FIFO
implementation, perhaps
> 
>> +	if (bps_limit != U64_MAX)
>> +		tg->bytes_skipped[rw] +=
>> +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
>> +			tg->bytes_disp[rw];
>> +	if (iops_limit != UINT_MAX)
>> +		tg->io_skipped[rw] +=
>> +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
>> +			tg->io_disp[rw];
> 
> So, this is calculating the budgets to carry over. Can we name them
> accordingly? I don't know what "skipped" means.

Yeah, thanks for you advice, art of naming is a little hard for me...
How do you think about these name: extended_bytes/io_budget?
> 
>> @@ -115,6 +115,17 @@ struct throtl_grp {
>>   	uint64_t bytes_disp[2];
>>   	/* Number of bio's dispatched in current slice */
>>   	unsigned int io_disp[2];
>> +	/*
>> +	 * The following two fields are updated when new configuration is
>> +	 * submitted while some bios are still throttled, they record how many
>> +	 * bytes/io are waited already in previous configuration, and they will
>> +	 * be used to calculate wait time under new configuration.
>> +	 *
>> +	 * Number of bytes will be skipped in current slice
>> +	 */
>> +	uint64_t bytes_skipped[2];
>> +	/* Number of bio will be skipped in current slice */
>> +	unsigned int io_skipped[2];
> 
> So, the code seems to make sense but the field names and comments don't
> really, at least to me. I can't find an intuitive understanding of what's
> being skipped. Can you please take another stab at making this more
> understandable?
> 
> Thanks.
>
Tejun Heo Aug. 17, 2022, 5:52 p.m. UTC | #3
On Wed, Aug 17, 2022 at 09:30:30AM +0800, Yu Kuai wrote:
> > Would it be easier if the fields were signed? It's fragile and odd to
> > explain "these are unsigned but if they underflow they behave just like
> > signed when added" when they can just be signed. Also, I have a hard time
> > understand what "preempt" means above.
> 
> I think preempt shound never happen based on current FIFO
> implementation, perhaps

Can you elaborate what "preempt" is?

> > > +	if (bps_limit != U64_MAX)
> > > +		tg->bytes_skipped[rw] +=
> > > +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
> > > +			tg->bytes_disp[rw];
> > > +	if (iops_limit != UINT_MAX)
> > > +		tg->io_skipped[rw] +=
> > > +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
> > > +			tg->io_disp[rw];
> > 
> > So, this is calculating the budgets to carry over. Can we name them
> > accordingly? I don't know what "skipped" means.
> 
> Yeah, thanks for you advice, art of naming is a little hard for me...
> How do you think about these name: extended_bytes/io_budget?

How about carryover_{ios|bytes}?

Thanks.
Yu Kuai Aug. 18, 2022, 1:16 a.m. UTC | #4
Hi, Tejun!

在 2022/08/18 1:52, Tejun Heo 写道:
> On Wed, Aug 17, 2022 at 09:30:30AM +0800, Yu Kuai wrote:
>>> Would it be easier if the fields were signed? It's fragile and odd to
>>> explain "these are unsigned but if they underflow they behave just like
>>> signed when added" when they can just be signed. Also, I have a hard time
>>> understand what "preempt" means above.
>>
>> I think preempt shound never happen based on current FIFO
>> implementation, perhaps
> 
> Can you elaborate what "preempt" is?

Here preempt means that the bio that is throttled later somehow get
dispatched earlier, Michal thinks it's better to comment that the code
still works fine in this particular scenario.

> 
>>>> +	if (bps_limit != U64_MAX)
>>>> +		tg->bytes_skipped[rw] +=
>>>> +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
>>>> +			tg->bytes_disp[rw];
>>>> +	if (iops_limit != UINT_MAX)
>>>> +		tg->io_skipped[rw] +=
>>>> +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
>>>> +			tg->io_disp[rw];
>>>
>>> So, this is calculating the budgets to carry over. Can we name them
>>> accordingly? I don't know what "skipped" means.
>>
>> Yeah, thanks for you advice, art of naming is a little hard for me...
>> How do you think about these name: extended_bytes/io_budget?
> 
> How about carryover_{ios|bytes}?

Yes, that sounds good.

By the way, should I use 'ios' here instead of 'io'? I was confused
because there are many places that is using 'io' currently.

Thanks,
Kuai
> 
> Thanks.
>
Tejun Heo Aug. 19, 2022, 5:33 p.m. UTC | #5
Hello,

On Thu, Aug 18, 2022 at 09:16:28AM +0800, Yu Kuai wrote:
> 在 2022/08/18 1:52, Tejun Heo 写道:
> > On Wed, Aug 17, 2022 at 09:30:30AM +0800, Yu Kuai wrote:
> > > > Would it be easier if the fields were signed? It's fragile and odd to
> > > > explain "these are unsigned but if they underflow they behave just like
> > > > signed when added" when they can just be signed. Also, I have a hard time
> > > > understand what "preempt" means above.
> > > 
> > > I think preempt shound never happen based on current FIFO
> > > implementation, perhaps
> > 
> > Can you elaborate what "preempt" is?
> 
> Here preempt means that the bio that is throttled later somehow get
> dispatched earlier, Michal thinks it's better to comment that the code
> still works fine in this particular scenario.

You'd have to spell it out. It's not clear "preempt" means the above.

> > How about carryover_{ios|bytes}?
> 
> Yes, that sounds good.
> 
> By the way, should I use 'ios' here instead of 'io'? I was confused
> because there are many places that is using 'io' currently.

Yeah, blk-throttle.c is kinda inconsistent about that. It uses bytes/ios in
some places and bytes/io in others. I'd prefer ios here.

Thanks.
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0d9719c41fe2..621402cf2576 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -639,6 +639,8 @@  static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 {
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
+	tg->bytes_skipped[rw] = 0;
+	tg->io_skipped[rw] = 0;
 
 	/*
 	 * Previous slice has expired. We must have trimmed it after last
@@ -656,12 +658,17 @@  static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 		   tg->slice_end[rw], jiffies);
 }
 
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
+static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw,
+					  bool clear_skipped)
 {
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
 	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
+	if (clear_skipped) {
+		tg->bytes_skipped[rw] = 0;
+		tg->io_skipped[rw] = 0;
+	}
 
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
@@ -783,6 +790,46 @@  static u64 calculate_bytes_allowed(u64 bps_limit, unsigned long jiffy_elapsed)
 	return mul_u64_u64_div_u64(bps_limit, (u64)jiffy_elapsed, (u64)HZ);
 }
 
+static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
+{
+	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
+	u64 bps_limit = tg_bps_limit(tg, rw);
+	u32 iops_limit = tg_iops_limit(tg, rw);
+
+	/*
+	 * If config is updated while bios are still throttled, calculate and
+	 * accumulate how many bytes/io are waited across changes. And
+	 * bytes/io_skipped will be used to calculate new wait time under new
+	 * configuration.
+	 *
+	 * Following calculation won't overflow as long as bios that are
+	 * dispatched later won't preempt already throttled bios. Even if such
+	 * overflow do happen, there should be no problem because unsigned is
+	 * used here, and bytes_skipped/io_skipped will be updated correctly.
+	 */
+	if (bps_limit != U64_MAX)
+		tg->bytes_skipped[rw] +=
+			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
+			tg->bytes_disp[rw];
+	if (iops_limit != UINT_MAX)
+		tg->io_skipped[rw] +=
+			calculate_io_allowed(iops_limit, jiffy_elapsed) -
+			tg->io_disp[rw];
+}
+
+static void tg_update_skipped(struct throtl_grp *tg)
+{
+	if (tg->service_queue.nr_queued[READ])
+		__tg_update_skipped(tg, READ);
+	if (tg->service_queue.nr_queued[WRITE])
+		__tg_update_skipped(tg, WRITE);
+
+	/* see comments in struct throtl_grp for meaning of these fields. */
+	throtl_log(&tg->service_queue, "%s: %llu %llu %u %u\n", __func__,
+		   tg->bytes_skipped[READ], tg->bytes_skipped[WRITE],
+		   tg->io_skipped[READ], tg->io_skipped[WRITE]);
+}
+
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  u32 iops_limit, unsigned long *wait)
 {
@@ -800,7 +847,8 @@  static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Round up to the next throttle slice, wait time must be nonzero */
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice);
-	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd);
+	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) +
+		     tg->io_skipped[rw];
 	if (tg->io_disp[rw] + 1 <= io_allowed) {
 		if (wait)
 			*wait = 0;
@@ -837,7 +885,8 @@  static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
-	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd);
+	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) +
+			tg->bytes_skipped[rw];
 	if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) {
 		if (wait)
 			*wait = 0;
@@ -898,7 +947,7 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 * slice and it should be extended instead.
 	 */
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
-		throtl_start_new_slice(tg, rw);
+		throtl_start_new_slice(tg, rw, true);
 	else {
 		if (time_before(tg->slice_end[rw],
 		    jiffies + tg->td->throtl_slice))
@@ -1327,8 +1376,8 @@  static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	 * that a group's limit are dropped suddenly and we don't want to
 	 * account recently dispatched IO with new low rate.
 	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_start_new_slice(tg, READ, false);
+	throtl_start_new_slice(tg, WRITE, false);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1356,6 +1405,7 @@  static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_update_skipped(tg);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
@@ -1542,6 +1592,7 @@  static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_update_skipped(tg);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index c1b602996127..0163aa9104c3 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -115,6 +115,17 @@  struct throtl_grp {
 	uint64_t bytes_disp[2];
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
+	/*
+	 * The following two fields are updated when new configuration is
+	 * submitted while some bios are still throttled, they record how many
+	 * bytes/io are waited already in previous configuration, and they will
+	 * be used to calculate wait time under new configuration.
+	 *
+	 * Number of bytes will be skipped in current slice
+	 */
+	uint64_t bytes_skipped[2];
+	/* Number of bio will be skipped in current slice */
+	unsigned int io_skipped[2];
 
 	unsigned long last_low_overflow_time[2];