diff mbox series

[-next,v5,4/8] blk-throttle: fix io hung due to config updates

Message ID 20220528064330.3471000-5-yukuai3@huawei.com (mailing list archive)
State New, archived
Headers show
Series bugfix and cleanup for blk-throttle | expand

Commit Message

Yu Kuai May 28, 2022, 6:43 a.m. UTC
If new configuration is submitted while a bio is throttled, then new
waiting time is recalculated regardless that the bio might aready wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio aready waited.
In order to do that, add new fields to record how many bytes/io already
waited, and use it to calculate wait time for throttled bio under new
configuration.

Some simple test:
1)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 2048" > blkio.throttle.write_bps_device
{
        sleep 2
        echo "8:0 1024" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

2)
cd /sys/fs/cgroup/blkio/
echo $$ > cgroup.procs
echo "8:0 1024" > blkio.throttle.write_bps_device
{
        sleep 4
        echo "8:0 2048" > blkio.throttle.write_bps_device
} &
dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct

test results: io finish time
	before this patch	with this patch
1)	10s			6s
2)	8s			6s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 51 ++++++++++++++++++++++++++++++++++++++------
 block/blk-throttle.h |  9 ++++++++
 2 files changed, 54 insertions(+), 6 deletions(-)

Comments

Michal Koutný June 22, 2022, 5:26 p.m. UTC | #1
(Apologies for taking so long before answering.)

On Sat, May 28, 2022 at 02:43:26PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
> Some simple test:
> 1)
> cd /sys/fs/cgroup/blkio/
> echo $$ > cgroup.procs
> echo "8:0 2048" > blkio.throttle.write_bps_device
> {
>         sleep 2
>         echo "8:0 1024" > blkio.throttle.write_bps_device
> } &
> dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
> 
> 2)
> cd /sys/fs/cgroup/blkio/
> echo $$ > cgroup.procs
> echo "8:0 1024" > blkio.throttle.write_bps_device
> {
>         sleep 4
>         echo "8:0 2048" > blkio.throttle.write_bps_device
> } &
> dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
> 
> test results: io finish time
> 	before this patch	with this patch
> 1)	10s			6s
> 2)	8s			6s

I agree these are consistent and correct times.

And the new implementation won't make it worse (in terms of delaying a
bio) than configuring minimal limits from the beginning, AFACT.

> @@ -801,7 +836,8 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
>  
>  	/* Round up to the next throttle slice, wait time must be nonzero */
>  	jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice);
> -	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd);
> +	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) +
> +		     tg->io_skipped[rw];
>  	if (tg->io_disp[rw] + 1 <= io_allowed) {
>  		if (wait)
>  			*wait = 0;
> @@ -838,7 +874,8 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  		jiffy_elapsed_rnd = tg->td->throtl_slice;
>  
>  	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
> -	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd);
> +	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) +
> +			tg->bytes_skipped[rw];
>  	if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) {
>  		if (wait)
>  			*wait = 0;
>

Here we may allow to dispatch a bio above current slice's
calculate_bytes_allowed() if bytes_skipped is already >0.

bytes_disp + bio_size <= calculate_bytes_allowed() + bytes_skipped

Then on the next update

> [shuffle]
> +static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
> +{
> +	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
> +	u64 bps_limit = tg_bps_limit(tg, rw);
> +	u32 iops_limit = tg_iops_limit(tg, rw);
> +
> +	if (bps_limit != U64_MAX)
> +		tg->bytes_skipped[rw] +=
> +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
> +			tg->bytes_disp[rw];
> +	if (iops_limit != UINT_MAX)
> +		tg->io_skipped[rw] +=
> +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
> +			tg->io_disp[rw];
> +}

the difference(s) here could be negative. bytes_skipped should be
reduced to account for the additionally dispatched bio.
This is all unsigned so negative numbers underflow, however, we add them
again to the unsigned, so thanks to modular arithmetics the result is
correctly updated bytes_skipped.

Maybe add a comment about this (unsigned) intention?

(But can this happen? The discussed bio would have to outrun another bio
(the one which defined the current slice_end) but since blk-throttle
uses queues (FIFO) everywhere this shouldn't really happen. But it's
good to know this works as intended.)

This patch can have
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Yu Kuai June 23, 2022, 12:27 p.m. UTC | #2
Hi,

在 2022/06/23 1:26, Michal Koutný 写道:
> (Apologies for taking so long before answering.)
> 
> On Sat, May 28, 2022 at 02:43:26PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>> Some simple test:
>> 1)
>> cd /sys/fs/cgroup/blkio/
>> echo $$ > cgroup.procs
>> echo "8:0 2048" > blkio.throttle.write_bps_device
>> {
>>          sleep 2
>>          echo "8:0 1024" > blkio.throttle.write_bps_device
>> } &
>> dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
>>
>> 2)
>> cd /sys/fs/cgroup/blkio/
>> echo $$ > cgroup.procs
>> echo "8:0 1024" > blkio.throttle.write_bps_device
>> {
>>          sleep 4
>>          echo "8:0 2048" > blkio.throttle.write_bps_device
>> } &
>> dd if=/dev/zero of=/dev/sda bs=8k count=1 oflag=direct
>>
>> test results: io finish time
>> 	before this patch	with this patch
>> 1)	10s			6s
>> 2)	8s			6s
> 
> I agree these are consistent and correct times.
> 
> And the new implementation won't make it worse (in terms of delaying a
> bio) than configuring minimal limits from the beginning, AFACT.
> 
>> @@ -801,7 +836,8 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
>>   
>>   	/* Round up to the next throttle slice, wait time must be nonzero */
>>   	jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice);
>> -	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd);
>> +	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) +
>> +		     tg->io_skipped[rw];
>>   	if (tg->io_disp[rw] + 1 <= io_allowed) {
>>   		if (wait)
>>   			*wait = 0;
>> @@ -838,7 +874,8 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>>   		jiffy_elapsed_rnd = tg->td->throtl_slice;
>>   
>>   	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
>> -	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd);
>> +	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) +
>> +			tg->bytes_skipped[rw];
>>   	if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) {
>>   		if (wait)
>>   			*wait = 0;
>>
> 
> Here we may allow to dispatch a bio above current slice's
> calculate_bytes_allowed() if bytes_skipped is already >0.

Hi, I don't expect that to happen. For example, if a bio is still
throttled, then old slice is keeped with proper 'bytes_skipped',
then new wait time is caculated based on (bio_size - bytes_skipped).

After the bio is dispatched(I assum that other bios can't preempt),
if new slice is started, then 'bytes_skipped' is cleared, there should
be no problem; If old slice is extended, note that we only wait
for 'bio_size - bytes_skipped' bytes, while 'bio_size' bytes is added
to 'tg->bytes_disp'. I think this will make sure new bio won't be
dispatched above slice.

What do you think?
> 
> bytes_disp + bio_size <= calculate_bytes_allowed() + bytes_skipped
> 
> Then on the next update
> 
>> [shuffle]
>> +static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
>> +{
>> +	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
>> +	u64 bps_limit = tg_bps_limit(tg, rw);
>> +	u32 iops_limit = tg_iops_limit(tg, rw);
>> +
>> +	if (bps_limit != U64_MAX)
>> +		tg->bytes_skipped[rw] +=
>> +			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
>> +			tg->bytes_disp[rw];
>> +	if (iops_limit != UINT_MAX)
>> +		tg->io_skipped[rw] +=
>> +			calculate_io_allowed(iops_limit, jiffy_elapsed) -
>> +			tg->io_disp[rw];
>> +}
> 
> the difference(s) here could be negative. bytes_skipped should be
> reduced to account for the additionally dispatched bio.
> This is all unsigned so negative numbers underflow, however, we add them
> again to the unsigned, so thanks to modular arithmetics the result is
> correctly updated bytes_skipped.
> 
> Maybe add a comment about this (unsigned) intention?

Of course I can do that.
> 
> (But can this happen? The discussed bio would have to outrun another bio
> (the one which defined the current slice_end) but since blk-throttle
> uses queues (FIFO) everywhere this shouldn't really happen. But it's
> good to know this works as intended.)
I can also mention that in comment.
> 
> This patch can have
> Reviewed-by: Michal Koutný <mkoutny@suse.com>
> 

Thanks for the review!
Kuai
Michal Koutný June 23, 2022, 4:26 p.m. UTC | #3
On Thu, Jun 23, 2022 at 08:27:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
> > Here we may allow to dispatch a bio above current slice's
> > calculate_bytes_allowed() if bytes_skipped is already >0.
> 
> Hi, I don't expect that to happen. For example, if a bio is still
> throttled, then old slice is keeped with proper 'bytes_skipped',
> then new wait time is caculated based on (bio_size - bytes_skipped).
> 
> After the bio is dispatched(I assum that other bios can't preempt),

With this assumptions it adds up as you write. I believe we're in
agreement.

It's the same assumption I made below (FIFO everywhere, i.e. no
reordering). So the discussed difference shouldn't really be negative
(and if the assumption didn't hold, so the modular arithmetic yields
corerct bytes_skipped value).

Michal
Yu Kuai June 25, 2022, 8:36 a.m. UTC | #4
在 2022/06/24 0:26, Michal Koutný 写道:
> On Thu, Jun 23, 2022 at 08:27:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>>> Here we may allow to dispatch a bio above current slice's
>>> calculate_bytes_allowed() if bytes_skipped is already >0.
>>
>> Hi, I don't expect that to happen. For example, if a bio is still
>> throttled, then old slice is keeped with proper 'bytes_skipped',
>> then new wait time is caculated based on (bio_size - bytes_skipped).
>>
>> After the bio is dispatched(I assum that other bios can't preempt),
> 
> With this assumptions it adds up as you write. I believe we're in
> agreement.
> 
> It's the same assumption I made below (FIFO everywhere, i.e. no
> reordering). So the discussed difference shouldn't really be negative
> (and if the assumption didn't hold, so the modular arithmetic yields
> corerct bytes_skipped value).
Yes, nice that we're in aggreement.

I'll wait to see if Tejun has any suggestions.

Thanks,
Kuai
> 
> Michal
> .
>
Jens Axboe June 25, 2022, 4:41 p.m. UTC | #5
On 6/25/22 2:36 AM, Yu Kuai wrote:
> ? 2022/06/24 0:26, Michal Koutn? ??:
>> On Thu, Jun 23, 2022 at 08:27:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>>>> Here we may allow to dispatch a bio above current slice's
>>>> calculate_bytes_allowed() if bytes_skipped is already >0.
>>>
>>> Hi, I don't expect that to happen. For example, if a bio is still
>>> throttled, then old slice is keeped with proper 'bytes_skipped',
>>> then new wait time is caculated based on (bio_size - bytes_skipped).
>>>
>>> After the bio is dispatched(I assum that other bios can't preempt),
>>
>> With this assumptions it adds up as you write. I believe we're in
>> agreement.
>>
>> It's the same assumption I made below (FIFO everywhere, i.e. no
>> reordering). So the discussed difference shouldn't really be negative
>> (and if the assumption didn't hold, so the modular arithmetic yields
>> corerct bytes_skipped value).
> Yes, nice that we're in aggreement.
> 
> I'll wait to see if Tejun has any suggestions.

I flushed more emails from spam again. Please stop using the buggy
huawei address until this gets resolved, your patches are getting lost
left and right and I don't have time to go hunting for emails.
Yu Kuai June 26, 2022, 2:39 a.m. UTC | #6
在 2022/06/26 0:41, Jens Axboe 写道:
> On 6/25/22 2:36 AM, Yu Kuai wrote:
>> ? 2022/06/24 0:26, Michal Koutn? ??:
>>> On Thu, Jun 23, 2022 at 08:27:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>>>>> Here we may allow to dispatch a bio above current slice's
>>>>> calculate_bytes_allowed() if bytes_skipped is already >0.
>>>>
>>>> Hi, I don't expect that to happen. For example, if a bio is still
>>>> throttled, then old slice is keeped with proper 'bytes_skipped',
>>>> then new wait time is caculated based on (bio_size - bytes_skipped).
>>>>
>>>> After the bio is dispatched(I assum that other bios can't preempt),
>>>
>>> With this assumptions it adds up as you write. I believe we're in
>>> agreement.
>>>
>>> It's the same assumption I made below (FIFO everywhere, i.e. no
>>> reordering). So the discussed difference shouldn't really be negative
>>> (and if the assumption didn't hold, so the modular arithmetic yields
>>> corerct bytes_skipped value).
>> Yes, nice that we're in aggreement.
>>
>> I'll wait to see if Tejun has any suggestions.
> 
> I flushed more emails from spam again. Please stop using the buggy
> huawei address until this gets resolved, your patches are getting lost
> left and right and I don't have time to go hunting for emails.
> 

My apologize for that, I'm quite annoied that our IT still can't solve
this. I'll stop sending new emails with this address for now..

Thanks,
Kuai
Yu Kuai July 5, 2022, 11:42 a.m. UTC | #7
在 2022/06/26 0:41, Jens Axboe 写道:
> On 6/25/22 2:36 AM, Yu Kuai wrote:
>> ? 2022/06/24 0:26, Michal Koutn? ??:
>>> On Thu, Jun 23, 2022 at 08:27:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>>>>> Here we may allow to dispatch a bio above current slice's
>>>>> calculate_bytes_allowed() if bytes_skipped is already >0.
>>>>
>>>> Hi, I don't expect that to happen. For example, if a bio is still
>>>> throttled, then old slice is keeped with proper 'bytes_skipped',
>>>> then new wait time is caculated based on (bio_size - bytes_skipped).
>>>>
>>>> After the bio is dispatched(I assum that other bios can't preempt),
>>>
>>> With this assumptions it adds up as you write. I believe we're in
>>> agreement.
>>>
>>> It's the same assumption I made below (FIFO everywhere, i.e. no
>>> reordering). So the discussed difference shouldn't really be negative
>>> (and if the assumption didn't hold, so the modular arithmetic yields
>>> corerct bytes_skipped value).
>> Yes, nice that we're in aggreement.
>>
>> I'll wait to see if Tejun has any suggestions.
> 
> I flushed more emails from spam again. Please stop using the buggy
> huawei address until this gets resolved, your patches are getting lost
> left and right and I don't have time to go hunting for emails.
> 

Hi, Jens

Can you please take a look if this patchset is ok?

https://lore.kernel.org/all/20220701093441.885741-1-yukuai1@huaweicloud.com/

This is sent by huaweicloud.com(DMARC record is empty).

Thanks,
Kuai
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d67b20ce4d63..94fd73e8b2d9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -639,6 +639,8 @@  static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 {
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
+	tg->bytes_skipped[rw] = 0;
+	tg->io_skipped[rw] = 0;
 
 	/*
 	 * Previous slice has expired. We must have trimmed it after last
@@ -656,12 +658,17 @@  static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 		   tg->slice_end[rw], jiffies);
 }
 
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
+static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw,
+					  bool clear_skipped)
 {
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
 	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
+	if (clear_skipped) {
+		tg->bytes_skipped[rw] = 0;
+		tg->io_skipped[rw] = 0;
+	}
 
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
@@ -784,6 +791,34 @@  static u64 calculate_bytes_allowed(u64 bps_limit,
 	return mul_u64_u64_div_u64(bps_limit, (u64)jiffy_elapsed_rnd, (u64)HZ);
 }
 
+static void __tg_update_skipped(struct throtl_grp *tg, bool rw)
+{
+	unsigned long jiffy_elapsed = jiffies - tg->slice_start[rw];
+	u64 bps_limit = tg_bps_limit(tg, rw);
+	u32 iops_limit = tg_iops_limit(tg, rw);
+
+	if (bps_limit != U64_MAX)
+		tg->bytes_skipped[rw] +=
+			calculate_bytes_allowed(bps_limit, jiffy_elapsed) -
+			tg->bytes_disp[rw];
+	if (iops_limit != UINT_MAX)
+		tg->io_skipped[rw] +=
+			calculate_io_allowed(iops_limit, jiffy_elapsed) -
+			tg->io_disp[rw];
+}
+
+static void tg_update_skipped(struct throtl_grp *tg)
+{
+	if (tg->service_queue.nr_queued[READ])
+		__tg_update_skipped(tg, READ);
+	if (tg->service_queue.nr_queued[WRITE])
+		__tg_update_skipped(tg, WRITE);
+
+	throtl_log(&tg->service_queue, "%s: %llu %llu %u %u\n", __func__,
+		   tg->bytes_skipped[READ], tg->bytes_skipped[WRITE],
+		   tg->io_skipped[READ], tg->io_skipped[WRITE]);
+}
+
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  u32 iops_limit, unsigned long *wait)
 {
@@ -801,7 +836,8 @@  static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Round up to the next throttle slice, wait time must be nonzero */
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed + 1, tg->td->throtl_slice);
-	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd);
+	io_allowed = calculate_io_allowed(iops_limit, jiffy_elapsed_rnd) +
+		     tg->io_skipped[rw];
 	if (tg->io_disp[rw] + 1 <= io_allowed) {
 		if (wait)
 			*wait = 0;
@@ -838,7 +874,8 @@  static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
-	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd);
+	bytes_allowed = calculate_bytes_allowed(bps_limit, jiffy_elapsed_rnd) +
+			tg->bytes_skipped[rw];
 	if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) {
 		if (wait)
 			*wait = 0;
@@ -899,7 +936,7 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 * slice and it should be extended instead.
 	 */
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
-		throtl_start_new_slice(tg, rw);
+		throtl_start_new_slice(tg, rw, true);
 	else {
 		if (time_before(tg->slice_end[rw],
 		    jiffies + tg->td->throtl_slice))
@@ -1328,8 +1365,8 @@  static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	 * that a group's limit are dropped suddenly and we don't want to
 	 * account recently dispatched IO with new low rate.
 	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_start_new_slice(tg, READ, false);
+	throtl_start_new_slice(tg, WRITE, false);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1357,6 +1394,7 @@  static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_update_skipped(tg);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
@@ -1543,6 +1581,7 @@  static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_update_skipped(tg);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index c1b602996127..b8178e6b4d30 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -115,6 +115,15 @@  struct throtl_grp {
 	uint64_t bytes_disp[2];
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
+	/*
+	 * The following two fields are used to calculate new wait time for
+	 * throttled bio when new configuration is submmited.
+	 *
+	 * Number of bytes will be skipped in current slice
+	 */
+	uint64_t bytes_skipped[2];
+	/* Number of bio will be skipped in current slice */
+	unsigned int io_skipped[2];
 
 	unsigned long last_low_overflow_time[2];