diff mbox series

[-next,v3,2/2] blk-throttle: fix io hung due to configuration updates

Message ID 20220519085811.879097-3-yukuai3@huawei.com (mailing list archive)
State New, archived
Headers show
Series bugfix for blk-throttle | expand

Commit Message

Yu Kuai May 19, 2022, 8:58 a.m. UTC
If new configuration is submitted while a bio is throttled, then new
waiting time is recaculated regardless that the bio might aready wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio aready waited.
In order to do that, instead of start new slice in tg_conf_updated(),
just update 'bytes_disp' and 'io_disp' based on the new configuration.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 80 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 13 deletions(-)

Comments

Michal Koutný May 19, 2022, 9:58 a.m. UTC | #1
Hello Kuayi.

On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
> If new configuration is submitted while a bio is throttled, then new
> waiting time is recaculated regardless that the bio might aready wait
> for some time:
> 
> tg_conf_updated
>  throtl_start_new_slice
>   tg_update_disptime
>   throtl_schedule_next_dispatch
> 
> Then io hung can be triggered by always submmiting new configuration
> before the throttled bio is dispatched.

O.K.

> -	/*
> -	 * We're already holding queue_lock and know @tg is valid.  Let's
> -	 * apply the new config directly.
> -	 *
> -	 * Restart the slices for both READ and WRITES. It might happen
> -	 * that a group's limit are dropped suddenly and we don't want to
> -	 * account recently dispatched IO with new low rate.
> -	 */
> -	throtl_start_new_slice(tg, READ);
> -	throtl_start_new_slice(tg, WRITE);
> +	throtl_update_slice(tg, old_limits);

throtl_start_new_slice zeroes *_disp fields.
If for instance, new config allowed only 0.5 throughput, the *_disp
fields would be scaled to 0.5.
How that change helps (better) the previously throttled bio to be dispatched?

(Is it because you omit update of slice_{start,end}?)

Thanks,
Michal
Yu Kuai May 19, 2022, 12:14 p.m. UTC | #2
在 2022/05/19 17:58, Michal Koutný 写道:
> Hello Kuayi.
> 
> On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>> If new configuration is submitted while a bio is throttled, then new
>> waiting time is recaculated regardless that the bio might aready wait
>> for some time:
>>
>> tg_conf_updated
>>   throtl_start_new_slice
>>    tg_update_disptime
>>    throtl_schedule_next_dispatch
>>
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> O.K.
> 
>> -	/*
>> -	 * We're already holding queue_lock and know @tg is valid.  Let's
>> -	 * apply the new config directly.
>> -	 *
>> -	 * Restart the slices for both READ and WRITES. It might happen
>> -	 * that a group's limit are dropped suddenly and we don't want to
>> -	 * account recently dispatched IO with new low rate.
>> -	 */
>> -	throtl_start_new_slice(tg, READ);
>> -	throtl_start_new_slice(tg, WRITE);
>> +	throtl_update_slice(tg, old_limits);
> 
> throtl_start_new_slice zeroes *_disp fields.
Hi,

The problem is not just zeroes *_disp fields, in fact, the real problem
is that 'slice_start' is reset to jiffies.

> If for instance, new config allowed only 0.5 throughput, the *_disp
> fields would be scaled to 0.5.
> How that change helps (better) the previously throttled bio to be dispatched?
> 
tg_with_in_bps_limit() is caculating 'wait' based on 'slice_start'and
'bytes_disp':

tg_with_in_bps_limit:
  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
  tmp = bps_limit * jiffy_elapsed_rnd;
  do_div(tmp, HZ);
  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
		         incluing dispatched.
  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
   *wait = 0 -> no need to wait if this bio is within limit

  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
  -> extra_bytes is based on 'bytes_disp'

For example:

1) bps_limit is 2k, we issue two io, (1k and 9k)
2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:

without this patch:  bytes_disp = 0, slict_start =3:
bytes_allowed = 1k
extra_bytes = 9k - 1k = 8k
wait = 8s

whth this patch: bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3 + 1k = 4k
extra_bytes =  0.5k + 9k - 4k = 5.5k
wait = 5.5s

I hope I can expliain it clearly...

Thanks,
Kuai
> (Is it because you omit update of slice_{start,end}?)
> 
> Thanks,
> Michal
> 
> .
>
Michal Koutný May 19, 2022, 4:10 p.m. UTC | #3
On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> tg_with_in_bps_limit:
>  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>  tmp = bps_limit * jiffy_elapsed_rnd;
>  do_div(tmp, HZ);
>  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
> 		         incluing dispatched.
>  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>   *wait = 0 -> no need to wait if this bio is within limit
> 
>  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>  -> extra_bytes is based on 'bytes_disp'
> 
> For example:
> 
> 1) bps_limit is 2k, we issue two io, (1k and 9k)
> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s

The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
at 5s (i.e. 10k/*2kB/s = 5s).

> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
> 
> without this patch:  bytes_disp = 0, slict_start =3:
> bytes_allowed = 1k	                            <--- why 1k and not 0?
> extra_bytes = 9k - 1k = 8k
> wait = 8s

This looks like it was calculated at time 4s (1s after new config was
set).

> 
> whth this patch: bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3 + 1k = 4k
> extra_bytes =  0.5k + 9k - 4k = 5.5k
> wait = 5.5s

This looks like calculated at 4s, so the IO would be waiting till
4s+5.5s = 9.5s.

As I don't know why using time 4s, I'll shift this calculation to the
time 3s (when the config changes):

bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3  = 3k
extra_bytes =  0.5k + 9k - 3k = 7.5k
wait = 7.5s

In absolute time, the IO would wait till 3s+7.5s = 10.5s

OK, either your 9.5s or my 10.5s looks weird (although earlier than
original 4s+8s=12s).
However, the IO should ideally only wait till

    3s + (9k -   (6k    -    1k)     ) / 1k/s =
         bio - (allowed - dispatched)  / new_limit

   =3s + 4k / 1k/s = 7s

   ('allowed' is based on old limit)

Or in another example, what if you change the config from 2k/s to ∞k/s
(unlimited, let's neglect the arithmetic overflow that you handle
explicitly, imagine a big number but not so big to be greater than
division result).

In such a case, the wait time should be zero, i.e. IO should be
dispatched right at the time of config change.
(With your patch that still calculates >0 wait time (and the original
behavior gives >0 wait too.)

> I hope I can expliain it clearly...

Yes, thanks for pointing me to relevant parts.
I hope I grasped them correctly.

IOW, your patch and formula make the wait time shorter but still IO can
be delayed indefinitely if you pass a sequence of new configs. (AFAIU)

Regards,
Michal
Yu Kuai May 20, 2022, 1:22 a.m. UTC | #4
在 2022/05/20 0:10, Michal Koutný 写道:
> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
>> tg_with_in_bps_limit:
>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>   do_div(tmp, HZ);
>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>> 		         incluing dispatched.
>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>    *wait = 0 -> no need to wait if this bio is within limit
>>
>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>   -> extra_bytes is based on 'bytes_disp'
>>
>> For example:
>>
>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
> 
> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
> at 5s (i.e. 10k/*2kB/s = 5s).
No, the example is that the second io arrived together with first io.
> 
>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>
>> without this patch:  bytes_disp = 0, slict_start =3:
>> bytes_allowed = 1k	                            <--- why 1k and not 0?
Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>> extra_bytes = 9k - 1k = 8k
>> wait = 8s
> 
> This looks like it was calculated at time 4s (1s after new config was
> set).
No... it was caculated at time 3s:

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
affect result)
> 
>>
>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3 + 1k = 4k
>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>> wait = 5.5s
> 
> This looks like calculated at 4s, so the IO would be waiting till
> 4s+5.5s = 9.5s.
wait time is based on extra_bytes, this is really 5.5s, add 4s is
wrong here.

bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
extra_bytes = bio_size + bytes_disp - bytes_allowed
wait = extra_bytes / bps_limit
> 
> As I don't know why using time 4s, I'll shift this calculation to the
> time 3s (when the config changes):
> 
> bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3  = 3k
> extra_bytes =  0.5k + 9k - 3k = 7.5k
6.5k
> wait = 7.5s
> 
> In absolute time, the IO would wait till 3s+7.5s = 10.5s
Like I said above, wait time should not add (jiffies - slice_start)
> 
> OK, either your 9.5s or my 10.5s looks weird (although earlier than
> original 4s+8s=12s).
> However, the IO should ideally only wait till
> 
>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>           bio - (allowed - dispatched)  / new_limit
> 
>     =3s + 4k / 1k/s = 7s
> 
>     ('allowed' is based on old limit)
> 
> Or in another example, what if you change the config from 2k/s to ∞k/s
> (unlimited, let's neglect the arithmetic overflow that you handle
> explicitly, imagine a big number but not so big to be greater than
> division result).
> 
> In such a case, the wait time should be zero, i.e. IO should be
> dispatched right at the time of config change.

I thought about it, however, IMO, this is not a good idea. If user
updated config quite frequently, io throttle will be invalid.

Thanks,
Kuai
> (With your patch that still calculates >0 wait time (and the original
> behavior gives >0 wait too.)
> 
>> I hope I can expliain it clearly...
> 
> Yes, thanks for pointing me to relevant parts.
> I hope I grasped them correctly.
> 
> IOW, your patch and formula make the wait time shorter but still IO can
> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
> 
> Regards,
> Michal
> .
>
Yu Kuai May 20, 2022, 1:36 a.m. UTC | #5
在 2022/05/20 9:22, yukuai (C) 写道:
> 在 2022/05/20 0:10, Michal Koutný 写道:
>> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" 
>> <yukuai3@huawei.com> wrote:
>>> tg_with_in_bps_limit:
>>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>>   do_div(tmp, HZ);
>>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>>>                  incluing dispatched.
>>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>>    *wait = 0 -> no need to wait if this bio is within limit
>>>
>>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>>   -> extra_bytes is based on 'bytes_disp'
>>>
>>> For example:
>>>
>>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
>>
>> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
>> at 5s (i.e. 10k/*2kB/s = 5s).
> No, the example is that the second io arrived together with first io.
>>
>>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>>
>>> without this patch:  bytes_disp = 0, slict_start =3:
>>> bytes_allowed = 1k                                <--- why 1k and not 0?
> Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>>> extra_bytes = 9k - 1k = 8k
>>> wait = 8s
>>
>> This looks like it was calculated at time 4s (1s after new config was
>> set).
> No... it was caculated at time 3s:
> 
> jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
> 
> jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
> 3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
> affect result)
Hi,

Just to simplify explanation (assum that throtl_slice is greater than
0.5s):
Without this patch:
wait time is caculated based on issuing 9k from now(3s) without any
bytes aready dispatched.

With this patch:
wait time is caculated based on issuing 9k from 0s with 0.5 bytes
aready dispatched.
>>
>>>
>>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>>> bytes_allowed = 1k * 3 + 1k = 4k
>>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>>> wait = 5.5s
>>
>> This looks like calculated at 4s, so the IO would be waiting till
>> 4s+5.5s = 9.5s.
> wait time is based on extra_bytes, this is really 5.5s, add 4s is
> wrong here.
> 
> bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
> extra_bytes = bio_size + bytes_disp - bytes_allowed
> wait = extra_bytes / bps_limit
>>
>> As I don't know why using time 4s, I'll shift this calculation to the
>> time 3s (when the config changes):
>>
>> bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3  = 3k
>> extra_bytes =  0.5k + 9k - 3k = 7.5k
> 6.5k
>> wait = 7.5s
>>
>> In absolute time, the IO would wait till 3s+7.5s = 10.5s
> Like I said above, wait time should not add (jiffies - slice_start)
>>
>> OK, either your 9.5s or my 10.5s looks weird (although earlier than
>> original 4s+8s=12s).
>> However, the IO should ideally only wait till
>>
>>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>>           bio - (allowed - dispatched)  / new_limit
>>
>>     =3s + 4k / 1k/s = 7s
>>
>>     ('allowed' is based on old limit)
>>
>> Or in another example, what if you change the config from 2k/s to ∞k/s
>> (unlimited, let's neglect the arithmetic overflow that you handle
>> explicitly, imagine a big number but not so big to be greater than
>> division result).
>>
>> In such a case, the wait time should be zero, i.e. IO should be
>> dispatched right at the time of config change.
> 
> I thought about it, however, IMO, this is not a good idea. If user
> updated config quite frequently, io throttle will be invalid.
> 
> Thanks,
> Kuai
>> (With your patch that still calculates >0 wait time (and the original
>> behavior gives >0 wait too.)
>>
>>> I hope I can expliain it clearly...
>>
>> Yes, thanks for pointing me to relevant parts.
>> I hope I grasped them correctly.
>>
>> IOW, your patch and formula make the wait time shorter but still IO can
>> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
>>
>> Regards,
>> Michal
>> .
>>
Michal Koutný May 20, 2022, 4:03 p.m. UTC | #6
On Fri, May 20, 2022 at 09:36:11AM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> Just to simplify explanation (assum that throtl_slice is greater than
> 0.5s):
> Without this patch:
> wait time is caculated based on issuing 9k from now(3s) without any
> bytes aready dispatched.

I acknowledge that pre-patch state is incorrect because it erases
already passed wait-time from the previous slice.

> With this patch:
> wait time is caculated based on issuing 9k from 0s with 0.5 bytes
> aready dispatched.

Thanks for your further hint. Hopefully, I'm getting closer to real
understanding. Now, I calculate the wait times as durations between
current moment and timepoint when a bio can be dispatched.

IIUC, after config change the ideal wait time of a bio is

    wait_ideal := (disp + bio - Δt*l_old) / l_new

where Δt is the elapsed time of the current slice.
You maintain the slice but scale disp, so you get

    wait_kuai := ((l_new/l_old)*disp + bio - Δt*l_lew) / l_new
               = disp / l_old + bio / l_new - Δt

Please confirm we're on the same page here.

Then I look at

    error := wait_kuai - wait_ideal
          ...
	  = (Δt * l_old - disp) * (1/l_new - 1/l_old)
	  = (Δt * l_old - disp) * (1 - α) / (α * l_old)
where
    α = l_new / l_old

The leftmost term is a unconsumed IO of the slice. Say it's positive,
while the bigger bio is throttled at the moment of a config change.
If the config change increases throttling (α < 1), the error grows very
high (i.e. over-throttling similar to the existing behavior).
If the config change relieves throttling (α > 1), the wait time's
slightly shorter (under-throttling) wrt the ideal.

If I was to propose a correction, it'd be like the patch at the bottom
derived from your but not finished (the XXX part). It's for potential
further discussion.


I had myself carried a way with the formulas. If I go back to the
beginning:

> Then io hung can be triggered by always submmiting new configuration
> before the throttled bio is dispatched.

How big is this a problem actually? Is it only shooting oneself in the leg
or can there be a user who's privileged enough to modify throttling
configuration yet not privileged enough to justify the hung's
consequences (like some global FS locks).


Thanks,
Michal

--- 8< ---
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 469c483719be..3fd458d16f31 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1274,7 +1274,62 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
 	return 0;
 }
 
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_slice_scale(unsigned int slice_start, u64 new_limit,
+				     u64 old_limit)
+{
+	if (new_limit == old_limit)
+		return slice_start;
+
+	/* This shouldn't really matter but semantically we want to extend the
+	 * slice from the earliest possible point of time. */
+	if (WARN_ON(new_limit == 0))
+		return 0;
+
+	return jiffies - div64_u64((jiffies - slice_start) * old_limit, new_limit);
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+	/*
+	 * How does this work? We're going to calculate new wait time in
+	 * tg_with_in_bps_limit(). Ideal wait time after config change is
+	 *
+	 *   wait_ideal := (disp + bio - Δt*l_old) / l_new
+	 *
+	 * where Δt = jiffies - tg->slice_start (elapsed time of slice).
+	 * In reality, the function has no idea about l_old so it calculates
+	 *
+	 *   wait_skewed := (disp + bio - Δt*l_new) / l_new
+	 *
+	 * So we modify slice_start to get correct number
+	 *
+	 *   wait_fixed := (disp + bio - Δt'*l_new) / l_new == wait_ideal
+	 *
+	 * from that
+	 *   Δt' = Δt * l_old / l_new
+	 * or
+	 *   jiffies - slice_start' = (jiffies - slice_start) * l_old / l_new
+	 * .
+	 */
+	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
+							   tg_bps_limit(tg, READ),
+							   old_limits[0]);
+	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
+							   tg_bps_limit(tg, WRITE),
+							   old_limits[1]);
+
+	// XXX This looks like OK since we should not change BPS and IOPS limit
+	// at the same time but it is not actually OK because scaling
+	// slice_start for one limit breaks the other anyway.
+	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
+							   tg_iops_limit(tg, READ),
+							   old_limits[2]);
+	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
+							   tg_iops_limit(tg, WRITE),
+							   old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct cgroup_subsys_state *pos_css;
@@ -1313,16 +1368,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 				parent_tg->latency_target);
 	}
 
-	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
-	 * apply the new config directly.
-	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
-	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_update_slice(tg, old_limits);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1330,6 +1376,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	}
 }
 
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+	limits[0] = tg_bps_limit(tg, READ);
+	limits[1] = tg_bps_limit(tg, WRITE);
+	limits[2] = tg_iops_limit(tg, READ);
+	limits[3] = tg_iops_limit(tg, WRITE);
+}
+
 static ssize_t tg_set_conf(struct kernfs_open_file *of,
 			   char *buf, size_t nbytes, loff_t off, bool is_u64)
 {
@@ -1338,6 +1392,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	int ret;
 	u64 v;
+	u64 old_limits[4];
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1350,13 +1405,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
-	tg_conf_updated(tg, false);
+	tg_conf_updated(tg, old_limits, false);
 	ret = 0;
 out_finish:
 	blkg_conf_finish(&ctx);
@@ -1526,6 +1582,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 old_limits[4];
 	unsigned long idle_time;
 	unsigned long latency_time;
 	int ret;
@@ -1536,6 +1593,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
@@ -1627,7 +1685,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 	} else
 		tg->td->limit_index = LIMIT_MAX;
-	tg_conf_updated(tg, index == LIMIT_LOW &&
+	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
 		tg->td->limit_valid[LIMIT_LOW]);
 	ret = 0;
 out_finish:
Tejun Heo May 20, 2022, 4:20 p.m. UTC | #7
Hello,

On Fri, May 20, 2022 at 06:03:05PM +0200, Michal Koutný wrote:
> > Then io hung can be triggered by always submmiting new configuration
> > before the throttled bio is dispatched.
> 
> How big is this a problem actually? Is it only shooting oneself in the leg
> or can there be a user who's privileged enough to modify throttling
> configuration yet not privileged enough to justify the hung's
> consequences (like some global FS locks).

So, the problem in itself is of the self-inflicted type and I'd prefer to
ignore it. Unfortunately, the kernel doesn't have the kind of isolation
where stalling out some aribtrary tasks is generally safe, especially not
blk-throtl as it doesn't handle bio_issue_as_root() and thus can have a
pretty severe priority inversions where IOs which can block system-wide
operations (e.g. memory reclaim) get trapped in a random cgroup.

Even ignoring that, the kernel in general assumes some forward progress from
everybody and when a part stalls it's relatively easy to spread to the rest
of the system, sometimes gradually, sometimes suddenly - e.g. if the stalled
IO was being performed while holding the mmap_sem, which isn't rare, then
anything which tries to read its proc cmdline will hang behind it.

So, we wanna avoid a situation where a non-priviledged user can cause
indefinite UNINTERRUPTIBLE sleeps to prevent local DoS attacks. I mean,
preventing local attacks is almost never fool proof but we don't want to
make it too easy at least.

Thanks.
Yu Kuai May 21, 2022, 3:01 a.m. UTC | #8
在 2022/05/21 0:03, Michal Koutný 写道:
> On Fri, May 20, 2022 at 09:36:11AM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
>> Just to simplify explanation (assum that throtl_slice is greater than
>> 0.5s):
>> Without this patch:
>> wait time is caculated based on issuing 9k from now(3s) without any
>> bytes aready dispatched.
> 
> I acknowledge that pre-patch state is incorrect because it erases
> already passed wait-time from the previous slice.
> 
>> With this patch:
>> wait time is caculated based on issuing 9k from 0s with 0.5 bytes
>> aready dispatched.
> 
> Thanks for your further hint. Hopefully, I'm getting closer to real
> understanding. Now, I calculate the wait times as durations between
> current moment and timepoint when a bio can be dispatched.
> 
> IIUC, after config change the ideal wait time of a bio is
> 
>      wait_ideal := (disp + bio - Δt*l_old) / l_new
> 
> where Δt is the elapsed time of the current slice.
> You maintain the slice but scale disp, so you get
> 
>      wait_kuai := ((l_new/l_old)*disp + bio - Δt*l_lew) / l_new
>                 = disp / l_old + bio / l_new - Δt
> 
> Please confirm we're on the same page here.
Hi, Michal

Yes we're on the same page here.
> 
> Then I look at
> 
>      error := wait_kuai - wait_ideal
>            ...
> 	  = (Δt * l_old - disp) * (1/l_new - 1/l_old)
> 	  = (Δt * l_old - disp) * (1 - α) / (α * l_old)
> where
>      α = l_new / l_old
> 
> The leftmost term is a unconsumed IO of the slice. Say it's positive,
> while the bigger bio is throttled at the moment of a config change.
> If the config change increases throttling (α < 1), the error grows very
> high (i.e. over-throttling similar to the existing behavior).
> If the config change relieves throttling (α > 1), the wait time's
> slightly shorter (under-throttling) wrt the ideal.
Yew, you are right.
> 
> If I was to propose a correction, it'd be like the patch at the bottom
> derived from your but not finished (the XXX part). It's for potential
> further discussion.
I thought about it, however, I was thing that for such corner case,
fixing io hung if probably enough. Now with the formula that you sorted
out, it's right this is better.

Thanks,
Kuai
> 
> 
> I had myself carried a way with the formulas. If I go back to the
> beginning:
> 
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> How big is this a problem actually? Is it only shooting oneself in the leg
> or can there be a user who's privileged enough to modify throttling
> configuration yet not privileged enough to justify the hung's
> consequences (like some global FS locks).
> 
> 
> Thanks,
> Michal
> 
> --- 8< ---
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 469c483719be..3fd458d16f31 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1274,7 +1274,62 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
>   	return 0;
>   }
>   
> -static void tg_conf_updated(struct throtl_grp *tg, bool global)
> +static u64 throtl_update_slice_scale(unsigned int slice_start, u64 new_limit,
> +				     u64 old_limit)
> +{
> +	if (new_limit == old_limit)
> +		return slice_start;
> +
> +	/* This shouldn't really matter but semantically we want to extend the
> +	 * slice from the earliest possible point of time. */
> +	if (WARN_ON(new_limit == 0))
> +		return 0;
> +
> +	return jiffies - div64_u64((jiffies - slice_start) * old_limit, new_limit);
> +}
> +
> +static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
> +{
> +	/*
> +	 * How does this work? We're going to calculate new wait time in
> +	 * tg_with_in_bps_limit(). Ideal wait time after config change is
> +	 *
> +	 *   wait_ideal := (disp + bio - Δt*l_old) / l_new
> +	 *
> +	 * where Δt = jiffies - tg->slice_start (elapsed time of slice).
> +	 * In reality, the function has no idea about l_old so it calculates
> +	 *
> +	 *   wait_skewed := (disp + bio - Δt*l_new) / l_new
> +	 *
> +	 * So we modify slice_start to get correct number
> +	 *
> +	 *   wait_fixed := (disp + bio - Δt'*l_new) / l_new == wait_ideal
> +	 *
> +	 * from that
> +	 *   Δt' = Δt * l_old / l_new
> +	 * or
> +	 *   jiffies - slice_start' = (jiffies - slice_start) * l_old / l_new
> +	 * .
> +	 */
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_bps_limit(tg, READ),
> +							   old_limits[0]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_bps_limit(tg, WRITE),
> +							   old_limits[1]);
> +
> +	// XXX This looks like OK since we should not change BPS and IOPS limit
> +	// at the same time but it is not actually OK because scaling
> +	// slice_start for one limit breaks the other anyway.
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_iops_limit(tg, READ),
> +							   old_limits[2]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_iops_limit(tg, WRITE),
> +							   old_limits[3]);
> +}
> +
> +static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
>   {
>   	struct throtl_service_queue *sq = &tg->service_queue;
>   	struct cgroup_subsys_state *pos_css;
> @@ -1313,16 +1368,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   				parent_tg->latency_target);
>   	}
>   
> -	/*
> -	 * We're already holding queue_lock and know @tg is valid.  Let's
> -	 * apply the new config directly.
> -	 *
> -	 * Restart the slices for both READ and WRITES. It might happen
> -	 * that a group's limit are dropped suddenly and we don't want to
> -	 * account recently dispatched IO with new low rate.
> -	 */
> -	throtl_start_new_slice(tg, READ);
> -	throtl_start_new_slice(tg, WRITE);
> +	throtl_update_slice(tg, old_limits);
>   
>   	if (tg->flags & THROTL_TG_PENDING) {
>   		tg_update_disptime(tg);
> @@ -1330,6 +1376,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   	}
>   }
>   
> +static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
> +{
> +	limits[0] = tg_bps_limit(tg, READ);
> +	limits[1] = tg_bps_limit(tg, WRITE);
> +	limits[2] = tg_iops_limit(tg, READ);
> +	limits[3] = tg_iops_limit(tg, WRITE);
> +}
> +
>   static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   			   char *buf, size_t nbytes, loff_t off, bool is_u64)
>   {
> @@ -1338,6 +1392,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   	struct throtl_grp *tg;
>   	int ret;
>   	u64 v;
> +	u64 old_limits[4];
>   
>   	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
>   	if (ret)
> @@ -1350,13 +1405,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   		v = U64_MAX;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	if (is_u64)
>   		*(u64 *)((void *)tg + of_cft(of)->private) = v;
>   	else
>   		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
>   
> -	tg_conf_updated(tg, false);
> +	tg_conf_updated(tg, old_limits, false);
>   	ret = 0;
>   out_finish:
>   	blkg_conf_finish(&ctx);
> @@ -1526,6 +1582,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   	struct blkg_conf_ctx ctx;
>   	struct throtl_grp *tg;
>   	u64 v[4];
> +	u64 old_limits[4];
>   	unsigned long idle_time;
>   	unsigned long latency_time;
>   	int ret;
> @@ -1536,6 +1593,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   		return ret;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	v[0] = tg->bps_conf[READ][index];
>   	v[1] = tg->bps_conf[WRITE][index];
> @@ -1627,7 +1685,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   			tg->td->limit_index = LIMIT_LOW;
>   	} else
>   		tg->td->limit_index = LIMIT_MAX;
> -	tg_conf_updated(tg, index == LIMIT_LOW &&
> +	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
>   		tg->td->limit_valid[LIMIT_LOW]);
>   	ret = 0;
>   out_finish:
> .
>
Yu Kuai May 21, 2022, 3:51 a.m. UTC | #9
在 2022/05/21 0:20, Tejun Heo 写道:
> Hello,
> 
> On Fri, May 20, 2022 at 06:03:05PM +0200, Michal Koutný wrote:
>>> Then io hung can be triggered by always submmiting new configuration
>>> before the throttled bio is dispatched.
>>
>> How big is this a problem actually? Is it only shooting oneself in the leg
>> or can there be a user who's privileged enough to modify throttling
>> configuration yet not privileged enough to justify the hung's
>> consequences (like some global FS locks).
> 
> So, the problem in itself is of the self-inflicted type and I'd prefer to
> ignore it. Unfortunately, the kernel doesn't have the kind of isolation
> where stalling out some aribtrary tasks is generally safe, especially not
> blk-throtl as it doesn't handle bio_issue_as_root() and thus can have a
> pretty severe priority inversions where IOs which can block system-wide
> operations (e.g. memory reclaim) get trapped in a random cgroup.
Hi, Tejun

It's right the problem is self-inflicted. However, I do think with
Michal's suggestion, how throttled bios are handled while new config is
submitted really make sense from the functional poinit of view.

Do you think the solution is OK?

Thnaks,
Kuai
> 
> Even ignoring that, the kernel in general assumes some forward progress from
> everybody and when a part stalls it's relatively easy to spread to the rest
> of the system, sometimes gradually, sometimes suddenly - e.g. if the stalled
> IO was being performed while holding the mmap_sem, which isn't rare, then
> anything which tries to read its proc cmdline will hang behind it.
> 
> So, we wanna avoid a situation where a non-priviledged user can cause
> indefinite UNINTERRUPTIBLE sleeps to prevent local DoS attacks. I mean,
> preventing local attacks is almost never fool proof but we don't want to
> make it too easy at least.
> 
> Thanks.
>
Tejun Heo May 21, 2022, 5 a.m. UTC | #10
On Sat, May 21, 2022 at 11:51:11AM +0800, yukuai (C) wrote:
> It's right the problem is self-inflicted. However, I do think with
> Michal's suggestion, how throttled bios are handled while new config is
> submitted really make sense from the functional poinit of view.
> 
> Do you think the solution is OK?

I haven't followed the details but anything which isn't overly complex and
doesn't produce extra budget or eat into existing one on config change is
fine by me.

Thanks.
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0c37be08ff28..aca63148bb83 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1271,7 +1271,58 @@  static int tg_print_conf_uint(struct seq_file *sf, void *v)
 	return 0;
 }
 
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_bytes_disp(u64 dispatched, u64 new_limit,
+				    u64 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (div64_u64(U64_MAX, dispatched) < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return div64_u64(dispatched, old_limit);
+}
+
+static u32 throtl_update_io_disp(u32 dispatched, u32 new_limit, u32 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (UINT_MAX / dispatched < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return dispatched / old_limit;
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+	tg->bytes_disp[READ] = throtl_update_bytes_disp(tg->bytes_disp[READ],
+			tg_bps_limit(tg, READ), old_limits[0]);
+	tg->bytes_disp[WRITE] = throtl_update_bytes_disp(tg->bytes_disp[WRITE],
+			tg_bps_limit(tg, WRITE), old_limits[1]);
+	tg->io_disp[READ] = throtl_update_io_disp(tg->io_disp[READ],
+			tg_iops_limit(tg, READ), (u32)old_limits[2]);
+	tg->io_disp[WRITE] = throtl_update_io_disp(tg->io_disp[WRITE],
+			tg_iops_limit(tg, WRITE), (u32)old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct cgroup_subsys_state *pos_css;
@@ -1310,16 +1361,7 @@  static void tg_conf_updated(struct throtl_grp *tg, bool global)
 				parent_tg->latency_target);
 	}
 
-	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
-	 * apply the new config directly.
-	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
-	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_update_slice(tg, old_limits);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1327,6 +1369,14 @@  static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	}
 }
 
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+	limits[0] = tg_bps_limit(tg, READ);
+	limits[1] = tg_bps_limit(tg, WRITE);
+	limits[2] = tg_iops_limit(tg, READ);
+	limits[3] = tg_iops_limit(tg, WRITE);
+}
+
 static ssize_t tg_set_conf(struct kernfs_open_file *of,
 			   char *buf, size_t nbytes, loff_t off, bool is_u64)
 {
@@ -1335,6 +1385,7 @@  static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	int ret;
 	u64 v;
+	u64 old_limits[4];
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1347,13 +1398,14 @@  static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
-	tg_conf_updated(tg, false);
+	tg_conf_updated(tg, old_limits, false);
 	ret = 0;
 out_finish:
 	blkg_conf_finish(&ctx);
@@ -1523,6 +1575,7 @@  static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 old_limits[4];
 	unsigned long idle_time;
 	unsigned long latency_time;
 	int ret;
@@ -1533,6 +1586,7 @@  static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
@@ -1624,7 +1678,7 @@  static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 	} else
 		tg->td->limit_index = LIMIT_MAX;
-	tg_conf_updated(tg, index == LIMIT_LOW &&
+	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
 		tg->td->limit_valid[LIMIT_LOW]);
 	ret = 0;
 out_finish: