diff mbox

blk-throttle: fix possible io stall when doing upgrade

Message ID 5b918e35-7072-ba9a-92cc-726d02777b4f@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Joseph Qi Sept. 25, 2017, 10:46 a.m. UTC
From: Joseph Qi <qijiang.qj@alibaba-inc.com>

Currently it will try to dispatch bio in throtl_upgrade_state. This may
lead to io stall in the following case.
Say the hierarchy is like:
/-test1
  |-subtest1
and subtest1 has 32 queued bios now.

throtl_pending_timer_fn            throtl_upgrade_state
------------------------------------------------------------------------
                                   upgrade to max
                                   throtl_select_dispatch
                                   throtl_schedule_next_dispatch
throtl_select_dispatch
throtl_schedule_next_dispatch

Since throtl_select_dispatch will move queued bios from subtest1 to
test1 in throtl_upgrade_state, it will then just do nothing in
throtl_pending_timer_fn. As a result, queued bios won't be dispatched
any more if no proper timer scheduled.
Fix this issue by just scheduling dispatch now and let the dispatch be
done only in timer.

Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com>
---
 block/blk-throttle.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Shaohua Li Sept. 25, 2017, 5:22 p.m. UTC | #1
On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
> 
> Currently it will try to dispatch bio in throtl_upgrade_state. This may
> lead to io stall in the following case.
> Say the hierarchy is like:
> /-test1
>   |-subtest1
> and subtest1 has 32 queued bios now.
> 
> throtl_pending_timer_fn            throtl_upgrade_state
> ------------------------------------------------------------------------
>                                    upgrade to max
>                                    throtl_select_dispatch
>                                    throtl_schedule_next_dispatch
> throtl_select_dispatch
> throtl_schedule_next_dispatch
> 
> Since throtl_select_dispatch will move queued bios from subtest1 to
> test1 in throtl_upgrade_state, it will then just do nothing in
> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
> any more if no proper timer scheduled.

Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
throtl_upgrade_state already moves bios to parent), there is no pending
blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
anything? could you please describe the failure in details?

Thanks,
Shaohua

> Fix this issue by just scheduling dispatch now and let the dispatch be
> done only in timer.
> 
> Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com>
> ---
>  block/blk-throttle.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0fea76a..29d282f 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1909,13 +1909,11 @@ static void throtl_upgrade_state(struct throtl_data *td)
>  		struct throtl_grp *tg = blkg_to_tg(blkg);
>  		struct throtl_service_queue *sq = &tg->service_queue;
>  
> -		tg->disptime = jiffies - 1;
> -		throtl_select_dispatch(sq);
> -		throtl_schedule_next_dispatch(sq, false);
> +		tg->disptime = jiffies;
> +		throtl_schedule_next_dispatch(sq, true);
>  	}
>  	rcu_read_unlock();
> -	throtl_select_dispatch(&td->service_queue);
> -	throtl_schedule_next_dispatch(&td->service_queue, false);
> +	throtl_schedule_next_dispatch(&td->service_queue, true);
>  	queue_work(kthrotld_workqueue, &td->dispatch_work);
>  }
>  
> -- 
> 1.9.4
Joseph Qi Sept. 26, 2017, 1:06 a.m. UTC | #2
Hi Shaohua,

On 17/9/26 01:22, Shaohua Li wrote:
> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
>>
>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>> lead to io stall in the following case.
>> Say the hierarchy is like:
>> /-test1
>>   |-subtest1
>> and subtest1 has 32 queued bios now.
>>
>> throtl_pending_timer_fn            throtl_upgrade_state
>> ------------------------------------------------------------------------
>>                                    upgrade to max
>>                                    throtl_select_dispatch
>>                                    throtl_schedule_next_dispatch
>> throtl_select_dispatch
>> throtl_schedule_next_dispatch
>>
>> Since throtl_select_dispatch will move queued bios from subtest1 to
>> test1 in throtl_upgrade_state, it will then just do nothing in
>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>> any more if no proper timer scheduled.
> 
> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
> throtl_upgrade_state already moves bios to parent), there is no pending
> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
> anything? could you please describe the failure in details?
> 
> Thanks,
> Shaohua
>In normal case, throtl_pending_timer_fn tries to move bios from
subtest1 to test1, and finally do the real issueing work when reach
the top-level.
But int the case above, throtl_select_dispatch in
throtl_pending_timer_fn returns 0, because the work is done by
throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
nothing to do, but the queued bios are still in service queue of
test1.
Since both throtl_pending_timer_fn and throtl_upgrade_state won't
handle the queued bios, io stall happens.

Thanks,
Joseph
Shaohua Li Sept. 26, 2017, 2:48 a.m. UTC | #3
On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
> Hi Shaohua,
> 
> On 17/9/26 01:22, Shaohua Li wrote:
> > On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
> >> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
> >>
> >> Currently it will try to dispatch bio in throtl_upgrade_state. This may
> >> lead to io stall in the following case.
> >> Say the hierarchy is like:
> >> /-test1
> >>   |-subtest1
> >> and subtest1 has 32 queued bios now.
> >>
> >> throtl_pending_timer_fn            throtl_upgrade_state
> >> ------------------------------------------------------------------------
> >>                                    upgrade to max
> >>                                    throtl_select_dispatch
> >>                                    throtl_schedule_next_dispatch
> >> throtl_select_dispatch
> >> throtl_schedule_next_dispatch
> >>
> >> Since throtl_select_dispatch will move queued bios from subtest1 to
> >> test1 in throtl_upgrade_state, it will then just do nothing in
> >> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
> >> any more if no proper timer scheduled.
> > 
> > Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
> > throtl_upgrade_state already moves bios to parent), there is no pending
> > blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
> > anything? could you please describe the failure in details?
> > 
> > Thanks,
> > Shaohua
> >In normal case, throtl_pending_timer_fn tries to move bios from
> subtest1 to test1, and finally do the real issueing work when reach
> the top-level.
> But int the case above, throtl_select_dispatch in
> throtl_pending_timer_fn returns 0, because the work is done by
> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
> nothing to do, but the queued bios are still in service queue of
> test1.

Still didn't get, sorry. If there are pending bios in test1, why
throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
timer?
Joseph Qi Sept. 26, 2017, 3:16 a.m. UTC | #4
On 17/9/26 10:48, Shaohua Li wrote:
> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>> Hi Shaohua,
>>
>> On 17/9/26 01:22, Shaohua Li wrote:
>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
>>>>
>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>> lead to io stall in the following case.
>>>> Say the hierarchy is like:
>>>> /-test1
>>>>   |-subtest1
>>>> and subtest1 has 32 queued bios now.
>>>>
>>>> throtl_pending_timer_fn            throtl_upgrade_state
>>>> ------------------------------------------------------------------------
>>>>                                    upgrade to max
>>>>                                    throtl_select_dispatch
>>>>                                    throtl_schedule_next_dispatch
>>>> throtl_select_dispatch
>>>> throtl_schedule_next_dispatch
>>>>
>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>> any more if no proper timer scheduled.
>>>
>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>> anything? could you please describe the failure in details?
>>>
>>> Thanks,
>>> Shaohua
>>> In normal case, throtl_pending_timer_fn tries to move bios from
>> subtest1 to test1, and finally do the real issueing work when reach
>> the top-level.
>> But int the case above, throtl_select_dispatch in
>> throtl_pending_timer_fn returns 0, because the work is done by
>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>> nothing to do, but the queued bios are still in service queue of
>> test1.
> 
> Still didn't get, sorry. If there are pending bios in test1, why
> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
> timer?
> 

throtl_schedule_next_dispatch doesn't setup timer because there is no
pending children left, all the queued bios are moved to parent test1
now. IMO, this is used in case that it cannot dispatch all queued bios
in one round.
And if the select dispatch is done by timer, it will then do propagate
dispatch in parent till reach the top-level.
But in the case above, it breaks this logic.
Please point out if I am understanding wrong.

Thanks,
Joseph
Shaohua Li Sept. 27, 2017, 9:38 p.m. UTC | #5
On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
> 
> 
> On 17/9/26 10:48, Shaohua Li wrote:
> > On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
> >> Hi Shaohua,
> >>
> >> On 17/9/26 01:22, Shaohua Li wrote:
> >>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
> >>>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
> >>>>
> >>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
> >>>> lead to io stall in the following case.
> >>>> Say the hierarchy is like:
> >>>> /-test1
> >>>>   |-subtest1
> >>>> and subtest1 has 32 queued bios now.
> >>>>
> >>>> throtl_pending_timer_fn            throtl_upgrade_state
> >>>> ------------------------------------------------------------------------
> >>>>                                    upgrade to max
> >>>>                                    throtl_select_dispatch
> >>>>                                    throtl_schedule_next_dispatch
> >>>> throtl_select_dispatch
> >>>> throtl_schedule_next_dispatch
> >>>>
> >>>> Since throtl_select_dispatch will move queued bios from subtest1 to
> >>>> test1 in throtl_upgrade_state, it will then just do nothing in
> >>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
> >>>> any more if no proper timer scheduled.
> >>>
> >>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
> >>> throtl_upgrade_state already moves bios to parent), there is no pending
> >>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
> >>> anything? could you please describe the failure in details?
> >>>
> >>> Thanks,
> >>> Shaohua
> >>> In normal case, throtl_pending_timer_fn tries to move bios from
> >> subtest1 to test1, and finally do the real issueing work when reach
> >> the top-level.
> >> But int the case above, throtl_select_dispatch in
> >> throtl_pending_timer_fn returns 0, because the work is done by
> >> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
> >> nothing to do, but the queued bios are still in service queue of
> >> test1.
> > 
> > Still didn't get, sorry. If there are pending bios in test1, why
> > throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
> > timer?
> > 
> 
> throtl_schedule_next_dispatch doesn't setup timer because there is no
> pending children left, all the queued bios are moved to parent test1
> now. IMO, this is used in case that it cannot dispatch all queued bios
> in one round.
> And if the select dispatch is done by timer, it will then do propagate
> dispatch in parent till reach the top-level.
> But in the case above, it breaks this logic.
> Please point out if I am understanding wrong.

I read your reply again. So if the bios are move to test1, why don't we
dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so it
handles subtest1 and then test1. Anything I missed? Please describe in details,
thanks! Did you see a real stall or is this based on code analysis?

Thanks,
Shaohua
Joseph Qi Sept. 28, 2017, 3:48 a.m. UTC | #6
Hi Shahua,

On 17/9/28 05:38, Shaohua Li wrote:
> On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
>>
>>
>> On 17/9/26 10:48, Shaohua Li wrote:
>>> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>>>> Hi Shaohua,
>>>>
>>>> On 17/9/26 01:22, Shaohua Li wrote:
>>>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>>>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
>>>>>>
>>>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>>>> lead to io stall in the following case.
>>>>>> Say the hierarchy is like:
>>>>>> /-test1
>>>>>>   |-subtest1
>>>>>> and subtest1 has 32 queued bios now.
>>>>>>
>>>>>> throtl_pending_timer_fn            throtl_upgrade_state
>>>>>> ------------------------------------------------------------------------
>>>>>>                                    upgrade to max
>>>>>>                                    throtl_select_dispatch
>>>>>>                                    throtl_schedule_next_dispatch
>>>>>> throtl_select_dispatch
>>>>>> throtl_schedule_next_dispatch
>>>>>>
>>>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>>>> any more if no proper timer scheduled.
>>>>>
>>>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>>>> anything? could you please describe the failure in details?
>>>>>
>>>>> Thanks,
>>>>> Shaohua
>>>>> In normal case, throtl_pending_timer_fn tries to move bios from
>>>> subtest1 to test1, and finally do the real issueing work when reach
>>>> the top-level.
>>>> But int the case above, throtl_select_dispatch in
>>>> throtl_pending_timer_fn returns 0, because the work is done by
>>>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>>>> nothing to do, but the queued bios are still in service queue of
>>>> test1.
>>>
>>> Still didn't get, sorry. If there are pending bios in test1, why
>>> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
>>> timer?
>>>
>>
>> throtl_schedule_next_dispatch doesn't setup timer because there is no
>> pending children left, all the queued bios are moved to parent test1
>> now. IMO, this is used in case that it cannot dispatch all queued bios
>> in one round.
>> And if the select dispatch is done by timer, it will then do propagate
>> dispatch in parent till reach the top-level.
>> But in the case above, it breaks this logic.
>> Please point out if I am understanding wrong.
> 
> I read your reply again. So if the bios are move to test1, why don't we
> dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so it
> handles subtest1 and then test1. Anything I missed? Please describe in details,
> thanks! Did you see a real stall or is this based on code analysis?
> 
> Thanks,
> Shaohua
> 

Sorry for the unclear description and the misunderstanding brought in.
I backported your patches to my kernel 3.10 and did the test. I tested
with libaio and iodepth 32. Most time it worked well, but occasionally
it would stall io, and the blktrace showed the following:

252,0   26        0    19.884802028     0  m   N throtl upgrade to max
252,0   13        0    19.884820336     0  m   N throtl /test1 dispatch nr_queued=32 read=0 write=32

From my analysis, it was because upgrade had moved the queued bios from
subtest1 to test1, but not continued to move them to parent and did the
real issuing. Then timer fn saw there were still 32 queued bios, but
since select dispatch returned 0, it wouldn't try more. As a result,
the corresponding fio stalled.
I've looked at the code again and found that the behavior of
blkg_for_each_descendant_post changes between 3.10 and 4.12. In 3.10 it
doesn't include root while in 4.12 it does. That's why the above case
happens.
So upstream don't have this problem, sorry again for the noise.

Thanks,
Joseph
Joseph Qi Sept. 28, 2017, 11:19 a.m. UTC | #7
On 17/9/28 11:48, Joseph Qi wrote:
> Hi Shahua,
> 
> On 17/9/28 05:38, Shaohua Li wrote:
>> On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
>>>
>>>
>>> On 17/9/26 10:48, Shaohua Li wrote:
>>>> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
>>>>> Hi Shaohua,
>>>>>
>>>>> On 17/9/26 01:22, Shaohua Li wrote:
>>>>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
>>>>>>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
>>>>>>>
>>>>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
>>>>>>> lead to io stall in the following case.
>>>>>>> Say the hierarchy is like:
>>>>>>> /-test1
>>>>>>>   |-subtest1
>>>>>>> and subtest1 has 32 queued bios now.
>>>>>>>
>>>>>>> throtl_pending_timer_fn            throtl_upgrade_state
>>>>>>> ------------------------------------------------------------------------
>>>>>>>                                    upgrade to max
>>>>>>>                                    throtl_select_dispatch
>>>>>>>                                    throtl_schedule_next_dispatch
>>>>>>> throtl_select_dispatch
>>>>>>> throtl_schedule_next_dispatch
>>>>>>>
>>>>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
>>>>>>> test1 in throtl_upgrade_state, it will then just do nothing in
>>>>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
>>>>>>> any more if no proper timer scheduled.
>>>>>>
>>>>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
>>>>>> throtl_upgrade_state already moves bios to parent), there is no pending
>>>>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
>>>>>> anything? could you please describe the failure in details?
>>>>>>
>>>>>> Thanks,
>>>>>> Shaohua
>>>>>> In normal case, throtl_pending_timer_fn tries to move bios from
>>>>> subtest1 to test1, and finally do the real issueing work when reach
>>>>> the top-level.
>>>>> But int the case above, throtl_select_dispatch in
>>>>> throtl_pending_timer_fn returns 0, because the work is done by
>>>>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
>>>>> nothing to do, but the queued bios are still in service queue of
>>>>> test1.
>>>>
>>>> Still didn't get, sorry. If there are pending bios in test1, why
>>>> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
>>>> timer?
>>>>
>>>
>>> throtl_schedule_next_dispatch doesn't setup timer because there is no
>>> pending children left, all the queued bios are moved to parent test1
>>> now. IMO, this is used in case that it cannot dispatch all queued bios
>>> in one round.
>>> And if the select dispatch is done by timer, it will then do propagate
>>> dispatch in parent till reach the top-level.
>>> But in the case above, it breaks this logic.
>>> Please point out if I am understanding wrong.
>>
>> I read your reply again. So if the bios are move to test1, why don't we
>> dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so it
>> handles subtest1 and then test1. Anything I missed? Please describe in details,
>> thanks! Did you see a real stall or is this based on code analysis?
>>
>> Thanks,
>> Shaohua
>>
> 
> Sorry for the unclear description and the misunderstanding brought in.
> I backported your patches to my kernel 3.10 and did the test. I tested
> with libaio and iodepth 32. Most time it worked well, but occasionally
> it would stall io, and the blktrace showed the following:
> 
> 252,0   26        0    19.884802028     0  m   N throtl upgrade to max
> 252,0   13        0    19.884820336     0  m   N throtl /test1 dispatch nr_queued=32 read=0 write=32
> 
> From my analysis, it was because upgrade had moved the queued bios from
> subtest1 to test1, but not continued to move them to parent and did the
> real issuing. Then timer fn saw there were still 32 queued bios, but
> since select dispatch returned 0, it wouldn't try more. As a result,
> the corresponding fio stalled.
> I've looked at the code again and found that the behavior of
> blkg_for_each_descendant_post changes between 3.10 and 4.12. In 3.10 it
> doesn't include root while in 4.12 it does. That's why the above case
> happens.
> So upstream don't have this problem, sorry again for the noise.
> 
> Thanks,
> Joseph
> 

Sorry, still has chance to lead to io stall. The case is described as
follows:
/-test1
  |-subtest1
/-test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, 8 queued bios from test2 to
/; note that test1 and test2 each has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   0        0     2.539007641     0  m   N throtl upgrade to max
8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16

Thanks,
Joseph
Shaohua Li Sept. 28, 2017, 9:18 p.m. UTC | #8
On Thu, Sep 28, 2017 at 07:19:45PM +0800, Joseph Qi wrote:
> 
> 
> On 17/9/28 11:48, Joseph Qi wrote:
> > Hi Shahua,
> > 
> > On 17/9/28 05:38, Shaohua Li wrote:
> >> On Tue, Sep 26, 2017 at 11:16:05AM +0800, Joseph Qi wrote:
> >>>
> >>>
> >>> On 17/9/26 10:48, Shaohua Li wrote:
> >>>> On Tue, Sep 26, 2017 at 09:06:57AM +0800, Joseph Qi wrote:
> >>>>> Hi Shaohua,
> >>>>>
> >>>>> On 17/9/26 01:22, Shaohua Li wrote:
> >>>>>> On Mon, Sep 25, 2017 at 06:46:42PM +0800, Joseph Qi wrote:
> >>>>>>> From: Joseph Qi <qijiang.qj@alibaba-inc.com>
> >>>>>>>
> >>>>>>> Currently it will try to dispatch bio in throtl_upgrade_state. This may
> >>>>>>> lead to io stall in the following case.
> >>>>>>> Say the hierarchy is like:
> >>>>>>> /-test1
> >>>>>>>   |-subtest1
> >>>>>>> and subtest1 has 32 queued bios now.
> >>>>>>>
> >>>>>>> throtl_pending_timer_fn            throtl_upgrade_state
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>>                                    upgrade to max
> >>>>>>>                                    throtl_select_dispatch
> >>>>>>>                                    throtl_schedule_next_dispatch
> >>>>>>> throtl_select_dispatch
> >>>>>>> throtl_schedule_next_dispatch
> >>>>>>>
> >>>>>>> Since throtl_select_dispatch will move queued bios from subtest1 to
> >>>>>>> test1 in throtl_upgrade_state, it will then just do nothing in
> >>>>>>> throtl_pending_timer_fn. As a result, queued bios won't be dispatched
> >>>>>>> any more if no proper timer scheduled.
> >>>>>>
> >>>>>> Sorry, didn't get it. If throtl_pending_timer_fn does nothing (because
> >>>>>> throtl_upgrade_state already moves bios to parent), there is no pending
> >>>>>> blkcg/bio, not rearming the timer wouldn't lose anything. Am I missing
> >>>>>> anything? could you please describe the failure in details?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Shaohua
> >>>>>> In normal case, throtl_pending_timer_fn tries to move bios from
> >>>>> subtest1 to test1, and finally do the real issueing work when reach
> >>>>> the top-level.
> >>>>> But int the case above, throtl_select_dispatch in
> >>>>> throtl_pending_timer_fn returns 0, because the work is done by
> >>>>> throtl_upgrade_state. Then throtl_pending_timer_fn *thinks* there is
> >>>>> nothing to do, but the queued bios are still in service queue of
> >>>>> test1.
> >>>>
> >>>> Still didn't get, sorry. If there are pending bios in test1, why
> >>>> throtl_schedule_next_dispatch in throtl_pending_timer_fn doesn't setup the
> >>>> timer?
> >>>>
> >>>
> >>> throtl_schedule_next_dispatch doesn't setup timer because there is no
> >>> pending children left, all the queued bios are moved to parent test1
> >>> now. IMO, this is used in case that it cannot dispatch all queued bios
> >>> in one round.
> >>> And if the select dispatch is done by timer, it will then do propagate
> >>> dispatch in parent till reach the top-level.
> >>> But in the case above, it breaks this logic.
> >>> Please point out if I am understanding wrong.
> >>
> >> I read your reply again. So if the bios are move to test1, why don't we
> >> dispatch bios of test1? throtl_upgrade_state does a post-order traversal, so it
> >> handles subtest1 and then test1. Anything I missed? Please describe in details,
> >> thanks! Did you see a real stall or is this based on code analysis?
> >>
> >> Thanks,
> >> Shaohua
> >>
> > 
> > Sorry for the unclear description and the misunderstanding brought in.
> > I backported your patches to my kernel 3.10 and did the test. I tested
> > with libaio and iodepth 32. Most time it worked well, but occasionally
> > it would stall io, and the blktrace showed the following:
> > 
> > 252,0   26        0    19.884802028     0  m   N throtl upgrade to max
> > 252,0   13        0    19.884820336     0  m   N throtl /test1 dispatch nr_queued=32 read=0 write=32
> > 
> > From my analysis, it was because upgrade had moved the queued bios from
> > subtest1 to test1, but not continued to move them to parent and did the
> > real issuing. Then timer fn saw there were still 32 queued bios, but
> > since select dispatch returned 0, it wouldn't try more. As a result,
> > the corresponding fio stalled.
> > I've looked at the code again and found that the behavior of
> > blkg_for_each_descendant_post changes between 3.10 and 4.12. In 3.10 it
> > doesn't include root while in 4.12 it does. That's why the above case
> > happens.
> > So upstream don't have this problem, sorry again for the noise.
> > 
> > Thanks,
> > Joseph
> > 
> 
> Sorry, still has chance to lead to io stall. The case is described as
> follows:
> /-test1
>   |-subtest1
> /-test2
>   |-subtest2
> And subtest1 and subtest2 each has 32 queued bios.
> 
> Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
> bios as follows:
> 1) tg=subtest1, do nothing;
> 2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
> left, no need to schedule next dispatch;
> 3) tg=subtest2, do nothing;
> 4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
> left, no need to schedule next dispatch;
> 5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
> test2 to /, 8 queued bios from test1 to /, 8 queued bios from test2 to
> /; note that test1 and test2 each has 16 queued bios left;
> 6) tg=/, try to schedule next dispatch, but since disptime is now
> (update in tg_update_disptime, wait=0), pending timer is not scheduled
> in fact;
> 7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
> 32 left. test1 and test2 each has 16 queued bios;
> 8) throtl_pending_timer_fn sees the left over bios, but could do
> nothing, because throtl_select_dispatch returns 0, and test1/test2 has
> no pending tg.
> 
> The blktrace shows the following:
> 8,32   0        0     2.539007641     0  m   N throtl upgrade to max
> 8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
> 8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16

Ok, I got it now. As long as we have 3+ levels hierarchy and the top level
cgroup has more than 32 requests pending, we will run into this problem, right?
shouldn't changing throtl_schedule_next_dispatch's parameter to true in
throtl_upgrade_state() be an easier solution? Please update the changelog and
resend patch.

Thanks,
Shaohua
diff mbox

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0fea76a..29d282f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1909,13 +1909,11 @@  static void throtl_upgrade_state(struct throtl_data *td)
 		struct throtl_grp *tg = blkg_to_tg(blkg);
 		struct throtl_service_queue *sq = &tg->service_queue;
 
-		tg->disptime = jiffies - 1;
-		throtl_select_dispatch(sq);
-		throtl_schedule_next_dispatch(sq, false);
+		tg->disptime = jiffies;
+		throtl_schedule_next_dispatch(sq, true);
 	}
 	rcu_read_unlock();
-	throtl_select_dispatch(&td->service_queue);
-	throtl_schedule_next_dispatch(&td->service_queue, false);
+	throtl_schedule_next_dispatch(&td->service_queue, true);
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }