diff mbox

queue stall with blk-mq-sched

Message ID 6035003f-029c-6cff-c35f-4e90496cab50@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Hannes Reinecke Jan. 25, 2017, 11:10 a.m. UTC
On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>>> Try this patch. We only want to bump it for the driver tags, not the
>>>> scheduler side.
>>>
>>> More complete version, this one actually tested. I think this should fix
>>> your issue, let me know.
>>>
>> Nearly there.
>> The initial stall is gone, but the test got hung at the 'stonewall'
>> sequence again:
>>
>> [global]
>> bs=4k
>> ioengine=libaio
>> iodepth=256
>> size=4g
>> direct=1
>> runtime=60
>> # directory=/mnt
>> numjobs=32
>> group_reporting
>> cpus_allowed_policy=split
>> filename=/dev/md127
>>
>> [seq-read]
>> rw=read
>> -> stonewall
>>
>> [rand-read]
>> rw=randread
>> stonewall
>>
>> Restarting all queues made the fio job continue.
>> There were 4 queues with state 'restart', and one queue with state 'active'.
>> So we're missing a queue run somewhere.
>>
> I've found the queue stalls are gone with this patch:
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 6b465bc..de5db6c 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
>  }
> 
>  static inline void
> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> +{
> +       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> +               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +               blk_mq_run_hw_queue(hctx, true);
> +       }
> +}
> +
> +static inline void
>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
> request *rq)
>  {
>         struct elevator_queue *e = hctx->queue->elevator;
> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
>         BUG_ON(rq->internal_tag == -1);
> 
>         blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
> rq->internal_tag);
> -
> -       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> -               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> -               blk_mq_run_hw_queue(hctx, true);
> -       }
>  }
> 
>  static inline void blk_mq_sched_started_request(struct request *rq)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..63799ad 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
> *hctx, struct blk_mq_ctx *ctx,
>                 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>         if (sched_tag != -1)
>                 blk_mq_sched_completed_request(hctx, rq);
> +       blk_mq_sched_restart(hctx);
>         blk_queue_exit(q);
>  }
> 
Bah.

Not quite. I'm still seeing some queues with state 'restart'.

I've found that I need another patch on top of that:

        blk_queue_exit(q);


Reasoning is that in blk_mq_get_tag() we might end up scheduling the
request on another hctx, but the original hctx might still have the
SCHED_RESTART bit set.
Which will never cleared as we complete the request on a different hctx,
so anything we do on the end_request side won't do us any good.

Cheers,

Hannes

Comments

Jens Axboe Jan. 25, 2017, 3:52 p.m. UTC | #1
On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
> On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
>> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>>>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>>>> Try this patch. We only want to bump it for the driver tags, not the
>>>>> scheduler side.
>>>>
>>>> More complete version, this one actually tested. I think this should fix
>>>> your issue, let me know.
>>>>
>>> Nearly there.
>>> The initial stall is gone, but the test got hung at the 'stonewall'
>>> sequence again:
>>>
>>> [global]
>>> bs=4k
>>> ioengine=libaio
>>> iodepth=256
>>> size=4g
>>> direct=1
>>> runtime=60
>>> # directory=/mnt
>>> numjobs=32
>>> group_reporting
>>> cpus_allowed_policy=split
>>> filename=/dev/md127
>>>
>>> [seq-read]
>>> rw=read
>>> -> stonewall
>>>
>>> [rand-read]
>>> rw=randread
>>> stonewall
>>>
>>> Restarting all queues made the fio job continue.
>>> There were 4 queues with state 'restart', and one queue with state 'active'.
>>> So we're missing a queue run somewhere.
>>>
>> I've found the queue stalls are gone with this patch:
>>
>> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
>> index 6b465bc..de5db6c 100644
>> --- a/block/blk-mq-sched.h
>> +++ b/block/blk-mq-sched.h
>> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>>  }
>>
>>  static inline void
>> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>> +{
>> +       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
>> +               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>> +               blk_mq_run_hw_queue(hctx, true);
>> +       }
>> +}
>> +
>> +static inline void
>>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
>> request *rq)
>>  {
>>         struct elevator_queue *e = hctx->queue->elevator;
>> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>>         BUG_ON(rq->internal_tag == -1);
>>
>>         blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
>> rq->internal_tag);
>> -
>> -       if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
>> -               clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>> -               blk_mq_run_hw_queue(hctx, true);
>> -       }
>>  }
>>
>>  static inline void blk_mq_sched_started_request(struct request *rq)
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e872555..63799ad 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
>> *hctx, struct blk_mq_ctx *ctx,
>>                 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>>         if (sched_tag != -1)
>>                 blk_mq_sched_completed_request(hctx, rq);
>> +       blk_mq_sched_restart(hctx);
>>         blk_queue_exit(q);
>>  }
>>
> Bah.
> 
> Not quite. I'm still seeing some queues with state 'restart'.
> 
> I've found that I need another patch on top of that:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..edcbb44 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
> *work)
> 
>                 queue_for_each_hw_ctx(q, hctx, i) {
>                         /* the hctx may be unmapped, so check it here */
> -                       if (blk_mq_hw_queue_mapped(hctx))
> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>                                 blk_mq_tag_idle(hctx);
> +                               blk_mq_sched_restart(hctx);
> +                       }
>                 }
>         }
>         blk_queue_exit(q);
> 
> 
> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
> request on another hctx, but the original hctx might still have the
> SCHED_RESTART bit set.
> Which will never cleared as we complete the request on a different hctx,
> so anything we do on the end_request side won't do us any good.

I think you are right, it'll potentially trigger with shared tags and
multiple hardware queues. I'll debug this today and come up with a
decent fix.

I committed the previous patch, fwiw.
Hannes Reinecke Jan. 25, 2017, 4:57 p.m. UTC | #2
On 01/25/2017 04:52 PM, Jens Axboe wrote:
> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
[ .. ]
>> Bah.
>>
>> Not quite. I'm still seeing some queues with state 'restart'.
>>
>> I've found that I need another patch on top of that:
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e872555..edcbb44 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>> *work)
>>
>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>                         /* the hctx may be unmapped, so check it here */
>> -                       if (blk_mq_hw_queue_mapped(hctx))
>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>                                 blk_mq_tag_idle(hctx);
>> +                               blk_mq_sched_restart(hctx);
>> +                       }
>>                 }
>>         }
>>         blk_queue_exit(q);
>>
>>
>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>> request on another hctx, but the original hctx might still have the
>> SCHED_RESTART bit set.
>> Which will never cleared as we complete the request on a different hctx,
>> so anything we do on the end_request side won't do us any good.
>
> I think you are right, it'll potentially trigger with shared tags and
> multiple hardware queues. I'll debug this today and come up with a
> decent fix.
>
> I committed the previous patch, fwiw.
>
THX.

The above patch _does_ help in the sense that my testcase now completes 
without stalls. And I even get a decent performance with the mq-sched 
fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
when running without I/O scheduling.
Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
getting there.

However, I do get a noticeable stall during the stonewall sequence 
before the timeout handler kicks in, so the must be a better way for 
handling this.

But nevertheless, thanks for all your work here.
Very much appreciated.

Cheers,

Hannes
Jens Axboe Jan. 25, 2017, 5:03 p.m. UTC | #3
On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
> [ .. ]
>>> Bah.
>>>
>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>
>>> I've found that I need another patch on top of that:
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index e872555..edcbb44 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>> *work)
>>>
>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>                         /* the hctx may be unmapped, so check it here */
>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>                                 blk_mq_tag_idle(hctx);
>>> +                               blk_mq_sched_restart(hctx);
>>> +                       }
>>>                 }
>>>         }
>>>         blk_queue_exit(q);
>>>
>>>
>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>> request on another hctx, but the original hctx might still have the
>>> SCHED_RESTART bit set.
>>> Which will never cleared as we complete the request on a different hctx,
>>> so anything we do on the end_request side won't do us any good.
>>
>> I think you are right, it'll potentially trigger with shared tags and
>> multiple hardware queues. I'll debug this today and come up with a
>> decent fix.
>>
>> I committed the previous patch, fwiw.
>>
> THX.
> 
> The above patch _does_ help in the sense that my testcase now completes 
> without stalls. And I even get a decent performance with the mq-sched 
> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
> when running without I/O scheduling.
> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
> getting there.
> 
> However, I do get a noticeable stall during the stonewall sequence 
> before the timeout handler kicks in, so the must be a better way for 
> handling this.
> 
> But nevertheless, thanks for all your work here.
> Very much appreciated.

Yeah, the fix isn't really a fix, unless you are willing to tolerate
potentially tens of seconds of extra latency until we idle it out :-)

So we can't use the un-idling for this, but we can track it on the
shared state, which is the tags. The problem isn't that we are
switching to a new hardware queue, it's if we mark the hardware queue
as restart AND it has nothing pending. In that case, we'll never
get it restarted, since IO completion is what restarts it.

I need to handle that case separately. Currently testing a patch, I
should have something for you to test later today.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e872555..edcbb44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -753,8 +754,10 @@  static void blk_mq_timeout_work(struct work_struct
*work)

                queue_for_each_hw_ctx(q, hctx, i) {
                        /* the hctx may be unmapped, so check it here */
-                       if (blk_mq_hw_queue_mapped(hctx))
+                       if (blk_mq_hw_queue_mapped(hctx)) {
                                blk_mq_tag_idle(hctx);
+                               blk_mq_sched_restart(hctx);
+                       }
                }
        }