diff mbox series

[2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch

Message ID 20240811101921.4031-3-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Fix some starvation problems | expand

Commit Message

Muchun Song Aug. 11, 2024, 10:19 a.m. UTC
Supposing the following scenario with a virtio_blk driver.

CPU0                                                                CPU1

blk_mq_try_issue_directly()
    __blk_mq_issue_directly()
        q->mq_ops->queue_rq()
            virtio_queue_rq()
                blk_mq_stop_hw_queue()
                                                                    virtblk_done()
    blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
        /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
                                                                                clear_bit(BLK_MQ_S_STOPPED)                 3) store
    blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
        if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
            return                                                                      return
        blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
            if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
                return                                                                      return
            __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as between
3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
will not re-run the hardware queue causing starvation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Ming Lei Aug. 19, 2024, 2:27 a.m. UTC | #1
Hi Muchun,

On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
> Supposing the following scenario with a virtio_blk driver.
> 
> CPU0                                                                CPU1
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                                                     virtblk_done()
>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>             return                                                                      return
>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>                 return                                                                      return
>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> 
> The full memory barrier should be inserted between 1) and 2), as well as between
> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
> will not re-run the hardware queue causing starvation.

Yeah, it is one kind of race which is triggered when adding request into
->dispatch list after returning STS_RESOURCE. We were troubled by lots of
such kind of race.

stopping queue is used in very less drivers, and its only purpose should
be for throttling hw queue in case that low level queue is busy. There seems
more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
with blk_mq_quiesce_queue().

IMO, fixing this kind of issue via memory barrier is too tricky to
maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
make memory barrier solution as the last resort, and we can try to figure
out other easier & more reliable way first.

One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
& export it) before calling blk_mq_stop_hw_queue() in driver, then
return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
dispatch simply.


thanks,
Ming
Muchun Song Aug. 19, 2024, 3:49 a.m. UTC | #2
On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Muchun,
>
> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
> > Supposing the following scenario with a virtio_blk driver.
> >
> > CPU0                                                                CPU1
> >
> > blk_mq_try_issue_directly()
> >     __blk_mq_issue_directly()
> >         q->mq_ops->queue_rq()
> >             virtio_queue_rq()
> >                 blk_mq_stop_hw_queue()
> >                                                                     virtblk_done()
> >     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
> >         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
> >                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
> >     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
> >         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
> >             return                                                                      return
> >         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
> >             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
> >                 return                                                                      return
> >             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> >
> > The full memory barrier should be inserted between 1) and 2), as well as between
> > 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
> > sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
> > will not re-run the hardware queue causing starvation.
>
> Yeah, it is one kind of race which is triggered when adding request into
> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
> such kind of race.

Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.

>
> stopping queue is used in very less drivers, and its only purpose should
> be for throttling hw queue in case that low level queue is busy. There seems
> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
> with blk_mq_quiesce_queue().
>
> IMO, fixing this kind of issue via memory barrier is too tricky to
> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
> make memory barrier solution as the last resort, and we can try to figure
> out other easier & more reliable way first.

I do agree it is hard to maintain the dependencies in the future. We should
propose an easy-maintainable solution. But I thought it is a long-term issue
throughout different stable linux distros. Adding a mb is the easy way to fix
the problem (the code footprint is really small), so it will be very
easy for others
to backport those bug fixes to different stable linux distros. Therefore, mb
should be an interim solution. Then, we could improve it based on the solution
you've proposed below. What do you think?

Thanks,
Muchun.

>
> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
> & export it) before calling blk_mq_stop_hw_queue() in driver, then
> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
> dispatch simply.
>
>
> thanks,
> Ming
>
Yu Kuai Aug. 22, 2024, 3:54 a.m. UTC | #3
Hi,

在 2024/08/19 11:49, Muchun Song 写道:
> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>
>> Hi Muchun,
>>
>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>> Supposing the following scenario with a virtio_blk driver.
>>>
>>> CPU0                                                                CPU1
>>>
>>> blk_mq_try_issue_directly()
>>>      __blk_mq_issue_directly()
>>>          q->mq_ops->queue_rq()
>>>              virtio_queue_rq()
>>>                  blk_mq_stop_hw_queue()
>>>                                                                      virtblk_done()
>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>              return                                                                      return
>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>                  return                                                                      return
>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>
>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>> will not re-run the hardware queue causing starvation.
>>
>> Yeah, it is one kind of race which is triggered when adding request into
>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>> such kind of race.
> 
> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
> 
>>
>> stopping queue is used in very less drivers, and its only purpose should
>> be for throttling hw queue in case that low level queue is busy. There seems
>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>> with blk_mq_quiesce_queue().
>>
>> IMO, fixing this kind of issue via memory barrier is too tricky to
>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>> make memory barrier solution as the last resort, and we can try to figure
>> out other easier & more reliable way first.
> 
> I do agree it is hard to maintain the dependencies in the future. We should
> propose an easy-maintainable solution. But I thought it is a long-term issue
> throughout different stable linux distros. Adding a mb is the easy way to fix
> the problem (the code footprint is really small), so it will be very
> easy for others
> to backport those bug fixes to different stable linux distros. Therefore, mb
> should be an interim solution. Then, we could improve it based on the solution
> you've proposed below. What do you think?

I'll agree with Ming, let's figure out a better fix first. Easy to 
backport to stables is not first consideration.
> 
> Thanks,
> Muchun.
> 
>>
>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>> dispatch simply.

New status code look good to me, however, I wonder can we just remove
the problematic blk_mq_stop_hw_queue(), and replace it by handling the
new status from block layer?

- Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
the new status, if no request is inflight, unquiesce immediately;
- unquiesce is any IO is done afterwards;

Thanks,
Kuai

>>
>>
>> thanks,
>> Ming
>>
> 
> .
>
Muchun Song Aug. 26, 2024, 8:35 a.m. UTC | #4
> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/08/19 11:49, Muchun Song 写道:
>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>> 
>>> Hi Muchun,
>>> 
>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>> Supposing the following scenario with a virtio_blk driver.
>>>> 
>>>> CPU0                                                                CPU1
>>>> 
>>>> blk_mq_try_issue_directly()
>>>>     __blk_mq_issue_directly()
>>>>         q->mq_ops->queue_rq()
>>>>             virtio_queue_rq()
>>>>                 blk_mq_stop_hw_queue()
>>>>                                                                     virtblk_done()
>>>>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>             return                                                                      return
>>>>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>                 return                                                                      return
>>>>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>> 
>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>> will not re-run the hardware queue causing starvation.
>>> 
>>> Yeah, it is one kind of race which is triggered when adding request into
>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>> such kind of race.
>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>> 
>>> stopping queue is used in very less drivers, and its only purpose should
>>> be for throttling hw queue in case that low level queue is busy. There seems
>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>> with blk_mq_quiesce_queue().
>>> 
>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>> make memory barrier solution as the last resort, and we can try to figure
>>> out other easier & more reliable way first.
>> I do agree it is hard to maintain the dependencies in the future. We should
>> propose an easy-maintainable solution. But I thought it is a long-term issue
>> throughout different stable linux distros. Adding a mb is the easy way to fix
>> the problem (the code footprint is really small), so it will be very
>> easy for others
>> to backport those bug fixes to different stable linux distros. Therefore, mb
>> should be an interim solution. Then, we could improve it based on the solution
>> you've proposed below. What do you think?
> 
> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.

Hi Kuai,

All right. I usually focus on MM, it seems there is a gap between MM and BLock.
Anyway, let's figure out if there is any good solution.

>> Thanks,
>> Muchun.
>>> 
>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>> dispatch simply.
> 
> New status code look good to me, however, I wonder can we just remove
> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
> new status from block layer?
> 
> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with

I didn't fully understand your suggestion. Let me ask some questions.
blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
status to blk_mq_run_dispatch_ops in this case?

> the new status, if no request is inflight, unquiesce immediately;

Actually, I didn't understand how to avoid the above race. May you elaborate
the scenario?

Muhcun,
Thanks.

> - unquiesce is any IO is done afterwards;
Yu Kuai Aug. 26, 2024, 8:53 a.m. UTC | #5
Hi,

在 2024/08/26 16:35, Muchun Song 写道:
> 
> 
>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/08/19 11:49, Muchun Song 写道:
>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>
>>>> Hi Muchun,
>>>>
>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>
>>>>> CPU0                                                                CPU1
>>>>>
>>>>> blk_mq_try_issue_directly()
>>>>>      __blk_mq_issue_directly()
>>>>>          q->mq_ops->queue_rq()
>>>>>              virtio_queue_rq()
>>>>>                  blk_mq_stop_hw_queue()
>>>>>                                                                      virtblk_done()
>>>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>              return                                                                      return
>>>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>                  return                                                                      return
>>>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>
>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>> will not re-run the hardware queue causing starvation.
>>>>
>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>> such kind of race.
>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>
>>>> stopping queue is used in very less drivers, and its only purpose should
>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>> with blk_mq_quiesce_queue().
>>>>
>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>> make memory barrier solution as the last resort, and we can try to figure
>>>> out other easier & more reliable way first.
>>> I do agree it is hard to maintain the dependencies in the future. We should
>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>> the problem (the code footprint is really small), so it will be very
>>> easy for others
>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>> should be an interim solution. Then, we could improve it based on the solution
>>> you've proposed below. What do you think?
>>
>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
> 
> Hi Kuai,
> 
> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
> Anyway, let's figure out if there is any good solution.
> 
>>> Thanks,
>>> Muchun.
>>>>
>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>> dispatch simply.
>>
>> New status code look good to me, however, I wonder can we just remove
>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>> new status from block layer?
>>
>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
> 
> I didn't fully understand your suggestion. Let me ask some questions.
> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
> status to blk_mq_run_dispatch_ops in this case?

For queue_rq from dispatch path, it can be removed. However, it is
called from remove path as well, I don't check yet if it can be removed
there, that's another story.

And just add a return value for dispatch_ops to pass status.

Thanks,
Kuai

> 
>> the new status, if no request is inflight, unquiesce immediately;
> 
> Actually, I didn't understand how to avoid the above race. May you elaborate
> the scenario?
> 
> Muhcun,
> Thanks.
> 
>> - unquiesce is any IO is done afterwards;
> 
> 
> 
> 
> .
>
Muchun Song Aug. 27, 2024, 7:31 a.m. UTC | #6
> On Aug 26, 2024, at 16:53, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/08/26 16:35, Muchun Song 写道:
>>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>> 
>>> Hi,
>>> 
>>> 在 2024/08/19 11:49, Muchun Song 写道:
>>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>> 
>>>>> Hi Muchun,
>>>>> 
>>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>> 
>>>>>> CPU0                                                                CPU1
>>>>>> 
>>>>>> blk_mq_try_issue_directly()
>>>>>>     __blk_mq_issue_directly()
>>>>>>         q->mq_ops->queue_rq()
>>>>>>             virtio_queue_rq()
>>>>>>                 blk_mq_stop_hw_queue()
>>>>>>                                                                     virtblk_done()
>>>>>>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>>             return                                                                      return
>>>>>>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>>                 return                                                                      return
>>>>>>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>> 
>>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>>> will not re-run the hardware queue causing starvation.
>>>>> 
>>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>>> such kind of race.
>>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>> 
>>>>> stopping queue is used in very less drivers, and its only purpose should
>>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>>> with blk_mq_quiesce_queue().
>>>>> 
>>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>>> make memory barrier solution as the last resort, and we can try to figure
>>>>> out other easier & more reliable way first.
>>>> I do agree it is hard to maintain the dependencies in the future. We should
>>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>>> the problem (the code footprint is really small), so it will be very
>>>> easy for others
>>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>>> should be an interim solution. Then, we could improve it based on the solution
>>>> you've proposed below. What do you think?
>>> 
>>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
>> Hi Kuai,
>> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
>> Anyway, let's figure out if there is any good solution.
>>>> Thanks,
>>>> Muchun.
>>>>> 
>>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>>> dispatch simply.
>>> 
>>> New status code look good to me, however, I wonder can we just remove
>>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>>> new status from block layer?
>>> 
>>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
>> I didn't fully understand your suggestion. Let me ask some questions.
>> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
>> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
>> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
>> status to blk_mq_run_dispatch_ops in this case?
> 
> For queue_rq from dispatch path, it can be removed. However, it is
> called from remove path as well, I don't check yet if it can be removed
> there, that's another story.

The reason why I asked this question is that blk_mq_stop_hw_queues() also needs
to be fixed. See my patch 3.

> 
> And just add a return value for dispatch_ops to pass status.
> 
> Thanks,
> Kuai
> 
>>> the new status, if no request is inflight, unquiesce immediately;
>> Actually, I didn't understand how to avoid the above race. May you elaborate
>> the scenario?

Sorry for repeating, I didn't get your point here. May you elaborate
your suggestion? Thanks very much.

>> Muhcun,
>> Thanks.
>>> - unquiesce is any IO is done afterwards;
>> .
Yu Kuai Aug. 29, 2024, 7:57 a.m. UTC | #7
Hi,

在 2024/08/27 15:31, Muchun Song 写道:
> 
> 
>> On Aug 26, 2024, at 16:53, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/08/26 16:35, Muchun Song 写道:
>>>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2024/08/19 11:49, Muchun Song 写道:
>>>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>
>>>>>> Hi Muchun,
>>>>>>
>>>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>>>
>>>>>>> CPU0                                                                CPU1
>>>>>>>
>>>>>>> blk_mq_try_issue_directly()
>>>>>>>      __blk_mq_issue_directly()
>>>>>>>          q->mq_ops->queue_rq()
>>>>>>>              virtio_queue_rq()
>>>>>>>                  blk_mq_stop_hw_queue()
>>>>>>>                                                                      virtblk_done()
>>>>>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>>>              return                                                                      return
>>>>>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>>>                  return                                                                      return
>>>>>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>>>
>>>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>>>> will not re-run the hardware queue causing starvation.
>>>>>>
>>>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>>>> such kind of race.
>>>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>>>
>>>>>> stopping queue is used in very less drivers, and its only purpose should
>>>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>>>> with blk_mq_quiesce_queue().
>>>>>>
>>>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>>>> make memory barrier solution as the last resort, and we can try to figure
>>>>>> out other easier & more reliable way first.
>>>>> I do agree it is hard to maintain the dependencies in the future. We should
>>>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>>>> the problem (the code footprint is really small), so it will be very
>>>>> easy for others
>>>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>>>> should be an interim solution. Then, we could improve it based on the solution
>>>>> you've proposed below. What do you think?
>>>>
>>>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
>>> Hi Kuai,
>>> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
>>> Anyway, let's figure out if there is any good solution.
>>>>> Thanks,
>>>>> Muchun.
>>>>>>
>>>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>>>> dispatch simply.
>>>>
>>>> New status code look good to me, however, I wonder can we just remove
>>>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>>>> new status from block layer?
>>>>
>>>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
>>> I didn't fully understand your suggestion. Let me ask some questions.
>>> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
>>> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
>>> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
>>> status to blk_mq_run_dispatch_ops in this case?
>>
>> For queue_rq from dispatch path, it can be removed. However, it is
>> called from remove path as well, I don't check yet if it can be removed
>> there, that's another story.
> 
> The reason why I asked this question is that blk_mq_stop_hw_queues() also needs
> to be fixed. See my patch 3.

I just reviewed that patch, please check following.
> 
>>
>> And just add a return value for dispatch_ops to pass status.
>>
>> Thanks,
>> Kuai
>>
>>>> the new status, if no request is inflight, unquiesce immediately;
>>> Actually, I didn't understand how to avoid the above race. May you elaborate
>>> the scenario?
> 
> Sorry for repeating, I didn't get your point here. May you elaborate
> your suggestion? Thanks very much.

// dispatch path, replace stop queue with quiesce queue;
if (__blk_mq_run_dispatch_ops() != STS_STOP_DISPATCH)
        return;

blk_mq_quiesce_queue();
/* other context already stop dispatch */
if (test_and_set_bit(QUEUE_FLAG_STOP_DISPATCH, &q->queue_flags)) {
        blk_mq_unquiesce_queue();
        return;
}

/*
  * IO is done before stopping dispatch, hence can't let IO complete to
  * unquiesce queue.
  */
if (!blk_mq_inflight() && test_and_clear_bit(QUEUE_FLAG_STOP_DISPATCH, 
&q->queue_flags))
        blk_mq_unquiesce_queue();

// complete path
if (test_and_clear_bit(QUEUE_FLAG_STOP_DISPATCH, &q->queue_flags))
        blk_mq_unquiesce_queue();

Thanks,
Kuai

> 
>>> Muhcun,
>>> Thanks.
>>>> - unquiesce is any IO is done afterwards;
>>> .
> 
> 
> .
>
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2d0f22de0c7f..6f18993b8f454 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2075,6 +2075,13 @@  bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		 * in blk_mq_sched_restart(). Avoid restart code path to
 		 * miss the new added requests to hctx->dispatch, meantime
 		 * SCHED_RESTART is observed here.
+		 *
+		 * This barrier is also used to order adding of dispatch list
+		 * above and the test of BLK_MQ_S_STOPPED in the following
+		 * routine (in blk_mq_delay_run_hw_queue()). Pairs with the
+		 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code
+		 * could either see BLK_MQ_S_STOPPED is cleared or dispatch list
+		 * to avoid missing dispatching requests.
 		 */
 		smp_mb();
 
@@ -2237,6 +2244,17 @@  void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (!need_run)
 		return;
 
+	/*
+	 * This barrier is used to order adding of dispatch list or setting
+	 * of bitmap of any software queue outside of this function and the
+	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
+	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
+	 * of bitmap of any software queue to avoid missing dispatching
+	 * requests.
+	 */
+	smp_mb();
+
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
 		blk_mq_delay_run_hw_queue(hctx, 0);
 		return;
@@ -2392,6 +2410,13 @@  void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		return;
 
 	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	/*
+	 * Pairs with the smp_mb() in blk_mq_run_hw_queue() or
+	 * blk_mq_dispatch_rq_list() to order the clearing of
+	 * BLK_MQ_S_STOPPED and the test of dispatch list or
+	 * bitmap of any software queue.
+	 */
+	smp_mb__after_atomic();
 	blk_mq_run_hw_queue(hctx, async);
 }
 EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);