Message ID | 20240811101921.4031-5-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix some starvation problems | expand |
On Sun, Aug 11, 2024 at 06:19:21PM +0800, Muchun Song wrote: > Supposing the following scenario. > > CPU0 CPU1 > > blk_mq_request_issue_directly() blk_mq_unquiesce_queue() > if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > blk_mq_insert_request() blk_mq_run_hw_queues() > /* blk_mq_run_hw_queue() > * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load > * software queue. 1) store return > */ > blk_mq_run_hw_queue() > if (blk_queue_quiesced()) 2) load > return > blk_mq_sched_dispatch_requests() > > The full memory barrier should be inserted between 1) and 2), as well as > between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is > cleared or CPU1 sees dispatch list or setting of bitmap of software queue. > Otherwise, either CPU will not re-run the hardware queue causing starvation. Memory barrier shouldn't serve as bug fix for two slow code paths. One simple fix is to add helper of blk_queue_quiesced_lock(), and call the following check on CPU0: if (blk_queue_quiesced_lock()) blk_mq_run_hw_queue(); thanks, Ming
On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: > > Supposing the following scenario. > > > > CPU0 CPU1 > > > > blk_mq_request_issue_directly() blk_mq_unquiesce_queue() > > if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > > blk_mq_insert_request() blk_mq_run_hw_queues() > > /* blk_mq_run_hw_queue() > > * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load > > * software queue. 1) store return > > */ > > blk_mq_run_hw_queue() > > if (blk_queue_quiesced()) 2) load > > return > > blk_mq_sched_dispatch_requests() > > > > The full memory barrier should be inserted between 1) and 2), as well as > > between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is > > cleared or CPU1 sees dispatch list or setting of bitmap of software queue. > > Otherwise, either CPU will not re-run the hardware queue causing starvation. > > Memory barrier shouldn't serve as bug fix for two slow code paths. > > One simple fix is to add helper of blk_queue_quiesced_lock(), and > call the following check on CPU0: > > if (blk_queue_quiesced_lock()) > blk_mq_run_hw_queue(); This only fixes blk_mq_request_issue_directly(), I think anywhere that matching this pattern (inserting a request to dispatch list and then running the hardware queue) should be fixed. And I think there are many places which match this pattern (E.g. blk_mq_submit_bio()). The above graph should be adjusted to the following. CPU0 CPU1 blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() return blk_mq_run_hw_queue() blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load return So I think fixing blk_mq_run_hw_queue() could cover all of the situations. Maybe I thought wrongly. Please correct me. Muchun, Thanks.
> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: > > On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: >> >> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: >>> Supposing the following scenario. >>> >>> CPU0 CPU1 >>> >>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() >>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>> blk_mq_insert_request() blk_mq_run_hw_queues() >>> /* blk_mq_run_hw_queue() >>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load >>> * software queue. 1) store return >>> */ >>> blk_mq_run_hw_queue() >>> if (blk_queue_quiesced()) 2) load >>> return >>> blk_mq_sched_dispatch_requests() >>> >>> The full memory barrier should be inserted between 1) and 2), as well as >>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is >>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. >>> Otherwise, either CPU will not re-run the hardware queue causing starvation. >> >> Memory barrier shouldn't serve as bug fix for two slow code paths. >> >> One simple fix is to add helper of blk_queue_quiesced_lock(), and >> call the following check on CPU0: >> >> if (blk_queue_quiesced_lock()) >> blk_mq_run_hw_queue(); > > This only fixes blk_mq_request_issue_directly(), I think anywhere that > matching this > pattern (inserting a request to dispatch list and then running the > hardware queue) > should be fixed. And I think there are many places which match this > pattern (E.g. > blk_mq_submit_bio()). The above graph should be adjusted to the following. > > CPU0 CPU1 > > blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() > blk_mq_run_hw_queue() > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() > return blk_mq_run_hw_queue() > blk_mq_sched_dispatch_requests() if > (!blk_mq_hctx_has_pending()) 4) load > return Sorry. There is something wrong with my email client. Resend the graph. CPU0 CPU1 blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() return blk_mq_run_hw_queue() blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load return > > So I think fixing blk_mq_run_hw_queue() could cover all of the situations. > Maybe I thought wrongly. Please correct me. > > Muchun, > Thanks.
On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote: > > > > On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: > > > > On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: > >> > >> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: > >>> Supposing the following scenario. > >>> > >>> CPU0 CPU1 > >>> > >>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() > >>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > >>> blk_mq_insert_request() blk_mq_run_hw_queues() > >>> /* blk_mq_run_hw_queue() > >>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load > >>> * software queue. 1) store return > >>> */ > >>> blk_mq_run_hw_queue() > >>> if (blk_queue_quiesced()) 2) load > >>> return > >>> blk_mq_sched_dispatch_requests() > >>> > >>> The full memory barrier should be inserted between 1) and 2), as well as > >>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is > >>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. > >>> Otherwise, either CPU will not re-run the hardware queue causing starvation. > >> > >> Memory barrier shouldn't serve as bug fix for two slow code paths. > >> > >> One simple fix is to add helper of blk_queue_quiesced_lock(), and > >> call the following check on CPU0: > >> > >> if (blk_queue_quiesced_lock()) > >> blk_mq_run_hw_queue(); > > > > This only fixes blk_mq_request_issue_directly(), I think anywhere that > > matching this > > pattern (inserting a request to dispatch list and then running the > > hardware queue) > > should be fixed. And I think there are many places which match this > > pattern (E.g. > > blk_mq_submit_bio()). The above graph should be adjusted to the following. > > > > CPU0 CPU1 > > > > blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() > > blk_mq_run_hw_queue() > > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > > if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() > > return blk_mq_run_hw_queue() > > blk_mq_sched_dispatch_requests() if > > (!blk_mq_hctx_has_pending()) 4) load > > return > > Sorry. There is something wrong with my email client. Resend the graph. > > CPU0 CPU1 > > blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() > blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() > return blk_mq_run_hw_queue() > blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load > return OK. The issue shouldn't exist if blk_queue_quiesced() return false in blk_mq_run_hw_queue(), so it is still one race in two slow paths? I guess the barrier-less approach should work too, such as: diff --git a/block/blk-mq.c b/block/blk-mq.c index e3c3c0c21b55..632261982a77 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) } EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx) +{ + return !blk_queue_quiesced(hctx->queue) && + blk_mq_hctx_has_pending(hctx); +} + /** * blk_mq_run_hw_queue - Start to run a hardware queue. * @hctx: Pointer to the hardware queue to run. @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) * quiesced. */ __blk_mq_run_dispatch_ops(hctx->queue, false, - need_run = !blk_queue_quiesced(hctx->queue) && - blk_mq_hctx_has_pending(hctx)); + need_run = blk_mq_hw_queue_need_run(hctx)); - if (!need_run) - return; + if (!need_run) { + unsigned long flags; + + /* sync with unquiesce */ + spin_lock_irqsave(&hctx->queue->queue_lock, flags); + need_run = blk_mq_hw_queue_need_run(hctx); + spin_unlock_irqrestore(&hctx->queue->queue_lock, flags); + + if (!need_run) + return; + } if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) { blk_mq_delay_run_hw_queue(hctx, 0); thanks, Ming
> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote: > > On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote: >> >> >>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: >>> >>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: >>>>> Supposing the following scenario. >>>>> >>>>> CPU0 CPU1 >>>>> >>>>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() >>>>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>>> blk_mq_insert_request() blk_mq_run_hw_queues() >>>>> /* blk_mq_run_hw_queue() >>>>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load >>>>> * software queue. 1) store return >>>>> */ >>>>> blk_mq_run_hw_queue() >>>>> if (blk_queue_quiesced()) 2) load >>>>> return >>>>> blk_mq_sched_dispatch_requests() >>>>> >>>>> The full memory barrier should be inserted between 1) and 2), as well as >>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is >>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. >>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation. >>>> >>>> Memory barrier shouldn't serve as bug fix for two slow code paths. >>>> >>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and >>>> call the following check on CPU0: >>>> >>>> if (blk_queue_quiesced_lock()) >>>> blk_mq_run_hw_queue(); >>> >>> This only fixes blk_mq_request_issue_directly(), I think anywhere that >>> matching this >>> pattern (inserting a request to dispatch list and then running the >>> hardware queue) >>> should be fixed. And I think there are many places which match this >>> pattern (E.g. >>> blk_mq_submit_bio()). The above graph should be adjusted to the following. >>> >>> CPU0 CPU1 >>> >>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >>> blk_mq_run_hw_queue() >>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >>> return blk_mq_run_hw_queue() >>> blk_mq_sched_dispatch_requests() if >>> (!blk_mq_hctx_has_pending()) 4) load >>> return >> >> Sorry. There is something wrong with my email client. Resend the graph. >> >> CPU0 CPU1 >> >> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >> blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >> return blk_mq_run_hw_queue() >> blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load >> return > > OK. > > The issue shouldn't exist if blk_queue_quiesced() return false in > blk_mq_run_hw_queue(), so it is still one race in two slow paths? > > I guess the barrier-less approach should work too, such as: > If we prefer barrier-less approach, I think the following solution will work as well, I'll use it in v2. Thanks. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index e3c3c0c21b55..632261982a77 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) > } > EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); > > +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx) > +{ > + return !blk_queue_quiesced(hctx->queue) && > + blk_mq_hctx_has_pending(hctx); > +} > + > /** > * blk_mq_run_hw_queue - Start to run a hardware queue. > * @hctx: Pointer to the hardware queue to run. > @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) > * quiesced. > */ > __blk_mq_run_dispatch_ops(hctx->queue, false, > - need_run = !blk_queue_quiesced(hctx->queue) && > - blk_mq_hctx_has_pending(hctx)); > + need_run = blk_mq_hw_queue_need_run(hctx)); > > - if (!need_run) > - return; > + if (!need_run) { > + unsigned long flags; > + > + /* sync with unquiesce */ > + spin_lock_irqsave(&hctx->queue->queue_lock, flags); > + need_run = blk_mq_hw_queue_need_run(hctx); One question here: should we use __blk_mq_run_dispatch_ops()? I saw a comment above. It seems it is safe to call blk_mq_hw_queue_need_run under [s]rcu lock. Thanks. > + spin_unlock_irqrestore(&hctx->queue->queue_lock, flags); > + > + if (!need_run) > + return; > + } > > if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) { > blk_mq_delay_run_hw_queue(hctx, 0); > > > thanks, > Ming
> On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote: > > > >> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote: >> >> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote: >>> >>> >>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: >>>> >>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: >>>>> >>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: >>>>>> Supposing the following scenario. >>>>>> >>>>>> CPU0 CPU1 >>>>>> >>>>>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() >>>>>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>>>> blk_mq_insert_request() blk_mq_run_hw_queues() >>>>>> /* blk_mq_run_hw_queue() >>>>>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load >>>>>> * software queue. 1) store return >>>>>> */ >>>>>> blk_mq_run_hw_queue() >>>>>> if (blk_queue_quiesced()) 2) load >>>>>> return >>>>>> blk_mq_sched_dispatch_requests() >>>>>> >>>>>> The full memory barrier should be inserted between 1) and 2), as well as >>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is >>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. >>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation. >>>>> >>>>> Memory barrier shouldn't serve as bug fix for two slow code paths. >>>>> >>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and >>>>> call the following check on CPU0: >>>>> >>>>> if (blk_queue_quiesced_lock()) >>>>> blk_mq_run_hw_queue(); >>>> >>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that >>>> matching this >>>> pattern (inserting a request to dispatch list and then running the >>>> hardware queue) >>>> should be fixed. And I think there are many places which match this >>>> pattern (E.g. >>>> blk_mq_submit_bio()). The above graph should be adjusted to the following. >>>> >>>> CPU0 CPU1 >>>> >>>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >>>> blk_mq_run_hw_queue() >>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >>>> return blk_mq_run_hw_queue() >>>> blk_mq_sched_dispatch_requests() if >>>> (!blk_mq_hctx_has_pending()) 4) load >>>> return >>> >>> Sorry. There is something wrong with my email client. Resend the graph. >>> >>> CPU0 CPU1 >>> >>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >>> blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >>> return blk_mq_run_hw_queue() >>> blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load >>> return >> >> OK. >> >> The issue shouldn't exist if blk_queue_quiesced() return false in >> blk_mq_run_hw_queue(), so it is still one race in two slow paths? >> >> I guess the barrier-less approach should work too, such as: >> > > If we prefer barrier-less approach, I think the following solution > will work as well, I'll use it in v2. Thanks. > >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index e3c3c0c21b55..632261982a77 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) >> } >> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); >> >> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx) >> +{ >> + return !blk_queue_quiesced(hctx->queue) && >> + blk_mq_hctx_has_pending(hctx); >> +} >> + >> /** >> * blk_mq_run_hw_queue - Start to run a hardware queue. >> * @hctx: Pointer to the hardware queue to run. >> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) >> * quiesced. >> */ >> __blk_mq_run_dispatch_ops(hctx->queue, false, >> - need_run = !blk_queue_quiesced(hctx->queue) && >> - blk_mq_hctx_has_pending(hctx)); >> + need_run = blk_mq_hw_queue_need_run(hctx)); >> >> - if (!need_run) >> - return; >> + if (!need_run) { >> + unsigned long flags; >> + >> + /* sync with unquiesce */ >> + spin_lock_irqsave(&hctx->queue->queue_lock, flags); After some time thought, I think here we need a big comment to explain why we need to sync. Because there are other caller of blk_queue_quiesced() which do not need to hold ->queue_lock to sync. Then, I am thinking is ->queue_lock really easier to be maintained than mb? For developers, we still need to care about this, right? I don't see any obvious benefit. And the mb approach seems more efficient than spinlock. Something like: if (!need_run) { /* Add a comment here to explain what's going on here. */ smp_mb(); need_run = blk_mq_hw_queue_need_run(hctx); if (!need_run) return; } I am not objecting to your approach, I want to know if you insist on barrier-less approach here. If yes, I'm fine with this approach. I can use it in v2. Muhcun, Thanks. >> + need_run = blk_mq_hw_queue_need_run(hctx); > > One question here: should we use __blk_mq_run_dispatch_ops()? I saw a comment above. > It seems it is safe to call blk_mq_hw_queue_need_run under [s]rcu lock. > > Thanks. > >> + spin_unlock_irqrestore(&hctx->queue->queue_lock, flags); >> + >> + if (!need_run) >> + return; >> + } >> >> if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) { >> blk_mq_delay_run_hw_queue(hctx, 0); >> >> >> thanks, >> Ming > >
On Tue, Aug 27, 2024 at 4:17 PM Muchun Song <muchun.song@linux.dev> wrote: > > > > > On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote: > > > > > > > >> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote: > >> > >> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote: > >>> > >>> > >>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: > >>>> > >>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: > >>>>> > >>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: > >>>>>> Supposing the following scenario. > >>>>>> > >>>>>> CPU0 CPU1 > >>>>>> > >>>>>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() > >>>>>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > >>>>>> blk_mq_insert_request() blk_mq_run_hw_queues() > >>>>>> /* blk_mq_run_hw_queue() > >>>>>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load > >>>>>> * software queue. 1) store return > >>>>>> */ > >>>>>> blk_mq_run_hw_queue() > >>>>>> if (blk_queue_quiesced()) 2) load > >>>>>> return > >>>>>> blk_mq_sched_dispatch_requests() > >>>>>> > >>>>>> The full memory barrier should be inserted between 1) and 2), as well as > >>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is > >>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. > >>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation. > >>>>> > >>>>> Memory barrier shouldn't serve as bug fix for two slow code paths. > >>>>> > >>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and > >>>>> call the following check on CPU0: > >>>>> > >>>>> if (blk_queue_quiesced_lock()) > >>>>> blk_mq_run_hw_queue(); > >>>> > >>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that > >>>> matching this > >>>> pattern (inserting a request to dispatch list and then running the > >>>> hardware queue) > >>>> should be fixed. And I think there are many places which match this > >>>> pattern (E.g. > >>>> blk_mq_submit_bio()). The above graph should be adjusted to the following. > >>>> > >>>> CPU0 CPU1 > >>>> > >>>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() > >>>> blk_mq_run_hw_queue() > >>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > >>>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() > >>>> return blk_mq_run_hw_queue() > >>>> blk_mq_sched_dispatch_requests() if > >>>> (!blk_mq_hctx_has_pending()) 4) load > >>>> return > >>> > >>> Sorry. There is something wrong with my email client. Resend the graph. > >>> > >>> CPU0 CPU1 > >>> > >>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() > >>> blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store > >>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() > >>> return blk_mq_run_hw_queue() > >>> blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load > >>> return > >> > >> OK. > >> > >> The issue shouldn't exist if blk_queue_quiesced() return false in > >> blk_mq_run_hw_queue(), so it is still one race in two slow paths? > >> > >> I guess the barrier-less approach should work too, such as: > >> > > > > If we prefer barrier-less approach, I think the following solution > > will work as well, I'll use it in v2. Thanks. > > > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index e3c3c0c21b55..632261982a77 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) > >> } > >> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); > >> > >> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx) > >> +{ > >> + return !blk_queue_quiesced(hctx->queue) && > >> + blk_mq_hctx_has_pending(hctx); > >> +} > >> + > >> /** > >> * blk_mq_run_hw_queue - Start to run a hardware queue. > >> * @hctx: Pointer to the hardware queue to run. > >> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) > >> * quiesced. > >> */ > >> __blk_mq_run_dispatch_ops(hctx->queue, false, > >> - need_run = !blk_queue_quiesced(hctx->queue) && > >> - blk_mq_hctx_has_pending(hctx)); > >> + need_run = blk_mq_hw_queue_need_run(hctx)); > >> > >> - if (!need_run) > >> - return; > >> + if (!need_run) { > >> + unsigned long flags; > >> + > >> + /* sync with unquiesce */ > >> + spin_lock_irqsave(&hctx->queue->queue_lock, flags); > > After some time thought, I think here we need a big comment to explain > why we need to sync. Because there are other caller of blk_queue_quiesced() > which do not need to hold ->queue_lock to sync. Then, I am thinking > is ->queue_lock really easier to be maintained than mb? For developers, > we still need to care about this, right? I don't see any obvious benefit. > And the mb approach seems more efficient than spinlock. Something like: > > if (!need_run) { > /* Add a comment here to explain what's going on here. */ > smp_mb(); > need_run = blk_mq_hw_queue_need_run(hctx); > if (!need_run) > return; > } > > I am not objecting to your approach, I want to know if you insist on > barrier-less approach here. If yes, I'm fine with this approach. I can > use it in v2. Yes, as I mentioned, the race only exists on two slow code paths, we seldom use barrier in slow paths, in which traditional lock can provide a simpler & more readable solution. Anytime, READ/WRITE dependency implied in any barrier is hard to follow. Thanks,
> On Aug 29, 2024, at 10:51, Ming Lei <ming.lei@redhat.com> wrote: > > On Tue, Aug 27, 2024 at 4:17 PM Muchun Song <muchun.song@linux.dev> wrote: >> >> >> >>> On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote: >>> >>> >>> >>>> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote: >>>>> >>>>> >>>>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote: >>>>>> >>>>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote: >>>>>>> >>>>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote: >>>>>>>> Supposing the following scenario. >>>>>>>> >>>>>>>> CPU0 CPU1 >>>>>>>> >>>>>>>> blk_mq_request_issue_directly() blk_mq_unquiesce_queue() >>>>>>>> if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>>>>>> blk_mq_insert_request() blk_mq_run_hw_queues() >>>>>>>> /* blk_mq_run_hw_queue() >>>>>>>> * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load >>>>>>>> * software queue. 1) store return >>>>>>>> */ >>>>>>>> blk_mq_run_hw_queue() >>>>>>>> if (blk_queue_quiesced()) 2) load >>>>>>>> return >>>>>>>> blk_mq_sched_dispatch_requests() >>>>>>>> >>>>>>>> The full memory barrier should be inserted between 1) and 2), as well as >>>>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is >>>>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue. >>>>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation. >>>>>>> >>>>>>> Memory barrier shouldn't serve as bug fix for two slow code paths. >>>>>>> >>>>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and >>>>>>> call the following check on CPU0: >>>>>>> >>>>>>> if (blk_queue_quiesced_lock()) >>>>>>> blk_mq_run_hw_queue(); >>>>>> >>>>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that >>>>>> matching this >>>>>> pattern (inserting a request to dispatch list and then running the >>>>>> hardware queue) >>>>>> should be fixed. And I think there are many places which match this >>>>>> pattern (E.g. >>>>>> blk_mq_submit_bio()). The above graph should be adjusted to the following. >>>>>> >>>>>> CPU0 CPU1 >>>>>> >>>>>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >>>>>> blk_mq_run_hw_queue() >>>>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>>>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >>>>>> return blk_mq_run_hw_queue() >>>>>> blk_mq_sched_dispatch_requests() if >>>>>> (!blk_mq_hctx_has_pending()) 4) load >>>>>> return >>>>> >>>>> Sorry. There is something wrong with my email client. Resend the graph. >>>>> >>>>> CPU0 CPU1 >>>>> >>>>> blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() >>>>> blk_mq_run_hw_queue() blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store >>>>> if (blk_queue_quiesced()) 2) load blk_mq_run_hw_queues() >>>>> return blk_mq_run_hw_queue() >>>>> blk_mq_sched_dispatch_requests() if (!blk_mq_hctx_has_pending()) 4) load >>>>> return >>>> >>>> OK. >>>> >>>> The issue shouldn't exist if blk_queue_quiesced() return false in >>>> blk_mq_run_hw_queue(), so it is still one race in two slow paths? >>>> >>>> I guess the barrier-less approach should work too, such as: >>>> >>> >>> If we prefer barrier-less approach, I think the following solution >>> will work as well, I'll use it in v2. Thanks. >>> >>>> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>> index e3c3c0c21b55..632261982a77 100644 >>>> --- a/block/blk-mq.c >>>> +++ b/block/blk-mq.c >>>> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) >>>> } >>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue); >>>> >>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx) >>>> +{ >>>> + return !blk_queue_quiesced(hctx->queue) && >>>> + blk_mq_hctx_has_pending(hctx); >>>> +} >>>> + >>>> /** >>>> * blk_mq_run_hw_queue - Start to run a hardware queue. >>>> * @hctx: Pointer to the hardware queue to run. >>>> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) >>>> * quiesced. >>>> */ >>>> __blk_mq_run_dispatch_ops(hctx->queue, false, >>>> - need_run = !blk_queue_quiesced(hctx->queue) && >>>> - blk_mq_hctx_has_pending(hctx)); >>>> + need_run = blk_mq_hw_queue_need_run(hctx)); >>>> >>>> - if (!need_run) >>>> - return; >>>> + if (!need_run) { >>>> + unsigned long flags; >>>> + >>>> + /* sync with unquiesce */ >>>> + spin_lock_irqsave(&hctx->queue->queue_lock, flags); >> >> After some time thought, I think here we need a big comment to explain >> why we need to sync. Because there are other caller of blk_queue_quiesced() >> which do not need to hold ->queue_lock to sync. Then, I am thinking >> is ->queue_lock really easier to be maintained than mb? For developers, >> we still need to care about this, right? I don't see any obvious benefit. >> And the mb approach seems more efficient than spinlock. Something like: >> >> if (!need_run) { >> /* Add a comment here to explain what's going on here. */ >> smp_mb(); >> need_run = blk_mq_hw_queue_need_run(hctx); >> if (!need_run) >> return; >> } >> >> I am not objecting to your approach, I want to know if you insist on >> barrier-less approach here. If yes, I'm fine with this approach. I can >> use it in v2. > > Yes, as I mentioned, the race only exists on two slow code paths, > we seldom use barrier in slow paths, in which traditional lock > can provide a simpler & more readable solution. Anytime, > READ/WRITE dependency implied in any barrier is hard to follow. Got it. Thanks for your reply. > > Thanks,
diff --git a/block/blk-mq.c b/block/blk-mq.c index 385a74e566874..66b21407a9a6c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -264,6 +264,13 @@ void blk_mq_unquiesce_queue(struct request_queue *q) ; } else if (!--q->quiesce_depth) { blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q); + /** + * The need of memory barrier is in blk_mq_run_hw_queues() to + * make sure clearing of QUEUE_FLAG_QUIESCED is before the + * checking of dispatch list or bitmap of any software queue. + * + * smp_mb__after_atomic(); + */ run_queue = true; } spin_unlock_irqrestore(&q->queue_lock, flags); @@ -2222,6 +2229,21 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { bool need_run; + /* + * This barrier is used to order adding of dispatch list or setting + * of bitmap of any software queue outside of this function and the + * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the + * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could + * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting + * of bitmap of any software queue to avoid missing dispatching + * requests. + * + * This barrier is also used to order adding of dispatch list or + * setting of bitmap of any software queue outside of this function + * and test of QUEUE_FLAG_QUIESCED below. + */ + smp_mb(); + /* * We can't run the queue inline with interrupts disabled. */ @@ -2244,17 +2266,6 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) if (!need_run) return; - /* - * This barrier is used to order adding of dispatch list or setting - * of bitmap of any software queue outside of this function and the - * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the - * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could - * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting - * of bitmap of any software queue to avoid missing dispatching - * requests. - */ - smp_mb(); - if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) { blk_mq_delay_run_hw_queue(hctx, 0); return; @@ -2308,6 +2319,11 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting * of bitmap of any software queue to avoid missing dispatching * requests. + * + * This barrier is also used to order clearing of QUEUE_FLAG_QUIESCED + * outside of this function in blk_mq_unquiesce_queue() and checking + * of dispatch list or bitmap of any software queue in + * blk_mq_run_hw_queue(). */ smp_mb();
Supposing the following scenario. CPU0 CPU1 blk_mq_request_issue_directly() blk_mq_unquiesce_queue() if (blk_queue_quiesced()) blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store blk_mq_insert_request() blk_mq_run_hw_queues() /* blk_mq_run_hw_queue() * Add request to dispatch list or set bitmap of if (!blk_mq_hctx_has_pending()) 4) load * software queue. 1) store return */ blk_mq_run_hw_queue() if (blk_queue_quiesced()) 2) load return blk_mq_sched_dispatch_requests() The full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is cleared or CPU1 sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU will not re-run the hardware queue causing starvation. Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- block/blk-mq.c | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-)