Message ID | 20200402085050.v2.2.I28278ef8ea27afc0ec7e597752a6d4e58c16176f@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Fix two causes of IO stalls found in reboot testing | expand |
On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote: > It is possible for two threads to be running > blk_mq_do_dispatch_sched() at the same time with the same "hctx". > This is because there can be more than one caller to > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't > prevent more than one thread from entering. > > If more than one thread is running blk_mq_do_dispatch_sched() at the > same time with the same "hctx", they may have contention acquiring > budget. The blk_mq_get_dispatch_budget() can eventually translate > into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not > uncommon) then only one of the two threads will be the one to > increment "device_busy" to 1 and get the budget. > > The losing thread will break out of blk_mq_do_dispatch_sched() and > will stop dispatching requests. The assumption is that when more > budget is available later (when existing transactions finish) the > queue will be kicked again, perhaps in scsi_end_request(). > > The winning thread now has budget and can go on to call > dispatch_request(). If dispatch_request() returns NULL here then we > have a potential problem. Specifically we'll now call As I mentioned before, it is a BFQ specific issue, it tells blk-mq that it has work to do, and now the budget is assigned to the only winning thread, however, dispatch_request() still returns NULL. > blk_mq_put_dispatch_budget() which translates into > scsi_mq_put_budget(). That will mark the device as no longer busy but > doesn't do anything to kick the queue. This violates the assumption > that the queue would be kicked when more budget was available. The queue is still kicked in by BFQ in its idle timer, however that timer doesn't make forward progress. Without this idle timer, it is simply a BFQ issue. > > Pictorially: > > Thread A Thread B > ================================= ================================== > blk_mq_get_dispatch_budget() => 1 > dispatch_request() => NULL > blk_mq_get_dispatch_budget() => 0 > // because Thread A marked > // "device_busy" in scsi_device > blk_mq_put_dispatch_budget() What if there is only thread A? You need to mention that thread B is from BFQ. > > The above case was observed in reboot tests and caused a task to hang > forever waiting for IO to complete. Traces showed that in fact two > tasks were running blk_mq_do_dispatch_sched() at the same time with > the same "hctx". The task that got the budget did in fact see > dispatch_request() return NULL. Both tasks returned and the system > went on for several minutes (until the hung task delay kicked in) > without the given "hctx" showing up again in traces. > > Let's attempt to fix this problem by detecting if there was contention > for the budget in the case where we put the budget without dispatching > anything. If we saw contention we kick all hctx's associated with the > queue where there was contention. We do this without any locking by > adding a double-check for budget and accepting a small amount of faux > contention if the 2nd check gives us budget but then we don't dispatch > anything (we'll look like we contended with ourselves). > > A few extra notes: > > - This whole thing is only a problem due to the inexact nature of > has_work(). Specifically if has_work() always guaranteed that a > "true" return meant that dispatch_request() would return non-NULL > then we wouldn't have this problem. That's because we only grab the > budget if has_work() returned true. If we had the non-NULL > guarantee then at least one of the threads would actually dispatch > work and when the work was done then queues would be kicked > normally. > > - One specific I/O scheduler that trips this problem quite a bit is > BFQ which definitely returns "true" for has_work() in cases where it > wouldn't dispatch. Making BFQ's has_work() more exact requires that > has_work() becomes a much heavier function, including figuring out > how to acquire spinlocks in has_work() without tripping circular > lock dependencies. This is prototyped but it's unclear if it's > really the way to go when the problem can be solved with a > relatively lightweight contention detection mechanism. > > - Because this problem only trips with inexact has_work() it's > believed that blk_mq_do_dispatch_ctx() does not need a similar > change. Right, I prefer to fix it in BFQ, given it isn't a generic issue, not worth of generic solution. > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") > > block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- > block/blk-mq.c | 3 +++ > include/linux/blkdev.h | 2 ++ > 3 files changed, 29 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 74cedea56034..0195d75f5f96 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) > break; > > - if (!blk_mq_get_dispatch_budget(hctx)) > - break; > + > + if (!blk_mq_get_dispatch_budget(hctx)) { > + /* > + * We didn't get budget so set contention. If > + * someone else had the budget but didn't dispatch > + * they'll kick everything. NOTE: we check one > + * extra time _after_ setting contention to fully > + * close the race. If we don't actually dispatch > + * we'll detext faux contention (with ourselves) > + * but that should be rare. > + */ > + atomic_set(&q->budget_contention, 1); > + > + if (!blk_mq_get_dispatch_budget(hctx)) scsi_mq_get_budget() implies a smp_mb(), so the barrier can order between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1). > + break; > + } > > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + /* > + * We've released the budget but us holding it might > + * have prevented someone else from dispatching. > + * Detect that case and run all queues again. > + */ > + if (atomic_read(&q->budget_contention)) scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic() is needed between the above two op. > + blk_mq_run_hw_queues(q, true); > break; > } > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2cd8d2b49ff4..6163c43ceca5 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) > struct blk_mq_hw_ctx *hctx; > int i; > > + /* We're running the queues, so clear the contention detector */ > + atomic_set(&q->budget_contention, 0); > + You add extra cost in fast path. > queue_for_each_hw_ctx(q, hctx, i) { > if (blk_mq_hctx_stopped(hctx)) > continue; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index f629d40c645c..07f21e45d993 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -583,6 +583,8 @@ struct request_queue { > > #define BLK_MAX_WRITE_HINTS 5 > u64 write_hints[BLK_MAX_WRITE_HINTS]; > + > + atomic_t budget_contention; It needn't to be a atomic variable, and simple 'unsigned' int should be fine, what matters is that the order between R/W this flag and get/put budget. thanks, Ming
Hi, On Thu, Apr 2, 2020 at 6:34 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote: > > It is possible for two threads to be running > > blk_mq_do_dispatch_sched() at the same time with the same "hctx". > > This is because there can be more than one caller to > > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't > > prevent more than one thread from entering. > > > > If more than one thread is running blk_mq_do_dispatch_sched() at the > > same time with the same "hctx", they may have contention acquiring > > budget. The blk_mq_get_dispatch_budget() can eventually translate > > into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not > > uncommon) then only one of the two threads will be the one to > > increment "device_busy" to 1 and get the budget. > > > > The losing thread will break out of blk_mq_do_dispatch_sched() and > > will stop dispatching requests. The assumption is that when more > > budget is available later (when existing transactions finish) the > > queue will be kicked again, perhaps in scsi_end_request(). > > > > The winning thread now has budget and can go on to call > > dispatch_request(). If dispatch_request() returns NULL here then we > > have a potential problem. Specifically we'll now call > > As I mentioned before, it is a BFQ specific issue, it tells blk-mq > that it has work to do, and now the budget is assigned to the only > winning thread, however, dispatch_request() still returns NULL. Correct that it only happens with BFQ, but whether it's a BFQ bug or not just depends on how you define the has_work() API. If has_work() is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug depending on how you cut it. If has_work() must be exact then it is certainly a BFQ bug. If has_work() doesn't need to be exact then it's not a BFQ bug. I believe that a sane API could be defined either way. Either has_work() can be defined as a lightweight hint to trigger heavier code or it can be defined as something exact. It's really up to blk-mq to say how they define it. From his response on the SCSI patch [1], it sounded like Jens was OK with has_work() being a lightweight hint as long as BFQ ensures that the queues run later. ...but, as my investigation found, I believe that BFQ _does_ try to ensure that the queue is run at a later time by calling blk_mq_run_hw_queues(). The irony is that due to the race we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to be reliable if has_work() is inexact. :( One way to address this is to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. ...so Jens: care to clarify how you'd like has_work() to be defined? > > - Because this problem only trips with inexact has_work() it's > > believed that blk_mq_do_dispatch_ctx() does not need a similar > > change. > > Right, I prefer to fix it in BFQ, given it isn't a generic issue, > not worth of generic solution. Just to confirm: it sounds like you are saying that BFQ is not a first class citizen here because not everyone uses BFQ. Is that really the best policy? Couldn't I say that SCSI shouldn't be a first class citizen because not everyone uses SCSI? In such a case I could argue that we should speed up the blk-mq layer by removing all the budget code since SCSI is the only user. I'm not actually suggesting it, only pointing out that sometimes we need to make allowances in the code. > > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) > > break; > > > > - if (!blk_mq_get_dispatch_budget(hctx)) > > - break; > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) { > > + /* > > + * We didn't get budget so set contention. If > > + * someone else had the budget but didn't dispatch > > + * they'll kick everything. NOTE: we check one > > + * extra time _after_ setting contention to fully > > + * close the race. If we don't actually dispatch > > + * we'll detext faux contention (with ourselves) > > + * but that should be rare. > > + */ > > + atomic_set(&q->budget_contention, 1); > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) > > scsi_mq_get_budget() implies a smp_mb(), so the barrier can order > between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1). I always struggle when working with memory barriers. I think you are saying that this section of code is OK, though, presumably because the SCSI code does "atomic_inc_return" which implies this barrier. Do you happen to know if it's documented that anyone who implements get_budget() for "struct blk_mq_ops" will have a memory barrier here? I know SCSI is the only existing user, but we'd want to keep it generic. > > + break; > > + } > > > > rq = e->type->ops.dispatch_request(hctx); > > if (!rq) { > > blk_mq_put_dispatch_budget(hctx); > > + > > + /* > > + * We've released the budget but us holding it might > > + * have prevented someone else from dispatching. > > + * Detect that case and run all queues again. > > + */ > > + if (atomic_read(&q->budget_contention)) > > scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic() > is needed between the above two op. OK, thanks. I will add that we decide to move forward with this patch. Again I'd wonder if this type of thing should be documented. > > + blk_mq_run_hw_queues(q, true); > > break; > > } > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 2cd8d2b49ff4..6163c43ceca5 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) > > struct blk_mq_hw_ctx *hctx; > > int i; > > > > + /* We're running the queues, so clear the contention detector */ > > + atomic_set(&q->budget_contention, 0); > > + > > You add extra cost in fast path. Yes, but it is a fairly minor cost added. It's called once in a place where we're unlikely to be looping and where we're about to do a lot heavier operations (perhaps in a loop). This doesn't feel like a dealbreaker. > > queue_for_each_hw_ctx(q, hctx, i) { > > if (blk_mq_hctx_stopped(hctx)) > > continue; > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index f629d40c645c..07f21e45d993 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -583,6 +583,8 @@ struct request_queue { > > > > #define BLK_MAX_WRITE_HINTS 5 > > u64 write_hints[BLK_MAX_WRITE_HINTS]; > > + > > + atomic_t budget_contention; > > It needn't to be a atomic variable, and simple 'unsigned' > int should be fine, what matters is that the order between > R/W this flag and get/put budget. Apparently I'm the only one who feels atomic_t is more documenting here. I will cede to the will of the majority here and change to 'unsigned int' if we decide to move forward with this patch. -Doug [1] https://lore.kernel.org/r/d6af2344-11f7-5862-daed-e21cbd496d92@kernel.dk
Hi, On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote: > > Correct that it only happens with BFQ, but whether it's a BFQ bug or > not just depends on how you define the has_work() API. If has_work() > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug > depending on how you cut it. If has_work() must be exact then it is > certainly a BFQ bug. If has_work() doesn't need to be exact then it's > not a BFQ bug. I believe that a sane API could be defined either way. > Either has_work() can be defined as a lightweight hint to trigger > heavier code or it can be defined as something exact. It's really up > to blk-mq to say how they define it. > > From his response on the SCSI patch [1], it sounded like Jens was OK > with has_work() being a lightweight hint as long as BFQ ensures that > the queues run later. ...but, as my investigation found, I believe > that BFQ _does_ try to ensure that the queue is run at a later time by > calling blk_mq_run_hw_queues(). The irony is that due to the race > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to > be reliable if has_work() is inexact. :( One way to address this is > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. > > ...so Jens: care to clarify how you'd like has_work() to be defined? Sorry to reply so quickly after my prior reply, but I might have found an extreme corner case where we can still run into the same race even if has_work() is exact. This is all theoretical from code analysis. Maybe you can poke a hole in my scenario or tell me it's so implausible that we don't care, but it seems like it's theoretically possible. For this example I'll assume a budget of 1 (AKA only one thread can get budget for a given queue): * Threads A and B both run has_work() at the same time with the same "hctx". has_work() is exact but there's no lock, so it's OK if Thread A and B both get back true. * Thread B gets interrupted for a long time right after it decides that there is work. Maybe its CPU gets an interrupt and the interrupt handler is slow. * Thread A runs, get budget, dispatches work. * Thread A's work finishes and budget is released. * Thread B finally runs again and gets budget. * Since Thread A already took care of the work and no new work has come in, Thread B will get NULL from dispatch_request(). I believe this is specifically why dispatch_request() is allowed to return NULL in the first place if has_work() must be exact. * Thread B will now be holding the budget and is about to call put_budget(), but hasn't called it yet. * Thread B gets interrupted for a long time (again). Dang interrupts. * Now Thread C (with a different "hctx" but the same queue) comes along and runs blk_mq_do_dispatch_sched(). * Thread C won't do anything because it can't get budget. * Finally Thread B will run again and put the budget without kicking any queues. Now we have a potential I/O stall because nobody will ever kick the queues. I think the above example could happen even on non-BFQ systems and I think it would also be fixed by an approach like the one in my patch. -Doug
On Fri, Apr 03, 2020 at 08:49:54AM -0700, Doug Anderson wrote: > Hi, > > On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote: > > > > Correct that it only happens with BFQ, but whether it's a BFQ bug or > > not just depends on how you define the has_work() API. If has_work() > > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug > > depending on how you cut it. If has_work() must be exact then it is > > certainly a BFQ bug. If has_work() doesn't need to be exact then it's > > not a BFQ bug. I believe that a sane API could be defined either way. > > Either has_work() can be defined as a lightweight hint to trigger > > heavier code or it can be defined as something exact. It's really up > > to blk-mq to say how they define it. > > > > From his response on the SCSI patch [1], it sounded like Jens was OK > > with has_work() being a lightweight hint as long as BFQ ensures that > > the queues run later. ...but, as my investigation found, I believe > > that BFQ _does_ try to ensure that the queue is run at a later time by > > calling blk_mq_run_hw_queues(). The irony is that due to the race > > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to > > be reliable if has_work() is inexact. :( One way to address this is > > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact. > > > > ...so Jens: care to clarify how you'd like has_work() to be defined? > > Sorry to reply so quickly after my prior reply, but I might have found > an extreme corner case where we can still run into the same race even > if has_work() is exact. This is all theoretical from code analysis. > Maybe you can poke a hole in my scenario or tell me it's so > implausible that we don't care, but it seems like it's theoretically > possible. For this example I'll assume a budget of 1 (AKA only one > thread can get budget for a given queue): > > * Threads A and B both run has_work() at the same time with the same > "hctx". has_work() is exact but there's no lock, so it's OK if > Thread A and B both get back true. > > * Thread B gets interrupted for a long time right after it decides > that there is work. Maybe its CPU gets an interrupt and the > interrupt handler is slow. > > * Thread A runs, get budget, dispatches work. > > * Thread A's work finishes and budget is released. > > * Thread B finally runs again and gets budget. > > * Since Thread A already took care of the work and no new work has > come in, Thread B will get NULL from dispatch_request(). I believe > this is specifically why dispatch_request() is allowed to return > NULL in the first place if has_work() must be exact. > > * Thread B will now be holding the budget and is about to call > put_budget(), but hasn't called it yet. > > * Thread B gets interrupted for a long time (again). Dang interrupts. > > * Now Thread C (with a different "hctx" but the same queue) comes > along and runs blk_mq_do_dispatch_sched(). > > * Thread C won't do anything because it can't get budget. > > * Finally Thread B will run again and put the budget without kicking > any queues. > > Now we have a potential I/O stall because nobody will ever kick the > queues. > > > I think the above example could happen even on non-BFQ systems and I > think it would also be fixed by an approach like the one in my patch. OK, looks it isn't specific on BFQ any more. Follows another candidate approach for this issue, given it is so hard to trigger, we can make it more reliable by rerun queue when has_work() returns true after ops->dispath_request() returns NULL. diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 74cedea56034..4408e5d4fcd8 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) blk_mq_run_hw_queue(hctx, true); } +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ /* * Only SCSI implements .get_budget and .put_budget, and SCSI restarts * its queue by itself in its completion handler, so we don't need to @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) rq = e->type->ops.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); + + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); break; } Thanks, Ming
Hi, On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > OK, looks it isn't specific on BFQ any more. > > Follows another candidate approach for this issue, given it is so hard > to trigger, we can make it more reliable by rerun queue when has_work() > returns true after ops->dispath_request() returns NULL. > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 74cedea56034..4408e5d4fcd8 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) > blk_mq_run_hw_queue(hctx, true); > } > > +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ > /* > * Only SCSI implements .get_budget and .put_budget, and SCSI restarts > * its queue by itself in its completion handler, so we don't need to > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); I agree that your patch should solve the race. With the current BFQ's has_work() it's a bit of a disaster though. It will essentially put blk-mq into a busy-wait loop (with a 3 ms delay between each poll) while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. ...so I guess the question that still needs to be answered: does has_work() need to be exact? If so then we need the patch you propose plus one to BFQ. If not, we should continue along the lines of my patch. -Doug
> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: > > Hi, > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: >> >> OK, looks it isn't specific on BFQ any more. >> >> Follows another candidate approach for this issue, given it is so hard >> to trigger, we can make it more reliable by rerun queue when has_work() >> returns true after ops->dispath_request() returns NULL. >> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >> index 74cedea56034..4408e5d4fcd8 100644 >> --- a/block/blk-mq-sched.c >> +++ b/block/blk-mq-sched.c >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) >> blk_mq_run_hw_queue(hctx, true); >> } >> >> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ >> /* >> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts >> * its queue by itself in its completion handler, so we don't need to >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) >> rq = e->type->ops.dispatch_request(hctx); >> if (!rq) { >> blk_mq_put_dispatch_budget(hctx); >> + >> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > I agree that your patch should solve the race. With the current BFQ's > has_work() it's a bit of a disaster though. It will essentially put > blk-mq into a busy-wait loop (with a 3 ms delay between each poll) > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. > > ...so I guess the question that still needs to be answered: does > has_work() need to be exact? If so then we need the patch you propose > plus one to BFQ. If not, we should continue along the lines of my > patch. > Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, with this last Ming's patch, BFQ may happen to be polled every 3ms, for at most three times. On the opposite end, making bfq_has_work plugging aware costs more complexity, and possibly one more lock. While avoiding the above occasional polling, this may imply a lot of overhead or CPU stalls on every dispatch. Paolo > -Doug
Hi, On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote: > > > Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: > > > > Hi, > > > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > >> > >> OK, looks it isn't specific on BFQ any more. > >> > >> Follows another candidate approach for this issue, given it is so hard > >> to trigger, we can make it more reliable by rerun queue when has_work() > >> returns true after ops->dispath_request() returns NULL. > >> > >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > >> index 74cedea56034..4408e5d4fcd8 100644 > >> --- a/block/blk-mq-sched.c > >> +++ b/block/blk-mq-sched.c > >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) > >> blk_mq_run_hw_queue(hctx, true); > >> } > >> > >> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ > >> /* > >> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts > >> * its queue by itself in its completion handler, so we don't need to > >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > >> rq = e->type->ops.dispatch_request(hctx); > >> if (!rq) { > >> blk_mq_put_dispatch_budget(hctx); > >> + > >> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > > > I agree that your patch should solve the race. With the current BFQ's > > has_work() it's a bit of a disaster though. It will essentially put > > blk-mq into a busy-wait loop (with a 3 ms delay between each poll) > > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. > > > > ...so I guess the question that still needs to be answered: does > > has_work() need to be exact? If so then we need the patch you propose > > plus one to BFQ. If not, we should continue along the lines of my > > patch. > > > > Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, > with this last Ming's patch, BFQ may happen to be polled every 3ms, > for at most three times. Ah! I did not know this. OK, then Ming's patch seems like it should work. If nothing else it should fix the problem. If this ends up making BFQ chew up too much CPU time then presumably someone will notice and BFQ's has_work() can be improved. Ming: how do you want to proceed? Do you want to formally post the patch? Do you want me to post a v3 of my series where I place patch #2 with your patch? Do you want authorship (which implies adding your Signed-off-by)? > On the opposite end, making bfq_has_work plugging aware costs more > complexity, and possibly one more lock. While avoiding the above > occasional polling, this may imply a lot of overhead or CPU stalls on > every dispatch. I still think it would be interesting to run performance tests with my proof-of-concept solution for has_work(). Even if it's not ideal, knowing whether performance increased, decreased, or stayed the same would give information about how much more effort should be put into this. -Doug
Hi, On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > rq = e->type->ops.dispatch_request(hctx); > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); To really close the race, don't we need to run all the queues associated with the hctx? I haven't traced it through, but I've been assuming that the multiple "hctx"s associated with the same queue will have the same budget associated with them and thus they can block each other out. -Doug
On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote: > Hi, > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > rq = e->type->ops.dispatch_request(hctx); > > if (!rq) { > > blk_mq_put_dispatch_budget(hctx); > > + > > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > To really close the race, don't we need to run all the queues > associated with the hctx? I haven't traced it through, but I've been > assuming that the multiple "hctx"s associated with the same queue will > have the same budget associated with them and thus they can block each > other out. Yeah, we should run all hctxs which share the same budget space. Also, in theory, we don't have to add the delay, however BFQ may plug the dispatch for 9 ms, so looks delay run queue is required. thanks, Ming
> Il giorno 5 apr 2020, alle ore 18:16, Doug Anderson <dianders@chromium.org> ha scritto: > > Hi, > > On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote: >> >>> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto: >>> >>> Hi, >>> >>> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: >>>> >>>> OK, looks it isn't specific on BFQ any more. >>>> >>>> Follows another candidate approach for this issue, given it is so hard >>>> to trigger, we can make it more reliable by rerun queue when has_work() >>>> returns true after ops->dispath_request() returns NULL. >>>> >>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >>>> index 74cedea56034..4408e5d4fcd8 100644 >>>> --- a/block/blk-mq-sched.c >>>> +++ b/block/blk-mq-sched.c >>>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) >>>> blk_mq_run_hw_queue(hctx, true); >>>> } >>>> >>>> +#define BLK_MQ_BUDGET_DELAY 3 /* ms units */ >>>> /* >>>> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts >>>> * its queue by itself in its completion handler, so we don't need to >>>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) >>>> rq = e->type->ops.dispatch_request(hctx); >>>> if (!rq) { >>>> blk_mq_put_dispatch_budget(hctx); >>>> + >>>> + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) >>>> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); >>> >>> I agree that your patch should solve the race. With the current BFQ's >>> has_work() it's a bit of a disaster though. It will essentially put >>> blk-mq into a busy-wait loop (with a 3 ms delay between each poll) >>> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything. >>> >>> ...so I guess the question that still needs to be answered: does >>> has_work() need to be exact? If so then we need the patch you propose >>> plus one to BFQ. If not, we should continue along the lines of my >>> patch. >>> >> >> Some more comments. BFQ's I/O plugging lasts 9 ms by default. So, >> with this last Ming's patch, BFQ may happen to be polled every 3ms, >> for at most three times. > > Ah! I did not know this. OK, then Ming's patch seems like it should > work. If nothing else it should fix the problem. If this ends up > making BFQ chew up too much CPU time then presumably someone will > notice and BFQ's has_work() can be improved. > > Ming: how do you want to proceed? Do you want to formally post the > patch? Do you want me to post a v3 of my series where I place patch > #2 with your patch? Do you want authorship (which implies adding your > Signed-off-by)? > > >> On the opposite end, making bfq_has_work plugging aware costs more >> complexity, and possibly one more lock. While avoiding the above >> occasional polling, this may imply a lot of overhead or CPU stalls on >> every dispatch. > > I still think it would be interesting to run performance tests with my > proof-of-concept solution for has_work(). Even if it's not ideal, > knowing whether performance increased, decreased, or stayed the same > would give information about how much more effort should be put into > this. > Why not? It is however hard to hope that we add only negligible overhead and CPU stalls if we move from one lock-protected section per I/O-request dispatch, to two or more lock-protected sections per request (has_work may be invoked several times per request). At any rate, if useful, one of the scripts in my S benchmark suite can also measure max IOPS (when limited only by I/O processing) [1]. The script is for Linux distros; I don't know whether it works in your environments of interest, Doug. Paolo [1] https://github.com/Algodev-github/S/blob/master/throughput-sync/throughput-sync.sh > -Doug
Hi, On Mon, Apr 6, 2020 at 7:14 PM Ming Lei <ming.lei@redhat.com> wrote: > > On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote: > > Hi, > > > > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote: > > > > > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > > > rq = e->type->ops.dispatch_request(hctx); > > > if (!rq) { > > > blk_mq_put_dispatch_budget(hctx); > > > + > > > + if (e->type->ops.has_work && e->type->ops.has_work(hctx)) > > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY); > > > > To really close the race, don't we need to run all the queues > > associated with the hctx? I haven't traced it through, but I've been > > assuming that the multiple "hctx"s associated with the same queue will > > have the same budget associated with them and thus they can block each > > other out. > > Yeah, we should run all hctxs which share the same budget space. > > Also, in theory, we don't have to add the delay, however BFQ may plug the > dispatch for 9 ms, so looks delay run queue is required. I have posted up v3. A few notes: * Since we should run all "hctxs" I took out the check for has_work() before kicking the queue. * As far as I can tell the theoretical race happens for _anyone_ who puts budget, so I added it to blk_mq_put_dispatch_budget(). Feel free to shout at me in response to v3 if you believe I'm wrong about that. Thanks for all your reviews and suggestions so far! -Doug
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 74cedea56034..0195d75f5f96 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) break; - if (!blk_mq_get_dispatch_budget(hctx)) - break; + + if (!blk_mq_get_dispatch_budget(hctx)) { + /* + * We didn't get budget so set contention. If + * someone else had the budget but didn't dispatch + * they'll kick everything. NOTE: we check one + * extra time _after_ setting contention to fully + * close the race. If we don't actually dispatch + * we'll detext faux contention (with ourselves) + * but that should be rare. + */ + atomic_set(&q->budget_contention, 1); + + if (!blk_mq_get_dispatch_budget(hctx)) + break; + } rq = e->type->ops.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); + + /* + * We've released the budget but us holding it might + * have prevented someone else from dispatching. + * Detect that case and run all queues again. + */ + if (atomic_read(&q->budget_contention)) + blk_mq_run_hw_queues(q, true); break; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 2cd8d2b49ff4..6163c43ceca5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) struct blk_mq_hw_ctx *hctx; int i; + /* We're running the queues, so clear the contention detector */ + atomic_set(&q->budget_contention, 0); + queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_hctx_stopped(hctx)) continue; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f629d40c645c..07f21e45d993 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -583,6 +583,8 @@ struct request_queue { #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; + + atomic_t budget_contention; }; #define QUEUE_FLAG_STOPPED 0 /* queue is stopped */
It is possible for two threads to be running blk_mq_do_dispatch_sched() at the same time with the same "hctx". This is because there can be more than one caller to __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't prevent more than one thread from entering. If more than one thread is running blk_mq_do_dispatch_sched() at the same time with the same "hctx", they may have contention acquiring budget. The blk_mq_get_dispatch_budget() can eventually translate into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not uncommon) then only one of the two threads will be the one to increment "device_busy" to 1 and get the budget. The losing thread will break out of blk_mq_do_dispatch_sched() and will stop dispatching requests. The assumption is that when more budget is available later (when existing transactions finish) the queue will be kicked again, perhaps in scsi_end_request(). The winning thread now has budget and can go on to call dispatch_request(). If dispatch_request() returns NULL here then we have a potential problem. Specifically we'll now call blk_mq_put_dispatch_budget() which translates into scsi_mq_put_budget(). That will mark the device as no longer busy but doesn't do anything to kick the queue. This violates the assumption that the queue would be kicked when more budget was available. Pictorially: Thread A Thread B ================================= ================================== blk_mq_get_dispatch_budget() => 1 dispatch_request() => NULL blk_mq_get_dispatch_budget() => 0 // because Thread A marked // "device_busy" in scsi_device blk_mq_put_dispatch_budget() The above case was observed in reboot tests and caused a task to hang forever waiting for IO to complete. Traces showed that in fact two tasks were running blk_mq_do_dispatch_sched() at the same time with the same "hctx". The task that got the budget did in fact see dispatch_request() return NULL. Both tasks returned and the system went on for several minutes (until the hung task delay kicked in) without the given "hctx" showing up again in traces. Let's attempt to fix this problem by detecting if there was contention for the budget in the case where we put the budget without dispatching anything. If we saw contention we kick all hctx's associated with the queue where there was contention. We do this without any locking by adding a double-check for budget and accepting a small amount of faux contention if the 2nd check gives us budget but then we don't dispatch anything (we'll look like we contended with ourselves). A few extra notes: - This whole thing is only a problem due to the inexact nature of has_work(). Specifically if has_work() always guaranteed that a "true" return meant that dispatch_request() would return non-NULL then we wouldn't have this problem. That's because we only grab the budget if has_work() returned true. If we had the non-NULL guarantee then at least one of the threads would actually dispatch work and when the work was done then queues would be kicked normally. - One specific I/O scheduler that trips this problem quite a bit is BFQ which definitely returns "true" for has_work() in cases where it wouldn't dispatch. Making BFQ's has_work() more exact requires that has_work() becomes a much heavier function, including figuring out how to acquire spinlocks in has_work() without tripping circular lock dependencies. This is prototyped but it's unclear if it's really the way to go when the problem can be solved with a relatively lightweight contention detection mechanism. - Because this problem only trips with inexact has_work() it's believed that blk_mq_do_dispatch_ctx() does not need a similar change. Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...") block/blk-mq-sched.c | 26 ++++++++++++++++++++++++-- block/blk-mq.c | 3 +++ include/linux/blkdev.h | 2 ++ 3 files changed, 29 insertions(+), 2 deletions(-)