[v2,2/2] blk-mq: Rerun dispatching in the case of budget contention
diff mbox series

Message ID 20200402085050.v2.2.I28278ef8ea27afc0ec7e597752a6d4e58c16176f@changeid
State New
Headers show
Series
  • blk-mq: Fix two causes of IO stalls found in reboot testing
Related show

Commit Message

Doug Anderson April 2, 2020, 3:51 p.m. UTC
It is possible for two threads to be running
blk_mq_do_dispatch_sched() at the same time with the same "hctx".
This is because there can be more than one caller to
__blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
prevent more than one thread from entering.

If more than one thread is running blk_mq_do_dispatch_sched() at the
same time with the same "hctx", they may have contention acquiring
budget.  The blk_mq_get_dispatch_budget() can eventually translate
into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
uncommon) then only one of the two threads will be the one to
increment "device_busy" to 1 and get the budget.

The losing thread will break out of blk_mq_do_dispatch_sched() and
will stop dispatching requests.  The assumption is that when more
budget is available later (when existing transactions finish) the
queue will be kicked again, perhaps in scsi_end_request().

The winning thread now has budget and can go on to call
dispatch_request().  If dispatch_request() returns NULL here then we
have a potential problem.  Specifically we'll now call
blk_mq_put_dispatch_budget() which translates into
scsi_mq_put_budget().  That will mark the device as no longer busy but
doesn't do anything to kick the queue.  This violates the assumption
that the queue would be kicked when more budget was available.

Pictorially:

Thread A                          Thread B
================================= ==================================
blk_mq_get_dispatch_budget() => 1
dispatch_request() => NULL
                                  blk_mq_get_dispatch_budget() => 0
                                  // because Thread A marked
                                  // "device_busy" in scsi_device
blk_mq_put_dispatch_budget()

The above case was observed in reboot tests and caused a task to hang
forever waiting for IO to complete.  Traces showed that in fact two
tasks were running blk_mq_do_dispatch_sched() at the same time with
the same "hctx".  The task that got the budget did in fact see
dispatch_request() return NULL.  Both tasks returned and the system
went on for several minutes (until the hung task delay kicked in)
without the given "hctx" showing up again in traces.

Let's attempt to fix this problem by detecting if there was contention
for the budget in the case where we put the budget without dispatching
anything.  If we saw contention we kick all hctx's associated with the
queue where there was contention.  We do this without any locking by
adding a double-check for budget and accepting a small amount of faux
contention if the 2nd check gives us budget but then we don't dispatch
anything (we'll look like we contended with ourselves).

A few extra notes:

- This whole thing is only a problem due to the inexact nature of
  has_work().  Specifically if has_work() always guaranteed that a
  "true" return meant that dispatch_request() would return non-NULL
  then we wouldn't have this problem.  That's because we only grab the
  budget if has_work() returned true.  If we had the non-NULL
  guarantee then at least one of the threads would actually dispatch
  work and when the work was done then queues would be kicked
  normally.

- One specific I/O scheduler that trips this problem quite a bit is
  BFQ which definitely returns "true" for has_work() in cases where it
  wouldn't dispatch.  Making BFQ's has_work() more exact requires that
  has_work() becomes a much heavier function, including figuring out
  how to acquire spinlocks in has_work() without tripping circular
  lock dependencies.  This is prototyped but it's unclear if it's
  really the way to go when the problem can be solved with a
  relatively lightweight contention detection mechanism.

- Because this problem only trips with inexact has_work() it's
  believed that blk_mq_do_dispatch_ctx() does not need a similar
  change.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")

 block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
 block/blk-mq.c         |  3 +++
 include/linux/blkdev.h |  2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

Comments

Ming Lei April 3, 2020, 1:33 a.m. UTC | #1
On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> It is possible for two threads to be running
> blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> This is because there can be more than one caller to
> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> prevent more than one thread from entering.
> 
> If more than one thread is running blk_mq_do_dispatch_sched() at the
> same time with the same "hctx", they may have contention acquiring
> budget.  The blk_mq_get_dispatch_budget() can eventually translate
> into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
> uncommon) then only one of the two threads will be the one to
> increment "device_busy" to 1 and get the budget.
> 
> The losing thread will break out of blk_mq_do_dispatch_sched() and
> will stop dispatching requests.  The assumption is that when more
> budget is available later (when existing transactions finish) the
> queue will be kicked again, perhaps in scsi_end_request().
> 
> The winning thread now has budget and can go on to call
> dispatch_request().  If dispatch_request() returns NULL here then we
> have a potential problem.  Specifically we'll now call

As I mentioned before, it is a BFQ specific issue, it tells blk-mq
that it has work to do, and now the budget is assigned to the only
winning thread, however, dispatch_request() still returns NULL.

> blk_mq_put_dispatch_budget() which translates into
> scsi_mq_put_budget().  That will mark the device as no longer busy but
> doesn't do anything to kick the queue.  This violates the assumption
> that the queue would be kicked when more budget was available.

The queue is still kicked in by BFQ in its idle timer, however that
timer doesn't make forward progress.

Without this idle timer, it is simply a BFQ issue.

> 
> Pictorially:
> 
> Thread A                          Thread B
> ================================= ==================================
> blk_mq_get_dispatch_budget() => 1
> dispatch_request() => NULL
>                                   blk_mq_get_dispatch_budget() => 0
>                                   // because Thread A marked
>                                   // "device_busy" in scsi_device
> blk_mq_put_dispatch_budget()

What if there is only thread A? You need to mention that thread B
is from BFQ.

> 
> The above case was observed in reboot tests and caused a task to hang
> forever waiting for IO to complete.  Traces showed that in fact two
> tasks were running blk_mq_do_dispatch_sched() at the same time with
> the same "hctx".  The task that got the budget did in fact see
> dispatch_request() return NULL.  Both tasks returned and the system
> went on for several minutes (until the hung task delay kicked in)
> without the given "hctx" showing up again in traces.
> 
> Let's attempt to fix this problem by detecting if there was contention
> for the budget in the case where we put the budget without dispatching
> anything.  If we saw contention we kick all hctx's associated with the
> queue where there was contention.  We do this without any locking by
> adding a double-check for budget and accepting a small amount of faux
> contention if the 2nd check gives us budget but then we don't dispatch
> anything (we'll look like we contended with ourselves).
> 
> A few extra notes:
> 
> - This whole thing is only a problem due to the inexact nature of
>   has_work().  Specifically if has_work() always guaranteed that a
>   "true" return meant that dispatch_request() would return non-NULL
>   then we wouldn't have this problem.  That's because we only grab the
>   budget if has_work() returned true.  If we had the non-NULL
>   guarantee then at least one of the threads would actually dispatch
>   work and when the work was done then queues would be kicked
>   normally.
> 
> - One specific I/O scheduler that trips this problem quite a bit is
>   BFQ which definitely returns "true" for has_work() in cases where it
>   wouldn't dispatch.  Making BFQ's has_work() more exact requires that
>   has_work() becomes a much heavier function, including figuring out
>   how to acquire spinlocks in has_work() without tripping circular
>   lock dependencies.  This is prototyped but it's unclear if it's
>   really the way to go when the problem can be solved with a
>   relatively lightweight contention detection mechanism.
> 
> - Because this problem only trips with inexact has_work() it's
>   believed that blk_mq_do_dispatch_ctx() does not need a similar
>   change.

Right, I prefer to fix it in BFQ, given it isn't a generic issue,
not worth of generic solution.

> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
> Changes in v2:
> - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")
> 
>  block/blk-mq-sched.c   | 26 ++++++++++++++++++++++++--
>  block/blk-mq.c         |  3 +++
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..0195d75f5f96 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
>  			break;
>  
> -		if (!blk_mq_get_dispatch_budget(hctx))
> -			break;
> +
> +		if (!blk_mq_get_dispatch_budget(hctx)) {
> +			/*
> +			 * We didn't get budget so set contention.  If
> +			 * someone else had the budget but didn't dispatch
> +			 * they'll kick everything.  NOTE: we check one
> +			 * extra time _after_ setting contention to fully
> +			 * close the race.  If we don't actually dispatch
> +			 * we'll detext faux contention (with ourselves)
> +			 * but that should be rare.
> +			 */
> +			atomic_set(&q->budget_contention, 1);
> +
> +			if (!blk_mq_get_dispatch_budget(hctx))

scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

> +				break;
> +		}
>  
>  		rq = e->type->ops.dispatch_request(hctx);
>  		if (!rq) {
>  			blk_mq_put_dispatch_budget(hctx);
> +
> +			/*
> +			 * We've released the budget but us holding it might
> +			 * have prevented someone else from dispatching.
> +			 * Detect that case and run all queues again.
> +			 */
> +			if (atomic_read(&q->budget_contention))

scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
is needed between the above two op.

> +				blk_mq_run_hw_queues(q, true);
>  			break;
>  		}
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2cd8d2b49ff4..6163c43ceca5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
>  
> +	/* We're running the queues, so clear the contention detector */
> +	atomic_set(&q->budget_contention, 0);
> +

You add extra cost in fast path.

>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_hctx_stopped(hctx))
>  			continue;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f629d40c645c..07f21e45d993 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -583,6 +583,8 @@ struct request_queue {
>  
>  #define BLK_MAX_WRITE_HINTS	5
>  	u64			write_hints[BLK_MAX_WRITE_HINTS];
> +
> +	atomic_t		budget_contention;

It needn't to be a atomic variable, and simple 'unsigned'
int should be fine, what matters is that the order between
R/W this flag and get/put budget.


thanks,
Ming
Doug Anderson April 3, 2020, 3:10 p.m. UTC | #2
Hi,

On Thu, Apr 2, 2020 at 6:34 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> > It is possible for two threads to be running
> > blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> > This is because there can be more than one caller to
> > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> > prevent more than one thread from entering.
> >
> > If more than one thread is running blk_mq_do_dispatch_sched() at the
> > same time with the same "hctx", they may have contention acquiring
> > budget.  The blk_mq_get_dispatch_budget() can eventually translate
> > into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
> > uncommon) then only one of the two threads will be the one to
> > increment "device_busy" to 1 and get the budget.
> >
> > The losing thread will break out of blk_mq_do_dispatch_sched() and
> > will stop dispatching requests.  The assumption is that when more
> > budget is available later (when existing transactions finish) the
> > queue will be kicked again, perhaps in scsi_end_request().
> >
> > The winning thread now has budget and can go on to call
> > dispatch_request().  If dispatch_request() returns NULL here then we
> > have a potential problem.  Specifically we'll now call
>
> As I mentioned before, it is a BFQ specific issue, it tells blk-mq
> that it has work to do, and now the budget is assigned to the only
> winning thread, however, dispatch_request() still returns NULL.

Correct that it only happens with BFQ, but whether it's a BFQ bug or
not just depends on how you define the has_work() API.  If has_work()
is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
depending on how you cut it.  If has_work() must be exact then it is
certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
not a BFQ bug.  I believe that a sane API could be defined either way.
Either has_work() can be defined as a lightweight hint to trigger
heavier code or it can be defined as something exact.  It's really up
to blk-mq to say how they define it.

From his response on the SCSI patch [1], it sounded like Jens was OK
with has_work() being a lightweight hint as long as BFQ ensures that
the queues run later.  ...but, as my investigation found, I believe
that BFQ _does_ try to ensure that the queue is run at a later time by
calling blk_mq_run_hw_queues().  The irony is that due to the race
we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
be reliable if has_work() is inexact.  :(  One way to address this is
to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.

...so Jens: care to clarify how you'd like has_work() to be defined?


> > - Because this problem only trips with inexact has_work() it's
> >   believed that blk_mq_do_dispatch_ctx() does not need a similar
> >   change.
>
> Right, I prefer to fix it in BFQ, given it isn't a generic issue,
> not worth of generic solution.

Just to confirm: it sounds like you are saying that BFQ is not a first
class citizen here because not everyone uses BFQ.  Is that really the
best policy?  Couldn't I say that SCSI shouldn't be a first class
citizen because not everyone uses SCSI?  In such a case I could argue
that we should speed up the blk-mq layer by removing all the budget
code since SCSI is the only user.  I'm not actually suggesting it,
only pointing out that sometimes we need to make allowances in the
code.


> > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >               if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> >                       break;
> >
> > -             if (!blk_mq_get_dispatch_budget(hctx))
> > -                     break;
> > +
> > +             if (!blk_mq_get_dispatch_budget(hctx)) {
> > +                     /*
> > +                      * We didn't get budget so set contention.  If
> > +                      * someone else had the budget but didn't dispatch
> > +                      * they'll kick everything.  NOTE: we check one
> > +                      * extra time _after_ setting contention to fully
> > +                      * close the race.  If we don't actually dispatch
> > +                      * we'll detext faux contention (with ourselves)
> > +                      * but that should be rare.
> > +                      */
> > +                     atomic_set(&q->budget_contention, 1);
> > +
> > +                     if (!blk_mq_get_dispatch_budget(hctx))
>
> scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
> between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

I always struggle when working with memory barriers.  I think you are
saying that this section of code is OK, though, presumably because the
SCSI code does "atomic_inc_return" which implies this barrier.

Do you happen to know if it's documented that anyone who implements
get_budget() for "struct blk_mq_ops" will have a memory barrier here?
I know SCSI is the only existing user, but we'd want to keep it
generic.


> > +                             break;
> > +             }
> >
> >               rq = e->type->ops.dispatch_request(hctx);
> >               if (!rq) {
> >                       blk_mq_put_dispatch_budget(hctx);
> > +
> > +                     /*
> > +                      * We've released the budget but us holding it might
> > +                      * have prevented someone else from dispatching.
> > +                      * Detect that case and run all queues again.
> > +                      */
> > +                     if (atomic_read(&q->budget_contention))
>
> scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
> is needed between the above two op.

OK, thanks.  I will add that we decide to move forward with this
patch.  Again I'd wonder if this type of thing should be documented.


> > +                             blk_mq_run_hw_queues(q, true);
> >                       break;
> >               }
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 2cd8d2b49ff4..6163c43ceca5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> >       struct blk_mq_hw_ctx *hctx;
> >       int i;
> >
> > +     /* We're running the queues, so clear the contention detector */
> > +     atomic_set(&q->budget_contention, 0);
> > +
>
> You add extra cost in fast path.

Yes, but it is a fairly minor cost added.  It's called once in a place
where we're unlikely to be looping and where we're about to do a lot
heavier operations (perhaps in a loop).  This doesn't feel like a
dealbreaker.


> >       queue_for_each_hw_ctx(q, hctx, i) {
> >               if (blk_mq_hctx_stopped(hctx))
> >                       continue;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index f629d40c645c..07f21e45d993 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -583,6 +583,8 @@ struct request_queue {
> >
> >  #define BLK_MAX_WRITE_HINTS  5
> >       u64                     write_hints[BLK_MAX_WRITE_HINTS];
> > +
> > +     atomic_t                budget_contention;
>
> It needn't to be a atomic variable, and simple 'unsigned'
> int should be fine, what matters is that the order between
> R/W this flag and get/put budget.

Apparently I'm the only one who feels atomic_t is more documenting
here.  I will cede to the will of the majority here and change to
'unsigned int' if we decide to move forward with this patch.

-Doug

[1] https://lore.kernel.org/r/d6af2344-11f7-5862-daed-e21cbd496d92@kernel.dk
Doug Anderson April 3, 2020, 3:49 p.m. UTC | #3
Hi,

On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Correct that it only happens with BFQ, but whether it's a BFQ bug or
> not just depends on how you define the has_work() API.  If has_work()
> is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> depending on how you cut it.  If has_work() must be exact then it is
> certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> not a BFQ bug.  I believe that a sane API could be defined either way.
> Either has_work() can be defined as a lightweight hint to trigger
> heavier code or it can be defined as something exact.  It's really up
> to blk-mq to say how they define it.
>
> From his response on the SCSI patch [1], it sounded like Jens was OK
> with has_work() being a lightweight hint as long as BFQ ensures that
> the queues run later.  ...but, as my investigation found, I believe
> that BFQ _does_ try to ensure that the queue is run at a later time by
> calling blk_mq_run_hw_queues().  The irony is that due to the race
> we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> be reliable if has_work() is inexact.  :(  One way to address this is
> to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
>
> ...so Jens: care to clarify how you'd like has_work() to be defined?

Sorry to reply so quickly after my prior reply, but I might have found
an extreme corner case where we can still run into the same race even
if has_work() is exact.  This is all theoretical from code analysis.
Maybe you can poke a hole in my scenario or tell me it's so
implausible that we don't care, but it seems like it's theoretically
possible.  For this example I'll assume a budget of 1 (AKA only one
thread can get budget for a given queue):

* Threads A and B both run has_work() at the same time with the same
  "hctx".  has_work() is exact but there's no lock, so it's OK if
  Thread A and B both get back true.

* Thread B gets interrupted for a long time right after it decides
  that there is work.  Maybe its CPU gets an interrupt and the
  interrupt handler is slow.

* Thread A runs, get budget, dispatches work.

* Thread A's work finishes and budget is released.

* Thread B finally runs again and gets budget.

* Since Thread A already took care of the work and no new work has
  come in, Thread B will get NULL from dispatch_request().  I believe
  this is specifically why dispatch_request() is allowed to return
  NULL in the first place if has_work() must be exact.

* Thread B will now be holding the budget and is about to call
  put_budget(), but hasn't called it yet.

* Thread B gets interrupted for a long time (again).  Dang interrupts.

* Now Thread C (with a different "hctx" but the same queue) comes
  along and runs blk_mq_do_dispatch_sched().

* Thread C won't do anything because it can't get budget.

* Finally Thread B will run again and put the budget without kicking
  any queues.

Now we have a potential I/O stall because nobody will ever kick the
queues.


I think the above example could happen even on non-BFQ systems and I
think it would also be fixed by an approach like the one in my patch.

-Doug
Ming Lei April 5, 2020, 9:14 a.m. UTC | #4
On Fri, Apr 03, 2020 at 08:49:54AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Correct that it only happens with BFQ, but whether it's a BFQ bug or
> > not just depends on how you define the has_work() API.  If has_work()
> > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> > depending on how you cut it.  If has_work() must be exact then it is
> > certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> > not a BFQ bug.  I believe that a sane API could be defined either way.
> > Either has_work() can be defined as a lightweight hint to trigger
> > heavier code or it can be defined as something exact.  It's really up
> > to blk-mq to say how they define it.
> >
> > From his response on the SCSI patch [1], it sounded like Jens was OK
> > with has_work() being a lightweight hint as long as BFQ ensures that
> > the queues run later.  ...but, as my investigation found, I believe
> > that BFQ _does_ try to ensure that the queue is run at a later time by
> > calling blk_mq_run_hw_queues().  The irony is that due to the race
> > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> > be reliable if has_work() is inexact.  :(  One way to address this is
> > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
> >
> > ...so Jens: care to clarify how you'd like has_work() to be defined?
> 
> Sorry to reply so quickly after my prior reply, but I might have found
> an extreme corner case where we can still run into the same race even
> if has_work() is exact.  This is all theoretical from code analysis.
> Maybe you can poke a hole in my scenario or tell me it's so
> implausible that we don't care, but it seems like it's theoretically
> possible.  For this example I'll assume a budget of 1 (AKA only one
> thread can get budget for a given queue):
> 
> * Threads A and B both run has_work() at the same time with the same
>   "hctx".  has_work() is exact but there's no lock, so it's OK if
>   Thread A and B both get back true.
> 
> * Thread B gets interrupted for a long time right after it decides
>   that there is work.  Maybe its CPU gets an interrupt and the
>   interrupt handler is slow.
> 
> * Thread A runs, get budget, dispatches work.
> 
> * Thread A's work finishes and budget is released.
> 
> * Thread B finally runs again and gets budget.
> 
> * Since Thread A already took care of the work and no new work has
>   come in, Thread B will get NULL from dispatch_request().  I believe
>   this is specifically why dispatch_request() is allowed to return
>   NULL in the first place if has_work() must be exact.
> 
> * Thread B will now be holding the budget and is about to call
>   put_budget(), but hasn't called it yet.
> 
> * Thread B gets interrupted for a long time (again).  Dang interrupts.
> 
> * Now Thread C (with a different "hctx" but the same queue) comes
>   along and runs blk_mq_do_dispatch_sched().
> 
> * Thread C won't do anything because it can't get budget.
> 
> * Finally Thread B will run again and put the budget without kicking
>   any queues.
> 
> Now we have a potential I/O stall because nobody will ever kick the
> queues.
> 
> 
> I think the above example could happen even on non-BFQ systems and I
> think it would also be fixed by an approach like the one in my patch.

OK, looks it isn't specific on BFQ any more.

Follows another candidate approach for this issue, given it is so hard
to trigger, we can make it more reliable by rerun queue when has_work()
returns true after ops->dispath_request() returns NULL.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..4408e5d4fcd8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
        blk_mq_run_hw_queue(hctx, true);
 }

+#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
 /*
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
@@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
                rq = e->type->ops.dispatch_request(hctx);
                if (!rq) {
                        blk_mq_put_dispatch_budget(hctx);
+
+                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
+                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
                        break;
                }


Thanks, 
Ming
Doug Anderson April 5, 2020, 2 p.m. UTC | #5
Hi,

On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> OK, looks it isn't specific on BFQ any more.
>
> Follows another candidate approach for this issue, given it is so hard
> to trigger, we can make it more reliable by rerun queue when has_work()
> returns true after ops->dispath_request() returns NULL.
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..4408e5d4fcd8 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>         blk_mq_run_hw_queue(hctx, true);
>  }
>
> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>  /*
>   * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>   * its queue by itself in its completion handler, so we don't need to
> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>                 rq = e->type->ops.dispatch_request(hctx);
>                 if (!rq) {
>                         blk_mq_put_dispatch_budget(hctx);
> +
> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);

I agree that your patch should solve the race.  With the current BFQ's
has_work() it's a bit of a disaster though. It will essentially put
blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.

...so I guess the question that still needs to be answered: does
has_work() need to be exact?  If so then we need the patch you propose
plus one to BFQ.  If not, we should continue along the lines of my
patch.

-Doug
Paolo Valente April 5, 2020, 2:57 p.m. UTC | #6
> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
> 
> Hi,
> 
> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> OK, looks it isn't specific on BFQ any more.
>> 
>> Follows another candidate approach for this issue, given it is so hard
>> to trigger, we can make it more reliable by rerun queue when has_work()
>> returns true after ops->dispath_request() returns NULL.
>> 
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 74cedea56034..4408e5d4fcd8 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>>        blk_mq_run_hw_queue(hctx, true);
>> }
>> 
>> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>> /*
>>  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>>  * its queue by itself in its completion handler, so we don't need to
>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>>                rq = e->type->ops.dispatch_request(hctx);
>>                if (!rq) {
>>                        blk_mq_put_dispatch_budget(hctx);
>> +
>> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
>> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> 
> I agree that your patch should solve the race.  With the current BFQ's
> has_work() it's a bit of a disaster though. It will essentially put
> blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
> 
> ...so I guess the question that still needs to be answered: does
> has_work() need to be exact?  If so then we need the patch you propose
> plus one to BFQ.  If not, we should continue along the lines of my
> patch.
> 

Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
with this last Ming's patch, BFQ may happen to be polled every 3ms,
for at most three times.

On the opposite end, making bfq_has_work plugging aware costs more
complexity, and possibly one more lock.  While avoiding the above
occasional polling, this may imply a lot of overhead or CPU stalls on
every dispatch.

Paolo
 
> -Doug
Doug Anderson April 5, 2020, 4:16 p.m. UTC | #7
Hi,

On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>
> > Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
> >
> > Hi,
> >
> > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> OK, looks it isn't specific on BFQ any more.
> >>
> >> Follows another candidate approach for this issue, given it is so hard
> >> to trigger, we can make it more reliable by rerun queue when has_work()
> >> returns true after ops->dispath_request() returns NULL.
> >>
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index 74cedea56034..4408e5d4fcd8 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> >>        blk_mq_run_hw_queue(hctx, true);
> >> }
> >>
> >> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
> >> /*
> >>  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> >>  * its queue by itself in its completion handler, so we don't need to
> >> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >>                rq = e->type->ops.dispatch_request(hctx);
> >>                if (!rq) {
> >>                        blk_mq_put_dispatch_budget(hctx);
> >> +
> >> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> >> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> >
> > I agree that your patch should solve the race.  With the current BFQ's
> > has_work() it's a bit of a disaster though. It will essentially put
> > blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
> > while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
> >
> > ...so I guess the question that still needs to be answered: does
> > has_work() need to be exact?  If so then we need the patch you propose
> > plus one to BFQ.  If not, we should continue along the lines of my
> > patch.
> >
>
> Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
> with this last Ming's patch, BFQ may happen to be polled every 3ms,
> for at most three times.

Ah!  I did not know this.  OK, then Ming's patch seems like it should
work.  If nothing else it should fix the problem.  If this ends up
making BFQ chew up too much CPU time then presumably someone will
notice and BFQ's has_work() can be improved.

Ming: how do you want to proceed?  Do you want to formally post the
patch?  Do you want me to post a v3 of my series where I place patch
#2 with your patch?  Do you want authorship (which implies adding your
Signed-off-by)?


> On the opposite end, making bfq_has_work plugging aware costs more
> complexity, and possibly one more lock.  While avoiding the above
> occasional polling, this may imply a lot of overhead or CPU stalls on
> every dispatch.

I still think it would be interesting to run performance tests with my
proof-of-concept solution for has_work().  Even if it's not ideal,
knowing whether performance increased, decreased, or stayed the same
would give information about how much more effort should be put into
this.

-Doug
Doug Anderson April 5, 2020, 4:26 p.m. UTC | #8
Hi,

On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>                 rq = e->type->ops.dispatch_request(hctx);
>                 if (!rq) {
>                         blk_mq_put_dispatch_budget(hctx);
> +
> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);

To really close the race, don't we need to run all the queues
associated with the hctx?  I haven't traced it through, but I've been
assuming that the multiple "hctx"s associated with the same queue will
have the same budget associated with them and thus they can block each
other out.

-Doug
Ming Lei April 7, 2020, 2:14 a.m. UTC | #9
On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >                 rq = e->type->ops.dispatch_request(hctx);
> >                 if (!rq) {
> >                         blk_mq_put_dispatch_budget(hctx);
> > +
> > +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> > +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> 
> To really close the race, don't we need to run all the queues
> associated with the hctx?  I haven't traced it through, but I've been
> assuming that the multiple "hctx"s associated with the same queue will
> have the same budget associated with them and thus they can block each
> other out.

Yeah, we should run all hctxs which share the same budget space.

Also, in theory, we don't have to add the delay, however BFQ may plug the
dispatch for 9 ms, so looks delay run queue is required.

thanks,
Ming
Paolo Valente April 7, 2020, 10:43 a.m. UTC | #10
> Il giorno 5 apr 2020, alle ore 18:16, Doug Anderson <dianders@chromium.org> ha scritto:
> 
> Hi,
> 
> On Sun, Apr 5, 2020 at 7:55 AM Paolo Valente <paolo.valente@linaro.org> wrote:
>> 
>>> Il giorno 5 apr 2020, alle ore 16:00, Doug Anderson <dianders@chromium.org> ha scritto:
>>> 
>>> Hi,
>>> 
>>> On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>> 
>>>> OK, looks it isn't specific on BFQ any more.
>>>> 
>>>> Follows another candidate approach for this issue, given it is so hard
>>>> to trigger, we can make it more reliable by rerun queue when has_work()
>>>> returns true after ops->dispath_request() returns NULL.
>>>> 
>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>> index 74cedea56034..4408e5d4fcd8 100644
>>>> --- a/block/blk-mq-sched.c
>>>> +++ b/block/blk-mq-sched.c
>>>> @@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>>>>       blk_mq_run_hw_queue(hctx, true);
>>>> }
>>>> 
>>>> +#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
>>>> /*
>>>> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
>>>> * its queue by itself in its completion handler, so we don't need to
>>>> @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>>>>               rq = e->type->ops.dispatch_request(hctx);
>>>>               if (!rq) {
>>>>                       blk_mq_put_dispatch_budget(hctx);
>>>> +
>>>> +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
>>>> +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
>>> 
>>> I agree that your patch should solve the race.  With the current BFQ's
>>> has_work() it's a bit of a disaster though. It will essentially put
>>> blk-mq into a busy-wait loop (with a 3 ms delay between each poll)
>>> while BFQ's has_work() says "true" but BFQ doesn't dispatch anything.
>>> 
>>> ...so I guess the question that still needs to be answered: does
>>> has_work() need to be exact?  If so then we need the patch you propose
>>> plus one to BFQ.  If not, we should continue along the lines of my
>>> patch.
>>> 
>> 
>> Some more comments.  BFQ's I/O plugging lasts 9 ms by default.  So,
>> with this last Ming's patch, BFQ may happen to be polled every 3ms,
>> for at most three times.
> 
> Ah!  I did not know this.  OK, then Ming's patch seems like it should
> work.  If nothing else it should fix the problem.  If this ends up
> making BFQ chew up too much CPU time then presumably someone will
> notice and BFQ's has_work() can be improved.
> 
> Ming: how do you want to proceed?  Do you want to formally post the
> patch?  Do you want me to post a v3 of my series where I place patch
> #2 with your patch?  Do you want authorship (which implies adding your
> Signed-off-by)?
> 
> 
>> On the opposite end, making bfq_has_work plugging aware costs more
>> complexity, and possibly one more lock.  While avoiding the above
>> occasional polling, this may imply a lot of overhead or CPU stalls on
>> every dispatch.
> 
> I still think it would be interesting to run performance tests with my
> proof-of-concept solution for has_work().  Even if it's not ideal,
> knowing whether performance increased, decreased, or stayed the same
> would give information about how much more effort should be put into
> this.
> 

Why not?  It is however hard to hope that we add only negligible
overhead and CPU stalls if we move from one lock-protected section per
I/O-request dispatch, to two or more lock-protected sections per
request (has_work may be invoked several times per request).

At any rate, if useful, one of the scripts in my S benchmark suite can
also measure max IOPS (when limited only by I/O processing) [1].  The
script is for Linux distros; I don't know whether it works in your
environments of interest, Doug.

Paolo

[1] https://github.com/Algodev-github/S/blob/master/throughput-sync/throughput-sync.sh

> -Doug
Doug Anderson April 7, 2020, 10:04 p.m. UTC | #11
Hi,

On Mon, Apr 6, 2020 at 7:14 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Apr 05, 2020 at 09:26:39AM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Sun, Apr 5, 2020 at 2:15 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > @@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >                 rq = e->type->ops.dispatch_request(hctx);
> > >                 if (!rq) {
> > >                         blk_mq_put_dispatch_budget(hctx);
> > > +
> > > +                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
> > > +                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
> >
> > To really close the race, don't we need to run all the queues
> > associated with the hctx?  I haven't traced it through, but I've been
> > assuming that the multiple "hctx"s associated with the same queue will
> > have the same budget associated with them and thus they can block each
> > other out.
>
> Yeah, we should run all hctxs which share the same budget space.
>
> Also, in theory, we don't have to add the delay, however BFQ may plug the
> dispatch for 9 ms, so looks delay run queue is required.

I have posted up v3.  A few notes:

* Since we should run all "hctxs" I took out the check for has_work()
before kicking the queue.

* As far as I can tell the theoretical race happens for _anyone_ who
puts budget, so I added it to blk_mq_put_dispatch_budget().  Feel free
to shout at me in response to v3 if you believe I'm wrong about that.

Thanks for all your reviews and suggestions so far!

-Doug

Patch
diff mbox series

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..0195d75f5f96 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -97,12 +97,34 @@  static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
 			break;
 
-		if (!blk_mq_get_dispatch_budget(hctx))
-			break;
+
+		if (!blk_mq_get_dispatch_budget(hctx)) {
+			/*
+			 * We didn't get budget so set contention.  If
+			 * someone else had the budget but didn't dispatch
+			 * they'll kick everything.  NOTE: we check one
+			 * extra time _after_ setting contention to fully
+			 * close the race.  If we don't actually dispatch
+			 * we'll detext faux contention (with ourselves)
+			 * but that should be rare.
+			 */
+			atomic_set(&q->budget_contention, 1);
+
+			if (!blk_mq_get_dispatch_budget(hctx))
+				break;
+		}
 
 		rq = e->type->ops.dispatch_request(hctx);
 		if (!rq) {
 			blk_mq_put_dispatch_budget(hctx);
+
+			/*
+			 * We've released the budget but us holding it might
+			 * have prevented someone else from dispatching.
+			 * Detect that case and run all queues again.
+			 */
+			if (atomic_read(&q->budget_contention))
+				blk_mq_run_hw_queues(q, true);
 			break;
 		}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2cd8d2b49ff4..6163c43ceca5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1528,6 +1528,9 @@  void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
+	/* We're running the queues, so clear the contention detector */
+	atomic_set(&q->budget_contention, 0);
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f629d40c645c..07f21e45d993 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -583,6 +583,8 @@  struct request_queue {
 
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
+
+	atomic_t		budget_contention;
 };
 
 #define QUEUE_FLAG_STOPPED	0	/* queue is stopped */