diff mbox

[V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

Message ID 20180123161613.14214-1-ming.lei@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lei Jan. 23, 2018, 4:16 p.m. UTC
This status is returned from driver to block layer if device related
resource is run out of, but driver can guarantee that IO dispatch will
be triggered in future when the resource is available.

This patch converts some drivers to use this return value. Meantime
if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
queue after 10ms for avoiding IO hang.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V3:
	- fix typo, and improvement document
	- add tested-by tag
V2:
	- add comments on the new introduced status
	- patch style fix
	- both are suggested by Chritoph

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  4 ++--
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    | 18 ++++++++++++++++++
 8 files changed, 42 insertions(+), 12 deletions(-)

Comments

Mike Snitzer Jan. 23, 2018, 4:20 p.m. UTC | #1
On Tue, Jan 23 2018 at 11:16am -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> This status is returned from driver to block layer if device related
> resource is run out of, but driver can guarantee that IO dispatch will
> be triggered in future when the resource is available.
> 
> This patch converts some drivers to use this return value. Meantime
> if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> queue after 10ms for avoiding IO hang.
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Tested-by: Laurence Oberman <loberman@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 23, 2018, 4:24 p.m. UTC | #2
On Wed, 2018-01-24 at 00:16 +0800, Ming Lei wrote:
> @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  		 * - Some but not all block drivers stop a queue before
>  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
>  		 *   and dm-rq.
> +		 *
> +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> +		 * bit is set, run queue after 10ms for avoiding IO hang
> +		 * because the queue may be idle and the RESTART mechanism
> +		 * can't work any more.
>  		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, 10);
>  	}

My opinion about this patch is as follows:
* Changing a blk_mq_delay_run_hw_queue() call followed by return
  BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
  a guaranteed queue rerun into a queue rerun that may or may not happen
  depending on whether or not multiple queue runs happen simultaneously.
* This change makes block drivers less readable because anyone who encounters
  BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what
  it's meaning is.
* We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed
  queue run can be implemented easily with the existing block layer API.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 4:37 p.m. UTC | #3
On Tue, Jan 23, 2018 at 04:24:20PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:16 +0800, Ming Lei wrote:
> > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  		 * - Some but not all block drivers stop a queue before
> >  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
> >  		 *   and dm-rq.
> > +		 *
> > +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> > +		 * bit is set, run queue after 10ms for avoiding IO hang
> > +		 * because the queue may be idle and the RESTART mechanism
> > +		 * can't work any more.
> >  		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, 10);
> >  	}
> 
> My opinion about this patch is as follows:
> * Changing a blk_mq_delay_run_hw_queue() call followed by return
>   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
>   a guaranteed queue rerun into a queue rerun that may or may not happen
>   depending on whether or not multiple queue runs happen simultaneously.

You may not understand the two:

1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
and using which one depends on SCHED_RESTART.

2) if driver can make sure the queue will be rerun after some resource
is available, either by itself or by blk-mq, it will return BLK_STS_DEV_RESOURCE

So what is wrong with this way?

> * This change makes block drivers less readable because anyone who encounters
>   BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what
>   it's meaning is.

It has been well-documented. BLK_STS_DEV_RESOURCE can be used very less,
so it shouldn't be an issue.

> * We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed
>   queue run can be implemented easily with the existing block layer API.

You mean to convert every STS_RESOURCE to call the API there, that way
need lots of change, and with race in theory, since when the delay run
queue is called in driver, the request isn't added to dispatch list.
Bart Van Assche Jan. 23, 2018, 4:57 p.m. UTC | #4
On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 04:24:20PM +0000, Bart Van Assche wrote:
> > My opinion about this patch is as follows:
> > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
> >   a guaranteed queue rerun into a queue rerun that may or may not happen
> >   depending on whether or not multiple queue runs happen simultaneously.
> 
> You may not understand the two:
> 
> 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> and using which one depends on SCHED_RESTART.
> 
> 2) if driver can make sure the queue will be rerun after some resource
> is available, either by itself or by blk-mq, it will return BLK_STS_DEV_RESOURCE
> 
> So what is wrong with this way?

Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
race condition in code where there was no race condition.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 24, 2018, 3:31 a.m. UTC | #5
On Tue, Jan 23, 2018 at 04:57:34PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:24:20PM +0000, Bart Van Assche wrote:
> > > My opinion about this patch is as follows:
> > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
> > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > >   depending on whether or not multiple queue runs happen simultaneously.
> > 
> > You may not understand the two:
> > 
> > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > and using which one depends on SCHED_RESTART.
> > 
> > 2) if driver can make sure the queue will be rerun after some resource
> > is available, either by itself or by blk-mq, it will return BLK_STS_DEV_RESOURCE
> > 
> > So what is wrong with this way?
> 
> Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
> by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> race condition in code where there was no race condition.

OK, then no such race you worried about in this patch.

Jens, could you take a look at this patch?

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Jan. 27, 2018, 7:09 p.m. UTC | #6
On Tue, Jan 23 2018 at 10:31pm -0500,
Ming Lei <ming.lei@redhat.com> wrote:

> On Tue, Jan 23, 2018 at 04:57:34PM +0000, Bart Van Assche wrote:
> > On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > > On Tue, Jan 23, 2018 at 04:24:20PM +0000, Bart Van Assche wrote:
> > > > My opinion about this patch is as follows:
> > > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
> > > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > > >   depending on whether or not multiple queue runs happen simultaneously.
> > > 
> > > You may not understand the two:
> > > 
> > > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > > and using which one depends on SCHED_RESTART.
> > > 
> > > 2) if driver can make sure the queue will be rerun after some resource
> > > is available, either by itself or by blk-mq, it will return BLK_STS_DEV_RESOURCE
> > > 
> > > So what is wrong with this way?
> > 
> > Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> > reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
> > by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> > race condition in code where there was no race condition.
> 
> OK, then no such race you worried about in this patch.
> 
> Jens, could you take a look at this patch?

Hi Jens,

Ming let me know that he successfully tested this V3 patch using both
your test (fio to both mpath and underlying path) and Bart's (02-mq with
can_queue in guest).

Would be great if you'd review and verify this fix works for you too.

Ideally we'd get a fix for this regression staged for 4.16 inclusion.
This V3 patch seems like the best option we have at this point.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 27, 2018, 10:12 p.m. UTC | #7
On Sat, 2018-01-27 at 14:09 -0500, Mike Snitzer wrote:
> Ming let me know that he successfully tested this V3 patch using both
> your test (fio to both mpath and underlying path) and Bart's (02-mq with
> can_queue in guest).
> 
> Would be great if you'd review and verify this fix works for you too.
> 
> Ideally we'd get a fix for this regression staged for 4.16 inclusion.
> This V3 patch seems like the best option we have at this point.

Hello Mike,

There are several issues with the patch at the start of this thread:
- It is an unnecessary change of the block layer API. Queue stalls can
  already be addressed with the current block layer API, namely by inserting
  a blk_mq_delay_run_hw_queue() call before returning BLK_STS_RESOURCE.
- The patch at the start of this thread complicates code further that is
  already too complicated, namely the blk-mq core.
- The patch at the start of this thread introduces a regression in the
  SCSI core, namely a queue stall if a request completion occurs concurrently
  with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

As a kernel maintainer one of your responsibilities is to help keeping the
quality of the kernel code high. So I think that you, as a kernel maintainer,
should tell Ming to discard this patch instead of
asking to merge it upstream
given all the disadvantages of this patch.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 27, 2018, 11:41 p.m. UTC | #8
On Sat, Jan 27, 2018 at 10:12:43PM +0000, Bart Van Assche wrote:
> On Sat, 2018-01-27 at 14:09 -0500, Mike Snitzer wrote:
> > Ming let me know that he successfully tested this V3 patch using both
> > your test (fio to both mpath and underlying path) and Bart's (02-mq with
> > can_queue in guest).
> > 
> > Would be great if you'd review and verify this fix works for you too.
> > 
> > Ideally we'd get a fix for this regression staged for 4.16 inclusion.
> > This V3 patch seems like the best option we have at this point.
> 
> Hello Mike,
> 
> There are several issues with the patch at the start of this thread:
> - It is an unnecessary change of the block layer API. Queue stalls can
>   already be addressed with the current block layer API, namely by inserting
>   a blk_mq_delay_run_hw_queue() call before returning BLK_STS_RESOURCE.

Again, both Jens and I concluded that it is a generic issue, which need
generic solution.

	https://marc.info/?l=linux-kernel&m=151638176727612&w=2

Otherwise, it needs to change the handling on every BLK_STS_RESOURCE in
drivers, do we really want to do that?

Not mention, the request isn't added to dispatch list yet in .queue_rq(),
strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
.queue_rq(), so the current block layer API can't handle it well enough.

> - The patch at the start of this thread complicates code further that is
>   already too complicated, namely the blk-mq core.

That is just your opinion, I don't agree.

> - The patch at the start of this thread introduces a regression in the
>   SCSI core, namely a queue stall if a request completion occurs concurrently
>   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
to blk-mq, again, please explain it in detail how this patch V3 introduces this
regression on SCSI.

Actually this patch should fix a race on SCSI-MQ, because when scsi_queue_rq()
call blk_mq_delay_run_hw_queue(), the request isn't in dispatch list yet, so in
theory this request may not be visible when __blk_mq_run_hw_queue() is run. Don't
expect the 3ms delay will cover that, it is absolutely fragile to depend on timing
to deal with the race.

Maybe it can be one LSF/MM topic proposal...

thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Jan. 28, 2018, 12:23 a.m. UTC | #9
On Sat, Jan 27 2018 at  5:12pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Sat, 2018-01-27 at 14:09 -0500, Mike Snitzer wrote:
> > Ming let me know that he successfully tested this V3 patch using both
> > your test (fio to both mpath and underlying path) and Bart's (02-mq with
> > can_queue in guest).
> > 
> > Would be great if you'd review and verify this fix works for you too.
> > 
> > Ideally we'd get a fix for this regression staged for 4.16 inclusion.
> > This V3 patch seems like the best option we have at this point.
> 
> Hello Mike,
> 
> There are several issues with the patch at the start of this thread:
> - It is an unnecessary change of the block layer API. Queue stalls can
>   already be addressed with the current block layer API, namely by inserting
>   a blk_mq_delay_run_hw_queue() call before returning BLK_STS_RESOURCE.
> - The patch at the start of this thread complicates code further that is
>   already too complicated, namely the blk-mq core.

The above says _nothing_ of substance.  You talk so loudly against
Ming's work that it has gotten to the point where nothing you say
against Ming's work can be taken seriously.

> - The patch at the start of this thread introduces a regression in the
>   SCSI core, namely a queue stall if a request completion occurs concurrently
>   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

And why is this supposed race unique to SCSI core?  

Fact is Ming dutifully implemented what Jens suggested.  And he verified
it to work.  What have you done other than play the antagonist?

> As a kernel maintainer one of your responsibilities is to help keeping the
> quality of the kernel code high. So I think that you, as a kernel maintainer,
> should tell Ming to discard this patch instead of
> asking to merge it upstream
> given all the disadvantages of this patch.

Your contributions do _not_ make up for your inability to work well with
others.  Tiresome doesn't begin to describe these interactions.

Life is too short to continue enduring your bullshit.

But do let us know when you have something of substance to contribute
(hint: code talks).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 28, 2018, 12:54 a.m. UTC | #10
On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> Your contributions do _not_ make up for your inability to work well with
> others.  Tiresome doesn't begin to describe these interactions.
> 
> Life is too short to continue enduring your bullshit.
> 
> But do let us know when you have something of substance to contribute
> (hint: code talks).

Sorry Mike but what you wrote above is not only very gross but it is also
wrong. What I did in my e-mail is to identify technical problems in Ming's
work. Additionally, it seems like you forgot that recently I helped Ming?
My patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
unintended delays" has been queued by Jens for kernel v4.16 and solves a
problem that Ming has been working on for two months but that he was
unable to come up with a proper fix for. Additionally, there is something
that you have to explain: the patch "dm mpath: don't call
blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE" that you have
queued in your tree is wrong and introduces a regression
(https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92).
I'm surprised that you have not yet reverted that patch? BTW, in case you
would not yet have noticed my patch "blk-mq: Avoid that
blk_mq_delay_run_hw_queue() introduces unintended delays" eliminates the
delay you referred to in the description of that patch.

In case the above would not yet have addressed the technical issue(s) you
are facing, I would really appreciate it if you would stop using insults and
if you could explain what technical problems you are facing. Isn't that what
the Linux kernel is about, namely about collaboration between technical
people from different organizations? Isn't that what made the Linux kernel
great?

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Jan. 28, 2018, 2:03 a.m. UTC | #11
On Sat, Jan 27 2018 at  7:54pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> > Your contributions do _not_ make up for your inability to work well with
> > others.  Tiresome doesn't begin to describe these interactions.
> > 
> > Life is too short to continue enduring your bullshit.
> > 
> > But do let us know when you have something of substance to contribute
> > (hint: code talks).
> 
> Sorry Mike but what you wrote above is not only very gross but it is also
> wrong. What I did in my e-mail is to identify technical problems in Ming's
> work. Additionally, it seems like you forgot that recently I helped Ming?
> My patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
> unintended delays" has been queued by Jens for kernel v4.16 and solves a
> problem that Ming has been working on for two months but that he was
> unable to come up with a proper fix for. Additionally, there is something
> that you have to explain: the patch "dm mpath: don't call
> blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE" that you have
> queued in your tree is wrong and introduces a regression
> (https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92).
> I'm surprised that you have not yet reverted that patch? BTW, in case you
> would not yet have noticed my patch "blk-mq: Avoid that
> blk_mq_delay_run_hw_queue() introduces unintended delays" eliminates the
> delay you referred to in the description of that patch.

You cannot even be forthcoming about the technical merit of a change you
authored (commit 6077c2d70) that I'm left to clean up in the face of
performance bottlenecks it unwittingly introduced?  If you were being
honest: you'd grant that the random delay of 100ms is utterly baseless
(not to mention that kicking the queue like you did is a complete
hack).  So that 100ms delay is what my dm-4.16 commit is talking about.
Not the fact that blk-mq wasn't using kblockd_mod_delayed_work_on().

Commit 6077c2d70 was wrong and never should've been introduced!  And you
even said that reintroducing commit 6077c2d70 didn't fix the regression
Ming's V3 fully corrects.  So why would I revert eliminating it exactly?

You aren't doing yourself any justice.  We're all smart enough to see
through your misdirection and labored attempt to cover your tracks.

In the past you've been very helpful with highly reasoned and technical
contributions.  But you get unhinged more often than not when it comes
to Ming's work.  That has made you one of the most duplicitous engineers
I've witnessed and had to deal with directly.  It is like Dr Jeykll and
Mr Hyde with you.

So I'm done treating you with kid gloves.  You are incapable of
responding favorably to subtle social queues or even outright pleas for
more productive behavior:
https://marc.info/?l=linux-scsi&m=151011037229460&w=2

Otherwise I wouldn't be having to respond to you right now!

> In case the above would not yet have addressed the technical issue(s) you
> are facing, I would really appreciate it if you would stop using insults and
> if you could explain what technical problems you are facing. Isn't that what
> the Linux kernel is about, namely about collaboration between technical
> people from different organizations? Isn't that what made the Linux kernel
> great?

Don't project onto me Bart.  This isn't the first time you've been
completely unprofessional and sadly it likely won't be the last.

Treat others how you want to be treated.  It really is that simple.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 28, 2018, 3 a.m. UTC | #12
On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> You cannot even be forthcoming about the technical merit of a change you
> authored (commit 6077c2d70) that I'm left to clean up in the face of
> performance bottlenecks it unwittingly introduced?  If you were being
> honest: you'd grant that the random delay of 100ms is utterly baseless
> (not to mention that kicking the queue like you did is a complete
> hack).  So that 100ms delay is what my dm-4.16 commit is talking about.

There are multiple errors in the above:
1. I have already explained in detail why commit 6077c2d70 is (a) correct
   and (b) essential. See e.g. https://www.redhat.com/archives/dm-devel/2018-January/msg00168.html.
2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
   unintended delays" applied, there is nothing to clean up anymore since
   that patch eliminates the queue delays that were triggered by
   blk_mq_delay_run_hw_queue().
3. You know that I'm honest. Suggesting that I'm not is wrong.
4. I never claimed that 100ms is the optimal value for the queue
   rerunning delay. I have already explained to you that I copied that
   value from older dm-rq code.

> Don't project onto me Bart.  This isn't the first time you've been
> completely unprofessional and sadly it likely won't be the last.

The only person who is behaving unprofessionally in this e-mail thread
is you.

Bart.
   

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Jan. 28, 2018, 4:58 a.m. UTC | #13
On Sat, Jan 27 2018 at 10:00pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> > You cannot even be forthcoming about the technical merit of a change you
> > authored (commit 6077c2d70) that I'm left to clean up in the face of
> > performance bottlenecks it unwittingly introduced?  If you were being
> > honest: you'd grant that the random delay of 100ms is utterly baseless
> > (not to mention that kicking the queue like you did is a complete
> > hack).  So that 100ms delay is what my dm-4.16 commit is talking about.
> 
> There are multiple errors in the above:
> 1. I have already explained in detail why commit 6077c2d70 is (a) correct
>    and (b) essential. See e.g. https://www.redhat.com/archives/dm-devel/2018-January/msg00168.html.

And you'd be wrong.  Again.  We've already established that commit
6077c2d70 is _not_ essential.  Ming's V3, in conjunction with all the
changes that already went into block and DM for 4.16, prove that.

Yet here you go repeating yourself.  You are clinging to a patch that
absolutely had no business going in when it did.  And even when
presented with the reality that it was not the proper way to fix the
issue you observed you keep going back to it like you cured cancer with
a single line of code.

> 2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
>    unintended delays" applied, there is nothing to clean up anymore since
>    that patch eliminates the queue delays that were triggered by
>    blk_mq_delay_run_hw_queue().

The issue Ming fixed is that your random queue kicks break RESTART on
dm-mq mpath.

> 3. You know that I'm honest. Suggesting that I'm not is wrong.

No, I trully do _not_ know you're always honest.  You've conflated so
many details on this topic that it makes me have serious doubts.  You're
lashing out so much, in defense of your dm_mq_queue_rq delayed queue
kick, that you're missing there is a genuine generic blk-mq problem that
is getting fixed in Ming's V3.

> 4. I never claimed that 100ms is the optimal value for the queue
>    rerunning delay. I have already explained to you that I copied that
>    value from older dm-rq code.

And I pointed out to you that you most certainly did _not_ copy the
value from elsewhere:
https://www.redhat.com/archives/dm-devel/2018-January/msg00202.html

The specific delay used isn't the issue; the need to kick the queue like
this is.  Ming's V3 avoids unnecessary kicking.

> > Don't project onto me Bart.  This isn't the first time you've been
> > completely unprofessional and sadly it likely won't be the last.
> 
> The only person who is behaving unprofessionally in this e-mail thread
> is you.

Bart, the number of emails exchanged on this topic has exhausted
_everyone_.  Emotions get tense when things don't evolve despite every
oppotunity for progress to be realized.  When people cling to solutions
that are no longer relevant.  There is a very real need for progress
here.  So either get out of the way or suggest a specific series of
change(s) that you feel better than Ming's V3.

With that, I'll also stop responding to your baiting now and forevermore
(e.g. "maintainers should this.. and maintainers should not that" or
"your employer is not investing adequately".  Next time you find
yourself typing drivel like that: spare us all and hit delete).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 28, 2018, 11:39 a.m. UTC | #14
On Sun, Jan 28, 2018 at 12:54:38AM +0000, Bart Van Assche wrote:
> On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> > Your contributions do _not_ make up for your inability to work well with
> > others.  Tiresome doesn't begin to describe these interactions.
> > 
> > Life is too short to continue enduring your bullshit.
> > 
> > But do let us know when you have something of substance to contribute
> > (hint: code talks).
> 
> Sorry Mike but what you wrote above is not only very gross but it is also
> wrong. What I did in my e-mail is to identify technical problems in Ming's

Bart,

You said my patch V3 introduces a race on SCSI, but you never explained
it, I have asked you for at least two times and you never explain it, so
could you please deal with this in a technical way?


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 28, 2018, 4:57 p.m. UTC | #15
On Sat, 2018-01-27 at 23:58 -0500, Mike Snitzer wrote:
> On Sat, Jan 27 2018 at 10:00pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> > > You cannot even be forthcoming about the technical merit of a change you
> > > authored (commit 6077c2d70) that I'm left to clean up in the face of
> > > performance bottlenecks it unwittingly introduced?  If you were being
> > > honest: you'd grant that the random delay of 100ms is utterly baseless
> > > (not to mention that kicking the queue like you did is a complete
> > > hack).  So that 100ms delay is what my dm-4.16 commit is talking about.
> > 
> > There are multiple errors in the above:
> > 1. I have already explained in detail why commit 6077c2d70 is (a) correct
> >    and (b) essential. See e.g. https://www.redhat.com/archives/dm-devel/2018-January/msg00168.html.
> 
> And you'd be wrong.  Again.  We've already established that commit
> 6077c2d70 is _not_ essential.  Ming's V3, in conjunction with all the
> changes that already went into block and DM for 4.16, prove that.

What I wrote was right when commit 6077c2d70 got introduced. My explanation
shows that at that time  was both correct and essential. Ming's v3 is in my
opinion not relevant yet to this discussion because it introduces new bugs
and so far no attempt has been made to address these bugs.

> > 2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
> >    unintended delays" applied, there is nothing to clean up anymore since
> >    that patch eliminates the queue delays that were triggered by
> >    blk_mq_delay_run_hw_queue().
> 
> The issue Ming fixed is that your random queue kicks break RESTART on
> dm-mq mpath.

That comment is too short to be comprehensible for anyone. Can you elaborate
this further?

> > 3. You know that I'm honest. Suggesting that I'm not is wrong.
> 
> No, I trully do _not_ know you're always honest.  You've conflated so
> many details on this topic that it makes me have serious doubts.  You're
> lashing out so much, in defense of your dm_mq_queue_rq delayed queue
> kick, that you're missing there is a genuine generic blk-mq problem that
> is getting fixed in Ming's V3.
> 
> [ ... ]
>
> Bart, the number of emails exchanged on this topic has exhausted
> _everyone_.  Emotions get tense when things don't evolve despite every
> oppotunity for progress to be realized.  When people cling to solutions
> that are no longer relevant.  There is a very real need for progress
> here.  So either get out of the way or suggest a specific series of
> change(s) that you feel better than Ming's V3.

If you and Ming really care about the approach in the patch at the start
of this e-mail thread, what are you waiting for to address the technical
shortcomings that I pointed out?

> With that, I'll also stop responding to your baiting now and forevermore
> (e.g. "maintainers should this.. and maintainers should not that" or
> "your employer is not investing adequately".  Next time you find
> yourself typing drivel like that: spare us all and hit delete).

My opinion about what you wrote in this and the other e-mails you sent to
me during the past months is as follows:
1. That I have always done my best to provide all the relevant technical
   information.
2. That my focus was on the technical aspects of the discussion.
3. That you did not try to reach a consensus but rather to dominate the
   discussion.
4. That the tone of your e-mails was very aggressive.
5. That you insulted me several times personally.

I believe that your behavior is a violation of the kernel code of conflict
and sufficient to file a complaint with the TAB. From
Documentation/process/code-of-conflict.rst:

"If however, anyone feels personally abused, threatened, or otherwise
uncomfortable due to this process, that is not acceptable.  If so,
please contact the Linux Foundation's Technical Advisory Board [ ... ]"

Additionally, since you are a U.S. Citizen, you should know that what you
wrote in previous e-mails amounts to libel. A quote from a definition of
libel: "to publish in print (including pictures), writing or broadcast
through radio, television or film, an untruth about another which will do
harm to that person or his/her reputation, by tending to bring the target
into ridicule, hatred, scorn or contempt of others." You should be aware
since I live in the U.S. that this makes it possible for me to sue you for
defamation and to ask for a compensation.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 28, 2018, 5:03 p.m. UTC | #16
On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> On Sat, Jan 27 2018 at  5:12pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > - The patch at the start of this thread introduces a regression in the
> >   SCSI core, namely a queue stall if a request completion occurs concurrently
> >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.
> 
> And why is this supposed race unique to SCSI core?  

That race is not unique to the SCSI core. It is possible that the patch at
the start of this thread introduces the same race in other block drivers
but I have not yet tried to verify that.

> Fact is Ming dutifully implemented what Jens suggested.

That's a misleading statement. Jens proposed to introduce a new status code
(which he called "NO_DEV_RESOURCE") but did not specify any of the other
aspects of Ming's patch. Jens definitely did not ask to introduce new race
conditions. See also https://www.spinics.net/lists/kernel/msg2703154.html

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Laurence Oberman Jan. 28, 2018, 5:26 p.m. UTC | #17
On Sun, 2018-01-28 at 16:57 +0000, Bart Van Assche wrote:
> On Sat, 2018-01-27 at 23:58 -0500, Mike Snitzer wrote:
> > On Sat, Jan 27 2018 at 10:00pm -0500,
> > Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > 
> > > On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> > > > You cannot even be forthcoming about the technical merit of a
> > > > change you
> > > > authored (commit 6077c2d70) that I'm left to clean up in the
> > > > face of
> > > > performance bottlenecks it unwittingly introduced?  If you were
> > > > being
> > > > honest: you'd grant that the random delay of 100ms is utterly
> > > > baseless
> > > > (not to mention that kicking the queue like you did is a
> > > > complete
> > > > hack).  So that 100ms delay is what my dm-4.16 commit is
> > > > talking about.
> > > 
> > > There are multiple errors in the above:
> > > 1. I have already explained in detail why commit 6077c2d70 is (a)
> > > correct
> > >    and (b) essential. See e.g. https://www.redhat.com/archives/dm
> > > -devel/2018-January/msg00168.html.
> > 
> > And you'd be wrong.  Again.  We've already established that commit
> > 6077c2d70 is _not_ essential.  Ming's V3, in conjunction with all
> > the
> > changes that already went into block and DM for 4.16, prove that.
> 
> What I wrote was right when commit 6077c2d70 got introduced. My
> explanation
> shows that at that time  was both correct and essential. Ming's v3 is
> in my
> opinion not relevant yet to this discussion because it introduces new
> bugs
> and so far no attempt has been made to address these bugs.
> 
> > > 2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue()
> > > introduces
> > >    unintended delays" applied, there is nothing to clean up
> > > anymore since
> > >    that patch eliminates the queue delays that were triggered by
> > >    blk_mq_delay_run_hw_queue().
> > 
> > The issue Ming fixed is that your random queue kicks break RESTART
> > on
> > dm-mq mpath.
> 
> That comment is too short to be comprehensible for anyone. Can you
> elaborate
> this further?
> 
> > > 3. You know that I'm honest. Suggesting that I'm not is wrong.
> > 
> > No, I trully do _not_ know you're always honest.  You've conflated
> > so
> > many details on this topic that it makes me have serious
> > doubts.  You're
> > lashing out so much, in defense of your dm_mq_queue_rq delayed
> > queue
> > kick, that you're missing there is a genuine generic blk-mq problem
> > that
> > is getting fixed in Ming's V3.
> > 
> > [ ... ]
> > 
> > Bart, the number of emails exchanged on this topic has exhausted
> > _everyone_.  Emotions get tense when things don't evolve despite
> > every
> > oppotunity for progress to be realized.  When people cling to
> > solutions
> > that are no longer relevant.  There is a very real need for
> > progress
> > here.  So either get out of the way or suggest a specific series of
> > change(s) that you feel better than Ming's V3.
> 
> If you and Ming really care about the approach in the patch at the
> start
> of this e-mail thread, what are you waiting for to address the
> technical
> shortcomings that I pointed out?
> 
> > With that, I'll also stop responding to your baiting now and
> > forevermore
> > (e.g. "maintainers should this.. and maintainers should not that"
> > or
> > "your employer is not investing adequately".  Next time you find
> > yourself typing drivel like that: spare us all and hit delete).
> 
> My opinion about what you wrote in this and the other e-mails you
> sent to
> me during the past months is as follows:
> 1. That I have always done my best to provide all the relevant
> technical
>    information.
> 2. That my focus was on the technical aspects of the discussion.
> 3. That you did not try to reach a consensus but rather to dominate
> the
>    discussion.
> 4. That the tone of your e-mails was very aggressive.
> 5. That you insulted me several times personally.
> 
> I believe that your behavior is a violation of the kernel code of
> conflict
> and sufficient to file a complaint with the TAB. From
> Documentation/process/code-of-conflict.rst:
> 
> "If however, anyone feels personally abused, threatened, or otherwise
> uncomfortable due to this process, that is not acceptable.  If so,
> please contact the Linux Foundation's Technical Advisory Board [ ...
> ]"
> 
> Additionally, since you are a U.S. Citizen, you should know that what
> you
> wrote in previous e-mails amounts to libel. A quote from a definition
> of
> libel: "to publish in print (including pictures), writing or
> broadcast
> through radio, television or film, an untruth about another which
> will do
> harm to that person or his/her reputation, by tending to bring the
> target
> into ridicule, hatred, scorn or contempt of others." You should be
> aware
> since I live in the U.S. that this makes it possible for me to sue
> you for
> defamation and to ask for a compensation.
> 
> Bart.

Folks

I think we need to all take some time, take a breath and lets see how
we can figure this out amicably.

Everybody on this list contributes a huge amount of professional and
personal time to improving the kernel. I for one have benefited a huge
amount from assistance given to me in the past by  Bart, Ming and Mike
(and others) and I am not about to take sides here.

While we cannot always be expected to agree we need to find a way to
get over these types of disagreements.

While Ming, Bart and Mike all have strong opinions this has to move
forward so what is the best way for tie breaker here. Do the folks that
know this code really well take a vote, majority wins for now and we
have progress.
Should this turn out to be an issue in the future then it gets
addressed by reverts/changes.

The changes have been tested but there is clearly some exposure or Bart
would not be concerned the way he is about this. The amount of exposure
should govern if this gets accepted or not.

Being a very very low member on the this very talented list I for one
can only contribute as far as testing is concerned and generic testing
of this has been very well behaved so perhaps while we have exposure
its acceptable risk.

I hope I am not speaking out of turn.

Sincerely
Laurence

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 29, 2018, 2:14 a.m. UTC | #18
On Sun, Jan 28, 2018 at 05:03:49PM +0000, Bart Van Assche wrote:
> On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> > On Sat, Jan 27 2018 at  5:12pm -0500,
> > Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > > - The patch at the start of this thread introduces a regression in the
> > >   SCSI core, namely a queue stall if a request completion occurs concurrently
> > >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.
> > 
> > And why is this supposed race unique to SCSI core?  
> 
> That race is not unique to the SCSI core. It is possible that the patch at
> the start of this thread introduces the same race in other block drivers
> but I have not yet tried to verify that.
> 
> > Fact is Ming dutifully implemented what Jens suggested.
> 
> That's a misleading statement. Jens proposed to introduce a new status code
> (which he called "NO_DEV_RESOURCE") but did not specify any of the other
> aspects of Ming's patch. Jens definitely did not ask to introduce new race
> conditions. See also https://www.spinics.net/lists/kernel/msg2703154.html

I don't see what is the difference between my patch and Jens's
suggestion. My patch just introduces new code of "BLK_STS_DEV_RESOURCE"
which is returned to blk-mq if driver can make sure that queue will be
run after resource is available, and keep the current code of
'BLK_STS_RESOURCE', this way is simpler since most of the in-tree
'BLK_STS_RESOURCE' does need to run queue via
blk_mq_delay_run_hw_queue(hctx, 10)? Right?

You just mentioned my patch introduces race, but never explained what is
the race? How is it triggered? When is it?

Now it is the 3rd time for me to ask you explain the introduced race in
my patch V3. Is this your technical way to review/comment patch?

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 29, 2018, 4:48 p.m. UTC | #19
On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> .queue_rq(), so the current block layer API can't handle it well enough.

I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
inside .queue_rq(). Additionally, I have already explained to you in
previous e-mails why it's fine to call that function from inside .queue_rq():
- Nobody has ever observed the race you described because the time after
  which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
  the time during which that race exists.
- It is easy to fix this race inside the block layer, namely by using
  call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
  postpone the queue rerunning until after the request has been added back to
  the dispatch list.

> > - The patch at the start of this thread introduces a regression in the
> >   SCSI core, namely a queue stall if a request completion occurs concurrently
> >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.
> 
> This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> to blk-mq, again, please explain it in detail how this patch V3 introduces this
> regression on SCSI.

Please reread the following message: https://marc.info/?l=dm-devel&m=151672480107560.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 30, 2018, 1:07 a.m. UTC | #20
On Mon, Jan 29, 2018 at 04:48:31PM +0000, Bart Van Assche wrote:
> On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> > Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> > strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> > .queue_rq(), so the current block layer API can't handle it well enough.
> 
> I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
> inside .queue_rq(). Additionally, I have already explained to you in
> previous e-mails why it's fine to call that function from inside .queue_rq():
> - Nobody has ever observed the race you described because the time after

'Nobody observed it' does not mean there isn't the race, since I can explain
it clearly.

>   which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
>   the time during which that race exists.

Again it is fragile to depend on the delay(10ms, 3ms, or what ever) to make
everything correct, why can't we make the way robust like the approach I took?

You can not guarantee that the request handled in .queue_rq() is added to
hctx->dispatch when you call blk_mq_delay_run_hw_queue(), because the
operation of adding request to hctx->dispatch happens after returning
from .queue_rq() to blk_mq_dispatch_rq_list(), which is done in the future
against calling blk_mq_delay_run_hw_queue(). Right? 

Especially now kblockd_mod_delayed_work_on() is changed to use mod_delayed_work_on(),
which may run the queue before the timer is expired from another context, then
IO hang still can be triggered since the run queue may miss the request
to be added to hctx->dispatch.

> - It is easy to fix this race inside the block layer, namely by using
>   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
>   postpone the queue rerunning until after the request has been added back to
>   the dispatch list.

It is just easy to say, can you cook a patch and fix all drivers first?

Then let's compare which patch is simpler and better.

> 
> > > - The patch at the start of this thread introduces a regression in the
> > >   SCSI core, namely a queue stall if a request completion occurs concurrently
> > >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

> > 
> > This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> > to blk-mq, again, please explain it in detail how this patch V3 introduces this
> > regression on SCSI.
> 
> Please reread the following message: https://marc.info/?l=dm-devel&m=151672480107560.

OK, the following message is copied from the above link:

> My opinion about this patch is as follows:
> * Changing a blk_mq_delay_run_hw_queue() call followed by return
>   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
>   a guaranteed queue rerun into a queue rerun that may or may not happen
>   depending on whether or not multiple queue runs happen simultaneously.
> * This change makes block drivers less readable because anyone who encounters
>   BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what
>   it's meaning is.
> * We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed
>   queue run can be implemented easily with the existing block layer API.

Later, you admitted you understood the patch wrong, so follows your
reply again from https://marc.info/?l=dm-devel&m=151672694508389&w=2

> On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:24:20PM +0000, Bart Van Assche wrote:
> > > My opinion about this patch is as follows:
> > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it changes
> > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > >   depending on whether or not multiple queue runs happen simultaneously.
> > 
> > You may not understand the two:
> > 
> > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > and using which one depends on SCHED_RESTART.
> > 
> > 2) if driver can make sure the queue will be rerun after some resource
> > is available, either by itself or by blk-mq, it will return BLK_STS_DEV_RESOURCE
> > 
> > So what is wrong with this way?
> 
> Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
> by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> race condition in code where there was no race condition.

You still doesn't explain the race condition here, right?

I can explain it again, but I am losing patience on you if you continue
to refuse answer my question, just like you refused to provide debugfs
log before when you reported issue.

When turning BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE, it is driver's
responsibility to make sure that the queue will be run after resource is available.
That point is documented clearly.

Especially on scsi_queue_rq():

-       if (atomic_read(&sdev->device_busy) == 0 &&
-           !scsi_device_blocked(sdev))
-           blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+       if (atomic_read(&sdev->device_busy) ||
+           scsi_device_blocked(sdev))
+           ret = BLK_STS_DEV_RESOURCE;

When either of two conditions become true, scsi knows that the queue will
be restarted in future, and this patch just moves the
blk_mq_delay_run_hw_queue() out of .queue_rq() into blk_mq_dispatch_rq_list()
for avoiding the race mentioned above.

I know there is race in scsi_queue_rq(), such as all in-flight request may be
completed after atomic_read(&sdev->device_busy) in the following code is checked,
but this race exists before and after my patch V3, and my patch V3 changes nothing
about this race, so I don't know how/why you concluded that race conditions is
introduced by my patch V3.

-       if (atomic_read(&sdev->device_busy) == 0 &&
-           !scsi_device_blocked(sdev))
-           blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
Bart Van Assche Jan. 30, 2018, 1:11 a.m. UTC | #21
On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> On Mon, Jan 29, 2018 at 04:48:31PM +0000, Bart Van Assche wrote:
> > - It is easy to fix this race inside the block layer, namely by using
> >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> >   postpone the queue rerunning until after the request has been added back to
> >   the dispatch list.
> 
> It is just easy to say, can you cook a patch and fix all drivers first?

Please reread what I wrote. I proposed to change the blk_mq_delay_run_hw_queue()
IMPLEMENTATION such that the callers do not have to be modified.

> [ ... ] Later, you admitted you understood the patch wrong. [ ... ]

That's nonsense. I never wrote that.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 30, 2018, 3:31 a.m. UTC | #22
On Tue, Jan 30, 2018 at 01:11:22AM +0000, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2018 at 04:48:31PM +0000, Bart Van Assche wrote:
> > > - It is easy to fix this race inside the block layer, namely by using
> > >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> > >   postpone the queue rerunning until after the request has been added back to
> > >   the dispatch list.
> > 
> > It is just easy to say, can you cook a patch and fix all drivers first?
> 
> Please reread what I wrote. I proposed to change the blk_mq_delay_run_hw_queue()
> IMPLEMENTATION such that the callers do not have to be modified.

Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
will call blk_mq_delay_run_hw_queue() for drivers?

> 
> > [ ... ] Later, you admitted you understood the patch wrong. [ ... ]
> 
> That's nonsense. I never wrote that.

Believe it or not, follows the link and your reply:

	https://marc.info/?l=dm-devel&m=151672694508389&w=2

> So what is wrong with this way?

>Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
>reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
>by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
>race condition in code where there was no race condition.
>
>Bart.
Bart Van Assche Jan. 30, 2018, 3:37 a.m. UTC | #23
On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote:
> Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
> will call blk_mq_delay_run_hw_queue() for drivers?

As you know the SCSI and dm drivers in kernel v4.15 already call that function
whenever necessary.

> > 
> > > [ ... ] Later, you admitted you understood the patch wrong. [ ... ]
> > 
> > That's nonsense. I never wrote that.
> 
> Believe it or not, follows the link and your reply:
> 
> 	https://marc.info/?l=dm-devel&m=151672694508389&w=2
> 
> > So what is wrong with this way?
> > Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> > reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call followed
> > by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> > race condition in code where there was no race condition.

That meant that I made a mistake (typo) while typing the reply and nothing else.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 30, 2018, 3:42 a.m. UTC | #24
On Tue, Jan 30, 2018 at 03:37:38AM +0000, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote:
> > Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
> > will call blk_mq_delay_run_hw_queue() for drivers?
> 
> As you know the SCSI and dm drivers in kernel v4.15 already call that function
> whenever necessary.

We have concluded that it is generic issue which need generic solution[1].

[1] https://marc.info/?l=linux-kernel&m=151638176727612&w=2

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE ./drivers/ | wc -l
43

Almost every driver need this kind of change if BLK_STS_RESOURCE is returned
from IO path. And the failure can be caused by kmalloc(GFP_ATOMIC), DMA Mapping,
or what ever.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a485335..ac4789c5e329 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@  static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d40825..55c52b9a0f30 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1169,6 +1169,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1181,7 +1182,6 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1226,7 +1226,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1257,6 +1257,8 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1280,10 +1282,18 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
+		 * bit is set, run queue after 10ms for avoiding IO hang
+		 * because the queue may be idle and the RESTART mechanism
+		 * can't work any more.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, 10);
 	}
 
 	return (queued + errors) != 0;
@@ -1764,6 +1774,7 @@  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1826,7 +1837,7 @@  static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 6655893a3a7a..287a09611c0f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@  static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@  static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..bf0b840645cc 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@  static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 	return r;
@@ -500,7 +500,7 @@  static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c5d3db0d83f8..730a8574d2b7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,24 @@  typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * BLK_STS_DEV_RESOURCE: Block layer and block driver specific status,
+ * which is usually returned from driver to block layer in IO path.
+ *
+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is run out of, but driver can guarantee that queue
+ * will be rerun in future for dispatching the current request when this
+ * resource is available.
+ *
+ * Difference with BLK_STS_RESOURCE:
+ * If driver isn't sure if the queue can be run again after this kind of
+ * resource is available, please return BLK_STS_RESOURCE, for example,
+ * when memory allocation, DMA Mapping or other system resource allocation
+ * fail and IO can't be submitted to device.
+ *
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with