diff mbox

[V3,2/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

Message ID 20180111060159.12991-3-ming.lei@redhat.com (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lei Jan. 11, 2018, 6:01 a.m. UTC
blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.

More importantly, we need return BLK_STS_RESOURCE to blk-mq so that
dequeue from I/O scheduler can be stopped, then I/O merge gets improved.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-mpath.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Jan. 12, 2018, 7:04 p.m. UTC | #1
On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 86bf502a8e51..fcddf5a62581 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>  		if (queue_dying) {
>  			atomic_inc(&m->pg_init_in_progress);
>  			activate_or_offline_path(pgpath);
> +			return DM_MAPIO_DELAY_REQUEUE;
>  		}
> -		return DM_MAPIO_DELAY_REQUEUE;
> +
> +		/*
> +		 * blk-mq's SCHED_RESTART can cover this requeue, so
> +		 * we needn't to deal with it by DELAY_REQUEUE. More
> +		 * importantly, we have to return DM_MAPIO_REQUEUE
> +		 * so that blk-mq can get the queue busy feedback,
> +		 * otherwise I/O merge can be hurt.
> +		 */
> +		if (q->mq_ops)
> +			return DM_MAPIO_REQUEUE;
> +		else
> +			return DM_MAPIO_DELAY_REQUEUE;
>  	}

Sorry but the approach of this patch looks wrong to me. I'm afraid that this
approach will cause 100% CPU consumption if the underlying .queue_rq()
function returns BLK_STS_RESOURCE for another reason than what can be detected
by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
$ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
204

Isn't this a severe regression?

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 13, 2018, 1:29 a.m. UTC | #2
On Fri, Jan 12, 2018 at 07:04:28PM +0000, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote:
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 86bf502a8e51..fcddf5a62581 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
> >  		if (queue_dying) {
> >  			atomic_inc(&m->pg_init_in_progress);
> >  			activate_or_offline_path(pgpath);
> > +			return DM_MAPIO_DELAY_REQUEUE;
> >  		}
> > -		return DM_MAPIO_DELAY_REQUEUE;
> > +
> > +		/*
> > +		 * blk-mq's SCHED_RESTART can cover this requeue, so
> > +		 * we needn't to deal with it by DELAY_REQUEUE. More
> > +		 * importantly, we have to return DM_MAPIO_REQUEUE
> > +		 * so that blk-mq can get the queue busy feedback,
> > +		 * otherwise I/O merge can be hurt.
> > +		 */
> > +		if (q->mq_ops)
> > +			return DM_MAPIO_REQUEUE;
> > +		else
> > +			return DM_MAPIO_DELAY_REQUEUE;
> >  	}
> 
> Sorry but the approach of this patch looks wrong to me. I'm afraid that this
> approach will cause 100% CPU consumption if the underlying .queue_rq()
> function returns BLK_STS_RESOURCE for another reason than what can be detected
> by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand()
> implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this:
> $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l
> 204
> 
> Isn't this a severe regression?

No, it is not.

This DM_MAPIO_REQUEUE will be converted to BLK_STS_RESOURCE in dm-rq, then
it is returned to blk-mq. That means when running out of resource for
dispatching the underlying request, we tell blk-mq the truth. This
way has been used for other blk-mq drivers, such NVMe,..., more
importantly we have to provide this real feedback to blk-mq for fixing
the performance issue.

The blk_get_request(GFP_ATOMIC) is just like a kmalloc(ATOMIC),
and memory is shared system-wide too. You can see there are lots kmalloc(ATOMIC)
in NVMe IO path, when it fails, BLK_STS_RESOURCE is returned to blk-mq.

blk-mq has built-in SCHED_RESTART mechanism to handle out of RESOURCE
when BLK_STS_RESOURCE is observed.

We only implements .get_bugdet() in SCSI only, and other blk-mq drivers: NVMe,
..., often returns BLK_STS_RESOURCE to blk-mq too, so wrt. this usage there
isn't any difference between dm-mpath and other cases, right?


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 86bf502a8e51..fcddf5a62581 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -533,8 +533,20 @@  static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 		if (queue_dying) {
 			atomic_inc(&m->pg_init_in_progress);
 			activate_or_offline_path(pgpath);
+			return DM_MAPIO_DELAY_REQUEUE;
 		}
-		return DM_MAPIO_DELAY_REQUEUE;
+
+		/*
+		 * blk-mq's SCHED_RESTART can cover this requeue, so
+		 * we needn't to deal with it by DELAY_REQUEUE. More
+		 * importantly, we have to return DM_MAPIO_REQUEUE
+		 * so that blk-mq can get the queue busy feedback,
+		 * otherwise I/O merge can be hurt.
+		 */
+		if (q->mq_ops)
+			return DM_MAPIO_REQUEUE;
+		else
+			return DM_MAPIO_DELAY_REQUEUE;
 	}
 	clone->bio = clone->biotail = NULL;
 	clone->rq_disk = bdev->bd_disk;