Message ID | 20180111060159.12991-3-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote: > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 86bf502a8e51..fcddf5a62581 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, > if (queue_dying) { > atomic_inc(&m->pg_init_in_progress); > activate_or_offline_path(pgpath); > + return DM_MAPIO_DELAY_REQUEUE; > } > - return DM_MAPIO_DELAY_REQUEUE; > + > + /* > + * blk-mq's SCHED_RESTART can cover this requeue, so > + * we needn't to deal with it by DELAY_REQUEUE. More > + * importantly, we have to return DM_MAPIO_REQUEUE > + * so that blk-mq can get the queue busy feedback, > + * otherwise I/O merge can be hurt. > + */ > + if (q->mq_ops) > + return DM_MAPIO_REQUEUE; > + else > + return DM_MAPIO_DELAY_REQUEUE; > } Sorry but the approach of this patch looks wrong to me. I'm afraid that this approach will cause 100% CPU consumption if the underlying .queue_rq() function returns BLK_STS_RESOURCE for another reason than what can be detected by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand() implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this: $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l 204 Isn't this a severe regression? Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Jan 12, 2018 at 07:04:28PM +0000, Bart Van Assche wrote: > On Thu, 2018-01-11 at 14:01 +0800, Ming Lei wrote: > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index 86bf502a8e51..fcddf5a62581 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, > > if (queue_dying) { > > atomic_inc(&m->pg_init_in_progress); > > activate_or_offline_path(pgpath); > > + return DM_MAPIO_DELAY_REQUEUE; > > } > > - return DM_MAPIO_DELAY_REQUEUE; > > + > > + /* > > + * blk-mq's SCHED_RESTART can cover this requeue, so > > + * we needn't to deal with it by DELAY_REQUEUE. More > > + * importantly, we have to return DM_MAPIO_REQUEUE > > + * so that blk-mq can get the queue busy feedback, > > + * otherwise I/O merge can be hurt. > > + */ > > + if (q->mq_ops) > > + return DM_MAPIO_REQUEUE; > > + else > > + return DM_MAPIO_DELAY_REQUEUE; > > } > > Sorry but the approach of this patch looks wrong to me. I'm afraid that this > approach will cause 100% CPU consumption if the underlying .queue_rq() > function returns BLK_STS_RESOURCE for another reason than what can be detected > by the .get_budget() call. This can happen if e.g. a SCSI LLD .queuecommand() > implementation returns SCSI_MLQUEUE_HOST_BUSY. Many SCSI LLDs can do this: > $ git grep 'SCSI_MLQUEUE_[^_]*_BUSY' | wc -l > 204 > > Isn't this a severe regression? No, it is not. This DM_MAPIO_REQUEUE will be converted to BLK_STS_RESOURCE in dm-rq, then it is returned to blk-mq. That means when running out of resource for dispatching the underlying request, we tell blk-mq the truth. This way has been used for other blk-mq drivers, such NVMe,..., more importantly we have to provide this real feedback to blk-mq for fixing the performance issue. The blk_get_request(GFP_ATOMIC) is just like a kmalloc(ATOMIC), and memory is shared system-wide too. You can see there are lots kmalloc(ATOMIC) in NVMe IO path, when it fails, BLK_STS_RESOURCE is returned to blk-mq. blk-mq has built-in SCHED_RESTART mechanism to handle out of RESOURCE when BLK_STS_RESOURCE is observed. We only implements .get_bugdet() in SCSI only, and other blk-mq drivers: NVMe, ..., often returns BLK_STS_RESOURCE to blk-mq too, so wrt. this usage there isn't any difference between dm-mpath and other cases, right? Thanks, Ming -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 86bf502a8e51..fcddf5a62581 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -533,8 +533,20 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, if (queue_dying) { atomic_inc(&m->pg_init_in_progress); activate_or_offline_path(pgpath); + return DM_MAPIO_DELAY_REQUEUE; } - return DM_MAPIO_DELAY_REQUEUE; + + /* + * blk-mq's SCHED_RESTART can cover this requeue, so + * we needn't to deal with it by DELAY_REQUEUE. More + * importantly, we have to return DM_MAPIO_REQUEUE + * so that blk-mq can get the queue busy feedback, + * otherwise I/O merge can be hurt. + */ + if (q->mq_ops) + return DM_MAPIO_REQUEUE; + else + return DM_MAPIO_DELAY_REQUEUE; } clone->bio = clone->biotail = NULL; clone->rq_disk = bdev->bd_disk;
blk-mq will rerun queue via RESTART or dispatch wake after one request is completed, so not necessary to wait random time for requeuing, we should trust blk-mq to do it. More importantly, we need return BLK_STS_RESOURCE to blk-mq so that dequeue from I/O scheduler can be stopped, then I/O merge gets improved. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/md/dm-mpath.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)