diff mbox series

[V4] scsi: core: only re-run queue in scsi_end_request() if device queue is busy

Message ID 20200817100840.2496976-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series [V4] scsi: core: only re-run queue in scsi_end_request() if device queue is busy | expand

Commit Message

Ming Lei Aug. 17, 2020, 10:08 a.m. UTC
Now the request queue is run in scsi_end_request() unconditionally if both
target queue and host queue is ready. We should have re-run request queue
only after this device queue becomes busy for restarting this LUN only.

Recently Long Li reported that cost of run queue may be very heavy in
case of high queue depth. So improve this situation by only running
the request queue when this LUN is busy.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Long Li <longli@microsoft.com>
Cc: John Garry <john.garry@huawei.com>
Cc: linux-block@vger.kernel.org
Reported-by: Long Li <longli@microsoft.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V4:
	- fix one race reported by Kashyap, and simplify the implementation
	a bit; also pass Kashyap's both function and performance test
V3:
	- add one smp_mb() in scsi_mq_get_budget() and comment

V2:
	- commit log change, no any code change
	- add reported-by tag

 drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
 include/scsi/scsi_device.h |  1 +
 2 files changed, 49 insertions(+), 3 deletions(-)

Comments

John Garry Aug. 20, 2020, 2:22 p.m. UTC | #1
On 17/08/2020 11:08, Ming Lei wrote:
> Now the request queue is run in scsi_end_request() unconditionally if both
> target queue and host queue is ready. We should have re-run request queue
> only after this device queue becomes busy for restarting this LUN only.
> 
> Recently Long Li reported that cost of run queue may be very heavy in
> case of high queue depth. So improve this situation by only running
> the request queue when this LUN is busy.
> 
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Long Li <longli@microsoft.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: linux-block@vger.kernel.org
> Reported-by: Long Li <longli@microsoft.com>
> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

FWIW, this looks ok to me, and tested on scsi_debug showed less time in 
blk_mq_run_hw_queues() [for reduced queue depth]:

Reviewed-by: John Garry <john.garry@huawei.com>

thanks
Long Li Sept. 1, 2020, 1:18 a.m. UTC | #2
>Subject: Re: [PATCH V4] scsi: core: only re-run queue in scsi_end_request() if
>device queue is busy
>
>On 17/08/2020 11:08, Ming Lei wrote:
>> Now the request queue is run in scsi_end_request() unconditionally if
>> both target queue and host queue is ready. We should have re-run
>> request queue only after this device queue becomes busy for restarting this
>LUN only.
>>
>> Recently Long Li reported that cost of run queue may be very heavy in
>> case of high queue depth. So improve this situation by only running
>> the request queue when this LUN is busy.
>>
>> Cc: Ewan D. Milne <emilne@redhat.com>
>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
>> Cc: Hannes Reinecke <hare@suse.de>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: Long Li <longli@microsoft.com>
>> Cc: John Garry <john.garry@huawei.com>
>> Cc: linux-block@vger.kernel.org
>> Reported-by: Long Li <longli@microsoft.com>
>> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
>FWIW, this looks ok to me, and tested on scsi_debug showed less time in
>blk_mq_run_hw_queues() [for reduced queue depth]:
>
>Reviewed-by: John Garry <john.garry@huawei.com>
>
>thanks

On Azure Standard D64s v3 (64 vcpus, 256 GiB memory), it sees better CPU utilization with the patch.

Tested-by: Long Li <longli@microsoft.com>
Martin K. Petersen Sept. 1, 2020, 2:38 a.m. UTC | #3
> On Azure Standard D64s v3 (64 vcpus, 256 GiB memory), it sees better
> CPU utilization with the patch.
>
> Tested-by: Long Li <longli@microsoft.com>

Since this is a core change I'd like to see another review tag.

Thanks!
Hannes Reinecke Sept. 1, 2020, 6:54 a.m. UTC | #4
On 8/17/20 12:08 PM, Ming Lei wrote:
> Now the request queue is run in scsi_end_request() unconditionally if both
> target queue and host queue is ready. We should have re-run request queue
> only after this device queue becomes busy for restarting this LUN only.
> 
> Recently Long Li reported that cost of run queue may be very heavy in
> case of high queue depth. So improve this situation by only running
> the request queue when this LUN is busy.
> 
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Long Li <longli@microsoft.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: linux-block@vger.kernel.org
> Reported-by: Long Li <longli@microsoft.com>
> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V4:
> 	- fix one race reported by Kashyap, and simplify the implementation
> 	a bit; also pass Kashyap's both function and performance test
> V3:
> 	- add one smp_mb() in scsi_mq_get_budget() and comment
> 
> V2:
> 	- commit log change, no any code change
> 	- add reported-by tag
> 
>   drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
>   include/scsi/scsi_device.h |  1 +
>   2 files changed, 49 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Bart Van Assche Sept. 2, 2020, 2:40 a.m. UTC | #5
On 2020-08-17 03:08, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 7c6dd6f75190..a62c29058d26 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -551,8 +551,27 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {

Has this patch been verified with checkpatch? Checkpatch should have warned
about the unbalanced braces.

> +		/*
> +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */

Hmm ... I don't see what orders the atomic_dec(&sdev->device_busy) from
scsi_device_unbusy() and the atomic_read() below? I don't think that the block
layer guarantees ordering of these two memory accesses since both accesses
happen in the request completion path.

> +		int old = atomic_read(&sdev->restarts);
> +
> +		if (old) {
> +			/*
> +			 * ->restarts has to be kept as non-zero if there is
> +			 *  new budget contention comes.

There are two verbs in the above sentence ("is" and "comes"). Please remove
"comes" such that the sentence becomes grammatically correct.

> +			 *
> +			 *  No need to run queue when either another re-run
> +			 *  queue wins in updating ->restarts or one new budget
> +			 *  contention comes.
> +			 */
> +			if (atomic_cmpxchg(&sdev->restarts, old, 0) == old)
> +				blk_mq_run_hw_queues(sdev->request_queue, true);
> +		}
> +	}

Please combine the two if-statements into a single if-statement using "&&"
to keep the indentation level low.

> @@ -1611,8 +1630,34 @@ static void scsi_mq_put_budget(struct request_queue *q)
>  static bool scsi_mq_get_budget(struct request_queue *q)
>  {
>  	struct scsi_device *sdev = q->queuedata;
> +	int ret = scsi_dev_queue_ready(q, sdev);
> +
> +	if (ret)
> +		return true;
> +
> +	atomic_inc(&sdev->restarts);
>  
> -	return scsi_dev_queue_ready(q, sdev);
> +	/*
> +	 * Order writing .restarts and reading .device_busy, and make sure
> +	 * .restarts is visible to scsi_end_request(). Its pair is implied by
> +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> +	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
> +	 *
> +	 */
> +	smp_mb__after_atomic();

Barriers do not guarantee "is visible to". Barriers enforce ordering of memory
accesses performed by a certain CPU core. Did you perhaps mean that
sdev->restarts must be incremented before the code below reads sdev->device busy?

> +	/*
> +	 * If all in-flight requests originated from this LUN are completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
> +	 * soon. Otherwise, completion of one of these request will observe
> +	 * the .restarts flag, and the request queue will be run for handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
> +	return false;
>  }

Thanks,

Bart.
Ming Lei Sept. 2, 2020, 7:01 a.m. UTC | #6
On Tue, Sep 01, 2020 at 07:40:54PM -0700, Bart Van Assche wrote:
> On 2020-08-17 03:08, Ming Lei wrote:
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 7c6dd6f75190..a62c29058d26 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -551,8 +551,27 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
> >  	if (scsi_target(sdev)->single_lun ||
> >  	    !list_empty(&sdev->host->starved_list))
> >  		kblockd_schedule_work(&sdev->requeue_work);
> > -	else
> > -		blk_mq_run_hw_queues(sdev->request_queue, true);
> > +	else {
> 
> Has this patch been verified with checkpatch? Checkpatch should have warned
> about the unbalanced braces.

[linux]$ ./scripts/checkpatch.pl -g HEAD
total: 0 errors, 0 warnings, 71 lines checked

Commit 0cbe51645b54 ("scsi: core: only re-run queue in scsi_end_request() if device queue is busy") has no obvious style problems and is ready for submission.

> 
> > +		/*
> > +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> > +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> > +		 * and reading sdev->restarts.
> > +		 */
> 
> Hmm ... I don't see what orders the atomic_dec(&sdev->device_busy) from
> scsi_device_unbusy() and the atomic_read() below? I don't think that the block
> layer guarantees ordering of these two memory accesses since both accesses
> happen in the request completion path.

__blk_mq_end_request() is called between scsi_device_unbusy() and
scsi_run_queue_async(). When __blk_mq_end_request() is called, this
request is actually ended really because SCMD_STATE_COMPLETE is covered
race between timeout and normal completion, so:

1) either __blk_mq_free_request() is called, smp_mb__after_atomic() is
implied in sbitmap_queue_clear() called from blk_mq_put_tag()

2) or rq->end_io() is called. We don't have too many ->end_io()
implemented. Either wake_up_process() or blk_mq_free_request() is called
in ->end_io(), so memory barrier is implied.

> 
> > +		int old = atomic_read(&sdev->restarts);
> > +
> > +		if (old) {
> > +			/*
> > +			 * ->restarts has to be kept as non-zero if there is
> > +			 *  new budget contention comes.
> 
> There are two verbs in the above sentence ("is" and "comes"). Please remove
> "comes" such that the sentence becomes grammatically correct.
> 
> > +			 *
> > +			 *  No need to run queue when either another re-run
> > +			 *  queue wins in updating ->restarts or one new budget
> > +			 *  contention comes.
> > +			 */
> > +			if (atomic_cmpxchg(&sdev->restarts, old, 0) == old)
> > +				blk_mq_run_hw_queues(sdev->request_queue, true);
> > +		}
> > +	}
> 
> Please combine the two if-statements into a single if-statement using "&&"
> to keep the indentation level low.
> 
> > @@ -1611,8 +1630,34 @@ static void scsi_mq_put_budget(struct request_queue *q)
> >  static bool scsi_mq_get_budget(struct request_queue *q)
> >  {
> >  	struct scsi_device *sdev = q->queuedata;
> > +	int ret = scsi_dev_queue_ready(q, sdev);
> > +
> > +	if (ret)
> > +		return true;
> > +
> > +	atomic_inc(&sdev->restarts);
> >  
> > -	return scsi_dev_queue_ready(q, sdev);
> > +	/*
> > +	 * Order writing .restarts and reading .device_busy, and make sure
> > +	 * .restarts is visible to scsi_end_request(). Its pair is implied by
> > +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> > +	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
> > +	 *
> > +	 */
> > +	smp_mb__after_atomic();
> 
> Barriers do not guarantee "is visible to". Barriers enforce ordering of memory
> accesses performed by a certain CPU core. Did you perhaps mean that
> sdev->restarts must be incremented before the code below reads sdev->device busy?

Right, ->restart has to be incremented before reading sdev->device_busy.


Thanks, 
Ming
Ewan Milne Sept. 8, 2020, 7:48 p.m. UTC | #7
So we ran a whole bunch of tests with this patch applied, to see if
it would cause any problems, and it ran OK.  The V5 patch just posted
yesterday does not appear to change the logic functionally.

See below for comments.

Reviewed-by: Ewan D. Milne <emilne@redhat.com>

On Mon, 2020-08-17 at 18:08 +0800, Ming Lei wrote:
> Now the request queue is run in scsi_end_request() unconditionally if both
> target queue and host queue is ready. We should have re-run request queue
> only after this device queue becomes busy for restarting this LUN only.
> 
> Recently Long Li reported that cost of run queue may be very heavy in
> case of high queue depth. So improve this situation by only running
> the request queue when this LUN is busy.
> 
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Long Li <longli@microsoft.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: linux-block@vger.kernel.org
> Reported-by: Long Li <longli@microsoft.com>
> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V4:
> 	- fix one race reported by Kashyap, and simplify the implementation
> 	a bit; also pass Kashyap's both function and performance test
> V3:
> 	- add one smp_mb() in scsi_mq_get_budget() and comment
> 
> V2:
> 	- commit log change, no any code change
> 	- add reported-by tag
> 
>  drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
>  include/scsi/scsi_device.h |  1 +
>  2 files changed, 49 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 7c6dd6f75190..a62c29058d26 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -551,8 +551,27 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {
> +		/*
> +		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
> +		 * is for ordering writing .device_busy in scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */
> +		int old = atomic_read(&sdev->restarts);
> +
> +		if (old) {
> +			/*
> +			 * ->restarts has to be kept as non-zero if there is
> +			 *  new budget contention comes.
> +			 *
> +			 *  No need to run queue when either another re-run
> +			 *  queue wins in updating ->restarts or one new budget
> +			 *  contention comes.
> +			 */
> +			if (atomic_cmpxchg(&sdev->restarts, old, 0) == old)
> +				blk_mq_run_hw_queues(sdev->request_queue, true);
> +		}
> +	}
>  }
>  
>  /* Returns false when no more bytes to process, true if there are more */
> @@ -1611,8 +1630,34 @@ static void scsi_mq_put_budget(struct request_queue *q)
>  static bool scsi_mq_get_budget(struct request_queue *q)
>  {
>  	struct scsi_device *sdev = q->queuedata;
> +	int ret = scsi_dev_queue_ready(q, sdev);
> +
> +	if (ret)
> +		return true;

I think this should just be:

	if (scsi_dev_queue_ready(q, sdev))
		return true;

There's no particular reason to call the function in a local variable
initializer, this just makes the code less clear to me.  And "ret"
isn't needed for any other reason.

> +
> +	atomic_inc(&sdev->restarts);
>  
> -	return scsi_dev_queue_ready(q, sdev);
> +	/*
> +	 * Order writing .restarts and reading .device_busy, and make sure
> +	 * .restarts is visible to scsi_end_request(). Its pair is implied by
> +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> +	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
> +	 *
> +	 */
> +	smp_mb__after_atomic();

I'm not sure how this helps, if you're trying to ensure consistency
between sdev->device_busy and sdev->restarts.  Another thread could be
in scsi_dev_queue_ready(), right?  And scsi_run_queue_async() in the
scsi_end_request() path is doing an atomic_read() of ->restarts anyway.

> +
> +	/*
> +	 * If all in-flight requests originated from this LUN are completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
> +	 * soon. Otherwise, completion of one of these request will observe
> +	 * the .restarts flag, and the request queue will be run for handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
> +	return false;
>  }
>  
>  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index bc5909033d13..1a5c9a3df6d6 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -109,6 +109,7 @@ struct scsi_device {
>  	atomic_t device_busy;		/* commands actually active on LLDD */
>  	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
>  
> +	atomic_t restarts;
>  	spinlock_t list_lock;
>  	struct list_head starved_entry;
>  	unsigned short queue_depth;	/* How deep of a queue we want */
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7c6dd6f75190..a62c29058d26 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -551,8 +551,27 @@  static void scsi_run_queue_async(struct scsi_device *sdev)
 	if (scsi_target(sdev)->single_lun ||
 	    !list_empty(&sdev->host->starved_list))
 		kblockd_schedule_work(&sdev->requeue_work);
-	else
-		blk_mq_run_hw_queues(sdev->request_queue, true);
+	else {
+		/*
+		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
+		 * is for ordering writing .device_busy in scsi_device_unbusy()
+		 * and reading sdev->restarts.
+		 */
+		int old = atomic_read(&sdev->restarts);
+
+		if (old) {
+			/*
+			 * ->restarts has to be kept as non-zero if there is
+			 *  new budget contention comes.
+			 *
+			 *  No need to run queue when either another re-run
+			 *  queue wins in updating ->restarts or one new budget
+			 *  contention comes.
+			 */
+			if (atomic_cmpxchg(&sdev->restarts, old, 0) == old)
+				blk_mq_run_hw_queues(sdev->request_queue, true);
+		}
+	}
 }
 
 /* Returns false when no more bytes to process, true if there are more */
@@ -1611,8 +1630,34 @@  static void scsi_mq_put_budget(struct request_queue *q)
 static bool scsi_mq_get_budget(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
+	int ret = scsi_dev_queue_ready(q, sdev);
+
+	if (ret)
+		return true;
+
+	atomic_inc(&sdev->restarts);
 
-	return scsi_dev_queue_ready(q, sdev);
+	/*
+	 * Order writing .restarts and reading .device_busy, and make sure
+	 * .restarts is visible to scsi_end_request(). Its pair is implied by
+	 * __blk_mq_end_request() in scsi_end_request() for ordering
+	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
+	 *
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * If all in-flight requests originated from this LUN are completed
+	 * before setting .restarts, sdev->device_busy will be observed as
+	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
+	 * soon. Otherwise, completion of one of these request will observe
+	 * the .restarts flag, and the request queue will be run for handling
+	 * this request, see scsi_end_request().
+	 */
+	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
+				!scsi_device_blocked(sdev)))
+		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
+	return false;
 }
 
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index bc5909033d13..1a5c9a3df6d6 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -109,6 +109,7 @@  struct scsi_device {
 	atomic_t device_busy;		/* commands actually active on LLDD */
 	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
 
+	atomic_t restarts;
 	spinlock_t list_lock;
 	struct list_head starved_entry;
 	unsigned short queue_depth;	/* How deep of a queue we want */