[4/4] scsi: core: don't limit per-LUN queue depth for SSD

Message ID	20191118103117.978-5-ming.lei@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=eqh8=ZK=vger.kernel.org=linux-block-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, "James E . J . Bottomley" <jejb@linux.ibm.com>, "Martin K . Petersen" <martin.petersen@oracle.com>, linux-scsi@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Sathya Prakash <sathya.prakash@broadcom.com>, Chaitra P B <chaitra.basappa@broadcom.com>, Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>, Kashyap Desai <kashyap.desai@broadcom.com>, Sumit Saxena <sumit.saxena@broadcom.com>, Shivasharan S <shivasharan.srikanteshwara@broadcom.com>, "Ewan D . Milne" <emilne@redhat.com>, Christoph Hellwig <hch@lst.de>, Hannes Reinecke <hare@suse.de>, Bart Van Assche <bart.vanassche@wdc.com> Subject: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Date: Mon, 18 Nov 2019 18:31:17 +0800 Message-Id: <20191118103117.978-5-ming.lei@redhat.com> In-Reply-To: <20191118103117.978-1-ming.lei@redhat.com> References: <20191118103117.978-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	scis: don't apply per-LUN queue depth for SSD \| expand [0/4] scis: don't apply per-LUN queue depth for SSD [1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands [2/4] scsi: mpt3sas: use private counter for tracking inflight per-LUN commands [3/4] scsi: sd: register request queue after sd_revalidate_disk is done [4/4] scsi: core: don't limit per-LUN queue depth for SSD

Ming Lei Nov. 18, 2019, 10:31 a.m. UTC

SCSI core uses the atomic variable of sdev->device_busy to track
in-flight IO requests dispatched to this scsi device. IO request may be
submitted from any CPU, so the cost for maintaining the shared atomic
counter can be very big on big NUMA machine with lots of CPU cores.

sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
2) fair IO request scattered among all LUNs.

blk-mq already provides fair request allocation among all active shared
request queues(LUNs), see hctx_may_queue().

NVMe doesn't have such per-request-queue(namespace) queue depth, so it
is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
play big role for reaching top SSD performance.

With this patch, big cost for tracking in-flight per-LUN requests via
atomic variable can be saved.

Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
before changing this flag.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c       | 14 +++++++++++++-
 drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
 drivers/scsi/sd.c       |  4 ++++
 3 files changed, 35 insertions(+), 7 deletions(-)

Hannes Reinecke Nov. 20, 2019, 10:05 a.m. UTC | #1

On 11/18/19 11:31 AM, Ming Lei wrote:
> SCSI core uses the atomic variable of sdev->device_busy to track
> in-flight IO requests dispatched to this scsi device. IO request may be
> submitted from any CPU, so the cost for maintaining the shared atomic
> counter can be very big on big NUMA machine with lots of CPU cores.
> 
> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> 2) fair IO request scattered among all LUNs.
> 
> blk-mq already provides fair request allocation among all active shared
> request queues(LUNs), see hctx_may_queue().
> 
> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> play big role for reaching top SSD performance.
> 
> With this patch, big cost for tracking in-flight per-LUN requests via
> atomic variable can be saved.
> 
> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> before changing this flag.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-sysfs.c       | 14 +++++++++++++-
>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
>  drivers/scsi/sd.c       |  4 ++++
>  3 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fca9b158f4a0..9cc0dd5f88a1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
>  #undef QUEUE_SYSFS_BIT_FNS
>  
> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> +		const char *page, size_t count)
> +{
> +	size_t ret;
> +
> +	blk_mq_freeze_queue(q);
> +	ret = queue_store_nonrot(q, page, count);
> +	blk_mq_unfreeze_queue(q);
> +
> +	return ret;
> +}
> +
>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
>  {
>  	switch (blk_queue_zoned_model(q)) {
> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
>  static struct queue_sysfs_entry queue_nonrot_entry = {
>  	.attr = {.name = "rotational", .mode = 0644 },
>  	.show = queue_show_nonrot,
> -	.store = queue_store_nonrot,
> +	.store = queue_store_nonrot_freeze,
>  };
>  
>  static struct queue_sysfs_entry queue_zoned_entry = {
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 62a86a82c38d..72655a049efd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
>  	if (starget->can_queue > 0)
>  		atomic_dec(&starget->target_busy);
>  
> -	atomic_dec(&sdev->device_busy);
> +	if (!blk_queue_nonrot(sdev->request_queue))
> +		atomic_dec(&sdev->device_busy);
>  }
>  
>  static void scsi_kick_queue(struct request_queue *q)
> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>  
>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
>  {
> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> +	if (!blk_queue_nonrot(sdev->request_queue) &&
> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>  		return true;
>  	if (atomic_read(&sdev->device_blocked) > 0)
>  		return true;
> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>  				  struct scsi_device *sdev)
>  {
>  	unsigned int busy;
> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
>  
> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> +	if (!bypass)
> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> +	else
> +		busy = 0;
>  	if (atomic_read(&sdev->device_blocked)) {
>  		if (busy)
>  			goto out_dec;
> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>  				   "unblocking device at zero depth\n"));
>  	}
>  
> +	if (bypass)
> +		return 1;
> +
>  	if (busy >= sdev->queue_depth)
>  		goto out_dec;
>  
>  	return 1;
>  out_dec:
> -	atomic_dec(&sdev->device_busy);
> +	if (!bypass)
> +		atomic_dec(&sdev->device_busy);
>  	return 0;
>  }
>  
> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct scsi_device *sdev = q->queuedata;
>  
> -	atomic_dec(&sdev->device_busy);
> +	if (!blk_queue_nonrot(sdev->request_queue))
> +		atomic_dec(&sdev->device_busy);
>  }
>  
>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case BLK_STS_OK:
>  		break;
>  	case BLK_STS_RESOURCE:
> -		if (atomic_read(&sdev->device_busy) ||
> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> +		     atomic_read(&sdev->device_busy)) ||
>  		    scsi_device_blocked(sdev))
>  			ret = BLK_STS_DEV_RESOURCE;
>  		break;
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 0744c34468e1..c3d47117700d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
>  	rot = get_unaligned_be16(&buffer[4]);
>  
>  	if (rot == 1) {
> +		blk_mq_freeze_queue(q);
>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> +		blk_mq_unfreeze_queue(q);
>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
>  	}
>  
> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  		 * cause this to be updated correctly and any device which
>  		 * doesn't support it should be treated as rotational.
>  		 */
> +		blk_mq_freeze_queue(q);
>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> +		blk_mq_unfreeze_queue(q);
>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
>  
>  		if (scsi_device_supports_vpd(sdp)) {
> 
Hmm.

I must admit I patently don't like this explicit dependency on
blk_nonrot(). Having a conditional counter is just an open invitation to
getting things wrong...

I would far prefer if we could delegate any queueing decision to the
elevators, and completely drop the device_busy flag for all devices.
But I do see that this might not be easy to do, nor necessarily
something which others agree on.
So alternatively, can't we make this a per-host flag, instead of having
it per device characteristic?

I'm just thinking of a storage array with a massive cache; it's anyone's
guess if 'nonrot' is set there, but at the same time these things can be
fast, so they might be profiting from dropping the device_busy counter
there, too.

Cheers,

Hannes

Ewan Milne Nov. 20, 2019, 5 p.m. UTC | #2

On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> 
> Hmm.
> 
> I must admit I patently don't like this explicit dependency on
> blk_nonrot(). Having a conditional counter is just an open invitation to
> getting things wrong...
> 

This concerns me as well, it seems like the SCSI ML should have it's
own per-device attribute if we actually need to control this per-device
instead of on a per-host or per-driver basis.  And it seems like this
is something that is specific to high-performance drivers, so changing
the way this works for all drivers seems a bit much.

Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
but I dislike wrapping the examination of that and the queue flag in
a macro that makes it not obvious how the behavior is affected.
(Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
which was another piece of ML functionality used by only a few drivers.)

Ming's patch does freeze the queue if NONROT is changed by sysfs, but
the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
clears it and then calls sd_read_block_characteristics() which may set
it again.  So it's not clear to me how reliable this is.

-Ewan

Bart Van Assche Nov. 20, 2019, 8:56 p.m. UTC | #3

On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
>> I must admit I patently don't like this explicit dependency on
>> blk_nonrot(). Having a conditional counter is just an open invitation to
>> getting things wrong...
> 
> This concerns me as well, it seems like the SCSI ML should have it's
> own per-device attribute if we actually need to control this per-device
> instead of on a per-host or per-driver basis.  And it seems like this
> is something that is specific to high-performance drivers, so changing
> the way this works for all drivers seems a bit much.
> 
> Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> but I dislike wrapping the examination of that and the queue flag in
> a macro that makes it not obvious how the behavior is affected.
> (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> which was another piece of ML functionality used by only a few drivers.)
> 
> Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> clears it and then calls sd_read_block_characteristics() which may set
> it again.  So it's not clear to me how reliable this is.

How about changing the default behavior into ignoring sdev->queue_depth 
and only honoring sdev->queue_depth if a "quirk" flag is set or if 
overridden by setting a sysfs attribute? My understanding is that the 
goal of the queue ramp-up/ramp-down mechanism is to reduce the number of 
times a SCSI device responds "BUSY". An alternative for queue 
ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() 
returns BLK_STS_RESOURCE that the queue is only rerun after a delay. 
 From blk_mq_dispatch_rq_list():

	[ ... ]
	else if (needs_restart && (ret == BLK_STS_RESOURCE))
		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

Bart.

Ewan Milne Nov. 20, 2019, 9:36 p.m. UTC | #4

On Wed, 2019-11-20 at 12:56 -0800, Bart Van Assche wrote:
> On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> How about changing the default behavior into ignoring sdev->queue_depth 
> and only honoring sdev->queue_depth if a "quirk" flag is set or if 
> overridden by setting a sysfs attribute? My understanding is that the 
> goal of the queue ramp-up/ramp-down mechanism is to reduce the number of 
> times a SCSI device responds "BUSY". An alternative for queue 
> ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() 
> returns BLK_STS_RESOURCE that the queue is only rerun after a delay. 
>  From blk_mq_dispatch_rq_list():
> 
> 	[ ... ]
> 	else if (needs_restart && (ret == BLK_STS_RESOURCE))
> 		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> 
> Bart.

In general it is better not to put in changes that affect older drivers
that are not regularly tested, I think.  I would prefer an opt-in for
drivers desiring higher performance.

Delaying the queue re-run vs. a ramp down might negatively affect performance.
I'm not sure how accurate the ramp is at discovering the optimum though.

-Ewan

Ming Lei Nov. 21, 2019, 12:53 a.m. UTC | #5

On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> On 11/18/19 11:31 AM, Ming Lei wrote:
> > SCSI core uses the atomic variable of sdev->device_busy to track
> > in-flight IO requests dispatched to this scsi device. IO request may be
> > submitted from any CPU, so the cost for maintaining the shared atomic
> > counter can be very big on big NUMA machine with lots of CPU cores.
> > 
> > sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> > 2) fair IO request scattered among all LUNs.
> > 
> > blk-mq already provides fair request allocation among all active shared
> > request queues(LUNs), see hctx_may_queue().
> > 
> > NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> > is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> > play big role for reaching top SSD performance.
> > 
> > With this patch, big cost for tracking in-flight per-LUN requests via
> > atomic variable can be saved.
> > 
> > Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> > before changing this flag.
> > 
> > Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> > Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> > Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> > Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> > Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> > Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> > Cc: Ewan D. Milne <emilne@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-sysfs.c       | 14 +++++++++++++-
> >  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> >  drivers/scsi/sd.c       |  4 ++++
> >  3 files changed, 35 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fca9b158f4a0..9cc0dd5f88a1 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> >  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> >  #undef QUEUE_SYSFS_BIT_FNS
> >  
> > +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> > +		const char *page, size_t count)
> > +{
> > +	size_t ret;
> > +
> > +	blk_mq_freeze_queue(q);
> > +	ret = queue_store_nonrot(q, page, count);
> > +	blk_mq_unfreeze_queue(q);
> > +
> > +	return ret;
> > +}
> > +
> >  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> >  {
> >  	switch (blk_queue_zoned_model(q)) {
> > @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> >  static struct queue_sysfs_entry queue_nonrot_entry = {
> >  	.attr = {.name = "rotational", .mode = 0644 },
> >  	.show = queue_show_nonrot,
> > -	.store = queue_store_nonrot,
> > +	.store = queue_store_nonrot_freeze,
> >  };
> >  
> >  static struct queue_sysfs_entry queue_zoned_entry = {
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 62a86a82c38d..72655a049efd 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> >  	if (starget->can_queue > 0)
> >  		atomic_dec(&starget->target_busy);
> >  
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!blk_queue_nonrot(sdev->request_queue))
> > +		atomic_dec(&sdev->device_busy);
> >  }
> >  
> >  static void scsi_kick_queue(struct request_queue *q)
> > @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >  
> >  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> >  {
> > -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> > +	if (!blk_queue_nonrot(sdev->request_queue) &&
> > +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >  		return true;
> >  	if (atomic_read(&sdev->device_blocked) > 0)
> >  		return true;
> > @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >  				  struct scsi_device *sdev)
> >  {
> >  	unsigned int busy;
> > +	bool bypass = blk_queue_nonrot(sdev->request_queue);
> >  
> > -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> > +	if (!bypass)
> > +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> > +	else
> > +		busy = 0;
> >  	if (atomic_read(&sdev->device_blocked)) {
> >  		if (busy)
> >  			goto out_dec;
> > @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >  				   "unblocking device at zero depth\n"));
> >  	}
> >  
> > +	if (bypass)
> > +		return 1;
> > +
> >  	if (busy >= sdev->queue_depth)
> >  		goto out_dec;
> >  
> >  	return 1;
> >  out_dec:
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!bypass)
> > +		atomic_dec(&sdev->device_busy);
> >  	return 0;
> >  }
> >  
> > @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> >  	struct request_queue *q = hctx->queue;
> >  	struct scsi_device *sdev = q->queuedata;
> >  
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!blk_queue_nonrot(sdev->request_queue))
> > +		atomic_dec(&sdev->device_busy);
> >  }
> >  
> >  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> > @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  	case BLK_STS_OK:
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > -		if (atomic_read(&sdev->device_busy) ||
> > +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> > +		     atomic_read(&sdev->device_busy)) ||
> >  		    scsi_device_blocked(sdev))
> >  			ret = BLK_STS_DEV_RESOURCE;
> >  		break;
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 0744c34468e1..c3d47117700d 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> >  	rot = get_unaligned_be16(&buffer[4]);
> >  
> >  	if (rot == 1) {
> > +		blk_mq_freeze_queue(q);
> >  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > +		blk_mq_unfreeze_queue(q);
> >  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> >  	}
> >  
> > @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >  		 * cause this to be updated correctly and any device which
> >  		 * doesn't support it should be treated as rotational.
> >  		 */
> > +		blk_mq_freeze_queue(q);
> >  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> > +		blk_mq_unfreeze_queue(q);
> >  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >  
> >  		if (scsi_device_supports_vpd(sdp)) {
> > 
> Hmm.
> 
> I must admit I patently don't like this explicit dependency on
> blk_nonrot(). Having a conditional counter is just an open invitation to
> getting things wrong...

That won't be an issue given this patch freezes queue when the
NONROT queue flag is changed.

> 
> I would far prefer if we could delegate any queueing decision to the
> elevators, and completely drop the device_busy flag for all devices.

If you drop it, you may create big sequential IO performance drop
on HDD., that is why this patch only bypasses sdev->queue_depth on
SSD. NVMe bypasses it because no one uses HDD. via NVMe.

> But I do see that this might not be easy to do, nor necessarily
> something which others agree on.
> So alternatively, can't we make this a per-host flag, instead of having
> it per device characteristic?

Are you sure each LUN shares the same NONROT flag? I think it can be
true for each LUN to have different NONROT flag.

> 
> I'm just thinking of a storage array with a massive cache; it's anyone's
> guess if 'nonrot' is set there, but at the same time these things can be
> fast, so they might be profiting from dropping the device_busy counter
> there, too.

If you are worrying about slow queue freeze, that won't be an issue,
given we don't switch to percpu mode until the flag is updated during
sd probe, and blk_mq_freeze_queue_wait() is pretty quick in ATOMIC
mode.


Thanks,
Ming

Ming Lei Nov. 21, 2019, 12:54 a.m. UTC | #6

On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote:
> On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > 
> > Hmm.
> > 
> > I must admit I patently don't like this explicit dependency on
> > blk_nonrot(). Having a conditional counter is just an open invitation to
> > getting things wrong...
> > 
> 
> This concerns me as well, it seems like the SCSI ML should have it's
> own per-device attribute if we actually need to control this per-device
> instead of on a per-host or per-driver basis.  And it seems like this
> is something that is specific to high-performance drivers, so changing
> the way this works for all drivers seems a bit much.
> 
> Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> but I dislike wrapping the examination of that and the queue flag in
> a macro that makes it not obvious how the behavior is affected.
> (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> which was another piece of ML functionality used by only a few drivers.)
> 
> Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> clears it and then calls sd_read_block_characteristics() which may set
> it again.  So it's not clear to me how reliable this is.

The queue freeze is applied in sd_revalidate_disk() too, isn't it?

Thanks,
Ming

Ming Lei Nov. 21, 2019, 1:07 a.m. UTC | #7

On Wed, Nov 20, 2019 at 12:56:21PM -0800, Bart Van Assche wrote:
> On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> How about changing the default behavior into ignoring sdev->queue_depth and
> only honoring sdev->queue_depth if a "quirk" flag is set or if overridden by
> setting a sysfs attribute?

Using 'quirk' was my first idea in mind when we start to discuss the issue, but
problem is that it isn't flexible, for example, one HBA may connects HDD. in one
setting, and SSD. in another setting.

> My understanding is that the goal of the queue
> ramp-up/ramp-down mechanism is to reduce the number of times a SCSI device
> responds "BUSY".

I don't understand the motivation of ramp-up/ramp-down, maybe it is just
for fairness among LUNs. So far, blk-mq provides fair IO submission
among LUNs. One problem of ignoring it is that sequential IO performance
may be dropped much compared with before.

> An alternative for queue ramp-up/ramp-down is a delayed
> queue re-run. I think if scsi_queue_rq() returns BLK_STS_RESOURCE that the
> queue is only rerun after a delay. From blk_mq_dispatch_rq_list():
> 
> 	[ ... ]
> 	else if (needs_restart && (ret == BLK_STS_RESOURCE))
> 		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

The delay re-run can't work given we call blk_mq_get_dispatch_budget()
before dequeuing request from scheduler/sw queue for improving IO merge.
At that time, run queue won't be involved.


Thanks,
Ming

Hannes Reinecke Nov. 21, 2019, 3:45 p.m. UTC | #8

On 11/21/19 1:53 AM, Ming Lei wrote:
> On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
>> On 11/18/19 11:31 AM, Ming Lei wrote:
>>> SCSI core uses the atomic variable of sdev->device_busy to track
>>> in-flight IO requests dispatched to this scsi device. IO request may be
>>> submitted from any CPU, so the cost for maintaining the shared atomic
>>> counter can be very big on big NUMA machine with lots of CPU cores.
>>>
>>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
>>> 2) fair IO request scattered among all LUNs.
>>>
>>> blk-mq already provides fair request allocation among all active shared
>>> request queues(LUNs), see hctx_may_queue().
>>>
>>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
>>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
>>> play big role for reaching top SSD performance.
>>>
>>> With this patch, big cost for tracking in-flight per-LUN requests via
>>> atomic variable can be saved.
>>>
>>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
>>> before changing this flag.
>>>
>>> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
>>> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
>>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
>>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
>>> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
>>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
>>> Cc: Ewan D. Milne <emilne@redhat.com>
>>> Cc: Christoph Hellwig <hch@lst.de>,
>>> Cc: Hannes Reinecke <hare@suse.de>
>>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-sysfs.c       | 14 +++++++++++++-
>>>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
>>>  drivers/scsi/sd.c       |  4 ++++
>>>  3 files changed, 35 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index fca9b158f4a0..9cc0dd5f88a1 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
>>>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
>>>  #undef QUEUE_SYSFS_BIT_FNS
>>>  
>>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
>>> +		const char *page, size_t count)
>>> +{
>>> +	size_t ret;
>>> +
>>> +	blk_mq_freeze_queue(q);
>>> +	ret = queue_store_nonrot(q, page, count);
>>> +	blk_mq_unfreeze_queue(q);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
>>>  {
>>>  	switch (blk_queue_zoned_model(q)) {
>>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
>>>  static struct queue_sysfs_entry queue_nonrot_entry = {
>>>  	.attr = {.name = "rotational", .mode = 0644 },
>>>  	.show = queue_show_nonrot,
>>> -	.store = queue_store_nonrot,
>>> +	.store = queue_store_nonrot_freeze,
>>>  };
>>>  
>>>  static struct queue_sysfs_entry queue_zoned_entry = {
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 62a86a82c38d..72655a049efd 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
>>>  	if (starget->can_queue > 0)
>>>  		atomic_dec(&starget->target_busy);
>>>  
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!blk_queue_nonrot(sdev->request_queue))
>>> +		atomic_dec(&sdev->device_busy);
>>>  }
>>>  
>>>  static void scsi_kick_queue(struct request_queue *q)
>>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>>>  
>>>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
>>>  {
>>> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>>> +	if (!blk_queue_nonrot(sdev->request_queue) &&
>>> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>>>  		return true;
>>>  	if (atomic_read(&sdev->device_blocked) > 0)
>>>  		return true;
>>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>>>  				  struct scsi_device *sdev)
>>>  {
>>>  	unsigned int busy;
>>> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
>>>  
>>> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
>>> +	if (!bypass)
>>> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
>>> +	else
>>> +		busy = 0;
>>>  	if (atomic_read(&sdev->device_blocked)) {
>>>  		if (busy)
>>>  			goto out_dec;
>>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>>>  				   "unblocking device at zero depth\n"));
>>>  	}
>>>  
>>> +	if (bypass)
>>> +		return 1;
>>> +
>>>  	if (busy >= sdev->queue_depth)
>>>  		goto out_dec;
>>>  
>>>  	return 1;
>>>  out_dec:
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!bypass)
>>> +		atomic_dec(&sdev->device_busy);
>>>  	return 0;
>>>  }
>>>  
>>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
>>>  	struct request_queue *q = hctx->queue;
>>>  	struct scsi_device *sdev = q->queuedata;
>>>  
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!blk_queue_nonrot(sdev->request_queue))
>>> +		atomic_dec(&sdev->device_busy);
>>>  }
>>>  
>>>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
>>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>  	case BLK_STS_OK:
>>>  		break;
>>>  	case BLK_STS_RESOURCE:
>>> -		if (atomic_read(&sdev->device_busy) ||
>>> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
>>> +		     atomic_read(&sdev->device_busy)) ||
>>>  		    scsi_device_blocked(sdev))
>>>  			ret = BLK_STS_DEV_RESOURCE;
>>>  		break;
>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>> index 0744c34468e1..c3d47117700d 100644
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
>>>  	rot = get_unaligned_be16(&buffer[4]);
>>>  
>>>  	if (rot == 1) {
>>> +		blk_mq_freeze_queue(q);
>>>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
>>> +		blk_mq_unfreeze_queue(q);
>>>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
>>>  	}
>>>  
>>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>>  		 * cause this to be updated correctly and any device which
>>>  		 * doesn't support it should be treated as rotational.
>>>  		 */
>>> +		blk_mq_freeze_queue(q);
>>>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
>>> +		blk_mq_unfreeze_queue(q);
>>>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
>>>  
>>>  		if (scsi_device_supports_vpd(sdp)) {
>>>
>> Hmm.
>>
>> I must admit I patently don't like this explicit dependency on
>> blk_nonrot(). Having a conditional counter is just an open invitation to
>> getting things wrong...
> 
> That won't be an issue given this patch freezes queue when the
> NONROT queue flag is changed.
> 
That's not what I'm saying.

My issue is that we're turning the device_busy counter into a
conditional counter, which only must be accessed when that condition is
true.
This not only puts a burden on the developer (as he has to remember to
always check the condition), but also has possible issues when switching
between modes.
The latter you've addressed with ensuring that the queue must be frozen
when modifying that flag, but the former is a conceptual problem.

And that was what I had been referring to.

>>
>> I would far prefer if we could delegate any queueing decision to the
>> elevators, and completely drop the device_busy flag for all devices.
> 
> If you drop it, you may create big sequential IO performance drop
> on HDD., that is why this patch only bypasses sdev->queue_depth on
> SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> 
I still wonder how much performance drop we actually see; what seems to
happen is that device_busy just arbitrary pushes back to the block
layer, giving it more time to do merging.
I do think we can do better then that...

Cheers,

Hannes

Ewan Milne Nov. 21, 2019, 7:19 p.m. UTC | #9

On Thu, 2019-11-21 at 08:54 +0800, Ming Lei wrote:
> On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > 
> > > Hmm.
> > > 
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > > 
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> The queue freeze is applied in sd_revalidate_disk() too, isn't it?
> 

Yes, sorry, you are right, your patch does add this.  But if anything else changes
the NONROT attribute for a queue associated with a SCSI device in the future
it would have to freeze the queue.  Because the device_busy counter mechanism
would rely on it to work right.  This isn't an obvious connection, it seems to me.

-Ewan

Martin K. Petersen Nov. 22, 2019, 2:18 a.m. UTC | #10

Hannes,

> I must admit I patently don't like this explicit dependency on
> blk_nonrot().

The whole idea behind using rotational/non-rotational as a trigger for
this is a red herring.

Fast devices also have internal queuing limitations. And more
importantly: Fast devices also experience temporary shortages which can
lead to congestion.

Martin K. Petersen Nov. 22, 2019, 2:25 a.m. UTC | #11

Ewan,

> Delaying the queue re-run vs. a ramp down might negatively affect
> performance.  I'm not sure how accurate the ramp is at discovering the
> optimum though.

The optimum - as well as the actual limit - might change over time in a
multi-initiator setup as other hosts grab resources on the device.

I do think that the only way forward here, if we want to avoid counting
outstanding commands for performance reasons, is to ensure that the
BUSY/QUEUE_FULL/TASK_SET_FULL handling is relatively fast path and not
something deep in the bowels of error handling. Which it actually
isn't. But I do think we'll need to take a closer look.

Martin K. Petersen Nov. 22, 2019, 2:59 a.m. UTC | #12

Ming,

> I don't understand the motivation of ramp-up/ramp-down, maybe it is just
> for fairness among LUNs.

Congestion control. Devices have actual, physical limitations that are
different from the tag context limitations on the HBA. You don't have
that problem on NVMe because (at least for PCIe) the storage device and
the controller are one and the same.

If you submit 100000 concurrent requests to a SCSI drive that does 100
IOPS, some requests will time out before they get serviced.
Consequently we have the ability to raise and lower the queue depth to
constrain the amount of requests in flight to a given device at any
point in time.

Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
off. We frequently see issues where the host can submit burst I/O much
faster than the device can de-stage from cache. In that scenario the
device reports BUSY/QF/TSF and we will back off so the device gets a
chance to recover. If we just let the application submit new I/O without
bounds, the system would never actually recover.

Note that the actual, physical limitations for how many commands a
target can handle are typically much, much lower than the number of tags
the HBA can manage. SATA devices can only express 32 concurrent
commands. SAS devices typically 128 concurrent commands per
port. Arrays differ.

If we ignore the RAID controller use case where the controller
internally queues and arbitrates commands between many devices, how is
submitting 1000 concurrent requests to a device which only has 128
command slots going to work?

Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
blindly stop restricting the number of I/Os in flight in the ML, we may
exceed either the capabilities of what the transport protocol can
express or internal device resources.

Ming Lei Nov. 22, 2019, 3:24 a.m. UTC | #13

Hi Martin,

On Thu, Nov 21, 2019 at 09:59:53PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > I don't understand the motivation of ramp-up/ramp-down, maybe it is just
> > for fairness among LUNs.
> 
> Congestion control. Devices have actual, physical limitations that are
> different from the tag context limitations on the HBA. You don't have
> that problem on NVMe because (at least for PCIe) the storage device and
> the controller are one and the same.
> 
> If you submit 100000 concurrent requests to a SCSI drive that does 100
> IOPS, some requests will time out before they get serviced.
> Consequently we have the ability to raise and lower the queue depth to
> constrain the amount of requests in flight to a given device at any
> point in time.

blk-mq has already puts a limit on each LUN, the number is
host_queue_depth / nr_active_LUNs, see hctx_may_queue().

Looks this way works for NVMe, that is why I try to bypass
.device_busy for SSD which is too expensive on fast storage. Even
Hannes wants to kill it completely.

> 
> Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
> off. We frequently see issues where the host can submit burst I/O much
> faster than the device can de-stage from cache. In that scenario the
> device reports BUSY/QF/TSF and we will back off so the device gets a
> chance to recover. If we just let the application submit new I/O without
> bounds, the system would never actually recover.
> 
> Note that the actual, physical limitations for how many commands a
> target can handle are typically much, much lower than the number of tags
> the HBA can manage. SATA devices can only express 32 concurrent
> commands. SAS devices typically 128 concurrent commands per
> port. Arrays differ.

I understand SATA's host queue depth is set as 32.

But SAS HBA's queue depth is often big, so do we reply on .device_busy for
throttling requests to SAS?

> 
> If we ignore the RAID controller use case where the controller
> internally queues and arbitrates commands between many devices, how is
> submitting 1000 concurrent requests to a device which only has 128
> command slots going to work?

For SSD, I guess it might be fine, given NVMe sets per-hw-queue depth
as 1023 usually. That means the concurrent requests can be as many as 
1023 * nr_hw_queues in case of single namespace.

> 
> Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
> blindly stop restricting the number of I/Os in flight in the ML, we may
> exceed either the capabilities of what the transport protocol can
> express or internal device resources.

OK, one conservative approach may be just to just bypass .device_busy 
in case of SSD only for some high end HBA.

Or maybe we can wire up sdev->queue_depth with block layer's scheduler
queue depth? One issue is that sdev->queue_depth may be updated some
times.

Thanks,
Ming

Ming Lei Nov. 22, 2019, 8:09 a.m. UTC | #14

On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
> On 11/21/19 1:53 AM, Ming Lei wrote:
> > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> >> On 11/18/19 11:31 AM, Ming Lei wrote:
> >>> SCSI core uses the atomic variable of sdev->device_busy to track
> >>> in-flight IO requests dispatched to this scsi device. IO request may be
> >>> submitted from any CPU, so the cost for maintaining the shared atomic
> >>> counter can be very big on big NUMA machine with lots of CPU cores.
> >>>
> >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> >>> 2) fair IO request scattered among all LUNs.
> >>>
> >>> blk-mq already provides fair request allocation among all active shared
> >>> request queues(LUNs), see hctx_may_queue().
> >>>
> >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> >>> play big role for reaching top SSD performance.
> >>>
> >>> With this patch, big cost for tracking in-flight per-LUN requests via
> >>> atomic variable can be saved.
> >>>
> >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> >>> before changing this flag.
> >>>
> >>> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> >>> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> >>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> >>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> >>> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> >>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> >>> Cc: Ewan D. Milne <emilne@redhat.com>
> >>> Cc: Christoph Hellwig <hch@lst.de>,
> >>> Cc: Hannes Reinecke <hare@suse.de>
> >>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/blk-sysfs.c       | 14 +++++++++++++-
> >>>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> >>>  drivers/scsi/sd.c       |  4 ++++
> >>>  3 files changed, 35 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> >>> index fca9b158f4a0..9cc0dd5f88a1 100644
> >>> --- a/block/blk-sysfs.c
> >>> +++ b/block/blk-sysfs.c
> >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> >>>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> >>>  #undef QUEUE_SYSFS_BIT_FNS
> >>>  
> >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> >>> +		const char *page, size_t count)
> >>> +{
> >>> +	size_t ret;
> >>> +
> >>> +	blk_mq_freeze_queue(q);
> >>> +	ret = queue_store_nonrot(q, page, count);
> >>> +	blk_mq_unfreeze_queue(q);
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> >>>  {
> >>>  	switch (blk_queue_zoned_model(q)) {
> >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> >>>  static struct queue_sysfs_entry queue_nonrot_entry = {
> >>>  	.attr = {.name = "rotational", .mode = 0644 },
> >>>  	.show = queue_show_nonrot,
> >>> -	.store = queue_store_nonrot,
> >>> +	.store = queue_store_nonrot_freeze,
> >>>  };
> >>>  
> >>>  static struct queue_sysfs_entry queue_zoned_entry = {
> >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >>> index 62a86a82c38d..72655a049efd 100644
> >>> --- a/drivers/scsi/scsi_lib.c
> >>> +++ b/drivers/scsi/scsi_lib.c
> >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> >>>  	if (starget->can_queue > 0)
> >>>  		atomic_dec(&starget->target_busy);
> >>>  
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!blk_queue_nonrot(sdev->request_queue))
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  }
> >>>  
> >>>  static void scsi_kick_queue(struct request_queue *q)
> >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >>>  
> >>>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> >>>  {
> >>> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>> +	if (!blk_queue_nonrot(sdev->request_queue) &&
> >>> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>>  		return true;
> >>>  	if (atomic_read(&sdev->device_blocked) > 0)
> >>>  		return true;
> >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>>  				  struct scsi_device *sdev)
> >>>  {
> >>>  	unsigned int busy;
> >>> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
> >>>  
> >>> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> +	if (!bypass)
> >>> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> +	else
> >>> +		busy = 0;
> >>>  	if (atomic_read(&sdev->device_blocked)) {
> >>>  		if (busy)
> >>>  			goto out_dec;
> >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>>  				   "unblocking device at zero depth\n"));
> >>>  	}
> >>>  
> >>> +	if (bypass)
> >>> +		return 1;
> >>> +
> >>>  	if (busy >= sdev->queue_depth)
> >>>  		goto out_dec;
> >>>  
> >>>  	return 1;
> >>>  out_dec:
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!bypass)
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  	return 0;
> >>>  }
> >>>  
> >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> >>>  	struct request_queue *q = hctx->queue;
> >>>  	struct scsi_device *sdev = q->queuedata;
> >>>  
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!blk_queue_nonrot(sdev->request_queue))
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  }
> >>>  
> >>>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>>  	case BLK_STS_OK:
> >>>  		break;
> >>>  	case BLK_STS_RESOURCE:
> >>> -		if (atomic_read(&sdev->device_busy) ||
> >>> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> >>> +		     atomic_read(&sdev->device_busy)) ||
> >>>  		    scsi_device_blocked(sdev))
> >>>  			ret = BLK_STS_DEV_RESOURCE;
> >>>  		break;
> >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> >>> index 0744c34468e1..c3d47117700d 100644
> >>> --- a/drivers/scsi/sd.c
> >>> +++ b/drivers/scsi/sd.c
> >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> >>>  	rot = get_unaligned_be16(&buffer[4]);
> >>>  
> >>>  	if (rot == 1) {
> >>> +		blk_mq_freeze_queue(q);
> >>>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> >>> +		blk_mq_unfreeze_queue(q);
> >>>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> >>>  	}
> >>>  
> >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >>>  		 * cause this to be updated correctly and any device which
> >>>  		 * doesn't support it should be treated as rotational.
> >>>  		 */
> >>> +		blk_mq_freeze_queue(q);
> >>>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> >>> +		blk_mq_unfreeze_queue(q);
> >>>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >>>  
> >>>  		if (scsi_device_supports_vpd(sdp)) {
> >>>
> >> Hmm.
> >>
> >> I must admit I patently don't like this explicit dependency on
> >> blk_nonrot(). Having a conditional counter is just an open invitation to
> >> getting things wrong...
> > 
> > That won't be an issue given this patch freezes queue when the
> > NONROT queue flag is changed.
> > 
> That's not what I'm saying.
> 
> My issue is that we're turning the device_busy counter into a
> conditional counter, which only must be accessed when that condition is
> true.
> This not only puts a burden on the developer (as he has to remember to
> always check the condition), but also has possible issues when switching
> between modes.

The conditional check can be avoided only if we can kill device_busy
completely. However, as Martin commented, that isn't realistic so far.

> The latter you've addressed with ensuring that the queue must be frozen
> when modifying that flag, but the former is a conceptual problem.

It can't be perfect, but it can be good by the latter.

Someone said 'Perfect is the enemy of the good', :-)

> 
> And that was what I had been referring to.
> 
> >>
> >> I would far prefer if we could delegate any queueing decision to the
> >> elevators, and completely drop the device_busy flag for all devices.
> > 
> > If you drop it, you may create big sequential IO performance drop
> > on HDD., that is why this patch only bypasses sdev->queue_depth on
> > SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> > 
> I still wonder how much performance drop we actually see; what seems to
> happen is that device_busy just arbitrary pushes back to the block
> layer, giving it more time to do merging.
> I do think we can do better then that...

For example, running the following script[1] on 4-core VM:

------------------------------------------
                    | QD:255    | QD: 32  | 
------------------------------------------
fio read throughput | 825MB/s   | 1432MB/s|   
------------------------------------------

note:

QD: scsi_device's queue depth, which can be passed to test.sh.
scsi debug's device queue depth is 255 at default, which is >= .can_queue

[1] test.sh

#!/bin/bash
QD=$1

modprobe scsi_debug ndelay=200000
sleep 2
DEVN=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEVN

echo $QD > /sys/block/$DEVN/device/queue_depth
SQD=`cat /sys/block/$DEVN/device/queue_depth`
echo "scsi device queue depth: $QD"

fio --readwrite=read --filename=$DEV --runtime=20s --time_based --group_reporting=1 \
    --ioengine=libaio --direct=1 --iodepth=64 --bs=4k --numjobs=4 --name=seq_perf



Thanks,
Ming

Sumanesh Samanta Nov. 22, 2019, 4:38 p.m. UTC | #15

>>If we ignore the RAID controller use case where the controller
>>internally queues and arbitrates commands between many devices

These controllers should not be "ignored", but rather enabled. Many of them visualize both HDD and NVMe devices behind them, and thus forced to expose themselves as SCSI controllers.
However, they have their own queue management and IO merging capabilities. Many have capability of holding IO in queue and pull them as needed (just like NVMe), and thus does not bother if many IOs to a device or controller is sent or if there is congestion. In case of congestion, the IO will simply wait in queue, along with advanced timeout handling capabilities.
Besides, as Ming pointed out, Block layer (function hctx_may_queue) already limits IO on a per controller and per LUN basis.

Overall, if the proposal does not work for all cases, then at least it should be made optional for high end controller, so that they are not disadvantaged vis-a-vis NVMe, just because they expose themselves as SCSI in order to support a wide range of devices behind them.

thanks,
Sumanesh

On 11/21/2019 7:59 PM, Martin K. Petersen wrote:
> Ming,
>
>> I don't understand the motivation of ramp-up/ramp-down, maybe it is just
>> for fairness among LUNs.
> Congestion control. Devices have actual, physical limitations that are
> different from the tag context limitations on the HBA. You don't have
> that problem on NVMe because (at least for PCIe) the storage device and
> the controller are one and the same.
>
> If you submit 100000 concurrent requests to a SCSI drive that does 100
> IOPS, some requests will time out before they get serviced.
> Consequently we have the ability to raise and lower the queue depth to
> constrain the amount of requests in flight to a given device at any
> point in time.
>
> Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
> off. We frequently see issues where the host can submit burst I/O much
> faster than the device can de-stage from cache. In that scenario the
> device reports BUSY/QF/TSF and we will back off so the device gets a
> chance to recover. If we just let the application submit new I/O without
> bounds, the system would never actually recover.
>
> Note that the actual, physical limitations for how many commands a
> target can handle are typically much, much lower than the number of tags
> the HBA can manage. SATA devices can only express 32 concurrent
> commands. SAS devices typically 128 concurrent commands per
> port. Arrays differ.
>
> If we ignore the RAID controller use case where the controller
> internally queues and arbitrates commands between many devices, how is
> submitting 1000 concurrent requests to a device which only has 128
> command slots going to work?
>
> Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
> blindly stop restricting the number of I/Os in flight in the ML, we may
> exceed either the capabilities of what the transport protocol can
> express or internal device resources.
>

Bart Van Assche Nov. 22, 2019, 6:14 p.m. UTC | #16

On 11/22/19 12:09 AM, Ming Lei wrote:
> On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
>> On 11/21/19 1:53 AM, Ming Lei wrote:
>>> On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
>>>> I would far prefer if we could delegate any queueing decision to the
>>>> elevators, and completely drop the device_busy flag for all devices.
>>>
>>> If you drop it, you may create big sequential IO performance drop
>>> on HDD., that is why this patch only bypasses sdev->queue_depth on
>>> SSD. NVMe bypasses it because no one uses HDD. via NVMe.
>>>
>> I still wonder how much performance drop we actually see; what seems to
>> happen is that device_busy just arbitrary pushes back to the block
>> layer, giving it more time to do merging.
>> I do think we can do better then that...
> 
> For example, running the following script[1] on 4-core VM:
> 
> ------------------------------------------
>                      | QD:255    | QD: 32  |
> ------------------------------------------
> fio read throughput | 825MB/s   | 1432MB/s|
> ------------------------------------------
> 
> [ ... ]

Hi Ming,

Thanks for having shared these numbers. I think this is very useful 
information. Do these results show the performance drop that happens if 
/sys/block/.../device/queue_depth exceeds .can_queue? What I am 
wondering about is how important these results are in the context of 
this discussion. Are there any modern SCSI devices for which a SCSI LLD 
sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
device responds with BUSY? What surprised me is that only three SCSI 
LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
mean that BUSY responses from a SCSI device or HBA are rare?

Thanks,

Bart.

James Smart Nov. 22, 2019, 6:26 p.m. UTC | #17

On 11/22/2019 10:14 AM, Bart Van Assche wrote:
> Thanks for having shared these numbers. I think this is very useful 
> information. Do these results show the performance drop that happens 
> if /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
> wondering about is how important these results are in the context of 
> this discussion. Are there any modern SCSI devices for which a SCSI 
> LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
> device responds with BUSY? What surprised me is that only three SCSI 
> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
> mean that BUSY responses from a SCSI device or HBA are rare?

That's because most of the drivers, which had queue full ramp up/ramp 
down in them and would have called scsi_track_queue_full() converted 
over to the moved-queue-full-handling-in-the-mid-layer, indicated by 
sht->track_queue_depth = 1.

Yes - it is still hit a lot!

-- james

Bart Van Assche Nov. 22, 2019, 8:46 p.m. UTC | #18

On 11/22/19 10:26 AM, James Smart wrote:
> On 11/22/2019 10:14 AM, Bart Van Assche wrote:
>> Thanks for having shared these numbers. I think this is very useful 
>> information. Do these results show the performance drop that happens 
>> if /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
>> wondering about is how important these results are in the context of 
>> this discussion. Are there any modern SCSI devices for which a SCSI 
>> LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
>> device responds with BUSY? What surprised me is that only three SCSI 
>> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
>> mean that BUSY responses from a SCSI device or HBA are rare?
> 
> That's because most of the drivers, which had queue full ramp up/ramp 
> down in them and would have called scsi_track_queue_full() converted 
> over to the moved-queue-full-handling-in-the-mid-layer, indicated by 
> sht->track_queue_depth = 1.
> 
> Yes - it is still hit a lot!

Hi James,

In the systems that I have been working on myself I made sure that the 
BUSY condition is rarely or never encountered. Anyway, since there are 
setups in which this condition is hit frequently we need to make sure 
that these setups keep performing well. I'm wondering now whether we 
should try to come up with an algorithm for maintaining 
sdev->device_busy only if it improves performance and for not 
maintaining sdev->device_busy for devices/HBAs that don't need it.

Bart.

Ming Lei Nov. 22, 2019, 10 p.m. UTC | #19

On Fri, Nov 22, 2019 at 10:14:51AM -0800, Bart Van Assche wrote:
> On 11/22/19 12:09 AM, Ming Lei wrote:
> > On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
> > > On 11/21/19 1:53 AM, Ming Lei wrote:
> > > > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> > > > > I would far prefer if we could delegate any queueing decision to the
> > > > > elevators, and completely drop the device_busy flag for all devices.
> > > > 
> > > > If you drop it, you may create big sequential IO performance drop
> > > > on HDD., that is why this patch only bypasses sdev->queue_depth on
> > > > SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> > > > 
> > > I still wonder how much performance drop we actually see; what seems to
> > > happen is that device_busy just arbitrary pushes back to the block
> > > layer, giving it more time to do merging.
> > > I do think we can do better then that...
> > 
> > For example, running the following script[1] on 4-core VM:
> > 
> > ------------------------------------------
> >                      | QD:255    | QD: 32  |
> > ------------------------------------------
> > fio read throughput | 825MB/s   | 1432MB/s|
> > ------------------------------------------
> > 
> > [ ... ]
> 
> Hi Ming,
> 
> Thanks for having shared these numbers. I think this is very useful
> information. Do these results show the performance drop that happens if
> /sys/block/.../device/queue_depth exceeds .can_queue? What I am wondering

The above test just shows that IO merge plays important role here, and
one important point for triggering IO merge is that .get_budget returns
false.

If sdev->queue_depth is too big, .get_budget may never return false.

That is why this patch just bypasses .device_busy for SSD.

> about is how important these results are in the context of this discussion.
> Are there any modern SCSI devices for which a SCSI LLD sets
> scsi_host->can_queue and scsi_host->cmd_per_lun such that the device
> responds with BUSY? What surprised me is that only three SCSI LLDs call

There are many such HBAs, for which sdev->queue_depth is smaller than
.can_queue, especially in case of small number of LUNs.

> scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that mean that BUSY
> responses from a SCSI device or HBA are rare?

It is only true for some HBAs.

thanks,
Ming

Ming Lei Nov. 22, 2019, 10:04 p.m. UTC | #20

On Fri, Nov 22, 2019 at 12:46:48PM -0800, Bart Van Assche wrote:
> On 11/22/19 10:26 AM, James Smart wrote:
> > On 11/22/2019 10:14 AM, Bart Van Assche wrote:
> > > Thanks for having shared these numbers. I think this is very useful
> > > information. Do these results show the performance drop that happens
> > > if /sys/block/.../device/queue_depth exceeds .can_queue? What I am
> > > wondering about is how important these results are in the context of
> > > this discussion. Are there any modern SCSI devices for which a SCSI
> > > LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that
> > > the device responds with BUSY? What surprised me is that only three
> > > SCSI LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does
> > > that mean that BUSY responses from a SCSI device or HBA are rare?
> > 
> > That's because most of the drivers, which had queue full ramp up/ramp
> > down in them and would have called scsi_track_queue_full() converted
> > over to the moved-queue-full-handling-in-the-mid-layer, indicated by
> > sht->track_queue_depth = 1.
> > 
> > Yes - it is still hit a lot!
> 
> Hi James,
> 
> In the systems that I have been working on myself I made sure that the BUSY
> condition is rarely or never encountered. Anyway, since there are setups in
> which this condition is hit frequently we need to make sure that these
> setups keep performing well. I'm wondering now whether we should try to come
> up with an algorithm for maintaining sdev->device_busy only if it improves
> performance and for not maintaining sdev->device_busy for devices/HBAs that
> don't need it.

The simplest policy could be to only maintain sdev->device_busy for HDD.


Thanks,
Ming

Ewan Milne Nov. 25, 2019, 6:28 p.m. UTC | #21

On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote:
> 
> Hi Ming,
> 
> Thanks for having shared these numbers. I think this is very useful 
> information. Do these results show the performance drop that happens if 
> /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
> wondering about is how important these results are in the context of 
> this discussion. Are there any modern SCSI devices for which a SCSI LLD 
> sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
> device responds with BUSY? What surprised me is that only three SCSI 
> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
> mean that BUSY responses from a SCSI device or HBA are rare?
> 

Some FC HBAs end up returning busy from ->queuecommand() but I think
this is more commonly due to there being and issue with the rport rather
than the device.

-Ewan

James Smart Nov. 25, 2019, 10:14 p.m. UTC | #22

On 11/25/2019 10:28 AM, Ewan D. Milne wrote:
> On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote:
>> Hi Ming,
>>
>> Thanks for having shared these numbers. I think this is very useful
>> information. Do these results show the performance drop that happens if
>> /sys/block/.../device/queue_depth exceeds .can_queue? What I am
>> wondering about is how important these results are in the context of
>> this discussion. Are there any modern SCSI devices for which a SCSI LLD
>> sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the
>> device responds with BUSY? What surprised me is that only three SCSI
>> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that
>> mean that BUSY responses from a SCSI device or HBA are rare?
>>
> Some FC HBAs end up returning busy from ->queuecommand() but I think
> this is more commonly due to there being and issue with the rport rather
> than the device.
>
> -Ewan
>

True - but I would assume busy from queuecommand() is different from 
BUSY/QUEUE_FULL via a SCSI response.

Adapter queuecommand busy's can be for out-of-resource limits in the 
driver - such as I_T io count limits enforced by the driver are reached, 
or if some other adapter resource limit is reached as well. Canqueue 
covers most of those - but we sometimes overcommit the adapter with a 
canqueue on physical port as well as per npiv ports, or scsi and nvme on 
the same port.

Going back to Bart's question - with SANS and multiple initiators 
sharing a target and lots of luns on that target, it's very common to 
hit bursty conditions where the target may reply with QUEUE_FULL.  Many 
arrays provide think tuning guides on how to set up values on multiple 
hosts, but it's mainly to help the target avoid being completely overrun 
as some didn't do so well.   In the end, it's very hard to predict 
multi-initiator load and in a lot of cases, things are usually left a 
bit overcommitted as the performance downside otherwise is significant.

-- james

[4/4] scsi: core: don't limit per-LUN queue depth for SSD

Commit Message

Comments

Patch