Message ID | 20191118103117.978-5-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scis: don't apply per-LUN queue depth for SSD | expand |
On 11/18/19 11:31 AM, Ming Lei wrote: > SCSI core uses the atomic variable of sdev->device_busy to track > in-flight IO requests dispatched to this scsi device. IO request may be > submitted from any CPU, so the cost for maintaining the shared atomic > counter can be very big on big NUMA machine with lots of CPU cores. > > sdev->queue_depth is usually used for two purposes: 1) improve IO merge; > 2) fair IO request scattered among all LUNs. > > blk-mq already provides fair request allocation among all active shared > request queues(LUNs), see hctx_may_queue(). > > NVMe doesn't have such per-request-queue(namespace) queue depth, so it > is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't > play big role for reaching top SSD performance. > > With this patch, big cost for tracking in-flight per-LUN requests via > atomic variable can be saved. > > Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue > before changing this flag. > > Cc: Sathya Prakash <sathya.prakash@broadcom.com> > Cc: Chaitra P B <chaitra.basappa@broadcom.com> > Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> > Cc: Kashyap Desai <kashyap.desai@broadcom.com> > Cc: Sumit Saxena <sumit.saxena@broadcom.com> > Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > Cc: Ewan D. Milne <emilne@redhat.com> > Cc: Christoph Hellwig <hch@lst.de>, > Cc: Hannes Reinecke <hare@suse.de> > Cc: Bart Van Assche <bart.vanassche@wdc.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-sysfs.c | 14 +++++++++++++- > drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ > drivers/scsi/sd.c | 4 ++++ > 3 files changed, 35 insertions(+), 7 deletions(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index fca9b158f4a0..9cc0dd5f88a1 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); > QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); > #undef QUEUE_SYSFS_BIT_FNS > > +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, > + const char *page, size_t count) > +{ > + size_t ret; > + > + blk_mq_freeze_queue(q); > + ret = queue_store_nonrot(q, page, count); > + blk_mq_unfreeze_queue(q); > + > + return ret; > +} > + > static ssize_t queue_zoned_show(struct request_queue *q, char *page) > { > switch (blk_queue_zoned_model(q)) { > @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { > static struct queue_sysfs_entry queue_nonrot_entry = { > .attr = {.name = "rotational", .mode = 0644 }, > .show = queue_show_nonrot, > - .store = queue_store_nonrot, > + .store = queue_store_nonrot_freeze, > }; > > static struct queue_sysfs_entry queue_zoned_entry = { > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 62a86a82c38d..72655a049efd 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) > if (starget->can_queue > 0) > atomic_dec(&starget->target_busy); > > - atomic_dec(&sdev->device_busy); > + if (!blk_queue_nonrot(sdev->request_queue)) > + atomic_dec(&sdev->device_busy); > } > > static void scsi_kick_queue(struct request_queue *q) > @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) > > static inline bool scsi_device_is_busy(struct scsi_device *sdev) > { > - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) > + if (!blk_queue_nonrot(sdev->request_queue) && > + atomic_read(&sdev->device_busy) >= sdev->queue_depth) > return true; > if (atomic_read(&sdev->device_blocked) > 0) > return true; > @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > struct scsi_device *sdev) > { > unsigned int busy; > + bool bypass = blk_queue_nonrot(sdev->request_queue); > > - busy = atomic_inc_return(&sdev->device_busy) - 1; > + if (!bypass) > + busy = atomic_inc_return(&sdev->device_busy) - 1; > + else > + busy = 0; > if (atomic_read(&sdev->device_blocked)) { > if (busy) > goto out_dec; > @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > "unblocking device at zero depth\n")); > } > > + if (bypass) > + return 1; > + > if (busy >= sdev->queue_depth) > goto out_dec; > > return 1; > out_dec: > - atomic_dec(&sdev->device_busy); > + if (!bypass) > + atomic_dec(&sdev->device_busy); > return 0; > } > > @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) > struct request_queue *q = hctx->queue; > struct scsi_device *sdev = q->queuedata; > > - atomic_dec(&sdev->device_busy); > + if (!blk_queue_nonrot(sdev->request_queue)) > + atomic_dec(&sdev->device_busy); > } > > static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) > @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > case BLK_STS_OK: > break; > case BLK_STS_RESOURCE: > - if (atomic_read(&sdev->device_busy) || > + if ((!blk_queue_nonrot(sdev->request_queue) && > + atomic_read(&sdev->device_busy)) || > scsi_device_blocked(sdev)) > ret = BLK_STS_DEV_RESOURCE; > break; > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 0744c34468e1..c3d47117700d 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) > rot = get_unaligned_be16(&buffer[4]); > > if (rot == 1) { > + blk_mq_freeze_queue(q); > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > + blk_mq_unfreeze_queue(q); > blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); > } > > @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) > * cause this to be updated correctly and any device which > * doesn't support it should be treated as rotational. > */ > + blk_mq_freeze_queue(q); > blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); > + blk_mq_unfreeze_queue(q); > blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); > > if (scsi_device_supports_vpd(sdp)) { > Hmm. I must admit I patently don't like this explicit dependency on blk_nonrot(). Having a conditional counter is just an open invitation to getting things wrong... I would far prefer if we could delegate any queueing decision to the elevators, and completely drop the device_busy flag for all devices. But I do see that this might not be easy to do, nor necessarily something which others agree on. So alternatively, can't we make this a per-host flag, instead of having it per device characteristic? I'm just thinking of a storage array with a massive cache; it's anyone's guess if 'nonrot' is set there, but at the same time these things can be fast, so they might be profiting from dropping the device_busy counter there, too. Cheers, Hannes
On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: > > Hmm. > > I must admit I patently don't like this explicit dependency on > blk_nonrot(). Having a conditional counter is just an open invitation to > getting things wrong... > This concerns me as well, it seems like the SCSI ML should have it's own per-device attribute if we actually need to control this per-device instead of on a per-host or per-driver basis. And it seems like this is something that is specific to high-performance drivers, so changing the way this works for all drivers seems a bit much. Ordinarily I'd prefer a host template attribute as Sumanesh proposed, but I dislike wrapping the examination of that and the queue flag in a macro that makes it not obvious how the behavior is affected. (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, which was another piece of ML functionality used by only a few drivers.) Ming's patch does freeze the queue if NONROT is changed by sysfs, but the flag can be changed by other kernel code, e.g. sd_revalidate_disk() clears it and then calls sd_read_block_characteristics() which may set it again. So it's not clear to me how reliable this is. -Ewan
On 11/20/19 9:00 AM, Ewan D. Milne wrote: > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: >> I must admit I patently don't like this explicit dependency on >> blk_nonrot(). Having a conditional counter is just an open invitation to >> getting things wrong... > > This concerns me as well, it seems like the SCSI ML should have it's > own per-device attribute if we actually need to control this per-device > instead of on a per-host or per-driver basis. And it seems like this > is something that is specific to high-performance drivers, so changing > the way this works for all drivers seems a bit much. > > Ordinarily I'd prefer a host template attribute as Sumanesh proposed, > but I dislike wrapping the examination of that and the queue flag in > a macro that makes it not obvious how the behavior is affected. > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, > which was another piece of ML functionality used by only a few drivers.) > > Ming's patch does freeze the queue if NONROT is changed by sysfs, but > the flag can be changed by other kernel code, e.g. sd_revalidate_disk() > clears it and then calls sd_read_block_characteristics() which may set > it again. So it's not clear to me how reliable this is. How about changing the default behavior into ignoring sdev->queue_depth and only honoring sdev->queue_depth if a "quirk" flag is set or if overridden by setting a sysfs attribute? My understanding is that the goal of the queue ramp-up/ramp-down mechanism is to reduce the number of times a SCSI device responds "BUSY". An alternative for queue ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() returns BLK_STS_RESOURCE that the queue is only rerun after a delay. From blk_mq_dispatch_rq_list(): [ ... ] else if (needs_restart && (ret == BLK_STS_RESOURCE)) blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); Bart.
On Wed, 2019-11-20 at 12:56 -0800, Bart Van Assche wrote: > On 11/20/19 9:00 AM, Ewan D. Milne wrote: > > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: > > > I must admit I patently don't like this explicit dependency on > > > blk_nonrot(). Having a conditional counter is just an open invitation to > > > getting things wrong... > > > > This concerns me as well, it seems like the SCSI ML should have it's > > own per-device attribute if we actually need to control this per-device > > instead of on a per-host or per-driver basis. And it seems like this > > is something that is specific to high-performance drivers, so changing > > the way this works for all drivers seems a bit much. > > > > Ordinarily I'd prefer a host template attribute as Sumanesh proposed, > > but I dislike wrapping the examination of that and the queue flag in > > a macro that makes it not obvious how the behavior is affected. > > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, > > which was another piece of ML functionality used by only a few drivers.) > > > > Ming's patch does freeze the queue if NONROT is changed by sysfs, but > > the flag can be changed by other kernel code, e.g. sd_revalidate_disk() > > clears it and then calls sd_read_block_characteristics() which may set > > it again. So it's not clear to me how reliable this is. > > How about changing the default behavior into ignoring sdev->queue_depth > and only honoring sdev->queue_depth if a "quirk" flag is set or if > overridden by setting a sysfs attribute? My understanding is that the > goal of the queue ramp-up/ramp-down mechanism is to reduce the number of > times a SCSI device responds "BUSY". An alternative for queue > ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() > returns BLK_STS_RESOURCE that the queue is only rerun after a delay. > From blk_mq_dispatch_rq_list(): > > [ ... ] > else if (needs_restart && (ret == BLK_STS_RESOURCE)) > blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); > > Bart. In general it is better not to put in changes that affect older drivers that are not regularly tested, I think. I would prefer an opt-in for drivers desiring higher performance. Delaying the queue re-run vs. a ramp down might negatively affect performance. I'm not sure how accurate the ramp is at discovering the optimum though. -Ewan
On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: > On 11/18/19 11:31 AM, Ming Lei wrote: > > SCSI core uses the atomic variable of sdev->device_busy to track > > in-flight IO requests dispatched to this scsi device. IO request may be > > submitted from any CPU, so the cost for maintaining the shared atomic > > counter can be very big on big NUMA machine with lots of CPU cores. > > > > sdev->queue_depth is usually used for two purposes: 1) improve IO merge; > > 2) fair IO request scattered among all LUNs. > > > > blk-mq already provides fair request allocation among all active shared > > request queues(LUNs), see hctx_may_queue(). > > > > NVMe doesn't have such per-request-queue(namespace) queue depth, so it > > is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't > > play big role for reaching top SSD performance. > > > > With this patch, big cost for tracking in-flight per-LUN requests via > > atomic variable can be saved. > > > > Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue > > before changing this flag. > > > > Cc: Sathya Prakash <sathya.prakash@broadcom.com> > > Cc: Chaitra P B <chaitra.basappa@broadcom.com> > > Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> > > Cc: Kashyap Desai <kashyap.desai@broadcom.com> > > Cc: Sumit Saxena <sumit.saxena@broadcom.com> > > Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > Cc: Ewan D. Milne <emilne@redhat.com> > > Cc: Christoph Hellwig <hch@lst.de>, > > Cc: Hannes Reinecke <hare@suse.de> > > Cc: Bart Van Assche <bart.vanassche@wdc.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-sysfs.c | 14 +++++++++++++- > > drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ > > drivers/scsi/sd.c | 4 ++++ > > 3 files changed, 35 insertions(+), 7 deletions(-) > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index fca9b158f4a0..9cc0dd5f88a1 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); > > QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); > > #undef QUEUE_SYSFS_BIT_FNS > > > > +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, > > + const char *page, size_t count) > > +{ > > + size_t ret; > > + > > + blk_mq_freeze_queue(q); > > + ret = queue_store_nonrot(q, page, count); > > + blk_mq_unfreeze_queue(q); > > + > > + return ret; > > +} > > + > > static ssize_t queue_zoned_show(struct request_queue *q, char *page) > > { > > switch (blk_queue_zoned_model(q)) { > > @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { > > static struct queue_sysfs_entry queue_nonrot_entry = { > > .attr = {.name = "rotational", .mode = 0644 }, > > .show = queue_show_nonrot, > > - .store = queue_store_nonrot, > > + .store = queue_store_nonrot_freeze, > > }; > > > > static struct queue_sysfs_entry queue_zoned_entry = { > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > index 62a86a82c38d..72655a049efd 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) > > if (starget->can_queue > 0) > > atomic_dec(&starget->target_busy); > > > > - atomic_dec(&sdev->device_busy); > > + if (!blk_queue_nonrot(sdev->request_queue)) > > + atomic_dec(&sdev->device_busy); > > } > > > > static void scsi_kick_queue(struct request_queue *q) > > @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) > > > > static inline bool scsi_device_is_busy(struct scsi_device *sdev) > > { > > - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) > > + if (!blk_queue_nonrot(sdev->request_queue) && > > + atomic_read(&sdev->device_busy) >= sdev->queue_depth) > > return true; > > if (atomic_read(&sdev->device_blocked) > 0) > > return true; > > @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > > struct scsi_device *sdev) > > { > > unsigned int busy; > > + bool bypass = blk_queue_nonrot(sdev->request_queue); > > > > - busy = atomic_inc_return(&sdev->device_busy) - 1; > > + if (!bypass) > > + busy = atomic_inc_return(&sdev->device_busy) - 1; > > + else > > + busy = 0; > > if (atomic_read(&sdev->device_blocked)) { > > if (busy) > > goto out_dec; > > @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > > "unblocking device at zero depth\n")); > > } > > > > + if (bypass) > > + return 1; > > + > > if (busy >= sdev->queue_depth) > > goto out_dec; > > > > return 1; > > out_dec: > > - atomic_dec(&sdev->device_busy); > > + if (!bypass) > > + atomic_dec(&sdev->device_busy); > > return 0; > > } > > > > @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) > > struct request_queue *q = hctx->queue; > > struct scsi_device *sdev = q->queuedata; > > > > - atomic_dec(&sdev->device_busy); > > + if (!blk_queue_nonrot(sdev->request_queue)) > > + atomic_dec(&sdev->device_busy); > > } > > > > static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) > > @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > > case BLK_STS_OK: > > break; > > case BLK_STS_RESOURCE: > > - if (atomic_read(&sdev->device_busy) || > > + if ((!blk_queue_nonrot(sdev->request_queue) && > > + atomic_read(&sdev->device_busy)) || > > scsi_device_blocked(sdev)) > > ret = BLK_STS_DEV_RESOURCE; > > break; > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > > index 0744c34468e1..c3d47117700d 100644 > > --- a/drivers/scsi/sd.c > > +++ b/drivers/scsi/sd.c > > @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) > > rot = get_unaligned_be16(&buffer[4]); > > > > if (rot == 1) { > > + blk_mq_freeze_queue(q); > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > + blk_mq_unfreeze_queue(q); > > blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); > > } > > > > @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) > > * cause this to be updated correctly and any device which > > * doesn't support it should be treated as rotational. > > */ > > + blk_mq_freeze_queue(q); > > blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); > > + blk_mq_unfreeze_queue(q); > > blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); > > > > if (scsi_device_supports_vpd(sdp)) { > > > Hmm. > > I must admit I patently don't like this explicit dependency on > blk_nonrot(). Having a conditional counter is just an open invitation to > getting things wrong... That won't be an issue given this patch freezes queue when the NONROT queue flag is changed. > > I would far prefer if we could delegate any queueing decision to the > elevators, and completely drop the device_busy flag for all devices. If you drop it, you may create big sequential IO performance drop on HDD., that is why this patch only bypasses sdev->queue_depth on SSD. NVMe bypasses it because no one uses HDD. via NVMe. > But I do see that this might not be easy to do, nor necessarily > something which others agree on. > So alternatively, can't we make this a per-host flag, instead of having > it per device characteristic? Are you sure each LUN shares the same NONROT flag? I think it can be true for each LUN to have different NONROT flag. > > I'm just thinking of a storage array with a massive cache; it's anyone's > guess if 'nonrot' is set there, but at the same time these things can be > fast, so they might be profiting from dropping the device_busy counter > there, too. If you are worrying about slow queue freeze, that won't be an issue, given we don't switch to percpu mode until the flag is updated during sd probe, and blk_mq_freeze_queue_wait() is pretty quick in ATOMIC mode. Thanks, Ming
On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote: > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: > > > > Hmm. > > > > I must admit I patently don't like this explicit dependency on > > blk_nonrot(). Having a conditional counter is just an open invitation to > > getting things wrong... > > > > This concerns me as well, it seems like the SCSI ML should have it's > own per-device attribute if we actually need to control this per-device > instead of on a per-host or per-driver basis. And it seems like this > is something that is specific to high-performance drivers, so changing > the way this works for all drivers seems a bit much. > > Ordinarily I'd prefer a host template attribute as Sumanesh proposed, > but I dislike wrapping the examination of that and the queue flag in > a macro that makes it not obvious how the behavior is affected. > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, > which was another piece of ML functionality used by only a few drivers.) > > Ming's patch does freeze the queue if NONROT is changed by sysfs, but > the flag can be changed by other kernel code, e.g. sd_revalidate_disk() > clears it and then calls sd_read_block_characteristics() which may set > it again. So it's not clear to me how reliable this is. The queue freeze is applied in sd_revalidate_disk() too, isn't it? Thanks, Ming
On Wed, Nov 20, 2019 at 12:56:21PM -0800, Bart Van Assche wrote: > On 11/20/19 9:00 AM, Ewan D. Milne wrote: > > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: > > > I must admit I patently don't like this explicit dependency on > > > blk_nonrot(). Having a conditional counter is just an open invitation to > > > getting things wrong... > > > > This concerns me as well, it seems like the SCSI ML should have it's > > own per-device attribute if we actually need to control this per-device > > instead of on a per-host or per-driver basis. And it seems like this > > is something that is specific to high-performance drivers, so changing > > the way this works for all drivers seems a bit much. > > > > Ordinarily I'd prefer a host template attribute as Sumanesh proposed, > > but I dislike wrapping the examination of that and the queue flag in > > a macro that makes it not obvious how the behavior is affected. > > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, > > which was another piece of ML functionality used by only a few drivers.) > > > > Ming's patch does freeze the queue if NONROT is changed by sysfs, but > > the flag can be changed by other kernel code, e.g. sd_revalidate_disk() > > clears it and then calls sd_read_block_characteristics() which may set > > it again. So it's not clear to me how reliable this is. > > How about changing the default behavior into ignoring sdev->queue_depth and > only honoring sdev->queue_depth if a "quirk" flag is set or if overridden by > setting a sysfs attribute? Using 'quirk' was my first idea in mind when we start to discuss the issue, but problem is that it isn't flexible, for example, one HBA may connects HDD. in one setting, and SSD. in another setting. > My understanding is that the goal of the queue > ramp-up/ramp-down mechanism is to reduce the number of times a SCSI device > responds "BUSY". I don't understand the motivation of ramp-up/ramp-down, maybe it is just for fairness among LUNs. So far, blk-mq provides fair IO submission among LUNs. One problem of ignoring it is that sequential IO performance may be dropped much compared with before. > An alternative for queue ramp-up/ramp-down is a delayed > queue re-run. I think if scsi_queue_rq() returns BLK_STS_RESOURCE that the > queue is only rerun after a delay. From blk_mq_dispatch_rq_list(): > > [ ... ] > else if (needs_restart && (ret == BLK_STS_RESOURCE)) > blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); The delay re-run can't work given we call blk_mq_get_dispatch_budget() before dequeuing request from scheduler/sw queue for improving IO merge. At that time, run queue won't be involved. Thanks, Ming
On 11/21/19 1:53 AM, Ming Lei wrote: > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: >> On 11/18/19 11:31 AM, Ming Lei wrote: >>> SCSI core uses the atomic variable of sdev->device_busy to track >>> in-flight IO requests dispatched to this scsi device. IO request may be >>> submitted from any CPU, so the cost for maintaining the shared atomic >>> counter can be very big on big NUMA machine with lots of CPU cores. >>> >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge; >>> 2) fair IO request scattered among all LUNs. >>> >>> blk-mq already provides fair request allocation among all active shared >>> request queues(LUNs), see hctx_may_queue(). >>> >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't >>> play big role for reaching top SSD performance. >>> >>> With this patch, big cost for tracking in-flight per-LUN requests via >>> atomic variable can be saved. >>> >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue >>> before changing this flag. >>> >>> Cc: Sathya Prakash <sathya.prakash@broadcom.com> >>> Cc: Chaitra P B <chaitra.basappa@broadcom.com> >>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> >>> Cc: Kashyap Desai <kashyap.desai@broadcom.com> >>> Cc: Sumit Saxena <sumit.saxena@broadcom.com> >>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> >>> Cc: Ewan D. Milne <emilne@redhat.com> >>> Cc: Christoph Hellwig <hch@lst.de>, >>> Cc: Hannes Reinecke <hare@suse.de> >>> Cc: Bart Van Assche <bart.vanassche@wdc.com> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >>> --- >>> block/blk-sysfs.c | 14 +++++++++++++- >>> drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ >>> drivers/scsi/sd.c | 4 ++++ >>> 3 files changed, 35 insertions(+), 7 deletions(-) >>> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>> index fca9b158f4a0..9cc0dd5f88a1 100644 >>> --- a/block/blk-sysfs.c >>> +++ b/block/blk-sysfs.c >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); >>> QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); >>> #undef QUEUE_SYSFS_BIT_FNS >>> >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, >>> + const char *page, size_t count) >>> +{ >>> + size_t ret; >>> + >>> + blk_mq_freeze_queue(q); >>> + ret = queue_store_nonrot(q, page, count); >>> + blk_mq_unfreeze_queue(q); >>> + >>> + return ret; >>> +} >>> + >>> static ssize_t queue_zoned_show(struct request_queue *q, char *page) >>> { >>> switch (blk_queue_zoned_model(q)) { >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { >>> static struct queue_sysfs_entry queue_nonrot_entry = { >>> .attr = {.name = "rotational", .mode = 0644 }, >>> .show = queue_show_nonrot, >>> - .store = queue_store_nonrot, >>> + .store = queue_store_nonrot_freeze, >>> }; >>> >>> static struct queue_sysfs_entry queue_zoned_entry = { >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >>> index 62a86a82c38d..72655a049efd 100644 >>> --- a/drivers/scsi/scsi_lib.c >>> +++ b/drivers/scsi/scsi_lib.c >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) >>> if (starget->can_queue > 0) >>> atomic_dec(&starget->target_busy); >>> >>> - atomic_dec(&sdev->device_busy); >>> + if (!blk_queue_nonrot(sdev->request_queue)) >>> + atomic_dec(&sdev->device_busy); >>> } >>> >>> static void scsi_kick_queue(struct request_queue *q) >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) >>> >>> static inline bool scsi_device_is_busy(struct scsi_device *sdev) >>> { >>> - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) >>> + if (!blk_queue_nonrot(sdev->request_queue) && >>> + atomic_read(&sdev->device_busy) >= sdev->queue_depth) >>> return true; >>> if (atomic_read(&sdev->device_blocked) > 0) >>> return true; >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, >>> struct scsi_device *sdev) >>> { >>> unsigned int busy; >>> + bool bypass = blk_queue_nonrot(sdev->request_queue); >>> >>> - busy = atomic_inc_return(&sdev->device_busy) - 1; >>> + if (!bypass) >>> + busy = atomic_inc_return(&sdev->device_busy) - 1; >>> + else >>> + busy = 0; >>> if (atomic_read(&sdev->device_blocked)) { >>> if (busy) >>> goto out_dec; >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, >>> "unblocking device at zero depth\n")); >>> } >>> >>> + if (bypass) >>> + return 1; >>> + >>> if (busy >= sdev->queue_depth) >>> goto out_dec; >>> >>> return 1; >>> out_dec: >>> - atomic_dec(&sdev->device_busy); >>> + if (!bypass) >>> + atomic_dec(&sdev->device_busy); >>> return 0; >>> } >>> >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) >>> struct request_queue *q = hctx->queue; >>> struct scsi_device *sdev = q->queuedata; >>> >>> - atomic_dec(&sdev->device_busy); >>> + if (!blk_queue_nonrot(sdev->request_queue)) >>> + atomic_dec(&sdev->device_busy); >>> } >>> >>> static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, >>> case BLK_STS_OK: >>> break; >>> case BLK_STS_RESOURCE: >>> - if (atomic_read(&sdev->device_busy) || >>> + if ((!blk_queue_nonrot(sdev->request_queue) && >>> + atomic_read(&sdev->device_busy)) || >>> scsi_device_blocked(sdev)) >>> ret = BLK_STS_DEV_RESOURCE; >>> break; >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >>> index 0744c34468e1..c3d47117700d 100644 >>> --- a/drivers/scsi/sd.c >>> +++ b/drivers/scsi/sd.c >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) >>> rot = get_unaligned_be16(&buffer[4]); >>> >>> if (rot == 1) { >>> + blk_mq_freeze_queue(q); >>> blk_queue_flag_set(QUEUE_FLAG_NONROT, q); >>> + blk_mq_unfreeze_queue(q); >>> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); >>> } >>> >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) >>> * cause this to be updated correctly and any device which >>> * doesn't support it should be treated as rotational. >>> */ >>> + blk_mq_freeze_queue(q); >>> blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); >>> + blk_mq_unfreeze_queue(q); >>> blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); >>> >>> if (scsi_device_supports_vpd(sdp)) { >>> >> Hmm. >> >> I must admit I patently don't like this explicit dependency on >> blk_nonrot(). Having a conditional counter is just an open invitation to >> getting things wrong... > > That won't be an issue given this patch freezes queue when the > NONROT queue flag is changed. > That's not what I'm saying. My issue is that we're turning the device_busy counter into a conditional counter, which only must be accessed when that condition is true. This not only puts a burden on the developer (as he has to remember to always check the condition), but also has possible issues when switching between modes. The latter you've addressed with ensuring that the queue must be frozen when modifying that flag, but the former is a conceptual problem. And that was what I had been referring to. >> >> I would far prefer if we could delegate any queueing decision to the >> elevators, and completely drop the device_busy flag for all devices. > > If you drop it, you may create big sequential IO performance drop > on HDD., that is why this patch only bypasses sdev->queue_depth on > SSD. NVMe bypasses it because no one uses HDD. via NVMe. > I still wonder how much performance drop we actually see; what seems to happen is that device_busy just arbitrary pushes back to the block layer, giving it more time to do merging. I do think we can do better then that... Cheers, Hannes
On Thu, 2019-11-21 at 08:54 +0800, Ming Lei wrote: > On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote: > > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote: > > > > > > Hmm. > > > > > > I must admit I patently don't like this explicit dependency on > > > blk_nonrot(). Having a conditional counter is just an open invitation to > > > getting things wrong... > > > > > > > This concerns me as well, it seems like the SCSI ML should have it's > > own per-device attribute if we actually need to control this per-device > > instead of on a per-host or per-driver basis. And it seems like this > > is something that is specific to high-performance drivers, so changing > > the way this works for all drivers seems a bit much. > > > > Ordinarily I'd prefer a host template attribute as Sumanesh proposed, > > but I dislike wrapping the examination of that and the queue flag in > > a macro that makes it not obvious how the behavior is affected. > > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list, > > which was another piece of ML functionality used by only a few drivers.) > > > > Ming's patch does freeze the queue if NONROT is changed by sysfs, but > > the flag can be changed by other kernel code, e.g. sd_revalidate_disk() > > clears it and then calls sd_read_block_characteristics() which may set > > it again. So it's not clear to me how reliable this is. > > The queue freeze is applied in sd_revalidate_disk() too, isn't it? > Yes, sorry, you are right, your patch does add this. But if anything else changes the NONROT attribute for a queue associated with a SCSI device in the future it would have to freeze the queue. Because the device_busy counter mechanism would rely on it to work right. This isn't an obvious connection, it seems to me. -Ewan
Hannes, > I must admit I patently don't like this explicit dependency on > blk_nonrot(). The whole idea behind using rotational/non-rotational as a trigger for this is a red herring. Fast devices also have internal queuing limitations. And more importantly: Fast devices also experience temporary shortages which can lead to congestion.
Ewan, > Delaying the queue re-run vs. a ramp down might negatively affect > performance. I'm not sure how accurate the ramp is at discovering the > optimum though. The optimum - as well as the actual limit - might change over time in a multi-initiator setup as other hosts grab resources on the device. I do think that the only way forward here, if we want to avoid counting outstanding commands for performance reasons, is to ensure that the BUSY/QUEUE_FULL/TASK_SET_FULL handling is relatively fast path and not something deep in the bowels of error handling. Which it actually isn't. But I do think we'll need to take a closer look.
Ming, > I don't understand the motivation of ramp-up/ramp-down, maybe it is just > for fairness among LUNs. Congestion control. Devices have actual, physical limitations that are different from the tag context limitations on the HBA. You don't have that problem on NVMe because (at least for PCIe) the storage device and the controller are one and the same. If you submit 100000 concurrent requests to a SCSI drive that does 100 IOPS, some requests will time out before they get serviced. Consequently we have the ability to raise and lower the queue depth to constrain the amount of requests in flight to a given device at any point in time. Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back off. We frequently see issues where the host can submit burst I/O much faster than the device can de-stage from cache. In that scenario the device reports BUSY/QF/TSF and we will back off so the device gets a chance to recover. If we just let the application submit new I/O without bounds, the system would never actually recover. Note that the actual, physical limitations for how many commands a target can handle are typically much, much lower than the number of tags the HBA can manage. SATA devices can only express 32 concurrent commands. SAS devices typically 128 concurrent commands per port. Arrays differ. If we ignore the RAID controller use case where the controller internally queues and arbitrates commands between many devices, how is submitting 1000 concurrent requests to a device which only has 128 command slots going to work? Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we blindly stop restricting the number of I/Os in flight in the ML, we may exceed either the capabilities of what the transport protocol can express or internal device resources.
Hi Martin, On Thu, Nov 21, 2019 at 09:59:53PM -0500, Martin K. Petersen wrote: > > Ming, > > > I don't understand the motivation of ramp-up/ramp-down, maybe it is just > > for fairness among LUNs. > > Congestion control. Devices have actual, physical limitations that are > different from the tag context limitations on the HBA. You don't have > that problem on NVMe because (at least for PCIe) the storage device and > the controller are one and the same. > > If you submit 100000 concurrent requests to a SCSI drive that does 100 > IOPS, some requests will time out before they get serviced. > Consequently we have the ability to raise and lower the queue depth to > constrain the amount of requests in flight to a given device at any > point in time. blk-mq has already puts a limit on each LUN, the number is host_queue_depth / nr_active_LUNs, see hctx_may_queue(). Looks this way works for NVMe, that is why I try to bypass .device_busy for SSD which is too expensive on fast storage. Even Hannes wants to kill it completely. > > Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back > off. We frequently see issues where the host can submit burst I/O much > faster than the device can de-stage from cache. In that scenario the > device reports BUSY/QF/TSF and we will back off so the device gets a > chance to recover. If we just let the application submit new I/O without > bounds, the system would never actually recover. > > Note that the actual, physical limitations for how many commands a > target can handle are typically much, much lower than the number of tags > the HBA can manage. SATA devices can only express 32 concurrent > commands. SAS devices typically 128 concurrent commands per > port. Arrays differ. I understand SATA's host queue depth is set as 32. But SAS HBA's queue depth is often big, so do we reply on .device_busy for throttling requests to SAS? > > If we ignore the RAID controller use case where the controller > internally queues and arbitrates commands between many devices, how is > submitting 1000 concurrent requests to a device which only has 128 > command slots going to work? For SSD, I guess it might be fine, given NVMe sets per-hw-queue depth as 1023 usually. That means the concurrent requests can be as many as 1023 * nr_hw_queues in case of single namespace. > > Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we > blindly stop restricting the number of I/Os in flight in the ML, we may > exceed either the capabilities of what the transport protocol can > express or internal device resources. OK, one conservative approach may be just to just bypass .device_busy in case of SSD only for some high end HBA. Or maybe we can wire up sdev->queue_depth with block layer's scheduler queue depth? One issue is that sdev->queue_depth may be updated some times. Thanks, Ming
On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote: > On 11/21/19 1:53 AM, Ming Lei wrote: > > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: > >> On 11/18/19 11:31 AM, Ming Lei wrote: > >>> SCSI core uses the atomic variable of sdev->device_busy to track > >>> in-flight IO requests dispatched to this scsi device. IO request may be > >>> submitted from any CPU, so the cost for maintaining the shared atomic > >>> counter can be very big on big NUMA machine with lots of CPU cores. > >>> > >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge; > >>> 2) fair IO request scattered among all LUNs. > >>> > >>> blk-mq already provides fair request allocation among all active shared > >>> request queues(LUNs), see hctx_may_queue(). > >>> > >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it > >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't > >>> play big role for reaching top SSD performance. > >>> > >>> With this patch, big cost for tracking in-flight per-LUN requests via > >>> atomic variable can be saved. > >>> > >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue > >>> before changing this flag. > >>> > >>> Cc: Sathya Prakash <sathya.prakash@broadcom.com> > >>> Cc: Chaitra P B <chaitra.basappa@broadcom.com> > >>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> > >>> Cc: Kashyap Desai <kashyap.desai@broadcom.com> > >>> Cc: Sumit Saxena <sumit.saxena@broadcom.com> > >>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > >>> Cc: Ewan D. Milne <emilne@redhat.com> > >>> Cc: Christoph Hellwig <hch@lst.de>, > >>> Cc: Hannes Reinecke <hare@suse.de> > >>> Cc: Bart Van Assche <bart.vanassche@wdc.com> > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> > >>> --- > >>> block/blk-sysfs.c | 14 +++++++++++++- > >>> drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ > >>> drivers/scsi/sd.c | 4 ++++ > >>> 3 files changed, 35 insertions(+), 7 deletions(-) > >>> > >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > >>> index fca9b158f4a0..9cc0dd5f88a1 100644 > >>> --- a/block/blk-sysfs.c > >>> +++ b/block/blk-sysfs.c > >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); > >>> QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); > >>> #undef QUEUE_SYSFS_BIT_FNS > >>> > >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, > >>> + const char *page, size_t count) > >>> +{ > >>> + size_t ret; > >>> + > >>> + blk_mq_freeze_queue(q); > >>> + ret = queue_store_nonrot(q, page, count); > >>> + blk_mq_unfreeze_queue(q); > >>> + > >>> + return ret; > >>> +} > >>> + > >>> static ssize_t queue_zoned_show(struct request_queue *q, char *page) > >>> { > >>> switch (blk_queue_zoned_model(q)) { > >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { > >>> static struct queue_sysfs_entry queue_nonrot_entry = { > >>> .attr = {.name = "rotational", .mode = 0644 }, > >>> .show = queue_show_nonrot, > >>> - .store = queue_store_nonrot, > >>> + .store = queue_store_nonrot_freeze, > >>> }; > >>> > >>> static struct queue_sysfs_entry queue_zoned_entry = { > >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > >>> index 62a86a82c38d..72655a049efd 100644 > >>> --- a/drivers/scsi/scsi_lib.c > >>> +++ b/drivers/scsi/scsi_lib.c > >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) > >>> if (starget->can_queue > 0) > >>> atomic_dec(&starget->target_busy); > >>> > >>> - atomic_dec(&sdev->device_busy); > >>> + if (!blk_queue_nonrot(sdev->request_queue)) > >>> + atomic_dec(&sdev->device_busy); > >>> } > >>> > >>> static void scsi_kick_queue(struct request_queue *q) > >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) > >>> > >>> static inline bool scsi_device_is_busy(struct scsi_device *sdev) > >>> { > >>> - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) > >>> + if (!blk_queue_nonrot(sdev->request_queue) && > >>> + atomic_read(&sdev->device_busy) >= sdev->queue_depth) > >>> return true; > >>> if (atomic_read(&sdev->device_blocked) > 0) > >>> return true; > >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > >>> struct scsi_device *sdev) > >>> { > >>> unsigned int busy; > >>> + bool bypass = blk_queue_nonrot(sdev->request_queue); > >>> > >>> - busy = atomic_inc_return(&sdev->device_busy) - 1; > >>> + if (!bypass) > >>> + busy = atomic_inc_return(&sdev->device_busy) - 1; > >>> + else > >>> + busy = 0; > >>> if (atomic_read(&sdev->device_blocked)) { > >>> if (busy) > >>> goto out_dec; > >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, > >>> "unblocking device at zero depth\n")); > >>> } > >>> > >>> + if (bypass) > >>> + return 1; > >>> + > >>> if (busy >= sdev->queue_depth) > >>> goto out_dec; > >>> > >>> return 1; > >>> out_dec: > >>> - atomic_dec(&sdev->device_busy); > >>> + if (!bypass) > >>> + atomic_dec(&sdev->device_busy); > >>> return 0; > >>> } > >>> > >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) > >>> struct request_queue *q = hctx->queue; > >>> struct scsi_device *sdev = q->queuedata; > >>> > >>> - atomic_dec(&sdev->device_busy); > >>> + if (!blk_queue_nonrot(sdev->request_queue)) > >>> + atomic_dec(&sdev->device_busy); > >>> } > >>> > >>> static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) > >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > >>> case BLK_STS_OK: > >>> break; > >>> case BLK_STS_RESOURCE: > >>> - if (atomic_read(&sdev->device_busy) || > >>> + if ((!blk_queue_nonrot(sdev->request_queue) && > >>> + atomic_read(&sdev->device_busy)) || > >>> scsi_device_blocked(sdev)) > >>> ret = BLK_STS_DEV_RESOURCE; > >>> break; > >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > >>> index 0744c34468e1..c3d47117700d 100644 > >>> --- a/drivers/scsi/sd.c > >>> +++ b/drivers/scsi/sd.c > >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) > >>> rot = get_unaligned_be16(&buffer[4]); > >>> > >>> if (rot == 1) { > >>> + blk_mq_freeze_queue(q); > >>> blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > >>> + blk_mq_unfreeze_queue(q); > >>> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); > >>> } > >>> > >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) > >>> * cause this to be updated correctly and any device which > >>> * doesn't support it should be treated as rotational. > >>> */ > >>> + blk_mq_freeze_queue(q); > >>> blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); > >>> + blk_mq_unfreeze_queue(q); > >>> blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); > >>> > >>> if (scsi_device_supports_vpd(sdp)) { > >>> > >> Hmm. > >> > >> I must admit I patently don't like this explicit dependency on > >> blk_nonrot(). Having a conditional counter is just an open invitation to > >> getting things wrong... > > > > That won't be an issue given this patch freezes queue when the > > NONROT queue flag is changed. > > > That's not what I'm saying. > > My issue is that we're turning the device_busy counter into a > conditional counter, which only must be accessed when that condition is > true. > This not only puts a burden on the developer (as he has to remember to > always check the condition), but also has possible issues when switching > between modes. The conditional check can be avoided only if we can kill device_busy completely. However, as Martin commented, that isn't realistic so far. > The latter you've addressed with ensuring that the queue must be frozen > when modifying that flag, but the former is a conceptual problem. It can't be perfect, but it can be good by the latter. Someone said 'Perfect is the enemy of the good', :-) > > And that was what I had been referring to. > > >> > >> I would far prefer if we could delegate any queueing decision to the > >> elevators, and completely drop the device_busy flag for all devices. > > > > If you drop it, you may create big sequential IO performance drop > > on HDD., that is why this patch only bypasses sdev->queue_depth on > > SSD. NVMe bypasses it because no one uses HDD. via NVMe. > > > I still wonder how much performance drop we actually see; what seems to > happen is that device_busy just arbitrary pushes back to the block > layer, giving it more time to do merging. > I do think we can do better then that... For example, running the following script[1] on 4-core VM: ------------------------------------------ | QD:255 | QD: 32 | ------------------------------------------ fio read throughput | 825MB/s | 1432MB/s| ------------------------------------------ note: QD: scsi_device's queue depth, which can be passed to test.sh. scsi debug's device queue depth is 255 at default, which is >= .can_queue [1] test.sh #!/bin/bash QD=$1 modprobe scsi_debug ndelay=200000 sleep 2 DEVN=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename` DEV="/dev/"$DEVN echo $QD > /sys/block/$DEVN/device/queue_depth SQD=`cat /sys/block/$DEVN/device/queue_depth` echo "scsi device queue depth: $QD" fio --readwrite=read --filename=$DEV --runtime=20s --time_based --group_reporting=1 \ --ioengine=libaio --direct=1 --iodepth=64 --bs=4k --numjobs=4 --name=seq_perf Thanks, Ming
>>If we ignore the RAID controller use case where the controller >>internally queues and arbitrates commands between many devices These controllers should not be "ignored", but rather enabled. Many of them visualize both HDD and NVMe devices behind them, and thus forced to expose themselves as SCSI controllers. However, they have their own queue management and IO merging capabilities. Many have capability of holding IO in queue and pull them as needed (just like NVMe), and thus does not bother if many IOs to a device or controller is sent or if there is congestion. In case of congestion, the IO will simply wait in queue, along with advanced timeout handling capabilities. Besides, as Ming pointed out, Block layer (function hctx_may_queue) already limits IO on a per controller and per LUN basis. Overall, if the proposal does not work for all cases, then at least it should be made optional for high end controller, so that they are not disadvantaged vis-a-vis NVMe, just because they expose themselves as SCSI in order to support a wide range of devices behind them. thanks, Sumanesh On 11/21/2019 7:59 PM, Martin K. Petersen wrote: > Ming, > >> I don't understand the motivation of ramp-up/ramp-down, maybe it is just >> for fairness among LUNs. > Congestion control. Devices have actual, physical limitations that are > different from the tag context limitations on the HBA. You don't have > that problem on NVMe because (at least for PCIe) the storage device and > the controller are one and the same. > > If you submit 100000 concurrent requests to a SCSI drive that does 100 > IOPS, some requests will time out before they get serviced. > Consequently we have the ability to raise and lower the queue depth to > constrain the amount of requests in flight to a given device at any > point in time. > > Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back > off. We frequently see issues where the host can submit burst I/O much > faster than the device can de-stage from cache. In that scenario the > device reports BUSY/QF/TSF and we will back off so the device gets a > chance to recover. If we just let the application submit new I/O without > bounds, the system would never actually recover. > > Note that the actual, physical limitations for how many commands a > target can handle are typically much, much lower than the number of tags > the HBA can manage. SATA devices can only express 32 concurrent > commands. SAS devices typically 128 concurrent commands per > port. Arrays differ. > > If we ignore the RAID controller use case where the controller > internally queues and arbitrates commands between many devices, how is > submitting 1000 concurrent requests to a device which only has 128 > command slots going to work? > > Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we > blindly stop restricting the number of I/Os in flight in the ML, we may > exceed either the capabilities of what the transport protocol can > express or internal device resources. >
On 11/22/19 12:09 AM, Ming Lei wrote: > On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote: >> On 11/21/19 1:53 AM, Ming Lei wrote: >>> On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: >>>> I would far prefer if we could delegate any queueing decision to the >>>> elevators, and completely drop the device_busy flag for all devices. >>> >>> If you drop it, you may create big sequential IO performance drop >>> on HDD., that is why this patch only bypasses sdev->queue_depth on >>> SSD. NVMe bypasses it because no one uses HDD. via NVMe. >>> >> I still wonder how much performance drop we actually see; what seems to >> happen is that device_busy just arbitrary pushes back to the block >> layer, giving it more time to do merging. >> I do think we can do better then that... > > For example, running the following script[1] on 4-core VM: > > ------------------------------------------ > | QD:255 | QD: 32 | > ------------------------------------------ > fio read throughput | 825MB/s | 1432MB/s| > ------------------------------------------ > > [ ... ] Hi Ming, Thanks for having shared these numbers. I think this is very useful information. Do these results show the performance drop that happens if /sys/block/.../device/queue_depth exceeds .can_queue? What I am wondering about is how important these results are in the context of this discussion. Are there any modern SCSI devices for which a SCSI LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the device responds with BUSY? What surprised me is that only three SCSI LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that mean that BUSY responses from a SCSI device or HBA are rare? Thanks, Bart.
On 11/22/2019 10:14 AM, Bart Van Assche wrote: > Thanks for having shared these numbers. I think this is very useful > information. Do these results show the performance drop that happens > if /sys/block/.../device/queue_depth exceeds .can_queue? What I am > wondering about is how important these results are in the context of > this discussion. Are there any modern SCSI devices for which a SCSI > LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the > device responds with BUSY? What surprised me is that only three SCSI > LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that > mean that BUSY responses from a SCSI device or HBA are rare? That's because most of the drivers, which had queue full ramp up/ramp down in them and would have called scsi_track_queue_full() converted over to the moved-queue-full-handling-in-the-mid-layer, indicated by sht->track_queue_depth = 1. Yes - it is still hit a lot! -- james
On 11/22/19 10:26 AM, James Smart wrote: > On 11/22/2019 10:14 AM, Bart Van Assche wrote: >> Thanks for having shared these numbers. I think this is very useful >> information. Do these results show the performance drop that happens >> if /sys/block/.../device/queue_depth exceeds .can_queue? What I am >> wondering about is how important these results are in the context of >> this discussion. Are there any modern SCSI devices for which a SCSI >> LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the >> device responds with BUSY? What surprised me is that only three SCSI >> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that >> mean that BUSY responses from a SCSI device or HBA are rare? > > That's because most of the drivers, which had queue full ramp up/ramp > down in them and would have called scsi_track_queue_full() converted > over to the moved-queue-full-handling-in-the-mid-layer, indicated by > sht->track_queue_depth = 1. > > Yes - it is still hit a lot! Hi James, In the systems that I have been working on myself I made sure that the BUSY condition is rarely or never encountered. Anyway, since there are setups in which this condition is hit frequently we need to make sure that these setups keep performing well. I'm wondering now whether we should try to come up with an algorithm for maintaining sdev->device_busy only if it improves performance and for not maintaining sdev->device_busy for devices/HBAs that don't need it. Bart.
On Fri, Nov 22, 2019 at 10:14:51AM -0800, Bart Van Assche wrote: > On 11/22/19 12:09 AM, Ming Lei wrote: > > On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote: > > > On 11/21/19 1:53 AM, Ming Lei wrote: > > > > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: > > > > > I would far prefer if we could delegate any queueing decision to the > > > > > elevators, and completely drop the device_busy flag for all devices. > > > > > > > > If you drop it, you may create big sequential IO performance drop > > > > on HDD., that is why this patch only bypasses sdev->queue_depth on > > > > SSD. NVMe bypasses it because no one uses HDD. via NVMe. > > > > > > > I still wonder how much performance drop we actually see; what seems to > > > happen is that device_busy just arbitrary pushes back to the block > > > layer, giving it more time to do merging. > > > I do think we can do better then that... > > > > For example, running the following script[1] on 4-core VM: > > > > ------------------------------------------ > > | QD:255 | QD: 32 | > > ------------------------------------------ > > fio read throughput | 825MB/s | 1432MB/s| > > ------------------------------------------ > > > > [ ... ] > > Hi Ming, > > Thanks for having shared these numbers. I think this is very useful > information. Do these results show the performance drop that happens if > /sys/block/.../device/queue_depth exceeds .can_queue? What I am wondering The above test just shows that IO merge plays important role here, and one important point for triggering IO merge is that .get_budget returns false. If sdev->queue_depth is too big, .get_budget may never return false. That is why this patch just bypasses .device_busy for SSD. > about is how important these results are in the context of this discussion. > Are there any modern SCSI devices for which a SCSI LLD sets > scsi_host->can_queue and scsi_host->cmd_per_lun such that the device > responds with BUSY? What surprised me is that only three SCSI LLDs call There are many such HBAs, for which sdev->queue_depth is smaller than .can_queue, especially in case of small number of LUNs. > scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that mean that BUSY > responses from a SCSI device or HBA are rare? It is only true for some HBAs. thanks, Ming
On Fri, Nov 22, 2019 at 12:46:48PM -0800, Bart Van Assche wrote: > On 11/22/19 10:26 AM, James Smart wrote: > > On 11/22/2019 10:14 AM, Bart Van Assche wrote: > > > Thanks for having shared these numbers. I think this is very useful > > > information. Do these results show the performance drop that happens > > > if /sys/block/.../device/queue_depth exceeds .can_queue? What I am > > > wondering about is how important these results are in the context of > > > this discussion. Are there any modern SCSI devices for which a SCSI > > > LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that > > > the device responds with BUSY? What surprised me is that only three > > > SCSI LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does > > > that mean that BUSY responses from a SCSI device or HBA are rare? > > > > That's because most of the drivers, which had queue full ramp up/ramp > > down in them and would have called scsi_track_queue_full() converted > > over to the moved-queue-full-handling-in-the-mid-layer, indicated by > > sht->track_queue_depth = 1. > > > > Yes - it is still hit a lot! > > Hi James, > > In the systems that I have been working on myself I made sure that the BUSY > condition is rarely or never encountered. Anyway, since there are setups in > which this condition is hit frequently we need to make sure that these > setups keep performing well. I'm wondering now whether we should try to come > up with an algorithm for maintaining sdev->device_busy only if it improves > performance and for not maintaining sdev->device_busy for devices/HBAs that > don't need it. The simplest policy could be to only maintain sdev->device_busy for HDD. Thanks, Ming
On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote: > > Hi Ming, > > Thanks for having shared these numbers. I think this is very useful > information. Do these results show the performance drop that happens if > /sys/block/.../device/queue_depth exceeds .can_queue? What I am > wondering about is how important these results are in the context of > this discussion. Are there any modern SCSI devices for which a SCSI LLD > sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the > device responds with BUSY? What surprised me is that only three SCSI > LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that > mean that BUSY responses from a SCSI device or HBA are rare? > Some FC HBAs end up returning busy from ->queuecommand() but I think this is more commonly due to there being and issue with the rport rather than the device. -Ewan
On 11/25/2019 10:28 AM, Ewan D. Milne wrote: > On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote: >> Hi Ming, >> >> Thanks for having shared these numbers. I think this is very useful >> information. Do these results show the performance drop that happens if >> /sys/block/.../device/queue_depth exceeds .can_queue? What I am >> wondering about is how important these results are in the context of >> this discussion. Are there any modern SCSI devices for which a SCSI LLD >> sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the >> device responds with BUSY? What surprised me is that only three SCSI >> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that >> mean that BUSY responses from a SCSI device or HBA are rare? >> > Some FC HBAs end up returning busy from ->queuecommand() but I think > this is more commonly due to there being and issue with the rport rather > than the device. > > -Ewan > True - but I would assume busy from queuecommand() is different from BUSY/QUEUE_FULL via a SCSI response. Adapter queuecommand busy's can be for out-of-resource limits in the driver - such as I_T io count limits enforced by the driver are reached, or if some other adapter resource limit is reached as well. Canqueue covers most of those - but we sometimes overcommit the adapter with a canqueue on physical port as well as per npiv ports, or scsi and nvme on the same port. Going back to Bart's question - with SANS and multiple initiators sharing a target and lots of luns on that target, it's very common to hit bursty conditions where the target may reply with QUEUE_FULL. Many arrays provide think tuning guides on how to set up values on multiple hosts, but it's mainly to help the target avoid being completely overrun as some didn't do so well. In the end, it's very hard to predict multi-initiator load and in a lot of cases, things are usually left a bit overcommitted as the performance downside otherwise is significant. -- james
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index fca9b158f4a0..9cc0dd5f88a1 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); #undef QUEUE_SYSFS_BIT_FNS +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, + const char *page, size_t count) +{ + size_t ret; + + blk_mq_freeze_queue(q); + ret = queue_store_nonrot(q, page, count); + blk_mq_unfreeze_queue(q); + + return ret; +} + static ssize_t queue_zoned_show(struct request_queue *q, char *page) { switch (blk_queue_zoned_model(q)) { @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { static struct queue_sysfs_entry queue_nonrot_entry = { .attr = {.name = "rotational", .mode = 0644 }, .show = queue_show_nonrot, - .store = queue_store_nonrot, + .store = queue_store_nonrot_freeze, }; static struct queue_sysfs_entry queue_zoned_entry = { diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 62a86a82c38d..72655a049efd 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static void scsi_kick_queue(struct request_queue *q) @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) static inline bool scsi_device_is_busy(struct scsi_device *sdev) { - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) + if (!blk_queue_nonrot(sdev->request_queue) && + atomic_read(&sdev->device_busy) >= sdev->queue_depth) return true; if (atomic_read(&sdev->device_blocked) > 0) return true; @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, struct scsi_device *sdev) { unsigned int busy; + bool bypass = blk_queue_nonrot(sdev->request_queue); - busy = atomic_inc_return(&sdev->device_busy) - 1; + if (!bypass) + busy = atomic_inc_return(&sdev->device_busy) - 1; + else + busy = 0; if (atomic_read(&sdev->device_blocked)) { if (busy) goto out_dec; @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, "unblocking device at zero depth\n")); } + if (bypass) + return 1; + if (busy >= sdev->queue_depth) goto out_dec; return 1; out_dec: - atomic_dec(&sdev->device_busy); + if (!bypass) + atomic_dec(&sdev->device_busy); return 0; } @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) || + if ((!blk_queue_nonrot(sdev->request_queue) && + atomic_read(&sdev->device_busy)) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; break; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 0744c34468e1..c3d47117700d 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) rot = get_unaligned_be16(&buffer[4]); if (rot == 1) { + blk_mq_freeze_queue(q); blk_queue_flag_set(QUEUE_FLAG_NONROT, q); + blk_mq_unfreeze_queue(q); blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); } @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) * cause this to be updated correctly and any device which * doesn't support it should be treated as rotational. */ + blk_mq_freeze_queue(q); blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); + blk_mq_unfreeze_queue(q); blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); if (scsi_device_supports_vpd(sdp)) {
SCSI core uses the atomic variable of sdev->device_busy to track in-flight IO requests dispatched to this scsi device. IO request may be submitted from any CPU, so the cost for maintaining the shared atomic counter can be very big on big NUMA machine with lots of CPU cores. sdev->queue_depth is usually used for two purposes: 1) improve IO merge; 2) fair IO request scattered among all LUNs. blk-mq already provides fair request allocation among all active shared request queues(LUNs), see hctx_may_queue(). NVMe doesn't have such per-request-queue(namespace) queue depth, so it is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't play big role for reaching top SSD performance. With this patch, big cost for tracking in-flight per-LUN requests via atomic variable can be saved. Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue before changing this flag. Cc: Sathya Prakash <sathya.prakash@broadcom.com> Cc: Chaitra P B <chaitra.basappa@broadcom.com> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de>, Cc: Hannes Reinecke <hare@suse.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-sysfs.c | 14 +++++++++++++- drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ drivers/scsi/sd.c | 4 ++++ 3 files changed, 35 insertions(+), 7 deletions(-)