From patchwork Tue Oct 8 10:09:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 11179271 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 748F61575 for ; Tue, 8 Oct 2019 10:10:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4978520673 for ; Tue, 8 Oct 2019 10:10:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730165AbfJHKKC (ORCPT ); Tue, 8 Oct 2019 06:10:02 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40886 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730106AbfJHKKB (ORCPT ); Tue, 8 Oct 2019 06:10:01 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E6289C056808; Tue, 8 Oct 2019 10:10:00 +0000 (UTC) Received: from localhost (ovpn-8-31.pek2.redhat.com [10.72.8.31]) by smtp.corp.redhat.com (Postfix) with ESMTP id B91385D772; Tue, 8 Oct 2019 10:09:57 +0000 (UTC) From: Ming Lei To: linux-scsi@vger.kernel.org, "Martin K . Petersen" Cc: James Bottomley , Ming Lei , Jens Axboe , "Ewan D . Milne" , Omar Sandoval , James Bottomley , Christoph Hellwig , Kashyap Desai , Hannes Reinecke , Laurence Oberman , Bart Van Assche Subject: [PATCH V3 1/2] scsi: core: avoid host-wide host_busy counter for scsi_mq Date: Tue, 8 Oct 2019 18:09:44 +0800 Message-Id: <20191008100945.24951-2-ming.lei@redhat.com> In-Reply-To: <20191008100945.24951-1-ming.lei@redhat.com> References: <20191008100945.24951-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Tue, 08 Oct 2019 10:10:01 +0000 (UTC) Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org It isn't necessary to check the host depth in scsi_queue_rq() any more since it has been respected by blk-mq before calling scsi_queue_rq() via getting driver tag. Lots of LUNs may attach to same host and per-host IOPS may reach millions, so we should avoid expensive atomic operations on the host-wide counter in the IO path. This patch implements scsi_host_busy() via blk_mq_tagset_busy_iter() with one scsi command state for reading the count of busy IOs for scsi_mq. It is observed that IOPS is increased by 15% in IO test on scsi_debug (32 LUNs, 32 submit queues, 1024 can_queue, libaio/dio) in a dual-socket system. V2: - introduce SCMD_STATE_INFLIGHT for getting accurate host busy via blk_mq_tagset_busy_iter() - verified that original Jens's report[1] is fixed - verified that SCSI timeout/abort works fine [1] https://www.spinics.net/lists/linux-scsi/msg122867.html [2] V1 & its revert: d772a65d8a6c Revert "scsi: core: avoid host-wide host_busy counter for scsi_mq" 23aa8e69f2c6 Revert "scsi: core: fix scsi_host_queue_ready" 265d59aacbce scsi: core: fix scsi_host_queue_ready 328728630d9f scsi: core: avoid host-wide host_busy counter for scsi_mq Cc: Jens Axboe Cc: Ewan D. Milne Cc: Omar Sandoval , Cc: "Martin K. Petersen" , Cc: James Bottomley , Cc: Christoph Hellwig , Cc: Kashyap Desai Cc: Hannes Reinecke Cc: Laurence Oberman Cc: Bart Van Assche Signed-off-by: Ming Lei --- drivers/scsi/hosts.c | 19 ++++++++++++++++++- drivers/scsi/scsi.c | 2 +- drivers/scsi/scsi_lib.c | 35 +++++++++++++++++------------------ drivers/scsi/scsi_priv.h | 2 +- include/scsi/scsi_cmnd.h | 1 + include/scsi/scsi_host.h | 1 - 6 files changed, 38 insertions(+), 22 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 55522b7162d3..1d669e47b692 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -38,6 +38,7 @@ #include #include #include +#include #include "scsi_priv.h" #include "scsi_logging.h" @@ -554,13 +555,29 @@ struct Scsi_Host *scsi_host_get(struct Scsi_Host *shost) } EXPORT_SYMBOL(scsi_host_get); +static bool scsi_host_check_in_flight(struct request *rq, void *data, + bool reserved) +{ + int *count = data; + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); + + if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) + (*count)++; + + return true; +} + /** * scsi_host_busy - Return the host busy counter * @shost: Pointer to Scsi_Host to inc. **/ int scsi_host_busy(struct Scsi_Host *shost) { - return atomic_read(&shost->host_busy); + int cnt = 0; + + blk_mq_tagset_busy_iter(&shost->tag_set, + scsi_host_check_in_flight, &cnt); + return cnt; } EXPORT_SYMBOL(scsi_host_busy); diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 1f5b5c8a7f72..ddc4ec6ea2a1 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -186,7 +186,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd) struct scsi_driver *drv; unsigned int good_bytes; - scsi_device_unbusy(sdev); + scsi_device_unbusy(sdev, cmd); /* * Clear the flags that say that the device/target/host is no longer diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index dc210b9d4896..b6f66dcb15a5 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -189,7 +189,7 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy) * active on the host/device. */ if (unbusy) - scsi_device_unbusy(device); + scsi_device_unbusy(device, cmd); /* * Requeue this command. It will go before all other commands @@ -329,12 +329,12 @@ static void scsi_init_cmd_errh(struct scsi_cmnd *cmd) * host_failed counter or that it notices the shost state change made by * scsi_eh_scmd_add(). */ -static void scsi_dec_host_busy(struct Scsi_Host *shost) +static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd) { unsigned long flags; rcu_read_lock(); - atomic_dec(&shost->host_busy); + __clear_bit(SCMD_STATE_INFLIGHT, &cmd->state); if (unlikely(scsi_host_in_recovery(shost))) { spin_lock_irqsave(shost->host_lock, flags); if (shost->host_failed || shost->host_eh_scheduled) @@ -344,12 +344,12 @@ static void scsi_dec_host_busy(struct Scsi_Host *shost) rcu_read_unlock(); } -void scsi_device_unbusy(struct scsi_device *sdev) +void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) { struct Scsi_Host *shost = sdev->host; struct scsi_target *starget = scsi_target(sdev); - scsi_dec_host_busy(shost); + scsi_dec_host_busy(shost, cmd); if (starget->can_queue > 0) atomic_dec(&starget->target_busy); @@ -430,9 +430,6 @@ static inline bool scsi_target_is_busy(struct scsi_target *starget) static inline bool scsi_host_is_busy(struct Scsi_Host *shost) { - if (shost->can_queue > 0 && - atomic_read(&shost->host_busy) >= shost->can_queue) - return true; if (atomic_read(&shost->host_blocked) > 0) return true; if (shost->host_self_blocked) @@ -1139,6 +1136,7 @@ void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd) unsigned int flags = cmd->flags & SCMD_PRESERVED_FLAGS; unsigned long jiffies_at_alloc; int retries; + bool in_flight; if (!blk_rq_is_scsi(rq) && !(flags & SCMD_INITIALIZED)) { flags |= SCMD_INITIALIZED; @@ -1147,6 +1145,7 @@ void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd) jiffies_at_alloc = cmd->jiffies_at_alloc; retries = cmd->retries; + in_flight = test_bit(SCMD_STATE_INFLIGHT, &cmd->state); /* zero out the cmd, except for the embedded scsi_request */ memset((char *)cmd + sizeof(cmd->req), 0, sizeof(*cmd) - sizeof(cmd->req) + dev->host->hostt->cmd_size); @@ -1158,6 +1157,8 @@ void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd) INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler); cmd->jiffies_at_alloc = jiffies_at_alloc; cmd->retries = retries; + if (in_flight) + __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); scsi_add_cmd_to_list(cmd); } @@ -1367,16 +1368,14 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost, */ static inline int scsi_host_queue_ready(struct request_queue *q, struct Scsi_Host *shost, - struct scsi_device *sdev) + struct scsi_device *sdev, + struct scsi_cmnd *cmd) { - unsigned int busy; - if (scsi_host_in_recovery(shost)) return 0; - busy = atomic_inc_return(&shost->host_busy) - 1; if (atomic_read(&shost->host_blocked) > 0) { - if (busy) + if (scsi_host_busy(shost) > 0) goto starved; /* @@ -1390,8 +1389,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, "unblocking host at zero depth\n")); } - if (shost->can_queue > 0 && busy >= shost->can_queue) - goto starved; if (shost->host_self_blocked) goto starved; @@ -1403,6 +1400,8 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); } + __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); + return 1; starved: @@ -1411,7 +1410,7 @@ static inline int scsi_host_queue_ready(struct request_queue *q, list_add_tail(&sdev->starved_entry, &shost->starved_list); spin_unlock_irq(shost->host_lock); out_dec: - scsi_dec_host_busy(shost); + scsi_dec_host_busy(shost, cmd); return 0; } @@ -1665,7 +1664,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, ret = BLK_STS_RESOURCE; if (!scsi_target_queue_ready(shost, sdev)) goto out_put_budget; - if (!scsi_host_queue_ready(q, shost, sdev)) + if (!scsi_host_queue_ready(q, shost, sdev, cmd)) goto out_dec_target_busy; if (!(req->rq_flags & RQF_DONTPREP)) { @@ -1697,7 +1696,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_STS_OK; out_dec_host_busy: - scsi_dec_host_busy(shost); + scsi_dec_host_busy(shost, cmd); out_dec_target_busy: if (scsi_target(sdev)->can_queue > 0) atomic_dec(&scsi_target(sdev)->target_busy); diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h index cc2859d76d81..3bff9f7aa684 100644 --- a/drivers/scsi/scsi_priv.h +++ b/drivers/scsi/scsi_priv.h @@ -87,7 +87,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd); extern void scsi_add_cmd_to_list(struct scsi_cmnd *cmd); extern void scsi_del_cmd_from_list(struct scsi_cmnd *cmd); extern int scsi_maybe_unblock_host(struct scsi_device *sdev); -extern void scsi_device_unbusy(struct scsi_device *sdev); +extern void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd); extern void scsi_queue_insert(struct scsi_cmnd *cmd, int reason); extern void scsi_io_completion(struct scsi_cmnd *, unsigned int); extern void scsi_run_host_queues(struct Scsi_Host *shost); diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h index 91bd749a02f7..9c22e85902ec 100644 --- a/include/scsi/scsi_cmnd.h +++ b/include/scsi/scsi_cmnd.h @@ -63,6 +63,7 @@ struct scsi_pointer { /* for scmd->state */ #define SCMD_STATE_COMPLETE 0 +#define SCMD_STATE_INFLIGHT 1 struct scsi_cmnd { struct scsi_request req; diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 31e0d6ca1eba..b3fa32df40f8 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -551,7 +551,6 @@ struct Scsi_Host { /* Area to keep a shared tag map */ struct blk_mq_tag_set tag_set; - atomic_t host_busy; /* commands actually active on low-level */ atomic_t host_blocked; unsigned int host_failed; /* commands that failed. From patchwork Tue Oct 8 10:09:45 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 11179273 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 19BAB139A for ; Tue, 8 Oct 2019 10:10:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 03A0C2070B for ; Tue, 8 Oct 2019 10:10:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730170AbfJHKKJ (ORCPT ); Tue, 8 Oct 2019 06:10:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57620 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730106AbfJHKKJ (ORCPT ); Tue, 8 Oct 2019 06:10:09 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id EFAF251EF1; Tue, 8 Oct 2019 10:10:08 +0000 (UTC) Received: from localhost (ovpn-8-31.pek2.redhat.com [10.72.8.31]) by smtp.corp.redhat.com (Postfix) with ESMTP id 46D9C5C219; Tue, 8 Oct 2019 10:10:02 +0000 (UTC) From: Ming Lei To: linux-scsi@vger.kernel.org, "Martin K . Petersen" Cc: James Bottomley , Ming Lei , Jens Axboe , "Ewan D . Milne" , Omar Sandoval , James Bottomley , Christoph Hellwig , Kashyap Desai , Hannes Reinecke , Laurence Oberman , Bart Van Assche Subject: [RFC PATCH V3 2/2] scsi: core: don't limit per-LUN queue depth for SSD Date: Tue, 8 Oct 2019 18:09:45 +0800 Message-Id: <20191008100945.24951-3-ming.lei@redhat.com> In-Reply-To: <20191008100945.24951-1-ming.lei@redhat.com> References: <20191008100945.24951-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Tue, 08 Oct 2019 10:10:09 +0000 (UTC) Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org SCSI core uses the atomic variable of sdev->device_busy to track in-flight IO requests dispatched to this scsi device. IO request may be submitted from any CPU, so the cost for maintaining the shared atomic counter can be very big on big NUMA machine with lots of CPU cores. sdev->queue_depth is usually used for two purposes: 1) improve IO merge; 2) fair IO request scattered among all LUNs. blk-mq already provides fair request allocation among all active shared request queues(LUNs), see hctx_may_queue(). NVMe doesn't have such per-request-queue(namespace) queue depth, so it is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't play big role for reaching top SSD performance. With this patch, big cost for tracking in-flight per-LUN requests via atomic variable can be saved. Cc: Jens Axboe Cc: Ewan D. Milne Cc: Omar Sandoval , Cc: "Martin K. Petersen" , Cc: James Bottomley , Cc: Christoph Hellwig , Cc: Kashyap Desai Cc: Hannes Reinecke Cc: Laurence Oberman Cc: Bart Van Assche Signed-off-by: Ming Lei Reported-by: kernel test robot --- drivers/scsi/scsi_lib.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index b6f66dcb15a5..0f8e7a522060 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static void scsi_kick_queue(struct request_queue *q) @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) static inline bool scsi_device_is_busy(struct scsi_device *sdev) { - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) + if (!blk_queue_nonrot(sdev->request_queue) && + atomic_read(&sdev->device_busy) >= sdev->queue_depth) return true; if (atomic_read(&sdev->device_blocked) > 0) return true; @@ -1283,8 +1285,10 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, struct scsi_device *sdev) { unsigned int busy; + bool bypass = blk_queue_nonrot(sdev->request_queue); - busy = atomic_inc_return(&sdev->device_busy) - 1; + if (!bypass) + busy = atomic_inc_return(&sdev->device_busy) - 1; if (atomic_read(&sdev->device_blocked)) { if (busy) goto out_dec; @@ -1298,12 +1302,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, "unblocking device at zero depth\n")); } + if (bypass) + return 1; + if (busy >= sdev->queue_depth) goto out_dec; return 1; out_dec: - atomic_dec(&sdev->device_busy); + if (!bypass) + atomic_dec(&sdev->device_busy); return 0; } @@ -1624,7 +1632,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) @@ -1706,7 +1715,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) || + if ((!blk_queue_nonrot(sdev->request_queue) && + atomic_read(&sdev->device_busy)) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; break;