Message ID | 20200119071432.18558-6-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scsi: support bypass device busy check for some high end HBA with SSD | expand |
On 2020-01-18 23:14, Ming Lei wrote: > +static inline bool scsi_bypass_device_busy(struct scsi_device *sdev) > +{ > + struct Scsi_Host *shost = sdev->host; > + > + if (!shost->hostt->no_device_queue_for_ssd) > + return false; > + > + return blk_queue_nonrot(sdev->request_queue); > +} In other words, sdev->device_busy is maintained for all SCSI devices except for those SSDs controlled by a SCSI LLD driver that has no_device_queue_for_ssd set in its host template. I'd like to see different behavior, namely that sdev->device_busy is not maintained for any SSD except if that SSD really needs the sdev->device_busy counter. The blacklist mechanism may be more appropriate to mark such SSDs than the SCSI host template. What I also noticed is that most scsi_bypass_device_busy() calls have an exclamation mark (!) in front of these calls. I think that inverting the return value and renaming this function into e.g. scsi_maintain_device_busy() would result in code that is easier to read. Thanks, Bart.
Ming, > NVMe doesn't have such per-request-queue(namespace) queue depth, so it > is reasonable to ignore the limit for SCSI SSD too. It is really not. A free host tag does not guarantee that the target device can queue the command. The assumption that SSDs are somehow special because they are "fast" is not valid. Given the common hardware queue depth for a SAS device of ~128 it is often trivial to drive a device into a congestion scenario. We see it all the time for non-rotational devices, SSDs and arrays alike. The SSD heuristic is simply not going to fly. Don't get me wrong, I am very sympathetic to obliterating device_busy in the hot path. I just don't think it is as easy as just ignoring the counter and hope for the best. Dynamic queue depth management is an integral part of the SCSI protocol, not something we can just decide to bypass because a device claims to be of a certain media type or speed. I would prefer not to touch drivers that rely on cmd_per_lun / untagged operation and focus exclusively on the ones that use .track_queue_depth. For those we could consider an adaptive queue depth management scheme. Something like not maintaining device_busy until we actually get a QUEUE_FULL condition. And then rely on the existing queue depth ramp up heuristics to determine when to disable the busy counter again. Maybe with an additional watermark or time limit to avoid flip-flopping. If that approach turns out to work, we should convert all remaining non-legacy drivers to .track_queue_depth so we only have two driver queuing flavors to worry about.
Hi Martin, On Mon, Jan 20, 2020 at 11:52:06PM -0500, Martin K. Petersen wrote: > > Ming, > > > NVMe doesn't have such per-request-queue(namespace) queue depth, so it > > is reasonable to ignore the limit for SCSI SSD too. > > It is really not. A free host tag does not guarantee that the target > device can queue the command. Yeah, I agree. > > The assumption that SSDs are somehow special because they are "fast" is > not valid. Given the common hardware queue depth for a SAS device of > ~128 it is often trivial to drive a device into a congestion > scenario. We see it all the time for non-rotational devices, SSDs and > arrays alike. The SSD heuristic is simply not going to fly. In reality, the channel number of SSD is very limited, it should be dozens of in enterprise grade SSD, so the device itself can be saturated by limited in-flight IOs. However, it depends on if the target device returns the congestion to host. From my observation, looks there isn't such feedback from NVMe target. If SSD target device doesn't provide such kind of congestion feedback, tracking in-flight per-LUN IO via .device_busy doesn't make any difference. Even if there was such SSD target which provides such congestion feedback, bypassing .device_busy won't cause big effect too since blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only after another in-flight one is completed. > > Don't get me wrong, I am very sympathetic to obliterating device_busy in > the hot path. I just don't think it is as easy as just ignoring the > counter and hope for the best. Dynamic queue depth management is an > integral part of the SCSI protocol, not something we can just decide to > bypass because a device claims to be of a certain media type or speed. At least, Broadcom guys tests this patch on megaraid raid and the results shows that big improvement was got, that is why the flag is only set on megaraid host. > > I would prefer not to touch drivers that rely on cmd_per_lun / untagged > operation and focus exclusively on the ones that use .track_queue_depth. > For those we could consider an adaptive queue depth management scheme. > Something like not maintaining device_busy until we actually get a > QUEUE_FULL condition. And then rely on the existing queue depth ramp up > heuristics to determine when to disable the busy counter again. Maybe > with an additional watermark or time limit to avoid flip-flopping. > > If that approach turns out to work, we should convert all remaining > non-legacy drivers to .track_queue_depth so we only have two driver > queuing flavors to worry about. In theory, .track_queue_depth may only improve sequential IO's performance for HDD., not very effective for SSD. Or just save a bit CPU cycles in case of SSD. I will study the related drivers/device a bit more wrt. track_queue_depth(). Thanks, Ming
Ming, > However, it depends on if the target device returns the congestion to > host. From my observation, looks there isn't such feedback from NVMe > target. It happens all the time with SCSI devices. It is imperative that this keeps working. > Even if there was such SSD target which provides such congestion > feedback, bypassing .device_busy won't cause big effect too since > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only > after another in-flight one is completed. The reason we back off is that it allows the device to recover by temporarily reducing its workload. In addition, the lower queue depth alleviates the risk of commands timing out leading to application I/O failures. > At least, Broadcom guys tests this patch on megaraid raid and the > results shows that big improvement was got, that is why the flag is > only set on megaraid host. I do not question that it improves performance. That's not my point. > In theory, .track_queue_depth may only improve sequential IO's > performance for HDD., not very effective for SSD. Or just save a bit > CPU cycles in case of SSD. This is not about performance. This is about how the system behaves when a device is starved for resources or experiencing transient failures.
Hi Martin, On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote: > > Ming, > > > However, it depends on if the target device returns the congestion to > > host. From my observation, looks there isn't such feedback from NVMe > > target. > > It happens all the time with SCSI devices. It is imperative that this > keeps working. > > > Even if there was such SSD target which provides such congestion > > feedback, bypassing .device_busy won't cause big effect too since > > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only > > after another in-flight one is completed. > > The reason we back off is that it allows the device to recover by > temporarily reducing its workload. In addition, the lower queue depth > alleviates the risk of commands timing out leading to application I/O > failures. The timeout risk may only happen when driver/device doesn't return congestion feedback, meantime the host queue depth is big enough. So far we don't see such issue on NVMe which hw queue depth is 1023, and the hw queue count is often 32+, and not see such timeout report when there are so many inflight IOs(32 * 1023) on single LUN. Also megaraid sas's queue depth is much less than (32 * 1023), so it seems much unlikely to happen. Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit and Shivasharan? > > > At least, Broadcom guys tests this patch on megaraid raid and the > > results shows that big improvement was got, that is why the flag is > > only set on megaraid host. > > I do not question that it improves performance. That's not my point. > > > In theory, .track_queue_depth may only improve sequential IO's > > performance for HDD., not very effective for SSD. Or just save a bit > > CPU cycles in case of SSD. > > This is not about performance. This is about how the system behaves when > a device is starved for resources or experiencing transient failures. Could you explain a bit how this patch changes the system beaviror? I understand the EH just retries the incompleted requests, which total number is just less than host queue depth. Thanks, Ming
On Fri, Jan 24, 2020 at 7:30 AM Ming Lei <ming.lei@redhat.com> wrote: > > Hi Martin, > > On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote: > > > > Ming, > > > > > However, it depends on if the target device returns the congestion to > > > host. From my observation, looks there isn't such feedback from NVMe > > > target. > > > > It happens all the time with SCSI devices. It is imperative that this > > keeps working. > > > > > Even if there was such SSD target which provides such congestion > > > feedback, bypassing .device_busy won't cause big effect too since > > > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only > > > after another in-flight one is completed. > > > > The reason we back off is that it allows the device to recover by > > temporarily reducing its workload. In addition, the lower queue depth > > alleviates the risk of commands timing out leading to application I/O > > failures. > > The timeout risk may only happen when driver/device doesn't return > congestion feedback, meantime the host queue depth is big enough. > > So far we don't see such issue on NVMe which hw queue depth is 1023, and > the hw queue count is often 32+, and not see such timeout report > when there are so many inflight IOs(32 * 1023) on single LUN. > > Also megaraid sas's queue depth is much less than (32 * 1023), so it > seems much unlikely to happen. > > Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit > and Shivasharan? Hi Ming, Martin, megaraid_sas driver does not enable “.track_queue_depth”, so megaraid_sas adapters never used QUEUE FULL interface of Linux SCSI layer. Most of the handling of QUEUE FULL is managed by MegaRAID controller Firmware and it also manages reducing drive level QD (like ramp up/down). "mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of Firmware. Default configuration is Firmware will manage QUEUE FULL. This is not same as Linux SCSI level handling. It is delayed retry in Firmware. It means, we should not expect IO timeout in case of QUEUE FULL from device since firmware can handle it as delayed retry. User can disable Firmware handling QUEUE FULL condition (through customized firmware) and allow QUEUE FULL return back to SCSI layer. This feature is called “MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING”. So for mpt3sas driver, we may use QUEUE FULL handling of OS. We can opt to enable “no_device_queue_for_ssd” for mpt3sas driver only if FW does not expose MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING. Thanks, Sumit > > > > > > At least, Broadcom guys tests this patch on megaraid raid and the > > > results shows that big improvement was got, that is why the flag is > > > only set on megaraid host. > > > > I do not question that it improves performance. That's not my point. > > > > > In theory, .track_queue_depth may only improve sequential IO's > > > performance for HDD., not very effective for SSD. Or just save a bit > > > CPU cycles in case of SSD. > > > > This is not about performance. This is about how the system behaves when > > a device is starved for resources or experiencing transient failures. > > Could you explain a bit how this patch changes the system beaviror? I > understand the EH just retries the incompleted requests, which total > number is just less than host queue depth. > > > Thanks, > Ming >
Hi Martin, >> Please read what I proposed. megaraid VDs don't report QUEUE_FULL so you would not trigger the device_busy counter. Yes, megaraid does not use set track_queue_depth, and never reports QUEUE_FULL . Instead of relying on QUEUE_FULL and some complex heuristics of when to start tracking device_busy, why can't we simply use " track_queue_depth" ( along with the other flag that Ming added) to decide which devices need queue depth tracking, and track device_busy only for them? >>Something like not maintaining device_busy until we actually get a >>QUEUE_FULL condition. And then rely on the existing queue depth ramp up >>heuristics to determine when to disable the busy counter again. I am not sure how we can suddenly start tracking device_busy on the fly, if we do not know how many IO are already pending for that device? Won't device_busy have a high chance of getting negative ( or at least wrong value) when pending IOs start getting completed? Thanks, Sumanesh -----Original Message----- From: Martin K. Petersen <martin.petersen@oracle.com> Sent: Thursday, January 23, 2020 6:59 PM To: Sumanesh Samanta <sumanesh.samanta@broadcom.com> Cc: linux-scsi@vger.kernel.org; Martin K. Petersen <martin.petersen@oracle.com> Subject: Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Sumanesh, > The high-end controllers might expose a SCSI interface, but can have > all kind of devices (NVMe/SCSI/SATA) behind it, and has its own > capability to queue IO and feed to devices as needed. Those devices > should not be penalized with the overhead of the device_busy counter, > just because they chose to expose themselves has SCSI devices (for > historical and backward compatibility reasons). Please read what I proposed. megaraid VDs don't report QUEUE_FULL so you would not trigger the device_busy counter.
Sumit, > "mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of > Firmware. Default configuration is Firmware will manage QUEUE FULL. Regrettably...
Sumanesh, > Instead of relying on QUEUE_FULL and some complex heuristics of when > to start tracking device_busy, why can't we simply use " > track_queue_depth" ( along with the other flag that Ming added) to > decide which devices need queue depth tracking, and track device_busy > only for them? Because I am interested in addressing the device_busy contention problem for all of our non-legacy drivers. I.e. not just for controllers that happen to queue internally. > I am not sure how we can suddenly start tracking device_busy on the fly, > if we do not know how many IO are already pending for that device? We know that from the tags. It's just not hot path material.
Hi Martin, On Tue, Jan 28, 2020 at 12:24 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > Sumanesh, > > > Instead of relying on QUEUE_FULL and some complex heuristics of when > > to start tracking device_busy, why can't we simply use " > > track_queue_depth" ( along with the other flag that Ming added) to > > decide which devices need queue depth tracking, and track device_busy > > only for them? > > Because I am interested in addressing the device_busy contention problem > for all of our non-legacy drivers. I.e. not just for controllers that > happen to queue internally. Can we just do it for controllers without 'track_queue_depth' and SSD now? > > > I am not sure how we can suddenly start tracking device_busy on the fly, > > if we do not know how many IO are already pending for that device? > > We know that from the tags. It's just not hot path material. In case of 'track_queue_depth', cost for tracking queue depth has to be paid, which can be too big to get expected perf on high end HBA. sbitmap might be used for this purpose, but not sure if it can scale well enough for this purpose. Thanks, Ming Lei
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 610ee41fa54c..0903697dd843 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -344,6 +344,16 @@ static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd) rcu_read_unlock(); } +static inline bool scsi_bypass_device_busy(struct scsi_device *sdev) +{ + struct Scsi_Host *shost = sdev->host; + + if (!shost->hostt->no_device_queue_for_ssd) + return false; + + return blk_queue_nonrot(sdev->request_queue); +} + void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) { struct Scsi_Host *shost = sdev->host; @@ -354,7 +364,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); - atomic_dec(&sdev->device_busy); + if (!scsi_bypass_device_busy(sdev)) + atomic_dec(&sdev->device_busy); } static void scsi_kick_queue(struct request_queue *q) @@ -410,7 +421,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) static inline bool scsi_device_is_busy(struct scsi_device *sdev) { - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) + if (!scsi_bypass_device_busy(sdev) && + atomic_read(&sdev->device_busy) >= sdev->queue_depth) return true; if (atomic_read(&sdev->device_blocked) > 0) return true; @@ -1283,8 +1295,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, struct scsi_device *sdev) { unsigned int busy; + bool bypass = scsi_bypass_device_busy(sdev); - busy = atomic_inc_return(&sdev->device_busy) - 1; + if (!bypass) + busy = atomic_inc_return(&sdev->device_busy) - 1; + else + busy = 0; if (atomic_read(&sdev->device_blocked)) { if (busy) goto out_dec; @@ -1298,12 +1314,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, "unblocking device at zero depth\n")); } + if (bypass) + return 1; + if (busy >= sdev->queue_depth) goto out_dec; return 1; out_dec: - atomic_dec(&sdev->device_busy); + if (!bypass) + atomic_dec(&sdev->device_busy); return 0; } @@ -1624,7 +1644,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - atomic_dec(&sdev->device_busy); + if (!scsi_bypass_device_busy(sdev)) + atomic_dec(&sdev->device_busy); } static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) @@ -1706,7 +1727,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) || + if ((!scsi_bypass_device_busy(sdev) && + atomic_read(&sdev->device_busy)) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; break; diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 7a97fb8104cf..8e80edfd5a6d 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -426,6 +426,9 @@ struct scsi_host_template { /* True if the controller does not support WRITE SAME */ unsigned no_write_same:1; + /* True if the controller needn't to maintain device queue for SSD */ + unsigned no_device_queue_for_ssd:1; + /* * Countdown for host blocking with no commands outstanding. */
SCSI core uses the atomic variable of sdev->device_busy to track in-flight IO requests dispatched to this scsi device. IO request may be submitted from any CPU, so the cost for maintaining the shared atomic counter can be very big on big NUMA machine with lots of CPU cores. sdev->queue_depth is usually used for two purposes: 1) improve IO merge; 2) fair IO request scattered among all LUNs. blk-mq already provides fair request allocation among all active shared request queues(LUNs), see hctx_may_queue(). NVMe doesn't have such per-request-queue(namespace) queue depth, so it is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't play big role for reaching top SSD performance. With this patch, big cost for tracking in-flight per-LUN requests via atomic variable can be saved for some high end controller with SSD. Cc: Sathya Prakash <sathya.prakash@broadcom.com> Cc: Chaitra P B <chaitra.basappa@broadcom.com> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de>, Cc: Hannes Reinecke <hare@suse.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/scsi/scsi_lib.c | 34 ++++++++++++++++++++++++++++------ include/scsi/scsi_host.h | 3 +++ 2 files changed, 31 insertions(+), 6 deletions(-)