[5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
diff mbox series

Message ID 20200119071432.18558-6-ming.lei@redhat.com
State New
Headers show
Series
  • scsi: support bypass device busy check for some high end HBA with SSD
Related show

Commit Message

Ming Lei Jan. 19, 2020, 7:14 a.m. UTC
SCSI core uses the atomic variable of sdev->device_busy to track
in-flight IO requests dispatched to this scsi device. IO request may be
submitted from any CPU, so the cost for maintaining the shared atomic
counter can be very big on big NUMA machine with lots of CPU cores.

sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
2) fair IO request scattered among all LUNs.

blk-mq already provides fair request allocation among all active shared
request queues(LUNs), see hctx_may_queue().

NVMe doesn't have such per-request-queue(namespace) queue depth, so it
is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
play big role for reaching top SSD performance.

With this patch, big cost for tracking in-flight per-LUN requests via
atomic variable can be saved for some high end controller with SSD.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c  | 34 ++++++++++++++++++++++++++++------
 include/scsi/scsi_host.h |  3 +++
 2 files changed, 31 insertions(+), 6 deletions(-)

Comments

Bart Van Assche Jan. 19, 2020, 8:58 p.m. UTC | #1
On 2020-01-18 23:14, Ming Lei wrote:
> +static inline bool scsi_bypass_device_busy(struct scsi_device *sdev)
> +{
> +	struct Scsi_Host *shost = sdev->host;
> +
> +	if (!shost->hostt->no_device_queue_for_ssd)
> +		return false;
> +
> +	return blk_queue_nonrot(sdev->request_queue);
> +}

In other words, sdev->device_busy is maintained for all SCSI devices
except for those SSDs controlled by a SCSI LLD driver that has
no_device_queue_for_ssd set in its host template. I'd like to see
different behavior, namely that sdev->device_busy is not maintained for
any SSD except if that SSD really needs the sdev->device_busy counter.
The blacklist mechanism may be more appropriate to mark such SSDs than
the SCSI host template.

What I also noticed is that most scsi_bypass_device_busy() calls have an
exclamation mark (!) in front of these calls. I think that inverting the
return value and renaming this function into e.g.
scsi_maintain_device_busy() would result in code that is easier to read.

Thanks,

Bart.
Martin K. Petersen Jan. 21, 2020, 4:52 a.m. UTC | #2
Ming,

> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> is reasonable to ignore the limit for SCSI SSD too.

It is really not. A free host tag does not guarantee that the target
device can queue the command.

The assumption that SSDs are somehow special because they are "fast" is
not valid. Given the common hardware queue depth for a SAS device of
~128 it is often trivial to drive a device into a congestion
scenario. We see it all the time for non-rotational devices, SSDs and
arrays alike. The SSD heuristic is simply not going to fly.

Don't get me wrong, I am very sympathetic to obliterating device_busy in
the hot path. I just don't think it is as easy as just ignoring the
counter and hope for the best. Dynamic queue depth management is an
integral part of the SCSI protocol, not something we can just decide to
bypass because a device claims to be of a certain media type or speed.

I would prefer not to touch drivers that rely on cmd_per_lun / untagged
operation and focus exclusively on the ones that use .track_queue_depth.
For those we could consider an adaptive queue depth management scheme.
Something like not maintaining device_busy until we actually get a
QUEUE_FULL condition. And then rely on the existing queue depth ramp up
heuristics to determine when to disable the busy counter again. Maybe
with an additional watermark or time limit to avoid flip-flopping.

If that approach turns out to work, we should convert all remaining
non-legacy drivers to .track_queue_depth so we only have two driver
queuing flavors to worry about.
Ming Lei Jan. 23, 2020, 2:54 a.m. UTC | #3
Hi Martin,

On Mon, Jan 20, 2020 at 11:52:06PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> > is reasonable to ignore the limit for SCSI SSD too.
> 
> It is really not. A free host tag does not guarantee that the target
> device can queue the command.

Yeah, I agree.

> 
> The assumption that SSDs are somehow special because they are "fast" is
> not valid. Given the common hardware queue depth for a SAS device of
> ~128 it is often trivial to drive a device into a congestion
> scenario. We see it all the time for non-rotational devices, SSDs and
> arrays alike. The SSD heuristic is simply not going to fly.

In reality, the channel number of SSD is very limited, it should be
dozens of in enterprise grade SSD, so the device itself can be
saturated by limited in-flight IOs.

However, it depends on if the target device returns the congestion
to host. From my observation, looks there isn't such feedback
from NVMe target.

If SSD target device doesn't provide such kind of congestion feedback,
tracking in-flight per-LUN IO via .device_busy doesn't make any
difference.

Even if there was such SSD target which provides such congestion
feedback, bypassing .device_busy won't cause big effect too since
blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE
only after another in-flight one is completed.

> 
> Don't get me wrong, I am very sympathetic to obliterating device_busy in
> the hot path. I just don't think it is as easy as just ignoring the
> counter and hope for the best. Dynamic queue depth management is an
> integral part of the SCSI protocol, not something we can just decide to
> bypass because a device claims to be of a certain media type or speed.

At least, Broadcom guys tests this patch on megaraid raid and the results
shows that big improvement was got, that is why the flag is only set
on megaraid host.

> 
> I would prefer not to touch drivers that rely on cmd_per_lun / untagged
> operation and focus exclusively on the ones that use .track_queue_depth.
> For those we could consider an adaptive queue depth management scheme.
> Something like not maintaining device_busy until we actually get a
> QUEUE_FULL condition. And then rely on the existing queue depth ramp up
> heuristics to determine when to disable the busy counter again. Maybe
> with an additional watermark or time limit to avoid flip-flopping.
> 
> If that approach turns out to work, we should convert all remaining
> non-legacy drivers to .track_queue_depth so we only have two driver
> queuing flavors to worry about.

In theory, .track_queue_depth may only improve sequential IO's
performance for HDD., not very effective for SSD. Or just save a bit
CPU cycles in case of SSD.

I will study the related drivers/device a bit more wrt. track_queue_depth().


Thanks,
Ming
Martin K. Petersen Jan. 24, 2020, 1:21 a.m. UTC | #4
Ming,

> However, it depends on if the target device returns the congestion to
> host. From my observation, looks there isn't such feedback from NVMe
> target.

It happens all the time with SCSI devices. It is imperative that this
keeps working.

> Even if there was such SSD target which provides such congestion
> feedback, bypassing .device_busy won't cause big effect too since
> blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> after another in-flight one is completed.

The reason we back off is that it allows the device to recover by
temporarily reducing its workload. In addition, the lower queue depth
alleviates the risk of commands timing out leading to application I/O
failures.

> At least, Broadcom guys tests this patch on megaraid raid and the
> results shows that big improvement was got, that is why the flag is
> only set on megaraid host.

I do not question that it improves performance. That's not my point.

> In theory, .track_queue_depth may only improve sequential IO's
> performance for HDD., not very effective for SSD. Or just save a bit
> CPU cycles in case of SSD.

This is not about performance. This is about how the system behaves when
a device is starved for resources or experiencing transient failures.
Ming Lei Jan. 24, 2020, 1:59 a.m. UTC | #5
Hi Martin,

On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > However, it depends on if the target device returns the congestion to
> > host. From my observation, looks there isn't such feedback from NVMe
> > target.
> 
> It happens all the time with SCSI devices. It is imperative that this
> keeps working.
> 
> > Even if there was such SSD target which provides such congestion
> > feedback, bypassing .device_busy won't cause big effect too since
> > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> > after another in-flight one is completed.
> 
> The reason we back off is that it allows the device to recover by
> temporarily reducing its workload. In addition, the lower queue depth
> alleviates the risk of commands timing out leading to application I/O
> failures.

The timeout risk may only happen when driver/device doesn't return
congestion feedback, meantime the host queue depth is big enough.

So far we don't see such issue on NVMe which hw queue depth is 1023, and
the hw queue count is often 32+, and not see such timeout report
when there are so many inflight IOs(32 * 1023) on single LUN.

Also megaraid sas's queue depth is much less than (32 * 1023), so it
seems much unlikely to happen.

Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit
and Shivasharan?

> 
> > At least, Broadcom guys tests this patch on megaraid raid and the
> > results shows that big improvement was got, that is why the flag is
> > only set on megaraid host.
> 
> I do not question that it improves performance. That's not my point.
> 
> > In theory, .track_queue_depth may only improve sequential IO's
> > performance for HDD., not very effective for SSD. Or just save a bit
> > CPU cycles in case of SSD.
> 
> This is not about performance. This is about how the system behaves when
> a device is starved for resources or experiencing transient failures.

Could you explain a bit how this patch changes the system beaviror? I
understand the EH just retries the incompleted requests, which total
number is just less than host queue depth.


Thanks,
Ming
Sumit Saxena Jan. 24, 2020, 12:43 p.m. UTC | #6
On Fri, Jan 24, 2020 at 7:30 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Martin,
>
> On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote:
> >
> > Ming,
> >
> > > However, it depends on if the target device returns the congestion to
> > > host. From my observation, looks there isn't such feedback from NVMe
> > > target.
> >
> > It happens all the time with SCSI devices. It is imperative that this
> > keeps working.
> >
> > > Even if there was such SSD target which provides such congestion
> > > feedback, bypassing .device_busy won't cause big effect too since
> > > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> > > after another in-flight one is completed.
> >
> > The reason we back off is that it allows the device to recover by
> > temporarily reducing its workload. In addition, the lower queue depth
> > alleviates the risk of commands timing out leading to application I/O
> > failures.
>
> The timeout risk may only happen when driver/device doesn't return
> congestion feedback, meantime the host queue depth is big enough.
>
> So far we don't see such issue on NVMe which hw queue depth is 1023, and
> the hw queue count is often 32+, and not see such timeout report
> when there are so many inflight IOs(32 * 1023) on single LUN.
>
> Also megaraid sas's queue depth is much less than (32 * 1023), so it
> seems much unlikely to happen.
>
> Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit
> and Shivasharan?

Hi Ming, Martin,

megaraid_sas driver does not enable “.track_queue_depth”, so
megaraid_sas adapters never used QUEUE FULL interface of Linux SCSI
layer. Most of the handling of QUEUE FULL is managed by MegaRAID
controller Firmware and it also manages reducing drive level QD (like
ramp up/down).

"mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of
Firmware.  Default configuration is Firmware will manage QUEUE FULL.
This is not same as Linux SCSI level handling. It is delayed retry in
Firmware. It means, we should not expect IO timeout in case of QUEUE
FULL from device since firmware can handle it as delayed retry. User
can disable Firmware handling QUEUE FULL condition (through customized
firmware) and allow QUEUE FULL return back to SCSI layer.  This
feature is called “MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING”.
So for mpt3sas driver, we may use QUEUE FULL handling of OS. We can
opt to enable “no_device_queue_for_ssd” for mpt3sas driver only if FW
does not expose MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING.

Thanks,
Sumit

>
> >
> > > At least, Broadcom guys tests this patch on megaraid raid and the
> > > results shows that big improvement was got, that is why the flag is
> > > only set on megaraid host.
> >
> > I do not question that it improves performance. That's not my point.
> >
> > > In theory, .track_queue_depth may only improve sequential IO's
> > > performance for HDD., not very effective for SSD. Or just save a bit
> > > CPU cycles in case of SSD.
> >
> > This is not about performance. This is about how the system behaves when
> > a device is starved for resources or experiencing transient failures.
>
> Could you explain a bit how this patch changes the system beaviror? I
> understand the EH just retries the incompleted requests, which total
> number is just less than host queue depth.
>
>
> Thanks,
> Ming
>
Sumanesh Samanta Jan. 24, 2020, 7:41 p.m. UTC | #7
Hi Martin,

>> Please read what I proposed. megaraid VDs don't report QUEUE_FULL so
you would not trigger the device_busy counter.

Yes, megaraid does not use set track_queue_depth, and never reports
QUEUE_FULL . Instead of relying on QUEUE_FULL and some complex heuristics
of when to start tracking device_busy, why can't we simply use  "
track_queue_depth" ( along with the other flag that Ming added) to decide
which devices need queue depth tracking, and track device_busy only for
them?

>>Something like not maintaining device_busy until we actually get a
>>QUEUE_FULL condition. And then rely on the existing queue depth ramp up
>>heuristics to determine when to disable the busy counter again.

I am not sure how we can suddenly start tracking device_busy on the fly,
if we do not know how many IO are already pending for that device? Won't
device_busy have a high chance of getting negative ( or at least wrong
value) when pending IOs start getting completed?

Thanks,
Sumanesh


-----Original Message-----
From: Martin K. Petersen <martin.petersen@oracle.com>
Sent: Thursday, January 23, 2020 6:59 PM
To: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: linux-scsi@vger.kernel.org; Martin K. Petersen
<martin.petersen@oracle.com>
Subject: Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for
SSD when HBA needs


Sumanesh,

> The high-end controllers might expose a SCSI interface, but can have
> all kind of devices (NVMe/SCSI/SATA) behind it, and has its own
> capability to queue IO and feed to devices as needed. Those devices
> should not be penalized with the overhead of the device_busy counter,
> just because they chose to expose themselves has SCSI devices (for
> historical and backward compatibility reasons).

Please read what I proposed. megaraid VDs don't report QUEUE_FULL so you
would not trigger the device_busy counter.
Martin K. Petersen Jan. 28, 2020, 4:04 a.m. UTC | #8
Sumit,

> "mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of
> Firmware.  Default configuration is Firmware will manage QUEUE FULL.

Regrettably...
Martin K. Petersen Jan. 28, 2020, 4:22 a.m. UTC | #9
Sumanesh,

> Instead of relying on QUEUE_FULL and some complex heuristics of when
> to start tracking device_busy, why can't we simply use "
> track_queue_depth" ( along with the other flag that Ming added) to
> decide which devices need queue depth tracking, and track device_busy
> only for them?

Because I am interested in addressing the device_busy contention problem
for all of our non-legacy drivers. I.e. not just for controllers that
happen to queue internally.

> I am not sure how we can suddenly start tracking device_busy on the fly,
> if we do not know how many IO are already pending for that device?

We know that from the tags. It's just not hot path material.
Ming Lei Jan. 31, 2020, 11:39 a.m. UTC | #10
Hi Martin,

On Tue, Jan 28, 2020 at 12:24 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Sumanesh,
>
> > Instead of relying on QUEUE_FULL and some complex heuristics of when
> > to start tracking device_busy, why can't we simply use "
> > track_queue_depth" ( along with the other flag that Ming added) to
> > decide which devices need queue depth tracking, and track device_busy
> > only for them?
>
> Because I am interested in addressing the device_busy contention problem
> for all of our non-legacy drivers. I.e. not just for controllers that
> happen to queue internally.

Can we just do it for controllers without 'track_queue_depth' and SSD now?

>
> > I am not sure how we can suddenly start tracking device_busy on the fly,
> > if we do not know how many IO are already pending for that device?
>
> We know that from the tags. It's just not hot path material.

In case of 'track_queue_depth', cost for tracking queue depth has to be paid,
which can be too big to get expected perf on high end HBA.

sbitmap might be used for this purpose, but not sure if it can scale well enough
for this purpose.


Thanks,
Ming Lei

Patch
diff mbox series

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..0903697dd843 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -344,6 +344,16 @@  static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
 	rcu_read_unlock();
 }
 
+static inline bool scsi_bypass_device_busy(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+
+	if (!shost->hostt->no_device_queue_for_ssd)
+		return false;
+
+	return blk_queue_nonrot(sdev->request_queue);
+}
+
 void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 {
 	struct Scsi_Host *shost = sdev->host;
@@ -354,7 +364,8 @@  void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 	if (starget->can_queue > 0)
 		atomic_dec(&starget->target_busy);
 
-	atomic_dec(&sdev->device_busy);
+	if (!scsi_bypass_device_busy(sdev))
+		atomic_dec(&sdev->device_busy);
 }
 
 static void scsi_kick_queue(struct request_queue *q)
@@ -410,7 +421,8 @@  static void scsi_single_lun_run(struct scsi_device *current_sdev)
 
 static inline bool scsi_device_is_busy(struct scsi_device *sdev)
 {
-	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
+	if (!scsi_bypass_device_busy(sdev) &&
+			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
 		return true;
 	if (atomic_read(&sdev->device_blocked) > 0)
 		return true;
@@ -1283,8 +1295,12 @@  static inline int scsi_dev_queue_ready(struct request_queue *q,
 				  struct scsi_device *sdev)
 {
 	unsigned int busy;
+	bool bypass = scsi_bypass_device_busy(sdev);
 
-	busy = atomic_inc_return(&sdev->device_busy) - 1;
+	if (!bypass)
+		busy = atomic_inc_return(&sdev->device_busy) - 1;
+	else
+		busy = 0;
 	if (atomic_read(&sdev->device_blocked)) {
 		if (busy)
 			goto out_dec;
@@ -1298,12 +1314,16 @@  static inline int scsi_dev_queue_ready(struct request_queue *q,
 				   "unblocking device at zero depth\n"));
 	}
 
+	if (bypass)
+		return 1;
+
 	if (busy >= sdev->queue_depth)
 		goto out_dec;
 
 	return 1;
 out_dec:
-	atomic_dec(&sdev->device_busy);
+	if (!bypass)
+		atomic_dec(&sdev->device_busy);
 	return 0;
 }
 
@@ -1624,7 +1644,8 @@  static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	atomic_dec(&sdev->device_busy);
+	if (!scsi_bypass_device_busy(sdev))
+		atomic_dec(&sdev->device_busy);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1706,7 +1727,8 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) ||
+		if ((!scsi_bypass_device_busy(sdev) &&
+		     atomic_read(&sdev->device_busy)) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
 		break;
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7a97fb8104cf..8e80edfd5a6d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -426,6 +426,9 @@  struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/* True if the controller needn't to maintain device queue for SSD */
+	unsigned no_device_queue_for_ssd:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */