diff mbox series

[v3,02/18] block: introduce BLK_STS_DURATION_LIMIT

Message ID 20230124190308.127318-3-niklas.cassel@wdc.com (mailing list archive)
State Changes Requested
Headers show
Series Add Command Duration Limits support | expand

Commit Message

Niklas Cassel Jan. 24, 2023, 7:02 p.m. UTC
From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
report command that failed due to a command duration limit being
exceeded. This new status is mapped to the ETIME error code to allow
users to differentiate "soft" duration limit failures from other more
serious hardware related errors.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c          | 3 +++
 include/linux/blk_types.h | 6 ++++++
 2 files changed, 9 insertions(+)

Comments

Bart Van Assche Jan. 24, 2023, 7:29 p.m. UTC | #1
On 1/24/23 11:02, Niklas Cassel wrote:
> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
> report command that failed due to a command duration limit being
> exceeded. This new status is mapped to the ETIME error code to allow
> users to differentiate "soft" duration limit failures from other more
> serious hardware related errors.

What makes exceeding the duration limit different from an I/O timeout 
(BLK_STS_TIMEOUT)? Why is it important to tell the difference between an 
I/O timeout and exceeding the command duration limit?

Thanks,

Bart.
Keith Busch Jan. 24, 2023, 7:59 p.m. UTC | #2
On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
> > Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
> > report command that failed due to a command duration limit being
> > exceeded. This new status is mapped to the ETIME error code to allow
> > users to differentiate "soft" duration limit failures from other more
> > serious hardware related errors.
> 
> What makes exceeding the duration limit different from an I/O timeout
> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
> timeout and exceeding the command duration limit?

BLK_STS_TIMEOUT should be used if the target device doesn't provide any
response to the command. The DURATION_LIMIT status is used when the device
completes a command with that status.
Bart Van Assche Jan. 24, 2023, 8:32 p.m. UTC | #3
On 1/24/23 11:59, Keith Busch wrote:
> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>> On 1/24/23 11:02, Niklas Cassel wrote:
>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>> report command that failed due to a command duration limit being
>>> exceeded. This new status is mapped to the ETIME error code to allow
>>> users to differentiate "soft" duration limit failures from other more
>>> serious hardware related errors.
>>
>> What makes exceeding the duration limit different from an I/O timeout
>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>> timeout and exceeding the command duration limit?
> 
> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
> response to the command. The DURATION_LIMIT status is used when the device
> completes a command with that status.

Hi Keith,

 From SPC-6: "The MAX ACTIVE TIME field specifies an upper limit on the 
time that elapses from the time at which the device server initiates 
actions to access, transfer, or act upon the specified data until the 
time the device server returns status for the command."

My interpretation of the above text is that the SCSI command duration 
limit specifies a hard limit, the same type of limit reported by the 
status code BLK_STS_TIMEOUT. It is not clear to me from the patch 
description why a new status code is needed for reporting that the 
command duration limit has been exceeded.

Thanks,

Bart.
Damien Le Moal Jan. 24, 2023, 9:34 p.m. UTC | #4
On 1/25/23 04:29, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>> report command that failed due to a command duration limit being
>> exceeded. This new status is mapped to the ETIME error code to allow
>> users to differentiate "soft" duration limit failures from other more
>> serious hardware related errors.
> 
> What makes exceeding the duration limit different from an I/O timeout 
> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an 
> I/O timeout and exceeding the command duration limit?

If the device fail to execute a command in time, it will either
1) Fail the command with an error and sense data set (policy 0xf for the
time limit)
2) Return a success status for the command with sense data set telling the
host "data not available". This (weird) case is in essence equivalent to
(1) but was defined to avoid the penalty of a queue abort with SATA drives
(NCQ command errors always result in all on-going commands being aborted).

In both cases, the drive is still responsive and operational.
BLK_STS_TIMEOUT is used if a command timed-out, indicating that the drive
is *not* responding. BLK_STS_TIMEOUT thus generally mean "something is
wrong" (not always, but most of the time.

So we cetainly do not want to overload BLK_STS_TIMEOUT to indicate failed
CDL IOs as that would not allow the user to distinguished from more
serious hardware issues.
Damien Le Moal Jan. 24, 2023, 9:36 p.m. UTC | #5
On 1/25/23 04:59, Keith Busch wrote:
> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>> On 1/24/23 11:02, Niklas Cassel wrote:
>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>> report command that failed due to a command duration limit being
>>> exceeded. This new status is mapped to the ETIME error code to allow
>>> users to differentiate "soft" duration limit failures from other more
>>> serious hardware related errors.
>>
>> What makes exceeding the duration limit different from an I/O timeout
>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>> timeout and exceeding the command duration limit?
> 
> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
> response to the command. The DURATION_LIMIT status is used when the device
> completes a command with that status.

Yes, exactly :)
Damien Le Moal Jan. 24, 2023, 9:39 p.m. UTC | #6
On 1/25/23 05:32, Bart Van Assche wrote:
> On 1/24/23 11:59, Keith Busch wrote:
>> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>>> On 1/24/23 11:02, Niklas Cassel wrote:
>>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>>> report command that failed due to a command duration limit being
>>>> exceeded. This new status is mapped to the ETIME error code to allow
>>>> users to differentiate "soft" duration limit failures from other more
>>>> serious hardware related errors.
>>>
>>> What makes exceeding the duration limit different from an I/O timeout
>>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>>> timeout and exceeding the command duration limit?
>>
>> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
>> response to the command. The DURATION_LIMIT status is used when the device
>> completes a command with that status.
> 
> Hi Keith,
> 
>  From SPC-6: "The MAX ACTIVE TIME field specifies an upper limit on the 
> time that elapses from the time at which the device server initiates 
> actions to access, transfer, or act upon the specified data until the 
> time the device server returns status for the command."
> 
> My interpretation of the above text is that the SCSI command duration 
> limit specifies a hard limit, the same type of limit reported by the 
> status code BLK_STS_TIMEOUT. It is not clear to me from the patch 
> description why a new status code is needed for reporting that the 
> command duration limit has been exceeded.

As explained, this allows differentiating the "drive gave a response"
(BLK_STS_DURATION_LIMIT) from the "drive is not responding" case with
BLK_STS_TIMEOUT. We took care of mapping BLK_STS_DURATION_LIMIT to ETIME
(timer expired) for user space too, to not overload ETIMEDOUT used with
BLK_STS_TIMEOUT.

We can certainly improve the commit message to describe all of this in
more details.

> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 46d12b3344c9..9ca31b779fc1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -170,6 +170,9 @@  static const struct {
 	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
 	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
 
+	/* Command duration limit device-side timeout */
+	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
+
 	/* everything else not covered above: */
 	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
 };
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..cde997590765 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -166,6 +166,12 @@  typedef u16 blk_short_t;
  */
 #define BLK_STS_OFFLINE		((__force blk_status_t)17)
 
+/*
+ * BLK_STS_DURATION_LIMIT is returned from the driver when the target device
+ * aborted the command because it exceeded one of its Command Duration Limits.
+ */
+#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)18)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with