diff mbox series

block: cleanup and fix batch completion adding conditions

Message ID 20575f0a-656e-4bb3-9d82-dec6c7e3a35c@kernel.dk (mailing list archive)
State New
Headers show
Series block: cleanup and fix batch completion adding conditions | expand

Commit Message

Jens Axboe Feb. 13, 2025, 3:18 p.m. UTC
The conditions for whether or not a request is allowed adding to a
completion batch are a bit hard to read, and they also have a few
issues. One is that ioerror may indeed be a random value on passthrough,
and it's being checked unconditionally of whether or not the given
request is a passthrough request or not.

Rewrite the conditions to be separate for easier reading, and only check
ioerror for non-passthrough requests. This fixes an issue with bio
unmapping on passthrough, where it fails getting added to a batch. This
both leads to suboptimal performance, and may trigger a potential
schedule-under-atomic condition for polled passthrough IO.

Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Comments

Shinichiro Kawasaki Feb. 19, 2025, 7:26 a.m. UTC | #1
CC+: Alan,

On Feb 13, 2025 / 08:18, Jens Axboe wrote:
> The conditions for whether or not a request is allowed adding to a
> completion batch are a bit hard to read, and they also have a few
> issues. One is that ioerror may indeed be a random value on passthrough,
> and it's being checked unconditionally of whether or not the given
> request is a passthrough request or not.
> 
> Rewrite the conditions to be separate for easier reading, and only check
> ioerror for non-passthrough requests. This fixes an issue with bio
> unmapping on passthrough, where it fails getting added to a batch. This
> both leads to suboptimal performance, and may trigger a potential
> schedule-under-atomic condition for polled passthrough IO.
> 
> Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()")
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

I observed the blktests test case nvme/039 failure with v6.14-rc3 kernel. I
bisected and found that this patch in the v6.14-rc3 is the trigger.

The test run output is as follows:

nvme/039 => nvme0n1 (test error logging)                     [failed]
    runtime  5.378s  ...  5.354s
    --- tests/nvme/039.out      2024-09-20 11:20:26.405380875 +0900
    +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2025-02-19 16:13:05.061387179 +0900
    @@ -1,7 +1,3 @@
     Running nvme/039
    - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
    - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
    - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
      Identify(0x6), Access Denied (sct 0x2 / sc 0x86) DNR cdw10=0x1 cdw11=0x0 cdw12=0x0 cdw13=0x0 cdw14=0x0 cdw15=0x0
    - Read(0x2), Invalid Command Opcode (sct 0x0 / sc 0x1) DNR cdw10=0x0 cdw11=0x0 cdw12=0x1 cdw13=0x0 cdw14=0x0 cdw15=0x0
     Test complete


The test case does error injection. Test method requires reconsideration to
adjust to this kernel change, probably. Help for fix will be appreciated.
Alan Adamson Feb. 19, 2025, 7:17 p.m. UTC | #2
On 2/18/25 11:26 PM, Shinichiro Kawasaki wrote:
> CC+: Alan,
>
> On Feb 13, 2025 / 08:18, Jens Axboe wrote:
>> The conditions for whether or not a request is allowed adding to a
>> completion batch are a bit hard to read, and they also have a few
>> issues. One is that ioerror may indeed be a random value on passthrough,
>> and it's being checked unconditionally of whether or not the given
>> request is a passthrough request or not.
>>
>> Rewrite the conditions to be separate for easier reading, and only check
>> ioerror for non-passthrough requests. This fixes an issue with bio
>> unmapping on passthrough, where it fails getting added to a batch. This
>> both leads to suboptimal performance, and may trigger a potential
>> schedule-under-atomic condition for polled passthrough IO.
>>
>> Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()")
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> I observed the blktests test case nvme/039 failure with v6.14-rc3 kernel. I
> bisected and found that this patch in the v6.14-rc3 is the trigger.
>
> The test run output is as follows:
>
> nvme/039 => nvme0n1 (test error logging)                     [failed]
>      runtime  5.378s  ...  5.354s
>      --- tests/nvme/039.out      2024-09-20 11:20:26.405380875 +0900
>      +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2025-02-19 16:13:05.061387179 +0900
>      @@ -1,7 +1,3 @@
>       Running nvme/039
>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
>      - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>      - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>        Identify(0x6), Access Denied (sct 0x2 / sc 0x86) DNR cdw10=0x1 cdw11=0x0 cdw12=0x0 cdw13=0x0 cdw14=0x0 cdw15=0x0
>      - Read(0x2), Invalid Command Opcode (sct 0x0 / sc 0x1) DNR cdw10=0x0 cdw11=0x0 cdw12=0x1 cdw13=0x0 cdw14=0x0 cdw15=0x0
>       Test complete
>
>
> The test case does error injection. Test method requires reconsideration to
> adjust to this kernel change, probably. Help for fix will be appreciated.

Not really an issue with the test, rather the error injector is broken. 
I'll investigate.

Alan
Alan Adamson Feb. 20, 2025, 1:04 a.m. UTC | #3
On 2/19/25 11:17 AM, alan.adamson@oracle.com wrote:
>
> On 2/18/25 11:26 PM, Shinichiro Kawasaki wrote:
>> CC+: Alan,
>>
>> On Feb 13, 2025 / 08:18, Jens Axboe wrote:
>>> The conditions for whether or not a request is allowed adding to a
>>> completion batch are a bit hard to read, and they also have a few
>>> issues. One is that ioerror may indeed be a random value on 
>>> passthrough,
>>> and it's being checked unconditionally of whether or not the given
>>> request is a passthrough request or not.
>>>
>>> Rewrite the conditions to be separate for easier reading, and only 
>>> check
>>> ioerror for non-passthrough requests. This fixes an issue with bio
>>> unmapping on passthrough, where it fails getting added to a batch. This
>>> both leads to suboptimal performance, and may trigger a potential
>>> schedule-under-atomic condition for polled passthrough IO.
>>>
>>> Fixes: f794f3351f26 ("block: add support for 
>>> blk_mq_end_request_batch()")
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> I observed the blktests test case nvme/039 failure with v6.14-rc3 
>> kernel. I
>> bisected and found that this patch in the v6.14-rc3 is the trigger.
>>
>> The test run output is as follows:
>>
>> nvme/039 => nvme0n1 (test error logging) [failed]
>>      runtime  5.378s  ...  5.354s
>>      --- tests/nvme/039.out      2024-09-20 11:20:26.405380875 +0900
>>      +++ 
>> /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad 
>> 2025-02-19 16:13:05.061387179 +0900
>>      @@ -1,7 +1,3 @@
>>       Running nvme/039
>>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / 
>> sc 0x81) DNR
>>      - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>>      - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>>        Identify(0x6), Access Denied (sct 0x2 / sc 0x86) DNR cdw10=0x1 
>> cdw11=0x0 cdw12=0x0 cdw13=0x0 cdw14=0x0 cdw15=0x0
>>      - Read(0x2), Invalid Command Opcode (sct 0x0 / sc 0x1) DNR 
>> cdw10=0x0 cdw11=0x0 cdw12=0x1 cdw13=0x0 cdw14=0x0 cdw15=0x0
>>       Test complete
>>
>>
>> The test case does error injection. Test method requires 
>> reconsideration to
>> adjust to this kernel change, probably. Help for fix will be 
>> appreciated.
>
> Not really an issue with the test, rather the error injector is 
> broken. I'll investigate.

The following change allows the test to pass.

- The check of (ioerror < 0) should be (ioerror != 0)

- passthroughs can also have ioerror set, so false needs to be returned. 
This needs to be resolved.


diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fa2a76cc2f73..b2bd2ac40441 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -865,17 +865,17 @@ static inline bool blk_mq_add_to_batch(struct 
request *req,
          * 1) No batch container
          * 2) Has scheduler data attached
          * 3) Not a passthrough request and end_io set
-        * 4) Not a passthrough request and an ioerror
+        * 4) An ioerror
          */
         if (!iob)
                 return false;
         if (req->rq_flags & RQF_SCHED_TAGS)
                 return false;
+       if (ioerror)
+               return false;
         if (!blk_rq_is_passthrough(req)) {
                 if (req->end_io)
                         return false;
-               if (ioerror < 0)
-                       return false;
         }

Alan
Shinichiro Kawasaki Feb. 20, 2025, 7:28 a.m. UTC | #4
On Feb 19, 2025 / 17:04, alan.adamson@oracle.com wrote:
> 
> On 2/19/25 11:17 AM, alan.adamson@oracle.com wrote:
> > 
> > On 2/18/25 11:26 PM, Shinichiro Kawasaki wrote:
> > > CC+: Alan,
> > > 
> > > On Feb 13, 2025 / 08:18, Jens Axboe wrote:
> > > > The conditions for whether or not a request is allowed adding to a
> > > > completion batch are a bit hard to read, and they also have a few
> > > > issues. One is that ioerror may indeed be a random value on
> > > > passthrough,
> > > > and it's being checked unconditionally of whether or not the given
> > > > request is a passthrough request or not.
> > > > 
> > > > Rewrite the conditions to be separate for easier reading, and
> > > > only check
> > > > ioerror for non-passthrough requests. This fixes an issue with bio
> > > > unmapping on passthrough, where it fails getting added to a batch. This
> > > > both leads to suboptimal performance, and may trigger a potential
> > > > schedule-under-atomic condition for polled passthrough IO.
> > > > 
> > > > Fixes: f794f3351f26 ("block: add support for
> > > > blk_mq_end_request_batch()")
> > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > I observed the blktests test case nvme/039 failure with v6.14-rc3
> > > kernel. I
> > > bisected and found that this patch in the v6.14-rc3 is the trigger.
> > > 
> > > The test run output is as follows:
> > > 
> > > nvme/039 => nvme0n1 (test error logging) [failed]
> > >      runtime  5.378s  ...  5.354s
> > >      --- tests/nvme/039.out      2024-09-20 11:20:26.405380875 +0900
> > >      +++
> > > /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad
> > > 2025-02-19 16:13:05.061387179 +0900
> > >      @@ -1,7 +1,3 @@
> > >       Running nvme/039
> > >      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2
> > > / sc 0x81) DNR
> > >      - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
> > >      - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
> > >        Identify(0x6), Access Denied (sct 0x2 / sc 0x86) DNR
> > > cdw10=0x1 cdw11=0x0 cdw12=0x0 cdw13=0x0 cdw14=0x0 cdw15=0x0
> > >      - Read(0x2), Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
> > > cdw10=0x0 cdw11=0x0 cdw12=0x1 cdw13=0x0 cdw14=0x0 cdw15=0x0
> > >       Test complete
> > > 
> > > 
> > > The test case does error injection. Test method requires
> > > reconsideration to
> > > adjust to this kernel change, probably. Help for fix will be
> > > appreciated.
> > 
> > Not really an issue with the test, rather the error injector is broken.
> > I'll investigate.

Alan, thank you for looking into this. Then it sounds that a fix is needed in
kernel side.

> 
> The following change allows the test to pass.
> 
> - The check of (ioerror < 0) should be (ioerror != 0)
> 
> - passthroughs can also have ioerror set, so false needs to be returned.
> This needs to be resolved.
> 
> 
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index fa2a76cc2f73..b2bd2ac40441 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -865,17 +865,17 @@ static inline bool blk_mq_add_to_batch(struct request
> *req,
>          * 1) No batch container
>          * 2) Has scheduler data attached
>          * 3) Not a passthrough request and end_io set
> -        * 4) Not a passthrough request and an ioerror
> +        * 4) An ioerror
>          */
>         if (!iob)
>                 return false;
>         if (req->rq_flags & RQF_SCHED_TAGS)
>                 return false;
> +       if (ioerror)
> +               return false;
>         if (!blk_rq_is_passthrough(req)) {
>                 if (req->end_io)
>                         return false;
> -               if (ioerror < 0)
> -                       return false;
>         }

I think the modification above essentially reverts the trigger commit, so it
means that we can not achieve what the commit aimed at.

I took a look in the call chain. Before the commit, the call chain was
like this for passthrough commands:

nvme_handle_cqe()
 blk_mq_add_to_batch() ... false is returned, then call nvme_pci_complete_rq()
 nvme_pci_complete_rq()
  nvme_complete_rq()
   nvme_end_req()
    nvme_log_err_passthrough() ... log is printed
    __nvme_end_req()           ... the nvme command ends

After the commit, it looks like this:

nvme_handle_cqe()
 blk_mq_add_to_batch() ... true is returned. Set nvme_pci_complete_batch()
nvme_pci_complete_batch()
 nvme_complete_batch()
  nvme_complete_batch()
   nvme_complete_batch_req()
    __nvme_end_req()         ... log is not printed but the command ends,
                                 then nvme/039 failed

Assuming the change by the trigger commit is required, I guess adding
nvme_log_err_passthrough() call in nvme_complete_batch_req() will restore
the log print feature for the passthrough case.

As a trial, I created a patch below [*]. I confirmed it avoids the nvme/039
failure. (As Alan noted, still I also find the check for "ioerror != 0" is
required for non-passthrough case, to make the test case pass.)


Jens, if you have comments, please share.


[*]

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 818d4e49aab5..33030dd9b76a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -496,6 +496,9 @@ void nvme_complete_batch_req(struct request *req)
 {
 	trace_nvme_complete_rq(req);
 	nvme_cleanup_cmd(req);
+	if (unlikely(nvme_req(req)->status && !(req->rq_flags & RQF_QUIET) &&
+		     blk_rq_is_passthrough(req)))
+		nvme_log_err_passthru(req);
 	__nvme_end_req(req);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_batch_req);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index fa2a76cc2f73..b891ed113306 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -874,7 +874,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 	if (!blk_rq_is_passthrough(req)) {
 		if (req->end_io)
 			return false;
-		if (ioerror < 0)
+		if (ioerror)
 			return false;
 	}
diff mbox series

Patch

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a0a9007cc1e3..cc1cb5de8fb6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -861,12 +861,22 @@  static inline bool blk_mq_add_to_batch(struct request *req,
 				       void (*complete)(struct io_comp_batch *))
 {
 	/*
-	 * blk_mq_end_request_batch() can't end request allocated from
-	 * sched tags
+	 * Check various conditions that exclude batch processing:
+	 * 1) No batch container
+	 * 2) Has scheduler data attached
+	 * 3) Not a passthrough request and end_io set
+	 * 4) Not a passthrough request and an ioerror
 	 */
-	if (!iob || (req->rq_flags & RQF_SCHED_TAGS) || ioerror ||
-			(req->end_io && !blk_rq_is_passthrough(req)))
+	if (!iob)
 		return false;
+	if (req->rq_flags & RQF_SCHED_TAGS)
+		return false;
+	if (!blk_rq_is_passthrough(req)) {
+		if (req->end_io)
+			return false;
+		if (ioerror < 0)
+			return false;
+	}
 
 	if (!iob->complete)
 		iob->complete = complete;