mbox series

[0/1] Fix not fully initialized SCSI commands

Message ID 20250324084933.15932-1-a.kovaleva@yadro.com (mailing list archive)
Headers show
Series Fix not fully initialized SCSI commands | expand

Message

Anastasia Kovaleva March 24, 2025, 8:49 a.m. UTC
We have encountered the following type of logs on initiators:

kernel: sd 16:0:1:84: [sdts] tag#405 timing out command, waited 720s
kernel: sd 16:0:1:84: [sdts] tag#405 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=66636s

The initiator uses dm-mpath for multipathing, the SCSI mid layer, and
the QLogic FC HBA driver (qla2xxx). After debugging, the following call
stack was identified:

blk_mq_sched_dispatch_requests()
  blk_mq_dispatch_rq_list()
    dm_mq_queue_rq()
      map_request()
        ti->type->clone_and_map_rq()    // New cloned request with tag 405
        blk_insert_cloned_request()
          scsi_queue_rq()
            qla2xxx_mqueuecommand()
              qla2xxx_dif_start_scsi_mq()

If qla2xxx_dif_start_scsi_mq() returns an error for any reason (e.g.,
due to extremely heavy traffic causing the driver to exhaust its
handles), scsi_done() -> scsi_end_request() is not called within
qla2xxx_mqueuecommand(). As a result, the SCMD_INITIALIZED flag
remains set. Next, map_request() releases the cloned request and
requeues the original request. While the cloned request is released, the
associated SCSI command retains stale data from the previous command.

If all I/O traffic stops for some extended period of time, and later
resumes, the following scenario may occur:

blk_mq_sched_dispatch_requests()
  blk_mq_dispatch_rq_list()
    dm_mq_queue_rq()
      map_request()
        ti->type->clone_and_map_rq()    // New cloned request uses tag 405 again
        blk_insert_cloned_request()
          scsi_queue_rq()


Within scsi_queue_rq(), the scsi_init_command() function does not call
scsi_initialize_rq() because the SCMD_INITIALIZED flag is already set.
Because of that, when the command completes in scsi_complete(), the
scsi_cmd_runtime_exceeded() check returns true, causing the command to
fail.

This issue appears after the commit 4abafdc4360d ("block: remove the
initialize_rq_fn blk_mq_ops method"). Before this change, the
initialize_rq_fn method forcibly initialized the SCSI command in
blk_get_request(). There may be other places where a command is queued
in scsi_queue_rq() but scsi_done() is not called.

Anastasia Kovaleva (1):
  scsi: uninit not completed scsi cmd

 drivers/scsi/scsi_lib.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--
2.40.3

Comments

James Bottomley March 24, 2025, 11:52 a.m. UTC | #1
On Mon, 2025-03-24 at 11:49 +0300, Anastasia Kovaleva wrote:
> We have encountered the following type of logs on initiators:

Hey, Anna,

Just so you know what's going on, Yadro is on the US SDN sanctions
list:

https://sanctionssearch.ofac.treas.gov/Details.aspx?id=39140

Which means that under the latest LF guidance:

https://www.linuxfoundation.org/blog/navigating-global-regulations-and-open-source-us-ofac-sanctions

We could take your patch as is, but we can't discuss changing it with
you (Point 3: avoid two way engagement).

Really sorry about this,

Regards,

James Bottomley