Message ID | 20201007145326.56850-3-mlombard@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | fix race conditions with task aborts | expand |
On 10/7/20 9:53 AM, Maurizio Lombardi wrote: > The iscsit_release_commands_from_conn() function does the following > operations: > > 1) locks the cmd_lock spinlock > 2) Scans the list of commands and sets the CMD_T_FABRIC_STOP flag > 3) Releases the cmd_lock spinlock > 4) Rescans the list again and clears the i_conn_node link of each command > > If an abort task timer is fired between 3) and 4), it will find > the CMD_T_FABRIC_STOP flag set and won't call list_del_init(); > therefore it may end up calling __iscsit_free_cmd() with a > non-empty i_conn_node list, thus triggering the warning. > > Considering that: > > - we expect list_del_init() to be executed by > iscsit_release_commands_from_conn() when the CMD_T_FABRIC_STOP is set. > - iscsit_aborted_task() is the only function that calls __iscsit_free_cmd() > directly, while all the other functions call iscsit_free_cmd(). > - the warning in __iscsit_free_cmd() is a duplicate (the same warning > can be found in iscsit_free_cmd(). > For iscsi, what does the last put on the cmd for a successful abort. Ignore this conn stop case. If we abort a command successfully, the abort path does: target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd When we create the cmd we do transport_init_se_cmd and that sets the refcount=1. We do target_get_sess_cmd with ack_kref=true so that sets refcount=2. On the abort completion path, target_handle_abort does target_put_sess_cmd (refcount=1 after), check_stop_free ends up doing a put (refcount=0 after). If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd The cmd is now freed. 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released.
Hello Mike, Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): > If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: > > 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. > 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: > > target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd > > The cmd is now freed. > 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. > > Thanks for the review! There are definitely some problems with task aborts and commands' refcounting * but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); unless you are saying that abort tasks should never be executed when the connection is going down and we have to prevent such cases from happening at all. * I did a test a few days ago: 1) blocked the backstore device on the target 2) issued a write command on the initiator 3) waited for abort task and connection shutdown 4) switched the target backstore device to state "running" I triggered the following warnings in target_handle_abort(): } else { /* * Allow the fabric driver to unmap any resources before * releasing the descriptor via TFO->release_cmd(). */ cmd->se_tfo->aborted_task(cmd); if (ack_kref) WARN_ON_ONCE(target_put_sess_cmd(cmd) != 0); <----------------------- /* * To do: establish a unit attention condition on the I_T * nexus associated with cmd. See also the paragraph "Aborting * commands" in SAM. */ } WARN_ON_ONCE(kref_read(&cmd->cmd_kref) == 0); <---------------------- transport_lun_remove_cmd(cmd); transport_cmd_check_stop_to_fabric(cmd); } target_handle_abort() is executed against a command which has already been released. [ 240.883865] iSCSI/iqn.1994-05.com.redhat:9489e049fb5e: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 353.848593] ABORT_TASK: Found referenced iSCSI task_tag: 101 [ 386.521582] iSCSI Login timeout on Network Portal 0.0.0.0:3260 [ 396.537133] WARNING: CPU: 1 PID: 2936 at drivers/target/target_core_transport.c:805 target_handle_abort+0xce/0x180 [target_core_mod] [ 396.537139] Modules linked in: iscsi_target_mod target_core_user uio target_core_pscsi target_core_file target_core_iblock target_core_mod uinput nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ppdev drm joydev pcspkr i2c_piix4 parport_pc parport xfs libcrc32c sr_mod cdrom sd_mod sg ata_generic ata_piix libata virtio_net net_failover failover serio_raw fuse [last unloaded: ip_tables] [ 396.537302] CPU: 1 PID: 2936 Comm: kworker/1:2 Kdump: loaded Not tainted 4.18.0-240.7.el8.x86_64 #1 [ 396.537306] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [ 396.537353] Workqueue: target_completion target_abort_work [target_core_mod] [ 396.537394] RIP: 0010:target_handle_abort+0xce/0x180 [target_core_mod] [ 396.537399] Code: 00 e8 e6 db 00 f1 41 81 e4 00 00 40 00 74 9d 48 8d bb 0c 01 00 00 f0 ff 8b 0c 01 00 00 0f 88 ed cd 00 00 75 87 e8 32 ed ff ff <0f> 0b e9 7b ff ff ff c6 03 40 0f 1f 44 00 00 0f 1f 44 00 00 48 8b [ 396.537402] RSP: 0018:ffffac1b818bbe80 EFLAGS: 00010286 [ 396.537407] RAX: 0000312f0022d690 RBX: ffff9aec21595350 RCX: 0000000000000000 [ 396.537410] RDX: 0000000000000000 RSI: 0000000000000282 RDI: 0000000000000282 [ 396.537413] RBP: 0000000000000000 R08: ffff9aebf4d65a00 R09: ffff9aebf4c12d98 [ 396.537416] R10: 0000000000000082 R11: 000000000000002e R12: 0000000000400000 [ 396.537418] R13: 0000000000000000 R14: ffff9aec02cd9c00 R15: ffff9aec21595488 [ 396.537423] FS: 0000000000000000(0000) GS:ffff9aec7fa80000(0000) knlGS:0000000000000000 [ 396.537427] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 396.537430] CR2: 000055bfa25e9110 CR3: 00000000614c6000 CR4: 00000000000006e0 [ 396.537444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 396.537447] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 396.537449] Call Trace: [ 396.537504] process_one_work+0x1a7/0x360 [ 396.537525] worker_thread+0x30/0x390 [ 396.537532] ? create_worker+0x1a0/0x1a0 [ 396.537536] kthread+0x112/0x130 [ 396.537542] ? kthread_flush_work_fn+0x10/0x10 [ 396.537550] ret_from_fork+0x35/0x40 [ 396.537570] ---[ end trace d6b5bf9aea218e55 ]--- [ 396.537734] WARNING: CPU: 1 PID: 2936 at drivers/target/target_core_transport.c:813 target_handle_abort+0x178/0x180 [target_core_mod] [ 396.537737] Modules linked in: iscsi_target_mod target_core_user uio target_core_pscsi target_core_file target_core_iblock target_core_mod uinput nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ppdev drm joydev pcspkr i2c_piix4 parport_pc parport xfs libcrc32c sr_mod cdrom sd_mod sg ata_generic ata_piix libata virtio_net net_failover failover serio_raw fuse [last unloaded: ip_tables] [ 396.537811] CPU: 1 PID: 2936 Comm: kworker/1:2 Kdump: loaded Tainted: G W --------- - - 4.18.0-240.7.el8.x86_64 #1 [ 396.537814] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [ 396.537860] Workqueue: target_completion target_abort_work [target_core_mod] [ 396.537903] RIP: 0010:target_handle_abort+0x178/0x180 [target_core_mod] [ 396.537909] Code: 67 03 00 48 85 ed 74 1d 48 8b 45 00 48 8b 7d 08 48 83 c5 18 48 89 de e8 26 db 00 f1 48 8b 45 00 48 85 c0 75 e7 e9 6a ff ff ff <0f> 0b e9 df fe ff ff 90 0f 1f 44 00 00 55 48 8d af 08 01 00 00 53 [ 396.537912] RSP: 0018:ffffac1b818bbe80 EFLAGS: 00010246 [ 396.537916] RAX: 0000000000000000 RBX: ffff9aec21595350 RCX: 0000000000000000 [ 396.537919] RDX: 0000000000000000 RSI: 0000000000000282 RDI: 0000000000000282 [ 396.537922] RBP: 0000000000000000 R08: ffff9aebf4d65a00 R09: ffff9aebf4c12d98 [ 396.537925] R10: 0000000000000082 R11: 000000000000002e R12: 0000000000400000 [ 396.537928] R13: 0000000000000000 R14: ffff9aec02cd9c00 R15: ffff9aec21595488 [ 396.537933] FS: 0000000000000000(0000) GS:ffff9aec7fa80000(0000) knlGS:0000000000000000 [ 396.537936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 396.537939] CR2: 000055bfa25e9110 CR3: 00000000614c6000 CR4: 00000000000006e0 [ 396.537951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 396.537954] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 396.537956] Call Trace: [ 396.537971] process_one_work+0x1a7/0x360 [ 396.537979] worker_thread+0x30/0x390 [ 396.537985] ? create_worker+0x1a0/0x1a0 [ 396.537989] kthread+0x112/0x130 [ 396.537994] ? kthread_flush_work_fn+0x10/0x10 [ 396.538009] ret_from_fork+0x35/0x40 [ 396.538016] ---[ end trace d6b5bf9aea218e56 ]--- [ 396.538221] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 101 Maurizio
On 10/27/20 8:49 AM, Maurizio Lombardi wrote: > Hello Mike, > > Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >> >> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >> >> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >> >> The cmd is now freed. >> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >> >> > > Thanks for the review! > > There are definitely some problems with task aborts and commands' refcounting * > but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); > unless you are saying that abort tasks should never be executed when the connection > is going down and we have to prevent such cases from happening at all. Yeah, I think if we prevent the race then we fix the refcount issue and your issue. Here is a patch that is only compile tested: From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 From: Mike Christie <michael.christie@oracle.com> Date: Tue, 27 Oct 2020 12:30:53 -0500 Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race The abort and cmd stop paths can race where: 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd The cmd is now freed. 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the CMD_T_ABORTED if the driver is not cleaning up the cmd because of a session shutdown. However, iscsit_release_commands_from_conn only sets the CMD_T_FABRIC_STOP and does not check to see if the abort path has claimed completion ownership of the command. This adds a check in iscsit_release_commands_from_conn so only the abort or fabric stop path cleanup the command. --- drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index f77e5ee..85027d3 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) { spin_lock_bh(&conn->cmd_lock); - if (!list_empty(&cmd->i_conn_node) && - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) + if (!list_empty(&cmd->i_conn_node)) list_del_init(&cmd->i_conn_node); spin_unlock_bh(&conn->cmd_lock); @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) if (se_cmd->se_tfo != NULL) { spin_lock_irq(&se_cmd->t_state_lock); + if (se_cmd->transport_state & CMD_T_ABORTED) { + /* + * LIO's abort path owns the cleanup for this, + * so put it back on the list and let + * aborted_task handle it. + */ + list_add_tail(&cmd->i_conn_node, + &conn->conn_cmd_list); + continue; + } se_cmd->transport_state |= CMD_T_FABRIC_STOP; spin_unlock_irq(&se_cmd->t_state_lock); }
> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote: > > On 10/27/20 8:49 AM, Maurizio Lombardi wrote: >> Hello Mike, >> >> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >>> >>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >>> >>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >>> >>> The cmd is now freed. >>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >>> >>> >> >> Thanks for the review! >> >> There are definitely some problems with task aborts and commands' refcounting * >> but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); >> unless you are saying that abort tasks should never be executed when the connection >> is going down and we have to prevent such cases from happening at all. > > Yeah, I think if we prevent the race then we fix the refcount issue and your issue. > Here is a patch that is only compile tested: > > From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 > From: Mike Christie <michael.christie@oracle.com> > Date: Tue, 27 Oct 2020 12:30:53 -0500 > Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race > > The abort and cmd stop paths can race where: > > 1. thread1 runs iscsit_release_commands_from_conn and sets > CMD_T_FABRIC_STOP. > 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It > then returns from the aborted_task callout and we finish > target_handle_abort and do: > > target_handle_abort -> transport_cmd_check_stop_to_fabric -> > lio_check_stop_free -> target_put_sess_cmd > > The cmd is now freed. > 3. thread1 now finishes iscsit_release_commands_from_conn and runs > iscsit_free_cmd while accessing a command we just released. > > In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the > CMD_T_ABORTED if the driver is not cleaning up the cmd because of > a session shutdown. However, iscsit_release_commands_from_conn only > sets the CMD_T_FABRIC_STOP and does not check to see if the abort path > has claimed completion ownership of the command. > > This adds a check in iscsit_release_commands_from_conn so only the > abort or fabric stop path cleanup the command. > --- > drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c > index f77e5ee..85027d3 100644 > --- a/drivers/target/iscsi/iscsi_target.c > +++ b/drivers/target/iscsi/iscsi_target.c > @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) > void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) > { > spin_lock_bh(&conn->cmd_lock); > - if (!list_empty(&cmd->i_conn_node) && > - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) > + if (!list_empty(&cmd->i_conn_node)) > list_del_init(&cmd->i_conn_node); > spin_unlock_bh(&conn->cmd_lock); > > @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) > > if (se_cmd->se_tfo != NULL) { > spin_lock_irq(&se_cmd->t_state_lock); > + if (se_cmd->transport_state & CMD_T_ABORTED) { > + /* > + * LIO's abort path owns the cleanup for this, > + * so put it back on the list and let > + * aborted_task handle it. > + */ > + list_add_tail(&cmd->i_conn_node, > + &conn->conn_cmd_list); That should have been a move from the tmp list back to the conn_cmd_list. > + continue; > + } > se_cmd->transport_state |= CMD_T_FABRIC_STOP; > spin_unlock_irq(&se_cmd->t_state_lock); > } > -- > 1.8.3.1 >
Dne 27. 10. 20 v 21:03 Michael Christie napsal(a): > > >> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote: >> >> On 10/27/20 8:49 AM, Maurizio Lombardi wrote: >>> Hello Mike, >>> >>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >>>> >>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >>>> >>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >>>> >>>> The cmd is now freed. >>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >>>> >>>> >>> >>> Thanks for the review! >>> >>> There are definitely some problems with task aborts and commands' refcounting * >>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); >>> unless you are saying that abort tasks should never be executed when the connection >>> is going down and we have to prevent such cases from happening at all. >> >> Yeah, I think if we prevent the race then we fix the refcount issue and your issue. >> Here is a patch that is only compile tested: >> >> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 >> From: Mike Christie <michael.christie@oracle.com> >> Date: Tue, 27 Oct 2020 12:30:53 -0500 >> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race >> >> The abort and cmd stop paths can race where: >> >> 1. thread1 runs iscsit_release_commands_from_conn and sets >> CMD_T_FABRIC_STOP. >> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It >> then returns from the aborted_task callout and we finish >> target_handle_abort and do: >> >> target_handle_abort -> transport_cmd_check_stop_to_fabric -> >> lio_check_stop_free -> target_put_sess_cmd >> >> The cmd is now freed. >> 3. thread1 now finishes iscsit_release_commands_from_conn and runs >> iscsit_free_cmd while accessing a command we just released. >> >> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the >> CMD_T_ABORTED if the driver is not cleaning up the cmd because of >> a session shutdown. However, iscsit_release_commands_from_conn only >> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path >> has claimed completion ownership of the command. >> >> This adds a check in iscsit_release_commands_from_conn so only the >> abort or fabric stop path cleanup the command. >> --- >> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c >> index f77e5ee..85027d3 100644 >> --- a/drivers/target/iscsi/iscsi_target.c >> +++ b/drivers/target/iscsi/iscsi_target.c >> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >> { >> spin_lock_bh(&conn->cmd_lock); >> - if (!list_empty(&cmd->i_conn_node) && >> - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) >> + if (!list_empty(&cmd->i_conn_node)) >> list_del_init(&cmd->i_conn_node); >> spin_unlock_bh(&conn->cmd_lock); >> >> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) >> >> if (se_cmd->se_tfo != NULL) { >> spin_lock_irq(&se_cmd->t_state_lock); >> + if (se_cmd->transport_state & CMD_T_ABORTED) { >> + /* >> + * LIO's abort path owns the cleanup for this, >> + * so put it back on the list and let >> + * aborted_task handle it. >> + */ >> + list_add_tail(&cmd->i_conn_node, >> + &conn->conn_cmd_list); > > > That should have been a move from the tmp list back to the conn_cmd_list. Nice, it looks simple and I will test it. I am a bit worried there could be other possible race conditions. Example: thread1: connection is going to be closed, iscsit_release_commands_from_conn() finds a command that is about to be aborted, re-adds it to conn_cmd_list and proceeds. iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn) that destroys the conn structure. thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing an invalid pointer. Possible solutions that I can think of: - Make iscsit_release_commands_from_conn() wait for the abort task to finish or - abort handler could hold a reference to the conn structure so that iscsit_close_connection() will sleep when calling iscsit_check_conn_usage_count(conn) until abort finishes. Maurizio
On 10/28/20 12:09 PM, Maurizio Lombardi wrote: > > > Dne 27. 10. 20 v 21:03 Michael Christie napsal(a): >> >> >>> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote: >>> >>> On 10/27/20 8:49 AM, Maurizio Lombardi wrote: >>>> Hello Mike, >>>> >>>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >>>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >>>>> >>>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >>>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >>>>> >>>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >>>>> >>>>> The cmd is now freed. >>>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >>>>> >>>>> >>>> >>>> Thanks for the review! >>>> >>>> There are definitely some problems with task aborts and commands' refcounting * >>>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); >>>> unless you are saying that abort tasks should never be executed when the connection >>>> is going down and we have to prevent such cases from happening at all. >>> >>> Yeah, I think if we prevent the race then we fix the refcount issue and your issue. >>> Here is a patch that is only compile tested: >>> >>> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 >>> From: Mike Christie <michael.christie@oracle.com> >>> Date: Tue, 27 Oct 2020 12:30:53 -0500 >>> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race >>> >>> The abort and cmd stop paths can race where: >>> >>> 1. thread1 runs iscsit_release_commands_from_conn and sets >>> CMD_T_FABRIC_STOP. >>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It >>> then returns from the aborted_task callout and we finish >>> target_handle_abort and do: >>> >>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> >>> lio_check_stop_free -> target_put_sess_cmd >>> >>> The cmd is now freed. >>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs >>> iscsit_free_cmd while accessing a command we just released. >>> >>> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the >>> CMD_T_ABORTED if the driver is not cleaning up the cmd because of >>> a session shutdown. However, iscsit_release_commands_from_conn only >>> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path >>> has claimed completion ownership of the command. >>> >>> This adds a check in iscsit_release_commands_from_conn so only the >>> abort or fabric stop path cleanup the command. >>> --- >>> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- >>> 1 file changed, 11 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c >>> index f77e5ee..85027d3 100644 >>> --- a/drivers/target/iscsi/iscsi_target.c >>> +++ b/drivers/target/iscsi/iscsi_target.c >>> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >>> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >>> { >>> spin_lock_bh(&conn->cmd_lock); >>> - if (!list_empty(&cmd->i_conn_node) && >>> - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) >>> + if (!list_empty(&cmd->i_conn_node)) >>> list_del_init(&cmd->i_conn_node); >>> spin_unlock_bh(&conn->cmd_lock); >>> >>> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) >>> >>> if (se_cmd->se_tfo != NULL) { >>> spin_lock_irq(&se_cmd->t_state_lock); >>> + if (se_cmd->transport_state & CMD_T_ABORTED) { >>> + /* >>> + * LIO's abort path owns the cleanup for this, >>> + * so put it back on the list and let >>> + * aborted_task handle it. >>> + */ >>> + list_add_tail(&cmd->i_conn_node, >>> + &conn->conn_cmd_list); >> >> >> That should have been a move from the tmp list back to the conn_cmd_list. > > Nice, it looks simple and I will test it. > I am a bit worried there could be other possible race conditions. > > Example: > > thread1: connection is going to be closed, > iscsit_release_commands_from_conn() finds a command that is about > to be aborted, re-adds it to conn_cmd_list and proceeds. > iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn) > that destroys the conn structure. > > thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing > an invalid pointer. Nice catch. You are right. > > Possible solutions that I can think of: > > - Make iscsit_release_commands_from_conn() wait for the abort task to finish Yeah you could set a completion in there then have aborted_task do the complete() call maybe? There is also the transport_cmd_check_stop_to_fabric t_transport_stop_comp completion and transport_wait_for_tasks pairing that seems close to what we want. I think we would have to change it to make it work for this case. Oh yeah, that is more of a brain stroming guess than a review type of comment :) > or > - abort handler could hold a reference to the conn structure so that iscsit_close_connection() > will sleep when calling iscsit_check_conn_usage_count(conn) until abort finishes. >
Dne 28. 10. 20 v 21:37 Mike Christie napsal(a): >> >> Possible solutions that I can think of: >> >> - Make iscsit_release_commands_from_conn() wait for the abort task to finish > > Yeah you could set a completion in there then have aborted_task do the complete() call maybe? > We could do something like this, what do you think? diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c index 067074ef50818..ffd3dbc53a42f 100644 --- a/drivers/target/iscsi/iscsi_target.c +++ b/drivers/target/iscsi/iscsi_target.c @@ -490,13 +490,16 @@ EXPORT_SYMBOL(iscsit_queue_rsp); void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) { + struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL; + spin_lock_bh(&conn->cmd_lock); - if (!list_empty(&cmd->i_conn_node) && - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) + if (!list_empty(&cmd->i_conn_node)) list_del_init(&cmd->i_conn_node); spin_unlock_bh(&conn->cmd_lock); __iscsit_free_cmd(cmd, true); + if (se_cmd && se_cmd->abrt_task_compl) + complete(se_cmd->abrt_task_compl); } EXPORT_SYMBOL(iscsit_aborted_task); @@ -4080,6 +4083,7 @@ int iscsi_target_rx_thread(void *arg) static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) { + DECLARE_COMPLETION_ONSTACK(compl); LIST_HEAD(tmp_list); struct iscsi_cmd *cmd = NULL, *cmd_tmp = NULL; struct iscsi_session *sess = conn->sess; @@ -4096,8 +4100,24 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) if (se_cmd->se_tfo != NULL) { spin_lock_irq(&se_cmd->t_state_lock); + if (se_cmd->transport_state & CMD_T_ABORTED) { + /* + * LIO's abort path owns the cleanup for this, + * so put it back on the list and let + * aborted_task handle it. + */ + list_move_tail(&cmd->i_conn_node, &conn->conn_cmd_list); + WARN_ON_ONCE(se_cmd->abrt_task_compl); + se_cmd->abrt_task_compl = &compl; + } se_cmd->transport_state |= CMD_T_FABRIC_STOP; spin_unlock_irq(&se_cmd->t_state_lock); + + if (se_cmd->abrt_task_compl) { + spin_unlock_bh(&conn->cmd_lock); + wait_for_completion(&compl); + spin_lock_bh(&conn->cmd_lock); + } } } spin_unlock_bh(&conn->cmd_lock); diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index db53a0d649da7..5611e6c00f18c 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -1391,6 +1391,7 @@ void transport_init_se_cmd( init_completion(&cmd->t_transport_stop_comp); cmd->free_compl = NULL; cmd->abrt_compl = NULL; + cmd->abrt_task_compl = NULL; spin_lock_init(&cmd->t_state_lock); INIT_WORK(&cmd->work, NULL); kref_init(&cmd->cmd_kref); diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h index 549947d407cfd..25cc451930281 100644 --- a/include/target/target_core_base.h +++ b/include/target/target_core_base.h @@ -491,6 +491,7 @@ struct se_cmd { struct list_head se_cmd_list; struct completion *free_compl; struct completion *abrt_compl; + struct completion *abrt_task_compl; const struct target_core_fabric_ops *se_tfo; sense_reason_t (*execute_cmd)(struct se_cmd *); sense_reason_t (*transport_complete_callback)(struct se_cmd *, bool, int *);
On 11/10/20 3:29 PM, Maurizio Lombardi wrote: > > > Dne 28. 10. 20 v 21:37 Mike Christie napsal(a): >>> >>> Possible solutions that I can think of: >>> >>> - Make iscsit_release_commands_from_conn() wait for the abort task to finish >> >> Yeah you could set a completion in there then have aborted_task do the complete() call maybe? >> > > We could do something like this, what do you think? > > diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c > index 067074ef50818..ffd3dbc53a42f 100644 > --- a/drivers/target/iscsi/iscsi_target.c > +++ b/drivers/target/iscsi/iscsi_target.c > @@ -490,13 +490,16 @@ EXPORT_SYMBOL(iscsit_queue_rsp); > > void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) > { > + struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL; > + > spin_lock_bh(&conn->cmd_lock); > - if (!list_empty(&cmd->i_conn_node) && > - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) > + if (!list_empty(&cmd->i_conn_node)) > list_del_init(&cmd->i_conn_node); > spin_unlock_bh(&conn->cmd_lock); > > __iscsit_free_cmd(cmd, true); > + if (se_cmd && se_cmd->abrt_task_compl) > + complete(se_cmd->abrt_task_compl); > } > EXPORT_SYMBOL(iscsit_aborted_task); > > @@ -4080,6 +4083,7 @@ int iscsi_target_rx_thread(void *arg) > > static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) > { > + DECLARE_COMPLETION_ONSTACK(compl); > LIST_HEAD(tmp_list); > struct iscsi_cmd *cmd = NULL, *cmd_tmp = NULL; > struct iscsi_session *sess = conn->sess; > @@ -4096,8 +4100,24 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) > > if (se_cmd->se_tfo != NULL) { > spin_lock_irq(&se_cmd->t_state_lock); > + if (se_cmd->transport_state & CMD_T_ABORTED) { > + /* > + * LIO's abort path owns the cleanup for this, > + * so put it back on the list and let > + * aborted_task handle it. > + */ > + list_move_tail(&cmd->i_conn_node, &conn->conn_cmd_list); > + WARN_ON_ONCE(se_cmd->abrt_task_compl); > + se_cmd->abrt_task_compl = &compl; > + } > se_cmd->transport_state |= CMD_T_FABRIC_STOP; > spin_unlock_irq(&se_cmd->t_state_lock); > + > + if (se_cmd->abrt_task_compl) { > + spin_unlock_bh(&conn->cmd_lock); > + wait_for_completion(&compl); > + spin_lock_bh(&conn->cmd_lock); You can still hit your freed conn race I think. The aborted_task callout is not the last time we are referencing the iscsi cmd and conn. That code path still has to do the release_cmd callout for example. Once we get past this wait_for_completion the abort code path could be still running. If this iscsit_release_commands_from_conn completes first then we can hit your case where we free the conn before the abort code path has done release_cmd and we could hit that cmd->conn->sess reference. I think if you do the complete in iscsit_release_cmd then we would not hit that issue. There might be a second issue though. What happens if aborted_task ran first and deleted the cmd from the conn_cmd_list. It would then be running while iscsit_release_commands_from_conn is running. We would then not do the wait_for_completion above. > + } > } > } > spin_unlock_bh(&conn->cmd_lock); > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > index db53a0d649da7..5611e6c00f18c 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -1391,6 +1391,7 @@ void transport_init_se_cmd( > init_completion(&cmd->t_transport_stop_comp); > cmd->free_compl = NULL; > cmd->abrt_compl = NULL; > + cmd->abrt_task_compl = NULL; > spin_lock_init(&cmd->t_state_lock); > INIT_WORK(&cmd->work, NULL); > kref_init(&cmd->cmd_kref); > diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h > index 549947d407cfd..25cc451930281 100644 > --- a/include/target/target_core_base.h > +++ b/include/target/target_core_base.h > @@ -491,6 +491,7 @@ struct se_cmd { > struct list_head se_cmd_list; > struct completion *free_compl; > struct completion *abrt_compl; > + struct completion *abrt_task_compl; This should be on the iscsi cmd since only iscsi uses it. > const struct target_core_fabric_ops *se_tfo; > sense_reason_t (*execute_cmd)(struct se_cmd *); > sense_reason_t (*transport_complete_callback)(struct se_cmd *, bool, int *); >
On 10/28/20 12:09 PM, Maurizio Lombardi wrote: > > > Dne 27. 10. 20 v 21:03 Michael Christie napsal(a): >> >> >>> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@oracle.com> wrote: >>> >>> On 10/27/20 8:49 AM, Maurizio Lombardi wrote: >>>> Hello Mike, >>>> >>>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >>>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >>>>> >>>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >>>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >>>>> >>>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >>>>> >>>>> The cmd is now freed. >>>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >>>>> >>>>> >>>> >>>> Thanks for the review! >>>> >>>> There are definitely some problems with task aborts and commands' refcounting * >>>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); >>>> unless you are saying that abort tasks should never be executed when the connection >>>> is going down and we have to prevent such cases from happening at all. >>> >>> Yeah, I think if we prevent the race then we fix the refcount issue and your issue. >>> Here is a patch that is only compile tested: >>> >>> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 >>> From: Mike Christie <michael.christie@oracle.com> >>> Date: Tue, 27 Oct 2020 12:30:53 -0500 >>> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race >>> >>> The abort and cmd stop paths can race where: >>> >>> 1. thread1 runs iscsit_release_commands_from_conn and sets >>> CMD_T_FABRIC_STOP. >>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It >>> then returns from the aborted_task callout and we finish >>> target_handle_abort and do: >>> >>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> >>> lio_check_stop_free -> target_put_sess_cmd >>> >>> The cmd is now freed. >>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs >>> iscsit_free_cmd while accessing a command we just released. >>> >>> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the >>> CMD_T_ABORTED if the driver is not cleaning up the cmd because of >>> a session shutdown. However, iscsit_release_commands_from_conn only >>> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path >>> has claimed completion ownership of the command. >>> >>> This adds a check in iscsit_release_commands_from_conn so only the >>> abort or fabric stop path cleanup the command. >>> --- >>> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- >>> 1 file changed, 11 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c >>> index f77e5ee..85027d3 100644 >>> --- a/drivers/target/iscsi/iscsi_target.c >>> +++ b/drivers/target/iscsi/iscsi_target.c >>> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >>> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >>> { >>> spin_lock_bh(&conn->cmd_lock); >>> - if (!list_empty(&cmd->i_conn_node) && >>> - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) >>> + if (!list_empty(&cmd->i_conn_node)) >>> list_del_init(&cmd->i_conn_node); >>> spin_unlock_bh(&conn->cmd_lock); >>> >>> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) >>> >>> if (se_cmd->se_tfo != NULL) { >>> spin_lock_irq(&se_cmd->t_state_lock); >>> + if (se_cmd->transport_state & CMD_T_ABORTED) { >>> + /* >>> + * LIO's abort path owns the cleanup for this, >>> + * so put it back on the list and let >>> + * aborted_task handle it. >>> + */ >>> + list_add_tail(&cmd->i_conn_node, >>> + &conn->conn_cmd_list); >> >> >> That should have been a move from the tmp list back to the conn_cmd_list. > > Nice, it looks simple and I will test it. > I am a bit worried there could be other possible race conditions. > > Example: > > thread1: connection is going to be closed, > iscsit_release_commands_from_conn() finds a command that is about > to be aborted, re-adds it to conn_cmd_list and proceeds. > iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn) > that destroys the conn structure. > > thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing > an invalid pointer. Hey, I tested this out and I do not think this will happen. We will get stuck waiting on the TMF completion for the affected cmd/cmds. In conn_cmd_list we would have [CMD1 -> ABORT TMF]. Those cmds get moved to the tmp list. It might happen where CMD1's CMD_T_ABORTED bit is set, and iscsit_release_commands_from_conn will would put it back onto the conn_cmd_list. But then it will see the ABORT on the list. We will then wait on the ABORT in: iscsit_release_commands_from_conn -> iscsit_free_cmd -> transport_generic_free_cmd. The abort at this time would be waiting on CMD1 in target_put_cmd_and_wait in target_core_tmr.c. Eventually CMD1 completes, aborted_task is run and the cmd is released. target_release_cmd_kref will then do complete on abort_compl. The abort will then wake from target_put_cmd_and_wait and will complete. We will then return from transport_generic_free_cmd for the abort. A LUN RESET should work similarly but we would be waiting on the cmds and aborts and then the LUN RESET would complete.
Dne 11. 11. 20 v 3:16 Mike Christie napsal(a): > Hey, I tested this out and I do not think this will happen. We will get stuck waiting on the TMF completion for the affected cmd/cmds. > > In conn_cmd_list we would have [CMD1 -> ABORT TMF]. Those cmds get moved to the tmp list. It might happen where CMD1's CMD_T_ABORTED bit is set, and iscsit_release_commands_from_conn will would put it back onto the conn_cmd_list. But then it will see the ABORT on the list. We will then wait on the ABORT in: > > iscsit_release_commands_from_conn -> iscsit_free_cmd -> transport_generic_free_cmd. Hi Mike, I'm not sure if I understood this part. The commands are moved to the tmp_list; we check for CMD_T_ABORTED and eventually move the commands from tmp_list back to conn_cmd_list because it's the abort task the one that should do the cleanup. iscsit_release_commands_from_conn() then scans the tmp_list and calls iscsit_free_cmd()... but not against those commands with CMD_T_ABORTED flag set because we just moved them back to conn_cmd_list and aren't linked to tmp_list anymore. Am I missing something? Maurizio
> On Nov 11, 2020, at 8:58 AM, Maurizio Lombardi <mlombard@redhat.com> wrote: > > > > Dne 11. 11. 20 v 3:16 Mike Christie napsal(a): >> Hey, I tested this out and I do not think this will happen. We will get stuck waiting on the TMF completion for the affected cmd/cmds. >> >> In conn_cmd_list we would have [CMD1 -> ABORT TMF]. Those cmds get moved to the tmp list. It might happen where CMD1's CMD_T_ABORTED bit is set, and iscsit_release_commands_from_conn will would put it back onto the conn_cmd_list. But then it will see the ABORT on the list. We will then wait on the ABORT in: >> >> iscsit_release_commands_from_conn -> iscsit_free_cmd -> transport_generic_free_cmd. > > Hi Mike, > > I'm not sure if I understood this part. > > The commands are moved to the tmp_list; > we check for CMD_T_ABORTED and eventually move the commands from tmp_list back to conn_cmd_list > because it's the abort task the one that should do the cleanup. I’m not sure what you mean here. Are you saying both CMD1’s se_cmd and the ABORT’s se_cmd will have the CMD_T_ABORTED bit set and will both go through the aborted_task callout? > > iscsit_release_commands_from_conn() then scans the tmp_list and calls iscsit_free_cmd()... but not against > those commands with CMD_T_ABORTED flag set because we just moved them back to conn_cmd_list > and aren't linked to tmp_list anymore. > > Am I missing something? If you have a SCSI READ/WRITE se_cmd (CMD1 in my example) and a ABORT se_cmd (ABORT TMF in my example) on the conn_cmd_list, then the ABORT’s se_cmd would not have the CMD_T_ABORTED bit set, right? If so, what sets it? If the SCSI R/W has the CMD_T_ABORTED bit set, we move it it back to the conn_cmd_list and the abort code path cleans it up. But then we still have the ABORT’s se_cmd on the tmp_list. We will then call transport_generic_free_cmd(wait_for_tasks=true) -> __transport_wait_for_tasks(fabric_stop=true) And wait for the ABORT to complete, and the ABORT does not complete until the last ref on the command it’s aborting completes. If you have a LUN RESET in the mix like: [CMD1 -> ABORT TMF -> LUN RESET TMF] Then CMD1 and the ABORT could have their CMD_T_ABORTED bit set. core_tmr_drain_tmr_list would call __target_check_io_state during the RESET processing. However, in this case, the LUN RESET’s se_cmd would not have the bit set, so we would end up waiting like I described for that to complete. In that case though the RESET waits for the cmds and tmfs it is cleaning up.
Dne 11. 11. 20 v 16:37 Michael Christie napsal(a): > If the SCSI R/W has the CMD_T_ABORTED bit set, we move it it back to the conn_cmd_list and the abort code path cleans it up. But then we still have the ABORT’s se_cmd on the tmp_list. We will then call > > transport_generic_free_cmd(wait_for_tasks=true) -> __transport_wait_for_tasks(fabric_stop=true) > > And wait for the ABORT to complete, and the ABORT does not complete until the last ref on the command it’s aborting completes. Right. now I understand it. Thanks. Maurizio
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 3082f5bde9fa..61233755ece0 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -741,8 +741,6 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool check_queues) { struct iscsi_conn *conn = cmd->conn; - WARN_ON(!list_empty(&cmd->i_conn_node)); - if (cmd->data_direction == DMA_TO_DEVICE) { iscsit_stop_dataout_timer(cmd); iscsit_free_r2ts_from_list(cmd);
The iscsit_release_commands_from_conn() function does the following operations: 1) locks the cmd_lock spinlock 2) Scans the list of commands and sets the CMD_T_FABRIC_STOP flag 3) Releases the cmd_lock spinlock 4) Rescans the list again and clears the i_conn_node link of each command If an abort task timer is fired between 3) and 4), it will find the CMD_T_FABRIC_STOP flag set and won't call list_del_init(); therefore it may end up calling __iscsit_free_cmd() with a non-empty i_conn_node list, thus triggering the warning. Considering that: - we expect list_del_init() to be executed by iscsit_release_commands_from_conn() when the CMD_T_FABRIC_STOP is set. - iscsit_aborted_task() is the only function that calls __iscsit_free_cmd() directly, while all the other functions call iscsit_free_cmd(). - the warning in __iscsit_free_cmd() is a duplicate (the same warning can be found in iscsit_free_cmd(). We can fix the bug by simply removing the warning from __iscsit_free_cmd() kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 1 PID: 21173 at drivers/target/iscsi/iscsi_target_util.c:720 __iscsit_free_cmd+0x26e/0x290 [iscsi_target_mod] ... kernel: CPU: 1 PID: 21173 Comm: kworker/u8:3 Kdump: loaded Not tainted 3.10.0-1062.4.1.el7.x86_64 #1 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015 kernel: Workqueue: tmr-user target_tmr_work [target_core_mod] kernel: Call Trace: kernel: [<ffffffff91d78ba4>] dump_stack+0x19/0x1b kernel: [<ffffffff9169a958>] __warn+0xd8/0x100 kernel: [<ffffffff9169aa9d>] warn_slowpath_null+0x1d/0x20 kernel: [<ffffffffc0a6f69e>] __iscsit_free_cmd+0x26e/0x290 [iscsi_target_mod] kernel: [<ffffffffc0a70bc4>] iscsit_aborted_task+0x64/0x70 [iscsi_target_mod] kernel: [<ffffffffc0a7830a>] lio_aborted_task+0x2a/0x30 [iscsi_target_mod] kernel: [<ffffffffc09fa516>] transport_cmd_finish_abort+0x66/0xb0 [target_core_mod] kernel: [<ffffffffc09f4d92>] core_tmr_abort_task+0x102/0x180 [target_core_mod] kernel: [<ffffffffc09f7bb2>] target_tmr_work+0x152/0x170 [target_core_mod] kernel: [<ffffffff916bd1df>] process_one_work+0x17f/0x440 kernel: [<ffffffff916be2f6>] worker_thread+0x126/0x3c0 kernel: [<ffffffff916be1d0>] ? manage_workers.isra.26+0x2a0/0x2a0 kernel: [<ffffffff916c51b1>] kthread+0xd1/0xe0 kernel: [<ffffffff916c50e0>] ? insert_kthread_work+0x40/0x40 kernel: [<ffffffff91d8bd37>] ret_from_fork_nospec_begin+0x21/0x21 kernel: [<ffffffff916c50e0>] ? insert_kthread_work+0x40/0x40 kernel: ---[ end trace ed2119501826ec7a ]--- Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> --- drivers/target/iscsi/iscsi_target_util.c | 2 -- 1 file changed, 2 deletions(-)