Message ID | 20230309223312.94595-8-michael.christie@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | target: TMF and recovery fixes | expand |
On Thu, Mar 09, 2023 at 04:33:01PM -0600, Mike Christie wrote: > > iscsit will set CMD_T_FABRIC_STOP on running commands when its transport > connection is down and it can't send/recv IO (tx/rx threads are killed > or the cleanup thread is run from the one thats up). It will then loop > over running commands and wait for LIO core to complete them or clean > them up if they were on an internal queue waiting to be sent or ackd. The current usage of CMD_T_FABRIC_STOP and CMD_T_ABORTED is to distinguish will command be aborted or finished at the connection release. Technically that means who is in charge to decrease the command's kref. The current usage of CMD_T_FABRIC_STOP and CMD_T_ABORTED is race free - it checks and *changes* the state under a lock. They are mutually exclusive. > Currently, CMD_T_FABRIC_STOP only stops TMRs from operating on the > command but for isert we need to prevent LIO core from calling into > iscsit callouts when the connection is being brought down. If LIO core > queues commands to iscsit and it ends up adding to an internal queue > instead of passing back to the driver then we can end up hanging waiting > on command completion that never occurs because it's stuck on the internal > list (the tx thread is stopped at this time, so it will never loop over > the response list and call into isert). We also want to sync up on a > point where we no longer call into isert so it can cleanup it's structs. If fabric driver knows that responses will not be completed by HW then the fabric driver shall itself complete such responses. Please do not shift this responsibility to LIO core. > This has LIO core treat CMD_T_FABRIC_STOP like CMD_T_STOP during > command execution and also fixes the locking around the > target_cmd_interrupted calls so fabric modules can make sure cmds are > never marked both CMD_T_COMPLETE and CMD_T_STOP|CMD_T_FABRIC_STOP. CMD_T_STOP is some ancient logic that is used to move responses from a failed connection to a new one during recovery in ERL=2. I believe that CMT_T_STOP logic was reused at connection release just to reduce conn/session use-after-free cases at command release. Thanks to this patchset all commands in the connection are waited for the completion in iscsit_release_commands_from_conn(). Is there any sense to use CMD_T_STOP mechanism there now? I believe it's time to remove it and to become like other fabric drivers - just wait for commands in async manner. For connection release CMT_T_STOP is definitely superfluous and error prone now. The long story short, at connection release with ERL=0 I propose to completely avoid CMD_T_STOP logic instead of reusing CMD_T_STOP logic. > > Signed-off-by: Mike Christie <michael.christie@oracle.com> > --- > drivers/target/target_core_sbc.c | 2 +- > drivers/target/target_core_transport.c | 27 +++++++++++++++----------- > 2 files changed, 17 insertions(+), 12 deletions(-) > > diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c > index c1cf37a1b4ce..ff1ae779543f 100644 > --- a/drivers/target/target_core_sbc.c > +++ b/drivers/target/target_core_sbc.c > @@ -457,7 +457,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes > * we don't have to perform the write operation. > */ > WARN_ON(!(cmd->transport_state & > - (CMD_T_ABORTED | CMD_T_STOP))); > + (CMD_T_ABORTED | CMD_T_STOP | CMD_T_FABRIC_STOP))); > goto out; > } > /* > diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c > index 86adff2a86ed..1c23079a5d7f 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -737,8 +737,8 @@ static int transport_cmd_check_stop_to_fabric(struct se_cmd *cmd) > * Determine if frontend context caller is requesting the stopping of > * this command for frontend exceptions. > */ > - if (cmd->transport_state & CMD_T_STOP) { > - pr_debug("%s:%d CMD_T_STOP for ITT: 0x%08llx\n", > + if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { > + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOP for ITT: 0x%08llx\n", > __func__, __LINE__, cmd->tag); For example, this snippet forbids kref decrement for CMD_TFABRIC_STOP commands although it is supposed to happen - that is a decrement from Core meaning that Core is not needed in this command any more. > > spin_unlock_irqrestore(&cmd->t_state_lock, flags); > @@ -889,7 +889,7 @@ static bool target_cmd_interrupted(struct se_cmd *cmd) > INIT_WORK(&cmd->work, target_abort_work); > queue_work(target_completion_wq, &cmd->work); > return true; > - } else if (cmd->transport_state & CMD_T_STOP) { > + } else if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { > if (cmd->transport_complete_callback) > cmd->transport_complete_callback(cmd, false, &post_ret); > complete_all(&cmd->t_transport_stop_comp); > @@ -907,13 +907,15 @@ void target_complete_cmd_with_sense(struct se_cmd *cmd, u8 scsi_status, > int success, cpu; > unsigned long flags; > > - if (target_cmd_interrupted(cmd)) > + spin_lock_irqsave(&cmd->t_state_lock, flags); > + if (target_cmd_interrupted(cmd)) { > + spin_unlock_irqrestore(&cmd->t_state_lock, flags); > return; > + } > > cmd->scsi_status = scsi_status; > cmd->sense_reason = sense_reason; > > - spin_lock_irqsave(&cmd->t_state_lock, flags); > switch (cmd->scsi_status) { > case SAM_STAT_CHECK_CONDITION: > if (cmd->se_cmd_flags & SCF_TRANSPORT_TASK_SENSE) > @@ -2277,10 +2279,12 @@ void target_execute_cmd(struct se_cmd *cmd) > * > * If the received CDB has already been aborted stop processing it here. > */ > - if (target_cmd_interrupted(cmd)) > + spin_lock_irq(&cmd->t_state_lock); > + if (target_cmd_interrupted(cmd)) { > + spin_unlock_irq(&cmd->t_state_lock); > return; > + } > > - spin_lock_irq(&cmd->t_state_lock); > cmd->t_state = TRANSPORT_PROCESSING; > cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT; > spin_unlock_irq(&cmd->t_state_lock); > @@ -2847,9 +2851,9 @@ transport_generic_new_cmd(struct se_cmd *cmd) > * Determine if frontend context caller is requesting the stopping of > * this command for frontend exceptions. > */ > - if (cmd->transport_state & CMD_T_STOP && > + if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP) && > !cmd->se_tfo->write_pending_must_be_called) { > - pr_debug("%s:%d CMD_T_STOP for ITT: 0x%08llx\n", > + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOPfor ITT: 0x%08llx\n", > __func__, __LINE__, cmd->tag); > > spin_unlock_irqrestore(&cmd->t_state_lock, flags); > @@ -2880,11 +2884,12 @@ static void transport_write_pending_qf(struct se_cmd *cmd) > bool stop; > > spin_lock_irqsave(&cmd->t_state_lock, flags); > - stop = (cmd->transport_state & (CMD_T_STOP | CMD_T_ABORTED)); > + stop = (cmd->transport_state & > + (CMD_T_STOP | CMD_T_FABRIC_STOP | CMD_T_ABORTED)); > spin_unlock_irqrestore(&cmd->t_state_lock, flags); > > if (stop) { > - pr_debug("%s:%d CMD_T_STOP|CMD_T_ABORTED for ITT: 0x%08llx\n", > + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOP|CMD_T_ABORTED for ITT: 0x%08llx\n", > __func__, __LINE__, cmd->tag); > complete_all(&cmd->t_transport_stop_comp); > return; > -- > 2.31.1 > >
On 3/15/23 5:47 AM, Dmitry Bogdanov wrote: > On Thu, Mar 09, 2023 at 04:33:01PM -0600, Mike Christie wrote: >> >> iscsit will set CMD_T_FABRIC_STOP on running commands when its transport >> connection is down and it can't send/recv IO (tx/rx threads are killed >> or the cleanup thread is run from the one thats up). It will then loop >> over running commands and wait for LIO core to complete them or clean >> them up if they were on an internal queue waiting to be sent or ackd. > > The current usage of CMD_T_FABRIC_STOP and CMD_T_ABORTED is to > distinguish will command be aborted or finished at the connection release. > Technically that means who is in charge to decrease the command's kref. > > The current usage of CMD_T_FABRIC_STOP and CMD_T_ABORTED is race free - > it checks and *changes* the state under a lock. They are mutually > exclusive I wasn't sure about that because __transport_wait_for_tasks will always set CMD_T_FABRIC_STOP if fabric_stop is true. It will do it even if CMD_T_ABORTED is set, so you can have them both set. I changed that in patch 14 so we do what you described above. > >> Currently, CMD_T_FABRIC_STOP only stops TMRs from operating on the >> command but for isert we need to prevent LIO core from calling into >> iscsit callouts when the connection is being brought down. If LIO core >> queues commands to iscsit and it ends up adding to an internal queue >> instead of passing back to the driver then we can end up hanging waiting >> on command completion that never occurs because it's stuck on the internal >> list (the tx thread is stopped at this time, so it will never loop over >> the response list and call into isert). We also want to sync up on a >> point where we no longer call into isert so it can cleanup it's structs. > > If fabric driver knows that responses will not be completed by HW > then the fabric driver shall itself complete such responses. > Please do not shift this responsibility to LIO core. This patch makes it so the fabric driver can tell LIO core to not do the last put, so the fabric driver can do some cleanup on on the cmd when it completes on the backend but before we free it. Basically for isert iscsit wants to call aborted_task to clean up the command. iscsit tcp will also re-call __iscsit_free_cmd to cleanup queue_status calls that were done. Are you saying that is shifting the responsibility to complete the cmd on LIO core? I think it's the opposite since the fabric driver is doing the last put and cleanup. However, are you saying that just adding the check is putting it on LIO core? If so, we can always call check_stop_free in transport_cmd_check_stop_to_fabric. The fabric driver's release_cmd then has to figure out what needs to be cleaned up then. I can make that work. It's more complicated than just what I wrote because of how isert handles failed cmds and tmrs, but it's doable. > >> This has LIO core treat CMD_T_FABRIC_STOP like CMD_T_STOP during >> command execution and also fixes the locking around the >> target_cmd_interrupted calls so fabric modules can make sure cmds are >> never marked both CMD_T_COMPLETE and CMD_T_STOP|CMD_T_FABRIC_STOP. > > CMD_T_STOP is some ancient logic that is used to move responses from a failed > connection to a new one during recovery in ERL=2. > I believe that CMT_T_STOP logic was reused at connection release just > to reduce conn/session use-after-free cases at command release. > > Thanks to this patchset all commands in the connection are waited for > the completion in iscsit_release_commands_from_conn(). Is there any > sense to use CMD_T_STOP mechanism there now? CMD_T_STOP does 2 things for us in iscsit_release_commands_from_conn: 1. It waits for the cmd to complete so we don't release the conn/session from under a cmd. 2. It prevents transport_cmd_check_stop_to_fabric from doing the last put if the command is completing normally (not aborted). iscsit does it instead so we can do some extra cleanup. For iscsit tcp, we don't need 1 or 2. Funnily enough if you never brought up isert in the other patchset, we could just do target_wait_for_cmds and be done :) For isert, we don't need 1. We are using it for #2 due to how the driver handles failed cmds and tmrs. I either need the some check so I can tell transport_cmd_check_stop_to_fabric that iscsit wants to do the last put, or we can do release_cmd based.
On 3/15/23 5:54 PM, Mike Christie wrote: > However, are you saying that just adding the check is putting it on LIO core? > If so, we can always call check_stop_free in transport_cmd_check_stop_to_fabric. > The fabric driver's release_cmd then has to figure out what needs to be cleaned > up then. I can make that work. It's more complicated than just what I wrote > because of how isert handles failed cmds and tmrs, but it's doable. I missed the point of your mail before. CMD_T_STOP sucks so don't replicate it with another bit :) Agree with that :) I'll work on a patch like I described above.
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c index c1cf37a1b4ce..ff1ae779543f 100644 --- a/drivers/target/target_core_sbc.c +++ b/drivers/target/target_core_sbc.c @@ -457,7 +457,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes * we don't have to perform the write operation. */ WARN_ON(!(cmd->transport_state & - (CMD_T_ABORTED | CMD_T_STOP))); + (CMD_T_ABORTED | CMD_T_STOP | CMD_T_FABRIC_STOP))); goto out; } /* diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 86adff2a86ed..1c23079a5d7f 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -737,8 +737,8 @@ static int transport_cmd_check_stop_to_fabric(struct se_cmd *cmd) * Determine if frontend context caller is requesting the stopping of * this command for frontend exceptions. */ - if (cmd->transport_state & CMD_T_STOP) { - pr_debug("%s:%d CMD_T_STOP for ITT: 0x%08llx\n", + if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOP for ITT: 0x%08llx\n", __func__, __LINE__, cmd->tag); spin_unlock_irqrestore(&cmd->t_state_lock, flags); @@ -889,7 +889,7 @@ static bool target_cmd_interrupted(struct se_cmd *cmd) INIT_WORK(&cmd->work, target_abort_work); queue_work(target_completion_wq, &cmd->work); return true; - } else if (cmd->transport_state & CMD_T_STOP) { + } else if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { if (cmd->transport_complete_callback) cmd->transport_complete_callback(cmd, false, &post_ret); complete_all(&cmd->t_transport_stop_comp); @@ -907,13 +907,15 @@ void target_complete_cmd_with_sense(struct se_cmd *cmd, u8 scsi_status, int success, cpu; unsigned long flags; - if (target_cmd_interrupted(cmd)) + spin_lock_irqsave(&cmd->t_state_lock, flags); + if (target_cmd_interrupted(cmd)) { + spin_unlock_irqrestore(&cmd->t_state_lock, flags); return; + } cmd->scsi_status = scsi_status; cmd->sense_reason = sense_reason; - spin_lock_irqsave(&cmd->t_state_lock, flags); switch (cmd->scsi_status) { case SAM_STAT_CHECK_CONDITION: if (cmd->se_cmd_flags & SCF_TRANSPORT_TASK_SENSE) @@ -2277,10 +2279,12 @@ void target_execute_cmd(struct se_cmd *cmd) * * If the received CDB has already been aborted stop processing it here. */ - if (target_cmd_interrupted(cmd)) + spin_lock_irq(&cmd->t_state_lock); + if (target_cmd_interrupted(cmd)) { + spin_unlock_irq(&cmd->t_state_lock); return; + } - spin_lock_irq(&cmd->t_state_lock); cmd->t_state = TRANSPORT_PROCESSING; cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT; spin_unlock_irq(&cmd->t_state_lock); @@ -2847,9 +2851,9 @@ transport_generic_new_cmd(struct se_cmd *cmd) * Determine if frontend context caller is requesting the stopping of * this command for frontend exceptions. */ - if (cmd->transport_state & CMD_T_STOP && + if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP) && !cmd->se_tfo->write_pending_must_be_called) { - pr_debug("%s:%d CMD_T_STOP for ITT: 0x%08llx\n", + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOPfor ITT: 0x%08llx\n", __func__, __LINE__, cmd->tag); spin_unlock_irqrestore(&cmd->t_state_lock, flags); @@ -2880,11 +2884,12 @@ static void transport_write_pending_qf(struct se_cmd *cmd) bool stop; spin_lock_irqsave(&cmd->t_state_lock, flags); - stop = (cmd->transport_state & (CMD_T_STOP | CMD_T_ABORTED)); + stop = (cmd->transport_state & + (CMD_T_STOP | CMD_T_FABRIC_STOP | CMD_T_ABORTED)); spin_unlock_irqrestore(&cmd->t_state_lock, flags); if (stop) { - pr_debug("%s:%d CMD_T_STOP|CMD_T_ABORTED for ITT: 0x%08llx\n", + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOP|CMD_T_ABORTED for ITT: 0x%08llx\n", __func__, __LINE__, cmd->tag); complete_all(&cmd->t_transport_stop_comp); return;
iscsit will set CMD_T_FABRIC_STOP on running commands when its transport connection is down and it can't send/recv IO (tx/rx threads are killed or the cleanup thread is run from the one thats up). It will then loop over running commands and wait for LIO core to complete them or clean them up if they were on an internal queue waiting to be sent or ackd. Currently, CMD_T_FABRIC_STOP only stops TMRs from operating on the command but for isert we need to prevent LIO core from calling into iscsit callouts when the connection is being brought down. If LIO core queues commands to iscsit and it ends up adding to an internal queue instead of passing back to the driver then we can end up hanging waiting on command completion that never occurs because it's stuck on the internal list (the tx thread is stopped at this time, so it will never loop over the response list and call into isert). We also want to sync up on a point where we no longer call into isert so it can cleanup it's structs. This has LIO core treat CMD_T_FABRIC_STOP like CMD_T_STOP during command execution and also fixes the locking around the target_cmd_interrupted calls so fabric modules can make sure cmds are never marked both CMD_T_COMPLETE and CMD_T_STOP|CMD_T_FABRIC_STOP. Signed-off-by: Mike Christie <michael.christie@oracle.com> --- drivers/target/target_core_sbc.c | 2 +- drivers/target/target_core_transport.c | 27 +++++++++++++++----------- 2 files changed, 17 insertions(+), 12 deletions(-)