Message ID | 541C287D.1050900@acm.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On 9/19/2014 3:58 PM, Bart Van Assche wrote: > Attempting to connect three times may be insufficient after an > initiator system that was using multiple RDMA channels tries to > relogin. Additionally, this login retry mechanism is a workaround > for particular behavior of the IB/CM. Since the srp_daemon retries > a failed login attempt anyway, remove the stale connection retry > mechanism. > I agree, this work-around doesn't even always work for some targets. In case the RDMA stack restarted while IB ports were down and srp initiator didn't manage to teardown the connections, some targets are keeping cm_ids online returning "stale connection" rejects longer then expected. let user-space retry from scratch... Reviewed-by: Sagi Grimberg <sagig@mellanox.com> > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > --- > drivers/infiniband/ulp/srp/ib_srp.c | 16 +++------------- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index d3c712f..9608e7a 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -904,7 +904,6 @@ static void srp_rport_delete(struct srp_rport *rport) > > static int srp_connect_target(struct srp_target_port *target) > { > - int retries = 3; > int ret; > > WARN_ON_ONCE(target->connected); > @@ -945,19 +944,10 @@ static int srp_connect_target(struct srp_target_port *target) > break; > > case SRP_STALE_CONN: > - /* Our current CM id was stale, and is now in timewait. > - * Try to reconnect with a new one. > - */ > - if (!retries-- || srp_new_cm_id(target)) { > - shost_printk(KERN_ERR, target->scsi_host, PFX > - "giving up on stale connection\n"); > - target->status = -ECONNRESET; > - return target->status; > - } > - > shost_printk(KERN_ERR, target->scsi_host, PFX > - "retrying stale connection\n"); > - break; > + "giving up on stale connection\n"); > + target->status = -ECONNRESET; > + return target->status; > > default: > return target->status; > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> wrote: > Attempting to connect three times may be insufficient after an > initiator system that was using multiple RDMA channels tries to > relogin. Additionally, this login retry mechanism is a workaround > for particular behavior of the IB/CM. Can you be more specific re the particular behavior of the IB CM? added Sean, the CM maintainer. > Since the srp_daemon retries > a failed login attempt anyway, remove the stale connection retry > mechanism. > > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > --- > drivers/infiniband/ulp/srp/ib_srp.c | 16 +++------------- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index d3c712f..9608e7a 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -904,7 +904,6 @@ static void srp_rport_delete(struct srp_rport *rport) > > static int srp_connect_target(struct srp_target_port *target) > { > - int retries = 3; > int ret; > > WARN_ON_ONCE(target->connected); > @@ -945,19 +944,10 @@ static int srp_connect_target(struct srp_target_port *target) > break; > > case SRP_STALE_CONN: > - /* Our current CM id was stale, and is now in timewait. > - * Try to reconnect with a new one. > - */ > - if (!retries-- || srp_new_cm_id(target)) { > - shost_printk(KERN_ERR, target->scsi_host, PFX > - "giving up on stale connection\n"); > - target->status = -ECONNRESET; > - return target->status; > - } > - > shost_printk(KERN_ERR, target->scsi_host, PFX > - "retrying stale connection\n"); > - break; > + "giving up on stale connection\n"); > + target->status = -ECONNRESET; > + return target->status; > > default: > return target->status; > -- > 1.8.4.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 19:45, Or Gerlitz wrote: > On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> wrote: >> Attempting to connect three times may be insufficient after an >> initiator system that was using multiple RDMA channels tries to >> relogin. Additionally, this login retry mechanism is a workaround >> for particular behavior of the IB/CM. > > Can you be more specific re the particular behavior of the IB CM? > added Sean, the CM maintainer. Let's focus on the software behavior instead of the people who are involved. What I have observed several times is that after a power cycle of the initiator system the first few login attempts are rejected. I was assuming that this was due to the IB/CM implementation but now that I have had another look at the logs I see that there is not enough information in the system logs to draw this conclusion. I will add additional logging statements in the initiator and target kernel code such that I can determine the root cause of this behavior. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/02/14 12:34, Bart Van Assche wrote: > On 09/20/14 19:45, Or Gerlitz wrote: >> On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> >> wrote: >>> Attempting to connect three times may be insufficient after an >>> initiator system that was using multiple RDMA channels tries to >>> relogin. Additionally, this login retry mechanism is a workaround >>> for particular behavior of the IB/CM. >> >> Can you be more specific re the particular behavior of the IB CM? >> added Sean, the CM maintainer. > > Let's focus on the software behavior instead of the people who are > involved. What I have observed several times is that after a power cycle > of the initiator system the first few login attempts are rejected. I was > assuming that this was due to the IB/CM implementation but now that I > have had another look at the logs I see that there is not enough > information in the system logs to draw this conclusion. I will add > additional logging statements in the initiator and target kernel code > such that I can determine the root cause of this behavior. (replying to my own e-mail / removed linux-scsi from CC-list) So far I have been able to reproduce this behavior once after pushing the reset button of the initiator system while it was in the connected state. After the initiator system had finished rebooting I started ibdump on both IB ports of the target system (attached to this e-mail). What surprised me is that I found all the messages I expected in the ibdump output (e.g. IB MAD device management query) but no CM messages. Both sides were running FW 2.32.5100. The following messages were logged at the initiator side while ibdump was running at the target side: Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: REJ received Oct 02 17:43:42 msi kernel: scsi host14: REJ reason: stale connection Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: giving up on stale connection Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: Connection 0/12 failed Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: REJ received Oct 02 17:43:42 msi kernel: scsi host15: REJ reason: stale connection Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: giving up on stale connection Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: Connection 0/12 failed After a few more login attempts SRP login succeeded. Bart.
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index d3c712f..9608e7a 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -904,7 +904,6 @@ static void srp_rport_delete(struct srp_rport *rport) static int srp_connect_target(struct srp_target_port *target) { - int retries = 3; int ret; WARN_ON_ONCE(target->connected); @@ -945,19 +944,10 @@ static int srp_connect_target(struct srp_target_port *target) break; case SRP_STALE_CONN: - /* Our current CM id was stale, and is now in timewait. - * Try to reconnect with a new one. - */ - if (!retries-- || srp_new_cm_id(target)) { - shost_printk(KERN_ERR, target->scsi_host, PFX - "giving up on stale connection\n"); - target->status = -ECONNRESET; - return target->status; - } - shost_printk(KERN_ERR, target->scsi_host, PFX - "retrying stale connection\n"); - break; + "giving up on stale connection\n"); + target->status = -ECONNRESET; + return target->status; default: return target->status;
Attempting to connect three times may be insufficient after an initiator system that was using multiple RDMA channels tries to relogin. Additionally, this login retry mechanism is a workaround for particular behavior of the IB/CM. Since the srp_daemon retries a failed login attempt anyway, remove the stale connection retry mechanism. Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- drivers/infiniband/ulp/srp/ib_srp.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-)