diff mbox

[5/8] IB/srp: Remove stale connection retry mechanism

Message ID 541C287D.1050900@acm.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Bart Van Assche Sept. 19, 2014, 12:58 p.m. UTC
Attempting to connect three times may be insufficient after an
initiator system that was using multiple RDMA channels tries to
relogin. Additionally, this login retry mechanism is a workaround
for particular behavior of the IB/CM. Since the srp_daemon retries
a failed login attempt anyway, remove the stale connection retry
mechanism.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

Comments

Sagi Grimberg Sept. 19, 2014, 6:25 p.m. UTC | #1
On 9/19/2014 3:58 PM, Bart Van Assche wrote:
> Attempting to connect three times may be insufficient after an
> initiator system that was using multiple RDMA channels tries to
> relogin. Additionally, this login retry mechanism is a workaround
> for particular behavior of the IB/CM. Since the srp_daemon retries
> a failed login attempt anyway, remove the stale connection retry
> mechanism.
>

I agree, this work-around doesn't even always work for some targets. In
case the RDMA stack restarted while IB ports were down and srp initiator
didn't manage to teardown the connections, some targets are keeping
cm_ids online returning "stale connection" rejects longer then expected.

let user-space retry from scratch...

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>

> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   drivers/infiniband/ulp/srp/ib_srp.c | 16 +++-------------
>   1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index d3c712f..9608e7a 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -904,7 +904,6 @@ static void srp_rport_delete(struct srp_rport *rport)
>
>   static int srp_connect_target(struct srp_target_port *target)
>   {
> - int retries = 3;
>   int ret;
>
>   WARN_ON_ONCE(target->connected);
> @@ -945,19 +944,10 @@ static int srp_connect_target(struct srp_target_port *target)
>   break;
>
>   case SRP_STALE_CONN:
> - /* Our current CM id was stale, and is now in timewait.
> - * Try to reconnect with a new one.
> - */
> - if (!retries-- || srp_new_cm_id(target)) {
> - shost_printk(KERN_ERR, target->scsi_host, PFX
> -     "giving up on stale connection\n");
> - target->status = -ECONNRESET;
> - return target->status;
> - }
> -
>   shost_printk(KERN_ERR, target->scsi_host, PFX
> -     "retrying stale connection\n");
> - break;
> +     "giving up on stale connection\n");
> + target->status = -ECONNRESET;
> + return target->status;
>
>   default:
>   return target->status;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 20, 2014, 5:45 p.m. UTC | #2
On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> Attempting to connect three times may be insufficient after an
> initiator system that was using multiple RDMA channels tries to
> relogin. Additionally, this login retry mechanism is a workaround
> for particular behavior of the IB/CM.

Can you be more specific re the particular behavior of the IB CM?
added Sean, the CM maintainer.

> Since the srp_daemon retries
> a failed login attempt anyway, remove the stale connection retry
> mechanism.
>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/infiniband/ulp/srp/ib_srp.c | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index d3c712f..9608e7a 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -904,7 +904,6 @@ static void srp_rport_delete(struct srp_rport *rport)
>
>  static int srp_connect_target(struct srp_target_port *target)
>  {
> -       int retries = 3;
>         int ret;
>
>         WARN_ON_ONCE(target->connected);
> @@ -945,19 +944,10 @@ static int srp_connect_target(struct srp_target_port *target)
>                         break;
>
>                 case SRP_STALE_CONN:
> -                       /* Our current CM id was stale, and is now in timewait.
> -                        * Try to reconnect with a new one.
> -                        */
> -                       if (!retries-- || srp_new_cm_id(target)) {
> -                               shost_printk(KERN_ERR, target->scsi_host, PFX
> -                                            "giving up on stale connection\n");
> -                               target->status = -ECONNRESET;
> -                               return target->status;
> -                       }
> -
>                         shost_printk(KERN_ERR, target->scsi_host, PFX
> -                                    "retrying stale connection\n");
> -                       break;
> +                                    "giving up on stale connection\n");
> +                       target->status = -ECONNRESET;
> +                       return target->status;
>
>                 default:
>                         return target->status;
> --
> 1.8.4.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 2, 2014, 10:34 a.m. UTC | #3
On 09/20/14 19:45, Or Gerlitz wrote:
> On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> wrote:
>> Attempting to connect three times may be insufficient after an
>> initiator system that was using multiple RDMA channels tries to
>> relogin. Additionally, this login retry mechanism is a workaround
>> for particular behavior of the IB/CM.
>
> Can you be more specific re the particular behavior of the IB CM?
> added Sean, the CM maintainer.

Let's focus on the software behavior instead of the people who are 
involved. What I have observed several times is that after a power cycle 
of the initiator system the first few login attempts are rejected. I was 
assuming that this was due to the IB/CM implementation but now that I 
have had another look at the logs I see that there is not enough 
information in the system logs to draw this conclusion. I will add 
additional logging statements in the initiator and target kernel code 
such that I can determine the root cause of this behavior.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 3, 2014, 8:51 a.m. UTC | #4
On 10/02/14 12:34, Bart Van Assche wrote:
> On 09/20/14 19:45, Or Gerlitz wrote:
>> On Fri, Sep 19, 2014 at 3:58 PM, Bart Van Assche <bvanassche@acm.org> 
>> wrote:
>>> Attempting to connect three times may be insufficient after an
>>> initiator system that was using multiple RDMA channels tries to
>>> relogin. Additionally, this login retry mechanism is a workaround
>>> for particular behavior of the IB/CM.
>>
>> Can you be more specific re the particular behavior of the IB CM?
>> added Sean, the CM maintainer.
> 
> Let's focus on the software behavior instead of the people who are 
> involved. What I have observed several times is that after a power cycle 
> of the initiator system the first few login attempts are rejected. I was 
> assuming that this was due to the IB/CM implementation but now that I 
> have had another look at the logs I see that there is not enough 
> information in the system logs to draw this conclusion. I will add 
> additional logging statements in the initiator and target kernel code 
> such that I can determine the root cause of this behavior.

(replying to my own e-mail / removed linux-scsi from CC-list)

So far I have been able to reproduce this behavior once after pushing 
the reset button of the initiator system while it was in the connected 
state. After the initiator system had finished rebooting I started 
ibdump on both IB ports of the target system (attached to this e-mail). 
What surprised me is that I found all the messages I expected in the 
ibdump output (e.g. IB MAD device management query) but no CM messages. Both 
sides were running FW 2.32.5100. The following messages were logged at 
the initiator side while ibdump was running at the target side:

Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: REJ received
Oct 02 17:43:42 msi kernel: scsi host14:   REJ reason: stale connection
Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: giving up on stale connection
Oct 02 17:43:42 msi kernel: scsi host14: ib_srp: Connection 0/12 failed
Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: REJ received
Oct 02 17:43:42 msi kernel: scsi host15:   REJ reason: stale connection
Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: giving up on stale connection
Oct 02 17:43:42 msi kernel: scsi host15: ib_srp: Connection 0/12 failed

After a few more login attempts SRP login succeeded.

Bart.
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index d3c712f..9608e7a 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -904,7 +904,6 @@  static void srp_rport_delete(struct srp_rport *rport)
 
 static int srp_connect_target(struct srp_target_port *target)
 {
-	int retries = 3;
 	int ret;
 
 	WARN_ON_ONCE(target->connected);
@@ -945,19 +944,10 @@  static int srp_connect_target(struct srp_target_port *target)
 			break;
 
 		case SRP_STALE_CONN:
-			/* Our current CM id was stale, and is now in timewait.
-			 * Try to reconnect with a new one.
-			 */
-			if (!retries-- || srp_new_cm_id(target)) {
-				shost_printk(KERN_ERR, target->scsi_host, PFX
-					     "giving up on stale connection\n");
-				target->status = -ECONNRESET;
-				return target->status;
-			}
-
 			shost_printk(KERN_ERR, target->scsi_host, PFX
-				     "retrying stale connection\n");
-			break;
+				     "giving up on stale connection\n");
+			target->status = -ECONNRESET;
+			return target->status;
 
 		default:
 			return target->status;