diff mbox series

xprtrdma: Wake up re_connect_wait on disconnect

Message ID 20200620171805.1748399-1-dan@kernelim.com (mailing list archive)
State New, archived
Headers show
Series xprtrdma: Wake up re_connect_wait on disconnect | expand

Commit Message

Dan Aloni June 20, 2020, 5:18 p.m. UTC
Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
connections don't succeeds, something needs to wake it up. In my case, this has
been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
`rpcrdma_xprt_connect()` slept forever.

This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
rpcrdma_cm_event_handler()').

Signed-off-by: Dan Aloni <dan@kernelim.com>
CC: Chuck Lever <chuck.lever@oracle.com>
---

Notes:
    Hi Chuck,
    
    Maybe I missd something, as it is not clear to me how otherwise (without this
    patch), re_connect_wait can be woken up in this situation. Please explain?

 net/sunrpc/xprtrdma/verbs.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Chuck Lever June 20, 2020, 6:46 p.m. UTC | #1
Hi Dan-

> On Jun 20, 2020, at 1:18 PM, Dan Aloni <dan@kernelim.com> wrote:
> 
> Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
> connections don't succeeds, something needs to wake it up. In my case, this has
> been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
> `rpcrdma_xprt_connect()` slept forever.

Interesting. My development and testing generates plenty of REJECTED connection
requests, but I never saw this particular failure mode.


> This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
> rpcrdma_cm_event_handler()').

The patch looks sensible. I'll pull it into my test harness.


> Signed-off-by: Dan Aloni <dan@kernelim.com>
> CC: Chuck Lever <chuck.lever@oracle.com>
> ---
> 
> Notes:
>    Hi Chuck,
> 
>    Maybe I missd something, as it is not clear to me how otherwise (without this
>    patch), re_connect_wait can be woken up in this situation. Please explain?
> 
> net/sunrpc/xprtrdma/verbs.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 2ae348377806..8bd76a47a91f 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> 		ep->re_connect_status = -ECONNABORTED;
> disconnected:
> 		xprt_force_disconnect(xprt);
> +		wake_up_all(&ep->re_connect_wait);
> 		return rpcrdma_ep_destroy(ep);
> 	default:
> 		break;
> -- 
> 2.25.4
> 

--
Chuck Lever
Chuck Lever June 21, 2020, 2:49 p.m. UTC | #2
Hi Dan-

> On Jun 20, 2020, at 2:46 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> Hi Dan-
> 
>> On Jun 20, 2020, at 1:18 PM, Dan Aloni <dan@kernelim.com> wrote:
>> 
>> Given that rpcrdma_xprt_connect() happens from workqueue context, on cases where
>> connections don't succeeds, something needs to wake it up. In my case, this has
>> been observed when the CM callback received `RDMA_CM_EVENT_REJECTED`, and
>> `rpcrdma_xprt_connect()` slept forever.
> 
> Interesting. My development and testing generates plenty of REJECTED connection
> requests, but I never saw this particular failure mode.

Correction: My testing _used_ _to_ generate REJECTED events regularly. It does
not seem to any more, even after client crashes. So that explains why I haven't
seen this before.

I haven't reproduced the problem here, but the fix still looks proper to me,
and doesn't appear to introduce any regressions. I do have some issues with your
proposed patch, though.

The first paragraph of the patch description is incorrect. RDMA_CM_EVENT_DISCONNECTED
can occur only once a connection has been established. That guarantees there are no
waiters on re_connect_wait in that case. It's connect errors that need to wake-up
the connect worker.


>> This continues the fix in commit 58bd6656f808 ('xprtrdma: Restore wake-up-all to
>> rpcrdma_cm_event_handler()').

IMO this paragraph needs to be replaced by:

Fixes: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")


>> Signed-off-by: Dan Aloni <dan@kernelim.com>
>> CC: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> 
>> Notes:
>>   Hi Chuck,
>> 
>>   Maybe I missd something, as it is not clear to me how otherwise (without this
>>   patch), re_connect_wait can be woken up in this situation. Please explain?
>> 
>> net/sunrpc/xprtrdma/verbs.c | 1 +
>> 1 file changed, 1 insertion(+)
>> 
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 2ae348377806..8bd76a47a91f 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>> 		ep->re_connect_status = -ECONNABORTED;
>> disconnected:
>> 		xprt_force_disconnect(xprt);
>> +		wake_up_all(&ep->re_connect_wait);
>> 		return rpcrdma_ep_destroy(ep);
>> 	default:
>> 		break;

This hunk does not apply on top of fixes I've already sent to Anna for 5.8-rc1.

So, if you don't object, I'll adjust your patch (this hunk and the description)
before sending it along to Anna.


--
Chuck Lever
Dan Aloni June 21, 2020, 3:11 p.m. UTC | #3
On Sun, Jun 21, 2020 at 10:49:53AM -0400, Chuck Lever wrote:
> >> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> >> index 2ae348377806..8bd76a47a91f 100644
> >> --- a/net/sunrpc/xprtrdma/verbs.c
> >> +++ b/net/sunrpc/xprtrdma/verbs.c
> >> @@ -289,6 +289,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
> >> 		ep->re_connect_status = -ECONNABORTED;
> >> disconnected:
> >> 		xprt_force_disconnect(xprt);
> >> +		wake_up_all(&ep->re_connect_wait);
> >> 		return rpcrdma_ep_destroy(ep);
> >> 	default:
> >> 		break;
> 
> This hunk does not apply on top of fixes I've already sent to Anna for 5.8-rc1.
> 
> So, if you don't object, I'll adjust your patch (this hunk and the description)
> before sending it along to Anna.

Sure, go ahead. Thanks for working on this!
diff mbox series

Patch

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 2ae348377806..8bd76a47a91f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -289,6 +289,7 @@  rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
 		ep->re_connect_status = -ECONNABORTED;
 disconnected:
 		xprt_force_disconnect(xprt);
+		wake_up_all(&ep->re_connect_wait);
 		return rpcrdma_ep_destroy(ep);
 	default:
 		break;