diff mbox series

[v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL

Message ID 20240506093759.2934591-1-dan.aloni@vastdata.com (mailing list archive)
State New, archived
Headers show
Series [v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL | expand

Commit Message

Dan Aloni May 6, 2024, 9:37 a.m. UTC
Under the scenario of IB device bonding, when bringing down one of the
ports, or all ports, we saw xprtrdma entering a non-recoverable state
where it is not even possible to complete the disconnect and shut it
down the mount, requiring a reboot. Following debug, we saw that
transport connect never ended after receiving the
RDMA_CM_EVENT_DEVICE_REMOVAL callback.

The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
connected, and ESTABLISHED may not have happened. So need to work with
each of these states accordingly.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
Cc: Sagi Grimberg <sagi.grimberg@vastdata.com>
Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
---
 net/sunrpc/xprtrdma/verbs.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Sagi Grimberg May 6, 2024, 3:09 p.m. UTC | #1
On 06/05/2024 12:37, Dan Aloni wrote:
> Under the scenario of IB device bonding, when bringing down one of the
> ports, or all ports, we saw xprtrdma entering a non-recoverable state
> where it is not even possible to complete the disconnect and shut it
> down the mount, requiring a reboot. Following debug, we saw that
> transport connect never ended after receiving the
> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
>
> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> connected, and ESTABLISHED may not have happened. So need to work with
> each of these states accordingly.
>
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Is this actually the offending commit ?

commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an 
NFS mount")
is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting 
that the
cm_id may not be ESTABLISHED (where we need to wake the connect waiter?

Question though, in DEVICE_REMOVAL the device is going away as soon as the
cm handler callback returns. Shouldn't nfs release all the device 
resources (related to this
cm_id)? afaict it was changed in:
e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")

The patch itself looks reasonable (although I do think that the rdma 
stack expects the
ulp to have the rdma resources released when the callback returns).

FWIW in nvme we avoided the problem altogether by registering an 
ib_client that is
called on .remove() and its a separate context that doesn't have all the 
intricacies with
rdma_cm...
Chuck Lever III May 6, 2024, 3:55 p.m. UTC | #2
On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
> 
> On 06/05/2024 12:37, Dan Aloni wrote:
> > Under the scenario of IB device bonding, when bringing down one of the
> > ports, or all ports, we saw xprtrdma entering a non-recoverable state
> > where it is not even possible to complete the disconnect and shut it
> > down the mount, requiring a reboot. Following debug, we saw that
> > transport connect never ended after receiving the
> > RDMA_CM_EVENT_DEVICE_REMOVAL callback.
> > 
> > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> > connected, and ESTABLISHED may not have happened. So need to work with
> > each of these states accordingly.
> > 
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> 
> Is this actually the offending commit ?
> 
> commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS
> mount")
> is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that
> the
> cm_id may not be ESTABLISHED (where we need to wake the connect waiter?

I'd be OK with discussing possible culprits in the patch description
but leaving off a Fixes: tag for now.

It would be reasonable to demand that the proposed fix be applied
to each LTS kernel and tested individually to ensure there are no
side-effects or pre-requisites.


> Question though, in DEVICE_REMOVAL the device is going away as soon as the
> cm handler callback returns. Shouldn't nfs release all the device resources
> (related to this
> cm_id)? afaict it was changed in:
> e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")

In the case where a DEVICE_REMOVAL event fires and a connection
hasn't yet been established, my guess is the ep reference count will
go to zero when rpcrdma_ep_put() is called.


> The patch itself looks reasonable (although I do think that the rdma stack
> expects the
> ulp to have the rdma resources released when the callback returns).

Thanks for the review! Should we add:

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


> FWIW in nvme we avoided the problem altogether by registering an ib_client
> that is
> called on .remove() and its a separate context that doesn't have all the
> intricacies with
> rdma_cm...

I looked at ib_client, years ago, and thought it would be a lot of
added complexity. With a code sample (NVMe host) maybe I can put
something together.
Sagi Grimberg May 6, 2024, 7:10 p.m. UTC | #3
On 06/05/2024 18:55, Chuck Lever wrote:
> On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
>> On 06/05/2024 12:37, Dan Aloni wrote:
>>> Under the scenario of IB device bonding, when bringing down one of the
>>> ports, or all ports, we saw xprtrdma entering a non-recoverable state
>>> where it is not even possible to complete the disconnect and shut it
>>> down the mount, requiring a reboot. Following debug, we saw that
>>> transport connect never ended after receiving the
>>> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
>>>
>>> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
>>> connected, and ESTABLISHED may not have happened. So need to work with
>>> each of these states accordingly.
>>>
>>> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
>> Is this actually the offending commit ?
>>
>> commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS
>> mount")
>> is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that
>> the
>> cm_id may not be ESTABLISHED (where we need to wake the connect waiter?
> I'd be OK with discussing possible culprits in the patch description
> but leaving off a Fixes: tag for now.
Agreed.
>
> It would be reasonable to demand that the proposed fix be applied
> to each LTS kernel and tested individually to ensure there are no
> side-effects or pre-requisites.
>
>
>> Question though, in DEVICE_REMOVAL the device is going away as soon as the
>> cm handler callback returns. Shouldn't nfs release all the device resources
>> (related to this
>> cm_id)? afaict it was changed in:
>> e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
> In the case where a DEVICE_REMOVAL event fires and a connection
> hasn't yet been established, my guess is the ep reference count will
> go to zero when rpcrdma_ep_put() is called.

Yes, I was actually referring to the case where the connection was 
established.
It looks like rpcrdma_force_disconnect -> xprt_force_disconnect 
schedules async
work to tear things down no?

>
>
>> The patch itself looks reasonable (although I do think that the rdma stack
>> expects the
>> ulp to have the rdma resources released when the callback returns).
> Thanks for the review! Should we add:
>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Yes.

>
>
>> FWIW in nvme we avoided the problem altogether by registering an ib_client
>> that is
>> called on .remove() and its a separate context that doesn't have all the
>> intricacies with
>> rdma_cm...
> I looked at ib_client, years ago, and thought it would be a lot of
> added complexity. With a code sample (NVMe host) maybe I can put
> something together.
>
>

The plus is that there is no need to handle the DEVICE_REMOVAL cm event, 
which is
always nice...
Chuck Lever III May 6, 2024, 8:06 p.m. UTC | #4
On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote:
> Under the scenario of IB device bonding, when bringing down one of the
> ports, or all ports, we saw xprtrdma entering a non-recoverable state
> where it is not even possible to complete the disconnect and shut it
> down the mount, requiring a reboot. Following debug, we saw that
> transport connect never ended after receiving the
> RDMA_CM_EVENT_DEVICE_REMOVAL callback.
> 
> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> connected, and ESTABLISHED may not have happened. So need to work with
> each of these states accordingly.
> 
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> Cc: Sagi Grimberg <sagi.grimberg@vastdata.com>
> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
> ---
>  net/sunrpc/xprtrdma/verbs.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..432557a553e7 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>  	case RDMA_CM_EVENT_DEVICE_REMOVAL:
>  		pr_info("rpcrdma: removing device %s for %pISpc\n",
>  			ep->re_id->device->name, sap);
> -		fallthrough;
> +		switch (xchg(&ep->re_connect_status, -ENODEV)) {
> +		case 0: goto wake_connect_worker;
> +		case 1: goto disconnected;
> +		}
> +		return 0;
>  	case RDMA_CM_EVENT_ADDR_CHANGE:
>  		ep->re_connect_status = -ENODEV;
>  		goto disconnected;
> -- 
> 2.39.3
> 

Hi Anna,

Please apply this patch with:

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever III May 6, 2024, 8:08 p.m. UTC | #5
On Mon, May 06, 2024 at 10:10:12PM +0300, Sagi Grimberg wrote:
> 
> On 06/05/2024 18:55, Chuck Lever wrote:
> > On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote:
> > > Question though, in DEVICE_REMOVAL the device is going away as soon as the
> > > cm handler callback returns. Shouldn't nfs release all the device resources
> > > (related to this
> > > cm_id)? afaict it was changed in:
> > > e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
> > In the case where a DEVICE_REMOVAL event fires and a connection
> > hasn't yet been established, my guess is the ep reference count will
> > go to zero when rpcrdma_ep_put() is called.
> 
> Yes, I was actually referring to the case where the connection was
> established.
> It looks like rpcrdma_force_disconnect -> xprt_force_disconnect schedules
> async
> work to tear things down no?

Yes because we can't rip out the hardware resources while the
RPC client is still using the transport. Converting to use an
ib_client might help by first triggering a disconnect, done
with proper serialization.


> > > FWIW in nvme we avoided the problem altogether by registering an ib_client
> > > that is
> > > called on .remove() and its a separate context that doesn't have all the
> > > intricacies with
> > > rdma_cm...
> > I looked at ib_client, years ago, and thought it would be a lot of
> > added complexity. With a code sample (NVMe host) maybe I can put
> > something together.
> 
> The plus is that there is no need to handle the DEVICE_REMOVAL cm event,
> which is always nice...

I'll have a look, thanks for the suggestion.
Trond Myklebust May 6, 2024, 8:25 p.m. UTC | #6
On Mon, 2024-05-06 at 16:06 -0400, Chuck Lever wrote:
> On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote:
> > Under the scenario of IB device bonding, when bringing down one of
> > the
> > ports, or all ports, we saw xprtrdma entering a non-recoverable
> > state
> > where it is not even possible to complete the disconnect and shut
> > it
> > down the mount, requiring a reboot. Following debug, we saw that
> > transport connect never ended after receiving the
> > RDMA_CM_EVENT_DEVICE_REMOVAL callback.
> > 
> > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
> > connected, and ESTABLISHED may not have happened. So need to work
> > with
> > each of these states accordingly.
> > 
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep
> > after it is freed')
> > Cc: Sagi Grimberg <sagi.grimberg@vastdata.com>
> > Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
> > ---
> >  net/sunrpc/xprtrdma/verbs.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/sunrpc/xprtrdma/verbs.c
> > b/net/sunrpc/xprtrdma/verbs.c
> > index 4f8d7efa469f..432557a553e7 100644
> > --- a/net/sunrpc/xprtrdma/verbs.c
> > +++ b/net/sunrpc/xprtrdma/verbs.c
> > @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id
> > *id, struct rdma_cm_event *event)
> >  	case RDMA_CM_EVENT_DEVICE_REMOVAL:
> >  		pr_info("rpcrdma: removing device %s for
> > %pISpc\n",
> >  			ep->re_id->device->name, sap);
> > -		fallthrough;
> > +		switch (xchg(&ep->re_connect_status, -ENODEV)) {
> > +		case 0: goto wake_connect_worker;
> > +		case 1: goto disconnected;
> > +		}
> > +		return 0;
> >  	case RDMA_CM_EVENT_ADDR_CHANGE:
> >  		ep->re_connect_status = -ENODEV;
> >  		goto disconnected;
> > -- 
> > 2.39.3
> > 
> 
> Hi Anna,
> 
> Please apply this patch with:
> 
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
> 
> 
Anna is back on leave for a few weeks, so I'll take it.
diff mbox series

Patch

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4f8d7efa469f..432557a553e7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -244,7 +244,11 @@  rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
 	case RDMA_CM_EVENT_DEVICE_REMOVAL:
 		pr_info("rpcrdma: removing device %s for %pISpc\n",
 			ep->re_id->device->name, sap);
-		fallthrough;
+		switch (xchg(&ep->re_connect_status, -ENODEV)) {
+		case 0: goto wake_connect_worker;
+		case 1: goto disconnected;
+		}
+		return 0;
 	case RDMA_CM_EVENT_ADDR_CHANGE:
 		ep->re_connect_status = -ENODEV;
 		goto disconnected;