Message ID | 20240506093759.2934591-1-dan.aloni@vastdata.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL | expand |
On 06/05/2024 12:37, Dan Aloni wrote: > Under the scenario of IB device bonding, when bringing down one of the > ports, or all ports, we saw xprtrdma entering a non-recoverable state > where it is not even possible to complete the disconnect and shut it > down the mount, requiring a reboot. Following debug, we saw that > transport connect never ended after receiving the > RDMA_CM_EVENT_DEVICE_REMOVAL callback. > > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is > connected, and ESTABLISHED may not have happened. So need to work with > each of these states accordingly. > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') Is this actually the offending commit ? commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount") is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that the cm_id may not be ESTABLISHED (where we need to wake the connect waiter? Question though, in DEVICE_REMOVAL the device is going away as soon as the cm handler callback returns. Shouldn't nfs release all the device resources (related to this cm_id)? afaict it was changed in: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") The patch itself looks reasonable (although I do think that the rdma stack expects the ulp to have the rdma resources released when the callback returns). FWIW in nvme we avoided the problem altogether by registering an ib_client that is called on .remove() and its a separate context that doesn't have all the intricacies with rdma_cm...
On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote: > > On 06/05/2024 12:37, Dan Aloni wrote: > > Under the scenario of IB device bonding, when bringing down one of the > > ports, or all ports, we saw xprtrdma entering a non-recoverable state > > where it is not even possible to complete the disconnect and shut it > > down the mount, requiring a reboot. Following debug, we saw that > > transport connect never ended after receiving the > > RDMA_CM_EVENT_DEVICE_REMOVAL callback. > > > > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is > > connected, and ESTABLISHED may not have happened. So need to work with > > each of these states accordingly. > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > > Is this actually the offending commit ? > > commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS > mount") > is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that > the > cm_id may not be ESTABLISHED (where we need to wake the connect waiter? I'd be OK with discussing possible culprits in the patch description but leaving off a Fixes: tag for now. It would be reasonable to demand that the proposed fix be applied to each LTS kernel and tested individually to ensure there are no side-effects or pre-requisites. > Question though, in DEVICE_REMOVAL the device is going away as soon as the > cm handler callback returns. Shouldn't nfs release all the device resources > (related to this > cm_id)? afaict it was changed in: > e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") In the case where a DEVICE_REMOVAL event fires and a connection hasn't yet been established, my guess is the ep reference count will go to zero when rpcrdma_ep_put() is called. > The patch itself looks reasonable (although I do think that the rdma stack > expects the > ulp to have the rdma resources released when the callback returns). Thanks for the review! Should we add: Reviewed-by: Sagi Grimberg <sagi@grimberg.me> > FWIW in nvme we avoided the problem altogether by registering an ib_client > that is > called on .remove() and its a separate context that doesn't have all the > intricacies with > rdma_cm... I looked at ib_client, years ago, and thought it would be a lot of added complexity. With a code sample (NVMe host) maybe I can put something together.
On 06/05/2024 18:55, Chuck Lever wrote: > On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote: >> On 06/05/2024 12:37, Dan Aloni wrote: >>> Under the scenario of IB device bonding, when bringing down one of the >>> ports, or all ports, we saw xprtrdma entering a non-recoverable state >>> where it is not even possible to complete the disconnect and shut it >>> down the mount, requiring a reboot. Following debug, we saw that >>> transport connect never ended after receiving the >>> RDMA_CM_EVENT_DEVICE_REMOVAL callback. >>> >>> The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is >>> connected, and ESTABLISHED may not have happened. So need to work with >>> each of these states accordingly. >>> >>> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') >> Is this actually the offending commit ? >> >> commit bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS >> mount") >> is the one assuming DEVICE_REMOVAL triggers a disconnect not accounting that >> the >> cm_id may not be ESTABLISHED (where we need to wake the connect waiter? > I'd be OK with discussing possible culprits in the patch description > but leaving off a Fixes: tag for now. Agreed. > > It would be reasonable to demand that the proposed fix be applied > to each LTS kernel and tested individually to ensure there are no > side-effects or pre-requisites. > > >> Question though, in DEVICE_REMOVAL the device is going away as soon as the >> cm handler callback returns. Shouldn't nfs release all the device resources >> (related to this >> cm_id)? afaict it was changed in: >> e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") > In the case where a DEVICE_REMOVAL event fires and a connection > hasn't yet been established, my guess is the ep reference count will > go to zero when rpcrdma_ep_put() is called. Yes, I was actually referring to the case where the connection was established. It looks like rpcrdma_force_disconnect -> xprt_force_disconnect schedules async work to tear things down no? > > >> The patch itself looks reasonable (although I do think that the rdma stack >> expects the >> ulp to have the rdma resources released when the callback returns). > Thanks for the review! Should we add: > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Yes. > > >> FWIW in nvme we avoided the problem altogether by registering an ib_client >> that is >> called on .remove() and its a separate context that doesn't have all the >> intricacies with >> rdma_cm... > I looked at ib_client, years ago, and thought it would be a lot of > added complexity. With a code sample (NVMe host) maybe I can put > something together. > > The plus is that there is no need to handle the DEVICE_REMOVAL cm event, which is always nice...
On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote: > Under the scenario of IB device bonding, when bringing down one of the > ports, or all ports, we saw xprtrdma entering a non-recoverable state > where it is not even possible to complete the disconnect and shut it > down the mount, requiring a reboot. Following debug, we saw that > transport connect never ended after receiving the > RDMA_CM_EVENT_DEVICE_REMOVAL callback. > > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is > connected, and ESTABLISHED may not have happened. So need to work with > each of these states accordingly. > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > Cc: Sagi Grimberg <sagi.grimberg@vastdata.com> > Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> > --- > net/sunrpc/xprtrdma/verbs.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c > index 4f8d7efa469f..432557a553e7 100644 > --- a/net/sunrpc/xprtrdma/verbs.c > +++ b/net/sunrpc/xprtrdma/verbs.c > @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) > case RDMA_CM_EVENT_DEVICE_REMOVAL: > pr_info("rpcrdma: removing device %s for %pISpc\n", > ep->re_id->device->name, sap); > - fallthrough; > + switch (xchg(&ep->re_connect_status, -ENODEV)) { > + case 0: goto wake_connect_worker; > + case 1: goto disconnected; > + } > + return 0; > case RDMA_CM_EVENT_ADDR_CHANGE: > ep->re_connect_status = -ENODEV; > goto disconnected; > -- > 2.39.3 > Hi Anna, Please apply this patch with: Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
On Mon, May 06, 2024 at 10:10:12PM +0300, Sagi Grimberg wrote: > > On 06/05/2024 18:55, Chuck Lever wrote: > > On Mon, May 06, 2024 at 06:09:51PM +0300, Sagi Grimberg wrote: > > > Question though, in DEVICE_REMOVAL the device is going away as soon as the > > > cm handler callback returns. Shouldn't nfs release all the device resources > > > (related to this > > > cm_id)? afaict it was changed in: > > > e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") > > In the case where a DEVICE_REMOVAL event fires and a connection > > hasn't yet been established, my guess is the ep reference count will > > go to zero when rpcrdma_ep_put() is called. > > Yes, I was actually referring to the case where the connection was > established. > It looks like rpcrdma_force_disconnect -> xprt_force_disconnect schedules > async > work to tear things down no? Yes because we can't rip out the hardware resources while the RPC client is still using the transport. Converting to use an ib_client might help by first triggering a disconnect, done with proper serialization. > > > FWIW in nvme we avoided the problem altogether by registering an ib_client > > > that is > > > called on .remove() and its a separate context that doesn't have all the > > > intricacies with > > > rdma_cm... > > I looked at ib_client, years ago, and thought it would be a lot of > > added complexity. With a code sample (NVMe host) maybe I can put > > something together. > > The plus is that there is no need to handle the DEVICE_REMOVAL cm event, > which is always nice... I'll have a look, thanks for the suggestion.
On Mon, 2024-05-06 at 16:06 -0400, Chuck Lever wrote: > On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote: > > Under the scenario of IB device bonding, when bringing down one of > > the > > ports, or all ports, we saw xprtrdma entering a non-recoverable > > state > > where it is not even possible to complete the disconnect and shut > > it > > down the mount, requiring a reboot. Following debug, we saw that > > transport connect never ended after receiving the > > RDMA_CM_EVENT_DEVICE_REMOVAL callback. > > > > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is > > connected, and ESTABLISHED may not have happened. So need to work > > with > > each of these states accordingly. > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep > > after it is freed') > > Cc: Sagi Grimberg <sagi.grimberg@vastdata.com> > > Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> > > --- > > net/sunrpc/xprtrdma/verbs.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/net/sunrpc/xprtrdma/verbs.c > > b/net/sunrpc/xprtrdma/verbs.c > > index 4f8d7efa469f..432557a553e7 100644 > > --- a/net/sunrpc/xprtrdma/verbs.c > > +++ b/net/sunrpc/xprtrdma/verbs.c > > @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id > > *id, struct rdma_cm_event *event) > > case RDMA_CM_EVENT_DEVICE_REMOVAL: > > pr_info("rpcrdma: removing device %s for > > %pISpc\n", > > ep->re_id->device->name, sap); > > - fallthrough; > > + switch (xchg(&ep->re_connect_status, -ENODEV)) { > > + case 0: goto wake_connect_worker; > > + case 1: goto disconnected; > > + } > > + return 0; > > case RDMA_CM_EVENT_ADDR_CHANGE: > > ep->re_connect_status = -ENODEV; > > goto disconnected; > > -- > > 2.39.3 > > > > Hi Anna, > > Please apply this patch with: > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> > Reviewed-by: Chuck Lever <chuck.lever@oracle.com> > > Anna is back on leave for a few weeks, so I'll take it.
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 4f8d7efa469f..432557a553e7 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) case RDMA_CM_EVENT_DEVICE_REMOVAL: pr_info("rpcrdma: removing device %s for %pISpc\n", ep->re_id->device->name, sap); - fallthrough; + switch (xchg(&ep->re_connect_status, -ENODEV)) { + case 0: goto wake_connect_worker; + case 1: goto disconnected; + } + return 0; case RDMA_CM_EVENT_ADDR_CHANGE: ep->re_connect_status = -ENODEV; goto disconnected;
Under the scenario of IB device bonding, when bringing down one of the ports, or all ports, we saw xprtrdma entering a non-recoverable state where it is not even possible to complete the disconnect and shut it down the mount, requiring a reboot. Following debug, we saw that transport connect never ended after receiving the RDMA_CM_EVENT_DEVICE_REMOVAL callback. The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is connected, and ESTABLISHED may not have happened. So need to work with each of these states accordingly. Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') Cc: Sagi Grimberg <sagi.grimberg@vastdata.com> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> --- net/sunrpc/xprtrdma/verbs.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)