Message ID | 20240505124910.1877325-1-dan.aloni@vastdata.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rpcrdma: don't decref EP if a ESTABLISHED did not happen | expand |
On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote: > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount > underflow. > > The specific scenario that caused this to happen is IB device bonding, > when bringing down one of the ports, or all ports. The situation is not > just a print - it also causes a non-recoverable state it is not even > possible to complete the disconnect and shut it down the mount, > requiring a reboot, suggesting that tear-down is also incomplete in this > state. > > The trivial fix seems to work as such - if we did not receive a > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of > `rpcrdma_xprt_disconnect`. > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') Hi Dan, thanks for the bug report! In the future, note that the Fixes: tag needs to go down by the Signed-off-by: tag. But I wonder if 2acc5 introduced this issue -- ep reference counting seems to have been added before this commit. Second, I wonder if, when bonding interfaces, there's an opportunity for the verbs consumer to take an additional transport reference. Cc'ing linux-rdma for input on that issue. Can you show a brief diagram of when the ep reference count changes when bonding? Also, I note that the WARNING below happened on a RHEL9 kernel: 5.14.0-284.11.1.el9_2.x86_64 Can you confirm that this issue reproduces on v6.8 and newer ? > Example crash: > > rpcrdma: removing device mlx5_3 for 172.21.208.2:20049 > ------------[ cut here ]------------ > refcount_t: underflow; use-after-free. > WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 > Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_> > isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx> > CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1 > Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022 > Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] > RIP: 0010:refcount_warn_saturate+0xba/0x110 > Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7 > RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282 > RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027 > RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0 > RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff > R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000 > R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928 > FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > rpcrdma_ep_put+0x42/0x50 [rpcrdma] > rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma] > xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma] > process_one_work+0x1e5/0x3c0 > ? rescuer_thread+0x3a0/0x3a0 > worker_thread+0x50/0x3b0 > ? rescuer_thread+0x3a0/0x3a0 > kthread+0xd6/0x100 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> > --- > net/sunrpc/xprtrdma/verbs.c | 5 ++++- > net/sunrpc/xprtrdma/xprt_rdma.h | 1 + > 2 files changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c > index 4f8d7efa469f..19996515da94 100644 > --- a/net/sunrpc/xprtrdma/verbs.c > +++ b/net/sunrpc/xprtrdma/verbs.c > @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) > goto disconnected; > case RDMA_CM_EVENT_ESTABLISHED: > rpcrdma_ep_get(ep); > + ep->re_connect_ref = true; > ep->re_connect_status = 1; > rpcrdma_update_cm_private(ep, &event->param.conn); > trace_xprtrdma_inline_thresh(ep); > @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) > ep->re_connect_status = -ECONNABORTED; > disconnected: > rpcrdma_force_disconnect(ep); > - return rpcrdma_ep_put(ep); > + if (ep->re_connect_ref) > + return rpcrdma_ep_put(ep); > + return 0; > default: > break; > } > diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h > index da409450dfc0..1553ef69a844 100644 > --- a/net/sunrpc/xprtrdma/xprt_rdma.h > +++ b/net/sunrpc/xprtrdma/xprt_rdma.h > @@ -84,6 +84,7 @@ struct rpcrdma_ep { > unsigned int re_max_inline_recv; > int re_async_rc; > int re_connect_status; > + bool re_connect_ref; > atomic_t re_receiving; > atomic_t re_force_disconnect; > struct ib_qp_init_attr re_attr; > -- > 2.39.3 >
On 2024-05-05 10:35:46, Chuck Lever wrote: > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote: > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount > > underflow. > > > > The specific scenario that caused this to happen is IB device bonding, > > when bringing down one of the ports, or all ports. The situation is not > > just a print - it also causes a non-recoverable state it is not even > > possible to complete the disconnect and shut it down the mount, > > requiring a reboot, suggesting that tear-down is also incomplete in this > > state. > > > > The trivial fix seems to work as such - if we did not receive a > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of > > `rpcrdma_xprt_disconnect`. > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > > Hi Dan, thanks for the bug report! > [..] > Second, I wonder if, when bonding interfaces, there's an opportunity > for the verbs consumer to take an additional transport reference. > Cc'ing linux-rdma for input on that issue. Can you show a brief > diagram of when the ep reference count changes when bonding? Not sure we need an additional reference here. I understand regarding rpcrdma managing its internal EP refcount, that it is having three in total: one for init, another one for ESTABLISHED mode, and a third for 'outstanding receives', so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me, and I found some more about it, please read on. > Also, I note that the WARNING below happened on a RHEL9 kernel: > > 5.14.0-284.11.1.el9_2.x86_64 > > Can you confirm that this issue reproduces on v6.8 and newer ? Some context: I originally tested with a version that also has implemented a timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat 'culprit' in the negative decref, but there is the other issue and second order effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes the unkillable transport, so we are not reaching teardown at all. With both of these changes applied both problems are gone. As for upstream version testing, with the elrepo build of 6.8.9 which matches vanilla, I don't see the negative kref but I do see the other effect of `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below), which does not release even after ports are back up online. Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);` for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery on bonding, I'll post in a reply. ``` INFO: task kworker/u128:6:1688 blocked for more than 122 seconds. Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000 Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] Call Trace: <TASK> __schedule+0x21c/0x680 schedule+0x31/0xd0 schedule_timeout+0x148/0x160 ? mutex_lock+0xe/0x30 ? cma_process_remove+0x218/0x260 [rdma_cm] ? preempt_count_add+0x70/0xa0 __wait_for_common+0x90/0x1e0 ? __pfx_schedule_timeout+0x10/0x10 cma_remove_one+0x5c/0xd0 [rdma_cm] remove_client_context+0x88/0xd0 [ib_core] disable_device+0x8a/0x160 [ib_core] ? _raw_spin_unlock+0x15/0x30 __ib_unregister_device+0x42/0xb0 [ib_core] ib_unregister_device+0x22/0x30 [ib_core] mlx5r_remove+0x36/0x60 [mlx5_ib] auxiliary_bus_remove+0x18/0x30 device_release_driver_internal+0x193/0x200 bus_remove_device+0xbf/0x130 device_del+0x157/0x3e0 ? devl_param_driverinit_value_get+0x29/0xa0 mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core] mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core] mlx5_do_bond+0x2cb/0x390 [mlx5_core] mlx5_do_bond_work+0x96/0xf0 [mlx5_core] process_one_work+0x174/0x340 worker_thread+0x27e/0x390 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> INFO: task fio:10584 blocked for more than 122 seconds. Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 ```
On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote: > On 2024-05-05 10:35:46, Chuck Lever wrote: > > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote: > > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount > > > underflow. > > > > > > The specific scenario that caused this to happen is IB device bonding, > > > when bringing down one of the ports, or all ports. The situation is not > > > just a print - it also causes a non-recoverable state it is not even > > > possible to complete the disconnect and shut it down the mount, > > > requiring a reboot, suggesting that tear-down is also incomplete in this > > > state. > > > > > > The trivial fix seems to work as such - if we did not receive a > > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise > > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of > > > `rpcrdma_xprt_disconnect`. > > > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > > > > Hi Dan, thanks for the bug report! > > > [..] > > Second, I wonder if, when bonding interfaces, there's an opportunity > > for the verbs consumer to take an additional transport reference. > > Cc'ing linux-rdma for input on that issue. Can you show a brief > > diagram of when the ep reference count changes when bonding? > > Not sure we need an additional reference here. We already have two mechanisms in play: - the ep reference count - the re_connect_status value I would prefer not to add a boolean here. Seems like re_connect_status could do that job if we need it to. I seem to recall that when a device is removed, the verbs consumer doesn't get a DISCONNECT first. Or does it? Having a sequence of CM events that you see in the bonded case would help a lot. > I understand regarding rpcrdma > managing its internal EP refcount, that it is having three in total: one for > init, another one for ESTABLISHED mode, and a third for 'outstanding receives', > so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me, > and I found some more about it, please read on. > > > Also, I note that the WARNING below happened on a RHEL9 kernel: > > > > 5.14.0-284.11.1.el9_2.x86_64 > > > > Can you confirm that this issue reproduces on v6.8 and newer ? > > Some context: I originally tested with a version that also has implemented a > timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat > 'culprit' in the negative decref, but there is the other issue and second order > effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes > the unkillable transport, so we are not reaching teardown at all. With both of > these changes applied both problems are gone. > > As for upstream version testing, with the elrepo build of 6.8.9 which matches > vanilla, I don't see the negative kref but I do see the other effect of > `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below), > which does not release even after ports are back up online. > > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);` > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery > on bonding, I'll post in a reply. It looks like you are trying to fix several issues in one patch. So I need you to separate these issues and the fixes, and let's focus on the upstream kernel (v6.9-rc6) because there's nothing I can do about the RHEL9 kernel, which is clearly a different source base than the one I work on. If we need a "wake_up_all(&ep->re_connect_wait);" during DEVICE_REMOVAL, that should be a separate patch. And you need to figure out if ADDR_CHANGE needs the same treatment: the v2 you just sent changes the behavior of ADDR_CHANGE too but does not mention whether it intended that change. Without a reproducer or a detailed explanation of how the ep reference count changes in step with CM events, I can't properly assess your proposed fix. > ``` > INFO: task kworker/u128:6:1688 blocked for more than 122 seconds. > Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000 > Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] > Call Trace: > <TASK> > __schedule+0x21c/0x680 > schedule+0x31/0xd0 > schedule_timeout+0x148/0x160 > ? mutex_lock+0xe/0x30 > ? cma_process_remove+0x218/0x260 [rdma_cm] > ? preempt_count_add+0x70/0xa0 > __wait_for_common+0x90/0x1e0 > ? __pfx_schedule_timeout+0x10/0x10 > cma_remove_one+0x5c/0xd0 [rdma_cm] > remove_client_context+0x88/0xd0 [ib_core] > disable_device+0x8a/0x160 [ib_core] > ? _raw_spin_unlock+0x15/0x30 > __ib_unregister_device+0x42/0xb0 [ib_core] > ib_unregister_device+0x22/0x30 [ib_core] > mlx5r_remove+0x36/0x60 [mlx5_ib] > auxiliary_bus_remove+0x18/0x30 > device_release_driver_internal+0x193/0x200 > bus_remove_device+0xbf/0x130 > device_del+0x157/0x3e0 > ? devl_param_driverinit_value_get+0x29/0xa0 > mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core] > mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core] > mlx5_do_bond+0x2cb/0x390 [mlx5_core] > mlx5_do_bond_work+0x96/0xf0 [mlx5_core] > process_one_work+0x174/0x340 > worker_thread+0x27e/0x390 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xee/0x120 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2d/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > INFO: task fio:10584 blocked for more than 122 seconds. > Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 > ``` > > -- > Dan Aloni
On 2024-05-05 15:00:10, Chuck Lever wrote: > On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote: > > On 2024-05-05 10:35:46, Chuck Lever wrote: > > [..] > > > Second, I wonder if, when bonding interfaces, there's an opportunity > > > for the verbs consumer to take an additional transport reference. > > > Cc'ing linux-rdma for input on that issue. Can you show a brief > > > diagram of when the ep reference count changes when bonding? > > > > Not sure we need an additional reference here. > > We already have two mechanisms in play: > > - the ep reference count > - the re_connect_status value > > I would prefer not to add a boolean here. Seems like > re_connect_status could do that job if we need it to. That's possible. So before overrides e.g. `ep->re_connect_status = -ENODEV` we would need to remember whether it was `1` and across the call to `rpcrdma_force_disconnect()`, to do the putref() after it. Anyway, the patch will be a big larger I guess. >[..] > > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);` > > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery > > on bonding, I'll post in a reply. > > It looks like you are trying to fix several issues in one patch. So > I need you to separate these issues and the fixes, and let's focus > on the upstream kernel (v6.9-rc6) because there's nothing I can do > about the RHEL9 kernel, which is clearly a different source base > than the one I work on. > > If we need a "wake_up_all(&ep->re_connect_wait);" during > DEVICE_REMOVAL, that should be a separate patch. And you need to > figure out if ADDR_CHANGE needs the same treatment: the v2 you just > sent changes the behavior of ADDR_CHANGE too but does not mention > whether it intended that change. Was it just the `rpcrdma_ep_put` change in this case? Will feel less comfortable changing it without a repro of the `ADDR_CHANGE` case. Going to isolate it. > Without a reproducer or a detailed explanation of how the ep > reference count changes in step with CM events, I can't properly > assess your proposed fix. With 6.8 I only see a single `RDMA_CM_EVENT_DEVICE_REMOVAL` event arriving, driving a single `rpcrdma_ep_put` call to put the reference back at 1, and it waits forever. Getting late here, I'll check on 6.9-rc tomorrow. Maybe for a repro, you can generate a single interface bond? I haven't tried, but possibly it would occur on a single one too if the underlying port is down.
Hi Dan, kernel test robot noticed the following build warnings: [auto build test WARNING on trondmy-nfs/linux-next] [also build test WARNING on linus/master v6.9-rc7 next-20240510] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Dan-Aloni/rpcrdma-don-t-decref-EP-if-a-ESTABLISHED-did-not-happen/20240505-205016 base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next patch link: https://lore.kernel.org/r/20240505124910.1877325-1-dan.aloni%40vastdata.com patch subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen config: parisc-randconfig-r081-20240512 (https://download.01.org/0day-ci/archive/20240513/202405130122.S5dgt6et-lkp@intel.com/config) compiler: hppa-linux-gcc (GCC) 13.2.0 If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202405130122.S5dgt6et-lkp@intel.com/ smatch warnings: net/sunrpc/xprtrdma/verbs.c:254 rpcrdma_cm_event_handler() warn: inconsistent indenting vim +254 net/sunrpc/xprtrdma/verbs.c 87cfb9a0c85ce4 Chuck Lever 2016-09-15 213 ae38288eb73c52 Chuck Lever 2018-10-01 214 /** ae38288eb73c52 Chuck Lever 2018-10-01 215 * rpcrdma_cm_event_handler - Handle RDMA CM events ae38288eb73c52 Chuck Lever 2018-10-01 216 * @id: rdma_cm_id on which an event has occurred ae38288eb73c52 Chuck Lever 2018-10-01 217 * @event: details of the event ae38288eb73c52 Chuck Lever 2018-10-01 218 * ae38288eb73c52 Chuck Lever 2018-10-01 219 * Called with @id's mutex held. Returns 1 if caller should ae38288eb73c52 Chuck Lever 2018-10-01 220 * destroy @id, otherwise 0. ae38288eb73c52 Chuck Lever 2018-10-01 221 */ c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 222 static int ae38288eb73c52 Chuck Lever 2018-10-01 223 rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 224 { 745b734c9bb805 Chuck Lever 2020-02-21 225 struct sockaddr *sap = (struct sockaddr *)&id->route.addr.dst_addr; e28ce90083f032 Chuck Lever 2020-02-21 226 struct rpcrdma_ep *ep = id->context; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 227 ae38288eb73c52 Chuck Lever 2018-10-01 228 might_sleep(); ae38288eb73c52 Chuck Lever 2018-10-01 229 c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 230 switch (event->event) { c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 231 case RDMA_CM_EVENT_ADDR_RESOLVED: c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 232 case RDMA_CM_EVENT_ROUTE_RESOLVED: 93aa8e0a9de80e Chuck Lever 2020-02-21 233 ep->re_async_rc = 0; 93aa8e0a9de80e Chuck Lever 2020-02-21 234 complete(&ep->re_done); 316a616e788658 Chuck Lever 2018-10-01 235 return 0; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 236 case RDMA_CM_EVENT_ADDR_ERROR: 93aa8e0a9de80e Chuck Lever 2020-02-21 237 ep->re_async_rc = -EPROTO; 93aa8e0a9de80e Chuck Lever 2020-02-21 238 complete(&ep->re_done); 316a616e788658 Chuck Lever 2018-10-01 239 return 0; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 240 case RDMA_CM_EVENT_ROUTE_ERROR: 93aa8e0a9de80e Chuck Lever 2020-02-21 241 ep->re_async_rc = -ENETUNREACH; 93aa8e0a9de80e Chuck Lever 2020-02-21 242 complete(&ep->re_done); 316a616e788658 Chuck Lever 2018-10-01 243 return 0; bebd031866caa4 Chuck Lever 2017-04-11 244 case RDMA_CM_EVENT_DEVICE_REMOVAL: 745b734c9bb805 Chuck Lever 2020-02-21 245 pr_info("rpcrdma: removing device %s for %pISpc\n", 745b734c9bb805 Chuck Lever 2020-02-21 246 ep->re_id->device->name, sap); df561f6688fef7 Gustavo A. R. Silva 2020-08-23 247 fallthrough; e28ce90083f032 Chuck Lever 2020-02-21 248 case RDMA_CM_EVENT_ADDR_CHANGE: 93aa8e0a9de80e Chuck Lever 2020-02-21 249 ep->re_connect_status = -ENODEV; e28ce90083f032 Chuck Lever 2020-02-21 250 goto disconnected; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 251 case RDMA_CM_EVENT_ESTABLISHED: 2acc5cae292355 Chuck Lever 2020-06-15 252 rpcrdma_ep_get(ep); c58f2c9a4198e3 Dan Aloni 2024-05-05 253 ep->re_connect_ref = true; 93aa8e0a9de80e Chuck Lever 2020-02-21 @254 ep->re_connect_status = 1; 745b734c9bb805 Chuck Lever 2020-02-21 255 rpcrdma_update_cm_private(ep, &event->param.conn); 745b734c9bb805 Chuck Lever 2020-02-21 256 trace_xprtrdma_inline_thresh(ep); 93aa8e0a9de80e Chuck Lever 2020-02-21 257 wake_up_all(&ep->re_connect_wait); 31e62d25b5b815 Chuck Lever 2018-10-01 258 break; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 259 case RDMA_CM_EVENT_CONNECT_ERROR: 93aa8e0a9de80e Chuck Lever 2020-02-21 260 ep->re_connect_status = -ENOTCONN; af667527b0e349 Chuck Lever 2020-06-27 261 goto wake_connect_worker; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 262 case RDMA_CM_EVENT_UNREACHABLE: 93aa8e0a9de80e Chuck Lever 2020-02-21 263 ep->re_connect_status = -ENETUNREACH; af667527b0e349 Chuck Lever 2020-06-27 264 goto wake_connect_worker; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 265 case RDMA_CM_EVENT_REJECTED: 93aa8e0a9de80e Chuck Lever 2020-02-21 266 ep->re_connect_status = -ECONNREFUSED; 0a90487bf7182c Chuck Lever 2017-02-08 267 if (event->status == IB_CM_REJ_STALE_CONN) 4cf44be6f1e86d Chuck Lever 2020-06-27 268 ep->re_connect_status = -ENOTCONN; af667527b0e349 Chuck Lever 2020-06-27 269 wake_connect_worker: af667527b0e349 Chuck Lever 2020-06-27 270 wake_up_all(&ep->re_connect_wait); af667527b0e349 Chuck Lever 2020-06-27 271 return 0; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 272 case RDMA_CM_EVENT_DISCONNECTED: 93aa8e0a9de80e Chuck Lever 2020-02-21 273 ep->re_connect_status = -ECONNABORTED; 31e62d25b5b815 Chuck Lever 2018-10-01 274 disconnected: c487eb7d8e4157 Chuck Lever 2020-06-15 275 rpcrdma_force_disconnect(ep); c58f2c9a4198e3 Dan Aloni 2024-05-05 276 if (ep->re_connect_ref) 2acc5cae292355 Chuck Lever 2020-06-15 277 return rpcrdma_ep_put(ep); c58f2c9a4198e3 Dan Aloni 2024-05-05 278 return 0; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 279 default: c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 280 break; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 281 } c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 282 c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 283 return 0; c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 284 } c56c65fb67d646 \"Talpey, Thomas\ 2007-09-10 285
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 4f8d7efa469f..19996515da94 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) goto disconnected; case RDMA_CM_EVENT_ESTABLISHED: rpcrdma_ep_get(ep); + ep->re_connect_ref = true; ep->re_connect_status = 1; rpcrdma_update_cm_private(ep, &event->param.conn); trace_xprtrdma_inline_thresh(ep); @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) ep->re_connect_status = -ECONNABORTED; disconnected: rpcrdma_force_disconnect(ep); - return rpcrdma_ep_put(ep); + if (ep->re_connect_ref) + return rpcrdma_ep_put(ep); + return 0; default: break; } diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index da409450dfc0..1553ef69a844 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -84,6 +84,7 @@ struct rpcrdma_ep { unsigned int re_max_inline_recv; int re_async_rc; int re_connect_status; + bool re_connect_ref; atomic_t re_receiving; atomic_t re_force_disconnect; struct ib_qp_init_attr re_attr;
We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount underflow. The specific scenario that caused this to happen is IB device bonding, when bringing down one of the ports, or all ports. The situation is not just a print - it also causes a non-recoverable state it is not even possible to complete the disconnect and shut it down the mount, requiring a reboot, suggesting that tear-down is also incomplete in this state. The trivial fix seems to work as such - if we did not receive a `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise `rpcrdma_xprt_drain` kills the EP prematurely in from the context of `rpcrdma_xprt_disconnect`. Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') Example crash: rpcrdma: removing device mlx5_3 for 172.21.208.2:20049 ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_> isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx> CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G W OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022 Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] RIP: 0010:refcount_warn_saturate+0xba/0x110 Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7 RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027 RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0 RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000 R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928 FS: 0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> rpcrdma_ep_put+0x42/0x50 [rpcrdma] rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma] xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma] process_one_work+0x1e5/0x3c0 ? rescuer_thread+0x3a0/0x3a0 worker_thread+0x50/0x3b0 ? rescuer_thread+0x3a0/0x3a0 kthread+0xd6/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> --- net/sunrpc/xprtrdma/verbs.c | 5 ++++- net/sunrpc/xprtrdma/xprt_rdma.h | 1 + 2 files changed, 5 insertions(+), 1 deletion(-)