Message ID | 166118222093.3250511.11454048195824271658.stgit@morisot.1015granger.net (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [v1] RDMA/core: Fix check_flush_dependency splat on addr_wq | expand |
On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: > While setting up a new lab, I accidentally misconfigured the > Ethernet port for a system that tried an NFS mount using RoCE. > This made the NFS server unreachable. The following WARNING > popped on the NFS client while waiting for the mount attempt to > time out: > > Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI> > Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca > Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma> > Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13 > Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 > Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] > Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca > Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be> > Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092 > Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027 > Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff > Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000 > Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731 > Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0 > Aug 20 17:12:05 bazille kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000 > Aug 20 17:12:05 bazille kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0 > Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Aug 20 17:12:05 bazille kernel: Call Trace: > Aug 20 17:12:05 bazille kernel: <TASK> > Aug 20 17:12:05 bazille kernel: __flush_work.isra.0+0xaf/0x188 > Aug 20 17:12:05 bazille kernel: ? _raw_spin_lock_irqsave+0x2c/0x37 > Aug 20 17:12:05 bazille kernel: ? lock_timer_base+0x38/0x5f > Aug 20 17:12:05 bazille kernel: __cancel_work_timer+0xea/0x13d > Aug 20 17:12:05 bazille kernel: ? preempt_latency_start+0x2b/0x46 > Aug 20 17:12:05 bazille kernel: rdma_addr_cancel+0x70/0x81 [ib_core] > Aug 20 17:12:05 bazille kernel: _destroy_id+0x1a/0x246 [rdma_cm] > Aug 20 17:12:05 bazille kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] > Aug 20 17:12:05 bazille kernel: ? _raw_spin_unlock+0x14/0x29 > Aug 20 17:12:05 bazille kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 > Aug 20 17:12:05 bazille kernel: ? finish_task_switch.isra.0+0x171/0x249 > Aug 20 17:12:05 bazille kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] > Aug 20 17:12:05 bazille kernel: process_one_work+0x1d8/0x2d4 > Aug 20 17:12:05 bazille kernel: worker_thread+0x18b/0x24f > Aug 20 17:12:05 bazille kernel: ? rescuer_thread+0x280/0x280 > Aug 20 17:12:05 bazille kernel: kthread+0xf4/0xfc > Aug 20 17:12:05 bazille kernel: ? kthread_complete_and_exit+0x1b/0x1b > Aug 20 17:12:05 bazille kernel: ret_from_fork+0x22/0x30 > Aug 20 17:12:05 bazille kernel: </TASK> > > The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that > one of its work items tries to cancel has to be WQ_MEM_RECLAIM to > prevent a priority inversion. But why do you have WQ_MEM_RECLAIM in xprtiod? 1270 wq = alloc_workqueue("xprtiod", WQ_UNBOUND | WQ_MEM_RECLAIM, 0); IMHO, It will be nicer if we remove WQ_MEM_RECLAIM instead of adding it. Thanks > > Suggested-by: Trond Myklebust <trondmy@hammerspace.com> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > drivers/infiniband/core/addr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > index f253295795f0..5c36d01ebf0b 100644 > --- a/drivers/infiniband/core/addr.c > +++ b/drivers/infiniband/core/addr.c > @@ -872,7 +872,7 @@ static struct notifier_block nb = { > > int addr_init(void) > { > - addr_wq = alloc_ordered_workqueue("ib_addr", 0); > + addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM); > if (!addr_wq) > return -ENOMEM; > > >
> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote: > > On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: >> While setting up a new lab, I accidentally misconfigured the >> Ethernet port for a system that tried an NFS mount using RoCE. >> This made the NFS server unreachable. The following WARNING >> popped on the NFS client while waiting for the mount attempt to >> time out: >> >> Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI> >> Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca >> Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma> >> Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13 >> Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 >> Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] >> Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca >> Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be> >> Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092 >> Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027 >> Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff >> Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000 >> Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731 >> Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0 >> Aug 20 17:12:05 bazille kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000 >> Aug 20 17:12:05 bazille kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0 >> Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Aug 20 17:12:05 bazille kernel: Call Trace: >> Aug 20 17:12:05 bazille kernel: <TASK> >> Aug 20 17:12:05 bazille kernel: __flush_work.isra.0+0xaf/0x188 >> Aug 20 17:12:05 bazille kernel: ? _raw_spin_lock_irqsave+0x2c/0x37 >> Aug 20 17:12:05 bazille kernel: ? lock_timer_base+0x38/0x5f >> Aug 20 17:12:05 bazille kernel: __cancel_work_timer+0xea/0x13d >> Aug 20 17:12:05 bazille kernel: ? preempt_latency_start+0x2b/0x46 >> Aug 20 17:12:05 bazille kernel: rdma_addr_cancel+0x70/0x81 [ib_core] >> Aug 20 17:12:05 bazille kernel: _destroy_id+0x1a/0x246 [rdma_cm] >> Aug 20 17:12:05 bazille kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] >> Aug 20 17:12:05 bazille kernel: ? _raw_spin_unlock+0x14/0x29 >> Aug 20 17:12:05 bazille kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 >> Aug 20 17:12:05 bazille kernel: ? finish_task_switch.isra.0+0x171/0x249 >> Aug 20 17:12:05 bazille kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] >> Aug 20 17:12:05 bazille kernel: process_one_work+0x1d8/0x2d4 >> Aug 20 17:12:05 bazille kernel: worker_thread+0x18b/0x24f >> Aug 20 17:12:05 bazille kernel: ? rescuer_thread+0x280/0x280 >> Aug 20 17:12:05 bazille kernel: kthread+0xf4/0xfc >> Aug 20 17:12:05 bazille kernel: ? kthread_complete_and_exit+0x1b/0x1b >> Aug 20 17:12:05 bazille kernel: ret_from_fork+0x22/0x30 >> Aug 20 17:12:05 bazille kernel: </TASK> >> >> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that >> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to >> prevent a priority inversion. > > But why do you have WQ_MEM_RECLAIM in xprtiod? Because RPC is under a filesystem (NFS). Therefore it has to handle writeback demanded by direct reclaim. All of the storage ULPs have this constraint, in fact. > 1270 wq = alloc_workqueue("xprtiod", WQ_UNBOUND | WQ_MEM_RECLAIM, 0); > > IMHO, It will be nicer if we remove WQ_MEM_RECLAIM instead of adding it. > > Thanks > >> >> Suggested-by: Trond Myklebust <trondmy@hammerspace.com> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> --- >> drivers/infiniband/core/addr.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c >> index f253295795f0..5c36d01ebf0b 100644 >> --- a/drivers/infiniband/core/addr.c >> +++ b/drivers/infiniband/core/addr.c >> @@ -872,7 +872,7 @@ static struct notifier_block nb = { >> >> int addr_init(void) >> { >> - addr_wq = alloc_ordered_workqueue("ib_addr", 0); >> + addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM); >> if (!addr_wq) >> return -ENOMEM; -- Chuck Lever
On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote: > > > > On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote: > > > > On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: <...> > >> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that > >> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to > >> prevent a priority inversion. > > > > But why do you have WQ_MEM_RECLAIM in xprtiod? > > Because RPC is under a filesystem (NFS). Therefore it has to handle > writeback demanded by direct reclaim. All of the storage ULPs have > this constraint, in fact. I don't know, this ib_addr workqueue is used when connection is created. Jason, what do you think? Thanks
> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote: >> >> >>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote: >>> >>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: > > <...> > >>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that >>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to >>>> prevent a priority inversion. >>> >>> But why do you have WQ_MEM_RECLAIM in xprtiod? >> >> Because RPC is under a filesystem (NFS). Therefore it has to handle >> writeback demanded by direct reclaim. All of the storage ULPs have >> this constraint, in fact. > > I don't know, this ib_addr workqueue is used when connection is created. Reconnection is exactly when we need to ensure that creating a new connection won't trigger more memory allocation, because that will immediately deadlock. Again, all network storage ULPs have this constraint. > Jason, what do you think? > > Thanks -- Chuck Lever
On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote: > > > > On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote: > >> > >> > >>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote: > >>> > >>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: > > > > <...> > > > >>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that > >>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to > >>>> prevent a priority inversion. > >>> > >>> But why do you have WQ_MEM_RECLAIM in xprtiod? > >> > >> Because RPC is under a filesystem (NFS). Therefore it has to handle > >> writeback demanded by direct reclaim. All of the storage ULPs have > >> this constraint, in fact. > > > > I don't know, this ib_addr workqueue is used when connection is created. > > Reconnection is exactly when we need to ensure that creating > a new connection won't trigger more memory allocation, because > that will immediately deadlock. > > Again, all network storage ULPs have this constraint. IMHO this whole concept is broken. The RDMA stack does not operate globally under RECLAIM, nor should it. If you attempt to do a reconnection/etc from within a RECLAIM context it will deadlock on one of the many allocations that are made to support opening the connection. The general idea of reclaim is that the entire task context working under the reclaim is marked with an override of the gfp flags to make all allocations under that call chain reclaim safe. But rdmacm does allocations outside this, eg in the WQs processing the CM packets. So this doesn't work and we will deadlock. Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here and there.. For instance, this patch is just incorrect, you can't use WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything useful to happen: addr_resolve() addr_resolve_neigh() fetch_ha() ib_nl_fetch_ha() ib_nl_ip_send_msg() skb = nlmsg_new(len, GFP_KERNEL); So regardless of the MEM_RECLAIM the *actual work* being canceled can be stuck on an non-reclaim-safe allocation. Jason
> On Aug 26, 2022, at 9:29 AM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote: >> >> >>> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote: >>> >>> On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote: >>>> >>>> >>>>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote: >>>>> >>>>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote: >>> >>> <...> >>> >>>>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that >>>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to >>>>>> prevent a priority inversion. >>>>> >>>>> But why do you have WQ_MEM_RECLAIM in xprtiod? >>>> >>>> Because RPC is under a filesystem (NFS). Therefore it has to handle >>>> writeback demanded by direct reclaim. All of the storage ULPs have >>>> this constraint, in fact. >>> >>> I don't know, this ib_addr workqueue is used when connection is created. >> >> Reconnection is exactly when we need to ensure that creating >> a new connection won't trigger more memory allocation, because >> that will immediately deadlock. >> >> Again, all network storage ULPs have this constraint. > > IMHO this whole concept is broken. > > The RDMA stack does not operate globally under RECLAIM, nor should it. > > If you attempt to do a reconnection/etc from within a RECLAIM context > it will deadlock on one of the many allocations that are made to > support opening the connection. > > The general idea of reclaim is that the entire task context working > under the reclaim is marked with an override of the gfp flags to make > all allocations under that call chain reclaim safe. > > But rdmacm does allocations outside this, eg in the WQs processing the > CM packets. So this doesn't work and we will deadlock. > > Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here > and there.. > > For instance, this patch is just incorrect, you can't use > WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything > useful to happen: > > addr_resolve() > addr_resolve_neigh() > fetch_ha() > ib_nl_fetch_ha() > ib_nl_ip_send_msg() > skb = nlmsg_new(len, GFP_KERNEL); > > So regardless of the MEM_RECLAIM the *actual work* being canceled can > be stuck on an non-reclaim-safe allocation. I see recent commits that do exactly what I've done for the reason I've done it. 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning") 533d2e8b4d5e ("nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown") I accept that this might be a long chain to pull, but we need a plan to resolve this. Storage ULPs go to a lot of trouble to pre-allocate resources to avoid deadlocking in reclaim -- handling reclaim properly is supposed to be designed into this stack. -- Chuck Lever
On Fri, Aug 26, 2022 at 02:02:55PM +0000, Chuck Lever III wrote: > I see recent commits that do exactly what I've done for the reason I've done it. > > 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning") No, this one says: The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become entangled with memory reclaim, so this flag is appropriate. So it is OK, it is not the same thing as adding WQ_MEM_RECLAIM to a WQ that allocates memory. > I accept that this might be a long chain to pull, but we need a plan > to resolve this. It is not just a long chain, it is something that was never designed to even work or thought about. People put storage ULPs on top of this and just ignored the problem. If someone wants to tackle this then we need a comprehensive patch series identifying what functions are safe to call under memory reclaim contexts and then fully auditing them that they are actually safe. Right now I don't even know the basic information what functions the storage community need to be reclaim safe. Jason
> On Aug 26, 2022, at 10:08 AM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Aug 26, 2022 at 02:02:55PM +0000, Chuck Lever III wrote: > >> I see recent commits that do exactly what I've done for the reason I've done it. >> >> 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning") > > No, this one says: > > The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become > entangled with memory reclaim, so this flag is appropriate. > > So it is OK, it is not the same thing as adding WQ_MEM_RECLAIM to a WQ > that allocates memory. > >> I accept that this might be a long chain to pull, but we need a plan >> to resolve this. > > It is not just a long chain, it is something that was never designed > to even work or thought about. People put storage ULPs on top of this > and just ignored the problem. > > If someone wants to tackle this then we need a comprehensive patch > series identifying what functions are safe to call under memory > reclaim contexts and then fully auditing them that they are actually > safe. > > Right now I don't even know the basic information what functions the > storage community need to be reclaim safe. The connect APIs would be a place to start. In the meantime, though... Two or three years ago I spent some effort to ensure that closing an RDMA connection leaves a client-side RPC/RDMA transport with no RDMA resources associated with it. It releases the CQs, QP, and all the MRs. That makes initial connect and reconnect both behave exactly the same, and guarantees that a reconnect does not get stuck with an old CQ that is no longer working or a QP that is in TIMEWAIT. However that does mean that substantial resource allocation is done on every reconnect. One way to resolve the check_flush_dependency() splat would be to have rpcrdma.ko allocate its own workqueue for handling connections and MR allocation, and leave WQ_MEM_RECLAIM disabled for it. Basically, replace the use of the xprtiod workqueue for RPC/RDMA transports. -- Chuck Lever
On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: > The connect APIs would be a place to start. In the meantime, though... > > Two or three years ago I spent some effort to ensure that closing > an RDMA connection leaves a client-side RPC/RDMA transport with no > RDMA resources associated with it. It releases the CQs, QP, and all > the MRs. That makes initial connect and reconnect both behave exactly > the same, and guarantees that a reconnect does not get stuck with > an old CQ that is no longer working or a QP that is in TIMEWAIT. > > However that does mean that substantial resource allocation is > done on every reconnect. And if the resource allocations fail then what happens? The storage ULP retries forever and is effectively deadlocked? How much allocation can you safely do under GFP_NOIO? Jason
> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: >> The connect APIs would be a place to start. In the meantime, though... >> >> Two or three years ago I spent some effort to ensure that closing >> an RDMA connection leaves a client-side RPC/RDMA transport with no >> RDMA resources associated with it. It releases the CQs, QP, and all >> the MRs. That makes initial connect and reconnect both behave exactly >> the same, and guarantees that a reconnect does not get stuck with >> an old CQ that is no longer working or a QP that is in TIMEWAIT. >> >> However that does mean that substantial resource allocation is >> done on every reconnect. > > And if the resource allocations fail then what happens? The storage > ULP retries forever and is effectively deadlocked? The reconnection attempt fails, and any resources allocated during that attempt are released. The ULP waits a bit then tries again until it works or is interrupted. A deadlock might occur if one of those allocations triggers additional reclaim activity. > How much allocation can you safely do under GFP_NOIO? My naive take is that doing those allocations under NOIO would help avoid recursion during memory exhaustion. -- Chuck Lever
On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote: > > > > On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: > >> The connect APIs would be a place to start. In the meantime, though... > >> > >> Two or three years ago I spent some effort to ensure that closing > >> an RDMA connection leaves a client-side RPC/RDMA transport with no > >> RDMA resources associated with it. It releases the CQs, QP, and all > >> the MRs. That makes initial connect and reconnect both behave exactly > >> the same, and guarantees that a reconnect does not get stuck with > >> an old CQ that is no longer working or a QP that is in TIMEWAIT. > >> > >> However that does mean that substantial resource allocation is > >> done on every reconnect. > > > > And if the resource allocations fail then what happens? The storage > > ULP retries forever and is effectively deadlocked? > > The reconnection attempt fails, and any resources allocated during > that attempt are released. The ULP waits a bit then tries again > until it works or is interrupted. > > A deadlock might occur if one of those allocations triggers > additional reclaim activity. No, you are deadlocked now. If a direct reclaim calls back into NFS we are already at the point where normal allocations fail, and we are accessing the emergency reserve. When reclaim does this it marks the entire task with memalloc_noio_save() which forces GFP_NOIO on every allocation that task makes, meaning every allocation comes from the emergency reserve already. This is why it (barely) works *at all* with RDMA. If during the writeback the reserve is exhaused and memory allocation fails, then the IO stack is in trouble - either it fails the writeback (then what?) or it deadlocks the kernel because it *cannot* make forward progress without those memory allocations. The fact we have cases where the storage thread under the memalloc_noio_save() becomes contingent on the forward progress of other contexts that don't have memalloc_noio_save() is a fairly serious problem I can't see a solution to. Even a simple case like mlx5 may cause the NIC to trigger a host memory allocation, which is done in another thread and done as a normal GFP_KERNEL. This memory allocation must progress before a CQ/QP/MR/etc can be created. So now we are deadlocked again. Jason
> On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote: >> >> >>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: >>> >>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: >>>> The connect APIs would be a place to start. In the meantime, though... >>>> >>>> Two or three years ago I spent some effort to ensure that closing >>>> an RDMA connection leaves a client-side RPC/RDMA transport with no >>>> RDMA resources associated with it. It releases the CQs, QP, and all >>>> the MRs. That makes initial connect and reconnect both behave exactly >>>> the same, and guarantees that a reconnect does not get stuck with >>>> an old CQ that is no longer working or a QP that is in TIMEWAIT. >>>> >>>> However that does mean that substantial resource allocation is >>>> done on every reconnect. >>> >>> And if the resource allocations fail then what happens? The storage >>> ULP retries forever and is effectively deadlocked? >> >> The reconnection attempt fails, and any resources allocated during >> that attempt are released. The ULP waits a bit then tries again >> until it works or is interrupted. >> >> A deadlock might occur if one of those allocations triggers >> additional reclaim activity. > > No, you are deadlocked now. GFP_KERNEL can and will give up eventually, in which case the connection attempt fails and any previously allocated memory is released. Something else can then make progress. Single page allocation nearly always succeeds. It's the larger-order allocations that can block for long periods, and that's not necessarily because memory is low -- it can happen when one NUMA node's memory is heavily fragmented. > If a direct reclaim calls back into NFS we are already at the point > where normal allocations fail, and we are accessing the emergency > reserve. > > When reclaim does this it marks the entire task with > memalloc_noio_save() which forces GFP_NOIO on every allocation that > task makes, meaning every allocation comes from the emergency reserve > already. > > This is why it (barely) works *at all* with RDMA. > > If during the writeback the reserve is exhaused and memory allocation > fails, then the IO stack is in trouble - either it fails the writeback > (then what?) or it deadlocks the kernel because it *cannot* make > forward progress without those memory allocations. > > The fact we have cases where the storage thread under the > memalloc_noio_save() becomes contingent on the forward progress of > other contexts that don't have memalloc_noio_save() is a fairly > serious problem I can't see a solution to. This issue seems to be addressed in the socket stack, so I don't believe there's _no_ solution for RDMA. Usually the trick is to communicate the memalloc_noio settings somehow to other allocating threads. We could use cgroups, for example, to collect these threads and resources under one GFP umbrella. /eyeroll /ducks If nothing else we can talk with the MM folks about planning improvements. We've just gone through this with NFS on the socket stack. > Even a simple case like mlx5 may cause the NIC to trigger a host > memory allocation, which is done in another thread and done as a > normal GFP_KERNEL. This memory allocation must progress before a > CQ/QP/MR/etc can be created. So now we are deadlocked again. That sounds to me like a bug in mlx5. The driver is supposed to respect the caller's GFP settings. Again, if the request is small, it's likely to succeed anyway, but larger requests are not reliable and need to fail quickly so the system can move onto other fishing spots. I would like to at least get rid of the check_flush_dependency splat, which will fire a lot more often than we will get stuck in a reclaim allocation corner. I'm testing a patch that converts rpcrdma not to use MEM_RECLAIM work queues and notes how extensive the problem actually is. -- Chuck Lever
On Mon, Aug 29, 2022 at 06:15:28PM +0000, Chuck Lever III wrote: > > > > On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote: > >> > >> > >>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > >>> > >>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote: > >>>> The connect APIs would be a place to start. In the meantime, though... > >>>> > >>>> Two or three years ago I spent some effort to ensure that closing > >>>> an RDMA connection leaves a client-side RPC/RDMA transport with no > >>>> RDMA resources associated with it. It releases the CQs, QP, and all > >>>> the MRs. That makes initial connect and reconnect both behave exactly > >>>> the same, and guarantees that a reconnect does not get stuck with > >>>> an old CQ that is no longer working or a QP that is in TIMEWAIT. > >>>> > >>>> However that does mean that substantial resource allocation is > >>>> done on every reconnect. > >>> > >>> And if the resource allocations fail then what happens? The storage > >>> ULP retries forever and is effectively deadlocked? > >> > >> The reconnection attempt fails, and any resources allocated during > >> that attempt are released. The ULP waits a bit then tries again > >> until it works or is interrupted. > >> > >> A deadlock might occur if one of those allocations triggers > >> additional reclaim activity. > > > > No, you are deadlocked now. > > GFP_KERNEL can and will give up eventually, in which case > the connection attempt fails and any previously allocated > memory is released. Something else can then make progress. Something else, maybe for a time, but likely the storage stack is forever stuck. > Single page allocation nearly always succeeds. It's the > larger-order allocations that can block for long periods, > and that's not necessarily because memory is low -- it can > happen when one NUMA node's memory is heavily fragmented. We've done a lot of work in the rdma stack and drivers to avoid multi-page allocations. But we might need a lot of them, and I'm skeptical about this claim they always succeed. > This issue seems to be addressed in the socket stack, so I > don't believe there's _no_ solution for RDMA. Usually the > trick is to communicate the memalloc_noio settings somehow > to other allocating threads. And how do you do that when the other threads may have already started their work before a reclaim writeback is triggered? We may already be blocked inside GFP_KERNEL - heck we may already be inside reclaim from within one of our own threads! > If nothing else we can talk with the MM folks about planning > improvements. We've just gone through this with NFS on the > socket stack. I'm not aware of any work in this area.. > > Even a simple case like mlx5 may cause the NIC to trigger a host > > memory allocation, which is done in another thread and done as a > > normal GFP_KERNEL. This memory allocation must progress before a > > CQ/QP/MR/etc can be created. So now we are deadlocked again. > > That sounds to me like a bug in mlx5. The driver is supposed > to respect the caller's GFP settings. Again, if the request > is small, it's likely to succeed anyway, but larger requests > are not reliable and need to fail quickly so the system can > move onto other fishing spots. It is a design artifact, the FW is the one requesting the memory and it has no idea about kernel GFP flags. As above a FW thread could have already started requesting memory for some other purpose and we may already be inside the mlx5 FW page request thread under a GFP_KERNEL allocation doing reclaim. How can this ever be fixed? > I would like to at least get rid of the check_flush_dependency > splat, which will fire a lot more often than we will get stuck > in a reclaim allocation corner. I'm testing a patch that > converts rpcrdma not to use MEM_RECLAIM work queues and notes > how extensive the problem actually is. Ok Jason
> On Aug 29, 2022, at 2:26 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Aug 29, 2022 at 06:15:28PM +0000, Chuck Lever III wrote: >> >>> On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > >>> Even a simple case like mlx5 may cause the NIC to trigger a host >>> memory allocation, which is done in another thread and done as a >>> normal GFP_KERNEL. This memory allocation must progress before a >>> CQ/QP/MR/etc can be created. So now we are deadlocked again. >> >> That sounds to me like a bug in mlx5. The driver is supposed >> to respect the caller's GFP settings. Again, if the request >> is small, it's likely to succeed anyway, but larger requests >> are not reliable and need to fail quickly so the system can >> move onto other fishing spots. > > It is a design artifact, the FW is the one requesting the memory and > it has no idea about kernel GFP flags. As above a FW thread could have > already started requesting memory for some other purpose and we may > already be inside the mlx5 FW page request thread under a GFP_KERNEL > allocation doing reclaim. How can this ever be fixed? I'm willing to admit I'm no expert here. But... IIUC the deadlock problem is triggered by /waiting/ for memory to become available to satisfy an allocation request. So using GFP_NOWAIT, GFP_NOIO/memalloc_noio, and GFP_NOFS/memalloc_nofs when drivers allocate memory should be enough to prevent a deadlock and keep the allocations from diving into reserved memory. I believe only GFP_ATOMIC goes for reserved memory pools. These others are normal allocations that simply do not wait if a direct reclaim should be required. The second-order issue is that the "failed to allocate" recovery paths are not likely to be well tested, and these other flags make that kind of failure more likely. Enable memory allocation failure injection and begin fixing the shit that comes up. If you've got "can't fail" scenarios, we'll have to look at those closely. -- Chuck Lever
diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index f253295795f0..5c36d01ebf0b 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -872,7 +872,7 @@ static struct notifier_block nb = { int addr_init(void) { - addr_wq = alloc_ordered_workqueue("ib_addr", 0); + addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM); if (!addr_wq) return -ENOMEM;
While setting up a new lab, I accidentally misconfigured the Ethernet port for a system that tried an NFS mount using RoCE. This made the NFS server unreachable. The following WARNING popped on the NFS client while waiting for the mount attempt to time out: Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI> Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma> Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13 Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be> Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092 Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027 Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000 Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731 Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0 Aug 20 17:12:05 bazille kernel: FS: 0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000 Aug 20 17:12:05 bazille kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0 Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 20 17:12:05 bazille kernel: Call Trace: Aug 20 17:12:05 bazille kernel: <TASK> Aug 20 17:12:05 bazille kernel: __flush_work.isra.0+0xaf/0x188 Aug 20 17:12:05 bazille kernel: ? _raw_spin_lock_irqsave+0x2c/0x37 Aug 20 17:12:05 bazille kernel: ? lock_timer_base+0x38/0x5f Aug 20 17:12:05 bazille kernel: __cancel_work_timer+0xea/0x13d Aug 20 17:12:05 bazille kernel: ? preempt_latency_start+0x2b/0x46 Aug 20 17:12:05 bazille kernel: rdma_addr_cancel+0x70/0x81 [ib_core] Aug 20 17:12:05 bazille kernel: _destroy_id+0x1a/0x246 [rdma_cm] Aug 20 17:12:05 bazille kernel: rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma] Aug 20 17:12:05 bazille kernel: ? _raw_spin_unlock+0x14/0x29 Aug 20 17:12:05 bazille kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 Aug 20 17:12:05 bazille kernel: ? finish_task_switch.isra.0+0x171/0x249 Aug 20 17:12:05 bazille kernel: xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma] Aug 20 17:12:05 bazille kernel: process_one_work+0x1d8/0x2d4 Aug 20 17:12:05 bazille kernel: worker_thread+0x18b/0x24f Aug 20 17:12:05 bazille kernel: ? rescuer_thread+0x280/0x280 Aug 20 17:12:05 bazille kernel: kthread+0xf4/0xfc Aug 20 17:12:05 bazille kernel: ? kthread_complete_and_exit+0x1b/0x1b Aug 20 17:12:05 bazille kernel: ret_from_fork+0x22/0x30 Aug 20 17:12:05 bazille kernel: </TASK> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that one of its work items tries to cancel has to be WQ_MEM_RECLAIM to prevent a priority inversion. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- drivers/infiniband/core/addr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)