diff mbox series

[v1] RDMA/core: Fix check_flush_dependency splat on addr_wq

Message ID 166118222093.3250511.11454048195824271658.stgit@morisot.1015granger.net (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series [v1] RDMA/core: Fix check_flush_dependency splat on addr_wq | expand

Commit Message

Chuck Lever Aug. 22, 2022, 3:30 p.m. UTC
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:

Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
Aug 20 17:12:05 bazille kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
Aug 20 17:12:05 bazille kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 20 17:12:05 bazille kernel: Call Trace:
Aug 20 17:12:05 bazille kernel:  <TASK>
Aug 20 17:12:05 bazille kernel:  __flush_work.isra.0+0xaf/0x188
Aug 20 17:12:05 bazille kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
Aug 20 17:12:05 bazille kernel:  ? lock_timer_base+0x38/0x5f
Aug 20 17:12:05 bazille kernel:  __cancel_work_timer+0xea/0x13d
Aug 20 17:12:05 bazille kernel:  ? preempt_latency_start+0x2b/0x46
Aug 20 17:12:05 bazille kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
Aug 20 17:12:05 bazille kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
Aug 20 17:12:05 bazille kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
Aug 20 17:12:05 bazille kernel:  ? _raw_spin_unlock+0x14/0x29
Aug 20 17:12:05 bazille kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
Aug 20 17:12:05 bazille kernel:  ? finish_task_switch.isra.0+0x171/0x249
Aug 20 17:12:05 bazille kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
Aug 20 17:12:05 bazille kernel:  process_one_work+0x1d8/0x2d4
Aug 20 17:12:05 bazille kernel:  worker_thread+0x18b/0x24f
Aug 20 17:12:05 bazille kernel:  ? rescuer_thread+0x280/0x280
Aug 20 17:12:05 bazille kernel:  kthread+0xf4/0xfc
Aug 20 17:12:05 bazille kernel:  ? kthread_complete_and_exit+0x1b/0x1b
Aug 20 17:12:05 bazille kernel:  ret_from_fork+0x22/0x30
Aug 20 17:12:05 bazille kernel:  </TASK>

The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion.

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/addr.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Leon Romanovsky Aug. 23, 2022, 8:09 a.m. UTC | #1
On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
> While setting up a new lab, I accidentally misconfigured the
> Ethernet port for a system that tried an NFS mount using RoCE.
> This made the NFS server unreachable. The following WARNING
> popped on the NFS client while waiting for the mount attempt to
> time out:
> 
> Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
> Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
> Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
> Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
> Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
> Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
> Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
> Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
> Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
> Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
> Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
> Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
> Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
> Aug 20 17:12:05 bazille kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
> Aug 20 17:12:05 bazille kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
> Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug 20 17:12:05 bazille kernel: Call Trace:
> Aug 20 17:12:05 bazille kernel:  <TASK>
> Aug 20 17:12:05 bazille kernel:  __flush_work.isra.0+0xaf/0x188
> Aug 20 17:12:05 bazille kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
> Aug 20 17:12:05 bazille kernel:  ? lock_timer_base+0x38/0x5f
> Aug 20 17:12:05 bazille kernel:  __cancel_work_timer+0xea/0x13d
> Aug 20 17:12:05 bazille kernel:  ? preempt_latency_start+0x2b/0x46
> Aug 20 17:12:05 bazille kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
> Aug 20 17:12:05 bazille kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
> Aug 20 17:12:05 bazille kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
> Aug 20 17:12:05 bazille kernel:  ? _raw_spin_unlock+0x14/0x29
> Aug 20 17:12:05 bazille kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
> Aug 20 17:12:05 bazille kernel:  ? finish_task_switch.isra.0+0x171/0x249
> Aug 20 17:12:05 bazille kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
> Aug 20 17:12:05 bazille kernel:  process_one_work+0x1d8/0x2d4
> Aug 20 17:12:05 bazille kernel:  worker_thread+0x18b/0x24f
> Aug 20 17:12:05 bazille kernel:  ? rescuer_thread+0x280/0x280
> Aug 20 17:12:05 bazille kernel:  kthread+0xf4/0xfc
> Aug 20 17:12:05 bazille kernel:  ? kthread_complete_and_exit+0x1b/0x1b
> Aug 20 17:12:05 bazille kernel:  ret_from_fork+0x22/0x30
> Aug 20 17:12:05 bazille kernel:  </TASK>
> 
> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> prevent a priority inversion.

But why do you have WQ_MEM_RECLAIM in xprtiod?

  1270         wq = alloc_workqueue("xprtiod", WQ_UNBOUND | WQ_MEM_RECLAIM, 0);

IMHO, It will be nicer if we remove WQ_MEM_RECLAIM instead of adding it.

Thanks

> 
> Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  drivers/infiniband/core/addr.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index f253295795f0..5c36d01ebf0b 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -872,7 +872,7 @@ static struct notifier_block nb = {
>  
>  int addr_init(void)
>  {
> -	addr_wq = alloc_ordered_workqueue("ib_addr", 0);
> +	addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM);
>  	if (!addr_wq)
>  		return -ENOMEM;
>  
> 
>
Chuck Lever Aug. 23, 2022, 1:58 p.m. UTC | #2
> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
>> While setting up a new lab, I accidentally misconfigured the
>> Ethernet port for a system that tried an NFS mount using RoCE.
>> This made the NFS server unreachable. The following WARNING
>> popped on the NFS client while waiting for the mount attempt to
>> time out:
>> 
>> Aug 20 17:12:05 bazille kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
>> Aug 20 17:12:05 bazille kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
>> Aug 20 17:12:05 bazille kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
>> Aug 20 17:12:05 bazille kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
>> Aug 20 17:12:05 bazille kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
>> Aug 20 17:12:05 bazille kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
>> Aug 20 17:12:05 bazille kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
>> Aug 20 17:12:05 bazille kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
>> Aug 20 17:12:05 bazille kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
>> Aug 20 17:12:05 bazille kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
>> Aug 20 17:12:05 bazille kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
>> Aug 20 17:12:05 bazille kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
>> Aug 20 17:12:05 bazille kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
>> Aug 20 17:12:05 bazille kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
>> Aug 20 17:12:05 bazille kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
>> Aug 20 17:12:05 bazille kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Aug 20 17:12:05 bazille kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
>> Aug 20 17:12:05 bazille kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Aug 20 17:12:05 bazille kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Aug 20 17:12:05 bazille kernel: Call Trace:
>> Aug 20 17:12:05 bazille kernel:  <TASK>
>> Aug 20 17:12:05 bazille kernel:  __flush_work.isra.0+0xaf/0x188
>> Aug 20 17:12:05 bazille kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
>> Aug 20 17:12:05 bazille kernel:  ? lock_timer_base+0x38/0x5f
>> Aug 20 17:12:05 bazille kernel:  __cancel_work_timer+0xea/0x13d
>> Aug 20 17:12:05 bazille kernel:  ? preempt_latency_start+0x2b/0x46
>> Aug 20 17:12:05 bazille kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
>> Aug 20 17:12:05 bazille kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
>> Aug 20 17:12:05 bazille kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
>> Aug 20 17:12:05 bazille kernel:  ? _raw_spin_unlock+0x14/0x29
>> Aug 20 17:12:05 bazille kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
>> Aug 20 17:12:05 bazille kernel:  ? finish_task_switch.isra.0+0x171/0x249
>> Aug 20 17:12:05 bazille kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
>> Aug 20 17:12:05 bazille kernel:  process_one_work+0x1d8/0x2d4
>> Aug 20 17:12:05 bazille kernel:  worker_thread+0x18b/0x24f
>> Aug 20 17:12:05 bazille kernel:  ? rescuer_thread+0x280/0x280
>> Aug 20 17:12:05 bazille kernel:  kthread+0xf4/0xfc
>> Aug 20 17:12:05 bazille kernel:  ? kthread_complete_and_exit+0x1b/0x1b
>> Aug 20 17:12:05 bazille kernel:  ret_from_fork+0x22/0x30
>> Aug 20 17:12:05 bazille kernel:  </TASK>
>> 
>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>> prevent a priority inversion.
> 
> But why do you have WQ_MEM_RECLAIM in xprtiod?

Because RPC is under a filesystem (NFS). Therefore it has to handle
writeback demanded by direct reclaim. All of the storage ULPs have
this constraint, in fact.


>  1270         wq = alloc_workqueue("xprtiod", WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
> 
> IMHO, It will be nicer if we remove WQ_MEM_RECLAIM instead of adding it.
> 
> Thanks
> 
>> 
>> Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> drivers/infiniband/core/addr.c |    2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
>> index f253295795f0..5c36d01ebf0b 100644
>> --- a/drivers/infiniband/core/addr.c
>> +++ b/drivers/infiniband/core/addr.c
>> @@ -872,7 +872,7 @@ static struct notifier_block nb = {
>> 
>> int addr_init(void)
>> {
>> -	addr_wq = alloc_ordered_workqueue("ib_addr", 0);
>> +	addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM);
>> 	if (!addr_wq)
>> 		return -ENOMEM;

--
Chuck Lever
Leon Romanovsky Aug. 24, 2022, 9:20 a.m. UTC | #3
On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:

<...>

> >> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
> >> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> >> prevent a priority inversion.
> > 
> > But why do you have WQ_MEM_RECLAIM in xprtiod?
> 
> Because RPC is under a filesystem (NFS). Therefore it has to handle
> writeback demanded by direct reclaim. All of the storage ULPs have
> this constraint, in fact.

I don't know, this ib_addr workqueue is used when connection is created.
Jason, what do you think?

Thanks
Chuck Lever Aug. 24, 2022, 2:09 p.m. UTC | #4
> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> 
>>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
> 
> <...>
> 
>>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>>>> prevent a priority inversion.
>>> 
>>> But why do you have WQ_MEM_RECLAIM in xprtiod?
>> 
>> Because RPC is under a filesystem (NFS). Therefore it has to handle
>> writeback demanded by direct reclaim. All of the storage ULPs have
>> this constraint, in fact.
> 
> I don't know, this ib_addr workqueue is used when connection is created.

Reconnection is exactly when we need to ensure that creating
a new connection won't trigger more memory allocation, because
that will immediately deadlock.

Again, all network storage ULPs have this constraint.


> Jason, what do you think?
> 
> Thanks

--
Chuck Lever
Jason Gunthorpe Aug. 26, 2022, 1:29 p.m. UTC | #5
On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
> >>> 
> >>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
> > 
> > <...>
> > 
> >>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
> >>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
> >>>> prevent a priority inversion.
> >>> 
> >>> But why do you have WQ_MEM_RECLAIM in xprtiod?
> >> 
> >> Because RPC is under a filesystem (NFS). Therefore it has to handle
> >> writeback demanded by direct reclaim. All of the storage ULPs have
> >> this constraint, in fact.
> > 
> > I don't know, this ib_addr workqueue is used when connection is created.
> 
> Reconnection is exactly when we need to ensure that creating
> a new connection won't trigger more memory allocation, because
> that will immediately deadlock.
> 
> Again, all network storage ULPs have this constraint.

IMHO this whole concept is broken.

The RDMA stack does not operate globally under RECLAIM, nor should it.

If you attempt to do a reconnection/etc from within a RECLAIM context
it will deadlock on one of the many allocations that are made to
support opening the connection.

The general idea of reclaim is that the entire task context working
under the reclaim is marked with an override of the gfp flags to make
all allocations under that call chain reclaim safe.

But rdmacm does allocations outside this, eg in the WQs processing the
CM packets. So this doesn't work and we will deadlock.

Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here
and there..

For instance, this patch is just incorrect, you can't use
WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything
useful to happen:

addr_resolve()
 addr_resolve_neigh()
  fetch_ha()
   ib_nl_fetch_ha()
     ib_nl_ip_send_msg()
       	skb = nlmsg_new(len, GFP_KERNEL);

So regardless of the MEM_RECLAIM the *actual work* being canceled can
be stuck on an non-reclaim-safe allocation.

Jason
Chuck Lever Aug. 26, 2022, 2:02 p.m. UTC | #6
> On Aug 26, 2022, at 9:29 AM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Wed, Aug 24, 2022 at 02:09:52PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Aug 24, 2022, at 5:20 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> 
>>> On Tue, Aug 23, 2022 at 01:58:44PM +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Aug 23, 2022, at 4:09 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>>> 
>>>>> On Mon, Aug 22, 2022 at 11:30:20AM -0400, Chuck Lever wrote:
>>> 
>>> <...>
>>> 
>>>>>> The xprtiod work queue is WQ_MEM_RECLAIM, so any work queue that
>>>>>> one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
>>>>>> prevent a priority inversion.
>>>>> 
>>>>> But why do you have WQ_MEM_RECLAIM in xprtiod?
>>>> 
>>>> Because RPC is under a filesystem (NFS). Therefore it has to handle
>>>> writeback demanded by direct reclaim. All of the storage ULPs have
>>>> this constraint, in fact.
>>> 
>>> I don't know, this ib_addr workqueue is used when connection is created.
>> 
>> Reconnection is exactly when we need to ensure that creating
>> a new connection won't trigger more memory allocation, because
>> that will immediately deadlock.
>> 
>> Again, all network storage ULPs have this constraint.
> 
> IMHO this whole concept is broken.
> 
> The RDMA stack does not operate globally under RECLAIM, nor should it.
> 
> If you attempt to do a reconnection/etc from within a RECLAIM context
> it will deadlock on one of the many allocations that are made to
> support opening the connection.
> 
> The general idea of reclaim is that the entire task context working
> under the reclaim is marked with an override of the gfp flags to make
> all allocations under that call chain reclaim safe.
> 
> But rdmacm does allocations outside this, eg in the WQs processing the
> CM packets. So this doesn't work and we will deadlock.
> 
> Fixing it is a big deal and needs more that poking WQ_MEM_RECLAIM here
> and there..
> 
> For instance, this patch is just incorrect, you can't use
> WQ_MEM_RECLAIM on a WQ that is doing allocations and expect anything
> useful to happen:
> 
> addr_resolve()
> addr_resolve_neigh()
>  fetch_ha()
>   ib_nl_fetch_ha()
>     ib_nl_ip_send_msg()
>       	skb = nlmsg_new(len, GFP_KERNEL);
> 
> So regardless of the MEM_RECLAIM the *actual work* being canceled can
> be stuck on an non-reclaim-safe allocation.

I see recent commits that do exactly what I've done for the reason I've done it.

4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning")
533d2e8b4d5e ("nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown")

I accept that this might be a long chain to pull, but we need a plan
to resolve this. Storage ULPs go to a lot of trouble to pre-allocate
resources to avoid deadlocking in reclaim -- handling reclaim properly
is supposed to be designed into this stack.


--
Chuck Lever
Jason Gunthorpe Aug. 26, 2022, 2:08 p.m. UTC | #7
On Fri, Aug 26, 2022 at 02:02:55PM +0000, Chuck Lever III wrote:
 
> I see recent commits that do exactly what I've done for the reason I've done it.
> 
> 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning")

No, this one says:

    The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become
    entangled with memory reclaim, so this flag is appropriate.
    
So it is OK, it is not the same thing as adding WQ_MEM_RECLAIM to a WQ
that allocates memory.

> I accept that this might be a long chain to pull, but we need a plan
> to resolve this. 

It is not just a long chain, it is something that was never designed
to even work or thought about. People put storage ULPs on top of this
and just ignored the problem.

If someone wants to tackle this then we need a comprehensive patch
series identifying what functions are safe to call under memory
reclaim contexts and then fully auditing them that they are actually
safe.

Right now I don't even know the basic information what functions the
storage community need to be reclaim safe.

Jason
Chuck Lever Aug. 26, 2022, 7:57 p.m. UTC | #8
> On Aug 26, 2022, at 10:08 AM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Fri, Aug 26, 2022 at 02:02:55PM +0000, Chuck Lever III wrote:
> 
>> I see recent commits that do exactly what I've done for the reason I've done it.
>> 
>> 4c4b1996b5db ("IB/hfi1: Fix WQ_MEM_RECLAIM warning")
> 
> No, this one says:
> 
>    The hfi1_wq does not allocate memory with GFP_KERNEL or otherwise become
>    entangled with memory reclaim, so this flag is appropriate.
> 
> So it is OK, it is not the same thing as adding WQ_MEM_RECLAIM to a WQ
> that allocates memory.
> 
>> I accept that this might be a long chain to pull, but we need a plan
>> to resolve this. 
> 
> It is not just a long chain, it is something that was never designed
> to even work or thought about. People put storage ULPs on top of this
> and just ignored the problem.
> 
> If someone wants to tackle this then we need a comprehensive patch
> series identifying what functions are safe to call under memory
> reclaim contexts and then fully auditing them that they are actually
> safe.
> 
> Right now I don't even know the basic information what functions the
> storage community need to be reclaim safe.

The connect APIs would be a place to start. In the meantime, though...

Two or three years ago I spent some effort to ensure that closing
an RDMA connection leaves a client-side RPC/RDMA transport with no
RDMA resources associated with it. It releases the CQs, QP, and all
the MRs. That makes initial connect and reconnect both behave exactly
the same, and guarantees that a reconnect does not get stuck with
an old CQ that is no longer working or a QP that is in TIMEWAIT.

However that does mean that substantial resource allocation is
done on every reconnect.

One way to resolve the check_flush_dependency() splat would be
to have rpcrdma.ko allocate its own workqueue for handling
connections and MR allocation, and leave WQ_MEM_RECLAIM disabled
for it. Basically, replace the use of the xprtiod workqueue for
RPC/RDMA transports.


--
Chuck Lever
Jason Gunthorpe Aug. 29, 2022, 4:45 p.m. UTC | #9
On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
> The connect APIs would be a place to start. In the meantime, though...
> 
> Two or three years ago I spent some effort to ensure that closing
> an RDMA connection leaves a client-side RPC/RDMA transport with no
> RDMA resources associated with it. It releases the CQs, QP, and all
> the MRs. That makes initial connect and reconnect both behave exactly
> the same, and guarantees that a reconnect does not get stuck with
> an old CQ that is no longer working or a QP that is in TIMEWAIT.
> 
> However that does mean that substantial resource allocation is
> done on every reconnect.

And if the resource allocations fail then what happens? The storage
ULP retries forever and is effectively deadlocked?

How much allocation can you safely do under GFP_NOIO?

Jason
Chuck Lever Aug. 29, 2022, 5:14 p.m. UTC | #10
> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
>> The connect APIs would be a place to start. In the meantime, though...
>> 
>> Two or three years ago I spent some effort to ensure that closing
>> an RDMA connection leaves a client-side RPC/RDMA transport with no
>> RDMA resources associated with it. It releases the CQs, QP, and all
>> the MRs. That makes initial connect and reconnect both behave exactly
>> the same, and guarantees that a reconnect does not get stuck with
>> an old CQ that is no longer working or a QP that is in TIMEWAIT.
>> 
>> However that does mean that substantial resource allocation is
>> done on every reconnect.
> 
> And if the resource allocations fail then what happens? The storage
> ULP retries forever and is effectively deadlocked?

The reconnection attempt fails, and any resources allocated during
that attempt are released. The ULP waits a bit then tries again
until it works or is interrupted.

A deadlock might occur if one of those allocations triggers
additional reclaim activity.


> How much allocation can you safely do under GFP_NOIO?

My naive take is that doing those allocations under NOIO would
help avoid recursion during memory exhaustion.


--
Chuck Lever
Jason Gunthorpe Aug. 29, 2022, 5:22 p.m. UTC | #11
On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
> >> The connect APIs would be a place to start. In the meantime, though...
> >> 
> >> Two or three years ago I spent some effort to ensure that closing
> >> an RDMA connection leaves a client-side RPC/RDMA transport with no
> >> RDMA resources associated with it. It releases the CQs, QP, and all
> >> the MRs. That makes initial connect and reconnect both behave exactly
> >> the same, and guarantees that a reconnect does not get stuck with
> >> an old CQ that is no longer working or a QP that is in TIMEWAIT.
> >> 
> >> However that does mean that substantial resource allocation is
> >> done on every reconnect.
> > 
> > And if the resource allocations fail then what happens? The storage
> > ULP retries forever and is effectively deadlocked?
> 
> The reconnection attempt fails, and any resources allocated during
> that attempt are released. The ULP waits a bit then tries again
> until it works or is interrupted.
> 
> A deadlock might occur if one of those allocations triggers
> additional reclaim activity.

No, you are deadlocked now.

If a direct reclaim calls back into NFS we are already at the point
where normal allocations fail, and we are accessing the emergency
reserve.

When reclaim does this it marks the entire task with
memalloc_noio_save() which forces GFP_NOIO on every allocation that
task makes, meaning every allocation comes from the emergency reserve
already.

This is why it (barely) works *at all* with RDMA.

If during the writeback the reserve is exhaused and memory allocation
fails, then the IO stack is in trouble - either it fails the writeback
(then what?) or it deadlocks the kernel because it *cannot* make
forward progress without those memory allocations.

The fact we have cases where the storage thread under the
memalloc_noio_save() becomes contingent on the forward progress of
other contexts that don't have memalloc_noio_save() is a fairly
serious problem I can't see a solution to.

Even a simple case like mlx5 may cause the NIC to trigger a host
memory allocation, which is done in another thread and done as a
normal GFP_KERNEL. This memory allocation must progress before a
CQ/QP/MR/etc can be created. So now we are deadlocked again.

Jason
Chuck Lever Aug. 29, 2022, 6:15 p.m. UTC | #12
> On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
>>> 
>>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
>>>> The connect APIs would be a place to start. In the meantime, though...
>>>> 
>>>> Two or three years ago I spent some effort to ensure that closing
>>>> an RDMA connection leaves a client-side RPC/RDMA transport with no
>>>> RDMA resources associated with it. It releases the CQs, QP, and all
>>>> the MRs. That makes initial connect and reconnect both behave exactly
>>>> the same, and guarantees that a reconnect does not get stuck with
>>>> an old CQ that is no longer working or a QP that is in TIMEWAIT.
>>>> 
>>>> However that does mean that substantial resource allocation is
>>>> done on every reconnect.
>>> 
>>> And if the resource allocations fail then what happens? The storage
>>> ULP retries forever and is effectively deadlocked?
>> 
>> The reconnection attempt fails, and any resources allocated during
>> that attempt are released. The ULP waits a bit then tries again
>> until it works or is interrupted.
>> 
>> A deadlock might occur if one of those allocations triggers
>> additional reclaim activity.
> 
> No, you are deadlocked now.

GFP_KERNEL can and will give up eventually, in which case
the connection attempt fails and any previously allocated
memory is released. Something else can then make progress.

Single page allocation nearly always succeeds. It's the
larger-order allocations that can block for long periods,
and that's not necessarily because memory is low -- it can
happen when one NUMA node's memory is heavily fragmented.


> If a direct reclaim calls back into NFS we are already at the point
> where normal allocations fail, and we are accessing the emergency
> reserve.
> 
> When reclaim does this it marks the entire task with
> memalloc_noio_save() which forces GFP_NOIO on every allocation that
> task makes, meaning every allocation comes from the emergency reserve
> already.
> 
> This is why it (barely) works *at all* with RDMA.
> 
> If during the writeback the reserve is exhaused and memory allocation
> fails, then the IO stack is in trouble - either it fails the writeback
> (then what?) or it deadlocks the kernel because it *cannot* make
> forward progress without those memory allocations.
> 
> The fact we have cases where the storage thread under the
> memalloc_noio_save() becomes contingent on the forward progress of
> other contexts that don't have memalloc_noio_save() is a fairly
> serious problem I can't see a solution to.

This issue seems to be addressed in the socket stack, so I
don't believe there's _no_ solution for RDMA. Usually the
trick is to communicate the memalloc_noio settings somehow
to other allocating threads.

We could use cgroups, for example, to collect these threads
and resources under one GFP umbrella.  /eyeroll /ducks

If nothing else we can talk with the MM folks about planning
improvements. We've just gone through this with NFS on the
socket stack.


> Even a simple case like mlx5 may cause the NIC to trigger a host
> memory allocation, which is done in another thread and done as a
> normal GFP_KERNEL. This memory allocation must progress before a
> CQ/QP/MR/etc can be created. So now we are deadlocked again.

That sounds to me like a bug in mlx5. The driver is supposed
to respect the caller's GFP settings. Again, if the request
is small, it's likely to succeed anyway, but larger requests
are not reliable and need to fail quickly so the system can
move onto other fishing spots.

I would like to at least get rid of the check_flush_dependency
splat, which will fire a lot more often than we will get stuck
in a reclaim allocation corner. I'm testing a patch that
converts rpcrdma not to use MEM_RECLAIM work queues and notes
how extensive the problem actually is.


--
Chuck Lever
Jason Gunthorpe Aug. 29, 2022, 6:26 p.m. UTC | #13
On Mon, Aug 29, 2022 at 06:15:28PM +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > On Mon, Aug 29, 2022 at 05:14:56PM +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On Aug 29, 2022, at 12:45 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>> 
> >>> On Fri, Aug 26, 2022 at 07:57:04PM +0000, Chuck Lever III wrote:
> >>>> The connect APIs would be a place to start. In the meantime, though...
> >>>> 
> >>>> Two or three years ago I spent some effort to ensure that closing
> >>>> an RDMA connection leaves a client-side RPC/RDMA transport with no
> >>>> RDMA resources associated with it. It releases the CQs, QP, and all
> >>>> the MRs. That makes initial connect and reconnect both behave exactly
> >>>> the same, and guarantees that a reconnect does not get stuck with
> >>>> an old CQ that is no longer working or a QP that is in TIMEWAIT.
> >>>> 
> >>>> However that does mean that substantial resource allocation is
> >>>> done on every reconnect.
> >>> 
> >>> And if the resource allocations fail then what happens? The storage
> >>> ULP retries forever and is effectively deadlocked?
> >> 
> >> The reconnection attempt fails, and any resources allocated during
> >> that attempt are released. The ULP waits a bit then tries again
> >> until it works or is interrupted.
> >> 
> >> A deadlock might occur if one of those allocations triggers
> >> additional reclaim activity.
> > 
> > No, you are deadlocked now.
> 
> GFP_KERNEL can and will give up eventually, in which case
> the connection attempt fails and any previously allocated
> memory is released. Something else can then make progress.

Something else, maybe for a time, but likely the storage stack is
forever stuck.

> Single page allocation nearly always succeeds. It's the
> larger-order allocations that can block for long periods,
> and that's not necessarily because memory is low -- it can
> happen when one NUMA node's memory is heavily fragmented.

We've done a lot of work in the rdma stack and drivers to avoid
multi-page allocations. But we might need a lot of them, and I'm
skeptical about this claim they always succeed.

> This issue seems to be addressed in the socket stack, so I
> don't believe there's _no_ solution for RDMA. Usually the
> trick is to communicate the memalloc_noio settings somehow
> to other allocating threads.

And how do you do that when the other threads may have already started
their work before a reclaim writeback is triggered? We may already be
blocked inside GFP_KERNEL - heck we may already be inside reclaim from
within one of our own threads!

> If nothing else we can talk with the MM folks about planning
> improvements. We've just gone through this with NFS on the
> socket stack.

I'm not aware of any work in this area.. 
 
> > Even a simple case like mlx5 may cause the NIC to trigger a host
> > memory allocation, which is done in another thread and done as a
> > normal GFP_KERNEL. This memory allocation must progress before a
> > CQ/QP/MR/etc can be created. So now we are deadlocked again.
> 
> That sounds to me like a bug in mlx5. The driver is supposed
> to respect the caller's GFP settings. Again, if the request
> is small, it's likely to succeed anyway, but larger requests
> are not reliable and need to fail quickly so the system can
> move onto other fishing spots.

It is a design artifact, the FW is the one requesting the memory and
it has no idea about kernel GFP flags. As above a FW thread could have
already started requesting memory for some other purpose and we may
already be inside the mlx5 FW page request thread under a GFP_KERNEL
allocation doing reclaim. How can this ever be fixed?

> I would like to at least get rid of the check_flush_dependency
> splat, which will fire a lot more often than we will get stuck
> in a reclaim allocation corner. I'm testing a patch that
> converts rpcrdma not to use MEM_RECLAIM work queues and notes
> how extensive the problem actually is.

Ok

Jason
Chuck Lever Aug. 29, 2022, 7:31 p.m. UTC | #14
> On Aug 29, 2022, at 2:26 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Mon, Aug 29, 2022 at 06:15:28PM +0000, Chuck Lever III wrote:
>> 
>>> On Aug 29, 2022, at 1:22 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>>> Even a simple case like mlx5 may cause the NIC to trigger a host
>>> memory allocation, which is done in another thread and done as a
>>> normal GFP_KERNEL. This memory allocation must progress before a
>>> CQ/QP/MR/etc can be created. So now we are deadlocked again.
>> 
>> That sounds to me like a bug in mlx5. The driver is supposed
>> to respect the caller's GFP settings. Again, if the request
>> is small, it's likely to succeed anyway, but larger requests
>> are not reliable and need to fail quickly so the system can
>> move onto other fishing spots.
> 
> It is a design artifact, the FW is the one requesting the memory and
> it has no idea about kernel GFP flags. As above a FW thread could have
> already started requesting memory for some other purpose and we may
> already be inside the mlx5 FW page request thread under a GFP_KERNEL
> allocation doing reclaim. How can this ever be fixed?

I'm willing to admit I'm no expert here. But... IIUC the
deadlock problem is triggered by /waiting/ for memory to
become available to satisfy an allocation request.

So using GFP_NOWAIT, GFP_NOIO/memalloc_noio, and
GFP_NOFS/memalloc_nofs when drivers allocate memory should
be enough to prevent a deadlock and keep the allocations from
diving into reserved memory. I believe only GFP_ATOMIC goes
for reserved memory pools. These others are normal allocations
that simply do not wait if a direct reclaim should be required.

The second-order issue is that the "failed to allocate"
recovery paths are not likely to be well tested, and these
other flags make that kind of failure more likely. Enable
memory allocation failure injection and begin fixing the shit
that comes up.

If you've got "can't fail" scenarios, we'll have to look at
those closely.


--
Chuck Lever
diff mbox series

Patch

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index f253295795f0..5c36d01ebf0b 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -872,7 +872,7 @@  static struct notifier_block nb = {
 
 int addr_init(void)
 {
-	addr_wq = alloc_ordered_workqueue("ib_addr", 0);
+	addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM);
 	if (!addr_wq)
 		return -ENOMEM;