diff mbox series

[for-rc,1/5] IB/hfi1: Fix WQ_MEM_RECLAIM warning

Message ID 20190318165501.23550.24989.stgit@scvm10.sc.intel.com (mailing list archive)
State Accepted
Delegated to: Jason Gunthorpe
Headers show
Series IB/hfi1: Fixes for 5.1 rc cycle | expand

Commit Message

Dennis Dalessandro March 18, 2019, 4:55 p.m. UTC
From: Mike Marciniszyn <mike.marciniszyn@intel.com>

The work_item cancels that occur when a QP is destroyed
can elicit the following trace:

[  708.997199] workqueue: WQ_MEM_RECLAIM ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM hfi0_0:_hfi1_do_send [hfi1]
[  708.997209] WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486 check_flush_dependency+0xb1/0x100
[  709.227743] Call Trace:
[  709.230852]  __flush_work.isra.29+0x8c/0x1a0
[  709.235779]  ? __switch_to_asm+0x40/0x70
[  709.240335]  __cancel_work_timer+0x103/0x190
[  709.245253]  ? schedule+0x32/0x80
[  709.249216]  iowait_cancel_work+0x15/0x30 [hfi1]
[  709.254475]  rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
[  709.259554]  rvt_destroy_qp+0x65/0x1f0 [rdmavt]
[  709.264703]  ? _cond_resched+0x15/0x30
[  709.269081]  ib_destroy_qp+0xe9/0x230 [ib_core]
[  709.274223]  ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
[  709.279799]  process_one_work+0x171/0x370
[  709.284425]  worker_thread+0x49/0x3f0
[  709.288695]  kthread+0xf8/0x130
[  709.292450]  ? max_active_store+0x80/0x80
[  709.297050]  ? kthread_bind+0x10/0x10
[  709.301293]  ret_from_fork+0x35/0x40
[  709.305441] ---[ end trace f0e973737146499b ]---

Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.

Fixes: 0a226edd203f ("staging/rdma/hfi1: Use parallel workqueue for SDMA engines")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/init.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

Comments

Jason Gunthorpe March 19, 2019, 7:27 p.m. UTC | #1
On Mon, Mar 18, 2019 at 09:55:09AM -0700, Dennis Dalessandro wrote:
> From: Mike Marciniszyn <mike.marciniszyn@intel.com>
> 
> The work_item cancels that occur when a QP is destroyed
> can elicit the following trace:
> 
> [  708.997199] workqueue: WQ_MEM_RECLAIM ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM hfi0_0:_hfi1_do_send [hfi1]
> [  708.997209] WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486 check_flush_dependency+0xb1/0x100
> [  709.227743] Call Trace:
> [  709.230852]  __flush_work.isra.29+0x8c/0x1a0
> [  709.235779]  ? __switch_to_asm+0x40/0x70
> [  709.240335]  __cancel_work_timer+0x103/0x190
> [  709.245253]  ? schedule+0x32/0x80
> [  709.249216]  iowait_cancel_work+0x15/0x30 [hfi1]
> [  709.254475]  rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
> [  709.259554]  rvt_destroy_qp+0x65/0x1f0 [rdmavt]
> [  709.264703]  ? _cond_resched+0x15/0x30
> [  709.269081]  ib_destroy_qp+0xe9/0x230 [ib_core]
> [  709.274223]  ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
> [  709.279799]  process_one_work+0x171/0x370
> [  709.284425]  worker_thread+0x49/0x3f0
> [  709.288695]  kthread+0xf8/0x130
> [  709.292450]  ? max_active_store+0x80/0x80
> [  709.297050]  ? kthread_bind+0x10/0x10
> [  709.301293]  ret_from_fork+0x35/0x40
> [  709.305441] ---[ end trace f0e973737146499b ]---
> 
> Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.

This seems like the same problem as the nvme patches.. Nobody seems to
know what the rules are for using WQ_MEM_RECLAIM.

AFAIK it has nothing to do with freeing memory though, that is a new one..

Are you sure cm_tx_reap shouln'd loose its reclaim flag?

Jason
Marciniszyn, Mike March 26, 2019, 8:55 p.m. UTC | #2
> > [  708.997199] workqueue: WQ_MEM_RECLAIM
> ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM
> hfi0_0:_hfi1_do_send [hfi1]
> > [  708.997209] WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486
> check_flush_dependency+0xb1/0x100
> > [  709.227743] Call Trace:
> > [  709.230852]  __flush_work.isra.29+0x8c/0x1a0
> > [  709.235779]  ? __switch_to_asm+0x40/0x70
> > [  709.240335]  __cancel_work_timer+0x103/0x190
> > [  709.245253]  ? schedule+0x32/0x80
> > [  709.249216]  iowait_cancel_work+0x15/0x30 [hfi1]
> > [  709.254475]  rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
> > [  709.259554]  rvt_destroy_qp+0x65/0x1f0 [rdmavt]
> > [  709.264703]  ? _cond_resched+0x15/0x30
> > [  709.269081]  ib_destroy_qp+0xe9/0x230 [ib_core]
> > [  709.274223]  ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
> > [  709.279799]  process_one_work+0x171/0x370
> > [  709.284425]  worker_thread+0x49/0x3f0
> > [  709.288695]  kthread+0xf8/0x130
> > [  709.292450]  ? max_active_store+0x80/0x80
> > [  709.297050]  ? kthread_bind+0x10/0x10
> > [  709.301293]  ret_from_fork+0x35/0x40
> > [  709.305441] ---[ end trace f0e973737146499b ]---
> >
> > Since QP destruction frees memory, hfi1_wq should have the
> WQ_MEM_RECLAIM.
> 
> This seems like the same problem as the nvme patches.. Nobody seems to
> know what the rules are for using WQ_MEM_RECLAIM.
> 
> AFAIK it has nothing to do with freeing memory though, that is a new one..
> 
> Are you sure cm_tx_reap shouln'd loose its reclaim flag?
> 

Changing the ipoib workqueue would certainly fix THIS issue.

Here are the uses of WQ_MEM_RECLAIM in drivers/infiniband:
C symbol: WQ_MEM_RECLAIM

  File          Function                        Line
0 cma.c         cma_init                         4687 cma_wq = alloc_ordered_workqueue("rdma_cm", WQ_MEM_RECLAIM);
1 device.c	ib_core_init                     1862 WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
2 device.c	ib_core_init                     1870 WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM |
3 mad.c         ib_mad_port_open                 3214 port_priv->wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM);
4 multicast.c   mcast_init                        886 mcast_wq = alloc_ordered_workqueue("ib_mcast", WQ_MEM_RECLAIM);
5 sa_query.c    ib_sa_init                       2453 ib_nl_wq = alloc_ordered_workqueue("ib_nl_sa_wq", WQ_MEM_RECLAIM);
6 ucma.c        ucma_open                        1733 WQ_MEM_RECLAIM);
7 iwch_cm.c     iwch_cm_init                     2247 workq = alloc_ordered_workqueue("iw_cxgb3", WQ_MEM_RECLAIM);
8 cm.c          c4iw_cm_init                     4435 workq = alloc_ordered_workqueue("iw_cxgb4", WQ_MEM_RECLAIM);
9 chip.c        init_cntrs                      12698 WQ_MEM_RECLAIM, dd->unit);
a init.c        create_workqueues                 808 WQ_MEM_RECLAIM,
b init.c        create_workqueues                 822 WQ_SYSFS | WQ_MEM_RECLAIM | WQ_UNBOUND,
c opfn.c        opfn_init                         309 WQ_MEM_RECLAIM,
d i40iw_cm.c    i40iw_setup_cm_core              3258 WQ_MEM_RECLAIM);
e i40iw_cm.c    i40iw_setup_cm_core              3261 WQ_MEM_RECLAIM);
f i40iw_main.c  i40iw_open                       1695 iwdev->virtchnl_wq = alloc_ordered_workqueue("iwvch", WQ_MEM_RECLAIM);
g i40iw_main.c  i40iw_open                       1708 iwdev->param_wq = alloc_ordered_workqueue("l2params", WQ_MEM_RECLAIM);
h alias_GUID.c  mlx4_ib_init_alias_guid_service   883 alloc_ordered_workqueue(alias_wq_name, WQ_MEM_RECLAIM);
i mad.c         mlx4_ib_alloc_demux_ctx          2186 ctx->wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM);
j mad.c         mlx4_ib_alloc_demux_ctx          2194 ctx->ud_wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM);
k main.c        mlx4_ib_init                     3351 wq = alloc_ordered_workqueue("mlx4_ib", WQ_MEM_RECLAIM);
l mcg.c         mlx4_ib_mcg_port_init            1048 ctx->mcg_wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM);
m mcg.c         mlx4_ib_mcg_init                 1247 clean_wq = alloc_ordered_workqueue("mlx4_ib_mcg", WQ_MEM_RECLAIM);
n mr.c          mlx5_mr_cache_init                647 cache->wq = alloc_ordered_workqueue("mkey_cache", WQ_MEM_RECLAIM);
o odp.c         mlx5_ib_create_pf_eq             1545 WQ_HIGHPRI | WQ_UNBOUND | WQ_MEM_RECLAIM,
p mthca_catas.c mthca_catas_init                  188 catas_wq = alloc_ordered_workqueue("mthca_catas", WQ_MEM_RECLAIM);
q qib_init.c    qib_create_workqueues             590 WQ_MEM_RECLAIM);
r pvrdma_main.c pvrdma_init                      1168 event_wq = alloc_ordered_workqueue("pvrdma_event_wq", WQ_MEM_RECLAIM);
s ipoib_main.c  ipoib_dev_init                   1756 priv->wq = alloc_ordered_workqueue("ipoib_wq", WQ_MEM_RECLAIM);

The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.

Tejun, what does "mem reclaim" really mean and when should it be used?

I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.

Mike
Tejun Heo March 27, 2019, 3:25 p.m. UTC | #3
Hello,

On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
> The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
> 
> Tejun, what does "mem reclaim" really mean and when should it be used?

That it may be depended during memory reclaim.

> I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.

If it can't block memory reclaim, it doesn't need the flag.  Just in
case, if a workqueue is used to issue block IOs, it is depended upon
for memory reclaim as writeback and swap-outs are critical parts of
memory reclaim.

Thanks.
Leon Romanovsky March 27, 2019, 5:02 p.m. UTC | #4
On Wed, Mar 27, 2019 at 08:25:17AM -0700, Tejun Heo (tj@kernel.org) wrote:
> Hello,
>
> On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
> > The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
> >
> > Tejun, what does "mem reclaim" really mean and when should it be used?
>
> That it may be depended during memory reclaim.
>
> > I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.
>
> If it can't block memory reclaim, it doesn't need the flag.  Just in
> case, if a workqueue is used to issue block IOs, it is depended upon
> for memory reclaim as writeback and swap-outs are critical parts of
> memory reclaim.

It looks like WQ_MEM_RECLAIM is needed for IPoIB, because if NFS runs
over IPoIB, it will do those types of IOs.

Thanks

>
> Thanks.
>
> --
> tejun
Jason Gunthorpe March 27, 2019, 5:16 p.m. UTC | #5
On Wed, Mar 27, 2019 at 08:25:17AM -0700, Tejun Heo (tj@kernel.org) wrote:
> Hello,
> 
> On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
> > The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
> > 
> > Tejun, what does "mem reclaim" really mean and when should it be used?
> 
> That it may be depended during memory reclaim.
> 
> > I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.
> 
> If it can't block memory reclaim, it doesn't need the flag.  Just in
> case, if a workqueue is used to issue block IOs, it is depended upon
> for memory reclaim as writeback and swap-outs are critical parts of
> memory reclaim.

But the comments seem to indicate WQ_MEM_RECLAIM is to guarentee
forward progress in cases when GFP_KERNEL allocations will block, by
avoiding an allocation.

So, is it OK for a WQ entry on a WQ_MEM_RECLAIM queue to call
kmalloc(GFP_KERNEL) ?

Jason
Tejun Heo March 27, 2019, 7:07 p.m. UTC | #6
Hello,

On Wed, Mar 27, 2019 at 02:16:11PM -0300, Jason Gunthorpe wrote:
> So, is it OK for a WQ entry on a WQ_MEM_RECLAIM queue to call
> kmalloc(GFP_KERNEL) ?

In most cases, you can answer these questions by mapping them to a
block device driver.  A block device driver which performs GFP_KERNEL
allocation on IO path may cause a deadlock.  If the workqueue is
sitting in a path which can be depended upon for memory reclaim IOs,
the same applies to the workqueue.

Thanks.
Jason Gunthorpe March 27, 2019, 7:43 p.m. UTC | #7
On Wed, Mar 27, 2019 at 12:07:20PM -0700, Tejun Heo (tj@kernel.org) wrote:
> Hello,
> 
> On Wed, Mar 27, 2019 at 02:16:11PM -0300, Jason Gunthorpe wrote:
> > So, is it OK for a WQ entry on a WQ_MEM_RECLAIM queue to call
> > kmalloc(GFP_KERNEL) ?
> 
> In most cases, you can answer these questions by mapping them to a
> block device driver.  A block device driver which performs GFP_KERNEL
> allocation on IO path may cause a deadlock.  If the workqueue is
> sitting in a path which can be depended upon for memory reclaim IOs,
> the same applies to the workqueue.

Well, at least I'm not super familiar with the block layer and
wouldn't know about this.. 

But your explanation is helpful. I will be watching for WQ_MEM_RECLAIM
users that are allocating memory in their work functions.

Would it be possible to have a lockdep-like debugging under kmalloc
similar to check_flush_dependency() that complains if a WQ_MEM_RECLAIM
work is doing inappropriate allocations?

That would help discourage using it in WQ's it shouldn't be used on.

Thanks,
Jason
Tejun Heo March 27, 2019, 9:25 p.m. UTC | #8
Hello,

On Wed, Mar 27, 2019 at 04:43:47PM -0300, Jason Gunthorpe wrote:
> Well, at least I'm not super familiar with the block layer and
> wouldn't know about this.. 

Yeah, conceptually it's not complex.  Filesytem and IOs are needed to
reclaim memory because we need to write back dirty pagecache and swap
pages before reuse them for other purposes.  Recursing on oneself
obviously won't be great, so filesystems need to use GFP_NOFS and
block layer below that needs to use GFP_NOIO.

It's kinda unfortunate that network devices end up being pushed into
this dependency chain but we do put them in memory reclaim path w/ nfs
and block-over-network things, so it is what it is, I suppose.

> But your explanation is helpful. I will be watching for WQ_MEM_RECLAIM
> users that are allocating memory in their work functions.
> 
> Would it be possible to have a lockdep-like debugging under kmalloc
> similar to check_flush_dependency() that complains if a WQ_MEM_RECLAIM
> work is doing inappropriate allocations?
> 
> That would help discourage using it in WQ's it shouldn't be used on.

Yeah, I was wondering whether that'd trigger warning.  I don't think
it does now.  I fully agree it'd be great to have.

Thanks.
Doug Ledford May 1, 2019, 2:44 p.m. UTC | #9
On Wed, 2019-03-27 at 14:25 -0700, Tejun Heo (tj@kernel.org) wrote:
> Hello,
> 
> On Wed, Mar 27, 2019 at 04:43:47PM -0300, Jason Gunthorpe wrote:
> > Well, at least I'm not super familiar with the block layer and
> > wouldn't know about this.. 
> 
> Yeah, conceptually it's not complex.  Filesytem and IOs are needed to
> reclaim memory because we need to write back dirty pagecache and swap
> pages before reuse them for other purposes.  Recursing on oneself
> obviously won't be great, so filesystems need to use GFP_NOFS and
> block layer below that needs to use GFP_NOIO.
> 
> It's kinda unfortunate that network devices end up being pushed into
> this dependency chain but we do put them in memory reclaim path w/ nfs
> and block-over-network things, so it is what it is, I suppose.

The discussion is helpful, but we still need to decide on the patch:

Correct me if I'm wrong Tejun, but the key issues are:

All WQ_MEM_RECLAIM work queues are eligible to be run when the machine
is under extreme memory pressure and attempting to reclaim memory.  That
means that the workqueue:

1) MUST not perform any GFP_ATOMIC allocations as this could deadlock
2) SHOULD not rely on any GFP_KERNEL allocations as these may fail
3) MUST complete without blocking
4) SHOULD ideally always make some sort of forward progress if at all
possible without needing memory allocations to do so

Mike, does hfi1_do_send() meet these requirements?  If not, we should
not be putting WQ_MEM_RECLAIM on it, and instead should find another
solution to the current trace issue.

> > But your explanation is helpful. I will be watching for WQ_MEM_RECLAIM
> > users that are allocating memory in their work functions.
> > 
> > Would it be possible to have a lockdep-like debugging under kmalloc
> > similar to check_flush_dependency() that complains if a WQ_MEM_RECLAIM
> > work is doing inappropriate allocations?
> > 
> > That would help discourage using it in WQ's it shouldn't be used on.
> 
> Yeah, I was wondering whether that'd trigger warning.  I don't think
> it does now.  I fully agree it'd be great to have.
> 
> Thanks.
>
Doug Ledford May 1, 2019, 3:21 p.m. UTC | #10
On Mar 27, 2019, at 1:02 PM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Wed, Mar 27, 2019 at 08:25:17AM -0700, Tejun Heo (tj@kernel.org) wrote:
>> Hello,
>> 
>> On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
>>> The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
>>> 
>>> Tejun, what does "mem reclaim" really mean and when should it be used?
>> 
>> That it may be depended during memory reclaim.
>> 
>>> I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.
>> 
>> If it can't block memory reclaim, it doesn't need the flag.  Just in
>> case, if a workqueue is used to issue block IOs, it is depended upon
>> for memory reclaim as writeback and swap-outs are critical parts of
>> memory reclaim.
> 
> It looks like WQ_MEM_RECLAIM is needed for IPoIB, because if NFS runs
> over IPoIB, it will do those types of IOs.

Because of what IPoIB does, you’re right that it’s needed.  However, it might be necessary to audit the wq usage in IPoIB to make sure it’s actually eligible for the flag and that it hasn’t been set when the code doesn’t meet the requirements of the flag.

> Thanks
> 
>> 
>> Thanks.
>> 
>> --
>> tejun

--
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Jason Gunthorpe May 1, 2019, 4:29 p.m. UTC | #11
On Wed, May 01, 2019 at 11:21:08AM -0400, Doug Ledford wrote:
> On Mar 27, 2019, at 1:02 PM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Wed, Mar 27, 2019 at 08:25:17AM -0700, Tejun Heo (tj@kernel.org) wrote:
> >> Hello,
> >> 
> >> On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
> >>> The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
> >>> 
> >>> Tejun, what does "mem reclaim" really mean and when should it be used?
> >> 
> >> That it may be depended during memory reclaim.
> >> 
> >>> I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.
> >> 
> >> If it can't block memory reclaim, it doesn't need the flag.  Just in
> >> case, if a workqueue is used to issue block IOs, it is depended upon
> >> for memory reclaim as writeback and swap-outs are critical parts of
> >> memory reclaim.
> > 
> > It looks like WQ_MEM_RECLAIM is needed for IPoIB, because if NFS runs
> > over IPoIB, it will do those types of IOs.
> 
> Because of what IPoIB does, you’re right that it’s needed.  However,
> it might be necessary to audit the wq usage in IPoIB to make sure
> it’s actually eligible for the flag and that it hasn’t been set when
> the code doesn’t meet the requirements of the flag.

It isn't right - it is doing memory allocations from the work queue
without the GFP_NOIO (or memalloc_noio_save)

And I'm not sure it can actually tolerate failure of a memory
allocation anyhow without blocking the dataplane.

In other words, the entire thing hasn't been designed with the idea
that it could be on the IO path..

I'm not sure how things work if NFS is on the critical reclaim path in
general - does it even work with a normal netdev? How does netdev
allocate a neighbor for instance if it becomes required?

Jason
Doug Ledford May 1, 2019, 4:38 p.m. UTC | #12
On Wed, 2019-05-01 at 13:29 -0300, Jason Gunthorpe wrote:
> On Wed, May 01, 2019 at 11:21:08AM -0400, Doug Ledford wrote:
> > On Mar 27, 2019, at 1:02 PM, Leon Romanovsky <leon@kernel.org> wrote:
> > > On Wed, Mar 27, 2019 at 08:25:17AM -0700, Tejun Heo (tj@kernel.org) wrote:
> > > > Hello,
> > > > 
> > > > On Tue, Mar 26, 2019 at 08:55:09PM +0000, Marciniszyn, Mike wrote:
> > > > > The latter is the ipoib wq that conflicts with our non-WQ_MEM_RECLAIM.  This seems excessive and pretty gratuitous.
> > > > > 
> > > > > Tejun, what does "mem reclaim" really mean and when should it be used?
> > > > 
> > > > That it may be depended during memory reclaim.
> > > > 
> > > > > I was assuming that since we are freeing QP kernel memory held by user mode programs that could be oom killed, we need the flag.
> > > > 
> > > > If it can't block memory reclaim, it doesn't need the flag.  Just in
> > > > case, if a workqueue is used to issue block IOs, it is depended upon
> > > > for memory reclaim as writeback and swap-outs are critical parts of
> > > > memory reclaim.
> > > 
> > > It looks like WQ_MEM_RECLAIM is needed for IPoIB, because if NFS runs
> > > over IPoIB, it will do those types of IOs.
> > 
> > Because of what IPoIB does, you’re right that it’s needed.  However,
> > it might be necessary to audit the wq usage in IPoIB to make sure
> > it’s actually eligible for the flag and that it hasn’t been set when
> > the code doesn’t meet the requirements of the flag.
> 
> It isn't right - it is doing memory allocations from the work queue
> without the GFP_NOIO (or memalloc_noio_save)
> 
> And I'm not sure it can actually tolerate failure of a memory
> allocation anyhow without blocking the dataplane.
> 
> In other words, the entire thing hasn't been designed with the idea
> that it could be on the IO path..
> 
> I'm not sure how things work if NFS is on the critical reclaim path in
> general - does it even work with a normal netdev? How does netdev
> allocate a neighbor for instance if it becomes required?
> 
> Jason

What we probably need to do (but probably not this late in the rc cycle,
save it for next) is remove WQ_MEM_RECLAIM anywhere that an audit isn't
conclusive that it's safe.  As I understand it, the memory subsystem
will explore other areas of memory reclaim if these items don't have the
flag, which is better than the alternative of setting the flag on a work
queue that isn't safe and ending up in a deadlock or other abnormal
failure scenario.
Marciniszyn, Mike May 6, 2019, 12:31 p.m. UTC | #13
> Correct me if I'm wrong Tejun, but the key issues are:
> 
> All WQ_MEM_RECLAIM work queues are eligible to be run when the
> machine
> is under extreme memory pressure and attempting to reclaim memory.  That
> means that the workqueue:
> 
> 1) MUST not perform any GFP_ATOMIC allocations as this could deadlock

The send engine code WILL do a GFP_ATOMIC allocation but the code handles failure as
will any other resource shortage.

> 2) SHOULD not rely on any GFP_KERNEL allocations as these may fail

There are no GFP_KERNEL allocations in the send engine code.

> 3) MUST complete without blocking

All resource blockages are handled by queuing the current QP being serviced by
the send engine for and interrupt to wake that QP up via the send engine.

> 4) SHOULD ideally always make some sort of forward progress if at all
> possible without needing memory allocations to do so
> 

As noted above.

> Mike, does hfi1_do_send() meet these requirements?  If not, we should
> not be putting WQ_MEM_RECLAIM on it, and instead should find another
> solution to the current trace issue.
> 

I'm not sure I understand the 1) above.

Tejun, can you elaborate?

Mike
Tejun Heo May 6, 2019, 3:09 p.m. UTC | #14
Hello, Doug.

On Wed, May 01, 2019 at 10:44:31AM -0400, Doug Ledford wrote:
> The discussion is helpful, but we still need to decide on the patch:
> 
> Correct me if I'm wrong Tejun, but the key issues are:
> 
> All WQ_MEM_RECLAIM work queues are eligible to be run when the machine
> is under extreme memory pressure and attempting to reclaim memory.  That
> means that the workqueue:
> 
> 1) MUST not perform any GFP_ATOMIC allocations as this could deadlock

GFP_ATOMIC or GFP_NOIO allocations are fine as they wouldn't block on
reclaim recursively; however, they can and will fail under severe
memory pressure.

> 2) SHOULD not rely on any GFP_KERNEL allocations as these may fail

Absolutely not, but not because they may fail but because it can end
up creating a waiting loop.

> 3) MUST complete without blocking

Blocking is fine as long as that blocking isn't memory bound.

> 4) SHOULD ideally always make some sort of forward progress if at all
> possible without needing memory allocations to do so

Yes, this is the only requirement.  All restrictions stem from this
one.

Thanks.
Doug Ledford May 6, 2019, 3:11 p.m. UTC | #15
On Mon, 2019-05-06 at 12:31 +0000, Marciniszyn, Mike wrote:
> > Correct me if I'm wrong Tejun, but the key issues are:
> > 
> > All WQ_MEM_RECLAIM work queues are eligible to be run when the
> > machine
> > is under extreme memory pressure and attempting to reclaim memory.  That
> > means that the workqueue:
> > 
> > 1) MUST not perform any GFP_ATOMIC allocations as this could deadlock
> 
> The send engine code WILL do a GFP_ATOMIC allocation but the code handles failure as
> will any other resource shortage.

You're right.  I was misremembering the flag's full meaning (I double
checked before writing this).  If you are holding a spinlock (or
anything else that means you can't sleep), you must use GFP_ATOMIC and
you must be prepared for failure.  So, before putting WQ_MEM_RECLAIM on
your workqueue, it should use ATOMIC and be prepared for failure.

> > 2) SHOULD not rely on any GFP_KERNEL allocations as these may fail
> 
> There are no GFP_KERNEL allocations in the send engine code.

Right.

> > 3) MUST complete without blocking
> 
> All resource blockages are handled by queuing the current QP being serviced by
> the send engine for and interrupt to wake that QP up via the send engine.

Ok.

> > 4) SHOULD ideally always make some sort of forward progress if at all
> > possible without needing memory allocations to do so
> > 
> 
> As noted above.

Right.

> > Mike, does hfi1_do_send() meet these requirements?  If not, we should
> > not be putting WQ_MEM_RECLAIM on it, and instead should find another
> > solution to the current trace issue.
> > 
> 
> I'm not sure I understand the 1) above.
> 
> Tejun, can you elaborate?

My mistake.  It's been a long while since I coded the stuff I did for
memory reclaim pressure and I had my flag usage wrong in my memory. 
From the description you just gave, the original patch to add
WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
though.
Jason Gunthorpe May 6, 2019, 4:07 p.m. UTC | #16
On Mon, Mar 18, 2019 at 09:55:09AM -0700, Dennis Dalessandro wrote:
> From: Mike Marciniszyn <mike.marciniszyn@intel.com>
> 
> The work_item cancels that occur when a QP is destroyed
> can elicit the following trace:
> 
> [  708.997199] workqueue: WQ_MEM_RECLAIM ipoib_wq:ipoib_cm_tx_reap [ib_ipoib] is flushing !WQ_MEM_RECLAIM hfi0_0:_hfi1_do_send [hfi1]
> [  708.997209] WARNING: CPU: 7 PID: 1403 at kernel/workqueue.c:2486 check_flush_dependency+0xb1/0x100
> [  709.227743] Call Trace:
> [  709.230852]  __flush_work.isra.29+0x8c/0x1a0
> [  709.235779]  ? __switch_to_asm+0x40/0x70
> [  709.240335]  __cancel_work_timer+0x103/0x190
> [  709.245253]  ? schedule+0x32/0x80
> [  709.249216]  iowait_cancel_work+0x15/0x30 [hfi1]
> [  709.254475]  rvt_reset_qp+0x1f8/0x3e0 [rdmavt]
> [  709.259554]  rvt_destroy_qp+0x65/0x1f0 [rdmavt]
> [  709.264703]  ? _cond_resched+0x15/0x30
> [  709.269081]  ib_destroy_qp+0xe9/0x230 [ib_core]
> [  709.274223]  ipoib_cm_tx_reap+0x21c/0x560 [ib_ipoib]
> [  709.279799]  process_one_work+0x171/0x370
> [  709.284425]  worker_thread+0x49/0x3f0
> [  709.288695]  kthread+0xf8/0x130
> [  709.292450]  ? max_active_store+0x80/0x80
> [  709.297050]  ? kthread_bind+0x10/0x10
> [  709.301293]  ret_from_fork+0x35/0x40
> [  709.305441] ---[ end trace f0e973737146499b ]---
> 
> Since QP destruction frees memory, hfi1_wq should have the WQ_MEM_RECLAIM.
> 
> Fixes: 0a226edd203f ("staging/rdma/hfi1: Use parallel workqueue for SDMA engines")
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/infiniband/hw/hfi1/init.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)

Since Mike confirms there are no GFP_KERNEL memory allocations in the
hfi1_wq, applied to for-next

Thanks,
Jason
Marciniszyn, Mike May 6, 2019, 5:52 p.m. UTC | #17
> 
> My mistake.  It's been a long while since I coded the stuff I did for
> memory reclaim pressure and I had my flag usage wrong in my memory.
> From the description you just gave, the original patch to add
> WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> though.
> 
> --
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine

The ipoib wq needs to be audited to see if it is in the data path for VM I/O.

Mike
Jason Gunthorpe May 6, 2019, 6:16 p.m. UTC | #18
On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
> > 
> > My mistake.  It's been a long while since I coded the stuff I did for
> > memory reclaim pressure and I had my flag usage wrong in my memory.
> > From the description you just gave, the original patch to add
> > WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> > though.
> > 
> 
> Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
> a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
> 
> The ipoib wq needs to be audited to see if it is in the data path for VM I/O.

Well, it is doing unsafe memory allocations and other stuff, so it
can't be RECLAIM. We should just delete them from IPoIB like Doug says.

Before we get excited about IPoIB I'd love to hear how the netstack is
supposed to handle reclaim as well ie when using NFS/etc.

AFAIK the netstack code is not reclaim safe and can need to do things
like allocate neighbour structures/etc to progress the dataplane.

Jason
Leon Romanovsky May 6, 2019, 7:03 p.m. UTC | #19
On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
> >
> > My mistake.  It's been a long while since I coded the stuff I did for
> > memory reclaim pressure and I had my flag usage wrong in my memory.
> > From the description you just gave, the original patch to add
> > WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> > though.
> >
> > --
> > Doug Ledford <dledford@redhat.com>
> >     GPG KeyID: B826A3330E572FDD
> >     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
>
> Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
> a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
>
> The ipoib wq needs to be audited to see if it is in the data path for VM I/O.

It is, at least we had a test of NFS running over IPoIB and it needed WQ_MEM_RECLAIM back then.

Thanks

>
> Mike
>
>
Leon Romanovsky May 6, 2019, 7:03 p.m. UTC | #20
On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
> On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
> > >
> > > My mistake.  It's been a long while since I coded the stuff I did for
> > > memory reclaim pressure and I had my flag usage wrong in my memory.
> > > From the description you just gave, the original patch to add
> > > WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> > > though.
> > >
> >
> > Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
> > a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
> > rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
> >
> > The ipoib wq needs to be audited to see if it is in the data path for VM I/O.
>
> Well, it is doing unsafe memory allocations and other stuff, so it
> can't be RECLAIM. We should just delete them from IPoIB like Doug says.

Please don't.

>
> Before we get excited about IPoIB I'd love to hear how the netstack is
> supposed to handle reclaim as well ie when using NFS/etc.
>
> AFAIK the netstack code is not reclaim safe and can need to do things
> like allocate neighbour structures/etc to progress the dataplane.
>
> Jason
Jason Gunthorpe May 6, 2019, 8:08 p.m. UTC | #21
On Mon, May 06, 2019 at 10:03:56PM +0300, Leon Romanovsky wrote:
> On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
> > On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
> > > >
> > > > My mistake.  It's been a long while since I coded the stuff I did for
> > > > memory reclaim pressure and I had my flag usage wrong in my memory.
> > > > From the description you just gave, the original patch to add
> > > > WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> > > > though.
> > > >
> > >
> > > Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
> > > a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
> > > rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
> > >
> > > The ipoib wq needs to be audited to see if it is in the data path for VM I/O.
> >
> > Well, it is doing unsafe memory allocations and other stuff, so it
> > can't be RECLAIM. We should just delete them from IPoIB like Doug says.
> 
> Please don't.

Well then fix the broken allocations it does, and I don't really see
how to do that. We can't have it both ways.

I would rather have NFS be broken then normal systems with ipoib
hanging during reclaim.

Jason
Chuck Lever May 6, 2019, 8:18 p.m. UTC | #22
> On May 6, 2019, at 4:08 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, May 06, 2019 at 10:03:56PM +0300, Leon Romanovsky wrote:
>> On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
>>> On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
>>>>> 
>>>>> My mistake.  It's been a long while since I coded the stuff I did for
>>>>> memory reclaim pressure and I had my flag usage wrong in my memory.
>>>>> From the description you just gave, the original patch to add
>>>>> WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
>>>>> though.
>>>>> 
>>>> 
>>>> Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
>>>> a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
>>>> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
>>>> 
>>>> The ipoib wq needs to be audited to see if it is in the data path for VM I/O.
>>> 
>>> Well, it is doing unsafe memory allocations and other stuff, so it
>>> can't be RECLAIM. We should just delete them from IPoIB like Doug says.
>> 
>> Please don't.
> 
> Well then fix the broken allocations it does, and I don't really see
> how to do that. We can't have it both ways.
> 
> I would rather have NFS be broken then normal systems with ipoib
> hanging during reclaim.

TBH, NFS on IPoIB is a hack that I would be happy to see replaced
with NFS/RDMA. However, I don't think we can have it regressing
at this point -- and it sounds like there are other use cases that
do depend on WQ_MEM_RECLAIM remaining in the IPoIB path.

If you are truly curious, bring this up on linux-nfs to find out
what NFS needs and how it works on Ethernet-only network stacks.

--
Chuck Lever
Williams, Gerald S May 7, 2019, 5:28 p.m. UTC | #23
On May 06, 2019 at 4:19 PM, Chuck Lever <chuck.level@oracle> wrote:
> On May 6, 2019, at 4:08 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Mon, May 06, 2019 at 10:03:56PM +0300, Leon Romanovsky wrote:
>>> On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
>>>>> Don't lose sight of the fact that the additional of the
>>>>> WQ_MEM_RECLAIM is to silence a warning BECAUSE ipoib's workqueue is
>>>>> WQ_MEM_RECLAIM.  This happens while
>>>>> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by
>>>>> the QP's send engine
>>>>
>>>> Well, it is doing unsafe memory allocations and other stuff, so it
>>>> can't be RECLAIM. We should just delete them from IPoIB like Doug says.
>>>
>>> Please don't.
>>
>> Well then fix the broken allocations it does, and I don't really see
>> how to do that. We can't have it both ways.
>>
>> I would rather have NFS be broken then normal systems with ipoib
>> hanging during reclaim.
>
> TBH, NFS on IPoIB is a hack that I would be happy to see replaced with
> NFS/RDMA. However, I don't think we can have it regressing at this point -
> - and it sounds like there are other use cases that do depend on
> WQ_MEM_RECLAIM remaining in the IPoIB path.

I don't have an opinion on whether it is better to specify WQ_MEM_RECLAIM or not, although we are encountering a regression in the hfi1 driver when it is set. I have been debugging an issue that started happening during certain high-stress conditions after we added the WQ_MEM_RECLAIM attribute to our workqueue.

We need to understand this issue, especially if it turns out to be a workqueue bug/restriction, if we're going to propagate WQ_MEM_RECLAIM.

Somewhat-gory WQ-specific details:

The error is detected by the "sanity checks" in destroy_workqueue(), which we call when unloading our driver. A stack dump is reported via WARN_ON and the function (which has no return code) exits without destroying the queue, creating additional errors later. The specific test that is failing is on the pwq reference count:

	WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))

That reference count is incremented by get_pwq(), which is called when adding work to the queue. It is also called by "mayday" handling in the send_mayday() function from the pool_mayday_timeout() timer handler and from within rescuer_thread() in some situations. rescuer_thread() also calls put_pwq() later to decrement the reference count.

By instrumenting workqueue.c, I determined that we are not adding any work to the queue at the time of the error (and that the queue is empty when the error occurs), although before that, send_mayday() is occasionally called on our workqueue, followed by the second get_pwq() from rescuer_thread().

This is happening during normal operation, although while the system is heavily loaded. Later, when we try to unload the driver, destroy_workqueue() fails due to the refcnt check, even though the queue is empty.

Here are the relevant checks around the rescuer_thread() call to get_pwq() that increments refcnt:

		if (!list_empty(scheduled)) {
			process_scheduled_works(rescuer);

			/*
			 * The above execution of rescued work items could
			 * have created more to rescue through
			 * pwq_activate_first_delayed() or chained
			 * queueing.  Let's put @pwq back on mayday list so
			 * that such back-to-back work items, which may be
			 * being used to relieve memory pressure, don't
			 * incur MAYDAY_INTERVAL delay inbetween.
			 */
			if (need_to_create_worker(pool)) {
				spin_lock(&wq_mayday_lock);
				get_pwq(pwq);

And here is the relevant code from send_mayday():

	/* mayday mayday mayday */
	if (list_empty(&pwq->mayday_node)) {
		/*
		 * If @pwq is for an unbound wq, its base ref may be put at
		 * any time due to an attribute change.  Pin @pwq until the
		 * rescuer is done with it.
		 */
		get_pwq(pwq);

To me, the comment in the latter seems to imply that the reference count is being incremented to keep the pwq alive. However, looking through the code, that does not appear to be how it works. destroy_workqueue() tries to flush it, but if the queue is empty yet refcnt is still greater than 1, it fails. Maybe there's something else going on behind the scenes that I don't understand here.

I am trying to understand what is going wrong and whether there is something we can do about it. I'm hoping Tejun can provide some guidance into investigating this issue, otherwise we may need to move the discussion over to lkml or something.

--
Gerald Williams
Jason Gunthorpe May 7, 2019, 6:55 p.m. UTC | #24
On Mon, May 06, 2019 at 04:18:44PM -0400, Chuck Lever wrote:
> 
> 
> > On May 6, 2019, at 4:08 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, May 06, 2019 at 10:03:56PM +0300, Leon Romanovsky wrote:
> >> On Mon, May 06, 2019 at 03:16:10PM -0300, Jason Gunthorpe wrote:
> >>> On Mon, May 06, 2019 at 05:52:48PM +0000, Marciniszyn, Mike wrote:
> >>>>> 
> >>>>> My mistake.  It's been a long while since I coded the stuff I did for
> >>>>> memory reclaim pressure and I had my flag usage wrong in my memory.
> >>>>> From the description you just gave, the original patch to add
> >>>>> WQ_MEM_RECLAIM is ok.  I probably still need to audit the ipoib usage
> >>>>> though.
> >>>>> 
> >>>> 
> >>>> Don't lose sight of the fact that the additional of the WQ_MEM_RECLAIM is to silence
> >>>> a warning BECAUSE ipoib's workqueue is WQ_MEM_RECLAIM.  This happens while
> >>>> rdmavt/hfi1 is doing a cancel_work_sync() for the work item used by the QP's send engine
> >>>> 
> >>>> The ipoib wq needs to be audited to see if it is in the data path for VM I/O.
> >>> 
> >>> Well, it is doing unsafe memory allocations and other stuff, so it
> >>> can't be RECLAIM. We should just delete them from IPoIB like Doug says.
> >> 
> >> Please don't.
> > 
> > Well then fix the broken allocations it does, and I don't really see
> > how to do that. We can't have it both ways.
> > 
> > I would rather have NFS be broken then normal systems with ipoib
> > hanging during reclaim.
> 
> TBH, NFS on IPoIB is a hack that I would be happy to see replaced
> with NFS/RDMA.

Does NFS/RDMA even work? Tejun said you can't do a GFP_KERNEL
allocation on the writeback progress path.

So if the system runs out of memory, and NFS/RDMA is in a state where
the connection has glitched and needs to be restarted it needs to go
through the whole CM stuff to progress writeback. That stuff uses
GFP_KERNEL, so it is not OK.

This is where nvme got to when it started to look at this problem.

> If you are truly curious, bring this up on linux-nfs to find out
> what NFS needs and how it works on Ethernet-only network stacks.

I have a feeling the only robust answer here is that NFS can never be
on the critical path of reclaim

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 7841a0a..3cd6f07 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -800,7 +800,8 @@  static int create_workqueues(struct hfi1_devdata *dd)
 			ppd->hfi1_wq =
 				alloc_workqueue(
 				    "hfi%d_%d",
-				    WQ_SYSFS | WQ_HIGHPRI | WQ_CPU_INTENSIVE,
+				    WQ_SYSFS | WQ_HIGHPRI | WQ_CPU_INTENSIVE |
+				    WQ_MEM_RECLAIM,
 				    HFI1_MAX_ACTIVE_WORKQUEUE_ENTRIES,
 				    dd->unit, pidx);
 			if (!ppd->hfi1_wq)