diff mbox series

[rdma-next] RDMA/addr: create addr_wq with WQ_MEM_RECLAIM flag

Message ID 5f5a1e4e90f3625cea57ffa79fc0e5bcb7efe09d.1548963371.git.swise@opengridcomputing.com (mailing list archive)
State Changes Requested
Headers show
Series [rdma-next] RDMA/addr: create addr_wq with WQ_MEM_RECLAIM flag | expand

Commit Message

Steve Wise Jan. 31, 2019, 7:30 p.m. UTC
While running NVMe/oF wire unplug tests, we hit this warning in
kernel/workqueue.c:check_flush_dependency():

WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
	  worker->current_pwq->wq->name, worker->current_func,
	  target_wq->name, target_func);

Which I think means we're flushing a workq that doesn't have
WQ_MEM_RECLAIM set, from workqueue context that does have it set.

Looking at rdma_addr_cancel() which is doing the flushing, it flushes
the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
is being called by the nvme host connection timeout/reconnect workqueue
thread that does have WQ_MEM_RECLAIM set.

So set WQ_MEM_RECLAIM on the addr_req workqueue.


Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/core/addr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Parav Pandit Jan. 31, 2019, 8:56 p.m. UTC | #1
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Steve Wise
> Sent: Thursday, January 31, 2019 1:31 PM
> To: dledford@redhat.com; Jason Gunthorpe <jgg@mellanox.com>
> Cc: linux-rdma@vger.kernel.org
> Subject: [PATCH rdma-next] RDMA/addr: create addr_wq with
> WQ_MEM_RECLAIM flag
> 
> While running NVMe/oF wire unplug tests, we hit this warning in
> kernel/workqueue.c:check_flush_dependency():
> 
> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
> WQ_MEM_RECLAIM),
> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing
> !WQ_MEM_RECLAIM %s:%pf",
> 	  worker->current_pwq->wq->name, worker->current_func,
> 	  target_wq->name, target_func);
> 
> Which I think means we're flushing a workq that doesn't have
> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> 
> Looking at rdma_addr_cancel() which is doing the flushing, it flushes the
> addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel() is
> being called by the nvme host connection timeout/reconnect workqueue
> thread that does have WQ_MEM_RECLAIM set.
> 
> So set WQ_MEM_RECLAIM on the addr_req workqueue.
>

Please add below [1] fixes by tag.
I removed this flag based on commit [2] of Sagi and discussion[3].
Which I think [2] and [3] are incorrect.

Memory reclaim path could have been triggered by processes or kernel and I think it is ok to allocate a memory in a work item trigger as part of reclaim path or otherwise.
Memory allocation in a wq has nothing to do with memory reclaim. I wish if Tejun or Christoph confirm this is right or correct our understanding.

Based these past discussions, it appears to me that WQ_MEM_RECLAIM limitation and its use is not clear.

Commit [2] also requires a fix.
This [3] is the past discussion thread when flag was dropped incorrectly as starting point.

If [3] is right, there are several wq which need a fix and proposed patch doesn't make sense too.

[1] Fixes: 39baf10310e6 ("IB/core: Fix use workqueue without WQ_MEM_RECLAIM")
[2] cb93e597779e ("cm: Don't allocate ib_cm workqueue with WQ_MEM_RECLAIM")
[3] https://www.spinics.net/lists/linux-rdma/msg53314.html

Reviewed-by: Parav Pandit <parav@mellanox.com> 
This was reported internally too a while back. It was in my TODO list, for a while now, thanks for the fix.
Lets get the clarity of WQ_MEM_RECLAIM once for all.
I would be happy to submit a documentation update patch once its clear.

> 
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
>  drivers/infiniband/core/addr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index 0dce94e3c495..1d88ee3ac8c7 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -868,7 +868,7 @@ static int netevent_callback(struct notifier_block
> *self, unsigned long event,
> 
>  int addr_init(void)
>  {
> -	addr_wq = alloc_ordered_workqueue("ib_addr", 0);
> +	addr_wq = alloc_ordered_workqueue("ib_addr",
> WQ_MEM_RECLAIM);
>  	if (!addr_wq)
>  		return -ENOMEM;
> 
> --
> 1.8.3.1
Jason Gunthorpe Jan. 31, 2019, 9 p.m. UTC | #2
On Thu, Jan 31, 2019 at 01:56:15PM -0700, Parav Pandit wrote:
> 
> 
> > From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> > owner@vger.kernel.org> On Behalf Of Steve Wise
> > Sent: Thursday, January 31, 2019 1:31 PM
> > To: dledford@redhat.com; Jason Gunthorpe <jgg@mellanox.com>
> > Cc: linux-rdma@vger.kernel.org
> > Subject: [PATCH rdma-next] RDMA/addr: create addr_wq with
> > WQ_MEM_RECLAIM flag
> > 
> > While running NVMe/oF wire unplug tests, we hit this warning in
> > kernel/workqueue.c:check_flush_dependency():
> > 
> > WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> > 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
> > WQ_MEM_RECLAIM),
> > 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing
> > !WQ_MEM_RECLAIM %s:%pf",
> > 	  worker->current_pwq->wq->name, worker->current_func,
> > 	  target_wq->name, target_func);
> > 
> > Which I think means we're flushing a workq that doesn't have
> > WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> > 
> > Looking at rdma_addr_cancel() which is doing the flushing, it flushes the
> > addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel() is
> > being called by the nvme host connection timeout/reconnect workqueue
> > thread that does have WQ_MEM_RECLAIM set.
> > 
> > So set WQ_MEM_RECLAIM on the addr_req workqueue.
> >
> 
> Please add below [1] fixes by tag.
> I removed this flag based on commit [2] of Sagi and discussion[3].
> Which I think [2] and [3] are incorrect.
> 
> Memory reclaim path could have been triggered by processes or kernel
> and I think it is ok to allocate a memory in a work item trigger as
> part of reclaim path or otherwise.  Memory allocation in a wq has
> nothing to do with memory reclaim. I wish if Tejun or Christoph
> confirm this is right or correct our understanding.

Considering the only thing WQ_MEM_RECLAIM does is to pre-allocate a
execution progress to guarentee forward progress if the system is
unable to succed in memory allocations..

It seems a bit goofy to say that it is OK to then block on memory
allocations inside that precious execution context..

I also would like to understand what the rules should be for this flag
before applying this patch??

> Based these past discussions, it appears to me that WQ_MEM_RECLAIM
> limitation and its use is not clear.

I've always thought it was designed to guarentee forward progress if
memory cannot be allocated.

Jason
Steve Wise Jan. 31, 2019, 9:21 p.m. UTC | #3
+tj@kernel.org
+hch@infradead.org
+sagi@

Hey Tejun, Christoph, and Sagi,

We seem to be (still) unclear on when and why work queues should be
created with WQ_MEM_RECLAIM.  Care to comment on this patch, as well as
the earlier discusson in [3] that led to removing WQ_MEM_RECLAIM on all
the RDMA workqueues?

more below:

On 1/31/2019 2:56 PM, Parav Pandit wrote:
>
>> -----Original Message-----
>> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
>> owner@vger.kernel.org> On Behalf Of Steve Wise
>> Sent: Thursday, January 31, 2019 1:31 PM
>> To: dledford@redhat.com; Jason Gunthorpe <jgg@mellanox.com>
>> Cc: linux-rdma@vger.kernel.org
>> Subject: [PATCH rdma-next] RDMA/addr: create addr_wq with
>> WQ_MEM_RECLAIM flag
>>
>> While running NVMe/oF wire unplug tests, we hit this warning in
>> kernel/workqueue.c:check_flush_dependency():
>>
>> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
>> WQ_MEM_RECLAIM),
>> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing
>> !WQ_MEM_RECLAIM %s:%pf",
>> 	  worker->current_pwq->wq->name, worker->current_func,
>> 	  target_wq->name, target_func);
>>
>> Which I think means we're flushing a workq that doesn't have
>> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
>>
>> Looking at rdma_addr_cancel() which is doing the flushing, it flushes the
>> addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel() is
>> being called by the nvme host connection timeout/reconnect workqueue
>> thread that does have WQ_MEM_RECLAIM set.
>>
>> So set WQ_MEM_RECLAIM on the addr_req workqueue.
>>
> Please add below [1] fixes by tag.
> I removed this flag based on commit [2] of Sagi and discussion[3].
> Which I think [2] and [3] are incorrect.
>
> Memory reclaim path could have been triggered by processes or kernel and I think it is ok to allocate a memory in a work item trigger as part of reclaim path or otherwise.
> Memory allocation in a wq has nothing to do with memory reclaim. I wish if Tejun or Christoph confirm this is right or correct our understanding.
>
> Based these past discussions, it appears to me that WQ_MEM_RECLAIM limitation and its use is not clear.
>
> Commit [2] also requires a fix.
> This [3] is the past discussion thread when flag was dropped incorrectly as starting point.
>
> If [3] is right, there are several wq which need a fix and proposed patch doesn't make sense too.
>
> [1] Fixes: 39baf10310e6 ("IB/core: Fix use workqueue without WQ_MEM_RECLAIM")
> [2] cb93e597779e ("cm: Don't allocate ib_cm workqueue with WQ_MEM_RECLAIM")
> [3] https://www.spinics.net/lists/linux-rdma/msg53314.html
>
> Reviewed-by: Parav Pandit <parav@mellanox.com> 
> This was reported internally too a while back. It was in my TODO list, for a while now, thanks for the fix.
> Lets get the clarity of WQ_MEM_RECLAIM once for all.
> I would be happy to submit a documentation update patch once its clear.

Funny, I reported [3] which led to removing WQ_MEM_RECLAIM on all the
rdma/core wqs [1] and [2].  If that was the correct thing to do, then
perhaps the nvme host workqueue, nvme_wq from nvme/host/core.c, needs to
remove WQ_MEM_RECLAIM:

...

int __init nvme_core_init(void)
{
        int result = -ENOMEM;

        nvme_wq = alloc_workqueue("nvme-wq",
                        WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
        if (!nvme_wq)
                goto out;
...
Leon Romanovsky Feb. 3, 2019, 6:14 p.m. UTC | #4
On Thu, Jan 31, 2019 at 09:00:42PM +0000, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 01:56:15PM -0700, Parav Pandit wrote:
> >
> >
> > > From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> > > owner@vger.kernel.org> On Behalf Of Steve Wise
> > > Sent: Thursday, January 31, 2019 1:31 PM
> > > To: dledford@redhat.com; Jason Gunthorpe <jgg@mellanox.com>
> > > Cc: linux-rdma@vger.kernel.org
> > > Subject: [PATCH rdma-next] RDMA/addr: create addr_wq with
> > > WQ_MEM_RECLAIM flag
> > >
> > > While running NVMe/oF wire unplug tests, we hit this warning in
> > > kernel/workqueue.c:check_flush_dependency():
> > >
> > > WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> > > 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
> > > WQ_MEM_RECLAIM),
> > > 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing
> > > !WQ_MEM_RECLAIM %s:%pf",
> > > 	  worker->current_pwq->wq->name, worker->current_func,
> > > 	  target_wq->name, target_func);
> > >
> > > Which I think means we're flushing a workq that doesn't have
> > > WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> > >
> > > Looking at rdma_addr_cancel() which is doing the flushing, it flushes the
> > > addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel() is
> > > being called by the nvme host connection timeout/reconnect workqueue
> > > thread that does have WQ_MEM_RECLAIM set.
> > >
> > > So set WQ_MEM_RECLAIM on the addr_req workqueue.
> > >
> >
> > Please add below [1] fixes by tag.
> > I removed this flag based on commit [2] of Sagi and discussion[3].
> > Which I think [2] and [3] are incorrect.
> >
> > Memory reclaim path could have been triggered by processes or kernel
> > and I think it is ok to allocate a memory in a work item trigger as
> > part of reclaim path or otherwise.  Memory allocation in a wq has
> > nothing to do with memory reclaim. I wish if Tejun or Christoph
> > confirm this is right or correct our understanding.
>
> Considering the only thing WQ_MEM_RECLAIM does is to pre-allocate a
> execution progress to guarentee forward progress if the system is
> unable to succed in memory allocations..
>
> It seems a bit goofy to say that it is OK to then block on memory
> allocations inside that precious execution context..
>
> I also would like to understand what the rules should be for this flag
> before applying this patch??
>
> > Based these past discussions, it appears to me that WQ_MEM_RECLAIM
> > limitation and its use is not clear.
>
> I've always thought it was designed to guarentee forward progress if
> memory cannot be allocated.

I understood it very similar except the last part and my version is:
"it was designed to guarantee forward progress and prioritized
if memory shrinker is called."

Thanks

>
> Jason
Jason Gunthorpe Feb. 5, 2019, 10:39 p.m. UTC | #5
On Thu, Jan 31, 2019 at 11:30:42AM -0800, Steve Wise wrote:
> While running NVMe/oF wire unplug tests, we hit this warning in
> kernel/workqueue.c:check_flush_dependency():
> 
> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
> 	  worker->current_pwq->wq->name, worker->current_func,
> 	  target_wq->name, target_func);
> 
> Which I think means we're flushing a workq that doesn't have
> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> 
> Looking at rdma_addr_cancel() which is doing the flushing, it flushes
> the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
> is being called by the nvme host connection timeout/reconnect workqueue
> thread that does have WQ_MEM_RECLAIM set.

Since we haven't learned anything more, I think you should look to
remove either the WQ_MEM_RECLAIM or the rdma_addr_cancel() from the
nvme side.

Jason
Steve Wise Feb. 6, 2019, 9:52 p.m. UTC | #6
On 2/5/2019 4:39 PM, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 11:30:42AM -0800, Steve Wise wrote:
>> While running NVMe/oF wire unplug tests, we hit this warning in
>> kernel/workqueue.c:check_flush_dependency():
>>
>> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
>> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
>> 	  worker->current_pwq->wq->name, worker->current_func,
>> 	  target_wq->name, target_func);
>>
>> Which I think means we're flushing a workq that doesn't have
>> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
>>
>> Looking at rdma_addr_cancel() which is doing the flushing, it flushes
>> the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
>> is being called by the nvme host connection timeout/reconnect workqueue
>> thread that does have WQ_MEM_RECLAIM set.
> Since we haven't learned anything more, I think you should look to
> remove either the WQ_MEM_RECLAIM or the rdma_addr_cancel() from the
> nvme side.

The nvme code is just calling rdma_destroy_id(), which in turn calls
rdma_addr_cancel(),  so I'll have to remove WQ_MEM_RECLAIM from the
workqueue.

I'll post this patch then.
Steve Wise Feb. 6, 2019, 10:19 p.m. UTC | #7
On 2/6/2019 3:52 PM, Steve Wise wrote:
> On 2/5/2019 4:39 PM, Jason Gunthorpe wrote:
>> On Thu, Jan 31, 2019 at 11:30:42AM -0800, Steve Wise wrote:
>>> While running NVMe/oF wire unplug tests, we hit this warning in
>>> kernel/workqueue.c:check_flush_dependency():
>>>
>>> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>>> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
>>> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
>>> 	  worker->current_pwq->wq->name, worker->current_func,
>>> 	  target_wq->name, target_func);
>>>
>>> Which I think means we're flushing a workq that doesn't have
>>> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
>>>
>>> Looking at rdma_addr_cancel() which is doing the flushing, it flushes
>>> the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
>>> is being called by the nvme host connection timeout/reconnect workqueue
>>> thread that does have WQ_MEM_RECLAIM set.
>> Since we haven't learned anything more, I think you should look to
>> remove either the WQ_MEM_RECLAIM or the rdma_addr_cancel() from the
>> nvme side.
> The nvme code is just calling rdma_destroy_id(), which in turn calls
> rdma_addr_cancel(),  so I'll have to remove WQ_MEM_RECLAIM from the
> workqueue.
>
> I'll post this patch then.


What a mess.  If I remove RECLAIM for nvme_wq, then I'll regress this
change, I think:

c669ccdc50c2 ("nvme: queue ns scanning and async request from nvme_wq")
Potnuri Bharat Teja March 13, 2019, 2:08 p.m. UTC | #8
On Thursday, February 02/07/19, 2019 at 03:49:47 +0530, Steve Wise wrote:
> 
> On 2/6/2019 3:52 PM, Steve Wise wrote:
> > On 2/5/2019 4:39 PM, Jason Gunthorpe wrote:
> >> On Thu, Jan 31, 2019 at 11:30:42AM -0800, Steve Wise wrote:
> >>> While running NVMe/oF wire unplug tests, we hit this warning in
> >>> kernel/workqueue.c:check_flush_dependency():
> >>>
> >>> WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> >>> 		      (WQ_MEM_RECLAIM | __WQ_LEGACY)) == WQ_MEM_RECLAIM),
> >>> 	  "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
> >>> 	  worker->current_pwq->wq->name, worker->current_func,
> >>> 	  target_wq->name, target_func);
> >>>
> >>> Which I think means we're flushing a workq that doesn't have
> >>> WQ_MEM_RECLAIM set, from workqueue context that does have it set.
> >>>
> >>> Looking at rdma_addr_cancel() which is doing the flushing, it flushes
> >>> the addr_wq which doesn't have MEM_RECLAIM set.  Yet rdma_addr_cancel()
> >>> is being called by the nvme host connection timeout/reconnect workqueue
> >>> thread that does have WQ_MEM_RECLAIM set.
> >> Since we haven't learned anything more, I think you should look to
> >> remove either the WQ_MEM_RECLAIM or the rdma_addr_cancel() from the
> >> nvme side.
> > The nvme code is just calling rdma_destroy_id(), which in turn calls
> > rdma_addr_cancel(),  so I'll have to remove WQ_MEM_RECLAIM from the
> > workqueue.
> >
> > I'll post this patch then.
> 
> 
> What a mess.  If I remove RECLAIM for nvme_wq, then I'll regress this
> change, I think:
> 
> c669ccdc50c2 ("nvme: queue ns scanning and async request from nvme_wq")
> 
Hi All,
I see this issue with some basic tests. Looks like the discussion ended here.
Please suggest where this ideally needs to be fixed so that I can try fixing it.

Thanks,
Bharat
diff mbox series

Patch

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 0dce94e3c495..1d88ee3ac8c7 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -868,7 +868,7 @@  static int netevent_callback(struct notifier_block *self, unsigned long event,
 
 int addr_init(void)
 {
-	addr_wq = alloc_ordered_workqueue("ib_addr", 0);
+	addr_wq = alloc_ordered_workqueue("ib_addr", WQ_MEM_RECLAIM);
 	if (!addr_wq)
 		return -ENOMEM;