[v2,1/6] workqueue: Inherit NOIO and NOFS alloc flags

Message ID	20240515125342.1069999-2-haakon.bugge@oracle.com (mailing list archive)
State	Superseded
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3578E82860; Wed, 15 May 2024 12:54:21 +0000 (UTC) From: =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com> To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, rds-devel@oss.oracle.com Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>, Allison Henderson <allison.henderson@oracle.com>, Manjunath Patil <manjunath.b.patil@oracle.com>, Mark Zhang <markzhang@nvidia.com>, =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>, Chuck Lever <chuck.lever@oracle.com>, Shiraz Saleem <shiraz.saleem@intel.com>, Yang Li <yang.lee@linux.alibaba.com> Subject: [PATCH v2 1/6] workqueue: Inherit NOIO and NOFS alloc flags Date: Wed, 15 May 2024 14:53:37 +0200 Message-Id: <20240515125342.1069999-2-haakon.bugge@oracle.com> In-Reply-To: <20240515125342.1069999-1-haakon.bugge@oracle.com> References: <20240515125342.1069999-1-haakon.bugge@oracle.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	rds: rdma: Add ability to force GFP_NOIO \| expand [v2,0/6] rds: rdma: Add ability to force GFP_NOIO [v2,1/6] workqueue: Inherit NOIO and NOFS alloc flags [v2,2/6] rds: Brute force GFP_NOIO [v2,3/6] RDMA/cma: Brute force GFP_NOIO [v2,4/6] RDMA/cm: Brute force GFP_NOIO [v2,5/6] RDMA/mlx5: Brute force GFP_NOIO [v2,6/6] net/mlx5: Brute force GFP_NOIO

Message ID

20240515125342.1069999-2-haakon.bugge@oracle.com (mailing list archive)

State

Superseded

Headers

From: =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>
To: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
        netdev@vger.kernel.org, rds-devel@oss.oracle.com
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
 Saeed Mahameed <saeedm@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>,
 "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>,
 Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
 Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
 Allison Henderson <allison.henderson@oracle.com>,
 Manjunath Patil <manjunath.b.patil@oracle.com>,
 Mark Zhang <markzhang@nvidia.com>,
 =?utf-8?q?H=C3=A5kon_Bugge?= <haakon.bugge@oracle.com>,
 Chuck Lever <chuck.lever@oracle.com>,
 Shiraz Saleem <shiraz.saleem@intel.com>, Yang Li <yang.lee@linux.alibaba.com>
Subject: [PATCH v2 1/6] workqueue: Inherit NOIO and NOFS alloc flags
Date: Wed, 15 May 2024 14:53:37 +0200
Message-Id: <20240515125342.1069999-2-haakon.bugge@oracle.com>
In-Reply-To: <20240515125342.1069999-1-haakon.bugge@oracle.com>
References: <20240515125342.1069999-1-haakon.bugge@oracle.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Series

rds: rdma: Add ability to force GFP_NOIO | expand

Commit Message

Haakon Bugge May 15, 2024, 12:53 p.m. UTC

For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).

This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.

We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.

Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.

We break this circular dependency by:

1. Force all allocations in the drivers to use GFP_NOIO, by means of a
   parenthetic use of memalloc_noio_{save,restore} on all relevant
   entry points.

2. Make sure work-queues inherits current->flags
   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
   work-queue inherits the same flag(s). That is what this commit
   contributes with.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>

---

v1 -> v2:
   * Added missing hunk in alloc_workqueue()
---
 include/linux/workqueue.h |  2 ++
 kernel/workqueue.c        | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

Comments

Tejun Heo May 15, 2024, 4:54 p.m. UTC | #1

> @@ -5583,6 +5600,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>  
>  	/* init wq */
>  	wq->flags = flags;
> +	if (current->flags & PF_MEMALLOC_NOIO)
> +		wq->flags |= __WQ_NOIO;
> +	if (current->flags & PF_MEMALLOC_NOFS)
> +		wq->flags |= __WQ_NOFS;

So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
work item so that it can user GFP_KERNEL allocations. I don't mind a
convenience feature to workqueue for this but this doesn't seem like the
right way. Also, memalloc_noio_save() and memalloc_nofs_save() are
convenience wrappers around memalloc_flags_save(), so it'd probably be
better to deal with gfp flags directly rather than singling out these two
flags.

Thanks.

Haakon Bugge May 16, 2024, 3:27 p.m. UTC | #2

> On 15 May 2024, at 18:54, Tejun Heo <tj@kernel.org> wrote:
> 
>> @@ -5583,6 +5600,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
>> 
>> /* init wq */
>> wq->flags = flags;
>> + if (current->flags & PF_MEMALLOC_NOIO)
>> + wq->flags |= __WQ_NOIO;
>> + if (current->flags & PF_MEMALLOC_NOFS)
>> + wq->flags |= __WQ_NOFS;
> 
> So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
> work item so that it can user GFP_KERNEL allocations.

If one work function want to use GPF_KERNEL and another using GFP_NOIO, queued on the same workqueue, one could create two workqueues. Create one that is surrounded by memalloc_noio_{save,restore}, another surrounded by memalloc_flags_save() + current->flags &= ~PF_MEMALLOC_NOIO and memalloc_flags_restore().

If you imply a work functions that performs combinations of GFP_KERNEL and GFP_NOIO, that sounds a little bit peculiar to me, but if needed, it must be open-coded. But wouldn't that be the same case as a WQ created with WQ_MEM_RECLAIM?

> I don't mind a
> convenience feature to workqueue for this but this doesn't seem like the
> right way. Also, memalloc_noio_save() and memalloc_nofs_save() are
> convenience wrappers around memalloc_flags_save(), so it'd probably be
> better to deal with gfp flags directly rather than singling out these two
> flags.

Actually, based on https://lore.kernel.org/linux-fsdevel/ZZcgXI46AinlcBDP@casper.infradead.org, I am inclided to skip GFP_NOFS. Also because the use-case for this series does not need GFP_NOFS.

When you say "deal with gfp flags directly", do you imply during WQ creation or queuing work on one? I am OK with adding the other per-process memory allocation flags, but that doesn's solve your initial issue ("if a NOIO callers wants to scheduler a work item so that it can user GFP_KERNEL").

Thxs, Håkon

Tejun Heo May 16, 2024, 4:29 p.m. UTC | #3

Hello,

On Thu, May 16, 2024 at 03:27:15PM +0000, Haakon Bugge wrote:
> > So, yeah, please don't do this. What if a NOIO callers wants to scheduler a
> > work item so that it can user GFP_KERNEL allocations.
> 
> If one work function want to use GPF_KERNEL and another using GFP_NOIO,
> queued on the same workqueue, one could create two workqueues. Create one
> that is surrounded by memalloc_noio_{save,restore}, another surrounded by
> memalloc_flags_save() + current->flags &= ~PF_MEMALLOC_NOIO and
> memalloc_flags_restore().

This is too subtle and the default behavior doesn't seem great either - in
most cases, the code path which sets up workqueues would be in GFP_KERNEL
context as init paths usually are, so it's not like this would make things
work automatically in most cases. In addition, now, the memory allocations
for workqueues themselves have to be subject to the same GFP restrictions
even when alloc_workqueue() is called from GFP_KERNEL context. It just
doesn't seem well thought out.

> When you say "deal with gfp flags directly", do you imply during WQ
> creation or queuing work on one? I am OK with adding the other per-process
> memory allocation flags, but that doesn's solve your initial issue ("if a
> NOIO callers wants to scheduler a work item so that it can user
> GFP_KERNEL").

It being a purely convenience feature, I don't think there's hard
requirement on where this should go although I don't know where you'd carry
this information if you tied it to each work item. And, please don't single
out specific GFP flags. Please make the feature generic so that users who
may need different GFP masking can also use it too. The underlying GFP
feature is already like that. There's no reason to restrict it from
workqueue side.

Thanks.

Haakon Bugge May 21, 2024, 2:02 p.m. UTC | #4

Hi,


> On 16 May 2024, at 18:29, Tejun Heo <tj@kernel.org> wrote:
> 
> 
>> When you say "deal with gfp flags directly", do you imply during WQ
>> creation or queuing work on one? I am OK with adding the other per-process
>> memory allocation flags, but that doesn's solve your initial issue ("if a
>> NOIO callers wants to scheduler a work item so that it can user
>> GFP_KERNEL").
> 
> It being a purely convenience feature, I don't think there's hard
> requirement on where this should go although I don't know where you'd carry
> this information if you tied it to each work item. And, please don't single
> out specific GFP flags. Please make the feature generic so that users who
> may need different GFP masking can also use it too. The underlying GFP
> feature is already like that. There's no reason to restrict it from
> workqueue side.


I am preparing a v3 which handles all PF_MEMALLOC* flags. The plan is to send it out tomorrow.


Thxs, Håkon

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@  enum wq_flags {
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
+	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
+	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
 
 	/* BH wq only allows the following flags */
 	__WQ_BH_ALLOWS		= WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..8eb7562372ce2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -3172,6 +3173,10 @@  __acquires(&pool->lock)
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
+	bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+	bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+	unsigned long noio_flags;
+	unsigned long nofs_flags;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@  __acquires(&pool->lock)
 
 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
 #endif
+	/* Set inherited alloc flags */
+	if (use_noio_allocs)
+		noio_flags = memalloc_noio_save();
+	if (use_nofs_allocs)
+		nofs_flags = memalloc_nofs_save();
+
 	/* ensure we're on the correct CPU */
 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 		     raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@  __acquires(&pool->lock)
 
 	/* must be the last step, see the function comment */
 	pwq_dec_nr_in_flight(pwq, work_data);
+
+	/* Restore alloc flags */
+	if (use_nofs_allocs)
+		memalloc_nofs_restore(nofs_flags);
+	if (use_noio_allocs)
+		memalloc_noio_restore(noio_flags);
 }
 
 /**
@@ -5583,6 +5600,10 @@  struct workqueue_struct *alloc_workqueue(const char *fmt,
 
 	/* init wq */
 	wq->flags = flags;
+	if (current->flags & PF_MEMALLOC_NOIO)
+		wq->flags |= __WQ_NOIO;
+	if (current->flags & PF_MEMALLOC_NOFS)
+		wq->flags |= __WQ_NOFS;
 	wq->max_active = max_active;
 	wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE);
 	wq->saved_max_active = wq->max_active;

[v2,1/6] workqueue: Inherit NOIO and NOFS alloc flags

Commit Message

Comments

Patch