diff mbox series

[1/6] workqueue: Inherit NOIO and NOFS alloc flags

Message ID 20240513125346.764076-2-haakon.bugge@oracle.com (mailing list archive)
State Superseded
Headers show
Series rds: rdma: Add ability to force GFP_NOIO | expand

Commit Message

Haakon Bugge May 13, 2024, 12:53 p.m. UTC
For drivers/modules running inside a
memalloc_{noio,nofs}_{save,restore} region, if a work-queue is
created, we make sure work executed on the work-queue inherits the
same flag(s).

This in order to conditionally enable drivers to work aligned with
block I/O devices. This commit makes sure that any work queued later
on work-queues created during module initialization, when current's
flags has PF_MEMALLOC_{NOIO,NOFS} set, will inherit the same flags.

We do this in order to enable drivers to be used as a network block
I/O device. This in order to support XFS or other file-systems on top
of a raw block device which uses said drivers as the network transport
layer.

Under intense memory pressure, we get memory reclaims. Assume the
file-system reclaims memory, goes to the raw block device, which calls
into said drivers. Now, if regular GFP_KERNEL allocations in the
drivers require reclaims to be fulfilled, we end up in a circular
dependency.

We break this circular dependency by:

1. Force all allocations in the drivers to use GFP_NOIO, by means of a
   parenthetic use of memalloc_noio_{save,restore} on all relevant
   entry points.

2. Make sure work-queues inherits current->flags
   wrt. PF_MEMALLOC_{NOIO,NOFS}, such that work executed on the
   work-queue inherits the same flag(s). That is what this commit
   contributes with.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
---
 include/linux/workqueue.h |  2 ++
 kernel/workqueue.c        | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

Comments

Tejun Heo May 13, 2024, 4:48 p.m. UTC | #1
Hello,

On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
>  	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
>  	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
>  	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */

I don't quite understand how this is supposed to be used. The flags are
marked internal but nothing actually sets them. Looking at later patches, I
don't see any usages either. What am I missing?

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
>  
>  	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
>  #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
>  	/* ensure we're on the correct CPU */
>  	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>  		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
>  
>  	/* must be the last step, see the function comment */
>  	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);

Also, this looks like something that the work function can do on entry and
before exit, no?

Thanks.
Haakon Bugge May 14, 2024, 1:48 p.m. UTC | #2
> Hello,
> 
> On Mon, May 13, 2024 at 02:53:41PM +0200, Håkon Bugge wrote:
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 158784dd189ab..09ecc692ffcae 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -398,6 +398,8 @@ enum wq_flags {
> 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
> 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
> 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
> +	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
> +	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
> 
> I don't quite understand how this is supposed to be used. The flags are
> marked internal but nothing actually sets them. Looking at later patches, I
> don't see any usages either. What am I missing?

Hi Tejun,


Thank you for so quickly looking into this. You are quite right. During re-basing, I missed a hunk from alloc_workqueue(), where I sample current->flags for PF_MALLOC_{NOIO,NOFS} bits and sets the corresponding bits in wq->flags. Fixed in v2.

> @@ -3184,6 +3189,12 @@ __acquires(&pool->lock)
> 
> 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
> #endif
> +	/* Set inherited alloc flags */
> +	if (use_noio_allocs)
> +		noio_flags = memalloc_noio_save();
> +	if (use_nofs_allocs)
> +		nofs_flags = memalloc_nofs_save();
> +
> 	/* ensure we're on the correct CPU */
> 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> 		     raw_smp_processor_id() != pool->cpu);
> @@ -3320,6 +3331,12 @@ __acquires(&pool->lock)
> 
> 	/* must be the last step, see the function comment */
> 	pwq_dec_nr_in_flight(pwq, work_data);
> +
> +	/* Restore alloc flags */
> +	if (use_nofs_allocs)
> +		memalloc_nofs_restore(nofs_flags);
> +	if (use_noio_allocs)
> +		memalloc_noio_restore(noio_flags);
> 
> Also, this looks like something that the work function can do on entry and
> before exit, no?

It _can_ be done in the work functions, but that will be a code sprawl. Only in RDS, we have the following worker functions:

rds_ib_odp_mr_worker();
rds_ib_mr_pool_flush_worker()
rds_ib_odp_mr_worker()
rds_tcp_accept_worker()
rds_connect_worker()
rds_send_worker()
rds_recv_worker()
rds_shutdown_worker()

adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly prefer to have it in one place.


Thxs, Håkon
Tejun Heo May 14, 2024, 4:49 p.m. UTC | #3
Hello,

On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> > Also, this looks like something that the work function can do on entry and
> > before exit, no?
> 
> It _can_ be done in the work functions, but that will be a code sprawl.
> Only in RDS, we have the following worker functions:
> 
> rds_ib_odp_mr_worker();
> rds_ib_mr_pool_flush_worker()
> rds_ib_odp_mr_worker()
> rds_tcp_accept_worker()
> rds_connect_worker()
> rds_send_worker()
> rds_recv_worker()
> rds_shutdown_worker()
> 
> adding the ones from ib_cm, rdma_cm, mlx5_ib, and mlx5_core, I strongly
> prefer to have it in one place.

I haven't seen the code yet, so can't tell for sure but if you're
automatically inherting these flags from the scheduling site, I don't think
that's gonna work. Note that getting a different, more permissive,
allocation context is one of reasons why one might want to use workqueues,
so it'd have to be explicit whether it's in workqueue or in its users.

Thanks.
Haakon Bugge May 15, 2024, 2:11 p.m. UTC | #4
Hi Tejun,


> On 14 May 2024, at 18:49, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
> On Tue, May 14, 2024 at 01:48:24PM +0000, Haakon Bugge wrote:
> 
> I haven't seen the code yet, so can't tell for sure but if you're
> automatically inherting these flags from the scheduling site, I don't think
> that's gonna work. Note that getting a different, more permissive,
> allocation context is one of reasons why one might want to use workqueues,
> so it'd have to be explicit whether it's in workqueue or in its users.

I wanted to hold off sending the v2 if it came in more comments, but I have sent it now.


Thxs, Håkon
diff mbox series

Patch

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 158784dd189ab..09ecc692ffcae 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -398,6 +398,8 @@  enum wq_flags {
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 	__WQ_LEGACY		= 1 << 18, /* internal: create*_workqueue() */
+	__WQ_NOIO               = 1 << 19, /* internal: execute work with NOIO */
+	__WQ_NOFS               = 1 << 20, /* internal: execute work with NOFS */
 
 	/* BH wq only allows the following flags */
 	__WQ_BH_ALLOWS		= WQ_BH | WQ_HIGHPRI,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d2dbe099286b9..a1d166a7c0f85 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/mm.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
 #include <linux/delay.h>
@@ -3172,6 +3173,10 @@  __acquires(&pool->lock)
 	unsigned long work_data;
 	int lockdep_start_depth, rcu_start_depth;
 	bool bh_draining = pool->flags & POOL_BH_DRAINING;
+	bool use_noio_allocs = pwq->wq->flags & __WQ_NOIO;
+	bool use_nofs_allocs = pwq->wq->flags & __WQ_NOFS;
+	unsigned long noio_flags;
+	unsigned long nofs_flags;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -3184,6 +3189,12 @@  __acquires(&pool->lock)
 
 	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
 #endif
+	/* Set inherited alloc flags */
+	if (use_noio_allocs)
+		noio_flags = memalloc_noio_save();
+	if (use_nofs_allocs)
+		nofs_flags = memalloc_nofs_save();
+
 	/* ensure we're on the correct CPU */
 	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 		     raw_smp_processor_id() != pool->cpu);
@@ -3320,6 +3331,12 @@  __acquires(&pool->lock)
 
 	/* must be the last step, see the function comment */
 	pwq_dec_nr_in_flight(pwq, work_data);
+
+	/* Restore alloc flags */
+	if (use_nofs_allocs)
+		memalloc_nofs_restore(nofs_flags);
+	if (use_noio_allocs)
+		memalloc_noio_restore(noio_flags);
 }
 
 /**