NFS: Fix memory allocation in rpc_malloc()

Message ID	20220315162052.570677-1-trondmy@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: trondmy@kernel.org To: linux-nfs@vger.kernel.org Subject: [PATCH] NFS: Fix memory allocation in rpc_malloc() Date: Tue, 15 Mar 2022 12:20:52 -0400 Message-Id: <20220315162052.570677-1-trondmy@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	NFS: Fix memory allocation in rpc_malloc() \| expand NFS: Fix memory allocation in rpc_malloc()

Trond Myklebust March 15, 2022, 4:20 p.m. UTC

From: Trond Myklebust <trond.myklebust@hammerspace.com>

When allocating memory, it should be safe to always use GFP_KERNEL,
since both swap tasks and asynchronous tasks will regulate the
allocation mode through the struct task flags.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 net/sunrpc/sched.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Chuck Lever March 15, 2022, 4:50 p.m. UTC | #1

> On Mar 15, 2022, at 12:20 PM, trondmy@kernel.org wrote:
> 
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> When allocating memory, it should be safe to always use GFP_KERNEL,
> since both swap tasks and asynchronous tasks will regulate the
> allocation mode through the struct task flags.

Is a similar change necessary in xprt_rdma_allocate() ?


> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
> net/sunrpc/sched.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 7c8f87ebdbc0..c62fcacf7366 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -1030,16 +1030,12 @@ int rpc_malloc(struct rpc_task *task)
> 	struct rpc_rqst *rqst = task->tk_rqstp;
> 	size_t size = rqst->rq_callsize + rqst->rq_rcvsize;
> 	struct rpc_buffer *buf;
> -	gfp_t gfp = GFP_KERNEL;
> -
> -	if (RPC_IS_ASYNC(task))
> -		gfp = GFP_NOWAIT | __GFP_NOWARN;
> 
> 	size += sizeof(struct rpc_buffer);
> 	if (size <= RPC_BUFFER_MAXSIZE)
> -		buf = mempool_alloc(rpc_buffer_mempool, gfp);
> +		buf = mempool_alloc(rpc_buffer_mempool, GFP_KERNEL);
> 	else
> -		buf = kmalloc(size, gfp);
> +		buf = kmalloc(size, GFP_KERNEL);
> 
> 	if (!buf)
> 		return -ENOMEM;
> -- 
> 2.35.1
> 

--
Chuck Lever

Trond Myklebust March 15, 2022, 5 p.m. UTC | #2

On Tue, 2022-03-15 at 16:50 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 15, 2022, at 12:20 PM, trondmy@kernel.org wrote:
> > 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When allocating memory, it should be safe to always use GFP_KERNEL,
> > since both swap tasks and asynchronous tasks will regulate the
> > allocation mode through the struct task flags.
> 
> Is a similar change necessary in xprt_rdma_allocate() ?

It looks to me as if that would be a safe change, and we should
probably also change that definition of RPCRDMA_DEF_GFP to match
GFP_KERNEL.

NeilBrown March 16, 2022, 1:53 a.m. UTC | #3

On Wed, 16 Mar 2022, trondmy@kernel.org wrote:
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> When allocating memory, it should be safe to always use GFP_KERNEL,
> since both swap tasks and asynchronous tasks will regulate the
> allocation mode through the struct task flags.

'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC.
They cannot prevent waiting altogether.
We need rpciod task to not block waiting for memory.  If they all do,
then there will be no thread free to handle the replies to WRITE which
would allow swapped-out memory to be freed.

As the very least the rescuer thread mustn't block, so the use of
GFP_NOWAIT could depend on current_is_workqueue_rescuer().

Was there some problem you saw caused by the use of GFP_NOWAIT ??

Thanks,
NeilBrown

> 
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  net/sunrpc/sched.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 7c8f87ebdbc0..c62fcacf7366 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -1030,16 +1030,12 @@ int rpc_malloc(struct rpc_task *task)
>  	struct rpc_rqst *rqst = task->tk_rqstp;
>  	size_t size = rqst->rq_callsize + rqst->rq_rcvsize;
>  	struct rpc_buffer *buf;
> -	gfp_t gfp = GFP_KERNEL;
> -
> -	if (RPC_IS_ASYNC(task))
> -		gfp = GFP_NOWAIT | __GFP_NOWARN;
>  
>  	size += sizeof(struct rpc_buffer);
>  	if (size <= RPC_BUFFER_MAXSIZE)
> -		buf = mempool_alloc(rpc_buffer_mempool, gfp);
> +		buf = mempool_alloc(rpc_buffer_mempool, GFP_KERNEL);
>  	else
> -		buf = kmalloc(size, gfp);
> +		buf = kmalloc(size, GFP_KERNEL);
>  
>  	if (!buf)
>  		return -ENOMEM;
> -- 
> 2.35.1
> 
>

Trond Myklebust March 16, 2022, 1:47 p.m. UTC | #4

On Wed, 2022-03-16 at 12:53 +1100, NeilBrown wrote:
> On Wed, 16 Mar 2022, trondmy@kernel.org wrote:
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When allocating memory, it should be safe to always use GFP_KERNEL,
> > since both swap tasks and asynchronous tasks will regulate the
> > allocation mode through the struct task flags.
> 
> 'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC.
> They cannot prevent waiting altogether.
> We need rpciod task to not block waiting for memory.  If they all do,
> then there will be no thread free to handle the replies to WRITE
> which
> would allow swapped-out memory to be freed.
> 
> As the very least the rescuer thread mustn't block, so the use of
> GFP_NOWAIT could depend on current_is_workqueue_rescuer().
> 
> Was there some problem you saw caused by the use of GFP_NOWAIT ??
> 

There is no point in trying to give rpciod stronger protection than
what is given to the standard memory reclaim processes. The VM does not
have a requirement that writepages() and its ilk to be non-waiting and
non-blocking, nor does it require that the threads that help service
those writebacks be non-blocking.

We wouldn't be able to do things like create a new socket to reconnect
to the server if we couldn't perform blocking allocations from rpciod.
None of the socket creation APIs allow you to pass in a gfp flag mask.

What we do require in this situation, is that we must not recurse back
into NFS. This is why we have the memalloc_nofs_save() /
memalloc_nofs_restore() protection in various places, including
rpc_async_schedule() and rpc_execute(). That still allows the VM to
actively perform direct reclaim, and to start I/O against block devices
when in a low memory situation. Why would we not want it to do that? 

The alternative with GFP_NOWAIT is to stall the RPC I/O altogether when
in a low memory situation, and to cross our fingers that something else
causes memory to be freed up.

NeilBrown March 18, 2022, 5:04 a.m. UTC | #5

On Thu, 17 Mar 2022, Trond Myklebust wrote:
> On Wed, 2022-03-16 at 12:53 +1100, NeilBrown wrote:
> > On Wed, 16 Mar 2022, trondmy@kernel.org wrote:
> > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > 
> > > When allocating memory, it should be safe to always use GFP_KERNEL,
> > > since both swap tasks and asynchronous tasks will regulate the
> > > allocation mode through the struct task flags.
> > 
> > 'struct task_struct' flags can only enforce NOFS, NOIO, or MEMALLOC.
> > They cannot prevent waiting altogether.
> > We need rpciod task to not block waiting for memory.  If they all do,
> > then there will be no thread free to handle the replies to WRITE
> > which
> > would allow swapped-out memory to be freed.
> > 
> > As the very least the rescuer thread mustn't block, so the use of
> > GFP_NOWAIT could depend on current_is_workqueue_rescuer().
> > 
> > Was there some problem you saw caused by the use of GFP_NOWAIT ??
> > 
> 
> There is no point in trying to give rpciod stronger protection than
> what is given to the standard memory reclaim processes. The VM does not
> have a requirement that writepages() and its ilk to be non-waiting and
> non-blocking, nor does it require that the threads that help service
> those writebacks be non-blocking.

My understanding is a little more nuanced.  It is certainly true that
writepages() etc can block, but only for events that are certain to
happen in some finite time.
This is why the block layer uses mempools, and uses a different mempool
at each layer going down the stack.  At each level it is ok to wait for
events at some lower level, but not for event at the same or higher
level. That ensures there are no loops in the waiting dependency.

This is where rpciod falls down without the commit that you want to
revert.
rpciod threads both wait on the rpc_buffer_mempool, and return bufs to
it with mempool_free().  This means that all the available threads could
be blocked waiting on the mmepool, so none would be available to free
buffers for which a reply has been received.

If there were separate workqueues (with separate rescuer threads) for
sending requests and for receiving replies, then there wouldn't be a
problem here.  I haven't given that possibility lot of thought but I
think it would be messy to separate requests from replies like that.

Without the separation, we need to ensure requests don't block waiting for
some other reply - else we can deadlock.

> 
> We wouldn't be able to do things like create a new socket to reconnect
> to the server if we couldn't perform blocking allocations from rpciod.
> None of the socket creation APIs allow you to pass in a gfp flag mask.

For cases where mempools are not convenient, there is PF_MEMALLOC.  This
allows access to generic reserves.  It is dangerous to use this for
regular allocations as it can still be exhausted and can still cause
deadlocks.  But it is generally safe for rare allocations as you are
unlikely to have lots of them at once resulting in exhaustion.

Allocating a new socket is exactly the sort of thing that PF_MEMALLOC is
good for.  It happen rarely (hours or more), in contrast with
rpc_malloc() which happens frequently (multiple times a second).
I'm not certain that socket allocation happens under PF_MEMALLOC - I
should check - but that is the correct solution for that need.

> 
> What we do require in this situation, is that we must not recurse back
> into NFS. This is why we have the memalloc_nofs_save() /
> memalloc_nofs_restore() protection in various places, including
> rpc_async_schedule() and rpc_execute(). That still allows the VM to
> actively perform direct reclaim, and to start I/O against block devices
> when in a low memory situation. Why would we not want it to do that? 

Certainly it is important to avoid recursion and the improvements to
properly use memalloc_nofs_save() etc so that we don't need GFP_NOFS are
good.  Consequently there is no problem with starting reclaim.  I would
not object to __GFP_DIRECT_RECLAIM being passed to kmalloc() providing
__GFP_NORETRY were passed as well.
But passing __GFP_DIRECT_RECLAIM to mempool_alloc() causes it to wait
indefinitely for success, and that triggers the deadlock because rpciod
is waiting for something that only rpciod can do.

> 
> The alternative with GFP_NOWAIT is to stall the RPC I/O altogether when
> in a low memory situation, and to cross our fingers that something else
> causes memory to be freed up.

I don't think crossing our fingers is generally a good strategy.  I
don't think "something else" freeing up memory is likely, though maybe
the oom killer might.
PF_MEMALLOC effectively is a "cross your fingers" strategy, but there it is
not "something will free up memory", but rather "we'll never need all
the reserve at the same time".  That is somewhat more defensible.

Thanks,
NeilBrown

> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
>

NFS: Fix memory allocation in rpc_malloc()

Commit Message

Comments

Patch