diff mbox

[RFC,1/2] mempool: do not consume memory reserves from the reclaim path

Message ID 1468831285-27242-1-git-send-email-mhocko@kernel.org (mailing list archive)
State Not Applicable, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Michal Hocko July 18, 2016, 8:41 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Mikulas was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. We can handle that case in a different way, though. We
can check whether the current task has access to memory reserves ad an
OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
is empty.

David Rientjes was objecting that such an approach wouldn't help if the
oom victim was blocked on a lock held by process doing mempool_alloc. This
is very similar to other oom deadlock situations and we have oom_reaper
to deal with them so it is reasonable to rely on the same mechanism
rather inventing a different one which has negative side effects.

Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

David Rientjes July 19, 2016, 2 a.m. UTC | #1
On Mon, 18 Jul 2016, Michal Hocko wrote:

> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 

Right, this causes oom livelock as described in the aforementioned thread: 
the oom victim is waiting on a mutex that is held by a thread doing 
mempool_alloc().  The oom reaper is not guaranteed to free any memory, so 
nothing on the system can allocate memory from the page allocator.

I think the better solution here is to allow mempool_alloc() users to set 
__GFP_NOMEMALLOC if they are in a context which allows them to deplete 
memory reserves.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Michal Hocko July 19, 2016, 7:49 a.m. UTC | #2
On Mon 18-07-16 19:00:57, David Rientjes wrote:
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> 
> Right, this causes oom livelock as described in the aforementioned thread: 
> the oom victim is waiting on a mutex that is held by a thread doing 
> mempool_alloc().

The backtrace you have provided:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

is not PF_MEMALLOC context AFAICS so clearing __GFP_NOMEMALLOC for such
a task will not help unless that task has TIF_MEMDIE. Could you provide
a trace where the PF_MEMALLOC context holding a lock cannot make a
forward progress?

> The oom reaper is not guaranteed to free any memory, so 
> nothing on the system can allocate memory from the page allocator.

Sure, there is no guarantee but as I've said earlier, 1) oom_reaper will
allow to select another victim in many cases and 2) such a deadlock is
no different from any other where the victim cannot continue because of
another context blocking a lock while waiting for memory. Tweaking
mempool allocator to potentially catch such a case in a different way
doesn't sound right in principle, not to mention this is other dangerous
side effects.
 
> I think the better solution here is to allow mempool_alloc() users to set 
> __GFP_NOMEMALLOC if they are in a context which allows them to deplete 
> memory reserves.

I am not really sure about that. I agree with Johannes [1] that this
is bending mempool allocator into an undesirable direction because
the point of the mempool is to have its own reliably reusable memory
reserves. Now I am even not sure whether TIF_MEMDIE exception is a
good way forward and a plain revert is more appropriate. Let's CC
Johannes. The patch is [2].

[1] http://lkml.kernel.org/r/20160718151445.GB14604@cmpxchg.org
[2] http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
Johannes Weiner July 19, 2016, 1:54 p.m. UTC | #3
On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.

I don't understand how this scenario wouldn't be a flat-out bug.

Mempool guarantees forward progress by having all necessary memory
objects for the guaranteed operation in reserve. Think about it this
way: you should be able to delete the pool->alloc() call entirely and
still make reliable forward progress. It would kill concurrency and be
super slow, but how could it be affected by a system OOM situation?

If our mempool_alloc() is waiting for an object that an OOM victim is
holding, where could that OOM victim get stuck before giving it back?
As I asked in the previous thread, surely you wouldn't do a mempool
allocation first and then rely on an unguarded page allocation to make
forward progress, right? It would defeat the purpose of using mempools
in the first place. And surely the OOM victim wouldn't be waiting for
a lock that somebody doing mempool_alloc() *against the same mempool*
is holding. That'd be an obvious ABBA deadlock.

So maybe I'm just dense, but could somebody please outline the exact
deadlock diagram? Who is doing what, and how are they getting stuck?

cpu0:                     cpu1:
                          mempool_alloc(pool0)
mempool_alloc(pool0)
  wait for cpu1
                          not allocating memory - would defeat mempool
                          not taking locks held by cpu0* - would ABBA
                          ???
                          mempool_free(pool0)

Thanks

* or any other task that does mempool_alloc(pool0) before unlock

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Michal Hocko July 19, 2016, 2:19 p.m. UTC | #4
On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> 
> I don't understand how this scenario wouldn't be a flat-out bug.
> 
> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?

Yes this is my understanding of the mempool usage as well. It is much
harder to check whether mempool users are really behaving and they do
not request more than the pre allocated pool allows them, though. That
would be a bug in the consumer not the mempool as such of course.

My original understanding of f9054c70d28b was that it acts as
a prevention for issues where the OOM victim loops inside the
mempool_alloc not doing reasonable progress because those who should
refill the pool are stuck for some reason (aka assume that not all
mempool users are behaving or they have unexpected dependencies like WQ
without WQ_MEM_RECLAIM and similar).

My thinking was that the victim has access to memory reserves by default
so it sounds reasonable to preserve this access also when it is in the
mempool_alloc. Therefore I wanted to preserve that particular logic and
came up with this patch which should be safer than f9054c70d28b. But the
more I am thinking about it the more it sounds like papering over a bug
somewhere else.

So I guess we should just go and revert f9054c70d28b and get back to
David's lockup and investigate what exactly went wrong and why. The
current form of f9054c70d28b is simply too dangerous.
David Rientjes July 19, 2016, 8:45 p.m. UTC | #5
On Tue, 19 Jul 2016, Johannes Weiner wrote:

> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?
> 
> If our mempool_alloc() is waiting for an object that an OOM victim is
> holding, where could that OOM victim get stuck before giving it back?
> As I asked in the previous thread, surely you wouldn't do a mempool
> allocation first and then rely on an unguarded page allocation to make
> forward progress, right? It would defeat the purpose of using mempools
> in the first place. And surely the OOM victim wouldn't be waiting for
> a lock that somebody doing mempool_alloc() *against the same mempool*
> is holding. That'd be an obvious ABBA deadlock.
> 
> So maybe I'm just dense, but could somebody please outline the exact
> deadlock diagram? Who is doing what, and how are they getting stuck?
> 
> cpu0:                     cpu1:
>                           mempool_alloc(pool0)
> mempool_alloc(pool0)
>   wait for cpu1
>                           not allocating memory - would defeat mempool
>                           not taking locks held by cpu0* - would ABBA
>                           ???
>                           mempool_free(pool0)
> 
> Thanks
> 
> * or any other task that does mempool_alloc(pool0) before unlock
> 

I'm approaching this from a perspective of any possible mempool usage, not 
with any single current user in mind.

Any mempool_alloc() user that then takes a contended mutex can do this.  
An example:

	taskA		taskB		taskC
	-----		-----		-----
	mempool_alloc(a)
			mutex_lock(b)
	mutex_lock(b)
					mempool_alloc(a)

Imagine the mempool_alloc() done by taskA depleting all free elements so 
we rely on it to do mempool_free() before any other mempool allocator can 
be guaranteed.

If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
mempool_alloc().  This livelocks the page allocator for all processes.

taskB in this case need only stall after taking mutex_lock() successfully; 
that could be because of the oom livelock, it is contended on another 
mutex held by an allocator, etc.

Obviously taskB stalling while holding a mutex that is contended by a 
mempool user holding an element is not preferred, but it's possible.  (A 
simplified version is also possible with 0-size mempools, which are also 
allowed.)

My point is that I don't think we should be forcing any behavior wrt 
memory reserves as part of the mempool implementation.  In the above, 
taskC mempool_alloc() would succeed and not livelock unless 
__GFP_NOMEMALLOC is forced.  The mempool_alloc() user may construct their 
set of gfp flags as appropriate just like any other memory allocator in 
the kernel.

The alternative would be to ensure no mempool users ever take a lock that 
another thread can hold while contending another mutex or allocating 
memory itself.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka July 19, 2016, 9:50 p.m. UTC | #6
On Mon, 18 Jul 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Mikulas was
> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> 
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
> 
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
> 
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 
> Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> Bisected-by: Mikulas Patocka <mpatocka@redhat.com>

Bisect was done by Ondrej Kozina.

> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
>  mm/mempool.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..ea26d75c8adf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
> +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
> -	if (likely(pool->curr_nr)) {
> -		/*
> -		 * Don't allocate from emergency reserves if there are
> -		 * elements available.  This check is racy, but it will
> -		 * be rechecked each loop.
> -		 */
> -		gfp_temp |= __GFP_NOMEMALLOC;
> -	}
> +	/*
> +	 * Make sure that the OOM victim will get access to memory reserves
> +	 * properly if there are no objects in the pool to prevent from
> +	 * livelocks.
> +	 */
> +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> +		gfp_temp &= ~__GFP_NOMEMALLOC;
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
> -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		gfp_temp = gfp_mask;
>  		goto repeat_alloc;
> -- 
> 2.8.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka July 19, 2016, 10:01 p.m. UTC | #7
On Tue, 19 Jul 2016, Michal Hocko wrote:

> On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> > On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > > The original intention of f9054c70d28b was to help with the OOM
> > > situations where the oom victim depends on mempool allocation to make a
> > > forward progress. We can handle that case in a different way, though. We
> > > can check whether the current task has access to memory reserves ad an
> > > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > > is empty.
> > > 
> > > David Rientjes was objecting that such an approach wouldn't help if the
> > > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > > is very similar to other oom deadlock situations and we have oom_reaper
> > > to deal with them so it is reasonable to rely on the same mechanism
> > > rather inventing a different one which has negative side effects.
> > 
> > I don't understand how this scenario wouldn't be a flat-out bug.
> > 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> 
> Yes this is my understanding of the mempool usage as well. It is much

Yes, that's correct.

> harder to check whether mempool users are really behaving and they do
> not request more than the pre allocated pool allows them, though. That
> would be a bug in the consumer not the mempool as such of course.
> 
> My original understanding of f9054c70d28b was that it acts as
> a prevention for issues where the OOM victim loops inside the
> mempool_alloc not doing reasonable progress because those who should
> refill the pool are stuck for some reason (aka assume that not all
> mempool users are behaving or they have unexpected dependencies like WQ
> without WQ_MEM_RECLAIM and similar).

David Rientjes didn't tell us what is the configuration of his servers, we 
don't know what dm targets and block device drivers is he using, we don't 
know how they are connected - so it not really possible to know what 
happened for him.

Mikulas

> My thinking was that the victim has access to memory reserves by default
> so it sounds reasonable to preserve this access also when it is in the
> mempool_alloc. Therefore I wanted to preserve that particular logic and
> came up with this patch which should be safer than f9054c70d28b. But the
> more I am thinking about it the more it sounds like papering over a bug
> somewhere else.
> 
> So I guess we should just go and revert f9054c70d28b and get back to
> David's lockup and investigate what exactly went wrong and why. The
> current form of f9054c70d28b is simply too dangerous.
> -- 
> Michal Hocko
> SUSE Labs
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Michal Hocko July 20, 2016, 6:44 a.m. UTC | #8
On Tue 19-07-16 17:50:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > There has been a report about OOM killer invoked when swapping out to
> > a dm-crypt device. The primary reason seems to be that the swapout
> > out IO managed to completely deplete memory reserves. Mikulas was
> > able to bisect and explained the issue by pointing to f9054c70d28b
> > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> > 
> > The reason is that the swapout path is not throttled properly because
> > the md-raid layer needs to allocate from the generic_make_request path
> > which means it allocates from the PF_MEMALLOC context. dm layer uses
> > mempool_alloc in order to guarantee a forward progress which used to
> > inhibit access to memory reserves when using page allocator. This has
> > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> > there are free elements") which has dropped the __GFP_NOMEMALLOC
> > protection when the memory pool is depleted.
> > 
> > If we are running out of memory and the only way forward to free memory
> > is to perform swapout we just keep consuming memory reserves rather than
> > throttling the mempool allocations and allowing the pending IO to
> > complete up to a moment when the memory is depleted completely and there
> > is no way forward but invoking the OOM killer. This is less than
> > optimal.
> > 
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> > Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> > Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> Bisect was done by Ondrej Kozina.

OK, fixed

> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>

Let's see whether we decide to go with this patch or a plain revert. In
any case I will mark the patch for stable so it will end up in both 4.6
and 4.7

Anyway thanks for your and Ondrejs help here!
Michal Hocko July 20, 2016, 8:15 a.m. UTC | #9
On Tue 19-07-16 13:45:52, David Rientjes wrote:
> On Tue, 19 Jul 2016, Johannes Weiner wrote:
> 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> > 
> > If our mempool_alloc() is waiting for an object that an OOM victim is
> > holding, where could that OOM victim get stuck before giving it back?
> > As I asked in the previous thread, surely you wouldn't do a mempool
> > allocation first and then rely on an unguarded page allocation to make
> > forward progress, right? It would defeat the purpose of using mempools
> > in the first place. And surely the OOM victim wouldn't be waiting for
> > a lock that somebody doing mempool_alloc() *against the same mempool*
> > is holding. That'd be an obvious ABBA deadlock.
> > 
> > So maybe I'm just dense, but could somebody please outline the exact
> > deadlock diagram? Who is doing what, and how are they getting stuck?
> > 
> > cpu0:                     cpu1:
> >                           mempool_alloc(pool0)
> > mempool_alloc(pool0)
> >   wait for cpu1
> >                           not allocating memory - would defeat mempool
> >                           not taking locks held by cpu0* - would ABBA
> >                           ???
> >                           mempool_free(pool0)
> > 
> > Thanks
> > 
> > * or any other task that does mempool_alloc(pool0) before unlock
> > 
> 
> I'm approaching this from a perspective of any possible mempool usage, not 
> with any single current user in mind.
> 
> Any mempool_alloc() user that then takes a contended mutex can do this.  
> An example:
> 
> 	taskA		taskB		taskC
> 	-----		-----		-----
> 	mempool_alloc(a)
> 			mutex_lock(b)
> 	mutex_lock(b)
> 					mempool_alloc(a)
> 
> Imagine the mempool_alloc() done by taskA depleting all free elements so 
> we rely on it to do mempool_free() before any other mempool allocator can 
> be guaranteed.
> 
> If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> mempool_alloc().  This livelocks the page allocator for all processes.
> 
> taskB in this case need only stall after taking mutex_lock() successfully; 
> that could be because of the oom livelock, it is contended on another 
> mutex held by an allocator, etc.

But that falls down to the deadlock described by Johannes above because
then the mempool user would _depend_ on an "unguarded page allocation"
via that particular lock and that is a bug.
 
> Obviously taskB stalling while holding a mutex that is contended by a 
> mempool user holding an element is not preferred, but it's possible.  (A 
> simplified version is also possible with 0-size mempools, which are also 
> allowed.)
> 
> My point is that I don't think we should be forcing any behavior wrt 
> memory reserves as part of the mempool implementation. 

Isn't the reserve management the whole point of the mempool approach?

> In the above, 
> taskC mempool_alloc() would succeed and not livelock unless 
> __GFP_NOMEMALLOC is forced. 

Or it would get stuck because even page allocator memory reserves got
depleted. Without any way to throttle there is no guarantee to make
further progress. In fact this is not a theoretical situation. It has
been observed with the swap over dm-crypt and there shouldn't be any
lock dependeces you are describing above there AFAIU.

> The mempool_alloc() user may construct their 
> set of gfp flags as appropriate just like any other memory allocator in 
> the kernel.

So which users of mempool_alloc would benefit from not having
__GFP_NOMEMALLOC and why?

> The alternative would be to ensure no mempool users ever take a lock that 
> another thread can hold while contending another mutex or allocating 
> memory itself.

I am not sure how can we enforce that but surely that would detect a
clear mempool usage bug. Lockdep could be probably extended to do so.

Anway, I feel we are looping in a circle. We have a clear regression
caused by your patch. It might solve some oom livelock you are seeing
but there are only very dim details about it and the patch might very
well paper over a bug in mempool usage somewhere else. We definitely
need more details to know that better.

That being said, f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements") should be either reverted or
http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
should be applied as a temporal workaround because it would make a
lockup less likely for now until we find out more about your issue.

Does that sound like a way forward?
David Rientjes July 20, 2016, 9:06 p.m. UTC | #10
On Wed, 20 Jul 2016, Michal Hocko wrote:

> > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > An example:
> > 
> > 	taskA		taskB		taskC
> > 	-----		-----		-----
> > 	mempool_alloc(a)
> > 			mutex_lock(b)
> > 	mutex_lock(b)
> > 					mempool_alloc(a)
> > 
> > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > we rely on it to do mempool_free() before any other mempool allocator can 
> > be guaranteed.
> > 
> > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > mempool_alloc().  This livelocks the page allocator for all processes.
> > 
> > taskB in this case need only stall after taking mutex_lock() successfully; 
> > that could be because of the oom livelock, it is contended on another 
> > mutex held by an allocator, etc.
> 
> But that falls down to the deadlock described by Johannes above because
> then the mempool user would _depend_ on an "unguarded page allocation"
> via that particular lock and that is a bug.
>  

It becomes a deadlock because of mempool_alloc(a) forcing 
__GFP_NOMEMALLOC, I agree.

For that not to be the case, it must be required that between 
mempool_alloc() and mempool_free() that we take no mutex that may be held 
by any other thread on the system, in any context, that is allocating 
memory.  If that's a caller's bug as you describe it, and only enabled by 
mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
lockdep detection, which would be trivial to add, so we can determine if 
any users are unsafe and prevent this issue in the future.  The 
overwhelming goal here should be to prevent possible problems in the 
future especially if an API does not allow you to opt-out of the behavior.

> > My point is that I don't think we should be forcing any behavior wrt 
> > memory reserves as part of the mempool implementation. 
> 
> Isn't the reserve management the whole point of the mempool approach?
> 

No, the whole point is to maintain the freelist of elements that are 
guaranteed; my suggestion is that we cannot make that guarantee if we are 
blocked from freeing elements.  It's trivial to fix by allowing 
__GFP_NOMEMALLOC from the caller in cases where you cannot possibly be 
blocked by an oom victim.

> Or it would get stuck because even page allocator memory reserves got
> depleted. Without any way to throttle there is no guarantee to make
> further progress. In fact this is not a theoretical situation. It has
> been observed with the swap over dm-crypt and there shouldn't be any
> lock dependeces you are describing above there AFAIU.
> 

They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

> > The mempool_alloc() user may construct their 
> > set of gfp flags as appropriate just like any other memory allocator in 
> > the kernel.
> 
> So which users of mempool_alloc would benefit from not having
> __GFP_NOMEMALLOC and why?
> 

Any mempool_alloc() user that would be blocked on returning the element 
back to the freelist by an oom condition.  I think the dm-crypt case is 
quite unique on how it is able to deplete memory reserves.

> Anway, I feel we are looping in a circle. We have a clear regression
> caused by your patch. It might solve some oom livelock you are seeing
> but there are only very dim details about it and the patch might very
> well paper over a bug in mempool usage somewhere else. We definitely
> need more details to know that better.
> 

What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
clear documentation on how to use it?  It can be described to not allow 
depletion of memory reserves with the caveat that the caller must ensure 
mempool_free() cannot be blocked in lowmem situations.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Michal Hocko July 21, 2016, 8:52 a.m. UTC | #11
On Wed 20-07-16 14:06:26, David Rientjes wrote:
> On Wed, 20 Jul 2016, Michal Hocko wrote:
> 
> > > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > > An example:
> > > 
> > > 	taskA		taskB		taskC
> > > 	-----		-----		-----
> > > 	mempool_alloc(a)
> > > 			mutex_lock(b)
> > > 	mutex_lock(b)
> > > 					mempool_alloc(a)
> > > 
> > > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > > we rely on it to do mempool_free() before any other mempool allocator can 
> > > be guaranteed.
> > > 
> > > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > > mempool_alloc().  This livelocks the page allocator for all processes.
> > > 
> > > taskB in this case need only stall after taking mutex_lock() successfully; 
> > > that could be because of the oom livelock, it is contended on another 
> > > mutex held by an allocator, etc.
> > 
> > But that falls down to the deadlock described by Johannes above because
> > then the mempool user would _depend_ on an "unguarded page allocation"
> > via that particular lock and that is a bug.
> >  
> 
> It becomes a deadlock because of mempool_alloc(a) forcing 
> __GFP_NOMEMALLOC, I agree.
> 
> For that not to be the case, it must be required that between 
> mempool_alloc() and mempool_free() that we take no mutex that may be held 
> by any other thread on the system, in any context, that is allocating 
> memory.  If that's a caller's bug as you describe it, and only enabled by 
> mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
> lockdep detection, which would be trivial to add, so we can determine if 
> any users are unsafe and prevent this issue in the future.

I am sorry but I am neither familiar with the lockdep internals nor I
have a time to add this support.

> The 
> overwhelming goal here should be to prevent possible problems in the 
> future especially if an API does not allow you to opt-out of the behavior.

The __GFP_NOMEMALLOC enforcement is there since b84a35be0285 ("[PATCH]
mempool: NOMEMALLOC and NORETRY") so more than 10 years ago. So I think
it is quite reasonable to expect that users are familiar with this fact
and handle it properly in the vast majority cases. In fact mempool
deadlocks are really rare.

[...]

> > Or it would get stuck because even page allocator memory reserves got
> > depleted. Without any way to throttle there is no guarantee to make
> > further progress. In fact this is not a theoretical situation. It has
> > been observed with the swap over dm-crypt and there shouldn't be any
> > lock dependeces you are describing above there AFAIU.
> > 
> 
> They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

How that would be any different from any other mempool user which can be
invoked from the swap out path - aka any other IO path?

> What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
> clear documentation on how to use it?  It can be described to not allow 
> depletion of memory reserves with the caveat that the caller must ensure 
> mempool_free() cannot be blocked in lowmem situations.

Look, there are
$ git grep mempool_alloc | wc -l
304

many users of this API and we do not want to flip the default behavior
which is there for more than 10 years. So far you have been arguing
about potential deadlocks and haven't shown any particular path which
would have a direct or indirect dependency between mempool and normal
allocator and it wouldn't be a bug. As the matter of fact the change
we are discussing here causes a regression. If you want to change the
semantic of mempool allocator then you are absolutely free to do so. In
a separate patch which would be discussed with IO people and other
users, though. But we _absolutely_ want to fix the regression first
and have a simple fix for 4.6 and 4.7 backports. At this moment there
are revert and patch 1 on the table.  The later one should make your
backtrace happy and should be only as a temporal fix until we find out
what is actually misbehaving on your systems. If you are not interested
to pursue that way I will simply go with the revert.
Johannes Weiner July 21, 2016, 12:13 p.m. UTC | #12
On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> Look, there are
> $ git grep mempool_alloc | wc -l
> 304
> 
> many users of this API and we do not want to flip the default behavior
> which is there for more than 10 years. So far you have been arguing
> about potential deadlocks and haven't shown any particular path which
> would have a direct or indirect dependency between mempool and normal
> allocator and it wouldn't be a bug. As the matter of fact the change
> we are discussing here causes a regression. If you want to change the
> semantic of mempool allocator then you are absolutely free to do so. In
> a separate patch which would be discussed with IO people and other
> users, though. But we _absolutely_ want to fix the regression first
> and have a simple fix for 4.6 and 4.7 backports. At this moment there
> are revert and patch 1 on the table.  The later one should make your
> backtrace happy and should be only as a temporal fix until we find out
> what is actually misbehaving on your systems. If you are not interested
> to pursue that way I will simply go with the revert.

+1

It's very unlikely that decade-old mempool semantics are suddenly a
fundamental livelock problem, when all the evidence we have is one
hang and vague speculation. Given that the patch causes regressions,
and that the bug is most likely elsewhere anyway, a full revert rather
than merely-less-invasive mempool changes makes the most sense to me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	/*
+	 * Make sure that the OOM victim will get access to memory reserves
+	 * properly if there are no objects in the pool to prevent from
+	 * livelocks.
+	 */
+	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+		gfp_temp &= ~__GFP_NOMEMALLOC;
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,7 +359,7 @@  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;