System freezes after OOM

Message ID	2d5e1f84-e886-7b98-cb11-170d7104fd13@I-love.SAKURA.ne.jp (mailing list archive)
State	Not Applicable, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> Resent-From: Ondrej Kozina <okozina@redhat.com> Resent-To: dm-devel@redhat.com Resent-Date: Wed, 13 Jul 2016 15:50:23 +0200 Resent-Message-ID: <358c2211-bd6a-d60c-e353-4b63be631b93@redhat.com> Resent-User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 To: Mikulas Patocka <mpatocka@redhat.com>, Michal Hocko <mhocko@kernel.org> References: <57837CEE.1010609@redhat.com> <f80dc690-7e71-26b2-59a2-5a1557d26713@redhat.com> <9be09452-de7f-d8be-fd5d-4a80d1cd1ba3@redhat.com> <alpine.LRH.2.02.1607111027080.14327@file01.intranet.prod.int.rdu2.redhat.com> <20160712064905.GA14586@dhcp22.suse.cz> <alpine.LRH.2.02.1607121907160.24806@file01.intranet.prod.int.rdu2.redhat.com> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Message-ID: <2d5e1f84-e886-7b98-cb11-170d7104fd13@I-love.SAKURA.ne.jp> Date: Wed, 13 Jul 2016 22:19:23 +0900 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <alpine.LRH.2.02.1607121907160.24806@file01.intranet.prod.int.rdu2.redhat.com> Cc: Ondrej Kozina <okozina@redhat.com>, Stanislav Kozina <skozina@redhat.com>, Jerome Marchand <jmarchan@redhat.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [dm-devel] System freezes after OOM Precedence: junk Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Tetsuo Handa July 13, 2016, 1:19 p.m. UTC

> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).

I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set
__GFP_NOMEMALLOC if there are free elements") is doing correct thing.
It says

    If an oom killed thread calls mempool_alloc(), it is possible that it'll
    loop forever if there are no elements on the freelist since
    __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
    oom conditions.

but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
memory reserves via below change, can't we? The purpose of allowing
ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation
request does not to loop forever inside the page allocator, isn't it?
Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use
ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 13, 2016, 2:18 p.m. UTC | #1

On Wed, 13 Jul 2016, Michal Hocko wrote:

> [CC David]
> 
> > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > they never exhausted reserved memory. With this commit, mempool 
> > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > process has PF_MEMALLOC, they can bypass all limits).
> > 
> > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > It says
> > 
> >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > it'll
> >     loop forever if there are no elements on the freelist since
> >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> >     oom conditions.
> 
> I haven't studied the patch very deeply so I might be missing something
> but from a quick look the patch does exactly what the above says.
> 
> mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> in the pool and so we have no fallback for the default __GFP_NORETRY
> request.

The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
to submit the swapping bio to the block driver. The device mapper driver 
uses mempools for all its I/O processing.

Prior to the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc 
never exhausted the reserved memory - it tried to allocace first with 
__GFP_NOMEMALLOC (thus preventing the allocator from allocating below the 
limits), then it tried to allocate from the mempool reserve and if the 
mempool is exhausted, it waits until some structures are returned to the 
mempool.

After the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, __GFP_NOMEMALLOC 
is not used if the mempool is exhausted - and so repeated use of 
mempool_alloc (tohether with PF_MEMALLOC that is implicitly set) can 
exhaust all available memory.

The patch f9054c70d28bc214b2857cf8db8269f4f45a5e23 allows more paralellism 
(mempool_alloc waits less and proceeds more often), but the downside is 
that it exhausts all the memory. Bisection showed that those dm-crypt 
swapping failures were caused by that patch.

I think f9054c70d28bc214b2857cf8db8269f4f45a5e23 should be reverted - but 
first, we need to find out why does swapping fail if all the memory is 
exhausted - that is a separate bug that should be addressed first.

> > but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
> > memory reserves via below change, can't we?

There are no mempool_alloc(__GFP_NOMEMALLOC) requsts - mempool users don't 
use this flag.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Michal Hocko July 13, 2016, 2:56 p.m. UTC | #2

On Wed 13-07-16 10:18:35, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Jul 2016, Michal Hocko wrote:
> 
> > [CC David]
> > 
> > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > > they never exhausted reserved memory. With this commit, mempool 
> > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > > process has PF_MEMALLOC, they can bypass all limits).
> > > 
> > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > > It says
> > > 
> > >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > > it'll
> > >     loop forever if there are no elements on the freelist since
> > >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> > >     oom conditions.
> > 
> > I haven't studied the patch very deeply so I might be missing something
> > but from a quick look the patch does exactly what the above says.
> > 
> > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> > in the pool and so we have no fallback for the default __GFP_NORETRY
> > request.
> 
> The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
> to submit the swapping bio to the block driver. The device mapper driver 
> uses mempools for all its I/O processing.

OK, this is the part I have missed. I didn't realize that the swapout
path, which is indeed PF_MEMALLOC, can get down to blk code which uses
mempools. A quick code travers shows that at least
	make_request_fn = blk_queue_bio
	blk_queue_bio
	  get_request
	    __get_request

might do that. And in that case I agree that the above mentioned patch
has unintentional side effects and should be re-evaluated. David, what
do you think? An obvious fixup would be considering TIF_MEMDIE in
mempool_alloc explicitly.

Mikulas Patocka July 13, 2016, 3:11 p.m. UTC | #3

On Wed, 13 Jul 2016, Michal Hocko wrote:

> On Wed 13-07-16 10:18:35, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 13 Jul 2016, Michal Hocko wrote:
> > 
> > > [CC David]
> > > 
> > > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > > > they never exhausted reserved memory. With this commit, mempool 
> > > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > > > process has PF_MEMALLOC, they can bypass all limits).
> > > > 
> > > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > > > It says
> > > > 
> > > >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > > > it'll
> > > >     loop forever if there are no elements on the freelist since
> > > >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> > > >     oom conditions.
> > > 
> > > I haven't studied the patch very deeply so I might be missing something
> > > but from a quick look the patch does exactly what the above says.
> > > 
> > > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> > > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> > > in the pool and so we have no fallback for the default __GFP_NORETRY
> > > request.
> > 
> > The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
> > to submit the swapping bio to the block driver. The device mapper driver 
> > uses mempools for all its I/O processing.
> 
> OK, this is the part I have missed. I didn't realize that the swapout
> path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> mempools. A quick code travers shows that at least
> 	make_request_fn = blk_queue_bio
> 	blk_queue_bio
> 	  get_request
> 	    __get_request
> 
> might do that. And in that case I agree that the above mentioned patch
> has unintentional side effects and should be re-evaluated. David, what
> do you think? An obvious fixup would be considering TIF_MEMDIE in
> mempool_alloc explicitly.

What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
tries to fix?

Do you have a stacktrace where it deadlocked, or was just a theoretical 
consideration?

Mempool users generally (except for some flawed cases like fs_bio_set) do 
not require memory to proceed. So if you just loop in mempool_alloc, the 
processes that exhasted the mempool reserve will eventually return objects 
to the mempool and you should proceed.

If you can't proceed, it is a bug in the code that uses the mempool.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Rientjes July 13, 2016, 11:53 p.m. UTC | #4

On Wed, 13 Jul 2016, Mikulas Patocka wrote:

> What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> tries to fix?
> 

It prevents the whole system from livelocking due to an oom killed process 
stalling forever waiting for mempool_alloc() to return.  No other threads 
may be oom killed while waiting for it to exit.

> Do you have a stacktrace where it deadlocked, or was just a theoretical 
> consideration?
> 

schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

> Mempool users generally (except for some flawed cases like fs_bio_set) do 
> not require memory to proceed. So if you just loop in mempool_alloc, the 
> processes that exhasted the mempool reserve will eventually return objects 
> to the mempool and you should proceed.
> 

That's obviously not the case if we have hundreds of machines timing out 
after two hours waiting for that fault to succeed.  The mempool interface 
cannot require that users return elements to the pool synchronous with all 
allocators so that we can happily loop forever, the only requirement on 
the interface is that mempool_alloc() must succeed.  If the context of the 
thread doing mempool_alloc() allows access to memory reserves, this will 
always be allowed by the page allocator.  This is not a mempool problem.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Rientjes July 14, 2016, 12:01 a.m. UTC | #5

On Wed, 13 Jul 2016, Tetsuo Handa wrote:

> I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set
> __GFP_NOMEMALLOC if there are free elements") is doing correct thing.
> It says
> 
>     If an oom killed thread calls mempool_alloc(), it is possible that it'll
>     loop forever if there are no elements on the freelist since
>     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
>     oom conditions.
> 
> but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
> memory reserves via below change, can't we? The purpose of allowing
> ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation
> request does not to loop forever inside the page allocator, isn't it?

This would defeat the purpose of __GFP_NOMEMALLOC for oom killed threads, 
so you'd need to demonstrate that isn't a problem for the current users 
and then change the semantics of the gfp flag.

> Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use
> ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set?
> 

mempool_alloc(__GFP_NOMEMALLOC) is forbidden.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tetsuo Handa July 14, 2016, 11:01 a.m. UTC | #6

Michal Hocko wrote:
> OK, this is the part I have missed. I didn't realize that the swapout
> path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> mempools. A quick code travers shows that at least
> 	make_request_fn = blk_queue_bio
> 	blk_queue_bio
> 	  get_request
> 	    __get_request
> 
> might do that. And in that case I agree that the above mentioned patch
> has unintentional side effects and should be re-evaluated. David, what
> do you think? An obvious fixup would be considering TIF_MEMDIE in
> mempool_alloc explicitly.

TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread,
there is no guarantee that TIF_MEMDIE is set to the thread which is looping
inside mempool_alloc(). And since __GFP_NORETRY is used (regardless of
f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom().
This means that the thread which is looping inside mempool_alloc() can't
get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer.

Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it
at mempool_alloc() when fatal_signal_pending() is true? But that behavior
can OOM-kill somebody else when current was not OOM-killed. Sigh...

David Rientjes wrote:
> On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> 
> > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > tries to fix?
> > 
> 
> It prevents the whole system from livelocking due to an oom killed process 
> stalling forever waiting for mempool_alloc() to return.  No other threads 
> may be oom killed while waiting for it to exit.

Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 14, 2016, 12:27 p.m. UTC | #7

On Wed, 13 Jul 2016, David Rientjes wrote:

> On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> 
> > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > tries to fix?
> > 
> 
> It prevents the whole system from livelocking due to an oom killed process 
> stalling forever waiting for mempool_alloc() to return.  No other threads 
> may be oom killed while waiting for it to exit.
> 
> > Do you have a stacktrace where it deadlocked, or was just a theoretical 
> > consideration?
> > 
> 
> schedule
> schedule_timeout
> io_schedule_timeout
> mempool_alloc
> __split_and_process_bio
> dm_request
> generic_make_request
> submit_bio
> mpage_readpages
> ext4_readpages
> __do_page_cache_readahead
> ra_submit
> filemap_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault

Device mapper should be able to proceed if there is no available memory. 
If it doesn't proceed, there is a bug in it.

I'd like to ask - what device mapper targets did you use in this case? Are 
there some other deadlocked processes? (show sysrq-t, sysrq-w when this 
happened)

Did the machine lock up completely with that stacktrace, or was it just 
slowed down?

> > Mempool users generally (except for some flawed cases like fs_bio_set) do 
> > not require memory to proceed. So if you just loop in mempool_alloc, the 
> > processes that exhasted the mempool reserve will eventually return objects 
> > to the mempool and you should proceed.
> > 
> 
> That's obviously not the case if we have hundreds of machines timing out 
> after two hours waiting for that fault to succeed.  The mempool interface 
> cannot require that users return elements to the pool synchronous with all 
> allocators so that we can happily loop forever, the only requirement on 

Mempool users must return objects to the mempool.

> the interface is that mempool_alloc() must succeed.  If the context of the 
> thread doing mempool_alloc() allows access to memory reserves, this will 
> always be allowed by the page allocator.  This is not a mempool problem.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 14, 2016, 12:29 p.m. UTC | #8

On Thu, 14 Jul 2016, Tetsuo Handa wrote:

> Michal Hocko wrote:
> > OK, this is the part I have missed. I didn't realize that the swapout
> > path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> > mempools. A quick code travers shows that at least
> > 	make_request_fn = blk_queue_bio
> > 	blk_queue_bio
> > 	  get_request
> > 	    __get_request
> > 
> > might do that. And in that case I agree that the above mentioned patch
> > has unintentional side effects and should be re-evaluated. David, what
> > do you think? An obvious fixup would be considering TIF_MEMDIE in
> > mempool_alloc explicitly.
> 
> TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread,
> there is no guarantee that TIF_MEMDIE is set to the thread which is looping
> inside mempool_alloc().

If the device mapper subsystem is not returning objects to the mempool, it 
should be investigated as a bug in the device mapper.

There is no need to add workarounds to mempool_alloc to work around that 
bug.

Mikulas

> And since __GFP_NORETRY is used (regardless of
> f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom().
> This means that the thread which is looping inside mempool_alloc() can't
> get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer.
> 
> Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it
> at mempool_alloc() when fatal_signal_pending() is true? But that behavior
> can OOM-kill somebody else when current was not OOM-killed. Sigh...
> 
> David Rientjes wrote:
> > On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> > 
> > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > > tries to fix?
> > > 
> > 
> > It prevents the whole system from livelocking due to an oom killed process 
> > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > may be oom killed while waiting for it to exit.
> 
> Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Rientjes July 14, 2016, 8:22 p.m. UTC | #9

On Thu, 14 Jul 2016, Mikulas Patocka wrote:

> > schedule
> > schedule_timeout
> > io_schedule_timeout
> > mempool_alloc
> > __split_and_process_bio
> > dm_request
> > generic_make_request
> > submit_bio
> > mpage_readpages
> > ext4_readpages
> > __do_page_cache_readahead
> > ra_submit
> > filemap_fault
> > handle_mm_fault
> > __do_page_fault
> > do_page_fault
> > page_fault
> 
> Device mapper should be able to proceed if there is no available memory. 
> If it doesn't proceed, there is a bug in it.
> 

The above stack trace has nothing to do with the device mapper with 
pre-f9054c70d28b behavior.  It simply is calling into mempool_alloc() and 
no elements are being returned to the mempool that allow it to return.

Recall that in the above situation, the whole system is oom; nothing can 
allocate memory.  The oom killer has selected the above process to be oom 
killed, so all other processes on the system trying to allocate memory 
will stall in the page allocator waiting for this process to exit.

The natural response to this situation is to allow access to memory 
reserves, if possible, so that mempool_alloc() may return.  There is no 
guarantee that _anything_ can return memory to the mempool, especially in 
a system oom condition where nothing can make forward progress.  Insisting 
that should be guaranteed is not helpful.

> I'd like to ask - what device mapper targets did you use in this case? Are 
> there some other deadlocked processes? (show sysrq-t, sysrq-w when this 
> happened)
> 

Every process on the system is deadlocked because they cannot get memory 
through the page allocator until the above process exits.  That is how the 
oom killer works: select a process, kill it, give it access to memory 
reserves so it may exit and free its memory, and wait.

> Did the machine lock up completely with that stacktrace, or was it just 
> slowed down?
> 

Hundreds of machines locked up and rebooted after a two hour watchdog 
timeout.

> > That's obviously not the case if we have hundreds of machines timing out 
> > after two hours waiting for that fault to succeed.  The mempool interface 
> > cannot require that users return elements to the pool synchronous with all 
> > allocators so that we can happily loop forever, the only requirement on 
> 
> Mempool users must return objects to the mempool.
> 

Not possible when the system is livelocked.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 15, 2016, 11:21 a.m. UTC | #10

On Thu, 14 Jul 2016, David Rientjes wrote:

> There is no guarantee that _anything_ can return memory to the mempool,

You misunderstand mempools if you make such claims.

There is in fact guarantee that objects will be returned to mempool. In 
the past I reviewed device mapper thoroughly to make sure that it can make 
forward progress even if there is no available memory.

I don't know what should I tell you if you keep on repeating the same 
false claim over and over again. Should I explain mempool oprerations to 
you in detail? Or will you find it on your own?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Rientjes July 15, 2016, 9:25 p.m. UTC | #11

On Fri, 15 Jul 2016, Mikulas Patocka wrote:

> > There is no guarantee that _anything_ can return memory to the mempool,
> 
> You misunderstand mempools if you make such claims.
> 
> There is in fact guarantee that objects will be returned to mempool. In 
> the past I reviewed device mapper thoroughly to make sure that it can make 
> forward progress even if there is no available memory.
> 
> I don't know what should I tell you if you keep on repeating the same 
> false claim over and over again. Should I explain mempool oprerations to 
> you in detail? Or will you find it on your own?
> 

If you are talking about patches you're proposing for 4.8 or any guarantee 
of memory freeing that the oom killer/reaper will provide in 4.8, that's 
fine.  However, the state of the 4.7 kernel is the same as it was when I 
fixed this issue that timed out hundreds of our machines and is 
contradicted by that evidence.  Our machines time out after two hours with 
the oom victim looping forever in mempool_alloc(), so if there was a 
guarantee that elements would be returned in a completely livelocked 
kernel in 4.7 or earlier kernels, that would not have been the case.  I 
frankly don't care about your patch reviewing of dm mempool usage when 
dm_request() livelocked our kernel.

Feel free to formally propose patches either for 4.7 or 4.8.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 15, 2016, 9:39 p.m. UTC | #12

On Fri, 15 Jul 2016, David Rientjes wrote:

> On Fri, 15 Jul 2016, Mikulas Patocka wrote:
> 
> > > There is no guarantee that _anything_ can return memory to the mempool,
> > 
> > You misunderstand mempools if you make such claims.
> > 
> > There is in fact guarantee that objects will be returned to mempool. In 
> > the past I reviewed device mapper thoroughly to make sure that it can make 
> > forward progress even if there is no available memory.
> > 
> > I don't know what should I tell you if you keep on repeating the same 
> > false claim over and over again. Should I explain mempool oprerations to 
> > you in detail? Or will you find it on your own?
> > 
> 
> If you are talking about patches you're proposing for 4.8 or any guarantee 
> of memory freeing that the oom killer/reaper will provide in 4.8, that's 
> fine.  However, the state of the 4.7 kernel is the same as it was when I 
> fixed this issue that timed out hundreds of our machines and is 
> contradicted by that evidence.  Our machines time out after two hours with 
> the oom victim looping forever in mempool_alloc(), so if there was a 

And what about the oom reaper? It should have freed all victim's pages 
even if the victim is looping in mempool_alloc. Why the oom reaper didn't 
free up memory?

> guarantee that elements would be returned in a completely livelocked 
> kernel in 4.7 or earlier kernels, that would not have been the case.  I 

And what kind of targets do you use in device mapper in the configuration 
that livelocked? Do you use some custom google-developed drivers?

Please describe the whole stack of block I/O devices when this livelock 
happened.

Most device mapper drivers can really make forward progress when they are 
out of memory, so I'm interested what kind of configuration do you have.

> frankly don't care about your patch reviewing of dm mempool usage when 
> dm_request() livelocked our kernel.

If it livelocked, it is a bug in some underlying block driver, not a bug 
in mempool_alloc.

> Feel free to formally propose patches either for 4.7 or 4.8.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Rientjes July 15, 2016, 9:58 p.m. UTC | #13

On Fri, 15 Jul 2016, Mikulas Patocka wrote:

> And what about the oom reaper? It should have freed all victim's pages 
> even if the victim is looping in mempool_alloc. Why the oom reaper didn't 
> free up memory?
> 

Is that possible with mlock or shared memory?  Nope.  The oom killer does 
not have the benefit of selecting a process to kill that will likely free 
the most memory or reap the most memory, the choice is configurable by the 
user.

> > guarantee that elements would be returned in a completely livelocked 
> > kernel in 4.7 or earlier kernels, that would not have been the case.  I 
> 
> And what kind of targets do you use in device mapper in the configuration 
> that livelocked? Do you use some custom google-developed drivers?
> 
> Please describe the whole stack of block I/O devices when this livelock 
> happened.
> 
> Most device mapper drivers can really make forward progress when they are 
> out of memory, so I'm interested what kind of configuration do you have.
> 

Kworkers are processing writeback, ext4_writepages() relies on kmem that 
is reclaiming memory itself through kmem_getpages() and they are waiting 
on the oom victim to exit so they endlessly loop in the page allocator 
themselves.  Same situation with __alloc_skb() so we can intermittently 
lose access to hundreds of the machines over the network.  There are no 
custom drivers required for this to happen, the stack trace has already 
been posted of the livelock victim and this can happen for anything in 
filemap_fault() that has TIF_MEMDIE set.

> > frankly don't care about your patch reviewing of dm mempool usage when 
> > dm_request() livelocked our kernel.
> 
> If it livelocked, it is a bug in some underlying block driver, not a bug 
> in mempool_alloc.
> 

Lol, the interface is quite clear and can be modified to allow mempool 
users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can 
guarantee elements will be returned to the freelist in all situations, 
including system oom situations.  We may revert that ourselves if our 
machines time out once we use a post-4.7 kernel and report that as 
necessary.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas Patocka July 15, 2016, 11:53 p.m. UTC | #14

On Fri, 15 Jul 2016, David Rientjes wrote:

> Kworkers are processing writeback, ext4_writepages() relies on kmem that 

ext4_writepages is above device mapper, not below, so how it could block 
device mapper progress?

Do you use device mapper on the top of block loop device? Writing to loop 
is prone to deadlock anyway, you should avoid that in production code.

> is reclaiming memory itself through kmem_getpages() and they are waiting 
> on the oom victim to exit so they endlessly loop in the page allocator 
> themselves.  Same situation with __alloc_skb() so we can intermittently 
> lose access to hundreds of the machines over the network.  There are no 
> custom drivers required for this to happen, the stack trace has already 
> been posted of the livelock victim and this can happen for anything in 
> filemap_fault() that has TIF_MEMDIE set.

Again - filemap_failt() is above device mapper, not below (unless you use 
loop).

> > > frankly don't care about your patch reviewing of dm mempool usage when 
> > > dm_request() livelocked our kernel.
> > 
> > If it livelocked, it is a bug in some underlying block driver, not a bug 
> > in mempool_alloc.
> > 
> 
> Lol, the interface is quite clear and can be modified to allow mempool 
> users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can 
> guarantee elements will be returned to the freelist in all situations, 

You still didn't post configuration of your block stack, so I have no clue 
why entries are not returned to the mempool.

> including system oom situations.  We may revert that ourselves if our 
> machines time out once we use a post-4.7 kernel and report that as 
> necessary.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Johannes Weiner July 18, 2016, 3:14 p.m. UTC | #15

CC Dave Chinner, who I recall had strong opinions on the mempool model

The context is commit f9054c7 ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements"), which gives MEMALLOC/TIF_MEMDIE mempool
allocations access to the system emergency reserves when there is no
reserved object currently residing in the mempool.

On Fri, Jul 15, 2016 at 07:21:59AM -0400, Mikulas Patocka wrote:
> On Thu, 14 Jul 2016, David Rientjes wrote:
> 
> > There is no guarantee that _anything_ can return memory to the mempool,
> 
> You misunderstand mempools if you make such claims.

Uhm, fully agreed.

The point of mempools is that they have their own reserves, separate
from the system reserves, to make forward progress in OOM situations.

All mempool object holders promise to make forward progress, and when
memory is depleted, the mempool allocations serialize against each
other. In this case, every allocation has to wait for in-flight IO to
finish to pass the reserved object on to the next IO. That's how the
mempool model is designed. The commit in question breaks this by not
waiting for outstanding object holders and instead quickly depletes
the system reserves. That's a mempool causing a memory deadlock...

David observed systems hanging 2+h inside mempool allocations. But
where would an object holders get stuck? It can't be taking a lock
that the waiting mempool_alloc() is holding, obviously. It also can't
be waiting for another allocation, it makes no sense to use mempools
to guarantee forward progress, but then have the whole sequence rely
on an unguaranteed allocation to succeed after the mempool ones. So
how could a system-wide OOM situation cause a mempool holder to hang?

These hangs are fishy, but it seems reasonable to assume that somebody
is breaking the mempool contract somewhere. The solution can not to be
to abandon the mempool model. f9054c7 should be reverted.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

System freezes after OOM

Commit Message

Comments

Patch