Message ID | 2d5e1f84-e886-7b98-cb11-170d7104fd13@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Wed, 13 Jul 2016, Michal Hocko wrote: > [CC David] > > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so > > > they never exhausted reserved memory. With this commit, mempool > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the > > > process has PF_MEMALLOC, they can bypass all limits). > > > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. > > It says > > > > If an oom killed thread calls mempool_alloc(), it is possible that > > it'll > > loop forever if there are no elements on the freelist since > > __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in > > oom conditions. > > I haven't studied the patch very deeply so I might be missing something > but from a quick look the patch does exactly what the above says. > > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects > in the pool and so we have no fallback for the default __GFP_NORETRY > request. The swapper core sets the flag PF_MEMALLOC and calls generic_make_request to submit the swapping bio to the block driver. The device mapper driver uses mempools for all its I/O processing. Prior to the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc never exhausted the reserved memory - it tried to allocace first with __GFP_NOMEMALLOC (thus preventing the allocator from allocating below the limits), then it tried to allocate from the mempool reserve and if the mempool is exhausted, it waits until some structures are returned to the mempool. After the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, __GFP_NOMEMALLOC is not used if the mempool is exhausted - and so repeated use of mempool_alloc (tohether with PF_MEMALLOC that is implicitly set) can exhaust all available memory. The patch f9054c70d28bc214b2857cf8db8269f4f45a5e23 allows more paralellism (mempool_alloc waits less and proceeds more often), but the downside is that it exhausts all the memory. Bisection showed that those dm-crypt swapping failures were caused by that patch. I think f9054c70d28bc214b2857cf8db8269f4f45a5e23 should be reverted - but first, we need to find out why does swapping fail if all the memory is exhausted - that is a separate bug that should be addressed first. > > but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access > > memory reserves via below change, can't we? There are no mempool_alloc(__GFP_NOMEMALLOC) requsts - mempool users don't use this flag. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed 13-07-16 10:18:35, Mikulas Patocka wrote: > > > On Wed, 13 Jul 2016, Michal Hocko wrote: > > > [CC David] > > > > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so > > > > they never exhausted reserved memory. With this commit, mempool > > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the > > > > process has PF_MEMALLOC, they can bypass all limits). > > > > > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set > > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. > > > It says > > > > > > If an oom killed thread calls mempool_alloc(), it is possible that > > > it'll > > > loop forever if there are no elements on the freelist since > > > __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in > > > oom conditions. > > > > I haven't studied the patch very deeply so I might be missing something > > but from a quick look the patch does exactly what the above says. > > > > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has > > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects > > in the pool and so we have no fallback for the default __GFP_NORETRY > > request. > > The swapper core sets the flag PF_MEMALLOC and calls generic_make_request > to submit the swapping bio to the block driver. The device mapper driver > uses mempools for all its I/O processing. OK, this is the part I have missed. I didn't realize that the swapout path, which is indeed PF_MEMALLOC, can get down to blk code which uses mempools. A quick code travers shows that at least make_request_fn = blk_queue_bio blk_queue_bio get_request __get_request might do that. And in that case I agree that the above mentioned patch has unintentional side effects and should be re-evaluated. David, what do you think? An obvious fixup would be considering TIF_MEMDIE in mempool_alloc explicitly.
On Wed, 13 Jul 2016, Michal Hocko wrote: > On Wed 13-07-16 10:18:35, Mikulas Patocka wrote: > > > > > > On Wed, 13 Jul 2016, Michal Hocko wrote: > > > > > [CC David] > > > > > > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so > > > > > they never exhausted reserved memory. With this commit, mempool > > > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the > > > > > process has PF_MEMALLOC, they can bypass all limits). > > > > > > > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set > > > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. > > > > It says > > > > > > > > If an oom killed thread calls mempool_alloc(), it is possible that > > > > it'll > > > > loop forever if there are no elements on the freelist since > > > > __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in > > > > oom conditions. > > > > > > I haven't studied the patch very deeply so I might be missing something > > > but from a quick look the patch does exactly what the above says. > > > > > > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has > > > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects > > > in the pool and so we have no fallback for the default __GFP_NORETRY > > > request. > > > > The swapper core sets the flag PF_MEMALLOC and calls generic_make_request > > to submit the swapping bio to the block driver. The device mapper driver > > uses mempools for all its I/O processing. > > OK, this is the part I have missed. I didn't realize that the swapout > path, which is indeed PF_MEMALLOC, can get down to blk code which uses > mempools. A quick code travers shows that at least > make_request_fn = blk_queue_bio > blk_queue_bio > get_request > __get_request > > might do that. And in that case I agree that the above mentioned patch > has unintentional side effects and should be re-evaluated. David, what > do you think? An obvious fixup would be considering TIF_MEMDIE in > mempool_alloc explicitly. What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 tries to fix? Do you have a stacktrace where it deadlocked, or was just a theoretical consideration? Mempool users generally (except for some flawed cases like fs_bio_set) do not require memory to proceed. So if you just loop in mempool_alloc, the processes that exhasted the mempool reserve will eventually return objects to the mempool and you should proceed. If you can't proceed, it is a bug in the code that uses the mempool. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 13 Jul 2016, Mikulas Patocka wrote: > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 > tries to fix? > It prevents the whole system from livelocking due to an oom killed process stalling forever waiting for mempool_alloc() to return. No other threads may be oom killed while waiting for it to exit. > Do you have a stacktrace where it deadlocked, or was just a theoretical > consideration? > schedule schedule_timeout io_schedule_timeout mempool_alloc __split_and_process_bio dm_request generic_make_request submit_bio mpage_readpages ext4_readpages __do_page_cache_readahead ra_submit filemap_fault handle_mm_fault __do_page_fault do_page_fault page_fault > Mempool users generally (except for some flawed cases like fs_bio_set) do > not require memory to proceed. So if you just loop in mempool_alloc, the > processes that exhasted the mempool reserve will eventually return objects > to the mempool and you should proceed. > That's obviously not the case if we have hundreds of machines timing out after two hours waiting for that fault to succeed. The mempool interface cannot require that users return elements to the pool synchronous with all allocators so that we can happily loop forever, the only requirement on the interface is that mempool_alloc() must succeed. If the context of the thread doing mempool_alloc() allows access to memory reserves, this will always be allowed by the page allocator. This is not a mempool problem. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 13 Jul 2016, Tetsuo Handa wrote: > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. > It says > > If an oom killed thread calls mempool_alloc(), it is possible that it'll > loop forever if there are no elements on the freelist since > __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in > oom conditions. > > but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access > memory reserves via below change, can't we? The purpose of allowing > ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation > request does not to loop forever inside the page allocator, isn't it? This would defeat the purpose of __GFP_NOMEMALLOC for oom killed threads, so you'd need to demonstrate that isn't a problem for the current users and then change the semantics of the gfp flag. > Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use > ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set? > mempool_alloc(__GFP_NOMEMALLOC) is forbidden. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Michal Hocko wrote: > OK, this is the part I have missed. I didn't realize that the swapout > path, which is indeed PF_MEMALLOC, can get down to blk code which uses > mempools. A quick code travers shows that at least > make_request_fn = blk_queue_bio > blk_queue_bio > get_request > __get_request > > might do that. And in that case I agree that the above mentioned patch > has unintentional side effects and should be re-evaluated. David, what > do you think? An obvious fixup would be considering TIF_MEMDIE in > mempool_alloc explicitly. TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread, there is no guarantee that TIF_MEMDIE is set to the thread which is looping inside mempool_alloc(). And since __GFP_NORETRY is used (regardless of f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom(). This means that the thread which is looping inside mempool_alloc() can't get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer. Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it at mempool_alloc() when fatal_signal_pending() is true? But that behavior can OOM-kill somebody else when current was not OOM-killed. Sigh... David Rientjes wrote: > On Wed, 13 Jul 2016, Mikulas Patocka wrote: > > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 > > tries to fix? > > > > It prevents the whole system from livelocking due to an oom killed process > stalling forever waiting for mempool_alloc() to return. No other threads > may be oom killed while waiting for it to exit. Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 13 Jul 2016, David Rientjes wrote: > On Wed, 13 Jul 2016, Mikulas Patocka wrote: > > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 > > tries to fix? > > > > It prevents the whole system from livelocking due to an oom killed process > stalling forever waiting for mempool_alloc() to return. No other threads > may be oom killed while waiting for it to exit. > > > Do you have a stacktrace where it deadlocked, or was just a theoretical > > consideration? > > > > schedule > schedule_timeout > io_schedule_timeout > mempool_alloc > __split_and_process_bio > dm_request > generic_make_request > submit_bio > mpage_readpages > ext4_readpages > __do_page_cache_readahead > ra_submit > filemap_fault > handle_mm_fault > __do_page_fault > do_page_fault > page_fault Device mapper should be able to proceed if there is no available memory. If it doesn't proceed, there is a bug in it. I'd like to ask - what device mapper targets did you use in this case? Are there some other deadlocked processes? (show sysrq-t, sysrq-w when this happened) Did the machine lock up completely with that stacktrace, or was it just slowed down? > > Mempool users generally (except for some flawed cases like fs_bio_set) do > > not require memory to proceed. So if you just loop in mempool_alloc, the > > processes that exhasted the mempool reserve will eventually return objects > > to the mempool and you should proceed. > > > > That's obviously not the case if we have hundreds of machines timing out > after two hours waiting for that fault to succeed. The mempool interface > cannot require that users return elements to the pool synchronous with all > allocators so that we can happily loop forever, the only requirement on Mempool users must return objects to the mempool. > the interface is that mempool_alloc() must succeed. If the context of the > thread doing mempool_alloc() allows access to memory reserves, this will > always be allowed by the page allocator. This is not a mempool problem. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 14 Jul 2016, Tetsuo Handa wrote: > Michal Hocko wrote: > > OK, this is the part I have missed. I didn't realize that the swapout > > path, which is indeed PF_MEMALLOC, can get down to blk code which uses > > mempools. A quick code travers shows that at least > > make_request_fn = blk_queue_bio > > blk_queue_bio > > get_request > > __get_request > > > > might do that. And in that case I agree that the above mentioned patch > > has unintentional side effects and should be re-evaluated. David, what > > do you think? An obvious fixup would be considering TIF_MEMDIE in > > mempool_alloc explicitly. > > TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread, > there is no guarantee that TIF_MEMDIE is set to the thread which is looping > inside mempool_alloc(). If the device mapper subsystem is not returning objects to the mempool, it should be investigated as a bug in the device mapper. There is no need to add workarounds to mempool_alloc to work around that bug. Mikulas > And since __GFP_NORETRY is used (regardless of > f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom(). > This means that the thread which is looping inside mempool_alloc() can't > get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer. > > Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it > at mempool_alloc() when fatal_signal_pending() is true? But that behavior > can OOM-kill somebody else when current was not OOM-killed. Sigh... > > David Rientjes wrote: > > On Wed, 13 Jul 2016, Mikulas Patocka wrote: > > > > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 > > > tries to fix? > > > > > > > It prevents the whole system from livelocking due to an oom killed process > > stalling forever waiting for mempool_alloc() to return. No other threads > > may be oom killed while waiting for it to exit. > > Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case. > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 14 Jul 2016, Mikulas Patocka wrote: > > schedule > > schedule_timeout > > io_schedule_timeout > > mempool_alloc > > __split_and_process_bio > > dm_request > > generic_make_request > > submit_bio > > mpage_readpages > > ext4_readpages > > __do_page_cache_readahead > > ra_submit > > filemap_fault > > handle_mm_fault > > __do_page_fault > > do_page_fault > > page_fault > > Device mapper should be able to proceed if there is no available memory. > If it doesn't proceed, there is a bug in it. > The above stack trace has nothing to do with the device mapper with pre-f9054c70d28b behavior. It simply is calling into mempool_alloc() and no elements are being returned to the mempool that allow it to return. Recall that in the above situation, the whole system is oom; nothing can allocate memory. The oom killer has selected the above process to be oom killed, so all other processes on the system trying to allocate memory will stall in the page allocator waiting for this process to exit. The natural response to this situation is to allow access to memory reserves, if possible, so that mempool_alloc() may return. There is no guarantee that _anything_ can return memory to the mempool, especially in a system oom condition where nothing can make forward progress. Insisting that should be guaranteed is not helpful. > I'd like to ask - what device mapper targets did you use in this case? Are > there some other deadlocked processes? (show sysrq-t, sysrq-w when this > happened) > Every process on the system is deadlocked because they cannot get memory through the page allocator until the above process exits. That is how the oom killer works: select a process, kill it, give it access to memory reserves so it may exit and free its memory, and wait. > Did the machine lock up completely with that stacktrace, or was it just > slowed down? > Hundreds of machines locked up and rebooted after a two hour watchdog timeout. > > That's obviously not the case if we have hundreds of machines timing out > > after two hours waiting for that fault to succeed. The mempool interface > > cannot require that users return elements to the pool synchronous with all > > allocators so that we can happily loop forever, the only requirement on > > Mempool users must return objects to the mempool. > Not possible when the system is livelocked. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 14 Jul 2016, David Rientjes wrote:
> There is no guarantee that _anything_ can return memory to the mempool,
You misunderstand mempools if you make such claims.
There is in fact guarantee that objects will be returned to mempool. In
the past I reviewed device mapper thoroughly to make sure that it can make
forward progress even if there is no available memory.
I don't know what should I tell you if you keep on repeating the same
false claim over and over again. Should I explain mempool oprerations to
you in detail? Or will you find it on your own?
Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 15 Jul 2016, Mikulas Patocka wrote: > > There is no guarantee that _anything_ can return memory to the mempool, > > You misunderstand mempools if you make such claims. > > There is in fact guarantee that objects will be returned to mempool. In > the past I reviewed device mapper thoroughly to make sure that it can make > forward progress even if there is no available memory. > > I don't know what should I tell you if you keep on repeating the same > false claim over and over again. Should I explain mempool oprerations to > you in detail? Or will you find it on your own? > If you are talking about patches you're proposing for 4.8 or any guarantee of memory freeing that the oom killer/reaper will provide in 4.8, that's fine. However, the state of the 4.7 kernel is the same as it was when I fixed this issue that timed out hundreds of our machines and is contradicted by that evidence. Our machines time out after two hours with the oom victim looping forever in mempool_alloc(), so if there was a guarantee that elements would be returned in a completely livelocked kernel in 4.7 or earlier kernels, that would not have been the case. I frankly don't care about your patch reviewing of dm mempool usage when dm_request() livelocked our kernel. Feel free to formally propose patches either for 4.7 or 4.8. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 15 Jul 2016, David Rientjes wrote: > On Fri, 15 Jul 2016, Mikulas Patocka wrote: > > > > There is no guarantee that _anything_ can return memory to the mempool, > > > > You misunderstand mempools if you make such claims. > > > > There is in fact guarantee that objects will be returned to mempool. In > > the past I reviewed device mapper thoroughly to make sure that it can make > > forward progress even if there is no available memory. > > > > I don't know what should I tell you if you keep on repeating the same > > false claim over and over again. Should I explain mempool oprerations to > > you in detail? Or will you find it on your own? > > > > If you are talking about patches you're proposing for 4.8 or any guarantee > of memory freeing that the oom killer/reaper will provide in 4.8, that's > fine. However, the state of the 4.7 kernel is the same as it was when I > fixed this issue that timed out hundreds of our machines and is > contradicted by that evidence. Our machines time out after two hours with > the oom victim looping forever in mempool_alloc(), so if there was a And what about the oom reaper? It should have freed all victim's pages even if the victim is looping in mempool_alloc. Why the oom reaper didn't free up memory? > guarantee that elements would be returned in a completely livelocked > kernel in 4.7 or earlier kernels, that would not have been the case. I And what kind of targets do you use in device mapper in the configuration that livelocked? Do you use some custom google-developed drivers? Please describe the whole stack of block I/O devices when this livelock happened. Most device mapper drivers can really make forward progress when they are out of memory, so I'm interested what kind of configuration do you have. > frankly don't care about your patch reviewing of dm mempool usage when > dm_request() livelocked our kernel. If it livelocked, it is a bug in some underlying block driver, not a bug in mempool_alloc. > Feel free to formally propose patches either for 4.7 or 4.8. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 15 Jul 2016, Mikulas Patocka wrote: > And what about the oom reaper? It should have freed all victim's pages > even if the victim is looping in mempool_alloc. Why the oom reaper didn't > free up memory? > Is that possible with mlock or shared memory? Nope. The oom killer does not have the benefit of selecting a process to kill that will likely free the most memory or reap the most memory, the choice is configurable by the user. > > guarantee that elements would be returned in a completely livelocked > > kernel in 4.7 or earlier kernels, that would not have been the case. I > > And what kind of targets do you use in device mapper in the configuration > that livelocked? Do you use some custom google-developed drivers? > > Please describe the whole stack of block I/O devices when this livelock > happened. > > Most device mapper drivers can really make forward progress when they are > out of memory, so I'm interested what kind of configuration do you have. > Kworkers are processing writeback, ext4_writepages() relies on kmem that is reclaiming memory itself through kmem_getpages() and they are waiting on the oom victim to exit so they endlessly loop in the page allocator themselves. Same situation with __alloc_skb() so we can intermittently lose access to hundreds of the machines over the network. There are no custom drivers required for this to happen, the stack trace has already been posted of the livelock victim and this can happen for anything in filemap_fault() that has TIF_MEMDIE set. > > frankly don't care about your patch reviewing of dm mempool usage when > > dm_request() livelocked our kernel. > > If it livelocked, it is a bug in some underlying block driver, not a bug > in mempool_alloc. > Lol, the interface is quite clear and can be modified to allow mempool users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can guarantee elements will be returned to the freelist in all situations, including system oom situations. We may revert that ourselves if our machines time out once we use a post-4.7 kernel and report that as necessary. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 15 Jul 2016, David Rientjes wrote: > Kworkers are processing writeback, ext4_writepages() relies on kmem that ext4_writepages is above device mapper, not below, so how it could block device mapper progress? Do you use device mapper on the top of block loop device? Writing to loop is prone to deadlock anyway, you should avoid that in production code. > is reclaiming memory itself through kmem_getpages() and they are waiting > on the oom victim to exit so they endlessly loop in the page allocator > themselves. Same situation with __alloc_skb() so we can intermittently > lose access to hundreds of the machines over the network. There are no > custom drivers required for this to happen, the stack trace has already > been posted of the livelock victim and this can happen for anything in > filemap_fault() that has TIF_MEMDIE set. Again - filemap_failt() is above device mapper, not below (unless you use loop). > > > frankly don't care about your patch reviewing of dm mempool usage when > > > dm_request() livelocked our kernel. > > > > If it livelocked, it is a bug in some underlying block driver, not a bug > > in mempool_alloc. > > > > Lol, the interface is quite clear and can be modified to allow mempool > users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can > guarantee elements will be returned to the freelist in all situations, You still didn't post configuration of your block stack, so I have no clue why entries are not returned to the mempool. > including system oom situations. We may revert that ourselves if our > machines time out once we use a post-4.7 kernel and report that as > necessary. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
CC Dave Chinner, who I recall had strong opinions on the mempool model The context is commit f9054c7 ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"), which gives MEMALLOC/TIF_MEMDIE mempool allocations access to the system emergency reserves when there is no reserved object currently residing in the mempool. On Fri, Jul 15, 2016 at 07:21:59AM -0400, Mikulas Patocka wrote: > On Thu, 14 Jul 2016, David Rientjes wrote: > > > There is no guarantee that _anything_ can return memory to the mempool, > > You misunderstand mempools if you make such claims. Uhm, fully agreed. The point of mempools is that they have their own reserves, separate from the system reserves, to make forward progress in OOM situations. All mempool object holders promise to make forward progress, and when memory is depleted, the mempool allocations serialize against each other. In this case, every allocation has to wait for in-flight IO to finish to pass the reserved object on to the next IO. That's how the mempool model is designed. The commit in question breaks this by not waiting for outstanding object holders and instead quickly depletes the system reserves. That's a mempool causing a memory deadlock... David observed systems hanging 2+h inside mempool allocations. But where would an object holders get stuck? It can't be taking a lock that the waiting mempool_alloc() is holding, obviously. It also can't be waiting for another allocation, it makes no sense to use mempools to guarantee forward progress, but then have the whole sequence rely on an unguaranteed allocation to succeed after the mempool ones. So how could a system-wide OOM situation cause a mempool holder to hang? These hangs are fishy, but it seems reasonable to assume that somebody is breaking the mempool contract somewhere. The solution can not to be to abandon the mempool model. f9054c7 should be reverted. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6903b69..e4e3700 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3439,14 +3439,14 @@ gfp_to_alloc_flags(gfp_t gfp_mask) } else if (unlikely(rt_task(current)) && !in_interrupt()) alloc_flags |= ALLOC_HARDER; - if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) { + if (!in_interrupt() && unlikely(test_thread_flag(TIF_MEMDIE))) + alloc_flags |= ALLOC_NO_WATERMARKS; + else if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) { if (gfp_mask & __GFP_MEMALLOC) alloc_flags |= ALLOC_NO_WATERMARKS; else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; - else if (!in_interrupt() && - ((current->flags & PF_MEMALLOC) || - unlikely(test_thread_flag(TIF_MEMDIE)))) + else if (!in_interrupt() && (current->flags & PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; } #ifdef CONFIG_CMA