Message ID | 20160721145309.GR26379@dhcp22.suse.cz (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Thu, Jul 21, 2016 at 04:53:10PM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 21 Jul 2016 16:40:59 +0200 > Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are > free elements" > > This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > There has been a report about OOM killer invoked when swapping out to > a dm-crypt device. The primary reason seems to be that the swapout > out IO managed to completely deplete memory reserves. Ondrej was -out > able to bisect and explained the issue by pointing to f9054c70d28b > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). > > The reason is that the swapout path is not throttled properly because > the md-raid layer needs to allocate from the generic_make_request path > which means it allocates from the PF_MEMALLOC context. dm layer uses > mempool_alloc in order to guarantee a forward progress which used to > inhibit access to memory reserves when using page allocator. This has > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if > there are free elements") which has dropped the __GFP_NOMEMALLOC > protection when the memory pool is depleted. > > If we are running out of memory and the only way forward to free memory > is to perform swapout we just keep consuming memory reserves rather than > throttling the mempool allocations and allowing the pending IO to > complete up to a moment when the memory is depleted completely and there > is no way forward but invoking the OOM killer. This is less than > optimal. > > The original intention of f9054c70d28b was to help with the OOM > situations where the oom victim depends on mempool allocation to make a > forward progress. David has mentioned the following backtrace: > > schedule > schedule_timeout > io_schedule_timeout > mempool_alloc > __split_and_process_bio > dm_request > generic_make_request > submit_bio > mpage_readpages > ext4_readpages > __do_page_cache_readahead > ra_submit > filemap_fault > handle_mm_fault > __do_page_fault > do_page_fault > page_fault > > We do not know more about why the mempool is depleted without being > replenished in time, though. In any case the dm layer shouldn't depend > on any allocations outside of the dedicated pools so a forward progress > should be guaranteed. If this is not the case then the dm should be > fixed rather than papering over the problem and postponing it to later > by accessing more memory reserves. > > mempools are a mechanism to maintain dedicated memory reserves to guaratee > forward progress. Allowing them an unbounded access to the page allocator > memory reserves is going against the whole purpose of this mechanism. > > Bisected-by: Ondrej Kozina <okozina@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Thanks Michal -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Jul 22 2016, Michal Hocko wrote: > On Thu 21-07-16 08:13:00, Johannes Weiner wrote: >> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote: >> > Look, there are >> > $ git grep mempool_alloc | wc -l >> > 304 >> > >> > many users of this API and we do not want to flip the default behavior >> > which is there for more than 10 years. So far you have been arguing >> > about potential deadlocks and haven't shown any particular path which >> > would have a direct or indirect dependency between mempool and normal >> > allocator and it wouldn't be a bug. As the matter of fact the change >> > we are discussing here causes a regression. If you want to change the >> > semantic of mempool allocator then you are absolutely free to do so. In >> > a separate patch which would be discussed with IO people and other >> > users, though. But we _absolutely_ want to fix the regression first >> > and have a simple fix for 4.6 and 4.7 backports. At this moment there >> > are revert and patch 1 on the table. The later one should make your >> > backtrace happy and should be only as a temporal fix until we find out >> > what is actually misbehaving on your systems. If you are not interested >> > to pursue that way I will simply go with the revert. >> >> +1 >> >> It's very unlikely that decade-old mempool semantics are suddenly a >> fundamental livelock problem, when all the evidence we have is one >> hang and vague speculation. Given that the patch causes regressions, >> and that the bug is most likely elsewhere anyway, a full revert rather >> than merely-less-invasive mempool changes makes the most sense to me. > > OK, fair enough. What do you think about the following then? Mikulas, I > have dropped your Tested-by and Reviewed-by because the patch is > different but unless you have hit the OOM killer then the testing > results should be same. > --- > From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 21 Jul 2016 16:40:59 +0200 > Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are > free elements" > > This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > There has been a report about OOM killer invoked when swapping out to > a dm-crypt device. The primary reason seems to be that the swapout > out IO managed to completely deplete memory reserves. Ondrej was > able to bisect and explained the issue by pointing to f9054c70d28b > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"). > > The reason is that the swapout path is not throttled properly because > the md-raid layer needs to allocate from the generic_make_request path > which means it allocates from the PF_MEMALLOC context. dm layer uses > mempool_alloc in order to guarantee a forward progress which used to > inhibit access to memory reserves when using page allocator. This has > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if > there are free elements") which has dropped the __GFP_NOMEMALLOC > protection when the memory pool is depleted. > > If we are running out of memory and the only way forward to free memory > is to perform swapout we just keep consuming memory reserves rather than > throttling the mempool allocations and allowing the pending IO to > complete up to a moment when the memory is depleted completely and there > is no way forward but invoking the OOM killer. This is less than > optimal. > > The original intention of f9054c70d28b was to help with the OOM > situations where the oom victim depends on mempool allocation to make a > forward progress. David has mentioned the following backtrace: > > schedule > schedule_timeout > io_schedule_timeout > mempool_alloc > __split_and_process_bio > dm_request > generic_make_request > submit_bio > mpage_readpages > ext4_readpages > __do_page_cache_readahead > ra_submit > filemap_fault > handle_mm_fault > __do_page_fault > do_page_fault > page_fault > > We do not know more about why the mempool is depleted without being > replenished in time, though. In any case the dm layer shouldn't depend > on any allocations outside of the dedicated pools so a forward progress > should be guaranteed. If this is not the case then the dm should be > fixed rather than papering over the problem and postponing it to later > by accessing more memory reserves. > > mempools are a mechanism to maintain dedicated memory reserves to guaratee > forward progress. Allowing them an unbounded access to the page allocator > memory reserves is going against the whole purpose of this mechanism. > > Bisected-by: Ondrej Kozina <okozina@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/mempool.c | 20 ++++---------------- > 1 file changed, 4 insertions(+), 16 deletions(-) > > diff --git a/mm/mempool.c b/mm/mempool.c > index 8f65464da5de..5ba6c8b3b814 100644 > --- a/mm/mempool.c > +++ b/mm/mempool.c > @@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize); > * returns NULL. Note that due to preallocation, this function > * *never* fails when called from process contexts. (it might > * fail if called from an IRQ context.) > - * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported. > + * Note: using __GFP_ZERO is not supported. > */ > -void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) > +void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask) > { > void *element; > unsigned long flags; > wait_queue_t wait; > gfp_t gfp_temp; > > - /* If oom killed, memory reserves are essential to prevent livelock */ > - VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC); > - /* No element size to zero on allocation */ > VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); > - > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); > > + gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */ > gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ > gfp_mask |= __GFP_NOWARN; /* failures are OK */ As I was reading through this thread I kept thinking "Surely mempool_alloc() should never ever allocate from emergency reserves. Ever." Then I saw this patch. It made me happy. Thanks. Acked-by: NeilBrown <neilb@suse.com> (if you want it) NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu 21-07-16 16:53:09, Michal Hocko wrote: > From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 21 Jul 2016 16:40:59 +0200 > Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are > free elements" > > This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. I've noticed that Andrew has already picked this one up. Is anybody against marking it for stable?
On 07/22/2016 08:37 AM, Michal Hocko wrote: > On Thu 21-07-16 16:53:09, Michal Hocko wrote: >> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001 >> From: Michal Hocko <mhocko@suse.com> >> Date: Thu, 21 Jul 2016 16:40:59 +0200 >> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are >> free elements" >> >> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > I've noticed that Andrew has already picked this one up. Is anybody > against marking it for stable? It would be strange to have different behavior with known regression in 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper? Vlastimil -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > On 07/22/2016 08:37 AM, Michal Hocko wrote: > > On Thu 21-07-16 16:53:09, Michal Hocko wrote: > >> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001 > >> From: Michal Hocko <mhocko@suse.com> > >> Date: Thu, 21 Jul 2016 16:40:59 +0200 > >> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are > >> free elements" > >> > >> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. > > > > I've noticed that Andrew has already picked this one up. Is anybody > > against marking it for stable? > > It would be strange to have different behavior with known regression in > 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper? > I added the cc:stable. Do we need to bust a gut to rush it into 4.7? It sounds safer to let it bake for a while, fix it in 4.7.1? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 07/22/2016 09:44 PM, Andrew Morton wrote: > On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > >> On 07/22/2016 08:37 AM, Michal Hocko wrote: >>> On Thu 21-07-16 16:53:09, Michal Hocko wrote: >>>> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001 >>>> From: Michal Hocko <mhocko@suse.com> >>>> Date: Thu, 21 Jul 2016 16:40:59 +0200 >>>> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are >>>> free elements" >>>> >>>> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. >>> >>> I've noticed that Andrew has already picked this one up. Is anybody >>> against marking it for stable? >> >> It would be strange to have different behavior with known regression in >> 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper? >> > > I added the cc:stable. > > Do we need to bust a gut to rush it into 4.7? It sounds safer to let > it bake for a while, fix it in 4.7.1? Yeah, I guess it's safer to wait now. Would be different if the reverted commit went in the same cycle. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/mm/mempool.c b/mm/mempool.c index 8f65464da5de..5ba6c8b3b814 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize); * returns NULL. Note that due to preallocation, this function * *never* fails when called from process contexts. (it might * fail if called from an IRQ context.) - * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported. + * Note: using __GFP_ZERO is not supported. */ -void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) +void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask) { void *element; unsigned long flags; wait_queue_t wait; gfp_t gfp_temp; - /* If oom killed, memory reserves are essential to prevent livelock */ - VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC); - /* No element size to zero on allocation */ VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); - might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); + gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */ gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ gfp_mask |= __GFP_NOWARN; /* failures are OK */ gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO); repeat_alloc: - if (likely(pool->curr_nr)) { - /* - * Don't allocate from emergency reserves if there are - * elements available. This check is racy, but it will - * be rechecked each loop. - */ - gfp_temp |= __GFP_NOMEMALLOC; - } element = pool->alloc(gfp_temp, pool->pool_data); if (likely(element != NULL)) @@ -359,12 +348,11 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) * We use gfp mask w/o direct reclaim or IO for the first round. If * alloc failed with that and @pool was empty, retry immediately. */ - if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) { + if (gfp_temp != gfp_mask) { spin_unlock_irqrestore(&pool->lock, flags); gfp_temp = gfp_mask; goto repeat_alloc; } - gfp_temp = gfp_mask; /* We must not sleep if !__GFP_DIRECT_RECLAIM */ if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {