btrfs: Avoid BUG_ON()s because of ENOMEM caused by kmalloc() failure
diff mbox

Message ID 56C16441.5030000@jp.fujitsu.com
State New
Headers show

Commit Message

Satoru Takeuchi Feb. 15, 2016, 5:38 a.m. UTC
There are some BUG_ON()'s after kmalloc() as follows.

Comments

David Sterba Feb. 15, 2016, 5:53 p.m. UTC | #1
On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote:
> There are some BUG_ON()'s after kmalloc() as follows.
> 
> =====
> foo = kmalloc();
> BUG_ON(!foo);	/* -ENOMEM case */
> =====
> 
> A Docker + memory cgroup user hit these BUG_ON()s.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=112101
> 
> Since it's very hard to handle these ENOMEMs properly,
> preventing these kmalloc() failures to avoid these
> BUG_ON()s for now, are a bit better than the current
> implementation anyway.

Beware that the NOFAIL semantics is can cause deadlocks if it's on the
critical writeback path or and can be reentered from itself through the
reclaim. Unless you're sure that this is not the case, please do not add
them just because it would seemingly fix the allocation failures.

In the docker example, the memory is limited by cgroups so the NOFAIL
mode can exhaust all reserves and just loop endlessly waiting for the
OOM killer to get some memory or just waiting without any chance to
progress.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Satoru Takeuchi Feb. 17, 2016, 5:54 a.m. UTC | #2
On 2016/02/16 2:53, David Sterba wrote:
> On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote:
>> There are some BUG_ON()'s after kmalloc() as follows.
>>
>> =====
>> foo = kmalloc();
>> BUG_ON(!foo);	/* -ENOMEM case */
>> =====
>>
>> A Docker + memory cgroup user hit these BUG_ON()s.
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=112101
>>
>> Since it's very hard to handle these ENOMEMs properly,
>> preventing these kmalloc() failures to avoid these
>> BUG_ON()s for now, are a bit better than the current
>> implementation anyway.
>
> Beware that the NOFAIL semantics is can cause deadlocks if it's on the
> critical writeback path or and can be reentered from itself through the
> reclaim. Unless you're sure that this is not the case, please do not add
> them just because it would seemingly fix the allocation failures.

About the all cases I changed, kmalloc()s can block
since gfp_flags_allow_blocking() are true. Then no locks
are acquired here and deadlocks don't happen.

Am I missing something?

>
> In the docker example, the memory is limited by cgroups so the NOFAIL
> mode can exhaust all reserves and just loop endlessly waiting for the
> OOM killer to get some memory or just waiting without any chance to
> progress.

I consider triggering OOM killer and killing processes
in a cgroup are better than killing whole system.

About the possibility of endless loop, there are many
such problems in the whole kernel. Of course it can be
said to Btrfs.

==========================================
$ grep -rnH __GFP_NOFAIL fs/btrfs/
fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL);
fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
==========================================

I understand fixing these problems cooperate with
memory cgroup guys is the best in the long run.
However, I consider bypassing this problem for now
is better than the current implementation.

Thanks,
Satoru

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Feb. 17, 2016, 3:11 p.m. UTC | #3
On Wed, Feb 17, 2016 at 02:54:23PM +0900, Satoru Takeuchi wrote:
> On 2016/02/16 2:53, David Sterba wrote:
> > On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote:
> >> There are some BUG_ON()'s after kmalloc() as follows.
> >>
> >> =====
> >> foo = kmalloc();
> >> BUG_ON(!foo);	/* -ENOMEM case */
> >> =====
> >>
> >> A Docker + memory cgroup user hit these BUG_ON()s.
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=112101
> >>
> >> Since it's very hard to handle these ENOMEMs properly,
> >> preventing these kmalloc() failures to avoid these
> >> BUG_ON()s for now, are a bit better than the current
> >> implementation anyway.
> >
> > Beware that the NOFAIL semantics is can cause deadlocks if it's on the
> > critical writeback path or and can be reentered from itself through the
> > reclaim. Unless you're sure that this is not the case, please do not add
> > them just because it would seemingly fix the allocation failures.
> 
> About the all cases I changed, kmalloc()s can block
> since gfp_flags_allow_blocking() are true. Then no locks
> are acquired here and deadlocks don't happen.
> 
> Am I missing something?

Waiting as in GFP_WAIT is not the same as GFP_NOFAIL that can wait
indefinetelly. The locking of NOFAIL is implied. The kmalloc callsite
will block until there's memory available. If another thread of btrfs
waits for this code to progress, it will block as well.

> > In the docker example, the memory is limited by cgroups so the NOFAIL
> > mode can exhaust all reserves and just loop endlessly waiting for the
> > OOM killer to get some memory or just waiting without any chance to
> > progress.
> 
> I consider triggering OOM killer and killing processes
> in a cgroup are better than killing whole system.

The action of OOM killer is not a problem. The cgroup memory limit can
be low or all the memory is unreclaimable. At this point btrfs code will
block.


> About the possibility of endless loop, there are many
> such problems in the whole kernel. Of course it can be
> said to Btrfs.

Many? Examples? In this context we're talking about endless loops caused
by non-failing allocations.

> ==========================================
> $ grep -rnH __GFP_NOFAIL fs/btrfs/
> fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL);
> fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
> fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
> fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
> ==========================================

I'm aware of the existing NOFAIL allocations. There are two at
extent_buffer allocation time, added recently and provoked by the
intended changes to memory allocator that would fail the GFP_NOFS
allocations.

The other two are from year 2010, set_extent_bit, IMHO added so the
error handling does not get complicated and expressing "we don't want to
fail here". But there are many other calls to set_extent_bit that could
fail. This is inconsistent and should be unified. In a way that we're
sure that we're not introducing potential hangs.

> I understand fixing these problems cooperate with
> memory cgroup guys is the best in the long run.

It's not about cgroups, btrfs needs to ideally handle all memory
allocation failures in a way that uses some sort of fallbacks but still
can lead to a transaction abort if there's simply no memory left.

> However, I consider bypassing this problem for now
> is better than the current implementation.

It's more like replacing one problem with another. With every new
NOFAIL, one has to think about the runtime interactions with the others.
I'd rather see this fixed with a particular call path in mind or class,
eg. the extent_map bit settings, than throwing NOFAIL at places that
somebody accidentally hit.

As a temporary fix we can add __GFP_HIGH to the interesting sites so we
can get access to the emergency reserves, and this is on my list of
things to do after the NOFS -> KERNEL updates are done.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Satoru Takeuchi Feb. 18, 2016, 1:59 a.m. UTC | #4
On 2016/02/18 0:11, David Sterba wrote:
> On Wed, Feb 17, 2016 at 02:54:23PM +0900, Satoru Takeuchi wrote:
>> On 2016/02/16 2:53, David Sterba wrote:
>>> On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote:
>>>> There are some BUG_ON()'s after kmalloc() as follows.
>>>>
>>>> =====
>>>> foo = kmalloc();
>>>> BUG_ON(!foo);	/* -ENOMEM case */
>>>> =====
>>>>
>>>> A Docker + memory cgroup user hit these BUG_ON()s.
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=112101
>>>>
>>>> Since it's very hard to handle these ENOMEMs properly,
>>>> preventing these kmalloc() failures to avoid these
>>>> BUG_ON()s for now, are a bit better than the current
>>>> implementation anyway.
>>>
>>> Beware that the NOFAIL semantics is can cause deadlocks if it's on the
>>> critical writeback path or and can be reentered from itself through the
>>> reclaim. Unless you're sure that this is not the case, please do not add
>>> them just because it would seemingly fix the allocation failures.
>>
>> About the all cases I changed, kmalloc()s can block
>> since gfp_flags_allow_blocking() are true. Then no locks
>> are acquired here and deadlocks don't happen.
>>
>> Am I missing something?
>
> Waiting as in GFP_WAIT is not the same as GFP_NOFAIL that can wait
> indefinetelly. The locking of NOFAIL is implied. The kmalloc callsite
> will block until there's memory available. If another thread of btrfs
> waits for this code to progress, it will block as well.
>
>>> In the docker example, the memory is limited by cgroups so the NOFAIL
>>> mode can exhaust all reserves and just loop endlessly waiting for the
>>> OOM killer to get some memory or just waiting without any chance to
>>> progress.
>>
>> I consider triggering OOM killer and killing processes
>> in a cgroup are better than killing whole system.
>
> The action of OOM killer is not a problem. The cgroup memory limit can
> be low or all the memory is unreclaimable. At this point btrfs code will
> block.
>
>
>> About the possibility of endless loop, there are many
>> such problems in the whole kernel. Of course it can be
>> said to Btrfs.
>
> Many? Examples? In this context we're talking about endless loops caused
> by non-failing allocations.
>
>> ==========================================
>> $ grep -rnH __GFP_NOFAIL fs/btrfs/
>> fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL);
>> fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
>> fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
>> fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>> ==========================================
>
> I'm aware of the existing NOFAIL allocations. There are two at
> extent_buffer allocation time, added recently and provoked by the
> intended changes to memory allocator that would fail the GFP_NOFS
> allocations.
>
> The other two are from year 2010, set_extent_bit, IMHO added so the
> error handling does not get complicated and expressing "we don't want to
> fail here". But there are many other calls to set_extent_bit that could
> fail. This is inconsistent and should be unified. In a way that we're
> sure that we're not introducing potential hangs.
>
>> I understand fixing these problems cooperate with
>> memory cgroup guys is the best in the long run.
>
> It's not about cgroups, btrfs needs to ideally handle all memory
> allocation failures in a way that uses some sort of fallbacks but still
> can lead to a transaction abort if there's simply no memory left.
>
>> However, I consider bypassing this problem for now
>> is better than the current implementation.
>
> It's more like replacing one problem with another. With every new
> NOFAIL, one has to think about the runtime interactions with the others.
> I'd rather see this fixed with a particular call path in mind or class,
> eg. the extent_map bit settings, than throwing NOFAIL at places that
> somebody accidentally hit.
>
> As a temporary fix we can add __GFP_HIGH to the interesting sites so we
> can get access to the emergency reserves, and this is on my list of
> things to do after the NOFS -> KERNEL updates are done.

OK, got it. I'll reconsider how to fix there problem.

Thanks,
Satoru

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

=====
foo = kmalloc();
BUG_ON(!foo);	/* -ENOMEM case */
=====

A Docker + memory cgroup user hit these BUG_ON()s.

https://bugzilla.kernel.org/show_bug.cgi?id=112101

Since it's very hard to handle these ENOMEMs properly,
preventing these kmalloc() failures to avoid these
BUG_ON()s for now, are a bit better than the current
implementation anyway.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
---
 fs/btrfs/extent_io.c  | 6 ++----
 fs/btrfs/inode.c      | 6 ++----
 fs/btrfs/relocation.c | 3 +--
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2e7c97a..5f92450 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -874,10 +874,8 @@  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,

 	bits |= EXTENT_FIRST_DELALLOC;
 again:
-	if (!prealloc && gfpflags_allow_blocking(mask)) {
-		prealloc = alloc_extent_state(mask);
-		BUG_ON(!prealloc);
-	}
+	if (!prealloc && gfpflags_allow_blocking(mask))
+		prealloc = alloc_extent_state(mask|__GFP_NOFAIL);

 	spin_lock(&tree->lock);
 	if (cached_state && *cached_state) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85afe66..d20e5c5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -357,8 +357,7 @@  static noinline int add_async_extent(struct async_cow *cow,
 {
 	struct async_extent *async_extent;

-	async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS);
-	BUG_ON(!async_extent); /* -ENOMEM */
+	async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS|__GFP_NOFAIL);
 	async_extent->start = start;
 	async_extent->ram_size = ram_size;
 	async_extent->compressed_size = compressed_size;
@@ -1143,8 +1142,7 @@  static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 	clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, EXTENT_LOCKED,
 			 1, 0, NULL, GFP_NOFS);
 	while (start < end) {
-		async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
-		BUG_ON(!async_cow); /* -ENOMEM */
+		async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS|__GFP_NOFAIL);
 		async_cow->inode = igrab(inode);
 		async_cow->root = root;
 		async_cow->locked_page = locked_page;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ef6d8fc..6b9f718 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1373,8 +1373,7 @@  static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
 	u64 last_snap = 0;
 	int ret;

-	root_item = kmalloc(sizeof(*root_item), GFP_NOFS);
-	BUG_ON(!root_item);
+	root_item = kmalloc(sizeof(*root_item), GFP_NOFS|__GFP_NOFAIL);

 	root_key.objectid = BTRFS_TREE_RELOC_OBJECTID;
 	root_key.type = BTRFS_ROOT_ITEM_KEY;