Message ID | 56C16441.5030000@jp.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote: > There are some BUG_ON()'s after kmalloc() as follows. > > ===== > foo = kmalloc(); > BUG_ON(!foo); /* -ENOMEM case */ > ===== > > A Docker + memory cgroup user hit these BUG_ON()s. > > https://bugzilla.kernel.org/show_bug.cgi?id=112101 > > Since it's very hard to handle these ENOMEMs properly, > preventing these kmalloc() failures to avoid these > BUG_ON()s for now, are a bit better than the current > implementation anyway. Beware that the NOFAIL semantics is can cause deadlocks if it's on the critical writeback path or and can be reentered from itself through the reclaim. Unless you're sure that this is not the case, please do not add them just because it would seemingly fix the allocation failures. In the docker example, the memory is limited by cgroups so the NOFAIL mode can exhaust all reserves and just loop endlessly waiting for the OOM killer to get some memory or just waiting without any chance to progress. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016/02/16 2:53, David Sterba wrote: > On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote: >> There are some BUG_ON()'s after kmalloc() as follows. >> >> ===== >> foo = kmalloc(); >> BUG_ON(!foo); /* -ENOMEM case */ >> ===== >> >> A Docker + memory cgroup user hit these BUG_ON()s. >> >> https://bugzilla.kernel.org/show_bug.cgi?id=112101 >> >> Since it's very hard to handle these ENOMEMs properly, >> preventing these kmalloc() failures to avoid these >> BUG_ON()s for now, are a bit better than the current >> implementation anyway. > > Beware that the NOFAIL semantics is can cause deadlocks if it's on the > critical writeback path or and can be reentered from itself through the > reclaim. Unless you're sure that this is not the case, please do not add > them just because it would seemingly fix the allocation failures. About the all cases I changed, kmalloc()s can block since gfp_flags_allow_blocking() are true. Then no locks are acquired here and deadlocks don't happen. Am I missing something? > > In the docker example, the memory is limited by cgroups so the NOFAIL > mode can exhaust all reserves and just loop endlessly waiting for the > OOM killer to get some memory or just waiting without any chance to > progress. I consider triggering OOM killer and killing processes in a cgroup are better than killing whole system. About the possibility of endless loop, there are many such problems in the whole kernel. Of course it can be said to Btrfs. ========================================== $ grep -rnH __GFP_NOFAIL fs/btrfs/ fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL); fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL); fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); ========================================== I understand fixing these problems cooperate with memory cgroup guys is the best in the long run. However, I consider bypassing this problem for now is better than the current implementation. Thanks, Satoru > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 02:54:23PM +0900, Satoru Takeuchi wrote: > On 2016/02/16 2:53, David Sterba wrote: > > On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote: > >> There are some BUG_ON()'s after kmalloc() as follows. > >> > >> ===== > >> foo = kmalloc(); > >> BUG_ON(!foo); /* -ENOMEM case */ > >> ===== > >> > >> A Docker + memory cgroup user hit these BUG_ON()s. > >> > >> https://bugzilla.kernel.org/show_bug.cgi?id=112101 > >> > >> Since it's very hard to handle these ENOMEMs properly, > >> preventing these kmalloc() failures to avoid these > >> BUG_ON()s for now, are a bit better than the current > >> implementation anyway. > > > > Beware that the NOFAIL semantics is can cause deadlocks if it's on the > > critical writeback path or and can be reentered from itself through the > > reclaim. Unless you're sure that this is not the case, please do not add > > them just because it would seemingly fix the allocation failures. > > About the all cases I changed, kmalloc()s can block > since gfp_flags_allow_blocking() are true. Then no locks > are acquired here and deadlocks don't happen. > > Am I missing something? Waiting as in GFP_WAIT is not the same as GFP_NOFAIL that can wait indefinetelly. The locking of NOFAIL is implied. The kmalloc callsite will block until there's memory available. If another thread of btrfs waits for this code to progress, it will block as well. > > In the docker example, the memory is limited by cgroups so the NOFAIL > > mode can exhaust all reserves and just loop endlessly waiting for the > > OOM killer to get some memory or just waiting without any chance to > > progress. > > I consider triggering OOM killer and killing processes > in a cgroup are better than killing whole system. The action of OOM killer is not a problem. The cgroup memory limit can be low or all the memory is unreclaimable. At this point btrfs code will block. > About the possibility of endless loop, there are many > such problems in the whole kernel. Of course it can be > said to Btrfs. Many? Examples? In this context we're talking about endless loops caused by non-failing allocations. > ========================================== > $ grep -rnH __GFP_NOFAIL fs/btrfs/ > fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL); > fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL); > fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); > fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); > ========================================== I'm aware of the existing NOFAIL allocations. There are two at extent_buffer allocation time, added recently and provoked by the intended changes to memory allocator that would fail the GFP_NOFS allocations. The other two are from year 2010, set_extent_bit, IMHO added so the error handling does not get complicated and expressing "we don't want to fail here". But there are many other calls to set_extent_bit that could fail. This is inconsistent and should be unified. In a way that we're sure that we're not introducing potential hangs. > I understand fixing these problems cooperate with > memory cgroup guys is the best in the long run. It's not about cgroups, btrfs needs to ideally handle all memory allocation failures in a way that uses some sort of fallbacks but still can lead to a transaction abort if there's simply no memory left. > However, I consider bypassing this problem for now > is better than the current implementation. It's more like replacing one problem with another. With every new NOFAIL, one has to think about the runtime interactions with the others. I'd rather see this fixed with a particular call path in mind or class, eg. the extent_map bit settings, than throwing NOFAIL at places that somebody accidentally hit. As a temporary fix we can add __GFP_HIGH to the interesting sites so we can get access to the emergency reserves, and this is on my list of things to do after the NOFS -> KERNEL updates are done. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016/02/18 0:11, David Sterba wrote: > On Wed, Feb 17, 2016 at 02:54:23PM +0900, Satoru Takeuchi wrote: >> On 2016/02/16 2:53, David Sterba wrote: >>> On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote: >>>> There are some BUG_ON()'s after kmalloc() as follows. >>>> >>>> ===== >>>> foo = kmalloc(); >>>> BUG_ON(!foo); /* -ENOMEM case */ >>>> ===== >>>> >>>> A Docker + memory cgroup user hit these BUG_ON()s. >>>> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=112101 >>>> >>>> Since it's very hard to handle these ENOMEMs properly, >>>> preventing these kmalloc() failures to avoid these >>>> BUG_ON()s for now, are a bit better than the current >>>> implementation anyway. >>> >>> Beware that the NOFAIL semantics is can cause deadlocks if it's on the >>> critical writeback path or and can be reentered from itself through the >>> reclaim. Unless you're sure that this is not the case, please do not add >>> them just because it would seemingly fix the allocation failures. >> >> About the all cases I changed, kmalloc()s can block >> since gfp_flags_allow_blocking() are true. Then no locks >> are acquired here and deadlocks don't happen. >> >> Am I missing something? > > Waiting as in GFP_WAIT is not the same as GFP_NOFAIL that can wait > indefinetelly. The locking of NOFAIL is implied. The kmalloc callsite > will block until there's memory available. If another thread of btrfs > waits for this code to progress, it will block as well. > >>> In the docker example, the memory is limited by cgroups so the NOFAIL >>> mode can exhaust all reserves and just loop endlessly waiting for the >>> OOM killer to get some memory or just waiting without any chance to >>> progress. >> >> I consider triggering OOM killer and killing processes >> in a cgroup are better than killing whole system. > > The action of OOM killer is not a problem. The cgroup memory limit can > be low or all the memory is unreclaimable. At this point btrfs code will > block. > > >> About the possibility of endless loop, there are many >> such problems in the whole kernel. Of course it can be >> said to Btrfs. > > Many? Examples? In this context we're talking about endless loops caused > by non-failing allocations. > >> ========================================== >> $ grep -rnH __GFP_NOFAIL fs/btrfs/ >> fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL); >> fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL); >> fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); >> fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); >> ========================================== > > I'm aware of the existing NOFAIL allocations. There are two at > extent_buffer allocation time, added recently and provoked by the > intended changes to memory allocator that would fail the GFP_NOFS > allocations. > > The other two are from year 2010, set_extent_bit, IMHO added so the > error handling does not get complicated and expressing "we don't want to > fail here". But there are many other calls to set_extent_bit that could > fail. This is inconsistent and should be unified. In a way that we're > sure that we're not introducing potential hangs. > >> I understand fixing these problems cooperate with >> memory cgroup guys is the best in the long run. > > It's not about cgroups, btrfs needs to ideally handle all memory > allocation failures in a way that uses some sort of fallbacks but still > can lead to a transaction abort if there's simply no memory left. > >> However, I consider bypassing this problem for now >> is better than the current implementation. > > It's more like replacing one problem with another. With every new > NOFAIL, one has to think about the runtime interactions with the others. > I'd rather see this fixed with a particular call path in mind or class, > eg. the extent_map bit settings, than throwing NOFAIL at places that > somebody accidentally hit. > > As a temporary fix we can add __GFP_HIGH to the interesting sites so we > can get access to the emergency reserves, and this is on my list of > things to do after the NOFS -> KERNEL updates are done. OK, got it. I'll reconsider how to fix there problem. Thanks, Satoru > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
===== foo = kmalloc(); BUG_ON(!foo); /* -ENOMEM case */ ===== A Docker + memory cgroup user hit these BUG_ON()s. https://bugzilla.kernel.org/show_bug.cgi?id=112101 Since it's very hard to handle these ENOMEMs properly, preventing these kmalloc() failures to avoid these BUG_ON()s for now, are a bit better than the current implementation anyway. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> --- fs/btrfs/extent_io.c | 6 ++---- fs/btrfs/inode.c | 6 ++---- fs/btrfs/relocation.c | 3 +-- 3 files changed, 5 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2e7c97a..5f92450 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -874,10 +874,8 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, bits |= EXTENT_FIRST_DELALLOC; again: - if (!prealloc && gfpflags_allow_blocking(mask)) { - prealloc = alloc_extent_state(mask); - BUG_ON(!prealloc); - } + if (!prealloc && gfpflags_allow_blocking(mask)) + prealloc = alloc_extent_state(mask|__GFP_NOFAIL); spin_lock(&tree->lock); if (cached_state && *cached_state) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 85afe66..d20e5c5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -357,8 +357,7 @@ static noinline int add_async_extent(struct async_cow *cow, { struct async_extent *async_extent; - async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS); - BUG_ON(!async_extent); /* -ENOMEM */ + async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS|__GFP_NOFAIL); async_extent->start = start; async_extent->ram_size = ram_size; async_extent->compressed_size = compressed_size; @@ -1143,8 +1142,7 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, EXTENT_LOCKED, 1, 0, NULL, GFP_NOFS); while (start < end) { - async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS); - BUG_ON(!async_cow); /* -ENOMEM */ + async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS|__GFP_NOFAIL); async_cow->inode = igrab(inode); async_cow->root = root; async_cow->locked_page = locked_page; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index ef6d8fc..6b9f718 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1373,8 +1373,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans, u64 last_snap = 0; int ret; - root_item = kmalloc(sizeof(*root_item), GFP_NOFS); - BUG_ON(!root_item); + root_item = kmalloc(sizeof(*root_item), GFP_NOFS|__GFP_NOFAIL); root_key.objectid = BTRFS_TREE_RELOC_OBJECTID; root_key.type = BTRFS_ROOT_ITEM_KEY;