diff mbox

Btrfs: fix easily get into ENOSPC in mixed case

Message ID 20110420005531.6d395333@sf (mailing list archive)
State New, archived
Headers show

Commit Message

Sergei Trofimovich April 19, 2011, 9:55 p.m. UTC
On Mon, 11 Apr 2011 14:29:21 +0800
liubo <liubo2009@cn.fujitsu.com> wrote:

> On 04/09/2011 05:55 AM, Sergei Trofimovich wrote:
> > [  100.500011] Call Trace:
> > [  100.500011]  [<ffffffff810ed3a0>] vfs_unlink+0x80/0xf0
> > [  100.500011]  [<ffffffff810ef6f3>] do_unlinkat+0x173/0x1b0
> > [  100.500011]  [<ffffffff8111727b>] ? fsnotify_find_inode_mark+0x3b/0x50
> > [  100.500011]  [<ffffffff810dff91>] ? filp_close+0x61/0x90
> > [  100.500011]  [<ffffffff810f0c0d>] sys_unlinkat+0x1d/0x40
> > [  100.500011]  [<ffffffff81574c3b>] system_call_fastpath+0x16/0x1b
> > [  100.500011] Code: 4c 8b 65 e0 48 8b 5d d8 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 40 00 4c 89 fe 4c 89 ef e8 05 d0 ff ff 85 c0 74 bb 0f 0b <0f> 0b 89 c3 eb cd 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 
> > [  100.500011] RIP  [<ffffffffa024a011>] btrfs_unlink+0xd1/0xe0 [btrfs]
> > [  100.500011]  RSP <ffff880070b55e28>
> > [  100.525672] ---[ end trace 7e63b9144b7307fe ]---
> > 
> > Looks like I won't be able to test your patch until this thing will go away first.
> 
> Thanks a lot for testing, though.
> 
> I guess something messed up your btrfs metadata, cause when btrfs_unlink() wanted to remove A,
> it found that A was just missing...

Hello again!

I've decided to explore OOps myself a bit. The result is a debugging patch I've attached.
I've applied it and you patch
    [PATCH] Btrfs: fix easily get into ENOSPC in mixed case
on top of vanilla 2.6.39-rc4

The result you can see in attached 'fault.log'. There you can see -ENOSPC propagation.
No idea why it converted to -ENOENT in the end. The result is very similar with vanilla 2.6.39-rc4.
I can run the debugging patch on it as well if you like.


I think interesting the parts are (they were found by Arne before):
[    0.040000] new update_space_info: 
[    0.040000] space_info has 0 free, is not full
[    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0
[    0.040000] new update_space_info: 
[    0.040000] space_info has 0 free, is not full
[    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0
[    0.040000] new update_space_info: 
[    0.040000] space_info has 0 free, is not full
[    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0

and:
[    0.290000] shrink_delalloc: kick writeback_inodes_sb_nr_if_idle to free 0 pages (yet 0 bytes to wb).
[    0.300000] shrink_delalloc: kick writeback_inodes_sb_nr_if_idle to free 0 pages (yet 0 bytes to wb).
[    0.310000] shrink_delalloc: kick writeback_inodes_sb_nr_if_idle to free 0 pages (yet 0 bytes to wb).
[    0.320000] reserve_metadata_bytes: shrink_delalloc(num_bytes=983040) -> 0
[    0.320000] reserve_metadata_bytes: 
[    0.320000] space_info has 18446744073708568576 free, is not full
[    0.320000] space_info total=0, used=0, pinned=0, reserved=983040, may_use=0, readonly=0

Here I see 2 problems:
1 (major) free space is busted (as the whole space_info).
2 (minor) the original loop of shrink_delalloc busy waited 30 seconds, so I had to trim it;
   and it seemed to know that additional IO won't reclaim any pages ('yet 0 bytes' string AFAIU);
   but it's likely a result of busted space_info: it has reserved=983040 more, than total=0.

If I'll revert c59021f846881a957ac5afe456d0f59d6a517b61 from vanilla 2.6.39-rc4 I'll be able to delete
the files from filesystem.

I perform tests in usermode linux,  and run 'rm -rf' on the same faulty partition snapshot.
I've backed it up, so it's easy to reproduce and debug. I can provide more info if you need.
It's a 'readable' (read-only btrfs mount behaves fine) 5GB image.

Comments

Sergei Trofimovich April 21, 2011, 3:19 p.m. UTC | #1
On Wed, 20 Apr 2011 00:55:31 +0300
Sergei Trofimovich <slyich@gmail.com> wrote:

> > Thanks a lot for testing, though.
> > 
> > I guess something messed up your btrfs metadata, cause when btrfs_unlink() wanted to remove A,
> > it found that A was just missing...

Yeah. Now I think I completely figured out what's going on:

My mixed partition was correct except it did'n have MIXED bit in superblock
(I mistakenly used LZO bit for that in btrfs-progs, as I took some old patch from maillist).

Thus I had D+M marked (aka mixed) block groups, but from superblock's features
there was no way to figure out there is yet such groups. That's why your second patch
(looking correct) didn't help me. I've fallen to mixed == 0 case.

> I think interesting the parts are (they were found by Arne before):
> [    0.040000] new update_space_info: 
> [    0.040000] space_info has 0 free, is not full
> [    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0
> [    0.040000] new update_space_info: 
> [    0.040000] space_info has 0 free, is not full
> [    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0
> [    0.040000] new update_space_info: 
> [    0.040000] space_info has 0 free, is not full
> [    0.040000] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, readonly=0

Talking with Arne I was convinced my breakage is not worth fixing in kernel.
After I've added MIXED bit to superblocks' features OOpses gone.

So, once again, please disregard my OOpses.
diff mbox

Patch

From 56668c999cae1b5ca00e94c63dc1c38eba68e3e1 Mon Sep 17 00:00:00 2001
From: Sergei Trofimovich <slyfox@gentoo.org>
Date: Wed, 20 Apr 2011 00:11:40 +0300
Subject: [PATCH 2/2] btrfs: my nice debug

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
---
 fs/btrfs/ctree.c       |    7 +++++++
 fs/btrfs/dir-item.c    |    6 ++++++
 fs/btrfs/extent-tree.c |   25 ++++++++++++++++++++++++-
 fs/btrfs/inode.c       |    5 +++++
 4 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 84d7ca1..41f8581 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -421,7 +421,10 @@  static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 				     root->root_key.objectid, &disk_key,
 				     level, search_start, empty_size);
 	if (IS_ERR(cow))
+	{
+		printk(KERN_INFO "__btrfs_cow_block: btrfs_alloc_free_block ret=%d\n", (int)PTR_ERR(cow));
 		return PTR_ERR(cow);
+	}
 
 	/* cow is set to blocking by btrfs_init_new_buffer */
 
@@ -1648,6 +1651,7 @@  again:
 					      p->slots[level + 1], &b);
 			if (err) {
 				ret = err;
+				printk(KERN_INFO "1. btrfs_search_slot ret=%d\n", ret);
 				goto done;
 			}
 		}
@@ -1692,6 +1696,7 @@  cow_done:
 				goto again;
 			if (err) {
 				ret = err;
+				printk(KERN_INFO "2. btrfs_search_slot ret=%d\n", ret);
 				goto done;
 			}
 			b = p->nodes[level];
@@ -1711,6 +1716,7 @@  cow_done:
 				goto again;
 			if (err) {
 				ret = err;
+				printk(KERN_INFO "3. btrfs_search_slot ret=%d\n", ret);
 				goto done;
 			}
 
@@ -1736,6 +1742,7 @@  cow_done:
 				BUG_ON(err > 0);
 				if (err) {
 					ret = err;
+					printk(KERN_INFO "4. btrfs_search_slot ret=%d\n", ret);
 					goto done;
 				}
 			}
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index dec9348..228a706 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -228,7 +228,10 @@  struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
 
 	ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
 	if (ret < 0)
+	{
+		printk (KERN_INFO "btrfs_lookup_dir_item: search slot failed: %d\n", ret);
 		return ERR_PTR(ret);
+	}
 	if (ret > 0) {
 		if (path->slots[0] == 0)
 			return NULL;
@@ -243,6 +246,7 @@  struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
 	    found_key.offset != key.offset)
 		return NULL;
 
+	printk (KERN_INFO "btrfs_lookup_dir_item -> btrfs_match_dir_item_name\n");
 	return btrfs_match_dir_item_name(root, path, name, name_len);
 }
 
@@ -377,6 +381,8 @@  struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_root *root,
 	u32 this_len;
 	struct extent_buffer *leaf;
 
+	printk(KERN_INFO "btrfs_match_dir_item_name (name='%s')\n", name);
+
 	leaf = path->nodes[0];
 	dir_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dir_item);
 	if (verify_dir_item(root, leaf, dir_item))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 31f33ba..e51adfa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3016,6 +3016,7 @@  static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 		found->full = 0;
 		spin_unlock(&found->lock);
 		*space_info = found;
+		printk (KERN_INFO "found %s: ", __func__); dump_space_info (*space_info, 0, 0);
 		return 0;
 	}
 	found = kzalloc(sizeof(*found), GFP_NOFS);
@@ -3043,6 +3044,7 @@  static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	*space_info = found;
 	list_add_rcu(&found->list, &info->space_info);
 	atomic_set(&found->caching_threads, 0);
+	printk (KERN_INFO "new %s: ", __func__); dump_space_info (*space_info, 0, 0);
 	return 0;
 }
 
@@ -3418,6 +3420,8 @@  static int shrink_delalloc(struct btrfs_trans_handle *trans,
 	block_rsv = &root->fs_info->delalloc_block_rsv;
 	space_info = block_rsv->space_info;
 
+	printk(KERN_INFO "shrink_delalloc: "); dump_space_info (space_info, 0, 0);
+
 	smp_mb();
 	reserved = space_info->bytes_reserved;
 	progress = space_info->reservation_progress;
@@ -3427,11 +3431,13 @@  static int shrink_delalloc(struct btrfs_trans_handle *trans,
 
 	max_reclaim = min(reserved, to_reclaim);
 
-	while (loops < 1024) {
+	while (loops < 10) {
 		/* have the flusher threads jump in and do some IO */
 		smp_mb();
 		nr_pages = min_t(unsigned long, nr_pages,
 		       root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
+		printk(KERN_INFO "shrink_delalloc: kick writeback_inodes_sb_nr_if_idle to free %d pages (yet %llu bytes to wb).\n"
+		      , nr_pages, (long long unsigned)root->fs_info->delalloc_bytes);
 		writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages);
 
 		spin_lock(&space_info->lock);
@@ -3446,7 +3452,10 @@  static int shrink_delalloc(struct btrfs_trans_handle *trans,
 			break;
 
 		if (trans && trans->transaction->blocked)
+		{
+		printk(KERN_INFO "shrink_delalloc: ret -EAGAIN\n");
 			return -EAGAIN;
+		}
 
 		time_left = schedule_timeout_interruptible(1);
 
@@ -3499,11 +3508,15 @@  again:
 	if (reserved)
 		num_bytes = 0;
 
+	printk(KERN_INFO "reserve_metadata_bytes: "); dump_space_info (space_info, 0, 0);
+
 	spin_lock(&space_info->lock);
 	unused = space_info->bytes_used + space_info->bytes_reserved +
 		 space_info->bytes_pinned + space_info->bytes_readonly +
 		 space_info->bytes_may_use;
 
+	printk(KERN_INFO "reserve_metadata_bytes: consider unused=%llu\n", (long long unsigned)unused);
+
 	/*
 	 * The idea here is that we've not already over-reserved the block group
 	 * then we can go ahead and save our reservation first and then start
@@ -3558,6 +3571,8 @@  again:
 	 * metadata until after the IO is completed.
 	 */
 	ret = shrink_delalloc(trans, root, num_bytes, 1);
+	printk(KERN_INFO "reserve_metadata_bytes: shrink_delalloc(num_bytes=%llu) -> %d\n"
+	      , (long long unsigned)num_bytes, ret);
 	if (ret > 0)
 		return 0;
 	else if (ret < 0)
@@ -5773,11 +5788,14 @@  use_block_rsv(struct btrfs_trans_handle *trans,
 		 * the global reserve.
 		 */
 		if (ret && block_rsv != global_rsv) {
+			printk(KERN_INFO "use_block_rsv (nonglobal metadata rsv) ret=%d. retry global\n", ret);
 			ret = block_rsv_use_bytes(global_rsv, blocksize);
 			if (!ret)
 				return global_rsv;
+			printk(KERN_INFO "use_block_rsv (retry global metadata rsv) ret=%d\n", ret);
 			return ERR_PTR(ret);
 		} else if (ret) {
+			printk(KERN_INFO "use_block_rsv (global metadata rsv) ret=%d\n", ret);
 			return ERR_PTR(ret);
 		}
 		return block_rsv;
@@ -5802,6 +5820,7 @@  use_block_rsv(struct btrfs_trans_handle *trans,
 		}
 	}
 
+	printk(KERN_INFO "the very end of use_block_rsv. -ENOSPC\n");
 	return ERR_PTR(-ENOSPC);
 }
 
@@ -5833,12 +5852,16 @@  struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
 
 	block_rsv = use_block_rsv(trans, root, blocksize);
 	if (IS_ERR(block_rsv))
+	{
+		printk(KERN_INFO "btrfs_alloc_free_block: use_block_rsv error=%d\n", (int)PTR_ERR(block_rsv));
 		return ERR_CAST(block_rsv);
+	}
 
 	ret = btrfs_reserve_extent(trans, root, blocksize, blocksize,
 				   empty_size, hint, (u64)-1, &ins, 0);
 	if (ret) {
 		unuse_block_rsv(block_rsv, blocksize);
+		printk(KERN_INFO "btrfs_alloc_free_block: reserve_extent=%d\n", ret);
 		return ERR_PTR(ret);
 	}
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fcd66b6..037afe2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2699,6 +2699,7 @@  static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	di = btrfs_lookup_dir_item(trans, root, path, dir->i_ino,
 				    name, name_len, -1);
 	if (IS_ERR(di)) {
+		printk(KERN_INFO "btrfs_lookup_dir_item: IS_ERR(di) = %d\n", (int)PTR_ERR(di));
 		ret = PTR_ERR(di);
 		goto err;
 	}
@@ -2710,7 +2711,9 @@  static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 	btrfs_dir_item_key_to_cpu(leaf, di, &key);
 	ret = btrfs_delete_one_dir_name(trans, root, path, di);
 	if (ret)
+	{
 		goto err;
+	}
 	btrfs_release_path(root, path);
 
 	ret = btrfs_del_inode_ref(trans, root, name, name_len,
@@ -2762,8 +2765,10 @@  int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 		       const char *name, int name_len)
 {
 	int ret;
+	printk (KERN_INFO"__unlinking '%s'\n", name);
 	ret = __btrfs_unlink_inode(trans, root, dir, inode, name, name_len);
 	if (!ret) {
+		printk (KERN_INFO"unlinking '%s'\n", name);
 		btrfs_drop_nlink(inode);
 		ret = btrfs_update_inode(trans, root, inode);
 	}
-- 
1.7.3.4