4.17-rc1 FS went read-only during balance
Qu Wenruo April 23, 2018, 6:13 a.m. UTC
On 2018年04月23日 13:08, Dmitrii Tcvetkov wrote:
> On Mon, 23 Apr 2018 09:23:53 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2018年04月21日 22:55, Dmitrii Tcvetkov wrote:
>>> TL;DR It seems as regression in 4.17, but I managed to find a
>>> workaround to make filesystem rw mountable again.
>>> Kernel built from tag v4.17-rc1
>>> btrfs-progs 4.16
>>> Tonight two my machines (PC (ECC RAM) and laptop(non-ECC RAM)) were
>>> doing usual weekly balance with this command via cron:
>>> btrfs balance start -musage=50 -dusage=50 <mountpoint>
>>> Both machines run same kernel version. 
>>> On PC that caused root and "data" filesystems to go readonly. Root
>>> is on an SSD with data single and metadata DUP, "data" filesystem
>>> is on 2 HDDs with RAID1 for data and metadata.
>>> On laptop only /home went ro, it's on NVMe SSD with data single and
>>> metadata DUP. 
>>> Btrfs check of PC rootfs was without any errors in both modes, I did
>>> them once each before reboot on readonly filesystem with --force
>>> flag and then from live usb. Same output without any errors.
>>> After reboot kernel refused rw mount rootfs with the same error as
>>> during cron balance, ro mount was accepted, error during rw mount:
>>> BTRFS: error (device dm-17) in merge_reloc_roots:2465: errno=-117  
>> 117 means EUCLEAN, which could be caused by the newly introduced
>> first_key and level check.
>> Please apply this hotfix to fix it.
>> btrfs: Only check first key for committed tree blocks
>> (Which is included in latest pull request)
>> Also, please consider enable CONFIG_BTRFS_DEBUG to provide extra
>> debug info.
>> Thanks,
>> Qu
> I tried 4.17-rc2 (as the pull request was pulled) with
> CONFIG_BTRFS_DEBUG on LVM snapshot of laptop home partition (/dev/vdb)
> in a VM (VM kernel sees only snapshot so no UUID collisions). Dmesg
> attached.

Thanks for the info and your previous btrfs-image.

The image itself shows nothing wrong, so it should be runtime problem.
Would you please apply these two debug patches?

And the attached diff file?

My guess is the parent node is not initialized correctly in this case.


diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60caa68c3618..79f482578e02 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -458,6 +458,7 @@  static int verify_level_key(struct btrfs_fs_info *fs_info,
 			  eb->start, first_key->objectid, first_key->type,
 			  first_key->offset, found_key.objectid,
 			  found_key.type, found_key.offset);
+		btrfs_print_tree(eb, false);
 	return ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 00b7d3231821..cde0cb6c9786 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1870,6 +1870,8 @@  int replace_path(struct btrfs_trans_handle *trans,
 					     level - 1, &first_key);
 			if (IS_ERR(eb)) {
 				ret = PTR_ERR(eb);
+				btrfs_err(fs_info, "parent leaf, slot: %d:", slot);
+				btrfs_print_tree(parent, false);
 			} else if (!extent_buffer_uptodate(eb)) {
 				ret = -EIO;